All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v3 0/5] Support for RWF_UNCACHED
@ 2019-12-11 15:29 Jens Axboe
  2019-12-11 15:29 ` [PATCH 1/5] fs: add read support " Jens Axboe
                   ` (6 more replies)
  0 siblings, 7 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-11 15:29 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block; +Cc: willy, clm, torvalds, david

Recently someone asked me how io_uring buffered IO compares to mmaped
IO in terms of performance. So I ran some tests with buffered IO, and
found the experience to be somewhat painful. The test case is pretty
basic, random reads over a dataset that's 10x the size of RAM.
Performance starts out fine, and then the page cache fills up and we
hit a throughput cliff. CPU usage of the IO threads go up, and we have
kswapd spending 100% of a core trying to keep up. Seeing that, I was
reminded of the many complaints I here about buffered IO, and the fact
that most of the folks complaining will ultimately bite the bullet and
move to O_DIRECT to just get the kernel out of the way.

But I don't think it needs to be like that. Switching to O_DIRECT isn't
always easily doable. The buffers have different life times, size and
alignment constraints, etc. On top of that, mixing buffered and O_DIRECT
can be painful.

Seems to me that we have an opportunity to provide something that sits
somewhere in between buffered and O_DIRECT, and this is where
RWF_UNCACHED enters the picture. If this flag is set on IO, we get the
following behavior:

- If the data is in cache, it remains in cache and the copy (in or out)
  is served to/from that.

- If the data is NOT in cache, we add it while performing the IO. When
  the IO is done, we remove it again.

With this, I can do 100% smooth buffered reads or writes without pushing
the kernel to the state where kswapd is sweating bullets. In fact it
doesn't even register.

Comments appreciated! This should work on any standard file system,
using either the generic helpers or iomap. I have tested ext4 and xfs
for the right read/write behavior, but no further validation has been
done yet. Patches are against current git, and can also be found here:

https://git.kernel.dk/cgit/linux-block/log/?h=buffered-uncached

 fs/ceph/file.c          |  2 +-
 fs/dax.c                |  2 +-
 fs/ext4/file.c          |  2 +-
 fs/iomap/apply.c        | 26 ++++++++++-
 fs/iomap/buffered-io.c  | 54 ++++++++++++++++-------
 fs/iomap/direct-io.c    |  3 +-
 fs/iomap/fiemap.c       |  5 ++-
 fs/iomap/seek.c         |  6 ++-
 fs/iomap/swapfile.c     |  2 +-
 fs/nfs/file.c           |  2 +-
 include/linux/fs.h      |  7 ++-
 include/linux/iomap.h   | 10 ++++-
 include/uapi/linux/fs.h |  5 ++-
 mm/filemap.c            | 95 ++++++++++++++++++++++++++++++++++++-----
 14 files changed, 181 insertions(+), 40 deletions(-)

Changes since v2:
- Rework the write side according to Chinners suggestions. Much cleaner
  this way. It does mean that we invalidate the full write region if just
  ONE page (or more) had to be created, where before it was more granular.
  I don't think that's a concern, and on the plus side, we now no longer
  have to chunk invalidations into 15/16 pages at the time.
- Cleanups

Changes since v1:
- Switch to pagevecs for write_drop_cached_pages()
- Use page_offset() instead of manual shift
- Ensure we hold a reference on the page between calling ->write_end()
  and checking the mapping on the locked page
- Fix XFS multi-page streamed writes, we'd drop the UNCACHED flag after
  the first page

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 1/5] fs: add read support for RWF_UNCACHED
  2019-12-11 15:29 [PATCHSET v3 0/5] Support for RWF_UNCACHED Jens Axboe
@ 2019-12-11 15:29 ` Jens Axboe
  2019-12-11 15:29 ` [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb Jens Axboe
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-11 15:29 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block
  Cc: willy, clm, torvalds, david, Jens Axboe

If RWF_UNCACHED is set for io_uring (or preadv2(2)), we'll drop the
cache for buffered reads if we are the ones instantiating it. If the
data is already cached, we leave it cached.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h      |  3 +++
 include/uapi/linux/fs.h |  5 ++++-
 mm/filemap.c            | 46 ++++++++++++++++++++++++++++++++++++-----
 3 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98e0349adb52..092ea2a4319b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -314,6 +314,7 @@ enum rw_hint {
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
 #define IOCB_NOWAIT		(1 << 7)
+#define IOCB_UNCACHED		(1 << 8)
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -3418,6 +3419,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
 		ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
 	if (flags & RWF_APPEND)
 		ki->ki_flags |= IOCB_APPEND;
+	if (flags & RWF_UNCACHED)
+		ki->ki_flags |= IOCB_UNCACHED;
 	return 0;
 }
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 379a612f8f1d..357ebb0e0c5d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -299,8 +299,11 @@ typedef int __bitwise __kernel_rwf_t;
 /* per-IO O_APPEND */
 #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
 
+/* drop cache after reading or writing data */
+#define RWF_UNCACHED	((__force __kernel_rwf_t)0x00000040)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND)
+			 RWF_APPEND | RWF_UNCACHED)
 
 #endif /* _UAPI_LINUX_FS_H */
diff --git a/mm/filemap.c b/mm/filemap.c
index bf6aa30be58d..ed23a11b3e34 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -933,8 +933,8 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 }
 EXPORT_SYMBOL(add_to_page_cache_locked);
 
-int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
-				pgoff_t offset, gfp_t gfp_mask)
+static int __add_to_page_cache(struct page *page, struct address_space *mapping,
+			       pgoff_t offset, gfp_t gfp_mask, bool lru)
 {
 	void *shadow = NULL;
 	int ret;
@@ -956,9 +956,17 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 		WARN_ON_ONCE(PageActive(page));
 		if (!(gfp_mask & __GFP_WRITE) && shadow)
 			workingset_refault(page, shadow);
-		lru_cache_add(page);
+		if (lru)
+			lru_cache_add(page);
 	}
 	return ret;
+
+}
+
+int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
+				pgoff_t offset, gfp_t gfp_mask)
+{
+	return __add_to_page_cache(page, mapping, offset, gfp_mask, true);
 }
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
@@ -2032,6 +2040,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 	offset = *ppos & ~PAGE_MASK;
 
 	for (;;) {
+		bool drop_page = false;
 		struct page *page;
 		pgoff_t end_index;
 		loff_t isize;
@@ -2048,6 +2057,9 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		if (!page) {
 			if (iocb->ki_flags & IOCB_NOWAIT)
 				goto would_block;
+			/* UNCACHED implies no read-ahead */
+			if (iocb->ki_flags & IOCB_UNCACHED)
+				goto no_cached_page;
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
@@ -2147,6 +2159,26 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		offset &= ~PAGE_MASK;
 		prev_offset = offset;
 
+		/*
+		 * If we're dropping this page due to drop-behind, then
+		 * lock it first. Ignore errors here, we can just leave it
+		 * in the page cache. Note that we didn't add this page to
+		 * the LRU when we added it to the page cache. So if we
+		 * fail removing it, or lock it, add to the LRU.
+		 */
+		if (drop_page) {
+			bool addlru = true;
+
+			if (!lock_page_killable(page)) {
+				if (page->mapping == mapping)
+					addlru = !remove_mapping(mapping, page);
+				else
+					addlru = false;
+				unlock_page(page);
+			}
+			if (addlru)
+				lru_cache_add(page);
+		}
 		put_page(page);
 		written += ret;
 		if (!iov_iter_count(iter))
@@ -2234,8 +2266,12 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 			error = -ENOMEM;
 			goto out;
 		}
-		error = add_to_page_cache_lru(page, mapping, index,
-				mapping_gfp_constraint(mapping, GFP_KERNEL));
+		if (iocb->ki_flags & IOCB_UNCACHED)
+			drop_page = true;
+
+		error = __add_to_page_cache(page, mapping, index,
+				mapping_gfp_constraint(mapping, GFP_KERNEL),
+				!drop_page);
 		if (error) {
 			put_page(page);
 			if (error == -EEXIST) {
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb
  2019-12-11 15:29 [PATCHSET v3 0/5] Support for RWF_UNCACHED Jens Axboe
  2019-12-11 15:29 ` [PATCH 1/5] fs: add read support " Jens Axboe
@ 2019-12-11 15:29 ` Jens Axboe
  2019-12-11 15:29 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-11 15:29 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block
  Cc: willy, clm, torvalds, david, Jens Axboe

Right now all callers pass in iocb->ki_pos, just pass in the iocb. This
is in preparation for using the iocb flags in generic_perform_write().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/ceph/file.c     | 2 +-
 fs/ext4/file.c     | 2 +-
 fs/nfs/file.c      | 2 +-
 include/linux/fs.h | 3 ++-
 mm/filemap.c       | 8 +++++---
 5 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 11929d2bb594..096c009f188f 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1538,7 +1538,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		 * are pending vmtruncate. So write and vmtruncate
 		 * can not run at the same time
 		 */
-		written = generic_perform_write(file, from, pos);
+		written = generic_perform_write(file, from, iocb);
 		if (likely(written >= 0))
 			iocb->ki_pos = pos + written;
 		ceph_end_io_write(inode);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6a7293a5cda2..9ffb857765d5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -249,7 +249,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 		goto out;
 
 	current->backing_dev_info = inode_to_bdi(inode);
-	ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
+	ret = generic_perform_write(iocb->ki_filp, from, iocb);
 	current->backing_dev_info = NULL;
 
 out:
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 8eb731d9be3e..d8f51a702a4e 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -624,7 +624,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 	result = generic_write_checks(iocb, from);
 	if (result > 0) {
 		current->backing_dev_info = inode_to_bdi(inode);
-		result = generic_perform_write(file, from, iocb->ki_pos);
+		result = generic_perform_write(file, from, iocb);
 		current->backing_dev_info = NULL;
 	}
 	nfs_end_io_write(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 092ea2a4319b..bf58db1bc032 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3103,7 +3103,8 @@ extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_direct_write(struct kiocb *, struct iov_iter *);
-extern ssize_t generic_perform_write(struct file *, struct iov_iter *, loff_t);
+extern ssize_t generic_perform_write(struct file *, struct iov_iter *,
+				     struct kiocb *);
 
 ssize_t vfs_iter_read(struct file *file, struct iov_iter *iter, loff_t *ppos,
 		rwf_t flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index ed23a11b3e34..fe37bd2b2630 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3302,10 +3302,11 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 EXPORT_SYMBOL(grab_cache_page_write_begin);
 
 ssize_t generic_perform_write(struct file *file,
-				struct iov_iter *i, loff_t pos)
+				struct iov_iter *i, struct kiocb *iocb)
 {
 	struct address_space *mapping = file->f_mapping;
 	const struct address_space_operations *a_ops = mapping->a_ops;
+	loff_t pos = iocb->ki_pos;
 	long status = 0;
 	ssize_t written = 0;
 	unsigned int flags = 0;
@@ -3439,7 +3440,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
 			goto out;
 
-		status = generic_perform_write(file, from, pos = iocb->ki_pos);
+		pos = iocb->ki_pos;
+		status = generic_perform_write(file, from, iocb);
 		/*
 		 * If generic_perform_write() returned a synchronous error
 		 * then we want to return the number of bytes which were
@@ -3471,7 +3473,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			 */
 		}
 	} else {
-		written = generic_perform_write(file, from, iocb->ki_pos);
+		written = generic_perform_write(file, from, iocb);
 		if (likely(written > 0))
 			iocb->ki_pos += written;
 	}
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED
  2019-12-11 15:29 [PATCHSET v3 0/5] Support for RWF_UNCACHED Jens Axboe
  2019-12-11 15:29 ` [PATCH 1/5] fs: add read support " Jens Axboe
  2019-12-11 15:29 ` [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb Jens Axboe
@ 2019-12-11 15:29 ` Jens Axboe
  2019-12-11 15:29 ` [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor Jens Axboe
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-11 15:29 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block
  Cc: willy, clm, torvalds, david, Jens Axboe

If RWF_UNCACHED is set for io_uring (or pwritev2(2)), we'll drop the
cache instantiated for buffered writes. If new pages aren't
instantiated, we leave them alone. This provides similar semantics to
reads with RWF_UNCACHED set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h |  1 +
 mm/filemap.c       | 41 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index bf58db1bc032..5ea5fc167524 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -285,6 +285,7 @@ enum positive_aop_returns {
 #define AOP_FLAG_NOFS			0x0002 /* used by filesystem to direct
 						* helper code (eg buffer layer)
 						* to clear GFP_FS from alloc */
+#define AOP_FLAG_UNCACHED		0x0004
 
 /*
  * oh the beauties of C type declarations.
diff --git a/mm/filemap.c b/mm/filemap.c
index fe37bd2b2630..4dadd1a4ca7c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3287,10 +3287,12 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 					pgoff_t index, unsigned flags)
 {
 	struct page *page;
-	int fgp_flags = FGP_LOCK|FGP_WRITE|FGP_CREAT;
+	int fgp_flags = FGP_LOCK|FGP_WRITE;
 
 	if (flags & AOP_FLAG_NOFS)
 		fgp_flags |= FGP_NOFS;
+	if (!(flags & AOP_FLAG_UNCACHED))
+		fgp_flags |= FGP_CREAT;
 
 	page = pagecache_get_page(mapping, index, fgp_flags,
 			mapping_gfp_mask(mapping));
@@ -3311,6 +3313,9 @@ ssize_t generic_perform_write(struct file *file,
 	ssize_t written = 0;
 	unsigned int flags = 0;
 
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		flags |= AOP_FLAG_UNCACHED;
+
 	do {
 		struct page *page;
 		unsigned long offset;	/* Offset into pagecache page */
@@ -3343,10 +3348,16 @@ ssize_t generic_perform_write(struct file *file,
 			break;
 		}
 
+retry:
 		status = a_ops->write_begin(file, mapping, pos, bytes, flags,
 						&page, &fsdata);
-		if (unlikely(status < 0))
+		if (unlikely(status < 0)) {
+			if (status == -ENOMEM && (flags & AOP_FLAG_UNCACHED)) {
+				flags &= ~AOP_FLAG_UNCACHED;
+				goto retry;
+			}
 			break;
+		}
 
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
@@ -3382,6 +3393,32 @@ ssize_t generic_perform_write(struct file *file,
 		balance_dirty_pages_ratelimited(mapping);
 	} while (iov_iter_count(i));
 
+	if (written && (iocb->ki_flags & IOCB_UNCACHED)) {
+		loff_t end;
+
+		pos = iocb->ki_pos;
+		end = pos + written;
+
+		status = filemap_write_and_wait_range(mapping, pos, end);
+		if (status)
+			goto out;
+
+		/*
+		 * No pages were created for this range, we're done
+		 */
+		if (flags & AOP_FLAG_UNCACHED)
+			 goto out;
+
+		/*
+		 * Try to invalidate cache pages for the range we just wrote.
+		 * We don't care if invalidation fails as the write has still
+		 * worked and leaving clean uptodate pages in the page cache
+		 * isn't a corruption vector for uncached IO.
+		 */
+		 invalidate_inode_pages2_range(mapping,
+					pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
+	}
+out:
 	return written ? written : status;
 }
 EXPORT_SYMBOL(generic_perform_write);
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor
  2019-12-11 15:29 [PATCHSET v3 0/5] Support for RWF_UNCACHED Jens Axboe
                   ` (2 preceding siblings ...)
  2019-12-11 15:29 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
@ 2019-12-11 15:29 ` Jens Axboe
  2019-12-11 17:19     ` Linus Torvalds
  2019-12-11 15:29 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 67+ messages in thread
From: Jens Axboe @ 2019-12-11 15:29 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block
  Cc: willy, clm, torvalds, david, Jens Axboe

This is in preparation for passing in a flag to the iomap_actor, which
currently doesn't support that.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/dax.c               |  2 +-
 fs/iomap/apply.c       |  2 +-
 fs/iomap/buffered-io.c | 17 ++++++++++-------
 fs/iomap/direct-io.c   |  3 ++-
 fs/iomap/fiemap.c      |  5 +++--
 fs/iomap/seek.c        |  6 ++++--
 fs/iomap/swapfile.c    |  2 +-
 include/linux/iomap.h  |  5 +++--
 8 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1f1f0201cad1..30a20b994140 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1091,7 +1091,7 @@ EXPORT_SYMBOL_GPL(__dax_zero_page_range);
 
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct block_device *bdev = iomap->bdev;
 	struct dax_device *dax_dev = iomap->dax_dev;
diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index 76925b40b5fd..562536da8a13 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -77,7 +77,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	 * iomap into the actors so that they don't need to have special
 	 * handling for the two cases.
 	 */
-	written = actor(inode, pos, length, data, &iomap,
+	written = actor(inode, pos, length, data, flags, &iomap,
 			srcmap.type != IOMAP_HOLE ? &srcmap : &iomap);
 
 	/*
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 828444e14d09..9b5b770ca4c7 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -249,7 +249,7 @@ static inline bool iomap_block_needs_zeroing(struct inode *inode,
 
 static loff_t
 iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
 	struct page *page = ctx->cur_page;
@@ -397,7 +397,8 @@ iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
 
 static loff_t
 iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
 	loff_t done, ret;
@@ -417,7 +418,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 			ctx->cur_page_in_bio = false;
 		}
 		ret = iomap_readpage_actor(inode, pos + done, length - done,
-				ctx, iomap, srcmap);
+				ctx, 0, iomap, srcmap);
 	}
 
 	return done;
@@ -797,7 +798,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned copied,
 
 static loff_t
 iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iov_iter *i = data;
 	long status = 0;
@@ -897,7 +898,7 @@ EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
 
 static loff_t
 iomap_unshare_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	long status = 0;
 	ssize_t written = 0;
@@ -983,7 +984,8 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
 
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	bool *did_zero = data;
 	loff_t written = 0;
@@ -1053,7 +1055,8 @@ EXPORT_SYMBOL_GPL(iomap_truncate_page);
 
 static loff_t
 iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct page *page = data;
 	int ret;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 23837926c0c5..2525997b09aa 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -365,7 +365,8 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
 
 static loff_t
 iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct iomap_dio *dio = data;
 
diff --git a/fs/iomap/fiemap.c b/fs/iomap/fiemap.c
index bccf305ea9ce..04de960259d0 100644
--- a/fs/iomap/fiemap.c
+++ b/fs/iomap/fiemap.c
@@ -44,7 +44,7 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
 
 static loff_t
 iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap, struct iomap *srcmap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct fiemap_ctx *ctx = data;
 	loff_t ret = length;
@@ -111,7 +111,8 @@ EXPORT_SYMBOL_GPL(iomap_fiemap);
 
 static loff_t
 iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap, struct iomap *srcmap)
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	sector_t *bno = data, addr;
 
diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
index 89f61d93c0bc..a5cbf04e8cb3 100644
--- a/fs/iomap/seek.c
+++ b/fs/iomap/seek.c
@@ -119,7 +119,8 @@ page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
 
 static loff_t
 iomap_seek_hole_actor(struct inode *inode, loff_t offset, loff_t length,
-		      void *data, struct iomap *iomap, struct iomap *srcmap)
+		      void *data, unsigned flags, struct iomap *iomap,
+		      struct iomap *srcmap)
 {
 	switch (iomap->type) {
 	case IOMAP_UNWRITTEN:
@@ -165,7 +166,8 @@ EXPORT_SYMBOL_GPL(iomap_seek_hole);
 
 static loff_t
 iomap_seek_data_actor(struct inode *inode, loff_t offset, loff_t length,
-		      void *data, struct iomap *iomap, struct iomap *srcmap)
+		      void *data, unsigned flags, struct iomap *iomap,
+		      struct iomap *srcmap)
 {
 	switch (iomap->type) {
 	case IOMAP_HOLE:
diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c
index a648dbf6991e..774bfc3e59e1 100644
--- a/fs/iomap/swapfile.c
+++ b/fs/iomap/swapfile.c
@@ -76,7 +76,7 @@ static int iomap_swapfile_add_extent(struct iomap_swapfile_info *isi)
  * distinction between written and unwritten extents.
  */
 static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
-		loff_t count, void *data, struct iomap *iomap,
+		loff_t count, void *data, unsigned flags, struct iomap *iomap,
 		struct iomap *srcmap)
 {
 	struct iomap_swapfile_info *isi = data;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8b09463dae0d..61fcaa3904d4 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -113,7 +113,7 @@ struct iomap_page_ops {
 };
 
 /*
- * Flags for iomap_begin / iomap_end.  No flag implies a read.
+ * Flags for iomap_begin / iomap_end / factor.  No flag implies a read.
  */
 #define IOMAP_WRITE		(1 << 0) /* writing, must allocate blocks */
 #define IOMAP_ZERO		(1 << 1) /* zeroing operation, may skip holes */
@@ -146,7 +146,8 @@ struct iomap_ops {
  * Main iomap iterator function.
  */
 typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
-		void *data, struct iomap *iomap, struct iomap *srcmap);
+		void *data, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap);
 
 loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
 		unsigned flags, const struct iomap_ops *ops, void *data,
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-11 15:29 [PATCHSET v3 0/5] Support for RWF_UNCACHED Jens Axboe
                   ` (3 preceding siblings ...)
  2019-12-11 15:29 ` [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor Jens Axboe
@ 2019-12-11 15:29 ` Jens Axboe
  2019-12-11 17:19   ` Matthew Wilcox
  2019-12-12 22:34   ` Dave Chinner
  2019-12-11 17:37   ` Linus Torvalds
  2019-12-12 10:44 ` Martin Steigerwald
  6 siblings, 2 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-11 15:29 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-block
  Cc: willy, clm, torvalds, david, Jens Axboe

This adds support for RWF_UNCACHED for file systems using iomap to
perform buffered writes. We use the generic infrastructure for this,
by tracking pages we created and calling write_drop_cached_pages()
to issue writeback and prune those pages.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/apply.c       | 24 ++++++++++++++++++++++++
 fs/iomap/buffered-io.c | 37 +++++++++++++++++++++++++++++--------
 include/linux/iomap.h  |  5 +++++
 3 files changed, 58 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index 562536da8a13..966826ad4bb9 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -90,5 +90,29 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 				     flags, &iomap);
 	}
 
+	if (written && (flags & IOMAP_UNCACHED)) {
+		struct address_space *mapping = inode->i_mapping;
+
+		end = pos + written;
+		ret = filemap_write_and_wait_range(mapping, pos, end);
+		if (ret)
+			goto out;
+
+		/*
+		 * No pages were created for this range, we're done
+		 */
+		if (!(iomap.flags & IOMAP_F_PAGE_CREATE))
+			goto out;
+
+		/*
+		 * Try to invalidate cache pages for the range we just wrote.
+		 * We don't care if invalidation fails as the write has still
+		 * worked and leaving clean uptodate pages in the page cache
+		 * isn't a corruption vector for uncached IO.
+		 */
+		invalidate_inode_pages2_range(mapping,
+				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
+	}
+out:
 	return written ? written : ret;
 }
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9b5b770ca4c7..09440f114506 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -566,6 +566,7 @@ EXPORT_SYMBOL_GPL(iomap_migrate_page);
 
 enum {
 	IOMAP_WRITE_F_UNSHARE		= (1 << 0),
+	IOMAP_WRITE_F_UNCACHED		= (1 << 1),
 };
 
 static void
@@ -643,6 +644,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
 {
 	const struct iomap_page_ops *page_ops = iomap->page_ops;
+	unsigned aop_flags;
 	struct page *page;
 	int status = 0;
 
@@ -659,8 +661,11 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 			return status;
 	}
 
+	aop_flags = AOP_FLAG_NOFS;
+	if (flags & IOMAP_UNCACHED)
+		aop_flags |= AOP_FLAG_UNCACHED;
 	page = grab_cache_page_write_begin(inode->i_mapping, pos >> PAGE_SHIFT,
-			AOP_FLAG_NOFS);
+						aop_flags);
 	if (!page) {
 		status = -ENOMEM;
 		goto out_no_page;
@@ -670,9 +675,14 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 		iomap_read_inline_data(inode, page, srcmap);
 	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
 		status = __block_write_begin_int(page, pos, len, NULL, srcmap);
-	else
-		status = __iomap_write_begin(inode, pos, len, flags, page,
+	else {
+		unsigned wb_flags = 0;
+
+		if (flags & IOMAP_UNCACHED)
+			wb_flags = IOMAP_WRITE_F_UNCACHED;
+		status = __iomap_write_begin(inode, pos, len, wb_flags, page,
 				srcmap);
+	}
 
 	if (unlikely(status))
 		goto out_unlock;
@@ -832,10 +842,17 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			break;
 		}
 
-		status = iomap_write_begin(inode, pos, bytes, 0, &page, iomap,
-				srcmap);
-		if (unlikely(status))
+retry:
+		status = iomap_write_begin(inode, pos, bytes, flags, &page,
+						iomap, srcmap);
+		if (unlikely(status)) {
+			if (status == -ENOMEM && (flags & IOMAP_UNCACHED)) {
+				iomap->flags |= IOMAP_F_PAGE_CREATE;
+				flags &= ~IOMAP_UNCACHED;
+				goto retry;
+			}
 			break;
+		}
 
 		if (mapping_writably_mapped(inode->i_mapping))
 			flush_dcache_page(page);
@@ -882,10 +899,14 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
 {
 	struct inode *inode = iocb->ki_filp->f_mapping->host;
 	loff_t pos = iocb->ki_pos, ret = 0, written = 0;
+	unsigned flags = IOMAP_WRITE;
+
+	if (iocb->ki_flags & IOCB_UNCACHED)
+		flags |= IOMAP_UNCACHED;
 
 	while (iov_iter_count(iter)) {
-		ret = iomap_apply(inode, pos, iov_iter_count(iter),
-				IOMAP_WRITE, ops, iter, iomap_write_actor);
+		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags,
+					ops, iter, iomap_write_actor);
 		if (ret <= 0)
 			break;
 		pos += ret;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 61fcaa3904d4..b5b5cf781eea 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -48,12 +48,16 @@ struct vm_fault;
  *
  * IOMAP_F_BUFFER_HEAD indicates that the file system requires the use of
  * buffer heads for this mapping.
+ *
+ * IOMAP_F_PAGE_CREATE indicates that pages had to be allocated to satisfy
+ * this operation.
  */
 #define IOMAP_F_NEW		0x01
 #define IOMAP_F_DIRTY		0x02
 #define IOMAP_F_SHARED		0x04
 #define IOMAP_F_MERGED		0x08
 #define IOMAP_F_BUFFER_HEAD	0x10
+#define IOMAP_F_PAGE_CREATE	0x20
 
 /*
  * Flags set by the core iomap code during operations:
@@ -121,6 +125,7 @@ struct iomap_page_ops {
 #define IOMAP_FAULT		(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT		(1 << 4) /* direct I/O */
 #define IOMAP_NOWAIT		(1 << 5) /* do not block */
+#define IOMAP_UNCACHED		(1 << 6)
 
 struct iomap_ops {
 	/*
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-11 15:29 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
@ 2019-12-11 17:19   ` Matthew Wilcox
  2019-12-11 18:05     ` Jens Axboe
  2019-12-12 22:34   ` Dave Chinner
  1 sibling, 1 reply; 67+ messages in thread
From: Matthew Wilcox @ 2019-12-11 17:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, linux-block, clm, torvalds, david

On Wed, Dec 11, 2019 at 08:29:43AM -0700, Jens Axboe wrote:
> @@ -670,9 +675,14 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  		iomap_read_inline_data(inode, page, srcmap);
>  	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
>  		status = __block_write_begin_int(page, pos, len, NULL, srcmap);
> -	else
> -		status = __iomap_write_begin(inode, pos, len, flags, page,
> +	else {
> +		unsigned wb_flags = 0;
> +
> +		if (flags & IOMAP_UNCACHED)
> +			wb_flags = IOMAP_WRITE_F_UNCACHED;
> +		status = __iomap_write_begin(inode, pos, len, wb_flags, page,
>  				srcmap);

I think if you do an uncached write to a currently shared extent, you
just lost the IOMAP_WRITE_F_UNSHARE flag?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor
  2019-12-11 15:29 ` [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor Jens Axboe
@ 2019-12-11 17:19     ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-11 17:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner

On Wed, Dec 11, 2019 at 7:29 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> This is in preparation for passing in a flag to the iomap_actor, which
> currently doesn't support that.

This really looks like we should use a struct for passing the arguments, no?

Now on 64-bit, you the iomap_actor() has seven arguments, which
already means that it's passing some of them on the stack on most
architectures.

On 32-bit, it's even worse, because two of the arguments are "loff_t",
which means that they are 2 words each, so you have 9 words of
arguments. I don't know a single architecture that does register
passing for things like that.

If you were to change the calling convention _first_ to do a "struct
iomap_actor" or whatever, then adding the "flags" field would be a
trivial addition.

               Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor
@ 2019-12-11 17:19     ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-11 17:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner

On Wed, Dec 11, 2019 at 7:29 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> This is in preparation for passing in a flag to the iomap_actor, which
> currently doesn't support that.

This really looks like we should use a struct for passing the arguments, no?

Now on 64-bit, you the iomap_actor() has seven arguments, which
already means that it's passing some of them on the stack on most
architectures.

On 32-bit, it's even worse, because two of the arguments are "loff_t",
which means that they are 2 words each, so you have 9 words of
arguments. I don't know a single architecture that does register
passing for things like that.

If you were to change the calling convention _first_ to do a "struct
iomap_actor" or whatever, then adding the "flags" field would be a
trivial addition.

               Linus


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-11 15:29 [PATCHSET v3 0/5] Support for RWF_UNCACHED Jens Axboe
@ 2019-12-11 17:37   ` Linus Torvalds
  2019-12-11 15:29 ` [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb Jens Axboe
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-11 17:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner

On Wed, Dec 11, 2019 at 7:29 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> Comments appreciated! This should work on any standard file system,
> using either the generic helpers or iomap. I have tested ext4 and xfs
> for the right read/write behavior, but no further validation has been
> done yet. Patches are against current git, and can also be found here:

I don't think this is conceptually wrong, but the implementation smells a bit.

I commented on the trivial part (the horrendous argument list to
iomap_actor), but I wonder how much of the explicit invalidation is
actually needed?

Because active invalidation really is a horrible horrible thing to do.
It immediately means that you can't use this interface for normal
everyday things that may actually cache perfectly fine.

What happens if you simply never _activate_ the page? Then they should
get invalidated on their own, without impacting any other load - but
only when there is _some_ memory pressure. They'll just stay on the
inactive lru list, and get re-used quickly.

Note that there are two ways to activate a page: the "accessed while
on the inactive list" will activate it, but these days we also have a
"pre-activate" path in the workingset code (see workingset_refault()).

Even if you might also want an explicit invalidate path, I would like
to hear what it looks like if you instead of - or in addition to -
invalidating, have a "don't activate" flag.

We don't have all _that_ many places where we activate pages, and they
should be easy to find (just grep for "SetPageActive()"), although the
call chain may make it a bit painful to add a "don't do it for this
access" kind of things.

But I think most of the regular IO call chains come through
"mark_page_accessed()". So _that_ is the part you want to avoid (and
maybe the workingset code). And that should be fairly straightforward,
I think.

In fact, that you say that just a pure random read case causes lots of
kswapd activity makes me think that maybe we've screwed up page
activation in general, and never noticed (because if you have enough
memory, you don't really see it that often)? So this might not be an
io_ring issue, but an issue in general.

                     Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
@ 2019-12-11 17:37   ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-11 17:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner

On Wed, Dec 11, 2019 at 7:29 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> Comments appreciated! This should work on any standard file system,
> using either the generic helpers or iomap. I have tested ext4 and xfs
> for the right read/write behavior, but no further validation has been
> done yet. Patches are against current git, and can also be found here:

I don't think this is conceptually wrong, but the implementation smells a bit.

I commented on the trivial part (the horrendous argument list to
iomap_actor), but I wonder how much of the explicit invalidation is
actually needed?

Because active invalidation really is a horrible horrible thing to do.
It immediately means that you can't use this interface for normal
everyday things that may actually cache perfectly fine.

What happens if you simply never _activate_ the page? Then they should
get invalidated on their own, without impacting any other load - but
only when there is _some_ memory pressure. They'll just stay on the
inactive lru list, and get re-used quickly.

Note that there are two ways to activate a page: the "accessed while
on the inactive list" will activate it, but these days we also have a
"pre-activate" path in the workingset code (see workingset_refault()).

Even if you might also want an explicit invalidate path, I would like
to hear what it looks like if you instead of - or in addition to -
invalidating, have a "don't activate" flag.

We don't have all _that_ many places where we activate pages, and they
should be easy to find (just grep for "SetPageActive()"), although the
call chain may make it a bit painful to add a "don't do it for this
access" kind of things.

But I think most of the regular IO call chains come through
"mark_page_accessed()". So _that_ is the part you want to avoid (and
maybe the workingset code). And that should be fairly straightforward,
I think.

In fact, that you say that just a pure random read case causes lots of
kswapd activity makes me think that maybe we've screwed up page
activation in general, and never noticed (because if you have enough
memory, you don't really see it that often)? So this might not be an
io_ring issue, but an issue in general.

                     Linus


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-11 17:37   ` Linus Torvalds
  (?)
@ 2019-12-11 17:56   ` Jens Axboe
  2019-12-11 19:14       ` Linus Torvalds
  2019-12-11 19:34     ` Jens Axboe
  -1 siblings, 2 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-11 17:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner

On 12/11/19 10:37 AM, Linus Torvalds wrote:
> On Wed, Dec 11, 2019 at 7:29 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Comments appreciated! This should work on any standard file system,
>> using either the generic helpers or iomap. I have tested ext4 and xfs
>> for the right read/write behavior, but no further validation has been
>> done yet. Patches are against current git, and can also be found here:
> 
> I don't think this is conceptually wrong, but the implementation
> smells a bit.
> 
> I commented on the trivial part (the horrendous argument list to
> iomap_actor), but I wonder how much of the explicit invalidation is
> actually needed?

Agree on the other email on that part, if we continue on this path, then
I'll clean that up and shove the arguments in an actor struct.

> Because active invalidation really is a horrible horrible thing to do.
> It immediately means that you can't use this interface for normal
> everyday things that may actually cache perfectly fine.
> 
> What happens if you simply never _activate_ the page? Then they should
> get invalidated on their own, without impacting any other load - but
> only when there is _some_ memory pressure. They'll just stay on the
> inactive lru list, and get re-used quickly.
> 
> Note that there are two ways to activate a page: the "accessed while
> on the inactive list" will activate it, but these days we also have a
> "pre-activate" path in the workingset code (see workingset_refault()).
> 
> Even if you might also want an explicit invalidate path, I would like
> to hear what it looks like if you instead of - or in addition to -
> invalidating, have a "don't activate" flag.
> 
> We don't have all _that_ many places where we activate pages, and they
> should be easy to find (just grep for "SetPageActive()"), although the
> call chain may make it a bit painful to add a "don't do it for this
> access" kind of things.
> 
> But I think most of the regular IO call chains come through
> "mark_page_accessed()". So _that_ is the part you want to avoid (and
> maybe the workingset code). And that should be fairly straightforward,
> I think.

Sure, I can give that a go and see how that behaves.

> In fact, that you say that just a pure random read case causes lots of
> kswapd activity makes me think that maybe we've screwed up page
> activation in general, and never noticed (because if you have enough
> memory, you don't really see it that often)? So this might not be an
> io_ring issue, but an issue in general.

This is very much not an io_uring issue, you can see exactly the same
kind of behavior with normal buffered reads or mmap'ed IO. I do wonder
if streamed reads are as bad in terms of making kswapd go crazy, I
forget if I tested that explicitly as well.

I'll run some streamed and random read testing on both and see how they
behave, then report back.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-11 17:19   ` Matthew Wilcox
@ 2019-12-11 18:05     ` Jens Axboe
  0 siblings, 0 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-11 18:05 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, linux-block, clm, torvalds, david

On 12/11/19 10:19 AM, Matthew Wilcox wrote:
> On Wed, Dec 11, 2019 at 08:29:43AM -0700, Jens Axboe wrote:
>> @@ -670,9 +675,14 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>>  		iomap_read_inline_data(inode, page, srcmap);
>>  	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
>>  		status = __block_write_begin_int(page, pos, len, NULL, srcmap);
>> -	else
>> -		status = __iomap_write_begin(inode, pos, len, flags, page,
>> +	else {
>> +		unsigned wb_flags = 0;
>> +
>> +		if (flags & IOMAP_UNCACHED)
>> +			wb_flags = IOMAP_WRITE_F_UNCACHED;
>> +		status = __iomap_write_begin(inode, pos, len, wb_flags, page,
>>  				srcmap);
> 
> I think if you do an uncached write to a currently shared extent, you
> just lost the IOMAP_WRITE_F_UNSHARE flag?

Oops good catch, I'll fix that up.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-11 17:56   ` Jens Axboe
@ 2019-12-11 19:14       ` Linus Torvalds
  2019-12-11 19:34     ` Jens Axboe
  1 sibling, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-11 19:14 UTC (permalink / raw)
  To: Jens Axboe, Johannes Weiner
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner

[ Adding Johannes Weiner to the cc, I think he's looked at the working
set and the inactive/active LRU lists the most ]

On Wed, Dec 11, 2019 at 9:56 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> > In fact, that you say that just a pure random read case causes lots of
> > kswapd activity makes me think that maybe we've screwed up page
> > activation in general, and never noticed (because if you have enough
> > memory, you don't really see it that often)? So this might not be an
> > io_ring issue, but an issue in general.
>
> This is very much not an io_uring issue, you can see exactly the same
> kind of behavior with normal buffered reads or mmap'ed IO. I do wonder
> if streamed reads are as bad in terms of making kswapd go crazy, I
> forget if I tested that explicitly as well.

We definitely used to have people test things like "read the same
much-bigger-than-memory file over and over", and it wasn't supposed to
be all _that_ painful, because the pages never activated, and they got
moved out of the cache quickly and didn't disturb other activities
(other than the constant IO, of course, which can be a big deal in
itself).

But maybe that was just the streaming case. With read-around and
random accesses, maybe we end up activating too much (and maybe we
always did).

But I wouldn't be surprised if we've lost that as people went from
having 16-32MB to having that many GB instead - simply because a lot
of loads are basically entirely cached, and the few things that are
not tend to be explicitly uncached (ie O_DIRECT etc).

I think the workingset changes actually were maybe kind of related to
this - the inactive list can become too small to ever give people time
to do a good job of picking the _right_ thing to activate.

So this might either be the reverse situation - maybe we let the
inactive list grow too large, and then even a big random load will
activate pages that really shouldn't be activated? Or it might be
related to the workingset issue in that we've activated pages too
eagerly and not ever moved things back to the inactive list (which
then in some situations causes the inactive list to be very small).

Who knows. But this is definitely an area that I suspect hasn't gotten
all that much attention simply because memory has become much more
plentiful, and a lot of regular loads basically have enough memory
that almost all IO is cached anyway, and the old "you needed to be
more clever about balancing swap/inactive/active even under normal
loads" thing may have gone away a bit.

These days, even if you do somewhat badly in that balancing act, a lot
of loads probably won't notice that much. Either there is still so
much memory for caching that the added IO isn't really ever dominant,
or you had such a big load to begin with that it was long since
rewritten to use O_DIRECT.

            Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
@ 2019-12-11 19:14       ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-11 19:14 UTC (permalink / raw)
  To: Jens Axboe, Johannes Weiner
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner

[ Adding Johannes Weiner to the cc, I think he's looked at the working
set and the inactive/active LRU lists the most ]

On Wed, Dec 11, 2019 at 9:56 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> > In fact, that you say that just a pure random read case causes lots of
> > kswapd activity makes me think that maybe we've screwed up page
> > activation in general, and never noticed (because if you have enough
> > memory, you don't really see it that often)? So this might not be an
> > io_ring issue, but an issue in general.
>
> This is very much not an io_uring issue, you can see exactly the same
> kind of behavior with normal buffered reads or mmap'ed IO. I do wonder
> if streamed reads are as bad in terms of making kswapd go crazy, I
> forget if I tested that explicitly as well.

We definitely used to have people test things like "read the same
much-bigger-than-memory file over and over", and it wasn't supposed to
be all _that_ painful, because the pages never activated, and they got
moved out of the cache quickly and didn't disturb other activities
(other than the constant IO, of course, which can be a big deal in
itself).

But maybe that was just the streaming case. With read-around and
random accesses, maybe we end up activating too much (and maybe we
always did).

But I wouldn't be surprised if we've lost that as people went from
having 16-32MB to having that many GB instead - simply because a lot
of loads are basically entirely cached, and the few things that are
not tend to be explicitly uncached (ie O_DIRECT etc).

I think the workingset changes actually were maybe kind of related to
this - the inactive list can become too small to ever give people time
to do a good job of picking the _right_ thing to activate.

So this might either be the reverse situation - maybe we let the
inactive list grow too large, and then even a big random load will
activate pages that really shouldn't be activated? Or it might be
related to the workingset issue in that we've activated pages too
eagerly and not ever moved things back to the inactive list (which
then in some situations causes the inactive list to be very small).

Who knows. But this is definitely an area that I suspect hasn't gotten
all that much attention simply because memory has become much more
plentiful, and a lot of regular loads basically have enough memory
that almost all IO is cached anyway, and the old "you needed to be
more clever about balancing swap/inactive/active even under normal
loads" thing may have gone away a bit.

These days, even if you do somewhat badly in that balancing act, a lot
of loads probably won't notice that much. Either there is still so
much memory for caching that the added IO isn't really ever dominant,
or you had such a big load to begin with that it was long since
rewritten to use O_DIRECT.

            Linus


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-11 17:56   ` Jens Axboe
  2019-12-11 19:14       ` Linus Torvalds
@ 2019-12-11 19:34     ` Jens Axboe
  2019-12-11 20:03         ` Linus Torvalds
  2019-12-11 20:04       ` Jens Axboe
  1 sibling, 2 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-11 19:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On 12/11/19 10:56 AM, Jens Axboe wrote:
>> But I think most of the regular IO call chains come through
>> "mark_page_accessed()". So _that_ is the part you want to avoid (and
>> maybe the workingset code). And that should be fairly straightforward,
>> I think.
> 
> Sure, I can give that a go and see how that behaves.

Before doing that, I ran a streamed read test instead of just random
reads, and the behavior is roughly the same. kswapd consumes a bit less
CPU, but it's still very active once the page cache has been filled. For
specifics on the setup, I deliberately boot the box with 32G of RAM, and
the dataset is 320G. My initial tests were with 1 320G file, but
Johannes complained about that so I went to 32 10G files instead. That's
what I'm currently using.

For the random test case, top of profile for kswapd is:

+   33.49%  kswapd0  [kernel.vmlinux]  [k] xas_create                          ◆
+    7.93%  kswapd0  [kernel.vmlinux]  [k] __isolate_lru_page                  ▒
+    7.18%  kswapd0  [kernel.vmlinux]  [k] unlock_page                         ▒
+    5.90%  kswapd0  [kernel.vmlinux]  [k] free_pcppages_bulk                  ▒
+    5.64%  kswapd0  [kernel.vmlinux]  [k] _raw_spin_lock_irqsave              ▒
+    5.57%  kswapd0  [kernel.vmlinux]  [k] shrink_page_list                    ▒
+    3.48%  kswapd0  [kernel.vmlinux]  [k] __remove_mapping                    ▒
+    3.35%  kswapd0  [kernel.vmlinux]  [k] isolate_lru_pages                   ▒
+    3.14%  kswapd0  [kernel.vmlinux]  [k] __delete_from_page_cache            ▒

Next I ran with NOT calling mark_page_accessed() to see if that makes a
difference. See patch below, I just applied this on top of this patchset
and added a new RWF_NOACCESS flag for it for ease of teting. I verified
that we are indeed skipping the mark_page_accessed() call in
generic_file_buffered_read().

I can't tell a difference in the results, there's no discernable
difference between NOT calling mark_page_accessed() or calling it.
Behavior seems about the same, in terms of pre and post page cache full,
and kswapd still churns a lot once the page cache is filled up.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-11 19:34     ` Jens Axboe
@ 2019-12-11 20:03         ` Linus Torvalds
  2019-12-11 20:04       ` Jens Axboe
  1 sibling, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-11 20:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 11:34 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> I can't tell a difference in the results, there's no discernable
> difference between NOT calling mark_page_accessed() or calling it.
> Behavior seems about the same, in terms of pre and post page cache full,
> and kswapd still churns a lot once the page cache is filled up.

Yeah, that sounds like a bug. I'm sure the RWF_UNCACHED flag fixes it
when you do the IO that way, but it seems to be a bug relardless.

Does /proc/meminfo have everything inactive for file data (ie the
"Active(file)" line is basically zero?).

Maybe pages got activated other ways (eg a problem with the workingset
code)? You said "See patch below", but there wasn't any.

That said, it's also entirely possible that even with everything in
the inactive list, we might try to shrink other things first for
whatever odd reason..

The fact that you see that xas_create() so prominently would imply
perhaps add_to_swap_cache(), which certainly implies that the page
shrinking isn't hitting the file pages...

               Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
@ 2019-12-11 20:03         ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-11 20:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 11:34 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> I can't tell a difference in the results, there's no discernable
> difference between NOT calling mark_page_accessed() or calling it.
> Behavior seems about the same, in terms of pre and post page cache full,
> and kswapd still churns a lot once the page cache is filled up.

Yeah, that sounds like a bug. I'm sure the RWF_UNCACHED flag fixes it
when you do the IO that way, but it seems to be a bug relardless.

Does /proc/meminfo have everything inactive for file data (ie the
"Active(file)" line is basically zero?).

Maybe pages got activated other ways (eg a problem with the workingset
code)? You said "See patch below", but there wasn't any.

That said, it's also entirely possible that even with everything in
the inactive list, we might try to shrink other things first for
whatever odd reason..

The fact that you see that xas_create() so prominently would imply
perhaps add_to_swap_cache(), which certainly implies that the page
shrinking isn't hitting the file pages...

               Linus


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-11 19:34     ` Jens Axboe
  2019-12-11 20:03         ` Linus Torvalds
@ 2019-12-11 20:04       ` Jens Axboe
  1 sibling, 0 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-11 20:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On 12/11/19 12:34 PM, Jens Axboe wrote:
> On 12/11/19 10:56 AM, Jens Axboe wrote:
>>> But I think most of the regular IO call chains come through
>>> "mark_page_accessed()". So _that_ is the part you want to avoid (and
>>> maybe the workingset code). And that should be fairly straightforward,
>>> I think.
>>
>> Sure, I can give that a go and see how that behaves.
> 
> Before doing that, I ran a streamed read test instead of just random
> reads, and the behavior is roughly the same. kswapd consumes a bit less
> CPU, but it's still very active once the page cache has been filled. For
> specifics on the setup, I deliberately boot the box with 32G of RAM, and
> the dataset is 320G. My initial tests were with 1 320G file, but
> Johannes complained about that so I went to 32 10G files instead. That's
> what I'm currently using.
> 
> For the random test case, top of profile for kswapd is:
> 
> +   33.49%  kswapd0  [kernel.vmlinux]  [k] xas_create
> +    7.93%  kswapd0  [kernel.vmlinux]  [k] __isolate_lru_page
> +    7.18%  kswapd0  [kernel.vmlinux]  [k] unlock_page
> +    5.90%  kswapd0  [kernel.vmlinux]  [k] free_pcppages_bulk
> +    5.64%  kswapd0  [kernel.vmlinux]  [k] _raw_spin_lock_irqsave
> +    5.57%  kswapd0  [kernel.vmlinux]  [k] shrink_page_list
> +    3.48%  kswapd0  [kernel.vmlinux]  [k] __remove_mapping
> +    3.35%  kswapd0  [kernel.vmlinux]  [k] isolate_lru_pages
> +    3.14%  kswapd0  [kernel.vmlinux]  [k] __delete_from_page_cache

Here's the profile for the !mark_page_accessed() run, looks very much
the same:

+   32.84%  kswapd0  [kernel.vmlinux]  [k] xas_create
+    8.05%  kswapd0  [kernel.vmlinux]  [k] unlock_page
+    7.68%  kswapd0  [kernel.vmlinux]  [k] __isolate_lru_page
+    6.08%  kswapd0  [kernel.vmlinux]  [k] free_pcppages_bulk
+    5.96%  kswapd0  [kernel.vmlinux]  [k] _raw_spin_lock_irqsave
+    5.56%  kswapd0  [kernel.vmlinux]  [k] shrink_page_list
+    4.02%  kswapd0  [kernel.vmlinux]  [k] __remove_mapping
+    3.70%  kswapd0  [kernel.vmlinux]  [k] __delete_from_page_cache
+    3.55%  kswapd0  [kernel.vmlinux]  [k] isolate_lru_pages

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-11 20:03         ` Linus Torvalds
  (?)
@ 2019-12-11 20:08         ` Jens Axboe
  2019-12-11 20:18             ` Linus Torvalds
  2019-12-11 20:43           ` Matthew Wilcox
  -1 siblings, 2 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-11 20:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On 12/11/19 1:03 PM, Linus Torvalds wrote:
> On Wed, Dec 11, 2019 at 11:34 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> I can't tell a difference in the results, there's no discernable
>> difference between NOT calling mark_page_accessed() or calling it.
>> Behavior seems about the same, in terms of pre and post page cache full,
>> and kswapd still churns a lot once the page cache is filled up.
> 
> Yeah, that sounds like a bug. I'm sure the RWF_UNCACHED flag fixes it
> when you do the IO that way, but it seems to be a bug relardless.

Hard to disagree with that.

> Does /proc/meminfo have everything inactive for file data (ie the
> "Active(file)" line is basically zero?).

$ cat /proc/meminfo | grep -i active
Active:           134136 kB
Inactive:       28683916 kB
Active(anon):      97064 kB
Inactive(anon):        4 kB
Active(file):      37072 kB
Inactive(file): 28683912 kB

This is after a run with RWF_NOACCESS.

> Maybe pages got activated other ways (eg a problem with the workingset
> code)? You said "See patch below", but there wasn't any.

Oops, now below.

> 
> That said, it's also entirely possible that even with everything in
> the inactive list, we might try to shrink other things first for
> whatever odd reason..
> 
> The fact that you see that xas_create() so prominently would imply
> perhaps add_to_swap_cache(), which certainly implies that the page
> shrinking isn't hitting the file pages...

That's presumably misleading, as it's just lookups. But yes,
confusing...

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5ea5fc167524..b2ecc66f5bd5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -316,6 +316,7 @@ enum rw_hint {
 #define IOCB_WRITE		(1 << 6)
 #define IOCB_NOWAIT		(1 << 7)
 #define IOCB_UNCACHED		(1 << 8)
+#define IOCB_NOACCESS		(1 << 9)
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -3423,6 +3424,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
 		ki->ki_flags |= IOCB_APPEND;
 	if (flags & RWF_UNCACHED)
 		ki->ki_flags |= IOCB_UNCACHED;
+	if (flags & RWF_NOACCESS)
+		ki->ki_flags |= IOCB_NOACCESS;
 	return 0;
 }
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 357ebb0e0c5d..f20f0048d5c5 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -302,8 +302,10 @@ typedef int __bitwise __kernel_rwf_t;
 /* drop cache after reading or writing data */
 #define RWF_UNCACHED	((__force __kernel_rwf_t)0x00000040)
 
+#define RWF_NOACCESS	((__force __kernel_rwf_t)0x00000080)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND | RWF_UNCACHED)
+			 RWF_APPEND | RWF_UNCACHED | RWF_NOACCESS)
 
 #endif /* _UAPI_LINUX_FS_H */
diff --git a/mm/filemap.c b/mm/filemap.c
index 4dadd1a4ca7c..c37b0e221a8a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2058,7 +2058,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 			if (iocb->ki_flags & IOCB_NOWAIT)
 				goto would_block;
 			/* UNCACHED implies no read-ahead */
-			if (iocb->ki_flags & IOCB_UNCACHED)
+			if (iocb->ki_flags & (IOCB_UNCACHED|IOCB_NOACCESS))
 				goto no_cached_page;
 			page_cache_sync_readahead(mapping,
 					ra, filp,
@@ -2144,7 +2144,8 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		 * When a sequential read accesses a page several times,
 		 * only mark it as accessed the first time.
 		 */
-		if (prev_index != index || offset != prev_offset)
+		if ((prev_index != index || offset != prev_offset) &&
+		    !(iocb->ki_flags & IOCB_NOACCESS))
 			mark_page_accessed(page);
 		prev_index = index;
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-11 20:08         ` Jens Axboe
@ 2019-12-11 20:18             ` Linus Torvalds
  2019-12-11 20:43           ` Matthew Wilcox
  1 sibling, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-11 20:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> $ cat /proc/meminfo | grep -i active
> Active:           134136 kB
> Inactive:       28683916 kB
> Active(anon):      97064 kB
> Inactive(anon):        4 kB
> Active(file):      37072 kB
> Inactive(file): 28683912 kB

Yeah, that should not put pressure on some swap activity. We have 28
GB of basically free inactive file data, and the VM is doing something
very very bad if it then doesn't just quickly free it with no real
drama.

In fact, I don't think it should even trigger kswapd at all, it should
all be direct reclaim. Of course, some of the mm people hate that with
a passion, but this does look like a prime example of why it should
just be done.

MM people - mind giving this a look?  Jens, if you have that NOACCESS
flag in a git tree too and a trivial way to recreate your load, that
would be good for people to be able to just try things out.

                     Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
@ 2019-12-11 20:18             ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-11 20:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> $ cat /proc/meminfo | grep -i active
> Active:           134136 kB
> Inactive:       28683916 kB
> Active(anon):      97064 kB
> Inactive(anon):        4 kB
> Active(file):      37072 kB
> Inactive(file): 28683912 kB

Yeah, that should not put pressure on some swap activity. We have 28
GB of basically free inactive file data, and the VM is doing something
very very bad if it then doesn't just quickly free it with no real
drama.

In fact, I don't think it should even trigger kswapd at all, it should
all be direct reclaim. Of course, some of the mm people hate that with
a passion, but this does look like a prime example of why it should
just be done.

MM people - mind giving this a look?  Jens, if you have that NOACCESS
flag in a git tree too and a trivial way to recreate your load, that
would be good for people to be able to just try things out.

                     Linus


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-11 20:08         ` Jens Axboe
  2019-12-11 20:18             ` Linus Torvalds
@ 2019-12-11 20:43           ` Matthew Wilcox
  1 sibling, 0 replies; 67+ messages in thread
From: Matthew Wilcox @ 2019-12-11 20:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Linux-MM, linux-fsdevel, linux-block,
	Chris Mason, Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 01:08:32PM -0700, Jens Axboe wrote:
> > The fact that you see that xas_create() so prominently would imply
> > perhaps add_to_swap_cache(), which certainly implies that the page
> > shrinking isn't hitting the file pages...
> 
> That's presumably misleading, as it's just lookups. But yes,
> confusing...

While xas_create() could be called directly, it's more often called
through xas_store() ... which would be called if we're storing a shadow
entry to replace a page, which this workload would presumably be doing.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-11 20:18             ` Linus Torvalds
  (?)
@ 2019-12-11 21:04             ` Johannes Weiner
  2019-12-12  1:30               ` Jens Axboe
  -1 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2019-12-11 21:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner

On Wed, Dec 11, 2019 at 12:18:38PM -0800, Linus Torvalds wrote:
> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > $ cat /proc/meminfo | grep -i active
> > Active:           134136 kB
> > Inactive:       28683916 kB
> > Active(anon):      97064 kB
> > Inactive(anon):        4 kB
> > Active(file):      37072 kB
> > Inactive(file): 28683912 kB
> 
> Yeah, that should not put pressure on some swap activity. We have 28
> GB of basically free inactive file data, and the VM is doing something
> very very bad if it then doesn't just quickly free it with no real
> drama.

I was looking at this with Jens offline last week. One thing to note
is the rate of IO that Jens is working with: combined with the low
cache hit rate, it was pushing upwards of half a million pages through
the page cache each second.

There isn't anything obvious sticking out in the kswapd profile: it's
dominated by cache tree deletions (or rather replacing pages with
shadow entries, hence the misleading xas_store()), tree lock
contention, etc. - all work that a direct reclaimer would have to do
as well, with one exceptions: RWC_UNCACHED doesn't need to go through
the LRU list, and 8-9% of kswapd cycles alone are going into
physically getting pages off the list. (And I suspect part of that is
also contention over the LRU lock as kswapd gets overwhelmed and
direct reclaim kicks in).

Jens, how much throughput difference does kswapd vs RWC_UNCACHED make?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-11 20:18             ` Linus Torvalds
  (?)
  (?)
@ 2019-12-11 23:41             ` Jens Axboe
  2019-12-12  1:08                 ` Linus Torvalds
  2019-12-12  1:09               ` Jens Axboe
  -1 siblings, 2 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-11 23:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On 12/11/19 1:18 PM, Linus Torvalds wrote:
> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> $ cat /proc/meminfo | grep -i active
>> Active:           134136 kB
>> Inactive:       28683916 kB
>> Active(anon):      97064 kB
>> Inactive(anon):        4 kB
>> Active(file):      37072 kB
>> Inactive(file): 28683912 kB
> 
> Yeah, that should not put pressure on some swap activity. We have 28
> GB of basically free inactive file data, and the VM is doing something
> very very bad if it then doesn't just quickly free it with no real
> drama.
> 
> In fact, I don't think it should even trigger kswapd at all, it should
> all be direct reclaim. Of course, some of the mm people hate that with
> a passion, but this does look like a prime example of why it should
> just be done.

For giggles, I ran just a single thread on the file set. We're only
doing about 100K IOPS at that point, yet when the page cache fills,
kswapd still eats 10% cpu. That seems like a lot for something that
slow.

> MM people - mind giving this a look?  Jens, if you have that NOACCESS
> flag in a git tree too and a trivial way to recreate your load, that
> would be good for people to be able to just try things out.

I've pushed the NOACCESS thing to my buffered-uncached branch as well,
and fio has a 'noaccess' branch that enables it for pvsync2 (which is
preadv2/pwritev2) and the io_uring engine.

Here's what I did to reproduce:

- Boot the box with 32G of memory.
- On a fast device, create 10x RAM size of files. I used 32 files, each
  10G. Mine are in /data, and they are named file[1-32].
- Run a buffered read workload on those files.

For pvsync2, something ala:

$ cat job.fio
[test]
ioengine=pvsync2
#uncached=1
#noaccess=1
iodepth=4
bs=4k
group_reporting=1
rw=randread
norandommap
buffered=1
directory=/data
filename=file1:file2:file3:file4:file5:file6:file7:file8:file9:file10:file11:file12:file13:file14:file15:file16:file17:file18:file19:file20:file21:file22:file23:file24:file25:file26:file27:file28:file29:file30:file31:file32

If you want to use more than one thread, add:

numjobs=4

for 4 threads. Uncomment the 'uncached=1' and/or 'noaccess=1' to enable
either RWF_UNCACHED or RWF_NOACCESS.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-11 23:41             ` Jens Axboe
@ 2019-12-12  1:08                 ` Linus Torvalds
  2019-12-12  1:09               ` Jens Axboe
  1 sibling, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-12  1:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 3:41 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Here's what I did to reproduce:

Gaah. I have a fairly fast ssd (but it's "consumer fast" - Samsung 970
Pro - I'm assuming yours is a different class), but I encrypt my disk,
so I only get about 15k iops and I see kcyrptd in my profiles, not
kswapd.

I didn't even try to worry about RWF_UNCACHED or RWF_NOACCESS, since I
wanted to see the problem. But I think that with my setup, I can't
really see it just due to my IO being slow ;(

I do see xas_create and kswapd0, but it's 0.42%. I'm assuming it's the
very first signs of this problem, but it's certainly not all that
noticeable.

               Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
@ 2019-12-12  1:08                 ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-12  1:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 3:41 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Here's what I did to reproduce:

Gaah. I have a fairly fast ssd (but it's "consumer fast" - Samsung 970
Pro - I'm assuming yours is a different class), but I encrypt my disk,
so I only get about 15k iops and I see kcyrptd in my profiles, not
kswapd.

I didn't even try to worry about RWF_UNCACHED or RWF_NOACCESS, since I
wanted to see the problem. But I think that with my setup, I can't
really see it just due to my IO being slow ;(

I do see xas_create and kswapd0, but it's 0.42%. I'm assuming it's the
very first signs of this problem, but it's certainly not all that
noticeable.

               Linus


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-11 23:41             ` Jens Axboe
  2019-12-12  1:08                 ` Linus Torvalds
@ 2019-12-12  1:09               ` Jens Axboe
  2019-12-12  2:03                 ` Jens Axboe
  2019-12-12 22:18                 ` Dave Chinner
  1 sibling, 2 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-12  1:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On 12/11/19 4:41 PM, Jens Axboe wrote:
> On 12/11/19 1:18 PM, Linus Torvalds wrote:
>> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> $ cat /proc/meminfo | grep -i active
>>> Active:           134136 kB
>>> Inactive:       28683916 kB
>>> Active(anon):      97064 kB
>>> Inactive(anon):        4 kB
>>> Active(file):      37072 kB
>>> Inactive(file): 28683912 kB
>>
>> Yeah, that should not put pressure on some swap activity. We have 28
>> GB of basically free inactive file data, and the VM is doing something
>> very very bad if it then doesn't just quickly free it with no real
>> drama.
>>
>> In fact, I don't think it should even trigger kswapd at all, it should
>> all be direct reclaim. Of course, some of the mm people hate that with
>> a passion, but this does look like a prime example of why it should
>> just be done.
> 
> For giggles, I ran just a single thread on the file set. We're only
> doing about 100K IOPS at that point, yet when the page cache fills,
> kswapd still eats 10% cpu. That seems like a lot for something that
> slow.

Warning, the below is from the really crazy department...

Anyway, I took a closer look at the profiles for the uncached case.
We're spending a lot of time doing memsets (this is the xa_node init,
from the radix tree constructor), and call_rcu for the node free later
on. All wasted time, and something that meant we weren't as close to the
performance of O_DIRECT as we could be.

So Chris and I started talking about this, and pondered "what would
happen if we simply bypassed the page cache completely?". Case in point,
see below incremental patch. We still do the page cache lookup, and use
that page to copy from if it's there. If the page isn't there, allocate
one and do IO to it, but DON'T add it to the page cache. With that,
we're almost at O_DIRECT levels of performance for the 4k read case,
without 1-2%. I think 512b would look awesome, but we're reading full
pages, so that won't really help us much. Compared to the previous
uncached method, this is 30% faster on this device. That's substantial.

Obviously this has issues with truncate that would need to be resolved,
and it's definitely dirtier. But the performance is very enticing...


diff --git a/mm/filemap.c b/mm/filemap.c
index c37b0e221a8a..9834c394fffd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -933,8 +933,8 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 }
 EXPORT_SYMBOL(add_to_page_cache_locked);
 
-static int __add_to_page_cache(struct page *page, struct address_space *mapping,
-			       pgoff_t offset, gfp_t gfp_mask, bool lru)
+int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
+				pgoff_t offset, gfp_t gfp_mask)
 {
 	void *shadow = NULL;
 	int ret;
@@ -956,18 +956,11 @@ static int __add_to_page_cache(struct page *page, struct address_space *mapping,
 		WARN_ON_ONCE(PageActive(page));
 		if (!(gfp_mask & __GFP_WRITE) && shadow)
 			workingset_refault(page, shadow);
-		if (lru)
-			lru_cache_add(page);
+		lru_cache_add(page);
 	}
 	return ret;
 
 }
-
-int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
-				pgoff_t offset, gfp_t gfp_mask)
-{
-	return __add_to_page_cache(page, mapping, offset, gfp_mask, true);
-}
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
 #ifdef CONFIG_NUMA
@@ -1998,6 +1991,13 @@ static void shrink_readahead_size_eio(struct file *filp,
 	ra->ra_pages /= 4;
 }
 
+static void buffered_put_page(struct page *page, bool clear_mapping)
+{
+	if (clear_mapping)
+		page->mapping = NULL;
+	put_page(page);
+}
+
 /**
  * generic_file_buffered_read - generic file read routine
  * @iocb:	the iocb to read
@@ -2040,7 +2040,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 	offset = *ppos & ~PAGE_MASK;
 
 	for (;;) {
-		bool drop_page = false;
+		bool clear_mapping = false;
 		struct page *page;
 		pgoff_t end_index;
 		loff_t isize;
@@ -2118,7 +2118,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		isize = i_size_read(inode);
 		end_index = (isize - 1) >> PAGE_SHIFT;
 		if (unlikely(!isize || index > end_index)) {
-			put_page(page);
+			buffered_put_page(page, clear_mapping);
 			goto out;
 		}
 
@@ -2127,7 +2127,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		if (index == end_index) {
 			nr = ((isize - 1) & ~PAGE_MASK) + 1;
 			if (nr <= offset) {
-				put_page(page);
+				buffered_put_page(page, clear_mapping);
 				goto out;
 			}
 		}
@@ -2160,27 +2160,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		offset &= ~PAGE_MASK;
 		prev_offset = offset;
 
-		/*
-		 * If we're dropping this page due to drop-behind, then
-		 * lock it first. Ignore errors here, we can just leave it
-		 * in the page cache. Note that we didn't add this page to
-		 * the LRU when we added it to the page cache. So if we
-		 * fail removing it, or lock it, add to the LRU.
-		 */
-		if (drop_page) {
-			bool addlru = true;
-
-			if (!lock_page_killable(page)) {
-				if (page->mapping == mapping)
-					addlru = !remove_mapping(mapping, page);
-				else
-					addlru = false;
-				unlock_page(page);
-			}
-			if (addlru)
-				lru_cache_add(page);
-		}
-		put_page(page);
+		buffered_put_page(page, clear_mapping);
 		written += ret;
 		if (!iov_iter_count(iter))
 			goto out;
@@ -2222,7 +2202,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 		if (unlikely(error)) {
 			if (error == AOP_TRUNCATED_PAGE) {
-				put_page(page);
+				buffered_put_page(page, clear_mapping);
 				error = 0;
 				goto find_page;
 			}
@@ -2239,7 +2219,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 					 * invalidate_mapping_pages got it
 					 */
 					unlock_page(page);
-					put_page(page);
+					buffered_put_page(page, clear_mapping);
 					goto find_page;
 				}
 				unlock_page(page);
@@ -2254,7 +2234,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 readpage_error:
 		/* UHHUH! A synchronous read error occurred. Report it */
-		put_page(page);
+		buffered_put_page(page, clear_mapping);
 		goto out;
 
 no_cached_page:
@@ -2267,12 +2247,16 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 			error = -ENOMEM;
 			goto out;
 		}
-		if (iocb->ki_flags & IOCB_UNCACHED)
-			drop_page = true;
+		if (iocb->ki_flags & IOCB_UNCACHED) {
+			__SetPageLocked(page);
+			page->mapping = mapping;
+			page->index = index;
+			clear_mapping = true;
+			goto readpage;
+		}
 
-		error = __add_to_page_cache(page, mapping, index,
-				mapping_gfp_constraint(mapping, GFP_KERNEL),
-				!drop_page);
+		error = add_to_page_cache(page, mapping, index,
+				mapping_gfp_constraint(mapping, GFP_KERNEL));
 		if (error) {
 			put_page(page);
 			if (error == -EEXIST) {

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12  1:08                 ` Linus Torvalds
  (?)
@ 2019-12-12  1:11                 ` Jens Axboe
  2019-12-12  1:22                     ` Linus Torvalds
  -1 siblings, 1 reply; 67+ messages in thread
From: Jens Axboe @ 2019-12-12  1:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On 12/11/19 6:08 PM, Linus Torvalds wrote:
> On Wed, Dec 11, 2019 at 3:41 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Here's what I did to reproduce:
> 
> Gaah. I have a fairly fast ssd (but it's "consumer fast" - Samsung 970
> Pro - I'm assuming yours is a different class), but I encrypt my disk,
> so I only get about 15k iops and I see kcyrptd in my profiles, not
> kswapd.
> 
> I didn't even try to worry about RWF_UNCACHED or RWF_NOACCESS, since I
> wanted to see the problem. But I think that with my setup, I can't
> really see it just due to my IO being slow ;(
> 
> I do see xas_create and kswapd0, but it's 0.42%. I'm assuming it's the
> very first signs of this problem, but it's certainly not all that
> noticeable.

15K is likely too slow to really show an issue, I'm afraid. The 970
is no slouch, but your crypt setup will likely hamper it a lot. You
don't have a non-encrypted partition on it? I always set aside a
decently sized empty partition on my laptop to be able to run these
kinds of things without having to resort to a test box.

Latency is where it's down the most, you'll likely do better with
more threads to get the IOPS up.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12  1:11                 ` Jens Axboe
@ 2019-12-12  1:22                     ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-12  1:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 5:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> 15K is likely too slow to really show an issue, I'm afraid. The 970
> is no slouch, but your crypt setup will likely hamper it a lot. You
> don't have a non-encrypted partition on it?

No. I normally don't need all that much disk, so I've never upgraded
my ssd from the 512G size.

Which means that it's actually half full or so, and I never felt like
"I should keep an unencrypted partition for IO testing", since I don't
generally _do_ any IO testing.

I can get my load up with "numjobs=8" and get my iops up to the 100k
range, though.

But kswapd doesn't much seem to care, the CPU percentage actually does
_down_ to 0.39% when I try that. Probably simply because now my CPU's
are busy, so they are running at 4.7Ghz instead of the 800Mhz "mostly
idle" state ...

I guess I should be happy. It does mean that the situation you see
isn't exactly the normal case. I understand why you want to do the
non-cached case, but the case I think it the worrisome one is the
regular buffered one, so that's what I'm testing (not even trying the
noaccess patches).

So from your report I went "uhhuh, that sounds like a bug". And it
appears that it largely isn't - you're seeing it because of pushing
the IO subsystem by another order of magnitude (and then I agree that
"under those kinds of IO loads, caching just won't help")

                   Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
@ 2019-12-12  1:22                     ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-12  1:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 5:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> 15K is likely too slow to really show an issue, I'm afraid. The 970
> is no slouch, but your crypt setup will likely hamper it a lot. You
> don't have a non-encrypted partition on it?

No. I normally don't need all that much disk, so I've never upgraded
my ssd from the 512G size.

Which means that it's actually half full or so, and I never felt like
"I should keep an unencrypted partition for IO testing", since I don't
generally _do_ any IO testing.

I can get my load up with "numjobs=8" and get my iops up to the 100k
range, though.

But kswapd doesn't much seem to care, the CPU percentage actually does
_down_ to 0.39% when I try that. Probably simply because now my CPU's
are busy, so they are running at 4.7Ghz instead of the 800Mhz "mostly
idle" state ...

I guess I should be happy. It does mean that the situation you see
isn't exactly the normal case. I understand why you want to do the
non-cached case, but the case I think it the worrisome one is the
regular buffered one, so that's what I'm testing (not even trying the
noaccess patches).

So from your report I went "uhhuh, that sounds like a bug". And it
appears that it largely isn't - you're seeing it because of pushing
the IO subsystem by another order of magnitude (and then I agree that
"under those kinds of IO loads, caching just won't help")

                   Linus


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12  1:22                     ` Linus Torvalds
  (?)
@ 2019-12-12  1:29                     ` Jens Axboe
  2019-12-12  1:41                         ` Linus Torvalds
  2019-12-12  1:41                       ` Jens Axboe
  -1 siblings, 2 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-12  1:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On 12/11/19 6:22 PM, Linus Torvalds wrote:
> On Wed, Dec 11, 2019 at 5:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> 15K is likely too slow to really show an issue, I'm afraid. The 970
>> is no slouch, but your crypt setup will likely hamper it a lot. You
>> don't have a non-encrypted partition on it?
> 
> No. I normally don't need all that much disk, so I've never upgraded
> my ssd from the 512G size.
> 
> Which means that it's actually half full or so, and I never felt like
> "I should keep an unencrypted partition for IO testing", since I don't
> generally _do_ any IO testing.
> 
> I can get my load up with "numjobs=8" and get my iops up to the 100k
> range, though.
> 
> But kswapd doesn't much seem to care, the CPU percentage actually does
> _down_ to 0.39% when I try that. Probably simply because now my CPU's
> are busy, so they are running at 4.7Ghz instead of the 800Mhz "mostly
> idle" state ...
> 
> I guess I should be happy. It does mean that the situation you see
> isn't exactly the normal case. I understand why you want to do the
> non-cached case, but the case I think it the worrisome one is the
> regular buffered one, so that's what I'm testing (not even trying the
> noaccess patches).
> 
> So from your report I went "uhhuh, that sounds like a bug". And it
> appears that it largely isn't - you're seeing it because of pushing
> the IO subsystem by another order of magnitude (and then I agree that
> "under those kinds of IO loads, caching just won't help")

I'd very much argue that it IS a bug, maybe just doesn't show on your
system. My test box is a pretty standard 2 socket system, 24 cores / 48
threads, 2 nodes. The last numbers I sent were 100K IOPS, so nothing
crazy, and granted that's only 10% kswapd cpu time, but that still seems
very high for those kinds of rates. I'm surprised you see essentially no
kswapd time for the same data rate.

We'll keep poking here, I know Johannes is spending some time looking
into the reclaim side.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-11 21:04             ` Johannes Weiner
@ 2019-12-12  1:30               ` Jens Axboe
  0 siblings, 0 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-12  1:30 UTC (permalink / raw)
  To: Johannes Weiner, Linus Torvalds
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner

On 12/11/19 2:04 PM, Johannes Weiner wrote:
> On Wed, Dec 11, 2019 at 12:18:38PM -0800, Linus Torvalds wrote:
>> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> $ cat /proc/meminfo | grep -i active
>>> Active:           134136 kB
>>> Inactive:       28683916 kB
>>> Active(anon):      97064 kB
>>> Inactive(anon):        4 kB
>>> Active(file):      37072 kB
>>> Inactive(file): 28683912 kB
>>
>> Yeah, that should not put pressure on some swap activity. We have 28
>> GB of basically free inactive file data, and the VM is doing something
>> very very bad if it then doesn't just quickly free it with no real
>> drama.
> 
> I was looking at this with Jens offline last week. One thing to note
> is the rate of IO that Jens is working with: combined with the low
> cache hit rate, it was pushing upwards of half a million pages through
> the page cache each second.
> 
> There isn't anything obvious sticking out in the kswapd profile: it's
> dominated by cache tree deletions (or rather replacing pages with
> shadow entries, hence the misleading xas_store()), tree lock
> contention, etc. - all work that a direct reclaimer would have to do
> as well, with one exceptions: RWC_UNCACHED doesn't need to go through
> the LRU list, and 8-9% of kswapd cycles alone are going into
> physically getting pages off the list. (And I suspect part of that is
> also contention over the LRU lock as kswapd gets overwhelmed and
> direct reclaim kicks in).
> 
> Jens, how much throughput difference does kswapd vs RWC_UNCACHED make?

It's not huge, like 5-10%. The CPU usage is the most noticable,
particularly at the higher IO rates.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12  1:29                     ` Jens Axboe
@ 2019-12-12  1:41                         ` Linus Torvalds
  2019-12-12  1:41                       ` Jens Axboe
  1 sibling, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-12  1:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 5:29 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> I'd very much argue that it IS a bug, maybe just doesn't show on your
> system.

Oh, I agree. But I also understand why people hadn't noticed, and I
don't think it's all that critical - because if you do 1M iops cached,
you're doing something really really strange.

I too can see xas_create using 30% CPU time, but that's when I do a
perf record on just kswapd - and when I actually look at it on a
system level, it looks nowhere near that bad.

So I think people should look at this. Part of it might be for Willy:
does that xas_create() need to be that expensive? I hate how "perf"
callchains work, but it does look like it is probably
page_cache_delete -> xas_store -> xas_create that is the bulk of the
cost there.

Replacing the real page with the shadow entry shouldn't be that big of
a deal, I would really hope.

Willy, that used to be a __radix_tree_lookup -> __radix_tree_replace
thing, is there perhaps some simple optmization that could be done on
the XArray case here?

                Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
@ 2019-12-12  1:41                         ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-12  1:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 5:29 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> I'd very much argue that it IS a bug, maybe just doesn't show on your
> system.

Oh, I agree. But I also understand why people hadn't noticed, and I
don't think it's all that critical - because if you do 1M iops cached,
you're doing something really really strange.

I too can see xas_create using 30% CPU time, but that's when I do a
perf record on just kswapd - and when I actually look at it on a
system level, it looks nowhere near that bad.

So I think people should look at this. Part of it might be for Willy:
does that xas_create() need to be that expensive? I hate how "perf"
callchains work, but it does look like it is probably
page_cache_delete -> xas_store -> xas_create that is the bulk of the
cost there.

Replacing the real page with the shadow entry shouldn't be that big of
a deal, I would really hope.

Willy, that used to be a __radix_tree_lookup -> __radix_tree_replace
thing, is there perhaps some simple optmization that could be done on
the XArray case here?

                Linus


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12  1:29                     ` Jens Axboe
  2019-12-12  1:41                         ` Linus Torvalds
@ 2019-12-12  1:41                       ` Jens Axboe
  2019-12-12  1:49                           ` Linus Torvalds
  1 sibling, 1 reply; 67+ messages in thread
From: Jens Axboe @ 2019-12-12  1:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On 12/11/19 6:29 PM, Jens Axboe wrote:
> On 12/11/19 6:22 PM, Linus Torvalds wrote:
>> On Wed, Dec 11, 2019 at 5:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> 15K is likely too slow to really show an issue, I'm afraid. The 970
>>> is no slouch, but your crypt setup will likely hamper it a lot. You
>>> don't have a non-encrypted partition on it?
>>
>> No. I normally don't need all that much disk, so I've never upgraded
>> my ssd from the 512G size.
>>
>> Which means that it's actually half full or so, and I never felt like
>> "I should keep an unencrypted partition for IO testing", since I don't
>> generally _do_ any IO testing.
>>
>> I can get my load up with "numjobs=8" and get my iops up to the 100k
>> range, though.
>>
>> But kswapd doesn't much seem to care, the CPU percentage actually does
>> _down_ to 0.39% when I try that. Probably simply because now my CPU's
>> are busy, so they are running at 4.7Ghz instead of the 800Mhz "mostly
>> idle" state ...
>>
>> I guess I should be happy. It does mean that the situation you see
>> isn't exactly the normal case. I understand why you want to do the
>> non-cached case, but the case I think it the worrisome one is the
>> regular buffered one, so that's what I'm testing (not even trying the
>> noaccess patches).
>>
>> So from your report I went "uhhuh, that sounds like a bug". And it
>> appears that it largely isn't - you're seeing it because of pushing
>> the IO subsystem by another order of magnitude (and then I agree that
>> "under those kinds of IO loads, caching just won't help")
> 
> I'd very much argue that it IS a bug, maybe just doesn't show on your
> system. My test box is a pretty standard 2 socket system, 24 cores / 48
> threads, 2 nodes. The last numbers I sent were 100K IOPS, so nothing
> crazy, and granted that's only 10% kswapd cpu time, but that still seems
> very high for those kinds of rates. I'm surprised you see essentially no
> kswapd time for the same data rate.
> 
> We'll keep poking here, I know Johannes is spending some time looking
> into the reclaim side.

Out of curiosity, just tried it on my laptop, which also has some
samsung drive. Using 8 jobs, I get around 100K IOPS too, and this
is my top listing:

23308 axboe     20   0  623156   1304      8 D  10.3  0.0   0:03.81 fio
23309 axboe     20   0  623160   1304      8 D  10.3  0.0   0:03.81 fio
23311 axboe     20   0  623168   1304      8 D  10.3  0.0   0:03.82 fio
23313 axboe     20   0  623176   1304      8 D  10.3  0.0   0:03.82 fio
23314 axboe     20   0  623180   1304      8 D  10.3  0.0   0:03.81 fio
  162 root      20   0       0      0      0 S   9.9  0.0   0:12.97 kswapd0
23307 axboe     20   0  623152   1304      8 D   9.9  0.0   0:03.84 fio
23310 axboe     20   0  623164   1304      8 D   9.9  0.0   0:03.81 fio
23312 axboe     20   0  623172   1304      8 D   9.9  0.0   0:03.80 fio

kswapd is between 9-11% the whole time, and the profile looks very
similar to what I saw on my test box:

    35.79%  kswapd0  [kernel.vmlinux]  [k] xas_create
     9.97%  kswapd0  [kernel.vmlinux]  [k] free_pcppages_bulk
     9.94%  kswapd0  [kernel.vmlinux]  [k] isolate_lru_pages
     7.78%  kswapd0  [kernel.vmlinux]  [k] shrink_page_list
     3.78%  kswapd0  [kernel.vmlinux]  [k] xas_clear_mark
     3.08%  kswapd0  [kernel.vmlinux]  [k] workingset_eviction
     2.48%  kswapd0  [kernel.vmlinux]  [k] __isolate_lru_page
     2.06%  kswapd0  [kernel.vmlinux]  [k] page_mapping
     1.95%  kswapd0  [kernel.vmlinux]  [k] __remove_mapping

So now I'm even more puzzled why your (desktop?) doesn't show it, it
must be more potent than my x1 laptop. But for me, the laptop and 2
socket test box show EXACTLY the same behavior, laptop is just too slow
to make it really pathological.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12  1:41                       ` Jens Axboe
@ 2019-12-12  1:49                           ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-12  1:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 5:41 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> So now I'm even more puzzled why your (desktop?) doesn't show it, it
> must be more potent than my x1 laptop. But for me, the laptop and 2
> socket test box show EXACTLY the same behavior, laptop is just too slow
> to make it really pathological.

I see the exact same thing.

And now, go do "perf record -a sleep 100" while this is going on, to
see the big picture.

Suddenly that number is a lot smaller. Because the kswapd cost is less
than 10% of one cpu, and the xas_create is 30% of that. IOW, it's not
all that dominant until you zoom in on it. Considering the amount of
IO going on, it's kind of understandable.

But as mentioned, it does look like something that should be fixed.
Maybe it's even something where it's a bad XArray case? Willy?

              Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
@ 2019-12-12  1:49                           ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-12  1:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 5:41 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> So now I'm even more puzzled why your (desktop?) doesn't show it, it
> must be more potent than my x1 laptop. But for me, the laptop and 2
> socket test box show EXACTLY the same behavior, laptop is just too slow
> to make it really pathological.

I see the exact same thing.

And now, go do "perf record -a sleep 100" while this is going on, to
see the big picture.

Suddenly that number is a lot smaller. Because the kswapd cost is less
than 10% of one cpu, and the xas_create is 30% of that. IOW, it's not
all that dominant until you zoom in on it. Considering the amount of
IO going on, it's kind of understandable.

But as mentioned, it does look like something that should be fixed.
Maybe it's even something where it's a bad XArray case? Willy?

              Linus


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12  1:41                         ` Linus Torvalds
  (?)
@ 2019-12-12  1:56                         ` Matthew Wilcox
  2019-12-12  2:47                             ` Linus Torvalds
  -1 siblings, 1 reply; 67+ messages in thread
From: Matthew Wilcox @ 2019-12-12  1:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Linux-MM, linux-fsdevel, linux-block, Chris Mason,
	Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 05:41:16PM -0800, Linus Torvalds wrote:
> I too can see xas_create using 30% CPU time, but that's when I do a
> perf record on just kswapd - and when I actually look at it on a
> system level, it looks nowhere near that bad.
> 
> So I think people should look at this. Part of it might be for Willy:
> does that xas_create() need to be that expensive? I hate how "perf"
> callchains work, but it does look like it is probably
> page_cache_delete -> xas_store -> xas_create that is the bulk of the
> cost there.
> 
> Replacing the real page with the shadow entry shouldn't be that big of
> a deal, I would really hope.
> 
> Willy, that used to be a __radix_tree_lookup -> __radix_tree_replace
> thing, is there perhaps some simple optmization that could be done on
> the XArray case here?

It _should_ be the same order of complexity.  Since there's already
a page in the page cache, xas_create() just walks its way down to the
right node calling xas_descend() and then xas_store() does the equivalent
of __radix_tree_replace().  I don't see a bug that would make it more
expensive than the old code ... a 10GB file is going to have four levels
of radix tree node, so it shouldn't even spend that long looking for
the right node.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12  1:09               ` Jens Axboe
@ 2019-12-12  2:03                 ` Jens Axboe
  2019-12-12  2:10                   ` Jens Axboe
  2019-12-12  2:21                   ` Matthew Wilcox
  2019-12-12 22:18                 ` Dave Chinner
  1 sibling, 2 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-12  2:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On 12/11/19 6:09 PM, Jens Axboe wrote:
> On 12/11/19 4:41 PM, Jens Axboe wrote:
>> On 12/11/19 1:18 PM, Linus Torvalds wrote:
>>> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> $ cat /proc/meminfo | grep -i active
>>>> Active:           134136 kB
>>>> Inactive:       28683916 kB
>>>> Active(anon):      97064 kB
>>>> Inactive(anon):        4 kB
>>>> Active(file):      37072 kB
>>>> Inactive(file): 28683912 kB
>>>
>>> Yeah, that should not put pressure on some swap activity. We have 28
>>> GB of basically free inactive file data, and the VM is doing something
>>> very very bad if it then doesn't just quickly free it with no real
>>> drama.
>>>
>>> In fact, I don't think it should even trigger kswapd at all, it should
>>> all be direct reclaim. Of course, some of the mm people hate that with
>>> a passion, but this does look like a prime example of why it should
>>> just be done.
>>
>> For giggles, I ran just a single thread on the file set. We're only
>> doing about 100K IOPS at that point, yet when the page cache fills,
>> kswapd still eats 10% cpu. That seems like a lot for something that
>> slow.
> 
> Warning, the below is from the really crazy department...
> 
> Anyway, I took a closer look at the profiles for the uncached case.
> We're spending a lot of time doing memsets (this is the xa_node init,
> from the radix tree constructor), and call_rcu for the node free later
> on. All wasted time, and something that meant we weren't as close to the
> performance of O_DIRECT as we could be.
> 
> So Chris and I started talking about this, and pondered "what would
> happen if we simply bypassed the page cache completely?". Case in point,
> see below incremental patch. We still do the page cache lookup, and use
> that page to copy from if it's there. If the page isn't there, allocate
> one and do IO to it, but DON'T add it to the page cache. With that,
> we're almost at O_DIRECT levels of performance for the 4k read case,
> without 1-2%. I think 512b would look awesome, but we're reading full
> pages, so that won't really help us much. Compared to the previous
> uncached method, this is 30% faster on this device. That's substantial.
> 
> Obviously this has issues with truncate that would need to be resolved,
> and it's definitely dirtier. But the performance is very enticing...

Tested and cleaned a bit, and added truncate protection through
inode_dio_begin()/inode_dio_end().

https://git.kernel.dk/cgit/linux-block/commit/?h=buffered-uncached&id=6dac80bc340dabdcbfb4230b9331e52510acca87

This is much faster than the previous page cache dance, and I _think_
we're ok as long as we block truncate and hole punching.

Comments?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12  2:03                 ` Jens Axboe
@ 2019-12-12  2:10                   ` Jens Axboe
  2019-12-12  2:21                   ` Matthew Wilcox
  1 sibling, 0 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-12  2:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux-MM, linux-fsdevel, linux-block, Matthew Wilcox,
	Chris Mason, Dave Chinner, Johannes Weiner

On 12/11/19 7:03 PM, Jens Axboe wrote:
> On 12/11/19 6:09 PM, Jens Axboe wrote:
>> On 12/11/19 4:41 PM, Jens Axboe wrote:
>>> On 12/11/19 1:18 PM, Linus Torvalds wrote:
>>>> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> $ cat /proc/meminfo | grep -i active
>>>>> Active:           134136 kB
>>>>> Inactive:       28683916 kB
>>>>> Active(anon):      97064 kB
>>>>> Inactive(anon):        4 kB
>>>>> Active(file):      37072 kB
>>>>> Inactive(file): 28683912 kB
>>>>
>>>> Yeah, that should not put pressure on some swap activity. We have 28
>>>> GB of basically free inactive file data, and the VM is doing something
>>>> very very bad if it then doesn't just quickly free it with no real
>>>> drama.
>>>>
>>>> In fact, I don't think it should even trigger kswapd at all, it should
>>>> all be direct reclaim. Of course, some of the mm people hate that with
>>>> a passion, but this does look like a prime example of why it should
>>>> just be done.
>>>
>>> For giggles, I ran just a single thread on the file set. We're only
>>> doing about 100K IOPS at that point, yet when the page cache fills,
>>> kswapd still eats 10% cpu. That seems like a lot for something that
>>> slow.
>>
>> Warning, the below is from the really crazy department...
>>
>> Anyway, I took a closer look at the profiles for the uncached case.
>> We're spending a lot of time doing memsets (this is the xa_node init,
>> from the radix tree constructor), and call_rcu for the node free later
>> on. All wasted time, and something that meant we weren't as close to the
>> performance of O_DIRECT as we could be.
>>
>> So Chris and I started talking about this, and pondered "what would
>> happen if we simply bypassed the page cache completely?". Case in point,
>> see below incremental patch. We still do the page cache lookup, and use
>> that page to copy from if it's there. If the page isn't there, allocate
>> one and do IO to it, but DON'T add it to the page cache. With that,
>> we're almost at O_DIRECT levels of performance for the 4k read case,
>> without 1-2%. I think 512b would look awesome, but we're reading full
>> pages, so that won't really help us much. Compared to the previous
>> uncached method, this is 30% faster on this device. That's substantial.
>>
>> Obviously this has issues with truncate that would need to be resolved,
>> and it's definitely dirtier. But the performance is very enticing...
> 
> Tested and cleaned a bit, and added truncate protection through
> inode_dio_begin()/inode_dio_end().
> 
> https://git.kernel.dk/cgit/linux-block/commit/?h=buffered-uncached&id=6dac80bc340dabdcbfb4230b9331e52510acca87
> 
> This is much faster than the previous page cache dance, and I _think_
> we're ok as long as we block truncate and hole punching.
> 
> Comments?

Incremental was unnecessarily hard to read, here's the original
RWF_UNCACHED with it folded in instead for easier reading:

commit d1c9eec7b958760ea4d4e75483b0f2e3cfec53d5
Author: Jens Axboe <axboe@kernel.dk>
Date:   Thu Dec 5 17:45:00 2019 -0700

    fs: add read support for RWF_UNCACHED
    
    If RWF_UNCACHED is set for io_uring (or preadv2(2)), we'll use private
    pages for the buffered reads. These pages will never be inserted into
    the page cache, and they are simply droped when we have done the copy at
    the end of IO.
    
    If pages in the read range are already in the page cache, then use those
    for just copying the data instead of starting IO on private pages.
    
    A previous solution used the page cache even for non-cached ranges, but
    the cost of doing so was too high. Removing nodes at the end is
    expensive, even with LRU bypass. On top of that, repeatedly
    instantiating new xarray nodes is very costly, as it needs to memset 576
    bytes of data, and freeing said nodes involve an RCU call per node as
    well. All that adds up, making uncached somewhat slower than O_DIRECT.
    
    With the current solition, we're basically at O_DIRECT levels of
    performance for RWF_UNCACHED IO.
    
    Protect against truncate the same way O_DIRECT does, by calling
    inode_dio_begin() to elevate the inode->i_dio_count.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98e0349adb52..092ea2a4319b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -314,6 +314,7 @@ enum rw_hint {
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
 #define IOCB_NOWAIT		(1 << 7)
+#define IOCB_UNCACHED		(1 << 8)
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -3418,6 +3419,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
 		ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
 	if (flags & RWF_APPEND)
 		ki->ki_flags |= IOCB_APPEND;
+	if (flags & RWF_UNCACHED)
+		ki->ki_flags |= IOCB_UNCACHED;
 	return 0;
 }
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 379a612f8f1d..357ebb0e0c5d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -299,8 +299,11 @@ typedef int __bitwise __kernel_rwf_t;
 /* per-IO O_APPEND */
 #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
 
+/* drop cache after reading or writing data */
+#define RWF_UNCACHED	((__force __kernel_rwf_t)0x00000040)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND)
+			 RWF_APPEND | RWF_UNCACHED)
 
 #endif /* _UAPI_LINUX_FS_H */
diff --git a/mm/filemap.c b/mm/filemap.c
index bf6aa30be58d..5d299d69b185 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1990,6 +1990,13 @@ static void shrink_readahead_size_eio(struct file *filp,
 	ra->ra_pages /= 4;
 }
 
+static void buffered_put_page(struct page *page, bool clear_mapping)
+{
+	if (clear_mapping)
+		page->mapping = NULL;
+	put_page(page);
+}
+
 /**
  * generic_file_buffered_read - generic file read routine
  * @iocb:	the iocb to read
@@ -2013,6 +2020,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
 	struct file_ra_state *ra = &filp->f_ra;
+	bool did_dio_begin = false;
 	loff_t *ppos = &iocb->ki_pos;
 	pgoff_t index;
 	pgoff_t last_index;
@@ -2032,6 +2040,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 	offset = *ppos & ~PAGE_MASK;
 
 	for (;;) {
+		bool clear_mapping = false;
 		struct page *page;
 		pgoff_t end_index;
 		loff_t isize;
@@ -2048,6 +2057,13 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		if (!page) {
 			if (iocb->ki_flags & IOCB_NOWAIT)
 				goto would_block;
+			/* UNCACHED implies no read-ahead */
+			if (iocb->ki_flags & IOCB_UNCACHED) {
+				did_dio_begin = true;
+				/* block truncate for UNCACHED reads */
+				inode_dio_begin(inode);
+				goto no_cached_page;
+			}
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
@@ -2106,7 +2122,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		isize = i_size_read(inode);
 		end_index = (isize - 1) >> PAGE_SHIFT;
 		if (unlikely(!isize || index > end_index)) {
-			put_page(page);
+			buffered_put_page(page, clear_mapping);
 			goto out;
 		}
 
@@ -2115,7 +2131,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		if (index == end_index) {
 			nr = ((isize - 1) & ~PAGE_MASK) + 1;
 			if (nr <= offset) {
-				put_page(page);
+				buffered_put_page(page, clear_mapping);
 				goto out;
 			}
 		}
@@ -2147,7 +2163,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		offset &= ~PAGE_MASK;
 		prev_offset = offset;
 
-		put_page(page);
+		buffered_put_page(page, clear_mapping);
 		written += ret;
 		if (!iov_iter_count(iter))
 			goto out;
@@ -2189,7 +2205,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 		if (unlikely(error)) {
 			if (error == AOP_TRUNCATED_PAGE) {
-				put_page(page);
+				buffered_put_page(page, clear_mapping);
 				error = 0;
 				goto find_page;
 			}
@@ -2206,7 +2222,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 					 * invalidate_mapping_pages got it
 					 */
 					unlock_page(page);
-					put_page(page);
+					buffered_put_page(page, clear_mapping);
 					goto find_page;
 				}
 				unlock_page(page);
@@ -2221,7 +2237,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 readpage_error:
 		/* UHHUH! A synchronous read error occurred. Report it */
-		put_page(page);
+		buffered_put_page(page, clear_mapping);
 		goto out;
 
 no_cached_page:
@@ -2234,7 +2250,15 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 			error = -ENOMEM;
 			goto out;
 		}
-		error = add_to_page_cache_lru(page, mapping, index,
+		if (iocb->ki_flags & IOCB_UNCACHED) {
+			__SetPageLocked(page);
+			page->mapping = mapping;
+			page->index = index;
+			clear_mapping = true;
+			goto readpage;
+		}
+
+		error = add_to_page_cache(page, mapping, index,
 				mapping_gfp_constraint(mapping, GFP_KERNEL));
 		if (error) {
 			put_page(page);
@@ -2250,6 +2274,8 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 would_block:
 	error = -EAGAIN;
 out:
+	if (did_dio_begin)
+		inode_dio_end(inode);
 	ra->prev_pos = prev_index;
 	ra->prev_pos <<= PAGE_SHIFT;
 	ra->prev_pos |= prev_offset;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12  2:03                 ` Jens Axboe
  2019-12-12  2:10                   ` Jens Axboe
@ 2019-12-12  2:21                   ` Matthew Wilcox
  2019-12-12  2:38                     ` Jens Axboe
  1 sibling, 1 reply; 67+ messages in thread
From: Matthew Wilcox @ 2019-12-12  2:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Linux-MM, linux-fsdevel, linux-block,
	Chris Mason, Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 07:03:40PM -0700, Jens Axboe wrote:
> Tested and cleaned a bit, and added truncate protection through
> inode_dio_begin()/inode_dio_end().
> 
> https://git.kernel.dk/cgit/linux-block/commit/?h=buffered-uncached&id=6dac80bc340dabdcbfb4230b9331e52510acca87
> 
> This is much faster than the previous page cache dance, and I _think_
> we're ok as long as we block truncate and hole punching.

What about races between UNCACHED and regular buffered IO?  Seems like
we might end up sending overlapping writes to the device?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12  2:21                   ` Matthew Wilcox
@ 2019-12-12  2:38                     ` Jens Axboe
  0 siblings, 0 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-12  2:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Linux-MM, linux-fsdevel, linux-block,
	Chris Mason, Dave Chinner, Johannes Weiner

On 12/11/19 7:21 PM, Matthew Wilcox wrote:
> On Wed, Dec 11, 2019 at 07:03:40PM -0700, Jens Axboe wrote:
>> Tested and cleaned a bit, and added truncate protection through
>> inode_dio_begin()/inode_dio_end().
>>
>> https://git.kernel.dk/cgit/linux-block/commit/?h=buffered-uncached&id=6dac80bc340dabdcbfb4230b9331e52510acca87
>>
>> This is much faster than the previous page cache dance, and I _think_
>> we're ok as long as we block truncate and hole punching.
> 
> What about races between UNCACHED and regular buffered IO?  Seems like
> we might end up sending overlapping writes to the device?

The UNCACHED out-of-page-cache IO is just for reads, it's not writes. The
write part is still using the same approach as earlier, though now I bet
it looks kinda slow compared to the read side...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12  1:56                         ` Matthew Wilcox
@ 2019-12-12  2:47                             ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-12  2:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, Linux-MM, linux-fsdevel, linux-block, Chris Mason,
	Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 5:56 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> It _should_ be the same order of complexity.  Since there's already
> a page in the page cache, xas_create() just walks its way down to the
> right node calling xas_descend() and then xas_store() does the equivalent
> of __radix_tree_replace().  I don't see a bug that would make it more
> expensive than the old code ... a 10GB file is going to have four levels
> of radix tree node, so it shouldn't even spend that long looking for
> the right node.

The profile seems to say that 85% of the cost of xas_create() is two
instructions:

# lib/xarray.c:143:     return (index >> node->shift) & XA_CHUNK_MASK;
        movzbl  (%rsi), %ecx    # MEM[(unsigned char *)node_13],
MEM[(unsigned char *)node_13]
...
# ./include/linux/xarray.h:1145:        return
rcu_dereference_check(node->slots[offset],
# ./include/linux/compiler.h:199:       __READ_ONCE_SIZE;
        movq    (%rax), %rax    # MEM[(volatile __u64 *)_80], <retval>

where that first load is "node->shift", and the second load seems to
be just the node dereference.

I think both of them are basically just xas_descend(), but it's a bit
hard to read the several levels of inlining.

I suspect the real issue is that going through kswapd means we've lost
almost all locality, and with the random walking the LRU list is
basically entirely unordered so you don't get any advantage of the
xarray having (otherwise) possibly good cache patterns.

So we're just missing in the caches all the time, making it expensive.

              Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
@ 2019-12-12  2:47                             ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-12  2:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, Linux-MM, linux-fsdevel, linux-block, Chris Mason,
	Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 5:56 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> It _should_ be the same order of complexity.  Since there's already
> a page in the page cache, xas_create() just walks its way down to the
> right node calling xas_descend() and then xas_store() does the equivalent
> of __radix_tree_replace().  I don't see a bug that would make it more
> expensive than the old code ... a 10GB file is going to have four levels
> of radix tree node, so it shouldn't even spend that long looking for
> the right node.

The profile seems to say that 85% of the cost of xas_create() is two
instructions:

# lib/xarray.c:143:     return (index >> node->shift) & XA_CHUNK_MASK;
        movzbl  (%rsi), %ecx    # MEM[(unsigned char *)node_13],
MEM[(unsigned char *)node_13]
...
# ./include/linux/xarray.h:1145:        return
rcu_dereference_check(node->slots[offset],
# ./include/linux/compiler.h:199:       __READ_ONCE_SIZE;
        movq    (%rax), %rax    # MEM[(volatile __u64 *)_80], <retval>

where that first load is "node->shift", and the second load seems to
be just the node dereference.

I think both of them are basically just xas_descend(), but it's a bit
hard to read the several levels of inlining.

I suspect the real issue is that going through kswapd means we've lost
almost all locality, and with the random walking the LRU list is
basically entirely unordered so you don't get any advantage of the
xarray having (otherwise) possibly good cache patterns.

So we're just missing in the caches all the time, making it expensive.

              Linus


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-11 15:29 [PATCHSET v3 0/5] Support for RWF_UNCACHED Jens Axboe
                   ` (5 preceding siblings ...)
  2019-12-11 17:37   ` Linus Torvalds
@ 2019-12-12 10:44 ` Martin Steigerwald
  2019-12-12 15:16   ` Jens Axboe
  6 siblings, 1 reply; 67+ messages in thread
From: Martin Steigerwald @ 2019-12-12 10:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds, david

Hi Jens.

Jens Axboe - 11.12.19, 16:29:38 CET:
> Recently someone asked me how io_uring buffered IO compares to mmaped
> IO in terms of performance. So I ran some tests with buffered IO, and
> found the experience to be somewhat painful. The test case is pretty
> basic, random reads over a dataset that's 10x the size of RAM.
> Performance starts out fine, and then the page cache fills up and we
> hit a throughput cliff. CPU usage of the IO threads go up, and we have
> kswapd spending 100% of a core trying to keep up. Seeing that, I was
> reminded of the many complaints I here about buffered IO, and the
> fact that most of the folks complaining will ultimately bite the
> bullet and move to O_DIRECT to just get the kernel out of the way.
> 
> But I don't think it needs to be like that. Switching to O_DIRECT
> isn't always easily doable. The buffers have different life times,
> size and alignment constraints, etc. On top of that, mixing buffered
> and O_DIRECT can be painful.
> 
> Seems to me that we have an opportunity to provide something that sits
> somewhere in between buffered and O_DIRECT, and this is where
> RWF_UNCACHED enters the picture. If this flag is set on IO, we get
> the following behavior:
> 
> - If the data is in cache, it remains in cache and the copy (in or
> out) is served to/from that.
> 
> - If the data is NOT in cache, we add it while performing the IO. When
> the IO is done, we remove it again.
> 
> With this, I can do 100% smooth buffered reads or writes without
> pushing the kernel to the state where kswapd is sweating bullets. In
> fact it doesn't even register.

A question from a user or Linux Performance trainer perspective:

How does this compare with posix_fadvise() with POSIX_FADV_DONTNEED that 
for example the nocache¹ command is using? Excerpt from manpage 
posix_fadvice(2):

       POSIX_FADV_DONTNEED
              The specified data will not be accessed  in  the  near
              future.

              POSIX_FADV_DONTNEED  attempts to free cached pages as‐
              sociated with the specified region.  This  is  useful,
              for  example,  while streaming large files.  A program
              may periodically request the  kernel  to  free  cached
              data  that  has already been used, so that more useful
              cached pages are not discarded instead.

[1] packaged in Debian as nocache or available herehttps://github.com/
Feh/nocache

In any way, would be nice to have some option in rsync… I still did not 
change my backup script to call rsync via nocache.

Thanks,
Martin

> Comments appreciated! This should work on any standard file system,
> using either the generic helpers or iomap. I have tested ext4 and xfs
> for the right read/write behavior, but no further validation has been
> done yet. Patches are against current git, and can also be found here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=buffered-uncached
> 
>  fs/ceph/file.c          |  2 +-
>  fs/dax.c                |  2 +-
>  fs/ext4/file.c          |  2 +-
>  fs/iomap/apply.c        | 26 ++++++++++-
>  fs/iomap/buffered-io.c  | 54 ++++++++++++++++-------
>  fs/iomap/direct-io.c    |  3 +-
>  fs/iomap/fiemap.c       |  5 ++-
>  fs/iomap/seek.c         |  6 ++-
>  fs/iomap/swapfile.c     |  2 +-
>  fs/nfs/file.c           |  2 +-
>  include/linux/fs.h      |  7 ++-
>  include/linux/iomap.h   | 10 ++++-
>  include/uapi/linux/fs.h |  5 ++-
>  mm/filemap.c            | 95
> ++++++++++++++++++++++++++++++++++++----- 14 files changed, 181
> insertions(+), 40 deletions(-)
> 
> Changes since v2:
> - Rework the write side according to Chinners suggestions. Much
> cleaner this way. It does mean that we invalidate the full write
> region if just ONE page (or more) had to be created, where before it
> was more granular. I don't think that's a concern, and on the plus
> side, we now no longer have to chunk invalidations into 15/16 pages
> at the time.
> - Cleanups
> 
> Changes since v1:
> - Switch to pagevecs for write_drop_cached_pages()
> - Use page_offset() instead of manual shift
> - Ensure we hold a reference on the page between calling ->write_end()
> and checking the mapping on the locked page
> - Fix XFS multi-page streamed writes, we'd drop the UNCACHED flag
> after the first page


-- 
Martin



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12 10:44 ` Martin Steigerwald
@ 2019-12-12 15:16   ` Jens Axboe
  2019-12-12 21:45     ` Martin Steigerwald
  2019-12-12 22:18       ` Linus Torvalds
  0 siblings, 2 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-12 15:16 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds, david

On 12/12/19 3:44 AM, Martin Steigerwald wrote:
> Hi Jens.
> 
> Jens Axboe - 11.12.19, 16:29:38 CET:
>> Recently someone asked me how io_uring buffered IO compares to mmaped
>> IO in terms of performance. So I ran some tests with buffered IO, and
>> found the experience to be somewhat painful. The test case is pretty
>> basic, random reads over a dataset that's 10x the size of RAM.
>> Performance starts out fine, and then the page cache fills up and we
>> hit a throughput cliff. CPU usage of the IO threads go up, and we have
>> kswapd spending 100% of a core trying to keep up. Seeing that, I was
>> reminded of the many complaints I here about buffered IO, and the
>> fact that most of the folks complaining will ultimately bite the
>> bullet and move to O_DIRECT to just get the kernel out of the way.
>>
>> But I don't think it needs to be like that. Switching to O_DIRECT
>> isn't always easily doable. The buffers have different life times,
>> size and alignment constraints, etc. On top of that, mixing buffered
>> and O_DIRECT can be painful.
>>
>> Seems to me that we have an opportunity to provide something that sits
>> somewhere in between buffered and O_DIRECT, and this is where
>> RWF_UNCACHED enters the picture. If this flag is set on IO, we get
>> the following behavior:
>>
>> - If the data is in cache, it remains in cache and the copy (in or
>> out) is served to/from that.
>>
>> - If the data is NOT in cache, we add it while performing the IO. When
>> the IO is done, we remove it again.
>>
>> With this, I can do 100% smooth buffered reads or writes without
>> pushing the kernel to the state where kswapd is sweating bullets. In
>> fact it doesn't even register.
> 
> A question from a user or Linux Performance trainer perspective:
> 
> How does this compare with posix_fadvise() with POSIX_FADV_DONTNEED that 
> for example the nocache¹ command is using? Excerpt from manpage 
> posix_fadvice(2):
> 
>        POSIX_FADV_DONTNEED
>               The specified data will not be accessed  in  the  near
>               future.
> 
>               POSIX_FADV_DONTNEED  attempts to free cached pages as‐
>               sociated with the specified region.  This  is  useful,
>               for  example,  while streaming large files.  A program
>               may periodically request the  kernel  to  free  cached
>               data  that  has already been used, so that more useful
>               cached pages are not discarded instead.
> 
> [1] packaged in Debian as nocache or available herehttps://github.com/
> Feh/nocache
> 
> In any way, would be nice to have some option in rsync… I still did not 
> change my backup script to call rsync via nocache.

I don't know the nocache tool, but I'm guessing it just does the writes
(or reads) and then uses FADV_DONTNEED to drop behind those pages?
That's fine for slower use cases, it won't work very well for fast IO.
The write side currently works pretty much like that internally, whereas
the read side doesn't use the page cache at all.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12  2:47                             ` Linus Torvalds
  (?)
@ 2019-12-12 17:52                             ` Matthew Wilcox
  2019-12-12 18:29                                 ` Linus Torvalds
  -1 siblings, 1 reply; 67+ messages in thread
From: Matthew Wilcox @ 2019-12-12 17:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Linux-MM, linux-fsdevel, linux-block, Chris Mason,
	Dave Chinner, Johannes Weiner

On Wed, Dec 11, 2019 at 06:47:25PM -0800, Linus Torvalds wrote:
> On Wed, Dec 11, 2019 at 5:56 PM Matthew Wilcox <willy@infradead.org> wrote:
> > It _should_ be the same order of complexity.  Since there's already
> > a page in the page cache, xas_create() just walks its way down to the
> > right node calling xas_descend() and then xas_store() does the equivalent
> > of __radix_tree_replace().  I don't see a bug that would make it more
> > expensive than the old code ... a 10GB file is going to have four levels
> > of radix tree node, so it shouldn't even spend that long looking for
> > the right node.
> 
> The profile seems to say that 85% of the cost of xas_create() is two
> instructions:
> 
> # lib/xarray.c:143:     return (index >> node->shift) & XA_CHUNK_MASK;
>         movzbl  (%rsi), %ecx    # MEM[(unsigned char *)node_13],
> MEM[(unsigned char *)node_13]
> ...
> # ./include/linux/xarray.h:1145:        return
> rcu_dereference_check(node->slots[offset],
> # ./include/linux/compiler.h:199:       __READ_ONCE_SIZE;
>         movq    (%rax), %rax    # MEM[(volatile __u64 *)_80], <retval>
> 
> where that first load is "node->shift", and the second load seems to
> be just the node dereference.
> 
> I think both of them are basically just xas_descend(), but it's a bit
> hard to read the several levels of inlining.

Yep, that's basically the two cache misses you get for each level of
the tree.  I'd expect the node->shift to be slightly more costly because
sometimes the node->slots[offset] is going to be in the same cacheline
or the next cacheline after node->shift.

> I suspect the real issue is that going through kswapd means we've lost
> almost all locality, and with the random walking the LRU list is
> basically entirely unordered so you don't get any advantage of the
> xarray having (otherwise) possibly good cache patterns.
> 
> So we're just missing in the caches all the time, making it expensive.

I have some bad ideas for improving it.

1. We could semi-sort the pages on the LRU list.  If we know we're going
to remove a bunch of pages, we could take a batch of them off the list,
sort them and remove them in-order.  This probably wouldn't be terribly
effective.

2. We could change struct page to point to the xa_node that holds them.
Looking up the page mapping would be page->xa_node->array and then
offsetof(i_pages) to get the mapping.

3. Once the maple tree is ready, a 10GB file can actually be held more
efficiently in a maple tree than in the radix tree.  Because each level
of the tree is 128 bytes and (Intel) CPUs fetch a pair of cachelines,
there's only one cache miss per level of the tree.  So despite the tree
being deeper, there are fewer cache misses.  With an average of 6 pointers
per level of the tree, terminating in a dense node, we'd see:

: 6 * 6 * 6 * 6 * 6 * 6 * 6 * 15
: 4199040
(a 10GB file contains 2621440 4kB pages)

which is still eight cache misses, so to see an improvement, we'd need to
implement another planned improvement, which is allocating a large, dense
leaf node of 4kB.  That would reduce the height of the tree by 2:

: 6 * 6 * 6 * 6 * 6 * 500
: 3888000

4. For consecutive accesses, I'm working on allocating larger pages
(16kB, 64kB, 256kB, ...).  That will help by reducing the number of
deletions from the page cache by a factor of 4/16/64/...

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12 17:52                             ` Matthew Wilcox
@ 2019-12-12 18:29                                 ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-12 18:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, Linux-MM, linux-fsdevel, linux-block, Chris Mason,
	Dave Chinner, Johannes Weiner

On Thu, Dec 12, 2019 at 9:52 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> 1. We could semi-sort the pages on the LRU list.  If we know we're going
> to remove a bunch of pages, we could take a batch of them off the list,
> sort them and remove them in-order.  This probably wouldn't be terribly
> effective.

I don't think the sorting is relevant.

Once you batch things, you already would get most of the locality
advantage in the cache if it exists (and the batch isn't insanely
large so that one batch already causes cache overflows).

The problem - I suspect - is that we don't batch at all. Or rather,
the "batching" does exist at a high level, but it's so high that
there's just tons of stuff going on between single pages. It is at the
shrink_page_list() level, which is pretty high up and basically does
one page at a time with locking and a lot of tests for each page, and
then we do "__remove_mapping()" (which does some more work) one at a
time before we actually get to __delete_from_page_cache().

So it's "batched", but it's in a huge loop, and even at that huge loop
level the batch size is fairly small. We limit it to SWAP_CLUSTER_MAX,
which is just 32.

Thinking about it, that SWAP_CLUSTER_MAX may make sense in some other
circumstances, but not necessarily in the "shrink clean inactive
pages" thing. I wonder if we could just batch clean pages a _lot_ more
aggressively. Yes, our batching loop is still very big and it might
not help at an L1 level, but it might help in the L2, at least.

In kswapd, when we have 28 GB of pages on the inactive list, a batch
of 32 pages at a time is pretty small ;)

> 2. We could change struct page to point to the xa_node that holds them.
> Looking up the page mapping would be page->xa_node->array and then
> offsetof(i_pages) to get the mapping.

I don't think we have space in 'struct page', and I'm pretty sure we
don't want to grow it. That's one of the more common data structures
in the kernel.

                         Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
@ 2019-12-12 18:29                                 ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-12 18:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, Linux-MM, linux-fsdevel, linux-block, Chris Mason,
	Dave Chinner, Johannes Weiner

On Thu, Dec 12, 2019 at 9:52 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> 1. We could semi-sort the pages on the LRU list.  If we know we're going
> to remove a bunch of pages, we could take a batch of them off the list,
> sort them and remove them in-order.  This probably wouldn't be terribly
> effective.

I don't think the sorting is relevant.

Once you batch things, you already would get most of the locality
advantage in the cache if it exists (and the batch isn't insanely
large so that one batch already causes cache overflows).

The problem - I suspect - is that we don't batch at all. Or rather,
the "batching" does exist at a high level, but it's so high that
there's just tons of stuff going on between single pages. It is at the
shrink_page_list() level, which is pretty high up and basically does
one page at a time with locking and a lot of tests for each page, and
then we do "__remove_mapping()" (which does some more work) one at a
time before we actually get to __delete_from_page_cache().

So it's "batched", but it's in a huge loop, and even at that huge loop
level the batch size is fairly small. We limit it to SWAP_CLUSTER_MAX,
which is just 32.

Thinking about it, that SWAP_CLUSTER_MAX may make sense in some other
circumstances, but not necessarily in the "shrink clean inactive
pages" thing. I wonder if we could just batch clean pages a _lot_ more
aggressively. Yes, our batching loop is still very big and it might
not help at an L1 level, but it might help in the L2, at least.

In kswapd, when we have 28 GB of pages on the inactive list, a batch
of 32 pages at a time is pretty small ;)

> 2. We could change struct page to point to the xa_node that holds them.
> Looking up the page mapping would be page->xa_node->array and then
> offsetof(i_pages) to get the mapping.

I don't think we have space in 'struct page', and I'm pretty sure we
don't want to grow it. That's one of the more common data structures
in the kernel.

                         Linus


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12 18:29                                 ` Linus Torvalds
  (?)
@ 2019-12-12 20:05                                 ` Matthew Wilcox
  -1 siblings, 0 replies; 67+ messages in thread
From: Matthew Wilcox @ 2019-12-12 20:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Linux-MM, linux-fsdevel, linux-block, Chris Mason,
	Dave Chinner, Johannes Weiner

On Thu, Dec 12, 2019 at 10:29:02AM -0800, Linus Torvalds wrote:
> On Thu, Dec 12, 2019 at 9:52 AM Matthew Wilcox <willy@infradead.org> wrote:
> > 1. We could semi-sort the pages on the LRU list.  If we know we're going
> > to remove a bunch of pages, we could take a batch of them off the list,
> > sort them and remove them in-order.  This probably wouldn't be terribly
> > effective.
> 
> I don't think the sorting is relevant.
> 
> Once you batch things, you already would get most of the locality
> advantage in the cache if it exists (and the batch isn't insanely
> large so that one batch already causes cache overflows).
> 
> The problem - I suspect - is that we don't batch at all. Or rather,
> the "batching" does exist at a high level, but it's so high that
> there's just tons of stuff going on between single pages. It is at the
> shrink_page_list() level, which is pretty high up and basically does
> one page at a time with locking and a lot of tests for each page, and
> then we do "__remove_mapping()" (which does some more work) one at a
> time before we actually get to __delete_from_page_cache().
> 
> So it's "batched", but it's in a huge loop, and even at that huge loop
> level the batch size is fairly small. We limit it to SWAP_CLUSTER_MAX,
> which is just 32.
> 
> Thinking about it, that SWAP_CLUSTER_MAX may make sense in some other
> circumstances, but not necessarily in the "shrink clean inactive
> pages" thing. I wonder if we could just batch clean pages a _lot_ more
> aggressively. Yes, our batching loop is still very big and it might
> not help at an L1 level, but it might help in the L2, at least.
> 
> In kswapd, when we have 28 GB of pages on the inactive list, a batch
> of 32 pages at a time is pretty small ;)

Yeah, that's pretty poor.  I just read through it, and even if pages are
in order on the page list, they're not going to batch nicely.  It'd be
nice to accumulate them and call delete_from_page_cache_batch(), but we
need to put shadow entries in to replace them, so we'd need a variant
of that which took two pagevecs.

> > 2. We could change struct page to point to the xa_node that holds them.
> > Looking up the page mapping would be page->xa_node->array and then
> > offsetof(i_pages) to get the mapping.
> 
> I don't think we have space in 'struct page', and I'm pretty sure we
> don't want to grow it. That's one of the more common data structures
> in the kernel.

Oh, I wasn't clear.  I meant replace page->mapping with page->xa_node.
We could still get from page to mapping, but it would be an extra
dereference.  I did say it was a _bad_ idea.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12 15:16   ` Jens Axboe
@ 2019-12-12 21:45     ` Martin Steigerwald
  2019-12-12 22:15       ` Jens Axboe
  2019-12-12 22:18       ` Linus Torvalds
  1 sibling, 1 reply; 67+ messages in thread
From: Martin Steigerwald @ 2019-12-12 21:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds, david

Jens Axboe - 12.12.19, 16:16:31 CET:
> On 12/12/19 3:44 AM, Martin Steigerwald wrote:
> > Jens Axboe - 11.12.19, 16:29:38 CET:
> >> Recently someone asked me how io_uring buffered IO compares to
> >> mmaped
> >> IO in terms of performance. So I ran some tests with buffered IO,
> >> and
> >> found the experience to be somewhat painful. The test case is
> >> pretty
> >> basic, random reads over a dataset that's 10x the size of RAM.
> >> Performance starts out fine, and then the page cache fills up and
> >> we
> >> hit a throughput cliff. CPU usage of the IO threads go up, and we
> >> have kswapd spending 100% of a core trying to keep up. Seeing
> >> that, I was reminded of the many complaints I here about buffered
> >> IO, and the fact that most of the folks complaining will
> >> ultimately bite the bullet and move to O_DIRECT to just get the
> >> kernel out of the way.
> >> 
> >> But I don't think it needs to be like that. Switching to O_DIRECT
> >> isn't always easily doable. The buffers have different life times,
> >> size and alignment constraints, etc. On top of that, mixing
> >> buffered
> >> and O_DIRECT can be painful.
> >> 
> >> Seems to me that we have an opportunity to provide something that
> >> sits somewhere in between buffered and O_DIRECT, and this is where
> >> RWF_UNCACHED enters the picture. If this flag is set on IO, we get
> >> the following behavior:
> >> 
> >> - If the data is in cache, it remains in cache and the copy (in or
> >> out) is served to/from that.
> >> 
> >> - If the data is NOT in cache, we add it while performing the IO.
> >> When the IO is done, we remove it again.
> >> 
> >> With this, I can do 100% smooth buffered reads or writes without
> >> pushing the kernel to the state where kswapd is sweating bullets.
> >> In
> >> fact it doesn't even register.
> > 
> > A question from a user or Linux Performance trainer perspective:
> > 
> > How does this compare with posix_fadvise() with POSIX_FADV_DONTNEED
> > that for example the nocache¹ command is using? Excerpt from
> > manpage> 
> > posix_fadvice(2):
> >        POSIX_FADV_DONTNEED
> >        
> >               The specified data will not be accessed  in  the  near
> >               future.
> >               
> >               POSIX_FADV_DONTNEED  attempts to free cached pages as‐
> >               sociated with the specified region.  This  is  useful,
> >               for  example,  while streaming large files.  A program
> >               may periodically request the  kernel  to  free  cached
> >               data  that  has already been used, so that more useful
> >               cached pages are not discarded instead.
> > 
> > [1] packaged in Debian as nocache or available
> > herehttps://github.com/ Feh/nocache
> > 
> > In any way, would be nice to have some option in rsync… I still did
> > not change my backup script to call rsync via nocache.
> 
> I don't know the nocache tool, but I'm guessing it just does the
> writes (or reads) and then uses FADV_DONTNEED to drop behind those
> pages? That's fine for slower use cases, it won't work very well for
> fast IO. The write side currently works pretty much like that
> internally, whereas the read side doesn't use the page cache at all.

Yes, it does that. And yeah I saw you changed the read site to bypass 
the cache entirely.

Also as I understand it this is for asynchronous using io uring 
primarily?

-- 
Martin



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12 21:45     ` Martin Steigerwald
@ 2019-12-12 22:15       ` Jens Axboe
  0 siblings, 0 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-12 22:15 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds, david

On 12/12/19 2:45 PM, Martin Steigerwald wrote:
> Jens Axboe - 12.12.19, 16:16:31 CET:
>> On 12/12/19 3:44 AM, Martin Steigerwald wrote:
>>> Jens Axboe - 11.12.19, 16:29:38 CET:
>>>> Recently someone asked me how io_uring buffered IO compares to
>>>> mmaped
>>>> IO in terms of performance. So I ran some tests with buffered IO,
>>>> and
>>>> found the experience to be somewhat painful. The test case is
>>>> pretty
>>>> basic, random reads over a dataset that's 10x the size of RAM.
>>>> Performance starts out fine, and then the page cache fills up and
>>>> we
>>>> hit a throughput cliff. CPU usage of the IO threads go up, and we
>>>> have kswapd spending 100% of a core trying to keep up. Seeing
>>>> that, I was reminded of the many complaints I here about buffered
>>>> IO, and the fact that most of the folks complaining will
>>>> ultimately bite the bullet and move to O_DIRECT to just get the
>>>> kernel out of the way.
>>>>
>>>> But I don't think it needs to be like that. Switching to O_DIRECT
>>>> isn't always easily doable. The buffers have different life times,
>>>> size and alignment constraints, etc. On top of that, mixing
>>>> buffered
>>>> and O_DIRECT can be painful.
>>>>
>>>> Seems to me that we have an opportunity to provide something that
>>>> sits somewhere in between buffered and O_DIRECT, and this is where
>>>> RWF_UNCACHED enters the picture. If this flag is set on IO, we get
>>>> the following behavior:
>>>>
>>>> - If the data is in cache, it remains in cache and the copy (in or
>>>> out) is served to/from that.
>>>>
>>>> - If the data is NOT in cache, we add it while performing the IO.
>>>> When the IO is done, we remove it again.
>>>>
>>>> With this, I can do 100% smooth buffered reads or writes without
>>>> pushing the kernel to the state where kswapd is sweating bullets.
>>>> In
>>>> fact it doesn't even register.
>>>
>>> A question from a user or Linux Performance trainer perspective:
>>>
>>> How does this compare with posix_fadvise() with POSIX_FADV_DONTNEED
>>> that for example the nocache¹ command is using? Excerpt from
>>> manpage> 
>>> posix_fadvice(2):
>>>        POSIX_FADV_DONTNEED
>>>        
>>>               The specified data will not be accessed  in  the  near
>>>               future.
>>>               
>>>               POSIX_FADV_DONTNEED  attempts to free cached pages as‐
>>>               sociated with the specified region.  This  is  useful,
>>>               for  example,  while streaming large files.  A program
>>>               may periodically request the  kernel  to  free  cached
>>>               data  that  has already been used, so that more useful
>>>               cached pages are not discarded instead.
>>>
>>> [1] packaged in Debian as nocache or available
>>> herehttps://github.com/ Feh/nocache
>>>
>>> In any way, would be nice to have some option in rsync… I still did
>>> not change my backup script to call rsync via nocache.
>>
>> I don't know the nocache tool, but I'm guessing it just does the
>> writes (or reads) and then uses FADV_DONTNEED to drop behind those
>> pages? That's fine for slower use cases, it won't work very well for
>> fast IO. The write side currently works pretty much like that
>> internally, whereas the read side doesn't use the page cache at all.
> 
> Yes, it does that. And yeah I saw you changed the read site to bypass 
> the cache entirely.
> 
> Also as I understand it this is for asynchronous using io uring 
> primarily?

Or preadv2/pwritev2, they also allow passing in RWF_* flags.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12 15:16   ` Jens Axboe
@ 2019-12-12 22:18       ` Linus Torvalds
  2019-12-12 22:18       ` Linus Torvalds
  1 sibling, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-12 22:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin Steigerwald, Linux-MM, linux-fsdevel, linux-block,
	Matthew Wilcox, Chris Mason, Dave Chinner

On Thu, Dec 12, 2019 at 7:16 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> I don't know the nocache tool, but I'm guessing it just does the writes
> (or reads) and then uses FADV_DONTNEED to drop behind those pages?
> That's fine for slower use cases, it won't work very well for fast IO.
> The write side currently works pretty much like that internally, whereas
> the read side doesn't use the page cache at all.

Well, I think that if we have this RWF/IOC_UNCACHED flag, maybe we
should make FADV_NOREUSE file descriptors just use it even for regular
read/write..

Right now FADV_NOREUSE is a no-op for the kernel, I think, because we
historically didn't have the facility.

But FADV_DONTNEED is different. That's for dropping pages after use
(ie invalidate_mapping_pages())

               Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
@ 2019-12-12 22:18       ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2019-12-12 22:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin Steigerwald, Linux-MM, linux-fsdevel, linux-block,
	Matthew Wilcox, Chris Mason, Dave Chinner

On Thu, Dec 12, 2019 at 7:16 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> I don't know the nocache tool, but I'm guessing it just does the writes
> (or reads) and then uses FADV_DONTNEED to drop behind those pages?
> That's fine for slower use cases, it won't work very well for fast IO.
> The write side currently works pretty much like that internally, whereas
> the read side doesn't use the page cache at all.

Well, I think that if we have this RWF/IOC_UNCACHED flag, maybe we
should make FADV_NOREUSE file descriptors just use it even for regular
read/write..

Right now FADV_NOREUSE is a no-op for the kernel, I think, because we
historically didn't have the facility.

But FADV_DONTNEED is different. That's for dropping pages after use
(ie invalidate_mapping_pages())

               Linus


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12  1:09               ` Jens Axboe
  2019-12-12  2:03                 ` Jens Axboe
@ 2019-12-12 22:18                 ` Dave Chinner
  2019-12-13  1:32                   ` Chris Mason
  1 sibling, 1 reply; 67+ messages in thread
From: Dave Chinner @ 2019-12-12 22:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Linux-MM, linux-fsdevel, linux-block,
	Matthew Wilcox, Chris Mason, Johannes Weiner

On Wed, Dec 11, 2019 at 06:09:14PM -0700, Jens Axboe wrote:
> On 12/11/19 4:41 PM, Jens Axboe wrote:
> > On 12/11/19 1:18 PM, Linus Torvalds wrote:
> >> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>
> >>> $ cat /proc/meminfo | grep -i active
> >>> Active:           134136 kB
> >>> Inactive:       28683916 kB
> >>> Active(anon):      97064 kB
> >>> Inactive(anon):        4 kB
> >>> Active(file):      37072 kB
> >>> Inactive(file): 28683912 kB
> >>
> >> Yeah, that should not put pressure on some swap activity. We have 28
> >> GB of basically free inactive file data, and the VM is doing something
> >> very very bad if it then doesn't just quickly free it with no real
> >> drama.
> >>
> >> In fact, I don't think it should even trigger kswapd at all, it should
> >> all be direct reclaim. Of course, some of the mm people hate that with
> >> a passion, but this does look like a prime example of why it should
> >> just be done.
> > 
> > For giggles, I ran just a single thread on the file set. We're only
> > doing about 100K IOPS at that point, yet when the page cache fills,
> > kswapd still eats 10% cpu. That seems like a lot for something that
> > slow.
> 
> Warning, the below is from the really crazy department...
> 
> Anyway, I took a closer look at the profiles for the uncached case.
> We're spending a lot of time doing memsets (this is the xa_node init,
> from the radix tree constructor), and call_rcu for the node free later
> on. All wasted time, and something that meant we weren't as close to the
> performance of O_DIRECT as we could be.
> 
> So Chris and I started talking about this, and pondered "what would
> happen if we simply bypassed the page cache completely?". Case in point,
> see below incremental patch. We still do the page cache lookup, and use
> that page to copy from if it's there. If the page isn't there, allocate
> one and do IO to it, but DON'T add it to the page cache. With that,
> we're almost at O_DIRECT levels of performance for the 4k read case,
> without 1-2%. I think 512b would look awesome, but we're reading full
> pages, so that won't really help us much. Compared to the previous
> uncached method, this is 30% faster on this device. That's substantial.

Interesting idea, but this seems like it is just direct IO with
kernel pages and a memcpy() rather than just mapping user pages, but
has none of the advantages of direct IO in that we can run reads and
writes concurrently because it's going through the buffered IO path.

It also needs all the special DIO truncate/hole punch serialisation
mechanisms to be propagated into the buffered IO path - the
requirement for inode_dio_wait() serialisation is something I'm
trying to remove from XFS, not have to add into more paths. And it
introduces the same issues with other buffered read/mmap access to
the same file ranges as direct IO has.

> Obviously this has issues with truncate that would need to be resolved,
> and it's definitely dirtier. But the performance is very enticing...

At which point I have to ask: why are we considering repeating the
mistakes that were made with direct IO?  Yes, it might be faster
than a coherent RWF_UNCACHED IO implementation, but I don't think
making it more like O_DIRECT is worth the price.

And, ultimately, RWF_UNCACHED will never be as fast as direct IO
because it *requires* the CPU to copy the data at least once. Direct
IO is zero-copy, and so it's always going to have lower overhead
than RWF_UNCACHED, and so when CPU or memory bandwidth is the
limiting facter, O_DIRECT will always be faster.

IOWs, I think trying to make RWF_UNCACHED as fast as O_DIRECT is a
fool's game and attempting to do so is taking a step in the wrong
direction architecturally.  I'd much prefer a sane IO model for
RWF_UNCACHED that provides coherency w/ mmap and other buffered IO
than compromise these things in the chase for ultimate performance.

Speaking of IO path architecture, perhaps what we really need here
is an iomap_apply()->iomap_read_actor loop here similar to the write
side. This would allow us to bypass all the complex readahead
shenanigans that generic_file_buffered_read() has to deal with and
directly control page cache residency and build the exact IOs we
need when RWF_UNCACHED is set. This moves it much closer to the
direct IO path in terms IO setup overhead and physical IO patterns,
but still has all the benefits of being fully cache coherent....

And, really, when we are talking about high end nvme drives that can
do 5-10GB/s read each, and we can put 20+ of them in a single
machine, there's no real value in doing readahead. i.e. there's
little read IO latency to hide in the first place and we such
systems have little memory bandwidth to spare to waste on readahead
IO that we don't end up using...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-11 15:29 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
  2019-12-11 17:19   ` Matthew Wilcox
@ 2019-12-12 22:34   ` Dave Chinner
  2019-12-13  0:54     ` Jens Axboe
  1 sibling, 1 reply; 67+ messages in thread
From: Dave Chinner @ 2019-12-12 22:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds

On Wed, Dec 11, 2019 at 08:29:43AM -0700, Jens Axboe wrote:
> This adds support for RWF_UNCACHED for file systems using iomap to
> perform buffered writes. We use the generic infrastructure for this,
> by tracking pages we created and calling write_drop_cached_pages()
> to issue writeback and prune those pages.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/iomap/apply.c       | 24 ++++++++++++++++++++++++
>  fs/iomap/buffered-io.c | 37 +++++++++++++++++++++++++++++--------
>  include/linux/iomap.h  |  5 +++++
>  3 files changed, 58 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
> index 562536da8a13..966826ad4bb9 100644
> --- a/fs/iomap/apply.c
> +++ b/fs/iomap/apply.c
> @@ -90,5 +90,29 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>  				     flags, &iomap);
>  	}
>  
> +	if (written && (flags & IOMAP_UNCACHED)) {
> +		struct address_space *mapping = inode->i_mapping;
> +
> +		end = pos + written;
> +		ret = filemap_write_and_wait_range(mapping, pos, end);
> +		if (ret)
> +			goto out;
> +
> +		/*
> +		 * No pages were created for this range, we're done
> +		 */
> +		if (!(iomap.flags & IOMAP_F_PAGE_CREATE))
> +			goto out;
> +
> +		/*
> +		 * Try to invalidate cache pages for the range we just wrote.
> +		 * We don't care if invalidation fails as the write has still
> +		 * worked and leaving clean uptodate pages in the page cache
> +		 * isn't a corruption vector for uncached IO.
> +		 */
> +		invalidate_inode_pages2_range(mapping,
> +				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> +	}
> +out:
>  	return written ? written : ret;
>  }

Just a thought on further optimisation for this for XFS.
IOMAP_UNCACHED is being passed into the filesystem ->iomap_begin
methods by iomap_apply().  Hence the filesystems know that it is
an uncached IO that is being done, and we can tailor allocation
strategies to suit the fact that the data is going to be written
immediately.

In this case, XFS needs to treat it the same way it treats direct
IO. That is, we do immediate unwritten extent allocation rather than
delayed allocation. This will reduce the allocation overhead and
will optimise for immediate IO locality rather than optimise for
delayed allocation.

This should just be a relatively simple change to
xfs_file_iomap_begin() along the lines of:

-	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && !(flags & IOMAP_DIRECT) &&
-			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
+	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) &&
+	    !(flags & (IOMAP_DIRECT | IOMAP_UNCACHED)) &&
+	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
		/* Reserve delalloc blocks for regular writeback. */
		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
				iomap);
	}

so that it avoids delayed allocation for uncached IO...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-12 22:34   ` Dave Chinner
@ 2019-12-13  0:54     ` Jens Axboe
  2019-12-13  0:57       ` Jens Axboe
  0 siblings, 1 reply; 67+ messages in thread
From: Jens Axboe @ 2019-12-13  0:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds

On 12/12/19 3:34 PM, Dave Chinner wrote:
> On Wed, Dec 11, 2019 at 08:29:43AM -0700, Jens Axboe wrote:
>> This adds support for RWF_UNCACHED for file systems using iomap to
>> perform buffered writes. We use the generic infrastructure for this,
>> by tracking pages we created and calling write_drop_cached_pages()
>> to issue writeback and prune those pages.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/iomap/apply.c       | 24 ++++++++++++++++++++++++
>>  fs/iomap/buffered-io.c | 37 +++++++++++++++++++++++++++++--------
>>  include/linux/iomap.h  |  5 +++++
>>  3 files changed, 58 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
>> index 562536da8a13..966826ad4bb9 100644
>> --- a/fs/iomap/apply.c
>> +++ b/fs/iomap/apply.c
>> @@ -90,5 +90,29 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>>  				     flags, &iomap);
>>  	}
>>  
>> +	if (written && (flags & IOMAP_UNCACHED)) {
>> +		struct address_space *mapping = inode->i_mapping;
>> +
>> +		end = pos + written;
>> +		ret = filemap_write_and_wait_range(mapping, pos, end);
>> +		if (ret)
>> +			goto out;
>> +
>> +		/*
>> +		 * No pages were created for this range, we're done
>> +		 */
>> +		if (!(iomap.flags & IOMAP_F_PAGE_CREATE))
>> +			goto out;
>> +
>> +		/*
>> +		 * Try to invalidate cache pages for the range we just wrote.
>> +		 * We don't care if invalidation fails as the write has still
>> +		 * worked and leaving clean uptodate pages in the page cache
>> +		 * isn't a corruption vector for uncached IO.
>> +		 */
>> +		invalidate_inode_pages2_range(mapping,
>> +				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
>> +	}
>> +out:
>>  	return written ? written : ret;
>>  }
> 
> Just a thought on further optimisation for this for XFS.
> IOMAP_UNCACHED is being passed into the filesystem ->iomap_begin
> methods by iomap_apply().  Hence the filesystems know that it is
> an uncached IO that is being done, and we can tailor allocation
> strategies to suit the fact that the data is going to be written
> immediately.
> 
> In this case, XFS needs to treat it the same way it treats direct
> IO. That is, we do immediate unwritten extent allocation rather than
> delayed allocation. This will reduce the allocation overhead and
> will optimise for immediate IO locality rather than optimise for
> delayed allocation.
> 
> This should just be a relatively simple change to
> xfs_file_iomap_begin() along the lines of:
> 
> -	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && !(flags & IOMAP_DIRECT) &&
> -			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> +	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) &&
> +	    !(flags & (IOMAP_DIRECT | IOMAP_UNCACHED)) &&
> +	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> 		/* Reserve delalloc blocks for regular writeback. */
> 		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
> 				iomap);
> 	}
> 
> so that it avoids delayed allocation for uncached IO...

That's very handy! Thanks, I'll add that to the next version. Just out
of curiosity, would you prefer this as a separate patch, or just bundle
it with the iomap buffered RWF_UNCACHED patch? I'm assuming the latter,
and I'll just mention it in the changelog.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-13  0:54     ` Jens Axboe
@ 2019-12-13  0:57       ` Jens Axboe
  2019-12-16  4:17         ` Dave Chinner
  0 siblings, 1 reply; 67+ messages in thread
From: Jens Axboe @ 2019-12-13  0:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds

On 12/12/19 5:54 PM, Jens Axboe wrote:
> On 12/12/19 3:34 PM, Dave Chinner wrote:
>> On Wed, Dec 11, 2019 at 08:29:43AM -0700, Jens Axboe wrote:
>>> This adds support for RWF_UNCACHED for file systems using iomap to
>>> perform buffered writes. We use the generic infrastructure for this,
>>> by tracking pages we created and calling write_drop_cached_pages()
>>> to issue writeback and prune those pages.
>>>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> ---
>>>  fs/iomap/apply.c       | 24 ++++++++++++++++++++++++
>>>  fs/iomap/buffered-io.c | 37 +++++++++++++++++++++++++++++--------
>>>  include/linux/iomap.h  |  5 +++++
>>>  3 files changed, 58 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
>>> index 562536da8a13..966826ad4bb9 100644
>>> --- a/fs/iomap/apply.c
>>> +++ b/fs/iomap/apply.c
>>> @@ -90,5 +90,29 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>>>  				     flags, &iomap);
>>>  	}
>>>  
>>> +	if (written && (flags & IOMAP_UNCACHED)) {
>>> +		struct address_space *mapping = inode->i_mapping;
>>> +
>>> +		end = pos + written;
>>> +		ret = filemap_write_and_wait_range(mapping, pos, end);
>>> +		if (ret)
>>> +			goto out;
>>> +
>>> +		/*
>>> +		 * No pages were created for this range, we're done
>>> +		 */
>>> +		if (!(iomap.flags & IOMAP_F_PAGE_CREATE))
>>> +			goto out;
>>> +
>>> +		/*
>>> +		 * Try to invalidate cache pages for the range we just wrote.
>>> +		 * We don't care if invalidation fails as the write has still
>>> +		 * worked and leaving clean uptodate pages in the page cache
>>> +		 * isn't a corruption vector for uncached IO.
>>> +		 */
>>> +		invalidate_inode_pages2_range(mapping,
>>> +				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
>>> +	}
>>> +out:
>>>  	return written ? written : ret;
>>>  }
>>
>> Just a thought on further optimisation for this for XFS.
>> IOMAP_UNCACHED is being passed into the filesystem ->iomap_begin
>> methods by iomap_apply().  Hence the filesystems know that it is
>> an uncached IO that is being done, and we can tailor allocation
>> strategies to suit the fact that the data is going to be written
>> immediately.
>>
>> In this case, XFS needs to treat it the same way it treats direct
>> IO. That is, we do immediate unwritten extent allocation rather than
>> delayed allocation. This will reduce the allocation overhead and
>> will optimise for immediate IO locality rather than optimise for
>> delayed allocation.
>>
>> This should just be a relatively simple change to
>> xfs_file_iomap_begin() along the lines of:
>>
>> -	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && !(flags & IOMAP_DIRECT) &&
>> -			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>> +	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) &&
>> +	    !(flags & (IOMAP_DIRECT | IOMAP_UNCACHED)) &&
>> +	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>> 		/* Reserve delalloc blocks for regular writeback. */
>> 		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
>> 				iomap);
>> 	}
>>
>> so that it avoids delayed allocation for uncached IO...
> 
> That's very handy! Thanks, I'll add that to the next version. Just out
> of curiosity, would you prefer this as a separate patch, or just bundle
> it with the iomap buffered RWF_UNCACHED patch? I'm assuming the latter,
> and I'll just mention it in the changelog.

OK, since it's in XFS, it'd be a separate patch. The code you quote seems
to be something out-of-tree?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-12 22:18                 ` Dave Chinner
@ 2019-12-13  1:32                   ` Chris Mason
  2020-01-07 17:42                     ` Christoph Hellwig
  2020-02-01 10:33                     ` Andres Freund
  0 siblings, 2 replies; 67+ messages in thread
From: Chris Mason @ 2019-12-13  1:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Linus Torvalds, Linux-MM, linux-fsdevel, linux-block,
	Matthew Wilcox, Johannes Weiner

On 12 Dec 2019, at 17:18, Dave Chinner wrote:

> On Wed, Dec 11, 2019 at 06:09:14PM -0700, Jens Axboe wrote:
>>
>> So Chris and I started talking about this, and pondered "what would
>> happen if we simply bypassed the page cache completely?". Case in 
>> point,
>> see below incremental patch. We still do the page cache lookup, and 
>> use
>> that page to copy from if it's there. If the page isn't there, 
>> allocate
>> one and do IO to it, but DON'T add it to the page cache. With that,
>> we're almost at O_DIRECT levels of performance for the 4k read case,
>> without 1-2%. I think 512b would look awesome, but we're reading full
>> pages, so that won't really help us much. Compared to the previous
>> uncached method, this is 30% faster on this device. That's 
>> substantial.
>
> Interesting idea, but this seems like it is just direct IO with
> kernel pages and a memcpy() rather than just mapping user pages, but
> has none of the advantages of direct IO in that we can run reads and
> writes concurrently because it's going through the buffered IO path.
>
> It also needs all the special DIO truncate/hole punch serialisation
> mechanisms to be propagated into the buffered IO path - the
> requirement for inode_dio_wait() serialisation is something I'm
> trying to remove from XFS, not have to add into more paths. And it
> introduces the same issues with other buffered read/mmap access to
> the same file ranges as direct IO has.
>
>> Obviously this has issues with truncate that would need to be 
>> resolved,
>> and it's definitely dirtier. But the performance is very enticing...
>
> At which point I have to ask: why are we considering repeating the
> mistakes that were made with direct IO?  Yes, it might be faster
> than a coherent RWF_UNCACHED IO implementation, but I don't think
> making it more like O_DIRECT is worth the price.
>
> And, ultimately, RWF_UNCACHED will never be as fast as direct IO
> because it *requires* the CPU to copy the data at least once.

They just have different tradeoffs.  O_DIRECT actively blows away caches 
and can also force writes during reads, making RWF_UNCACHED a more 
natural fit for some applications.  There are fewer surprises, and some 
services are willing to pay for flexibility with a memcpy.  In general, 
they still want to do some cache management because it reduces p90+ 
latencies across the board, and gives them more control over which pages 
stay in cache.

Most services using buffered IO here as part of their main workload are 
pairing it with sync_file_range() and sometimes fadvise DONT_NEED.  
We've seen kswapd saturating cores with much slower flash than the fancy 
stuff Jens is using, and the solution is usually O_DIRECT or fadvise.

Grepping through the code shows a wonderful assortment of helpers to 
control the cache, and RWF_UNCACHED would be both cleaner and faster 
than what we have today.  I'm on the fence about asking for 
RWF_FILE_RANGE_WRITE (+/- naming) to force writes to start without 
pitching pages, but we can talk to some service owners to see how useful 
that would be.   They can always chain a sync_file_range() in io_uring, 
but RWF_ would be lower overhead if it were a common pattern.

With all of that said, I really agree that xfs+O_DIRECT wins on write 
concurrency.  Jens's current patches are a great first step, but I think 
that if he really loved us, Jens would carve up a concurrent pageless 
write patch series before Christmas.

> Direct
> IO is zero-copy, and so it's always going to have lower overhead
> than RWF_UNCACHED, and so when CPU or memory bandwidth is the
> limiting facter, O_DIRECT will always be faster.
>
> IOWs, I think trying to make RWF_UNCACHED as fast as O_DIRECT is a
> fool's game and attempting to do so is taking a step in the wrong
> direction architecturally.  I'd much prefer a sane IO model for
> RWF_UNCACHED that provides coherency w/ mmap and other buffered IO
> than compromise these things in the chase for ultimate performance.

No matter what I wrote in my letters to Santa this year, I agree that we 
shouldn't compromise on avoiding the warts from O_DIRECT.

-chris

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-13  0:57       ` Jens Axboe
@ 2019-12-16  4:17         ` Dave Chinner
  2019-12-17 14:31           ` Jens Axboe
  0 siblings, 1 reply; 67+ messages in thread
From: Dave Chinner @ 2019-12-16  4:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds

On Thu, Dec 12, 2019 at 05:57:57PM -0700, Jens Axboe wrote:
> On 12/12/19 5:54 PM, Jens Axboe wrote:
> > On 12/12/19 3:34 PM, Dave Chinner wrote:
> >> On Wed, Dec 11, 2019 at 08:29:43AM -0700, Jens Axboe wrote:
> >>> This adds support for RWF_UNCACHED for file systems using iomap to
> >>> perform buffered writes. We use the generic infrastructure for this,
> >>> by tracking pages we created and calling write_drop_cached_pages()
> >>> to issue writeback and prune those pages.
> >>>
> >>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>> ---
> >>>  fs/iomap/apply.c       | 24 ++++++++++++++++++++++++
> >>>  fs/iomap/buffered-io.c | 37 +++++++++++++++++++++++++++++--------
> >>>  include/linux/iomap.h  |  5 +++++
> >>>  3 files changed, 58 insertions(+), 8 deletions(-)
> >>>
> >>> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
> >>> index 562536da8a13..966826ad4bb9 100644
> >>> --- a/fs/iomap/apply.c
> >>> +++ b/fs/iomap/apply.c
> >>> @@ -90,5 +90,29 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
> >>>  				     flags, &iomap);
> >>>  	}
> >>>  
> >>> +	if (written && (flags & IOMAP_UNCACHED)) {
> >>> +		struct address_space *mapping = inode->i_mapping;
> >>> +
> >>> +		end = pos + written;
> >>> +		ret = filemap_write_and_wait_range(mapping, pos, end);
> >>> +		if (ret)
> >>> +			goto out;
> >>> +
> >>> +		/*
> >>> +		 * No pages were created for this range, we're done
> >>> +		 */
> >>> +		if (!(iomap.flags & IOMAP_F_PAGE_CREATE))
> >>> +			goto out;
> >>> +
> >>> +		/*
> >>> +		 * Try to invalidate cache pages for the range we just wrote.
> >>> +		 * We don't care if invalidation fails as the write has still
> >>> +		 * worked and leaving clean uptodate pages in the page cache
> >>> +		 * isn't a corruption vector for uncached IO.
> >>> +		 */
> >>> +		invalidate_inode_pages2_range(mapping,
> >>> +				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> >>> +	}
> >>> +out:
> >>>  	return written ? written : ret;
> >>>  }
> >>
> >> Just a thought on further optimisation for this for XFS.
> >> IOMAP_UNCACHED is being passed into the filesystem ->iomap_begin
> >> methods by iomap_apply().  Hence the filesystems know that it is
> >> an uncached IO that is being done, and we can tailor allocation
> >> strategies to suit the fact that the data is going to be written
> >> immediately.
> >>
> >> In this case, XFS needs to treat it the same way it treats direct
> >> IO. That is, we do immediate unwritten extent allocation rather than
> >> delayed allocation. This will reduce the allocation overhead and
> >> will optimise for immediate IO locality rather than optimise for
> >> delayed allocation.
> >>
> >> This should just be a relatively simple change to
> >> xfs_file_iomap_begin() along the lines of:
> >>
> >> -	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && !(flags & IOMAP_DIRECT) &&
> >> -			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> >> +	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) &&
> >> +	    !(flags & (IOMAP_DIRECT | IOMAP_UNCACHED)) &&
> >> +	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> >> 		/* Reserve delalloc blocks for regular writeback. */
> >> 		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
> >> 				iomap);
> >> 	}
> >>
> >> so that it avoids delayed allocation for uncached IO...
> > 
> > That's very handy! Thanks, I'll add that to the next version. Just out
> > of curiosity, would you prefer this as a separate patch, or just bundle
> > it with the iomap buffered RWF_UNCACHED patch? I'm assuming the latter,
> > and I'll just mention it in the changelog.
> 
> OK, since it's in XFS, it'd be a separate patch.

*nod*

> The code you quote seems
> to be something out-of-tree?

Ah, I quoted the code in the 5.4 release branch, not the 5.5-rc1
tree. I'd forgotten that the xfs_file_iomap_begin() got massively
refactored in the 5.5 merge and I hadn't updated my cscope trees. SO
I'm guessing you want to go looking for the
xfs_buffered_write_iomap_begin() and add another case to this
initial branch:

        /* we can't use delayed allocations when using extent size hints */
        if (xfs_get_extsz_hint(ip))
                return xfs_direct_write_iomap_begin(inode, offset, count,
                                flags, iomap, srcmap);

To make the buffered write IO go down the direct IO allocation path...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-16  4:17         ` Dave Chinner
@ 2019-12-17 14:31           ` Jens Axboe
  2019-12-18  0:49             ` Dave Chinner
  0 siblings, 1 reply; 67+ messages in thread
From: Jens Axboe @ 2019-12-17 14:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds

On 12/15/19 9:17 PM, Dave Chinner wrote:
> On Thu, Dec 12, 2019 at 05:57:57PM -0700, Jens Axboe wrote:
>> On 12/12/19 5:54 PM, Jens Axboe wrote:
>>> On 12/12/19 3:34 PM, Dave Chinner wrote:
>>>> On Wed, Dec 11, 2019 at 08:29:43AM -0700, Jens Axboe wrote:
>>>>> This adds support for RWF_UNCACHED for file systems using iomap to
>>>>> perform buffered writes. We use the generic infrastructure for this,
>>>>> by tracking pages we created and calling write_drop_cached_pages()
>>>>> to issue writeback and prune those pages.
>>>>>
>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>> ---
>>>>>  fs/iomap/apply.c       | 24 ++++++++++++++++++++++++
>>>>>  fs/iomap/buffered-io.c | 37 +++++++++++++++++++++++++++++--------
>>>>>  include/linux/iomap.h  |  5 +++++
>>>>>  3 files changed, 58 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
>>>>> index 562536da8a13..966826ad4bb9 100644
>>>>> --- a/fs/iomap/apply.c
>>>>> +++ b/fs/iomap/apply.c
>>>>> @@ -90,5 +90,29 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>>>>>  				     flags, &iomap);
>>>>>  	}
>>>>>  
>>>>> +	if (written && (flags & IOMAP_UNCACHED)) {
>>>>> +		struct address_space *mapping = inode->i_mapping;
>>>>> +
>>>>> +		end = pos + written;
>>>>> +		ret = filemap_write_and_wait_range(mapping, pos, end);
>>>>> +		if (ret)
>>>>> +			goto out;
>>>>> +
>>>>> +		/*
>>>>> +		 * No pages were created for this range, we're done
>>>>> +		 */
>>>>> +		if (!(iomap.flags & IOMAP_F_PAGE_CREATE))
>>>>> +			goto out;
>>>>> +
>>>>> +		/*
>>>>> +		 * Try to invalidate cache pages for the range we just wrote.
>>>>> +		 * We don't care if invalidation fails as the write has still
>>>>> +		 * worked and leaving clean uptodate pages in the page cache
>>>>> +		 * isn't a corruption vector for uncached IO.
>>>>> +		 */
>>>>> +		invalidate_inode_pages2_range(mapping,
>>>>> +				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
>>>>> +	}
>>>>> +out:
>>>>>  	return written ? written : ret;
>>>>>  }
>>>>
>>>> Just a thought on further optimisation for this for XFS.
>>>> IOMAP_UNCACHED is being passed into the filesystem ->iomap_begin
>>>> methods by iomap_apply().  Hence the filesystems know that it is
>>>> an uncached IO that is being done, and we can tailor allocation
>>>> strategies to suit the fact that the data is going to be written
>>>> immediately.
>>>>
>>>> In this case, XFS needs to treat it the same way it treats direct
>>>> IO. That is, we do immediate unwritten extent allocation rather than
>>>> delayed allocation. This will reduce the allocation overhead and
>>>> will optimise for immediate IO locality rather than optimise for
>>>> delayed allocation.
>>>>
>>>> This should just be a relatively simple change to
>>>> xfs_file_iomap_begin() along the lines of:
>>>>
>>>> -	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && !(flags & IOMAP_DIRECT) &&
>>>> -			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>>>> +	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) &&
>>>> +	    !(flags & (IOMAP_DIRECT | IOMAP_UNCACHED)) &&
>>>> +	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>>>> 		/* Reserve delalloc blocks for regular writeback. */
>>>> 		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
>>>> 				iomap);
>>>> 	}
>>>>
>>>> so that it avoids delayed allocation for uncached IO...
>>>
>>> That's very handy! Thanks, I'll add that to the next version. Just out
>>> of curiosity, would you prefer this as a separate patch, or just bundle
>>> it with the iomap buffered RWF_UNCACHED patch? I'm assuming the latter,
>>> and I'll just mention it in the changelog.
>>
>> OK, since it's in XFS, it'd be a separate patch.
> 
> *nod*
> 
>> The code you quote seems
>> to be something out-of-tree?
> 
> Ah, I quoted the code in the 5.4 release branch, not the 5.5-rc1
> tree. I'd forgotten that the xfs_file_iomap_begin() got massively
> refactored in the 5.5 merge and I hadn't updated my cscope trees. SO
> I'm guessing you want to go looking for the
> xfs_buffered_write_iomap_begin() and add another case to this
> initial branch:
> 
>         /* we can't use delayed allocations when using extent size hints */
>         if (xfs_get_extsz_hint(ip))
>                 return xfs_direct_write_iomap_begin(inode, offset, count,
>                                 flags, iomap, srcmap);
> 
> To make the buffered write IO go down the direct IO allocation path...

Makes it even simpler! Something like this:


commit 1783722cd4b7088a3c004462c7ae610b8e42b720
Author: Jens Axboe <axboe@kernel.dk>
Date:   Tue Dec 17 07:30:04 2019 -0700

    xfs: don't do delayed allocations for uncached buffered writes
    
    This data is going to be written immediately, so don't bother trying
    to do delayed allocation for it.
    
    Suggested-by: Dave Chinner <david@fromorbit.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 28e2d1f37267..d0cd4a05d59f 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -847,8 +847,11 @@ xfs_buffered_write_iomap_begin(
 	int			allocfork = XFS_DATA_FORK;
 	int			error = 0;
 
-	/* we can't use delayed allocations when using extent size hints */
-	if (xfs_get_extsz_hint(ip))
+	/*
+	 * Don't do delayed allocations when using extent size hints, or
+	 * if we were asked to do uncached buffered writes.
+	 */
+	if (xfs_get_extsz_hint(ip) || (flags & IOMAP_UNCACHED))
 		return xfs_direct_write_iomap_begin(inode, offset, count,
 				flags, iomap, srcmap);
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-17 14:31           ` Jens Axboe
@ 2019-12-18  0:49             ` Dave Chinner
  2019-12-18  1:01               ` Jens Axboe
  0 siblings, 1 reply; 67+ messages in thread
From: Dave Chinner @ 2019-12-18  0:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds

On Tue, Dec 17, 2019 at 07:31:51AM -0700, Jens Axboe wrote:
> On 12/15/19 9:17 PM, Dave Chinner wrote:
> > On Thu, Dec 12, 2019 at 05:57:57PM -0700, Jens Axboe wrote:
> >> On 12/12/19 5:54 PM, Jens Axboe wrote:
> >>> On 12/12/19 3:34 PM, Dave Chinner wrote:
> >>>> Just a thought on further optimisation for this for XFS.
> >>>> IOMAP_UNCACHED is being passed into the filesystem ->iomap_begin
> >>>> methods by iomap_apply().  Hence the filesystems know that it is
> >>>> an uncached IO that is being done, and we can tailor allocation
> >>>> strategies to suit the fact that the data is going to be written
> >>>> immediately.
> >>>>
> >>>> In this case, XFS needs to treat it the same way it treats direct
> >>>> IO. That is, we do immediate unwritten extent allocation rather than
> >>>> delayed allocation. This will reduce the allocation overhead and
> >>>> will optimise for immediate IO locality rather than optimise for
> >>>> delayed allocation.
> >>>>
> >>>> This should just be a relatively simple change to
> >>>> xfs_file_iomap_begin() along the lines of:
> >>>>
> >>>> -	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && !(flags & IOMAP_DIRECT) &&
> >>>> -			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> >>>> +	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) &&
> >>>> +	    !(flags & (IOMAP_DIRECT | IOMAP_UNCACHED)) &&
> >>>> +	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> >>>> 		/* Reserve delalloc blocks for regular writeback. */
> >>>> 		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
> >>>> 				iomap);
> >>>> 	}
> >>>>
> >>>> so that it avoids delayed allocation for uncached IO...
> >>>
> >>> That's very handy! Thanks, I'll add that to the next version. Just out
> >>> of curiosity, would you prefer this as a separate patch, or just bundle
> >>> it with the iomap buffered RWF_UNCACHED patch? I'm assuming the latter,
> >>> and I'll just mention it in the changelog.
> >>
> >> OK, since it's in XFS, it'd be a separate patch.
> > 
> > *nod*
> > 
> >> The code you quote seems
> >> to be something out-of-tree?
> > 
> > Ah, I quoted the code in the 5.4 release branch, not the 5.5-rc1
> > tree. I'd forgotten that the xfs_file_iomap_begin() got massively
> > refactored in the 5.5 merge and I hadn't updated my cscope trees. SO
> > I'm guessing you want to go looking for the
> > xfs_buffered_write_iomap_begin() and add another case to this
> > initial branch:
> > 
> >         /* we can't use delayed allocations when using extent size hints */
> >         if (xfs_get_extsz_hint(ip))
> >                 return xfs_direct_write_iomap_begin(inode, offset, count,
> >                                 flags, iomap, srcmap);
> > 
> > To make the buffered write IO go down the direct IO allocation path...
> 
> Makes it even simpler! Something like this:
> 
> 
> commit 1783722cd4b7088a3c004462c7ae610b8e42b720
> Author: Jens Axboe <axboe@kernel.dk>
> Date:   Tue Dec 17 07:30:04 2019 -0700
> 
>     xfs: don't do delayed allocations for uncached buffered writes
>     
>     This data is going to be written immediately, so don't bother trying
>     to do delayed allocation for it.
>     
>     Suggested-by: Dave Chinner <david@fromorbit.com>
>     Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 28e2d1f37267..d0cd4a05d59f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -847,8 +847,11 @@ xfs_buffered_write_iomap_begin(
>  	int			allocfork = XFS_DATA_FORK;
>  	int			error = 0;
>  
> -	/* we can't use delayed allocations when using extent size hints */
> -	if (xfs_get_extsz_hint(ip))
> +	/*
> +	 * Don't do delayed allocations when using extent size hints, or
> +	 * if we were asked to do uncached buffered writes.
> +	 */
> +	if (xfs_get_extsz_hint(ip) || (flags & IOMAP_UNCACHED))
>  		return xfs_direct_write_iomap_begin(inode, offset, count,
>  				flags, iomap, srcmap);
>  

Yup, that's pretty much what I was thinking. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes
  2019-12-18  0:49             ` Dave Chinner
@ 2019-12-18  1:01               ` Jens Axboe
  0 siblings, 0 replies; 67+ messages in thread
From: Jens Axboe @ 2019-12-18  1:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-block, willy, clm, torvalds

On 12/17/19 5:49 PM, Dave Chinner wrote:
> On Tue, Dec 17, 2019 at 07:31:51AM -0700, Jens Axboe wrote:
>> On 12/15/19 9:17 PM, Dave Chinner wrote:
>>> On Thu, Dec 12, 2019 at 05:57:57PM -0700, Jens Axboe wrote:
>>>> On 12/12/19 5:54 PM, Jens Axboe wrote:
>>>>> On 12/12/19 3:34 PM, Dave Chinner wrote:
>>>>>> Just a thought on further optimisation for this for XFS.
>>>>>> IOMAP_UNCACHED is being passed into the filesystem ->iomap_begin
>>>>>> methods by iomap_apply().  Hence the filesystems know that it is
>>>>>> an uncached IO that is being done, and we can tailor allocation
>>>>>> strategies to suit the fact that the data is going to be written
>>>>>> immediately.
>>>>>>
>>>>>> In this case, XFS needs to treat it the same way it treats direct
>>>>>> IO. That is, we do immediate unwritten extent allocation rather than
>>>>>> delayed allocation. This will reduce the allocation overhead and
>>>>>> will optimise for immediate IO locality rather than optimise for
>>>>>> delayed allocation.
>>>>>>
>>>>>> This should just be a relatively simple change to
>>>>>> xfs_file_iomap_begin() along the lines of:
>>>>>>
>>>>>> -	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && !(flags & IOMAP_DIRECT) &&
>>>>>> -			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>>>>>> +	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) &&
>>>>>> +	    !(flags & (IOMAP_DIRECT | IOMAP_UNCACHED)) &&
>>>>>> +	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>>>>>> 		/* Reserve delalloc blocks for regular writeback. */
>>>>>> 		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
>>>>>> 				iomap);
>>>>>> 	}
>>>>>>
>>>>>> so that it avoids delayed allocation for uncached IO...
>>>>>
>>>>> That's very handy! Thanks, I'll add that to the next version. Just out
>>>>> of curiosity, would you prefer this as a separate patch, or just bundle
>>>>> it with the iomap buffered RWF_UNCACHED patch? I'm assuming the latter,
>>>>> and I'll just mention it in the changelog.
>>>>
>>>> OK, since it's in XFS, it'd be a separate patch.
>>>
>>> *nod*
>>>
>>>> The code you quote seems
>>>> to be something out-of-tree?
>>>
>>> Ah, I quoted the code in the 5.4 release branch, not the 5.5-rc1
>>> tree. I'd forgotten that the xfs_file_iomap_begin() got massively
>>> refactored in the 5.5 merge and I hadn't updated my cscope trees. SO
>>> I'm guessing you want to go looking for the
>>> xfs_buffered_write_iomap_begin() and add another case to this
>>> initial branch:
>>>
>>>         /* we can't use delayed allocations when using extent size hints */
>>>         if (xfs_get_extsz_hint(ip))
>>>                 return xfs_direct_write_iomap_begin(inode, offset, count,
>>>                                 flags, iomap, srcmap);
>>>
>>> To make the buffered write IO go down the direct IO allocation path...
>>
>> Makes it even simpler! Something like this:
>>
>>
>> commit 1783722cd4b7088a3c004462c7ae610b8e42b720
>> Author: Jens Axboe <axboe@kernel.dk>
>> Date:   Tue Dec 17 07:30:04 2019 -0700
>>
>>     xfs: don't do delayed allocations for uncached buffered writes
>>     
>>     This data is going to be written immediately, so don't bother trying
>>     to do delayed allocation for it.
>>     
>>     Suggested-by: Dave Chinner <david@fromorbit.com>
>>     Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 28e2d1f37267..d0cd4a05d59f 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -847,8 +847,11 @@ xfs_buffered_write_iomap_begin(
>>  	int			allocfork = XFS_DATA_FORK;
>>  	int			error = 0;
>>  
>> -	/* we can't use delayed allocations when using extent size hints */
>> -	if (xfs_get_extsz_hint(ip))
>> +	/*
>> +	 * Don't do delayed allocations when using extent size hints, or
>> +	 * if we were asked to do uncached buffered writes.
>> +	 */
>> +	if (xfs_get_extsz_hint(ip) || (flags & IOMAP_UNCACHED))
>>  		return xfs_direct_write_iomap_begin(inode, offset, count,
>>  				flags, iomap, srcmap);
>>  
> 
> Yup, that's pretty much what I was thinking. :)

Perfect, thanks for checking!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-13  1:32                   ` Chris Mason
@ 2020-01-07 17:42                     ` Christoph Hellwig
  2020-01-08 14:09                       ` Chris Mason
  2020-02-01 10:33                     ` Andres Freund
  1 sibling, 1 reply; 67+ messages in thread
From: Christoph Hellwig @ 2020-01-07 17:42 UTC (permalink / raw)
  To: Chris Mason
  Cc: Dave Chinner, Jens Axboe, Linus Torvalds, Linux-MM,
	linux-fsdevel, linux-block, Matthew Wilcox, Johannes Weiner

On Fri, Dec 13, 2019 at 01:32:10AM +0000, Chris Mason wrote:
> They just have different tradeoffs.  O_DIRECT actively blows away caches 
> and can also force writes during reads, making RWF_UNCACHED a more 
> natural fit for some applications.  There are fewer surprises, and some 
> services are willing to pay for flexibility with a memcpy.  In general, 
> they still want to do some cache management because it reduces p90+ 
> latencies across the board, and gives them more control over which pages 
> stay in cache.

We can always have a variant of O_DIRECT that doesn't do that and
instead check if data was in the cache and then also copy / from to
it in that case.  I need some time to actually look through this series,
so it might be pretty similar to the implementation, but if defined
the right way it could be concurrent for at least the fast path of no
cached pages.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2020-01-07 17:42                     ` Christoph Hellwig
@ 2020-01-08 14:09                       ` Chris Mason
  0 siblings, 0 replies; 67+ messages in thread
From: Chris Mason @ 2020-01-08 14:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Jens Axboe, Linus Torvalds, Linux-MM,
	linux-fsdevel, linux-block, Matthew Wilcox, Johannes Weiner

On 7 Jan 2020, at 12:42, Christoph Hellwig wrote:

> On Fri, Dec 13, 2019 at 01:32:10AM +0000, Chris Mason wrote:
>> They just have different tradeoffs.  O_DIRECT actively blows away 
>> caches
>> and can also force writes during reads, making RWF_UNCACHED a more
>> natural fit for some applications.  There are fewer surprises, and 
>> some
>> services are willing to pay for flexibility with a memcpy.  In 
>> general,
>> they still want to do some cache management because it reduces p90+
>> latencies across the board, and gives them more control over which 
>> pages
>> stay in cache.
>
> We can always have a variant of O_DIRECT that doesn't do that and
> instead check if data was in the cache and then also copy / from to
> it in that case.  I need some time to actually look through this 
> series,
> so it might be pretty similar to the implementation, but if defined
> the right way it could be concurrent for at least the fast path of no
> cached pages.

Yeah, I really do think we can end up with a fairly unified solution 
through iomap:

* Allowing concurrent writes (xfs DIO does this now)
* Optionally doing zero copy if alignment is good (btrfs DIO does this 
now)
* Optionally tossing pages at the end (requires a separate syscall now)
* Supporting aio via io_uring

We could just call this O_DIRECT, but I like RWF_UNCACHED as a way to 
avoid surprises for people that know and love the existing O_DIRECT 
semantics.

-chris

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED
  2019-12-13  1:32                   ` Chris Mason
  2020-01-07 17:42                     ` Christoph Hellwig
@ 2020-02-01 10:33                     ` Andres Freund
  1 sibling, 0 replies; 67+ messages in thread
From: Andres Freund @ 2020-02-01 10:33 UTC (permalink / raw)
  To: Chris Mason
  Cc: Dave Chinner, Jens Axboe, Linus Torvalds, Linux-MM,
	linux-fsdevel, linux-block, Matthew Wilcox, Johannes Weiner

Hi,

On 2019-12-13 01:32:10 +0000, Chris Mason wrote:
> Grepping through the code shows a wonderful assortment of helpers to
> control the cache, and RWF_UNCACHED would be both cleaner and faster
> than what we have today.  I'm on the fence about asking for
> RWF_FILE_RANGE_WRITE (+/- naming) to force writes to start without
> pitching pages, but we can talk to some service owners to see how useful
> that would be.   They can always chain a sync_file_range() in io_uring,
> but RWF_ would be lower overhead if it were a common pattern.

FWIW, for postgres something that'd allow us to do writes that

a) Doesn't remove pages from the pagecache if they're already there.
b) Doesn't delay writeback to some unpredictable point later.

   The later write causes both latency issues, and often under-utilizes
   write bandwidth for a while. For most cases where we write, we know
   that we're not likely to write the same page again soon.

c) Doesn't (except maybe temporarily) bring pages into the pagecache, if
   they weren't before.

   In the cases where the page previously wasn't in the page cache, and
   we wrote it out, it's very likely to have been resident for long
   enough in our cache, that the kernel caching it for the future isn't
   useful.

would be really helpful. Right now we simulate that to some degree by
doing normal buffered writes followed by sync_file_range(WRITE).

For most environments always using O_DIRECT isn't really an option for
us, as we can't rely on settings being tuned well enough (i.e. using a
large enough application cache), as well as continuing to want to
support setups where using a large enough postgres buffer cache isn't an
option because it'd prevent putting a number of variably used database
servers on one piece of hardware.

(There's also postgres side issues preventing us from doing O_DIRECT
performantly, partially because we so far couldn't rely on AIO, due to
also using buffered IO, but we're fixing that now.)


For us a per-request interface where we'd have to fulfill all the
requirements for O_DIRECT, but where neither reads nor writes would
cause a page to move in/out of the pagecache, would be optimal for a
good part of our IO. Especially when we still could get zero-copy IO for
the pretty common case that there's no pagecache presence for a file at
all.

That'd allow us to use zero copy writes for the common case of a file's
data fitting entirely in our cache, and us only occasionally writing the
deta out at checkpoints. And do zero copy reads for the the cases where
we know it's unnecessary for the kernel to cache (e.g. because we are
scanning a few TB of data on a machine with less memory, because we're
re-filling our cache after a restart, or because it's a maintenance
operation doing the reading). But still rely on the kernel page cache
for other reads where the kernel caching when memory is available is a
huge benefit.  Some well tuned workloads would turn that off, to only
use O_DIRECT, but everyone else would benefit with that being the
default.

We can concoct an approximation of that behaviour with a mix of
sync_file_range() (to force writeback), RWF_NOWAIT (to see if we should
read with O_DIRECT) and mmap()/mincore()/munmap() (to determine if
writes should use O_DIRECT). But that's quite a bit of overhead.

The reason that specifying this on a per-request basis would be useful
is mainly that that would allow us to avoid having to either have two
sets of FDs, or having to turn O_DIRECT on/off with fcntl.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2020-02-01 10:33 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-11 15:29 [PATCHSET v3 0/5] Support for RWF_UNCACHED Jens Axboe
2019-12-11 15:29 ` [PATCH 1/5] fs: add read support " Jens Axboe
2019-12-11 15:29 ` [PATCH 2/5] mm: make generic_perform_write() take a struct kiocb Jens Axboe
2019-12-11 15:29 ` [PATCH 3/5] mm: make buffered writes work with RWF_UNCACHED Jens Axboe
2019-12-11 15:29 ` [PATCH 4/5] iomap: pass in the write_begin/write_end flags to iomap_actor Jens Axboe
2019-12-11 17:19   ` Linus Torvalds
2019-12-11 17:19     ` Linus Torvalds
2019-12-11 15:29 ` [PATCH 5/5] iomap: support RWF_UNCACHED for buffered writes Jens Axboe
2019-12-11 17:19   ` Matthew Wilcox
2019-12-11 18:05     ` Jens Axboe
2019-12-12 22:34   ` Dave Chinner
2019-12-13  0:54     ` Jens Axboe
2019-12-13  0:57       ` Jens Axboe
2019-12-16  4:17         ` Dave Chinner
2019-12-17 14:31           ` Jens Axboe
2019-12-18  0:49             ` Dave Chinner
2019-12-18  1:01               ` Jens Axboe
2019-12-11 17:37 ` [PATCHSET v3 0/5] Support for RWF_UNCACHED Linus Torvalds
2019-12-11 17:37   ` Linus Torvalds
2019-12-11 17:56   ` Jens Axboe
2019-12-11 19:14     ` Linus Torvalds
2019-12-11 19:14       ` Linus Torvalds
2019-12-11 19:34     ` Jens Axboe
2019-12-11 20:03       ` Linus Torvalds
2019-12-11 20:03         ` Linus Torvalds
2019-12-11 20:08         ` Jens Axboe
2019-12-11 20:18           ` Linus Torvalds
2019-12-11 20:18             ` Linus Torvalds
2019-12-11 21:04             ` Johannes Weiner
2019-12-12  1:30               ` Jens Axboe
2019-12-11 23:41             ` Jens Axboe
2019-12-12  1:08               ` Linus Torvalds
2019-12-12  1:08                 ` Linus Torvalds
2019-12-12  1:11                 ` Jens Axboe
2019-12-12  1:22                   ` Linus Torvalds
2019-12-12  1:22                     ` Linus Torvalds
2019-12-12  1:29                     ` Jens Axboe
2019-12-12  1:41                       ` Linus Torvalds
2019-12-12  1:41                         ` Linus Torvalds
2019-12-12  1:56                         ` Matthew Wilcox
2019-12-12  2:47                           ` Linus Torvalds
2019-12-12  2:47                             ` Linus Torvalds
2019-12-12 17:52                             ` Matthew Wilcox
2019-12-12 18:29                               ` Linus Torvalds
2019-12-12 18:29                                 ` Linus Torvalds
2019-12-12 20:05                                 ` Matthew Wilcox
2019-12-12  1:41                       ` Jens Axboe
2019-12-12  1:49                         ` Linus Torvalds
2019-12-12  1:49                           ` Linus Torvalds
2019-12-12  1:09               ` Jens Axboe
2019-12-12  2:03                 ` Jens Axboe
2019-12-12  2:10                   ` Jens Axboe
2019-12-12  2:21                   ` Matthew Wilcox
2019-12-12  2:38                     ` Jens Axboe
2019-12-12 22:18                 ` Dave Chinner
2019-12-13  1:32                   ` Chris Mason
2020-01-07 17:42                     ` Christoph Hellwig
2020-01-08 14:09                       ` Chris Mason
2020-02-01 10:33                     ` Andres Freund
2019-12-11 20:43           ` Matthew Wilcox
2019-12-11 20:04       ` Jens Axboe
2019-12-12 10:44 ` Martin Steigerwald
2019-12-12 15:16   ` Jens Axboe
2019-12-12 21:45     ` Martin Steigerwald
2019-12-12 22:15       ` Jens Axboe
2019-12-12 22:18     ` Linus Torvalds
2019-12-12 22:18       ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.