[PATCH v3 0/8] Create large folios in iomap buffered write path

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/8] Create large folios in iomap buffered write path
@ 2023-06-12 20:39 Matthew Wilcox (Oracle)
  2023-06-12 20:39 ` [PATCH v3 1/8] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() Matthew Wilcox (Oracle)
                   ` (8 more replies)
  0 siblings, 9 replies; 33+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-12 20:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

Commit ebb7fb1557b1 limited the length of ioend chains to 4096 entries
to improve worst-case latency.  Unfortunately, this had the effect of
limiting the performance of:

fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \
        -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \
        -numjobs=4 -directory=/mnt/test

The problem ends up being lock contention on the i_pages spinlock as we
clear the writeback bit on each folio (and propagate that up through
the tree).  By using larger folios, we decrease the number of folios
to be processed by a factor of 256 for this benchmark, eliminating the
lock contention.

It's also the right thing to do.  This is a project that has been on
the back burner for years, it just hasn't been important enough to do
before now.

I think it's probably best if this goes through the iomap tree since
the changes outside iomap are either to the page cache or they're
trivial.

v3:
 - Fix the handling of compound highmem pages in copy_page_from_iter_atomic()
 - Rename fgp_t to fgf_t
 - Clarify some wording in the documentation

v2:
 - Fix misplaced semicolon
 - Rename fgp_order to fgp_set_order
 - Rename FGP_ORDER to FGP_GET_ORDER
 - Add fgp_t
 - Update the documentation for ->release_folio
 - Fix iomap_invalidate_folio()
 - Update iomap_release_folio()

Matthew Wilcox (Oracle) (8):
  iov_iter: Handle compound highmem pages in
    copy_page_from_iter_atomic()
  iomap: Remove large folio handling in iomap_invalidate_folio()
  doc: Correct the description of ->release_folio
  iomap: Remove unnecessary test from iomap_release_folio()
  filemap: Add fgf_t typedef
  filemap: Allow __filemap_get_folio to allocate large folios
  iomap: Create large folios in the buffered write path
  iomap: Copy larger chunks from userspace

 Documentation/filesystems/locking.rst | 15 ++++--
 fs/btrfs/file.c                       |  6 +--
 fs/f2fs/compress.c                    |  2 +-
 fs/f2fs/f2fs.h                        |  2 +-
 fs/gfs2/bmap.c                        |  2 +-
 fs/iomap/buffered-io.c                | 43 ++++++++--------
 include/linux/iomap.h                 |  2 +-
 include/linux/pagemap.h               | 71 ++++++++++++++++++++++-----
 lib/iov_iter.c                        | 43 ++++++++++------
 mm/filemap.c                          | 61 ++++++++++++-----------
 mm/folio-compat.c                     |  2 +-
 mm/readahead.c                        | 13 -----
 12 files changed, 159 insertions(+), 103 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 1/8] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()
  2023-06-12 20:39 [PATCH v3 0/8] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
@ 2023-06-12 20:39 ` Matthew Wilcox (Oracle)
  2023-06-13  4:52   ` Christoph Hellwig
  2023-06-12 20:39 ` [PATCH v3 2/8] iomap: Remove large folio handling in iomap_invalidate_folio() Matthew Wilcox (Oracle)
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-12 20:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

copy_page_from_iter_atomic() already handles !highmem compound
pages correctly, but if we are passed a highmem compound page,
each base page needs to be mapped & unmapped individually.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 lib/iov_iter.c | 43 ++++++++++++++++++++++++++++---------------
 1 file changed, 28 insertions(+), 15 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 960223ed9199..1a3fbda0c508 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -857,24 +857,37 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_zero);
 
-size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
-				  struct iov_iter *i)
+size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
+		size_t bytes, struct iov_iter *i)
 {
-	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
-	if (!page_copy_sane(page, offset, bytes)) {
-		kunmap_atomic(kaddr);
+	size_t n, copied = 0;
+
+	if (!page_copy_sane(page, offset, bytes))
 		return 0;
-	}
-	if (WARN_ON_ONCE(!i->data_source)) {
-		kunmap_atomic(kaddr);
+	if (WARN_ON_ONCE(!i->data_source))
 		return 0;
-	}
-	iterate_and_advance(i, bytes, base, len, off,
-		copyin(p + off, base, len),
-		memcpy_from_iter(i, p + off, base, len)
-	)
-	kunmap_atomic(kaddr);
-	return bytes;
+
+	do {
+		char *kaddr;
+
+		n = bytes - copied;
+		if (PageHighMem(page)) {
+			page += offset / PAGE_SIZE;
+			offset %= PAGE_SIZE;
+			n = min_t(size_t, n, PAGE_SIZE - offset);
+		}
+
+		kaddr = kmap_atomic(page) + offset;
+		iterate_and_advance(i, n, base, len, off,
+			copyin(kaddr + off, base, len),
+			memcpy_from_iter(i, kaddr + off, base, len)
+		)
+		kunmap_atomic(kaddr);
+		copied += n;
+		offset += n;
+	} while (PageHighMem(page) && copied != bytes && n > 0);
+
+	return copied;
 }
 EXPORT_SYMBOL(copy_page_from_iter_atomic);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 2/8] iomap: Remove large folio handling in iomap_invalidate_folio()
  2023-06-12 20:39 [PATCH v3 0/8] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
  2023-06-12 20:39 ` [PATCH v3 1/8] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() Matthew Wilcox (Oracle)
@ 2023-06-12 20:39 ` Matthew Wilcox (Oracle)
  2023-06-12 20:39 ` [PATCH v3 3/8] doc: Correct the description of ->release_folio Matthew Wilcox (Oracle)
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 33+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-12 20:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong, Christoph Hellwig

We do not need to release the iomap_page in iomap_invalidate_folio()
to allow the folio to be split.  The splitting code will call
->release_folio() if there is still per-fs private data attached to
the folio.  At that point, we will check if the folio is still dirty
and decline to release the iomap_page.  It is possible to trigger the
warning in perfectly legitimate circumstances (eg if a disk read fails,
we do a partial write to the folio, then we truncate the folio), which
will cause those writes to be lost.

Fixes: 60d8231089f0 ("iomap: Support large folios in invalidatepage")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 063133ec77f4..08ee293c4117 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -508,11 +508,6 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 		WARN_ON_ONCE(folio_test_writeback(folio));
 		folio_cancel_dirty(folio);
 		iomap_page_release(folio);
-	} else if (folio_test_large(folio)) {
-		/* Must release the iop so the page can be split */
-		WARN_ON_ONCE(!folio_test_uptodate(folio) &&
-			     folio_test_dirty(folio));
-		iomap_page_release(folio);
 	}
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 3/8] doc: Correct the description of ->release_folio
  2023-06-12 20:39 [PATCH v3 0/8] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
  2023-06-12 20:39 ` [PATCH v3 1/8] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() Matthew Wilcox (Oracle)
  2023-06-12 20:39 ` [PATCH v3 2/8] iomap: Remove large folio handling in iomap_invalidate_folio() Matthew Wilcox (Oracle)
@ 2023-06-12 20:39 ` Matthew Wilcox (Oracle)
  2023-06-13  4:53   ` Christoph Hellwig
  2023-06-12 20:39 ` [PATCH v3 4/8] iomap: Remove unnecessary test from iomap_release_folio() Matthew Wilcox (Oracle)
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-12 20:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

The filesystem ->release_folio method is called under more circumstances
now than when the documentation was written.  The second sentence
describing the interpretation of the return value is the wrong polarity
(false indicates failure, not success).  And the third sentence is also
wrong (the kernel calls try_to_free_buffers() instead).

So replace the entire paragraph with a detailed description of what the
state of the folio may be, the meaning of the gfp parameter, why the
method is being called and what the filesystem is expected to do.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 Documentation/filesystems/locking.rst | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index aa1a233b0fa8..b52ad5a79d94 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -374,10 +374,17 @@ invalidate_lock before invalidating page cache in truncate / hole punch
 path (and thus calling into ->invalidate_folio) to block races between page
 cache invalidation and page cache filling functions (fault, read, ...).
 
-->release_folio() is called when the kernel is about to try to drop the
-buffers from the folio in preparation for freeing it.  It returns false to
-indicate that the buffers are (or may be) freeable.  If ->release_folio is
-NULL, the kernel assumes that the fs has no private interest in the buffers.
+->release_folio() is called when the MM wants to make a change to the
+folio that would invalidate the filesystem's private data.  For example,
+it may be about to be removed from the address_space or split.  The folio
+is locked and not under writeback.  It may be dirty.  The gfp parameter
+is not usually used for allocation, but rather to indicate what the
+filesystem may do to attempt to free the private data.  The filesystem may
+return false to indicate that the folio's private data cannot be freed.
+If it returns true, it should have already removed the private data from
+the folio.  If a filesystem does not provide a ->release_folio method,
+the pagecache will assume that private data is buffer_heads and call
+try_to_free_buffers().
 
 ->free_folio() is called when the kernel has dropped the folio
 from the page cache.
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 4/8] iomap: Remove unnecessary test from iomap_release_folio()
  2023-06-12 20:39 [PATCH v3 0/8] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (2 preceding siblings ...)
  2023-06-12 20:39 ` [PATCH v3 3/8] doc: Correct the description of ->release_folio Matthew Wilcox (Oracle)
@ 2023-06-12 20:39 ` Matthew Wilcox (Oracle)
  2023-06-13  4:53   ` Christoph Hellwig
  2023-06-13 16:19   ` Matthew Wilcox
  2023-06-12 20:39 ` [PATCH v3 5/8] filemap: Add fgf_t typedef Matthew Wilcox (Oracle)
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 33+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-12 20:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

The check for the folio being under writeback is unnecessary; the caller
has checked this and the folio is locked, so the folio cannot be under
writeback at this point.

The comment is somewhat misleading in that it talks about one specific
situation in which we can see a dirty folio.  There are others, so change
the comment to explain why we can't release the iomap_page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 08ee293c4117..2054b85c9d9b 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -483,12 +483,10 @@ bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
 			folio_size(folio));
 
 	/*
-	 * mm accommodates an old ext3 case where clean folios might
-	 * not have had the dirty bit cleared.  Thus, it can send actual
-	 * dirty folios to ->release_folio() via shrink_active_list();
-	 * skip those here.
+	 * If the folio is dirty, we refuse to release our metadata because
+	 * it may be partially dirty (FIXME, add a test for that).
 	 */
-	if (folio_test_dirty(folio) || folio_test_writeback(folio))
+	if (folio_test_dirty(folio))
 		return false;
 	iomap_page_release(folio);
 	return true;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 5/8] filemap: Add fgf_t typedef
  2023-06-12 20:39 [PATCH v3 0/8] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (3 preceding siblings ...)
  2023-06-12 20:39 ` [PATCH v3 4/8] iomap: Remove unnecessary test from iomap_release_folio() Matthew Wilcox (Oracle)
@ 2023-06-12 20:39 ` Matthew Wilcox (Oracle)
  2023-06-13  4:53   ` Christoph Hellwig
  2023-06-12 20:39 ` [PATCH v3 6/8] filemap: Allow __filemap_get_folio to allocate large folios Matthew Wilcox (Oracle)
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-12 20:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

Similarly to gfp_t, define fgf_t as its own type to prevent various
misuses and confusion.  Leave the flags as FGP_* for now to reduce the
size of this patch; they will be converted to FGF_* later.  Move the
documentation to the definition of the type insted of burying it in the
__filemap_get_folio() documentation.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/btrfs/file.c         |  6 +++---
 fs/f2fs/compress.c      |  2 +-
 fs/f2fs/f2fs.h          |  2 +-
 fs/iomap/buffered-io.c  |  2 +-
 include/linux/pagemap.h | 48 +++++++++++++++++++++++++++++++----------
 mm/filemap.c            | 19 ++--------------
 mm/folio-compat.c       |  2 +-
 7 files changed, 46 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f649647392e0..934a92ca4785 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -876,9 +876,9 @@ static int prepare_uptodate_page(struct inode *inode,
 	return 0;
 }
 
-static unsigned int get_prepare_fgp_flags(bool nowait)
+static fgf_t get_prepare_fgp_flags(bool nowait)
 {
-	unsigned int fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
+	fgf_t fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
 
 	if (nowait)
 		fgp_flags |= FGP_NOWAIT;
@@ -910,7 +910,7 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
 	int i;
 	unsigned long index = pos >> PAGE_SHIFT;
 	gfp_t mask = get_prepare_gfp_flags(inode, nowait);
-	unsigned int fgp_flags = get_prepare_fgp_flags(nowait);
+	fgf_t fgp_flags = get_prepare_fgp_flags(nowait);
 	int err = 0;
 	int faili;
 
diff --git a/fs/f2fs/compress.c b/fs/f2fs/compress.c
index 11653fa79289..b42feec69175 100644
--- a/fs/f2fs/compress.c
+++ b/fs/f2fs/compress.c
@@ -1019,7 +1019,7 @@ static int prepare_compress_overwrite(struct compress_ctx *cc,
 	struct address_space *mapping = cc->inode->i_mapping;
 	struct page *page;
 	sector_t last_block_in_bio;
-	unsigned fgp_flag = FGP_LOCK | FGP_WRITE | FGP_CREAT;
+	fgf_t fgp_flag = FGP_LOCK | FGP_WRITE | FGP_CREAT;
 	pgoff_t start_idx = start_idx_of_cluster(cc);
 	int i, ret;
 
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index d211ee89c158..13b35db3d9c6 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -2715,7 +2715,7 @@ static inline struct page *f2fs_grab_cache_page(struct address_space *mapping,
 
 static inline struct page *f2fs_pagecache_get_page(
 				struct address_space *mapping, pgoff_t index,
-				int fgp_flags, gfp_t gfp_mask)
+				fgf_t fgp_flags, gfp_t gfp_mask)
 {
 	if (time_to_inject(F2FS_M_SB(mapping), FAULT_PAGE_GET))
 		return NULL;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 2054b85c9d9b..9af357d52e56 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -467,7 +467,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  */
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
 {
-	unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS;
+	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a56308a9d1a4..993242f0c1e1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -497,22 +497,48 @@ pgoff_t page_cache_next_miss(struct address_space *mapping,
 pgoff_t page_cache_prev_miss(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
 
-#define FGP_ACCESSED		0x00000001
-#define FGP_LOCK		0x00000002
-#define FGP_CREAT		0x00000004
-#define FGP_WRITE		0x00000008
-#define FGP_NOFS		0x00000010
-#define FGP_NOWAIT		0x00000020
-#define FGP_FOR_MMAP		0x00000040
-#define FGP_STABLE		0x00000080
+/**
+ * typedef fgf_t - Flags for getting folios from the page cache.
+ *
+ * Most users of the page cache will not need to use these flags;
+ * there are convenience functions such as filemap_get_folio() and
+ * filemap_lock_folio().  For users which need more control over exactly
+ * what is done with the folios, these flags to __filemap_get_folio()
+ * are available.
+ *
+ * * %FGP_ACCESSED - The folio will be marked accessed.
+ * * %FGP_LOCK - The folio is returned locked.
+ * * %FGP_CREAT - If no folio is present then a new folio is allocated,
+ *   added to the page cache and the VM's LRU list.  The folio is
+ *   returned locked.
+ * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
+ *   folio is already in cache.  If the folio was allocated, unlock it
+ *   before returning so the caller can do the same dance.
+ * * %FGP_WRITE - The folio will be written to by the caller.
+ * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
+ * * %FGP_NOWAIT - Don't block on the folio lock.
+ * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
+ * * %FGP_WRITEBEGIN - The flags to use in a filesystem write_begin()
+ *   implementation.
+ */
+typedef unsigned int __bitwise fgf_t;
+
+#define FGP_ACCESSED		((__force fgf_t)0x00000001)
+#define FGP_LOCK		((__force fgf_t)0x00000002)
+#define FGP_CREAT		((__force fgf_t)0x00000004)
+#define FGP_WRITE		((__force fgf_t)0x00000008)
+#define FGP_NOFS		((__force fgf_t)0x00000010)
+#define FGP_NOWAIT		((__force fgf_t)0x00000020)
+#define FGP_FOR_MMAP		((__force fgf_t)0x00000040)
+#define FGP_STABLE		((__force fgf_t)0x00000080)
 
 #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
 
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp);
+		fgf_t fgp_flags, gfp_t gfp);
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp);
+		fgf_t fgp_flags, gfp_t gfp);
 
 /**
  * filemap_get_folio - Find and get a folio.
@@ -586,7 +612,7 @@ static inline struct page *find_get_page(struct address_space *mapping,
 }
 
 static inline struct page *find_get_page_flags(struct address_space *mapping,
-					pgoff_t offset, int fgp_flags)
+					pgoff_t offset, fgf_t fgp_flags)
 {
 	return pagecache_get_page(mapping, offset, fgp_flags, 0);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index b4c9bd368b7e..42353b82ebf6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1887,30 +1887,15 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
  *
  * Looks up the page cache entry at @mapping & @index.
  *
- * @fgp_flags can be zero or more of these flags:
- *
- * * %FGP_ACCESSED - The folio will be marked accessed.
- * * %FGP_LOCK - The folio is returned locked.
- * * %FGP_CREAT - If no page is present then a new page is allocated using
- *   @gfp and added to the page cache and the VM's LRU list.
- *   The page is returned locked and with an increased refcount.
- * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
- *   page is already in cache.  If the page was allocated, unlock it before
- *   returning so the caller can do the same dance.
- * * %FGP_WRITE - The page will be written to by the caller.
- * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
- * * %FGP_NOWAIT - Don't get blocked by page lock.
- * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
- *
  * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
  * if the %GFP flags specified for %FGP_CREAT are atomic.
  *
- * If there is a page cache page, it is returned with an increased refcount.
+ * If this function returns a folio, it is returned with an increased refcount.
  *
  * Return: The found folio or an ERR_PTR() otherwise.
  */
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp)
+		fgf_t fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
 
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index c6f056c20503..10c3247542cb 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -92,7 +92,7 @@ EXPORT_SYMBOL(add_to_page_cache_lru);
 
 noinline
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp)
+		fgf_t fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 6/8] filemap: Allow __filemap_get_folio to allocate large folios
  2023-06-12 20:39 [PATCH v3 0/8] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (4 preceding siblings ...)
  2023-06-12 20:39 ` [PATCH v3 5/8] filemap: Add fgf_t typedef Matthew Wilcox (Oracle)
@ 2023-06-12 20:39 ` Matthew Wilcox (Oracle)
  2023-06-12 22:49   ` Dave Chinner
  2023-06-13  4:56   ` Christoph Hellwig
  2023-06-12 20:39 ` [PATCH v3 7/8] iomap: Create large folios in the buffered write path Matthew Wilcox (Oracle)
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 33+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-12 20:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

Allow callers of __filemap_get_folio() to specify a preferred folio
order in the FGP flags.  This is only honoured in the FGP_CREATE path;
if there is already a folio in the page cache that covers the index,
we will return it, no matter what its order is.  No create-around is
attempted; we will only create folios which start at the specified index.
Unmodified callers will continue to allocate order 0 folios.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 23 ++++++++++++++++++++++
 mm/filemap.c            | 42 ++++++++++++++++++++++++++++-------------
 mm/readahead.c          | 13 -------------
 3 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 993242f0c1e1..b2ed80f91e5b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -466,6 +466,19 @@ static inline void *detach_page_private(struct page *page)
 	return folio_detach_private(page_folio(page));
 }
 
+/*
+ * There are some parts of the kernel which assume that PMD entries
+ * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
+ * limit the maximum allocation order to PMD size.  I'm not aware of any
+ * assumptions about maximum order if THP are disabled, but 8 seems like
+ * a good order (that's 1MB if you're using 4kB pages)
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
+#else
+#define MAX_PAGECACHE_ORDER	8
+#endif
+
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
 #else
@@ -531,9 +544,19 @@ typedef unsigned int __bitwise fgf_t;
 #define FGP_NOWAIT		((__force fgf_t)0x00000020)
 #define FGP_FOR_MMAP		((__force fgf_t)0x00000040)
 #define FGP_STABLE		((__force fgf_t)0x00000080)
+#define FGF_GET_ORDER(fgf)	(((__force unsigned)fgf) >> 26)	/* top 6 bits */
 
 #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
 
+static inline fgf_t fgf_set_order(size_t size)
+{
+	unsigned int shift = ilog2(size);
+
+	if (shift <= PAGE_SHIFT)
+		return 0;
+	return (__force fgf_t)((shift - PAGE_SHIFT) << 26);
+}
+
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		fgf_t fgp_flags, gfp_t gfp);
diff --git a/mm/filemap.c b/mm/filemap.c
index 42353b82ebf6..bd66398ae072 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1937,7 +1937,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_wait_stable(folio);
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
+		unsigned order = FGF_GET_ORDER(fgp_flags);
 		int err;
+
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
 			gfp |= __GFP_WRITE;
 		if (fgp_flags & FGP_NOFS)
@@ -1946,26 +1948,40 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp &= ~GFP_KERNEL;
 			gfp |= GFP_NOWAIT | __GFP_NOWARN;
 		}
-
-		folio = filemap_alloc_folio(gfp, 0);
-		if (!folio)
-			return ERR_PTR(-ENOMEM);
-
 		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
 			fgp_flags |= FGP_LOCK;
 
-		/* Init accessed so avoid atomic mark_page_accessed later */
-		if (fgp_flags & FGP_ACCESSED)
-			__folio_set_referenced(folio);
+		if (!mapping_large_folio_support(mapping))
+			order = 0;
+		if (order > MAX_PAGECACHE_ORDER)
+			order = MAX_PAGECACHE_ORDER;
+		/* If we're not aligned, allocate a smaller folio */
+		if (index & ((1UL << order) - 1))
+			order = __ffs(index);
 
-		err = filemap_add_folio(mapping, folio, index, gfp);
-		if (unlikely(err)) {
+		do {
+			err = -ENOMEM;
+			if (order == 1)
+				order = 0;
+			folio = filemap_alloc_folio(gfp, order);
+			if (!folio)
+				continue;
+
+			/* Init accessed so avoid atomic mark_page_accessed later */
+			if (fgp_flags & FGP_ACCESSED)
+				__folio_set_referenced(folio);
+
+			err = filemap_add_folio(mapping, folio, index, gfp);
+			if (!err)
+				break;
 			folio_put(folio);
 			folio = NULL;
-			if (err == -EEXIST)
-				goto repeat;
-		}
+		} while (order-- > 0);
 
+		if (err == -EEXIST)
+			goto repeat;
+		if (err)
+			return ERR_PTR(err);
 		/*
 		 * filemap_add_folio locks the page, and for mmap
 		 * we expect an unlocked page.
diff --git a/mm/readahead.c b/mm/readahead.c
index 47afbca1d122..59a071badb90 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -462,19 +462,6 @@ static int try_context_readahead(struct address_space *mapping,
 	return 1;
 }
 
-/*
- * There are some parts of the kernel which assume that PMD entries
- * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
- * limit the maximum allocation order to PMD size.  I'm not aware of any
- * assumptions about maximum order if THP are disabled, but 8 seems like
- * a good order (that's 1MB if you're using 4kB pages)
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
-#else
-#define MAX_PAGECACHE_ORDER	8
-#endif
-
 static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 		pgoff_t mark, unsigned int order, gfp_t gfp)
 {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 7/8] iomap: Create large folios in the buffered write path
  2023-06-12 20:39 [PATCH v3 0/8] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (5 preceding siblings ...)
  2023-06-12 20:39 ` [PATCH v3 6/8] filemap: Allow __filemap_get_folio to allocate large folios Matthew Wilcox (Oracle)
@ 2023-06-12 20:39 ` Matthew Wilcox (Oracle)
  2023-06-13  4:56   ` Christoph Hellwig
  2023-06-12 20:39 ` [PATCH v3 8/8] iomap: Copy larger chunks from userspace Matthew Wilcox (Oracle)
  2023-06-21 12:03 ` [PATCH v3 0/8] Create large folios in iomap buffered write path Wang Yugui
  8 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-12 20:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

Use the size of the write as a hint for the size of the folio to create.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/gfs2/bmap.c         | 2 +-
 fs/iomap/buffered-io.c | 6 ++++--
 include/linux/iomap.h  | 2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index c739b258a2d9..3702e5e47b0f 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -971,7 +971,7 @@ gfs2_iomap_get_folio(struct iomap_iter *iter, loff_t pos, unsigned len)
 	if (status)
 		return ERR_PTR(status);
 
-	folio = iomap_get_folio(iter, pos);
+	folio = iomap_get_folio(iter, pos, len);
 	if (IS_ERR(folio))
 		gfs2_trans_end(sdp);
 	return folio;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9af357d52e56..a5d62c9640cf 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -461,16 +461,18 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  * iomap_get_folio - get a folio reference for writing
  * @iter: iteration structure
  * @pos: start offset of write
+ * @len: length of write
  *
  * Returns a locked reference to the folio at @pos, or an error pointer if the
  * folio could not be obtained.
  */
-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
+	fgp |= fgf_set_order(len);
 
 	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
 			fgp, mapping_gfp_mask(iter->inode->i_mapping));
@@ -596,7 +598,7 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
 	if (folio_ops && folio_ops->get_folio)
 		return folio_ops->get_folio(iter, pos, len);
 	else
-		return iomap_get_folio(iter, pos);
+		return iomap_get_folio(iter, pos, len);
 }
 
 static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index e2b836c2e119..80facb9c9e5b 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -261,7 +261,7 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
 void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 8/8] iomap: Copy larger chunks from userspace
  2023-06-12 20:39 [PATCH v3 0/8] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (6 preceding siblings ...)
  2023-06-12 20:39 ` [PATCH v3 7/8] iomap: Create large folios in the buffered write path Matthew Wilcox (Oracle)
@ 2023-06-12 20:39 ` Matthew Wilcox (Oracle)
  2023-06-13  4:58   ` Christoph Hellwig
  2023-06-17  7:13   ` Ritesh Harjani
  2023-06-21 12:03 ` [PATCH v3 0/8] Create large folios in iomap buffered write path Wang Yugui
  8 siblings, 2 replies; 33+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-12 20:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

If we have a large folio, we can copy in larger chunks than PAGE_SIZE.
Start at the maximum page cache size and shrink by half every time we
hit the "we are short on memory" problem.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index a5d62c9640cf..818dc350ffc5 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -768,6 +768,7 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 {
 	loff_t length = iomap_length(iter);
+	size_t chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;
 	loff_t pos = iter->pos;
 	ssize_t written = 0;
 	long status = 0;
@@ -776,15 +777,13 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 
 	do {
 		struct folio *folio;
-		struct page *page;
-		unsigned long offset;	/* Offset into pagecache page */
-		unsigned long bytes;	/* Bytes to write to page */
+		size_t offset;		/* Offset into folio */
+		unsigned long bytes;	/* Bytes to write to folio */
 		size_t copied;		/* Bytes copied from user */
 
-		offset = offset_in_page(pos);
-		bytes = min_t(unsigned long, PAGE_SIZE - offset,
-						iov_iter_count(i));
 again:
+		offset = pos & (chunk - 1);
+		bytes = min(chunk - offset, iov_iter_count(i));
 		status = balance_dirty_pages_ratelimited_flags(mapping,
 							       bdp_flags);
 		if (unlikely(status))
@@ -814,11 +813,14 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		if (iter->iomap.flags & IOMAP_F_STALE)
 			break;
 
-		page = folio_file_page(folio, pos >> PAGE_SHIFT);
+		offset = offset_in_folio(folio, pos);
+		if (bytes > folio_size(folio) - offset)
+			bytes = folio_size(folio) - offset;
+
 		if (mapping_writably_mapped(mapping))
-			flush_dcache_page(page);
+			flush_dcache_folio(folio);
 
-		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
+		copied = copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
 
 		status = iomap_write_end(iter, pos, bytes, copied, folio);
 
@@ -835,6 +837,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 			 */
 			if (copied)
 				bytes = copied;
+			if (chunk > PAGE_SIZE)
+				chunk /= 2;
 			goto again;
 		}
 		pos += status;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/8] filemap: Allow __filemap_get_folio to allocate large folios
  2023-06-12 20:39 ` [PATCH v3 6/8] filemap: Allow __filemap_get_folio to allocate large folios Matthew Wilcox (Oracle)
@ 2023-06-12 22:49   ` Dave Chinner
  2023-06-13  0:42     ` Matthew Wilcox
  2023-06-13  4:56   ` Christoph Hellwig
  1 sibling, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2023-06-12 22:49 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Christoph Hellwig,
	Darrick J . Wong

On Mon, Jun 12, 2023 at 09:39:08PM +0100, Matthew Wilcox (Oracle) wrote:
> Allow callers of __filemap_get_folio() to specify a preferred folio
> order in the FGP flags.  This is only honoured in the FGP_CREATE path;
> if there is already a folio in the page cache that covers the index,
> we will return it, no matter what its order is.  No create-around is
> attempted; we will only create folios which start at the specified index.
> Unmodified callers will continue to allocate order 0 folios.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
.....
> -		/* Init accessed so avoid atomic mark_page_accessed later */
> -		if (fgp_flags & FGP_ACCESSED)
> -			__folio_set_referenced(folio);
> +		if (!mapping_large_folio_support(mapping))
> +			order = 0;
> +		if (order > MAX_PAGECACHE_ORDER)
> +			order = MAX_PAGECACHE_ORDER;
> +		/* If we're not aligned, allocate a smaller folio */
> +		if (index & ((1UL << order) - 1))
> +			order = __ffs(index);

If I read this right, if we pass in an unaligned index, we won't get
the size of the folio we ask for?

e.g. if we want an order-4 folio (64kB) because we have a 64kB block
size in the filesystem, then we have to pass in an index that
order-4 aligned, yes?

I ask this, because the later iomap code that asks for large folios
only passes in "pos >> PAGE_SHIFT" so it looks to me like it won't
allocate large folios for anything other than large folio aligned
writes, even if we need them.

What am I missing?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/8] filemap: Allow __filemap_get_folio to allocate large folios
  2023-06-12 22:49   ` Dave Chinner
@ 2023-06-13  0:42     ` Matthew Wilcox
  2023-06-13  1:30       ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2023-06-13  0:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Christoph Hellwig,
	Darrick J . Wong

On Tue, Jun 13, 2023 at 08:49:05AM +1000, Dave Chinner wrote:
> On Mon, Jun 12, 2023 at 09:39:08PM +0100, Matthew Wilcox (Oracle) wrote:
> > Allow callers of __filemap_get_folio() to specify a preferred folio
> > order in the FGP flags.  This is only honoured in the FGP_CREATE path;
> > if there is already a folio in the page cache that covers the index,
> > we will return it, no matter what its order is.  No create-around is
> > attempted; we will only create folios which start at the specified index.
> > Unmodified callers will continue to allocate order 0 folios.
> .....
> > -		/* Init accessed so avoid atomic mark_page_accessed later */
> > -		if (fgp_flags & FGP_ACCESSED)
> > -			__folio_set_referenced(folio);
> > +		if (!mapping_large_folio_support(mapping))
> > +			order = 0;
> > +		if (order > MAX_PAGECACHE_ORDER)
> > +			order = MAX_PAGECACHE_ORDER;
> > +		/* If we're not aligned, allocate a smaller folio */
> > +		if (index & ((1UL << order) - 1))
> > +			order = __ffs(index);
> 
> If I read this right, if we pass in an unaligned index, we won't get
> the size of the folio we ask for?

Right.  That's implied by (but perhaps not obvious from) the changelog.
Folios are always naturally aligned in the file, so an order-4 folio
has to start at a multiple of 16.  If the index you pass in is not
a multiple of 16, we can't create an order-4 folio without starting
at an earlier index.

For a 4kB block size filesystem, that's what we want.  Applications
_generally_ don't write backwards, so creating an order-4 folio is just
wasting memory.

> e.g. if we want an order-4 folio (64kB) because we have a 64kB block
> size in the filesystem, then we have to pass in an index that
> order-4 aligned, yes?
> 
> I ask this, because the later iomap code that asks for large folios
> only passes in "pos >> PAGE_SHIFT" so it looks to me like it won't
> allocate large folios for anything other than large folio aligned
> writes, even if we need them.
> 
> What am I missing?

Perhaps what you're missing is that this isn't trying to solve the
problem of supporting a bs > ps filesystem?  That's also a worthwhile
project, but it's not this project.  In fact, I'd say that project is
almost orthogonal to this one; for this usage we can always fall back to
smaller folios on memory pressure or misalignment.  For a bs > ps block
device, we have to allocate folios at least as large as the blocksize
and cannot fall back to smaller folios.  For a bs > ps filesystem on a
bdev with bs == ps, we can fall back (as your prototype showed).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/8] filemap: Allow __filemap_get_folio to allocate large folios
  2023-06-13  0:42     ` Matthew Wilcox
@ 2023-06-13  1:30       ` Dave Chinner
  2023-06-13  2:00         ` Matthew Wilcox
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2023-06-13  1:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Christoph Hellwig,
	Darrick J . Wong

On Tue, Jun 13, 2023 at 01:42:51AM +0100, Matthew Wilcox wrote:
> On Tue, Jun 13, 2023 at 08:49:05AM +1000, Dave Chinner wrote:
> > On Mon, Jun 12, 2023 at 09:39:08PM +0100, Matthew Wilcox (Oracle) wrote:
> > > Allow callers of __filemap_get_folio() to specify a preferred folio
> > > order in the FGP flags.  This is only honoured in the FGP_CREATE path;
> > > if there is already a folio in the page cache that covers the index,
> > > we will return it, no matter what its order is.  No create-around is
> > > attempted; we will only create folios which start at the specified index.
> > > Unmodified callers will continue to allocate order 0 folios.
> > .....
> > > -		/* Init accessed so avoid atomic mark_page_accessed later */
> > > -		if (fgp_flags & FGP_ACCESSED)
> > > -			__folio_set_referenced(folio);
> > > +		if (!mapping_large_folio_support(mapping))
> > > +			order = 0;
> > > +		if (order > MAX_PAGECACHE_ORDER)
> > > +			order = MAX_PAGECACHE_ORDER;
> > > +		/* If we're not aligned, allocate a smaller folio */
> > > +		if (index & ((1UL << order) - 1))
> > > +			order = __ffs(index);
> > 
> > If I read this right, if we pass in an unaligned index, we won't get
> > the size of the folio we ask for?
> 
> Right.  That's implied by (but perhaps not obvious from) the changelog.
> Folios are always naturally aligned in the file, so an order-4 folio
> has to start at a multiple of 16.  If the index you pass in is not
> a multiple of 16, we can't create an order-4 folio without starting
> at an earlier index.
> 
> For a 4kB block size filesystem, that's what we want.  Applications
> _generally_ don't write backwards, so creating an order-4 folio is just
> wasting memory.
> 
> > e.g. if we want an order-4 folio (64kB) because we have a 64kB block
> > size in the filesystem, then we have to pass in an index that
> > order-4 aligned, yes?
> > 
> > I ask this, because the later iomap code that asks for large folios
> > only passes in "pos >> PAGE_SHIFT" so it looks to me like it won't
> > allocate large folios for anything other than large folio aligned
> > writes, even if we need them.
> > 
> > What am I missing?
> 
> Perhaps what you're missing is that this isn't trying to solve the
> problem of supporting a bs > ps filesystem?

No, that's not what I'm asking about. I know there's other changes
needed to enforce minimum folio size/alignment for bs > ps.

What I'm asking about is when someone does a 16kB write at offset
12kB, they won't get a large folio allocated at all, right? Even
though the write is large enough to enable it?

Indeed, if we do a 1MB write at offset 4KB, we'll get 4kB at 4KB, 8KB
and 12kB (because we can't do order-1 folios), then order-2 at 16KB,
order-3 at 32kB, and so on until we hit offset 1MB where we will do
an order-0 folio allocation again (because the remaining length is
4KB). The next 1MB write will then follow the same pattern, right?

I think this ends up being sub-optimal and fairly non-obvious
non-obvious behaviour from the iomap side of the fence which is
clearly asking for high-order folios to be allocated. i.e. a small
amount of allocate-around to naturally align large folios when the
page cache is otherwise empty would make a big difference to the
efficiency of non-large-folio-aligned sequential writes...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/8] filemap: Allow __filemap_get_folio to allocate large folios
  2023-06-13  1:30       ` Dave Chinner
@ 2023-06-13  2:00         ` Matthew Wilcox
  2023-06-13  7:54           ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2023-06-13  2:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Christoph Hellwig,
	Darrick J . Wong

On Tue, Jun 13, 2023 at 11:30:13AM +1000, Dave Chinner wrote:
> On Tue, Jun 13, 2023 at 01:42:51AM +0100, Matthew Wilcox wrote:
> > On Tue, Jun 13, 2023 at 08:49:05AM +1000, Dave Chinner wrote:
> > > On Mon, Jun 12, 2023 at 09:39:08PM +0100, Matthew Wilcox (Oracle) wrote:
> > > > Allow callers of __filemap_get_folio() to specify a preferred folio
> > > > order in the FGP flags.  This is only honoured in the FGP_CREATE path;
> > > > if there is already a folio in the page cache that covers the index,
> > > > we will return it, no matter what its order is.  No create-around is
> > > > attempted; we will only create folios which start at the specified index.
> > > > Unmodified callers will continue to allocate order 0 folios.
> > > .....
> > > > -		/* Init accessed so avoid atomic mark_page_accessed later */
> > > > -		if (fgp_flags & FGP_ACCESSED)
> > > > -			__folio_set_referenced(folio);
> > > > +		if (!mapping_large_folio_support(mapping))
> > > > +			order = 0;
> > > > +		if (order > MAX_PAGECACHE_ORDER)
> > > > +			order = MAX_PAGECACHE_ORDER;
> > > > +		/* If we're not aligned, allocate a smaller folio */
> > > > +		if (index & ((1UL << order) - 1))
> > > > +			order = __ffs(index);
> > > 
> > > If I read this right, if we pass in an unaligned index, we won't get
> > > the size of the folio we ask for?
> > 
> > Right.  That's implied by (but perhaps not obvious from) the changelog.
> > Folios are always naturally aligned in the file, so an order-4 folio
> > has to start at a multiple of 16.  If the index you pass in is not
> > a multiple of 16, we can't create an order-4 folio without starting
> > at an earlier index.
> > 
> > For a 4kB block size filesystem, that's what we want.  Applications
> > _generally_ don't write backwards, so creating an order-4 folio is just
> > wasting memory.
> > 
> > > e.g. if we want an order-4 folio (64kB) because we have a 64kB block
> > > size in the filesystem, then we have to pass in an index that
> > > order-4 aligned, yes?
> > > 
> > > I ask this, because the later iomap code that asks for large folios
> > > only passes in "pos >> PAGE_SHIFT" so it looks to me like it won't
> > > allocate large folios for anything other than large folio aligned
> > > writes, even if we need them.
> > > 
> > > What am I missing?
> > 
> > Perhaps what you're missing is that this isn't trying to solve the
> > problem of supporting a bs > ps filesystem?
> 
> No, that's not what I'm asking about. I know there's other changes
> needed to enforce minimum folio size/alignment for bs > ps.

OK.  Bringing up the 64kB block size filesystem confused me.

> What I'm asking about is when someone does a 16kB write at offset
> 12kB, they won't get a large folio allocated at all, right? Even
> though the write is large enough to enable it?

Right.

> Indeed, if we do a 1MB write at offset 4KB, we'll get 4kB at 4KB, 8KB
> and 12kB (because we can't do order-1 folios), then order-2 at 16KB,
> order-3 at 32kB, and so on until we hit offset 1MB where we will do
> an order-0 folio allocation again (because the remaining length is
> 4KB). The next 1MB write will then follow the same pattern, right?

Yes.  Assuming we get another write ...

> I think this ends up being sub-optimal and fairly non-obvious
> non-obvious behaviour from the iomap side of the fence which is
> clearly asking for high-order folios to be allocated. i.e. a small
> amount of allocate-around to naturally align large folios when the
> page cache is otherwise empty would make a big difference to the
> efficiency of non-large-folio-aligned sequential writes...

At this point we're arguing about what I/O pattern to optimise for.
I'm going for a "do no harm" approach where we only allocate exactly as
much memory as we did before.  You're advocating for a
higher-risk/higher-reward approach.

I'd prefer the low-risk approach for now; we can change it later!
I'd like to see some amount of per-fd write history (as we have per-fd
readahead history) to decide whether to allocate large folios ahead of
the current write position.  As with readahead, I'd like to see that even
doing single-byte writes can result in the allocation of large folios,
as long as the app has done enough of them.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 1/8] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()
  2023-06-12 20:39 ` [PATCH v3 1/8] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() Matthew Wilcox (Oracle)
@ 2023-06-13  4:52   ` Christoph Hellwig
  2023-07-10  3:36     ` Matthew Wilcox
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2023-06-13  4:52 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong

Looks good:


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/8] doc: Correct the description of ->release_folio
  2023-06-12 20:39 ` [PATCH v3 3/8] doc: Correct the description of ->release_folio Matthew Wilcox (Oracle)
@ 2023-06-13  4:53   ` Christoph Hellwig
  0 siblings, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2023-06-13  4:53 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 4/8] iomap: Remove unnecessary test from iomap_release_folio()
  2023-06-12 20:39 ` [PATCH v3 4/8] iomap: Remove unnecessary test from iomap_release_folio() Matthew Wilcox (Oracle)
@ 2023-06-13  4:53   ` Christoph Hellwig
  2023-06-13 16:19   ` Matthew Wilcox
  1 sibling, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2023-06-13  4:53 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 5/8] filemap: Add fgf_t typedef
  2023-06-12 20:39 ` [PATCH v3 5/8] filemap: Add fgf_t typedef Matthew Wilcox (Oracle)
@ 2023-06-13  4:53   ` Christoph Hellwig
  0 siblings, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2023-06-13  4:53 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/8] filemap: Allow __filemap_get_folio to allocate large folios
  2023-06-12 20:39 ` [PATCH v3 6/8] filemap: Allow __filemap_get_folio to allocate large folios Matthew Wilcox (Oracle)
  2023-06-12 22:49   ` Dave Chinner
@ 2023-06-13  4:56   ` Christoph Hellwig
  1 sibling, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2023-06-13  4:56 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

(and I agree with keeping it simple and the folios aligned in the
page and not too optimistically allocated for now)


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 7/8] iomap: Create large folios in the buffered write path
  2023-06-12 20:39 ` [PATCH v3 7/8] iomap: Create large folios in the buffered write path Matthew Wilcox (Oracle)
@ 2023-06-13  4:56   ` Christoph Hellwig
  0 siblings, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2023-06-13  4:56 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 8/8] iomap: Copy larger chunks from userspace
  2023-06-12 20:39 ` [PATCH v3 8/8] iomap: Copy larger chunks from userspace Matthew Wilcox (Oracle)
@ 2023-06-13  4:58   ` Christoph Hellwig
  2023-06-13 19:43     ` Matthew Wilcox
  2023-06-17  7:13   ` Ritesh Harjani
  1 sibling, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2023-06-13  4:58 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong

On Mon, Jun 12, 2023 at 09:39:10PM +0100, Matthew Wilcox (Oracle) wrote:
> If we have a large folio, we can copy in larger chunks than PAGE_SIZE.
> Start at the maximum page cache size and shrink by half every time we
> hit the "we are short on memory" problem.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  fs/iomap/buffered-io.c | 22 +++++++++++++---------
>  1 file changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index a5d62c9640cf..818dc350ffc5 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -768,6 +768,7 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
>  static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  {
>  	loff_t length = iomap_length(iter);
> +	size_t chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;

This could overflow if the chunk size ends up bigger than 4GB, but
I guess that's mostly theoretical.

> -		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
> +		copied = copy_page_from_iter_atomic(&folio->page, offset, bytes, i);

Would be nice t avoid the overly long line here

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/8] filemap: Allow __filemap_get_folio to allocate large folios
  2023-06-13  2:00         ` Matthew Wilcox
@ 2023-06-13  7:54           ` Dave Chinner
  2023-06-13 13:34             ` Matthew Wilcox
  2023-06-16 17:45             ` Matthew Wilcox
  0 siblings, 2 replies; 33+ messages in thread
From: Dave Chinner @ 2023-06-13  7:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Christoph Hellwig,
	Darrick J . Wong

On Tue, Jun 13, 2023 at 03:00:14AM +0100, Matthew Wilcox wrote:
> On Tue, Jun 13, 2023 at 11:30:13AM +1000, Dave Chinner wrote:
> > Indeed, if we do a 1MB write at offset 4KB, we'll get 4kB at 4KB, 8KB
> > and 12kB (because we can't do order-1 folios), then order-2 at 16KB,
> > order-3 at 32kB, and so on until we hit offset 1MB where we will do
> > an order-0 folio allocation again (because the remaining length is
> > 4KB). The next 1MB write will then follow the same pattern, right?
> 
> Yes.  Assuming we get another write ...
> 
> > I think this ends up being sub-optimal and fairly non-obvious
> > non-obvious behaviour from the iomap side of the fence which is
> > clearly asking for high-order folios to be allocated. i.e. a small
> > amount of allocate-around to naturally align large folios when the
> > page cache is otherwise empty would make a big difference to the
> > efficiency of non-large-folio-aligned sequential writes...
> 
> At this point we're arguing about what I/O pattern to optimise for.
> I'm going for a "do no harm" approach where we only allocate exactly as
> much memory as we did before.  You're advocating for a
> higher-risk/higher-reward approach.

Not really - I'm just trying to understand the behaviour the change
will result in, compared to what would be considered optimal as it's
not clearly spelled out in either the code or the commit messages.

If I hadn't looked at the code closely and saw a trace with this
sort of behaviour (i.e. I understood large folios were in use,
but not exactly how they worked), I'd be very surprised to see a
weird repeated pattern of varying folio sizes. I'd probably think
it was a bug in the implementation....

> I'd prefer the low-risk approach for now; we can change it later!

That's fine by me - just document the limitations and expected
behaviour in the code rather than expect people to have to discover
this behaviour for themselves.

> I'd like to see some amount of per-fd write history (as we have per-fd
> readahead history) to decide whether to allocate large folios ahead of
> the current write position.  As with readahead, I'd like to see that even
> doing single-byte writes can result in the allocation of large folios,
> as long as the app has done enough of them.

*nod*

We already have some hints in the iomaps that can tell you this sort
of thing. e.g. if ->iomap_begin() returns a delalloc iomap that
extends beyond the current write, we're performing a sequence of
multiple sequential writes.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/8] filemap: Allow __filemap_get_folio to allocate large folios
  2023-06-13  7:54           ` Dave Chinner
@ 2023-06-13 13:34             ` Matthew Wilcox
  2023-06-16 17:45             ` Matthew Wilcox
  1 sibling, 0 replies; 33+ messages in thread
From: Matthew Wilcox @ 2023-06-13 13:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Christoph Hellwig,
	Darrick J . Wong

On Tue, Jun 13, 2023 at 05:54:35PM +1000, Dave Chinner wrote:
> On Tue, Jun 13, 2023 at 03:00:14AM +0100, Matthew Wilcox wrote:
> > On Tue, Jun 13, 2023 at 11:30:13AM +1000, Dave Chinner wrote:
> > > I think this ends up being sub-optimal and fairly non-obvious
> > > non-obvious behaviour from the iomap side of the fence which is
> > > clearly asking for high-order folios to be allocated. i.e. a small
> > > amount of allocate-around to naturally align large folios when the
> > > page cache is otherwise empty would make a big difference to the
> > > efficiency of non-large-folio-aligned sequential writes...
> > 
> > At this point we're arguing about what I/O pattern to optimise for.
> > I'm going for a "do no harm" approach where we only allocate exactly as
> > much memory as we did before.  You're advocating for a
> > higher-risk/higher-reward approach.
> 
> Not really - I'm just trying to understand the behaviour the change
> will result in, compared to what would be considered optimal as it's
> not clearly spelled out in either the code or the commit messages.

I suppose it depends what you think we're optimising for.  If it's
minimum memory consumption, then the current patchset is optimal ;-) If
it's minimum number of folios allocated for a particular set of writes,
then your proposal makes sense.  I do think we should end up doing
something along the lines of your sketch; it just doesn't need to be now.

> > I'd like to see some amount of per-fd write history (as we have per-fd
> > readahead history) to decide whether to allocate large folios ahead of
> > the current write position.  As with readahead, I'd like to see that even
> > doing single-byte writes can result in the allocation of large folios,
> > as long as the app has done enough of them.
> 
> *nod*
> 
> We already have some hints in the iomaps that can tell you this sort
> of thing. e.g. if ->iomap_begin() returns a delalloc iomap that
> extends beyond the current write, we're performing a sequence of
> multiple sequential writes.....

Well, if this is something the FS is already tracking, then we can
either try to lift that functionality into the page cache, or just take
advantage of the FS knowledge.  In iomap_write_begin(), we have access
to the srcmap and the iomap, and we can pass in something other than
the length of the write as the hint to __iomap_get_folio() for the
size of the folio we would like.

I should probably clean up the kernel-doc for iomap_get_folio() ...

- * @len: length of write
+ * @len: Suggested size of folio to create.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 4/8] iomap: Remove unnecessary test from iomap_release_folio()
  2023-06-12 20:39 ` [PATCH v3 4/8] iomap: Remove unnecessary test from iomap_release_folio() Matthew Wilcox (Oracle)
  2023-06-13  4:53   ` Christoph Hellwig
@ 2023-06-13 16:19   ` Matthew Wilcox
  1 sibling, 0 replies; 33+ messages in thread
From: Matthew Wilcox @ 2023-06-13 16:19 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig, Darrick J . Wong

On Mon, Jun 12, 2023 at 09:39:06PM +0100, Matthew Wilcox (Oracle) wrote:
> The check for the folio being under writeback is unnecessary; the caller
> has checked this and the folio is locked, so the folio cannot be under
> writeback at this point.
> 
> The comment is somewhat misleading in that it talks about one specific
> situation in which we can see a dirty folio.  There are others, so change
> the comment to explain why we can't release the iomap_page.

> +	 * If the folio is dirty, we refuse to release our metadata because
> +	 * it may be partially dirty (FIXME, add a test for that).

Argh, forgot to fix this.

        /*
         * If the folio is dirty, we refuse to release our metadata because
-        * it may be partially dirty (FIXME, add a test for that).
+        * it may be partially dirty.  Once we track per-block dirty state,
+        * we can release the metadata if every block is dirty.
         */

> -	if (folio_test_dirty(folio) || folio_test_writeback(folio))
> +	if (folio_test_dirty(folio))
>  		return false;

Now I'm wondering if we shouldn't also refuse to release the metadata if
the folio is !uptodate.  It's not a correctness issue, it's a performance
issue, and a question of whose priorities are more important.

If we do release the metadata on a partially uptodate folio, we'll
re-read the parts of the folio from storage which had previously been
successfully read.  If we don't release the metadata, we prevent the
MM from splitting the page (eg on truncate).

No obviously right answer here, IMO.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 8/8] iomap: Copy larger chunks from userspace
  2023-06-13  4:58   ` Christoph Hellwig
@ 2023-06-13 19:43     ` Matthew Wilcox
  2023-07-10  3:45       ` Matthew Wilcox
  0 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2023-06-13 19:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Darrick J . Wong

On Mon, Jun 12, 2023 at 09:58:54PM -0700, Christoph Hellwig wrote:
> On Mon, Jun 12, 2023 at 09:39:10PM +0100, Matthew Wilcox (Oracle) wrote:
> > If we have a large folio, we can copy in larger chunks than PAGE_SIZE.
> > Start at the maximum page cache size and shrink by half every time we
> > hit the "we are short on memory" problem.
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > ---
> >  fs/iomap/buffered-io.c | 22 +++++++++++++---------
> >  1 file changed, 13 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index a5d62c9640cf..818dc350ffc5 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -768,6 +768,7 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
> >  static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
> >  {
> >  	loff_t length = iomap_length(iter);
> > +	size_t chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;
> 
> This could overflow if the chunk size ends up bigger than 4GB, but
> I guess that's mostly theoretical.

I don't think it can ... we currently restrict it to PMD_SIZE if THP are
enabled and order-8 if they're not.  I could add a MAX_PAGECACHE_SIZE if
needed, but PAGE_SIZE is 'unsigned long' on most if not all platforms,
so it's always the same size as size_t.  We definitely can't create
folios larger than size_t, so MAX_PAGECACHE_ORDER is never going to be
defined such that PAGE_SIZE << MAX_PAGECACHE_ORDER cannot fit in size_t.

The largest I can see it going would be on something like PowerPC with
its 16GB page size, and there we definitely have 1UL << PAGE_SHIFT.

> > -		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
> > +		copied = copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
> 
> Would be nice t avoid the overly long line here

The plan is to turn that into:

		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);

in the fairly near future.

> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/8] filemap: Allow __filemap_get_folio to allocate large folios
  2023-06-13  7:54           ` Dave Chinner
  2023-06-13 13:34             ` Matthew Wilcox
@ 2023-06-16 17:45             ` Matthew Wilcox
  2023-06-16 22:40               ` Dave Chinner
  1 sibling, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2023-06-16 17:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Christoph Hellwig,
	Darrick J . Wong

On Tue, Jun 13, 2023 at 05:54:35PM +1000, Dave Chinner wrote:
> If I hadn't looked at the code closely and saw a trace with this
> sort of behaviour (i.e. I understood large folios were in use,
> but not exactly how they worked), I'd be very surprised to see a
> weird repeated pattern of varying folio sizes. I'd probably think
> it was a bug in the implementation....
> 
> > I'd prefer the low-risk approach for now; we can change it later!
> 
> That's fine by me - just document the limitations and expected
> behaviour in the code rather than expect people to have to discover
> this behaviour for themselves.

How about this?

+++ b/include/linux/pagemap.h
@@ -548,6 +548,17 @@ typedef unsigned int __bitwise fgf_t;

 #define FGP_WRITEBEGIN         (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)

+/**
+ * fgf_set_order - Encode a length in the fgf_t flags.
+ * @size: The suggested size of the folio to create.
+ *
+ * The caller of __filemap_get_folio() can use this to suggest a preferred
+ * size for the folio that is created.  If there is already a folio at
+ * the index, it will be returned, no matter what its size.  If a folio
+ * is freshly created, it may be of a different size than requested
+ * due to alignment constraints, memory pressure, or the presence of
+ * other folios at nearby indices.
+ */
 static inline fgf_t fgf_set_order(size_t size)
 {
        unsigned int shift = ilog2(size);


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/8] filemap: Allow __filemap_get_folio to allocate large folios
  2023-06-16 17:45             ` Matthew Wilcox
@ 2023-06-16 22:40               ` Dave Chinner
  0 siblings, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2023-06-16 22:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Christoph Hellwig,
	Darrick J . Wong

On Fri, Jun 16, 2023 at 06:45:43PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 13, 2023 at 05:54:35PM +1000, Dave Chinner wrote:
> > If I hadn't looked at the code closely and saw a trace with this
> > sort of behaviour (i.e. I understood large folios were in use,
> > but not exactly how they worked), I'd be very surprised to see a
> > weird repeated pattern of varying folio sizes. I'd probably think
> > it was a bug in the implementation....
> > 
> > > I'd prefer the low-risk approach for now; we can change it later!
> > 
> > That's fine by me - just document the limitations and expected
> > behaviour in the code rather than expect people to have to discover
> > this behaviour for themselves.
> 
> How about this?
> 
> +++ b/include/linux/pagemap.h
> @@ -548,6 +548,17 @@ typedef unsigned int __bitwise fgf_t;
> 
>  #define FGP_WRITEBEGIN         (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
> 
> +/**
> + * fgf_set_order - Encode a length in the fgf_t flags.
> + * @size: The suggested size of the folio to create.
> + *
> + * The caller of __filemap_get_folio() can use this to suggest a preferred
> + * size for the folio that is created.  If there is already a folio at
> + * the index, it will be returned, no matter what its size.  If a folio
> + * is freshly created, it may be of a different size than requested
> + * due to alignment constraints, memory pressure, or the presence of
> + * other folios at nearby indices.
> + */
>  static inline fgf_t fgf_set_order(size_t size)
>  {
>         unsigned int shift = ilog2(size);

Yup, I'm happy with that - "preferred size" is a good way of
describing the best effort attempt it makes....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 8/8] iomap: Copy larger chunks from userspace
  2023-06-12 20:39 ` [PATCH v3 8/8] iomap: Copy larger chunks from userspace Matthew Wilcox (Oracle)
  2023-06-13  4:58   ` Christoph Hellwig
@ 2023-06-17  7:13   ` Ritesh Harjani
  2023-06-19 17:09     ` Matthew Wilcox
  1 sibling, 1 reply; 33+ messages in thread
From: Ritesh Harjani @ 2023-06-17  7:13 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

"Matthew Wilcox (Oracle)" <willy@infradead.org> writes:

> If we have a large folio, we can copy in larger chunks than PAGE_SIZE.
> Start at the maximum page cache size and shrink by half every time we
> hit the "we are short on memory" problem.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  fs/iomap/buffered-io.c | 22 +++++++++++++---------
>  1 file changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index a5d62c9640cf..818dc350ffc5 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -768,6 +768,7 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
>  static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  {
>  	loff_t length = iomap_length(iter);
> +	size_t chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;
>  	loff_t pos = iter->pos;
>  	ssize_t written = 0;
>  	long status = 0;
> @@ -776,15 +777,13 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  
>  	do {
>  		struct folio *folio;
> -		struct page *page;
> -		unsigned long offset;	/* Offset into pagecache page */
> -		unsigned long bytes;	/* Bytes to write to page */
> +		size_t offset;		/* Offset into folio */
> +		unsigned long bytes;	/* Bytes to write to folio */

why not keep typeof "bytes" as size_t same as of "copied".

>  		size_t copied;		/* Bytes copied from user */
>  
> -		offset = offset_in_page(pos);
> -		bytes = min_t(unsigned long, PAGE_SIZE - offset,
> -						iov_iter_count(i));
>  again:
> +		offset = pos & (chunk - 1);
> +		bytes = min(chunk - offset, iov_iter_count(i));
>  		status = balance_dirty_pages_ratelimited_flags(mapping,
>  							       bdp_flags);
>  		if (unlikely(status))
> @@ -814,11 +813,14 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  		if (iter->iomap.flags & IOMAP_F_STALE)
>  			break;
>  
> -		page = folio_file_page(folio, pos >> PAGE_SHIFT);
> +		offset = offset_in_folio(folio, pos);
> +		if (bytes > folio_size(folio) - offset)
> +			bytes = folio_size(folio) - offset;
> +
>  		if (mapping_writably_mapped(mapping))
> -			flush_dcache_page(page);
> +			flush_dcache_folio(folio);
>  
> -		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
> +		copied = copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
>  
>  		status = iomap_write_end(iter, pos, bytes, copied, folio);
>  
> @@ -835,6 +837,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  			 */
>  			if (copied)
>  				bytes = copied;

I think with your code change which changes the label position of
"again", the above lines doing bytes = copied becomes dead code.
We anyway recalculate bytes after "again" label. 


-ritesh


> +			if (chunk > PAGE_SIZE)
> +				chunk /= 2;
>  			goto again;
>  		}
>  		pos += status;
> -- 
> 2.39.2

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 8/8] iomap: Copy larger chunks from userspace
  2023-06-17  7:13   ` Ritesh Harjani
@ 2023-06-19 17:09     ` Matthew Wilcox
  2023-07-10  3:57       ` Matthew Wilcox
  0 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2023-06-19 17:09 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong

On Sat, Jun 17, 2023 at 12:43:59PM +0530, Ritesh Harjani wrote:
> >  	do {
> >  		struct folio *folio;
> > -		struct page *page;
> > -		unsigned long offset;	/* Offset into pagecache page */
> > -		unsigned long bytes;	/* Bytes to write to page */
> > +		size_t offset;		/* Offset into folio */
> > +		unsigned long bytes;	/* Bytes to write to folio */
> 
> why not keep typeof "bytes" as size_t same as of "copied".

Sure, makes sense.

> > @@ -835,6 +837,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
> >  			 */
> >  			if (copied)
> >  				bytes = copied;
> 
> I think with your code change which changes the label position of
> "again", the above lines doing bytes = copied becomes dead code.
> We anyway recalculate bytes after "again" label. 

Yes, you're right.  Removed.  I had a good think about whether this
forgotten removal meant an overlooked problem, but I can't see one.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/8] Create large folios in iomap buffered write path
  2023-06-12 20:39 [PATCH v3 0/8] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (7 preceding siblings ...)
  2023-06-12 20:39 ` [PATCH v3 8/8] iomap: Copy larger chunks from userspace Matthew Wilcox (Oracle)
@ 2023-06-21 12:03 ` Wang Yugui
  2023-07-10  3:55   ` Matthew Wilcox
  8 siblings, 1 reply; 33+ messages in thread
From: Wang Yugui @ 2023-06-21 12:03 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

Hi,

> Commit ebb7fb1557b1 limited the length of ioend chains to 4096 entries
> to improve worst-case latency.  Unfortunately, this had the effect of
> limiting the performance of:
> 
> fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \
>         -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \
>         -numjobs=4 -directory=/mnt/test
> 
> The problem ends up being lock contention on the i_pages spinlock as we
> clear the writeback bit on each folio (and propagate that up through
> the tree).  By using larger folios, we decrease the number of folios
> to be processed by a factor of 256 for this benchmark, eliminating the
> lock contention.
> 
> It's also the right thing to do.  This is a project that has been on
> the back burner for years, it just hasn't been important enough to do
> before now.
> 
> I think it's probably best if this goes through the iomap tree since
> the changes outside iomap are either to the page cache or they're
> trivial.
> 
> v3:
>  - Fix the handling of compound highmem pages in copy_page_from_iter_atomic()
>  - Rename fgp_t to fgf_t
>  - Clarify some wording in the documentation

This v3 patches broken linux 6.4-rc7.

fstests(btrfs/007 and more) will fail with the v3 patches .
but it works well without the v3 patches.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/06/21

> v2:
>  - Fix misplaced semicolon
>  - Rename fgp_order to fgp_set_order
>  - Rename FGP_ORDER to FGP_GET_ORDER
>  - Add fgp_t
>  - Update the documentation for ->release_folio
>  - Fix iomap_invalidate_folio()
>  - Update iomap_release_folio()
> 
> Matthew Wilcox (Oracle) (8):
>   iov_iter: Handle compound highmem pages in
>     copy_page_from_iter_atomic()
>   iomap: Remove large folio handling in iomap_invalidate_folio()
>   doc: Correct the description of ->release_folio
>   iomap: Remove unnecessary test from iomap_release_folio()
>   filemap: Add fgf_t typedef
>   filemap: Allow __filemap_get_folio to allocate large folios
>   iomap: Create large folios in the buffered write path
>   iomap: Copy larger chunks from userspace
> 
>  Documentation/filesystems/locking.rst | 15 ++++--
>  fs/btrfs/file.c                       |  6 +--
>  fs/f2fs/compress.c                    |  2 +-
>  fs/f2fs/f2fs.h                        |  2 +-
>  fs/gfs2/bmap.c                        |  2 +-
>  fs/iomap/buffered-io.c                | 43 ++++++++--------
>  include/linux/iomap.h                 |  2 +-
>  include/linux/pagemap.h               | 71 ++++++++++++++++++++++-----
>  lib/iov_iter.c                        | 43 ++++++++++------
>  mm/filemap.c                          | 61 ++++++++++++-----------
>  mm/folio-compat.c                     |  2 +-
>  mm/readahead.c                        | 13 -----
>  12 files changed, 159 insertions(+), 103 deletions(-)
> 
> -- 
> 2.39.2



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 1/8] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()
  2023-06-13  4:52   ` Christoph Hellwig
@ 2023-07-10  3:36     ` Matthew Wilcox
  0 siblings, 0 replies; 33+ messages in thread
From: Matthew Wilcox @ 2023-07-10  3:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Darrick J . Wong

On Mon, Jun 12, 2023 at 09:52:52PM -0700, Christoph Hellwig wrote:
> Looks good:
> 

I'm assuming this was intended to have
Reviewed-by: Christoph Hellwig <hch@lst.de>

after it, so I'm taking the liverty of adding that.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 8/8] iomap: Copy larger chunks from userspace
  2023-06-13 19:43     ` Matthew Wilcox
@ 2023-07-10  3:45       ` Matthew Wilcox
  0 siblings, 0 replies; 33+ messages in thread
From: Matthew Wilcox @ 2023-07-10  3:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Darrick J . Wong

On Tue, Jun 13, 2023 at 08:43:49PM +0100, Matthew Wilcox wrote:
> > > -		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
> > > +		copied = copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
> > 
> > Would be nice t avoid the overly long line here
> 
> The plan is to turn that into:
> 
> 		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
> 
> in the fairly near future.

Kent bugged me to add copy_folio_from_iter_atomic() now, and he's right,
so "the near future" is v4.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 0/8] Create large folios in iomap buffered write path
  2023-06-21 12:03 ` [PATCH v3 0/8] Create large folios in iomap buffered write path Wang Yugui
@ 2023-07-10  3:55   ` Matthew Wilcox
  0 siblings, 0 replies; 33+ messages in thread
From: Matthew Wilcox @ 2023-07-10  3:55 UTC (permalink / raw)
  To: Wang Yugui
  Cc: linux-fsdevel, linux-xfs, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

On Wed, Jun 21, 2023 at 08:03:13PM +0800, Wang Yugui wrote:
> This v3 patches broken linux 6.4-rc7.
> 
> fstests(btrfs/007 and more) will fail with the v3 patches .
> but it works well without the v3 patches.

Sorry, I didn't see this email until just now when I was reviewing all
the comments I got on v3.

I think I found the problem, and it's in patch 1.  Please retest v4
when I send it in a few hours after it's completed a test run.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 8/8] iomap: Copy larger chunks from userspace
  2023-06-19 17:09     ` Matthew Wilcox
@ 2023-07-10  3:57       ` Matthew Wilcox
  0 siblings, 0 replies; 33+ messages in thread
From: Matthew Wilcox @ 2023-07-10  3:57 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong

On Mon, Jun 19, 2023 at 06:09:42PM +0100, Matthew Wilcox wrote:
> > > @@ -835,6 +837,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
> > >  			 */
> > >  			if (copied)
> > >  				bytes = copied;
> > 
> > I think with your code change which changes the label position of
> > "again", the above lines doing bytes = copied becomes dead code.
> > We anyway recalculate bytes after "again" label. 
> 
> Yes, you're right.  Removed.  I had a good think about whether this
> forgotten removal meant an overlooked problem, but I can't see one.

... also, removing this means that 'goto again' has the same effect as
'continue' ... which means we can actually restructure the loop slightly
and avoid the again label, the goto and even the continue.  Patch to
follow in a few hours.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2023-07-10  3:57 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-12 20:39 [PATCH v3 0/8] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
2023-06-12 20:39 ` [PATCH v3 1/8] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() Matthew Wilcox (Oracle)
2023-06-13  4:52   ` Christoph Hellwig
2023-07-10  3:36     ` Matthew Wilcox
2023-06-12 20:39 ` [PATCH v3 2/8] iomap: Remove large folio handling in iomap_invalidate_folio() Matthew Wilcox (Oracle)
2023-06-12 20:39 ` [PATCH v3 3/8] doc: Correct the description of ->release_folio Matthew Wilcox (Oracle)
2023-06-13  4:53   ` Christoph Hellwig
2023-06-12 20:39 ` [PATCH v3 4/8] iomap: Remove unnecessary test from iomap_release_folio() Matthew Wilcox (Oracle)
2023-06-13  4:53   ` Christoph Hellwig
2023-06-13 16:19   ` Matthew Wilcox
2023-06-12 20:39 ` [PATCH v3 5/8] filemap: Add fgf_t typedef Matthew Wilcox (Oracle)
2023-06-13  4:53   ` Christoph Hellwig
2023-06-12 20:39 ` [PATCH v3 6/8] filemap: Allow __filemap_get_folio to allocate large folios Matthew Wilcox (Oracle)
2023-06-12 22:49   ` Dave Chinner
2023-06-13  0:42     ` Matthew Wilcox
2023-06-13  1:30       ` Dave Chinner
2023-06-13  2:00         ` Matthew Wilcox
2023-06-13  7:54           ` Dave Chinner
2023-06-13 13:34             ` Matthew Wilcox
2023-06-16 17:45             ` Matthew Wilcox
2023-06-16 22:40               ` Dave Chinner
2023-06-13  4:56   ` Christoph Hellwig
2023-06-12 20:39 ` [PATCH v3 7/8] iomap: Create large folios in the buffered write path Matthew Wilcox (Oracle)
2023-06-13  4:56   ` Christoph Hellwig
2023-06-12 20:39 ` [PATCH v3 8/8] iomap: Copy larger chunks from userspace Matthew Wilcox (Oracle)
2023-06-13  4:58   ` Christoph Hellwig
2023-06-13 19:43     ` Matthew Wilcox
2023-07-10  3:45       ` Matthew Wilcox
2023-06-17  7:13   ` Ritesh Harjani
2023-06-19 17:09     ` Matthew Wilcox
2023-07-10  3:57       ` Matthew Wilcox
2023-06-21 12:03 ` [PATCH v3 0/8] Create large folios in iomap buffered write path Wang Yugui
2023-07-10  3:55   ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.