[PATCH v2 0/7] Create large folios in iomap buffered write path

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/7] Create large folios in iomap buffered write path
@ 2023-06-02 22:24 Matthew Wilcox (Oracle)
  2023-06-02 22:24 ` [PATCH v2 1/7] iomap: Remove large folio handling in iomap_invalidate_folio() Matthew Wilcox (Oracle)
                   ` (7 more replies)
  0 siblings, 8 replies; 39+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-02 22:24 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

Commit ebb7fb1557b1 limited the length of ioend chains to 4096 entries
to improve worst-case latency.  Unfortunately, this had the effect of
limiting the performance of:

fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \
        -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \
        -numjobs=4 -directory=/mnt/test

The problem ends up being lock contention on the i_pages spinlock as we
clear the writeback bit on each folio (and propagate that up through
the tree).  By using larger folios, we decrease the number of folios
to be processed by a factor of 256 for this benchmark, eliminating the
lock contention.

It's also the right thing to do.  This is a project that has been on
the back burner for years, it just hasn't been important enough to do
before now.

I think it's probably best if this goes through the iomap tree since
the changes outside iomap are either to the page cache or they're
trivial.

v2:
 - Fix misplaced semicolon
 - Rename fgp_order to fgp_set_order
 - Rename FGP_ORDER to FGP_GET_ORDER
 - Add fgp_t
 - Update the documentation for ->release_folio
 - Fix iomap_invalidate_folio()
 - Update iomap_release_folio()

Matthew Wilcox (Oracle) (7):
  iomap: Remove large folio handling in iomap_invalidate_folio()
  doc: Correct the description of ->release_folio
  iomap: Remove unnecessary test from iomap_release_folio()
  filemap: Add fgp_t typedef
  filemap: Allow __filemap_get_folio to allocate large folios
  iomap: Create large folios in the buffered write path
  iomap: Copy larger chunks from userspace

 Documentation/filesystems/locking.rst | 14 ++++--
 fs/btrfs/file.c                       |  6 +--
 fs/f2fs/compress.c                    |  2 +-
 fs/f2fs/f2fs.h                        |  2 +-
 fs/gfs2/bmap.c                        |  2 +-
 fs/iomap/buffered-io.c                | 43 ++++++++--------
 include/linux/iomap.h                 |  2 +-
 include/linux/pagemap.h               | 71 ++++++++++++++++++++++-----
 mm/filemap.c                          | 61 ++++++++++++-----------
 mm/folio-compat.c                     |  2 +-
 mm/readahead.c                        | 13 -----
 11 files changed, 130 insertions(+), 88 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 1/7] iomap: Remove large folio handling in iomap_invalidate_folio()
  2023-06-02 22:24 [PATCH v2 0/7] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
@ 2023-06-02 22:24 ` Matthew Wilcox (Oracle)
  2023-06-04 17:58   ` Darrick J. Wong
  2023-06-05  7:11   ` Christoph Hellwig
  2023-06-02 22:24 ` [PATCH v2 2/7] doc: Correct the description of ->release_folio Matthew Wilcox (Oracle)
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-02 22:24 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

We do not need to release the iomap_page in iomap_invalidate_folio()
to allow the folio to be split.  The splitting code will call
->release_folio() if there is still per-fs private data attached to
the folio.  At that point, we will check if the folio is still dirty
and decline to release the iomap_page.  It is possible to trigger the
warning in perfectly legitimate circumstances (eg if a disk read fails,
we do a partial write to the folio, then we truncate the folio), which
will cause those writes to be lost.

Fixes: 60d8231089f0 ("iomap: Support large folios in invalidatepage")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 063133ec77f4..08ee293c4117 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -508,11 +508,6 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 		WARN_ON_ONCE(folio_test_writeback(folio));
 		folio_cancel_dirty(folio);
 		iomap_page_release(folio);
-	} else if (folio_test_large(folio)) {
-		/* Must release the iop so the page can be split */
-		WARN_ON_ONCE(!folio_test_uptodate(folio) &&
-			     folio_test_dirty(folio));
-		iomap_page_release(folio);
 	}
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 2/7] doc: Correct the description of ->release_folio
  2023-06-02 22:24 [PATCH v2 0/7] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
  2023-06-02 22:24 ` [PATCH v2 1/7] iomap: Remove large folio handling in iomap_invalidate_folio() Matthew Wilcox (Oracle)
@ 2023-06-02 22:24 ` Matthew Wilcox (Oracle)
  2023-06-04 17:55   ` Darrick J. Wong
  2023-06-05  7:12   ` Christoph Hellwig
  2023-06-02 22:24 ` [PATCH v2 3/7] iomap: Remove unnecessary test from iomap_release_folio() Matthew Wilcox (Oracle)
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-02 22:24 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

The filesystem ->release_folio method is called under more circumstances
now than when the documentation was written.  The second sentence
describing the interpretation of the return value is the wrong polarity
(false indicates failure, not success).  And the third sentence is also
wrong (the kernel calls try_to_free_buffers() instead).

So replace the entire paragraph with a detailed description of what the
state of the folio may be, the meaning of the gfp parameter, why the
method is being called and what the filesystem is expected to do.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 Documentation/filesystems/locking.rst | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index aa1a233b0fa8..91dc9d5bc602 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -374,10 +374,16 @@ invalidate_lock before invalidating page cache in truncate / hole punch
 path (and thus calling into ->invalidate_folio) to block races between page
 cache invalidation and page cache filling functions (fault, read, ...).
 
-->release_folio() is called when the kernel is about to try to drop the
-buffers from the folio in preparation for freeing it.  It returns false to
-indicate that the buffers are (or may be) freeable.  If ->release_folio is
-NULL, the kernel assumes that the fs has no private interest in the buffers.
+->release_folio() is called when the MM wants to make a change to the
+folio that would invalidate the filesystem's private data.  For example,
+it may be about to be removed from the address_space or split.  The folio
+is locked and not under writeback.  It may be dirty.  The gfp parameter is
+not usually used for allocation, but rather to indicate what the filesystem
+may do to attempt to free the private data.  The filesystem may
+return false to indicate that the folio's private data cannot be freed.
+If it returns true, it should have already removed the private data from
+the folio.  If a filesystem does not provide a ->release_folio method,
+the kernel will call try_to_free_buffers().
 
 ->free_folio() is called when the kernel has dropped the folio
 from the page cache.
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 3/7] iomap: Remove unnecessary test from iomap_release_folio()
  2023-06-02 22:24 [PATCH v2 0/7] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
  2023-06-02 22:24 ` [PATCH v2 1/7] iomap: Remove large folio handling in iomap_invalidate_folio() Matthew Wilcox (Oracle)
  2023-06-02 22:24 ` [PATCH v2 2/7] doc: Correct the description of ->release_folio Matthew Wilcox (Oracle)
@ 2023-06-02 22:24 ` Matthew Wilcox (Oracle)
  2023-06-04 18:01   ` Darrick J. Wong
  2023-06-05  7:13   ` Christoph Hellwig
  2023-06-02 22:24 ` [PATCH v2 4/7] filemap: Add fgp_t typedef Matthew Wilcox (Oracle)
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-02 22:24 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

The check for the folio being under writeback is unnecessary; the caller
has checked this and the folio is locked, so the folio cannot be under
writeback at this point.

The comment is somewhat misleading in that it talks about one specific
situation in which we can see a dirty folio.  There are others, so change
the comment to explain why we can't release the iomap_page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 08ee293c4117..2054b85c9d9b 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -483,12 +483,10 @@ bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
 			folio_size(folio));
 
 	/*
-	 * mm accommodates an old ext3 case where clean folios might
-	 * not have had the dirty bit cleared.  Thus, it can send actual
-	 * dirty folios to ->release_folio() via shrink_active_list();
-	 * skip those here.
+	 * If the folio is dirty, we refuse to release our metadata because
+	 * it may be partially dirty (FIXME, add a test for that).
 	 */
-	if (folio_test_dirty(folio) || folio_test_writeback(folio))
+	if (folio_test_dirty(folio))
 		return false;
 	iomap_page_release(folio);
 	return true;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 4/7] filemap: Add fgp_t typedef
  2023-06-02 22:24 [PATCH v2 0/7] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (2 preceding siblings ...)
  2023-06-02 22:24 ` [PATCH v2 3/7] iomap: Remove unnecessary test from iomap_release_folio() Matthew Wilcox (Oracle)
@ 2023-06-02 22:24 ` Matthew Wilcox (Oracle)
  2023-06-04 18:02   ` Darrick J. Wong
  2023-06-05  7:14   ` Christoph Hellwig
  2023-06-02 22:24 ` [PATCH v2 5/7] filemap: Allow __filemap_get_folio to allocate large folios Matthew Wilcox (Oracle)
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-02 22:24 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

Similarly to gfp_t, define fgp_t as its own type to prevent various
misuses and confusion.  Move the documentation to the definition of the
type insted of burying it in the __filemap_get_folio() documentation.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/btrfs/file.c         |  6 +++---
 fs/f2fs/compress.c      |  2 +-
 fs/f2fs/f2fs.h          |  2 +-
 fs/iomap/buffered-io.c  |  2 +-
 include/linux/pagemap.h | 48 +++++++++++++++++++++++++++++++----------
 mm/filemap.c            | 19 ++--------------
 mm/folio-compat.c       |  2 +-
 7 files changed, 46 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f649647392e0..4bbbbdafb472 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -876,9 +876,9 @@ static int prepare_uptodate_page(struct inode *inode,
 	return 0;
 }
 
-static unsigned int get_prepare_fgp_flags(bool nowait)
+static fgp_t get_prepare_fgp_flags(bool nowait)
 {
-	unsigned int fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
+	fgp_t fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
 
 	if (nowait)
 		fgp_flags |= FGP_NOWAIT;
@@ -910,7 +910,7 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
 	int i;
 	unsigned long index = pos >> PAGE_SHIFT;
 	gfp_t mask = get_prepare_gfp_flags(inode, nowait);
-	unsigned int fgp_flags = get_prepare_fgp_flags(nowait);
+	fgp_t fgp_flags = get_prepare_fgp_flags(nowait);
 	int err = 0;
 	int faili;
 
diff --git a/fs/f2fs/compress.c b/fs/f2fs/compress.c
index 11653fa79289..ee358f5f2a86 100644
--- a/fs/f2fs/compress.c
+++ b/fs/f2fs/compress.c
@@ -1019,7 +1019,7 @@ static int prepare_compress_overwrite(struct compress_ctx *cc,
 	struct address_space *mapping = cc->inode->i_mapping;
 	struct page *page;
 	sector_t last_block_in_bio;
-	unsigned fgp_flag = FGP_LOCK | FGP_WRITE | FGP_CREAT;
+	fgp_t fgp_flag = FGP_LOCK | FGP_WRITE | FGP_CREAT;
 	pgoff_t start_idx = start_idx_of_cluster(cc);
 	int i, ret;
 
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index d211ee89c158..a03c3df75f7c 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -2715,7 +2715,7 @@ static inline struct page *f2fs_grab_cache_page(struct address_space *mapping,
 
 static inline struct page *f2fs_pagecache_get_page(
 				struct address_space *mapping, pgoff_t index,
-				int fgp_flags, gfp_t gfp_mask)
+				fgp_t fgp_flags, gfp_t gfp_mask)
 {
 	if (time_to_inject(F2FS_M_SB(mapping), FAULT_PAGE_GET))
 		return NULL;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 2054b85c9d9b..30d53b50ee0f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -467,7 +467,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  */
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
 {
-	unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS;
+	fgp_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a56308a9d1a4..7ab57a2bb576 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -497,22 +497,48 @@ pgoff_t page_cache_next_miss(struct address_space *mapping,
 pgoff_t page_cache_prev_miss(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
 
-#define FGP_ACCESSED		0x00000001
-#define FGP_LOCK		0x00000002
-#define FGP_CREAT		0x00000004
-#define FGP_WRITE		0x00000008
-#define FGP_NOFS		0x00000010
-#define FGP_NOWAIT		0x00000020
-#define FGP_FOR_MMAP		0x00000040
-#define FGP_STABLE		0x00000080
+/**
+ * typedef fgp_t - Flags for getting folios from the page cache.
+ *
+ * Most users of the page cache will not need to use these flags;
+ * there are convenience functions such as filemap_get_folio() and
+ * filemap_lock_folio().  For users which need more control over exactly
+ * what is done with the folios, these flags to __filemap_get_folio()
+ * are available.
+ *
+ * * %FGP_ACCESSED - The folio will be marked accessed.
+ * * %FGP_LOCK - The folio is returned locked.
+ * * %FGP_CREAT - If no folio is present then a new folio is allocated,
+ *   added to the page cache and the VM's LRU list.  The folio is
+ *   returned locked.
+ * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
+ *   folio is already in cache.  If the folio was allocated, unlock it
+ *   before returning so the caller can do the same dance.
+ * * %FGP_WRITE - The folio will be written to by the caller.
+ * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
+ * * %FGP_NOWAIT - Don't block on the folio lock.
+ * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
+ * * %FGP_WRITEBEGIN - The flags to use in a filesystem write_begin()
+ *   implementation.
+ */
+typedef unsigned int __bitwise fgp_t;
+
+#define FGP_ACCESSED		((__force fgp_t)0x00000001)
+#define FGP_LOCK		((__force fgp_t)0x00000002)
+#define FGP_CREAT		((__force fgp_t)0x00000004)
+#define FGP_WRITE		((__force fgp_t)0x00000008)
+#define FGP_NOFS		((__force fgp_t)0x00000010)
+#define FGP_NOWAIT		((__force fgp_t)0x00000020)
+#define FGP_FOR_MMAP		((__force fgp_t)0x00000040)
+#define FGP_STABLE		((__force fgp_t)0x00000080)
 
 #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
 
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp);
+		fgp_t fgp_flags, gfp_t gfp);
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp);
+		fgp_t fgp_flags, gfp_t gfp);
 
 /**
  * filemap_get_folio - Find and get a folio.
@@ -586,7 +612,7 @@ static inline struct page *find_get_page(struct address_space *mapping,
 }
 
 static inline struct page *find_get_page_flags(struct address_space *mapping,
-					pgoff_t offset, int fgp_flags)
+					pgoff_t offset, fgp_t fgp_flags)
 {
 	return pagecache_get_page(mapping, offset, fgp_flags, 0);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index b4c9bd368b7e..eb89a815f2f8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1887,30 +1887,15 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
  *
  * Looks up the page cache entry at @mapping & @index.
  *
- * @fgp_flags can be zero or more of these flags:
- *
- * * %FGP_ACCESSED - The folio will be marked accessed.
- * * %FGP_LOCK - The folio is returned locked.
- * * %FGP_CREAT - If no page is present then a new page is allocated using
- *   @gfp and added to the page cache and the VM's LRU list.
- *   The page is returned locked and with an increased refcount.
- * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
- *   page is already in cache.  If the page was allocated, unlock it before
- *   returning so the caller can do the same dance.
- * * %FGP_WRITE - The page will be written to by the caller.
- * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
- * * %FGP_NOWAIT - Don't get blocked by page lock.
- * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
- *
  * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
  * if the %GFP flags specified for %FGP_CREAT are atomic.
  *
- * If there is a page cache page, it is returned with an increased refcount.
+ * If this function returns a folio, it is returned with an increased refcount.
  *
  * Return: The found folio or an ERR_PTR() otherwise.
  */
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp)
+		fgp_t fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
 
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index c6f056c20503..ae1a737d2533 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -92,7 +92,7 @@ EXPORT_SYMBOL(add_to_page_cache_lru);
 
 noinline
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp)
+		fgp_t fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 5/7] filemap: Allow __filemap_get_folio to allocate large folios
  2023-06-02 22:24 [PATCH v2 0/7] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (3 preceding siblings ...)
  2023-06-02 22:24 ` [PATCH v2 4/7] filemap: Add fgp_t typedef Matthew Wilcox (Oracle)
@ 2023-06-02 22:24 ` Matthew Wilcox (Oracle)
  2023-06-04 18:09   ` Darrick J. Wong
  2023-06-05  7:16   ` Christoph Hellwig
  2023-06-02 22:24 ` [PATCH v2 6/7] iomap: Create large folios in the buffered write path Matthew Wilcox (Oracle)
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-02 22:24 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

Allow callers of __filemap_get_folio() to specify a preferred folio
order in the FGP flags.  This is only honoured in the FGP_CREATE path;
if there is already a folio in the page cache that covers the index,
we will return it, no matter what its order is.  No create-around is
attempted; we will only create folios which start at the specified index.
Unmodified callers will continue to allocate order 0 folios.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 23 ++++++++++++++++++++++
 mm/filemap.c            | 42 ++++++++++++++++++++++++++++-------------
 mm/readahead.c          | 13 -------------
 3 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 7ab57a2bb576..667ce668f438 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -466,6 +466,19 @@ static inline void *detach_page_private(struct page *page)
 	return folio_detach_private(page_folio(page));
 }
 
+/*
+ * There are some parts of the kernel which assume that PMD entries
+ * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
+ * limit the maximum allocation order to PMD size.  I'm not aware of any
+ * assumptions about maximum order if THP are disabled, but 8 seems like
+ * a good order (that's 1MB if you're using 4kB pages)
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
+#else
+#define MAX_PAGECACHE_ORDER	8
+#endif
+
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
 #else
@@ -531,9 +544,19 @@ typedef unsigned int __bitwise fgp_t;
 #define FGP_NOWAIT		((__force fgp_t)0x00000020)
 #define FGP_FOR_MMAP		((__force fgp_t)0x00000040)
 #define FGP_STABLE		((__force fgp_t)0x00000080)
+#define FGP_GET_ORDER(fgp)	(((__force unsigned)fgp) >> 26)	/* top 6 bits */
 
 #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
 
+static inline fgp_t fgp_set_order(size_t size)
+{
+	unsigned int shift = ilog2(size);
+
+	if (shift <= PAGE_SHIFT)
+		return 0;
+	return (__force fgp_t)((shift - PAGE_SHIFT) << 26);
+}
+
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		fgp_t fgp_flags, gfp_t gfp);
diff --git a/mm/filemap.c b/mm/filemap.c
index eb89a815f2f8..10ea9321c36e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1937,7 +1937,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_wait_stable(folio);
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
+		unsigned order = FGP_GET_ORDER(fgp_flags);
 		int err;
+
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
 			gfp |= __GFP_WRITE;
 		if (fgp_flags & FGP_NOFS)
@@ -1946,26 +1948,40 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp &= ~GFP_KERNEL;
 			gfp |= GFP_NOWAIT | __GFP_NOWARN;
 		}
-
-		folio = filemap_alloc_folio(gfp, 0);
-		if (!folio)
-			return ERR_PTR(-ENOMEM);
-
 		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
 			fgp_flags |= FGP_LOCK;
 
-		/* Init accessed so avoid atomic mark_page_accessed later */
-		if (fgp_flags & FGP_ACCESSED)
-			__folio_set_referenced(folio);
+		if (!mapping_large_folio_support(mapping))
+			order = 0;
+		if (order > MAX_PAGECACHE_ORDER)
+			order = MAX_PAGECACHE_ORDER;
+		/* If we're not aligned, allocate a smaller folio */
+		if (index & ((1UL << order) - 1))
+			order = __ffs(index);
 
-		err = filemap_add_folio(mapping, folio, index, gfp);
-		if (unlikely(err)) {
+		do {
+			err = -ENOMEM;
+			if (order == 1)
+				order = 0;
+			folio = filemap_alloc_folio(gfp, order);
+			if (!folio)
+				continue;
+
+			/* Init accessed so avoid atomic mark_page_accessed later */
+			if (fgp_flags & FGP_ACCESSED)
+				__folio_set_referenced(folio);
+
+			err = filemap_add_folio(mapping, folio, index, gfp);
+			if (!err)
+				break;
 			folio_put(folio);
 			folio = NULL;
-			if (err == -EEXIST)
-				goto repeat;
-		}
+		} while (order-- > 0);
 
+		if (err == -EEXIST)
+			goto repeat;
+		if (err)
+			return ERR_PTR(err);
 		/*
 		 * filemap_add_folio locks the page, and for mmap
 		 * we expect an unlocked page.
diff --git a/mm/readahead.c b/mm/readahead.c
index 47afbca1d122..59a071badb90 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -462,19 +462,6 @@ static int try_context_readahead(struct address_space *mapping,
 	return 1;
 }
 
-/*
- * There are some parts of the kernel which assume that PMD entries
- * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
- * limit the maximum allocation order to PMD size.  I'm not aware of any
- * assumptions about maximum order if THP are disabled, but 8 seems like
- * a good order (that's 1MB if you're using 4kB pages)
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
-#else
-#define MAX_PAGECACHE_ORDER	8
-#endif
-
 static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 		pgoff_t mark, unsigned int order, gfp_t gfp)
 {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 6/7] iomap: Create large folios in the buffered write path
  2023-06-02 22:24 [PATCH v2 0/7] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (4 preceding siblings ...)
  2023-06-02 22:24 ` [PATCH v2 5/7] filemap: Allow __filemap_get_folio to allocate large folios Matthew Wilcox (Oracle)
@ 2023-06-02 22:24 ` Matthew Wilcox (Oracle)
  2023-06-04 18:10   ` Darrick J. Wong
  2023-06-05  7:16   ` Christoph Hellwig
  2023-06-02 22:24 ` [PATCH v2 7/7] iomap: Copy larger chunks from userspace Matthew Wilcox (Oracle)
  2023-06-04  0:19 ` [PATCH v2 0/7] Create large folios in iomap buffered write path Wang Yugui
  7 siblings, 2 replies; 39+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-02 22:24 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

Use the size of the write as a hint for the size of the folio to create.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/gfs2/bmap.c         | 2 +-
 fs/iomap/buffered-io.c | 6 ++++--
 include/linux/iomap.h  | 2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index c739b258a2d9..3702e5e47b0f 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -971,7 +971,7 @@ gfs2_iomap_get_folio(struct iomap_iter *iter, loff_t pos, unsigned len)
 	if (status)
 		return ERR_PTR(status);
 
-	folio = iomap_get_folio(iter, pos);
+	folio = iomap_get_folio(iter, pos, len);
 	if (IS_ERR(folio))
 		gfs2_trans_end(sdp);
 	return folio;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 30d53b50ee0f..a10f9c037515 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -461,16 +461,18 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  * iomap_get_folio - get a folio reference for writing
  * @iter: iteration structure
  * @pos: start offset of write
+ * @len: length of write
  *
  * Returns a locked reference to the folio at @pos, or an error pointer if the
  * folio could not be obtained.
  */
-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	fgp_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
+	fgp |= fgp_set_order(len);
 
 	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
 			fgp, mapping_gfp_mask(iter->inode->i_mapping));
@@ -596,7 +598,7 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
 	if (folio_ops && folio_ops->get_folio)
 		return folio_ops->get_folio(iter, pos, len);
 	else
-		return iomap_get_folio(iter, pos);
+		return iomap_get_folio(iter, pos, len);
 }
 
 static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index e2b836c2e119..80facb9c9e5b 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -261,7 +261,7 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
 void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 7/7] iomap: Copy larger chunks from userspace
  2023-06-02 22:24 [PATCH v2 0/7] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (5 preceding siblings ...)
  2023-06-02 22:24 ` [PATCH v2 6/7] iomap: Create large folios in the buffered write path Matthew Wilcox (Oracle)
@ 2023-06-02 22:24 ` Matthew Wilcox (Oracle)
  2023-06-04 18:29   ` Darrick J. Wong
  2023-06-04  0:19 ` [PATCH v2 0/7] Create large folios in iomap buffered write path Wang Yugui
  7 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-02 22:24 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

If we have a large folio, we can copy in larger chunks than PAGE_SIZE.
Start at the maximum page cache size and shrink by half every time we
hit the "we are short on memory" problem.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index a10f9c037515..10434b07e0f9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -768,6 +768,7 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 {
 	loff_t length = iomap_length(iter);
+	size_t chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;
 	loff_t pos = iter->pos;
 	ssize_t written = 0;
 	long status = 0;
@@ -776,15 +777,13 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 
 	do {
 		struct folio *folio;
-		struct page *page;
-		unsigned long offset;	/* Offset into pagecache page */
-		unsigned long bytes;	/* Bytes to write to page */
+		size_t offset;		/* Offset into folio */
+		unsigned long bytes;	/* Bytes to write to folio */
 		size_t copied;		/* Bytes copied from user */
 
-		offset = offset_in_page(pos);
-		bytes = min_t(unsigned long, PAGE_SIZE - offset,
-						iov_iter_count(i));
 again:
+		offset = pos & (chunk - 1);
+		bytes = min(chunk - offset, iov_iter_count(i));
 		status = balance_dirty_pages_ratelimited_flags(mapping,
 							       bdp_flags);
 		if (unlikely(status))
@@ -814,11 +813,14 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		if (iter->iomap.flags & IOMAP_F_STALE)
 			break;
 
-		page = folio_file_page(folio, pos >> PAGE_SHIFT);
+		offset = offset_in_folio(folio, pos);
+		if (bytes > folio_size(folio) - offset)
+			bytes = folio_size(folio) - offset;
+
 		if (mapping_writably_mapped(mapping))
-			flush_dcache_page(page);
+			flush_dcache_folio(folio);
 
-		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
+		copied = copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
 
 		status = iomap_write_end(iter, pos, bytes, copied, folio);
 
@@ -835,6 +837,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 			 */
 			if (copied)
 				bytes = copied;
+			if (chunk > PAGE_SIZE)
+				chunk /= 2;
 			goto again;
 		}
 		pos += status;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 0/7] Create large folios in iomap buffered write path
  2023-06-02 22:24 [PATCH v2 0/7] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (6 preceding siblings ...)
  2023-06-02 22:24 ` [PATCH v2 7/7] iomap: Copy larger chunks from userspace Matthew Wilcox (Oracle)
@ 2023-06-04  0:19 ` Wang Yugui
  7 siblings, 0 replies; 39+ messages in thread
From: Wang Yugui @ 2023-06-04  0:19 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Dave Chinner, Christoph Hellwig,
	Darrick J . Wong

[-- Attachment #1: Type: text/plain, Size: 3301 bytes --]

Hi,

> Commit ebb7fb1557b1 limited the length of ioend chains to 4096 entries
> to improve worst-case latency.  Unfortunately, this had the effect of
> limiting the performance of:
> 
> fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \
>         -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \
>         -numjobs=4 -directory=/mnt/test
> 
> The problem ends up being lock contention on the i_pages spinlock as we
> clear the writeback bit on each folio (and propagate that up through
> the tree).  By using larger folios, we decrease the number of folios
> to be processed by a factor of 256 for this benchmark, eliminating the
> lock contention.
> 
> It's also the right thing to do.  This is a project that has been on
> the back burner for years, it just hasn't been important enough to do
> before now.
> 
> I think it's probably best if this goes through the iomap tree since
> the changes outside iomap are either to the page cache or they're
> trivial.
> 
> v2:
>  - Fix misplaced semicolon
>  - Rename fgp_order to fgp_set_order
>  - Rename FGP_ORDER to FGP_GET_ORDER
>  - Add fgp_t
>  - Update the documentation for ->release_folio
>  - Fix iomap_invalidate_folio()
>  - Update iomap_release_folio()

It works well based on 6.4-rc4 here.
fio write bandwidth: about 8000MiB/s

And I tried to backport it to 6.1.y, it works well too.
1) 8 patches are selected as depency.
    d7b64041164c :Dave Chinner: iomap: write iomap validity checks

    7a70a5085ed0 :Andreas Gruenbacher: iomap: Add __iomap_put_folio helper
    80baab88bb93 :Andreas Gruenbacher: iomap/gfs2: Unlock and put folio in page_done handler
    40405dddd98a :Andreas Gruenbacher: iomap: Rename page_done handler to put_folio

    98321b5139f9 :Andreas Gruenbacher: iomap: Add iomap_get_folio helper

    9060bc4d3aca :Andreas Gruenbacher: iomap/gfs2: Get page in page_prepare handler
    07c22b56685d :Andreas Gruenbacher: iomap: Add __iomap_get_folio helper
    c82abc239464 :Andreas Gruenbacher: iomap: Rename page_prepare handler to get_folio

2) patch 4/5/6 are rebased (see attachment files)

better backport are welcome.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/06/04



> Matthew Wilcox (Oracle) (7):
>   iomap: Remove large folio handling in iomap_invalidate_folio()
>   doc: Correct the description of ->release_folio
>   iomap: Remove unnecessary test from iomap_release_folio()
>   filemap: Add fgp_t typedef
>   filemap: Allow __filemap_get_folio to allocate large folios
>   iomap: Create large folios in the buffered write path
>   iomap: Copy larger chunks from userspace
> 
>  Documentation/filesystems/locking.rst | 14 ++++--
>  fs/btrfs/file.c                       |  6 +--
>  fs/f2fs/compress.c                    |  2 +-
>  fs/f2fs/f2fs.h                        |  2 +-
>  fs/gfs2/bmap.c                        |  2 +-
>  fs/iomap/buffered-io.c                | 43 ++++++++--------
>  include/linux/iomap.h                 |  2 +-
>  include/linux/pagemap.h               | 71 ++++++++++++++++++++++-----
>  mm/filemap.c                          | 61 ++++++++++++-----------
>  mm/folio-compat.c                     |  2 +-
>  mm/readahead.c                        | 13 -----
>  11 files changed, 130 insertions(+), 88 deletions(-)
> 
> -- 
> 2.39.2


[-- Attachment #2: 0006-iomap-Create-large-folios-in-the-buffered-write-path.patch --]
[-- Type: application/octet-stream, Size: 3082 bytes --]

From 6901e07ceeb77f9bfe9b38fc46ff20437d293007 Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Fri, 2 Jun 2023 23:24:43 +0100
Subject: [PATCH] iomap: Create large folios in the buffered write path

Use the size of the write as a hint for the size of the folio to create.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/gfs2/bmap.c         | 2 +-
 fs/iomap/buffered-io.c | 6 ++++--
 include/linux/iomap.h  | 2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index c739b258a2d9..3702e5e47b0f 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -971,7 +971,7 @@ gfs2_iomap_get_folio(struct iomap_iter *iter, loff_t pos, unsigned len)
 	if (status)
 		return ERR_PTR(status);
 
-	folio = iomap_get_folio(iter, pos);
+	folio = iomap_get_folio(iter, pos, len);
 	if (IS_ERR(folio))
 		gfs2_trans_end(sdp);
 	return folio;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 30d53b50ee0f..a10f9c037515 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -461,17 +461,19 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  * iomap_get_folio - get a folio reference for writing
  * @iter: iteration structure
  * @pos: start offset of write
+ * @len: length of write
  *
  * Returns a locked reference to the folio at @pos, or an error pointer if the
  * folio could not be obtained.
  */
-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	fgp_t fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
 	struct folio *folio;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
+	fgp |= fgp_set_order(len);
 
 	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
 			fgp, mapping_gfp_mask(iter->inode->i_mapping));
@@ -596,7 +598,7 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
 	if (page_ops && page_ops->get_folio)
 		return page_ops->get_folio(iter, pos, len);
 	else
-		return iomap_get_folio(iter, pos);
+		return iomap_get_folio(iter, pos, len);
 }
 
 static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index e2b836c2e119..80facb9c9e5b 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -261,7 +261,7 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
 void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
-- 
2.36.2


[-- Attachment #3: 0004-filemap-Add-fgp_t-typedef.patch --]
[-- Type: application/octet-stream, Size: 7780 bytes --]

From 22b61cbd1429c25dbcd0548034894be2011e1544 Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Fri, 2 Jun 2023 23:24:41 +0100
Subject: [PATCH] filemap: Add fgp_t typedef

Similarly to gfp_t, define fgp_t as its own type to prevent various
misuses and confusion.  Move the documentation to the definition of the
type insted of burying it in the __filemap_get_folio() documentation.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/btrfs/file.c         |  6 +++---
 fs/f2fs/compress.c      |  2 +-
 fs/f2fs/f2fs.h          |  2 +-
 fs/iomap/buffered-io.c  |  2 +-
 include/linux/pagemap.h | 32 +++++++++++++++++++++++++++++---
 mm/filemap.c            | 21 ++-------------------
 mm/folio-compat.c       |  2 +-
 7 files changed, 46 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f649647392e0..4bbbbdafb472 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -876,9 +876,9 @@ static int prepare_uptodate_page(struct inode *inode,
 	return 0;
 }
 
-static unsigned int get_prepare_fgp_flags(bool nowait)
+static fgp_t get_prepare_fgp_flags(bool nowait)
 {
-	unsigned int fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
+	fgp_t fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
 
 	if (nowait)
 		fgp_flags |= FGP_NOWAIT;
@@ -910,7 +910,7 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
 	int i;
 	unsigned long index = pos >> PAGE_SHIFT;
 	gfp_t mask = get_prepare_gfp_flags(inode, nowait);
-	unsigned int fgp_flags = get_prepare_fgp_flags(nowait);
+	fgp_t fgp_flags = get_prepare_fgp_flags(nowait);
 	int err = 0;
 	int faili;
 
diff --git a/fs/f2fs/compress.c b/fs/f2fs/compress.c
index 11653fa79289..ee358f5f2a86 100644
--- a/fs/f2fs/compress.c
+++ b/fs/f2fs/compress.c
@@ -1019,7 +1019,7 @@ static int prepare_compress_overwrite(struct compress_ctx *cc,
 	struct address_space *mapping = cc->inode->i_mapping;
 	struct page *page;
 	sector_t last_block_in_bio;
-	unsigned fgp_flag = FGP_LOCK | FGP_WRITE | FGP_CREAT;
+	fgp_t fgp_flag = FGP_LOCK | FGP_WRITE | FGP_CREAT;
 	pgoff_t start_idx = start_idx_of_cluster(cc);
 	int i, ret;
 
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index d211ee89c158..a03c3df75f7c 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -2715,7 +2715,7 @@ static inline struct page *f2fs_grab_cache_page(struct address_space *mapping,
 
 static inline struct page *f2fs_pagecache_get_page(
 				struct address_space *mapping, pgoff_t index,
-				int fgp_flags, gfp_t gfp_mask)
+				fgp_t fgp_flags, gfp_t gfp_mask)
 {
 	if (time_to_inject(F2FS_M_SB(mapping), FAULT_PAGE_GET)) {
 		f2fs_show_injection_info(F2FS_M_SB(mapping), FAULT_PAGE_GET);
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 2054b85c9d9b..30d53b50ee0f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -467,7 +467,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  */
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
 {
-	unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
+	fgp_t fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
 	struct folio *folio;
 
 	if (iter->flags & IOMAP_NOWAIT)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a56308a9d1a4..7ab57a2bb576 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -497,6 +497,32 @@ pgoff_t page_cache_next_miss(struct address_space *mapping,
 pgoff_t page_cache_prev_miss(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
 
+/**
+ * typedef fgp_t - Flags for getting folios from the page cache.
+ *
+ * Most users of the page cache will not need to use these flags;
+ * there are convenience functions such as filemap_get_folio() and
+ * filemap_lock_folio().  For users which need more control over exactly
+ * what is done with the folios, these flags to __filemap_get_folio()
+ * are available.
+ *
+ * * %FGP_ACCESSED - The folio will be marked accessed.
+ * * %FGP_LOCK - The folio is returned locked.
+ * * %FGP_CREAT - If no folio is present then a new folio is allocated,
+ *   added to the page cache and the VM's LRU list.  The folio is
+ *   returned locked.
+ * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
+ *   folio is already in cache.  If the folio was allocated, unlock it
+ *   before returning so the caller can do the same dance.
+ * * %FGP_WRITE - The folio will be written to by the caller.
+ * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
+ * * %FGP_NOWAIT - Don't block on the folio lock.
+ * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
+ * * %FGP_WRITEBEGIN - The flags to use in a filesystem write_begin()
+ *   implementation.
+ */
+typedef unsigned int fgp_t;
+
 #define FGP_ACCESSED		0x00000001
 #define FGP_LOCK		0x00000002
 #define FGP_CREAT		0x00000004
@@ -509,9 +535,9 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
 #define FGP_STABLE		0x00000200
 
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp);
+		fgp_t fgp_flags, gfp_t gfp);
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp);
+		fgp_t fgp_flags, gfp_t gfp);
 
 /**
  * filemap_get_folio - Find and get a folio.
@@ -586,7 +612,7 @@ static inline struct page *find_get_page(struct address_space *mapping,
 }
 
 static inline struct page *find_get_page_flags(struct address_space *mapping,
-					pgoff_t offset, int fgp_flags)
+					pgoff_t offset, fgp_t fgp_flags)
 {
 	return pagecache_get_page(mapping, offset, fgp_flags, 0);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index b4c9bd368b7e..eb89a815f2f8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1882,32 +1882,15 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
  *
  * Looks up the page cache entry at @mapping & @index.
  *
- * @fgp_flags can be zero or more of these flags:
- *
- * * %FGP_ACCESSED - The folio will be marked accessed.
- * * %FGP_LOCK - The folio is returned locked.
- * * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it
- *   instead of allocating a new folio to replace it.
- * * %FGP_CREAT - If no page is present then a new page is allocated using
- *   @gfp and added to the page cache and the VM's LRU list.
- *   The page is returned locked and with an increased refcount.
- * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
- *   page is already in cache.  If the page was allocated, unlock it before
- *   returning so the caller can do the same dance.
- * * %FGP_WRITE - The page will be written to by the caller.
- * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
- * * %FGP_NOWAIT - Don't get blocked by page lock.
- * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
- *
  * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
  * if the %GFP flags specified for %FGP_CREAT are atomic.
  *
- * If there is a page cache page, it is returned with an increased refcount.
+ * If this function returns a folio, it is returned with an increased refcount.
  *
  * Return: The found folio or %NULL otherwise.
  */
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp)
+		fgp_t fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
 
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index c6f056c20503..ae1a737d2533 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -92,7 +92,7 @@ EXPORT_SYMBOL(add_to_page_cache_lru);
 
 noinline
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp)
+		fgp_t fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
 
-- 
2.36.2


[-- Attachment #4: 0005-filemap-Allow-__filemap_get_folio-to-allocate-large-.patch --]
[-- Type: application/octet-stream, Size: 5125 bytes --]

From a11bda0f117bc62a5fadba867a77637caf8100a3 Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Fri, 2 Jun 2023 23:24:42 +0100
Subject: [PATCH] filemap: Allow __filemap_get_folio to allocate large folios

Allow callers of __filemap_get_folio() to specify a preferred folio
order in the FGP flags.  This is only honoured in the FGP_CREATE path;
if there is already a folio in the page cache that covers the index,
we will return it, no matter what its order is.  No create-around is
attempted; we will only create folios which start at the specified index.
Unmodified callers will continue to allocate order 0 folios.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 23 ++++++++++++++++++++++
 mm/filemap.c            | 42 ++++++++++++++++++++++++++++-------------
 mm/readahead.c          | 13 -------------
 3 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 7ab57a2bb576..667ce668f438 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -466,6 +466,19 @@ static inline void *detach_page_private(struct page *page)
 	return folio_detach_private(page_folio(page));
 }
 
+/*
+ * There are some parts of the kernel which assume that PMD entries
+ * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
+ * limit the maximum allocation order to PMD size.  I'm not aware of any
+ * assumptions about maximum order if THP are disabled, but 8 seems like
+ * a good order (that's 1MB if you're using 4kB pages)
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
+#else
+#define MAX_PAGECACHE_ORDER	8
+#endif
+
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
 #else
@@ -533,6 +546,16 @@ typedef unsigned int fgp_t;
 #define FGP_HEAD		0x00000080
 #define FGP_ENTRY		0x00000100
 #define FGP_STABLE		0x00000200
+#define FGP_GET_ORDER(fgp)	(((__force unsigned)fgp) >> 26)	/* top 6 bits */
+
+static inline fgp_t fgp_set_order(size_t size)
+{
+	unsigned int shift = ilog2(size);
+
+	if (shift <= PAGE_SHIFT)
+		return 0;
+	return (__force fgp_t)((shift - PAGE_SHIFT) << 26);
+}
 
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		fgp_t fgp_flags, gfp_t gfp);
diff --git a/mm/filemap.c b/mm/filemap.c
index eb89a815f2f8..10ea9321c36e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1937,7 +1937,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_wait_stable(folio);
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
+		unsigned order = FGP_GET_ORDER(fgp_flags);
 		int err;
+
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
 			gfp |= __GFP_WRITE;
 		if (fgp_flags & FGP_NOFS)
@@ -1946,26 +1948,40 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp &= ~GFP_KERNEL;
 			gfp |= GFP_NOWAIT | __GFP_NOWARN;
 		}
-
-		folio = filemap_alloc_folio(gfp, 0);
-		if (!folio)
-			return NULL;
-
 		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
 			fgp_flags |= FGP_LOCK;
 
-		/* Init accessed so avoid atomic mark_page_accessed later */
-		if (fgp_flags & FGP_ACCESSED)
-			__folio_set_referenced(folio);
+		if (!mapping_large_folio_support(mapping))
+			order = 0;
+		if (order > MAX_PAGECACHE_ORDER)
+			order = MAX_PAGECACHE_ORDER;
+		/* If we're not aligned, allocate a smaller folio */
+		if (index & ((1UL << order) - 1))
+			order = __ffs(index);
 
-		err = filemap_add_folio(mapping, folio, index, gfp);
-		if (unlikely(err)) {
+		do {
+			err = -ENOMEM;
+			if (order == 1)
+				order = 0;
+			folio = filemap_alloc_folio(gfp, order);
+			if (!folio)
+				continue;
+
+			/* Init accessed so avoid atomic mark_page_accessed later */
+			if (fgp_flags & FGP_ACCESSED)
+				__folio_set_referenced(folio);
+
+			err = filemap_add_folio(mapping, folio, index, gfp);
+			if (!err)
+				break;
 			folio_put(folio);
 			folio = NULL;
-			if (err == -EEXIST)
-				goto repeat;
-		}
+		} while (order-- > 0);
 
+		if (err == -EEXIST)
+			goto repeat;
+		if (err)
+			return ERR_PTR(err);
 		/*
 		 * filemap_add_folio locks the page, and for mmap
 		 * we expect an unlocked page.
diff --git a/mm/readahead.c b/mm/readahead.c
index 47afbca1d122..59a071badb90 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -462,19 +462,6 @@ static int try_context_readahead(struct address_space *mapping,
 	return 1;
 }
 
-/*
- * There are some parts of the kernel which assume that PMD entries
- * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
- * limit the maximum allocation order to PMD size.  I'm not aware of any
- * assumptions about maximum order if THP are disabled, but 8 seems like
- * a good order (that's 1MB if you're using 4kB pages)
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
-#else
-#define MAX_PAGECACHE_ORDER	8
-#endif
-
 static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 		pgoff_t mark, unsigned int order, gfp_t gfp)
 {
-- 
2.36.2


[-- Attachment #5: 0006-iomap-Create-large-folios-in-the-buffered-write-path.patch --]
[-- Type: application/octet-stream, Size: 3082 bytes --]

From 6901e07ceeb77f9bfe9b38fc46ff20437d293007 Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Fri, 2 Jun 2023 23:24:43 +0100
Subject: [PATCH] iomap: Create large folios in the buffered write path

Use the size of the write as a hint for the size of the folio to create.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/gfs2/bmap.c         | 2 +-
 fs/iomap/buffered-io.c | 6 ++++--
 include/linux/iomap.h  | 2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index c739b258a2d9..3702e5e47b0f 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -971,7 +971,7 @@ gfs2_iomap_get_folio(struct iomap_iter *iter, loff_t pos, unsigned len)
 	if (status)
 		return ERR_PTR(status);
 
-	folio = iomap_get_folio(iter, pos);
+	folio = iomap_get_folio(iter, pos, len);
 	if (IS_ERR(folio))
 		gfs2_trans_end(sdp);
 	return folio;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 30d53b50ee0f..a10f9c037515 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -461,17 +461,19 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  * iomap_get_folio - get a folio reference for writing
  * @iter: iteration structure
  * @pos: start offset of write
+ * @len: length of write
  *
  * Returns a locked reference to the folio at @pos, or an error pointer if the
  * folio could not be obtained.
  */
-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	fgp_t fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
 	struct folio *folio;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
+	fgp |= fgp_set_order(len);
 
 	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
 			fgp, mapping_gfp_mask(iter->inode->i_mapping));
@@ -596,7 +598,7 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
 	if (page_ops && page_ops->get_folio)
 		return page_ops->get_folio(iter, pos, len);
 	else
-		return iomap_get_folio(iter, pos);
+		return iomap_get_folio(iter, pos, len);
 }
 
 static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index e2b836c2e119..80facb9c9e5b 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -261,7 +261,7 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
 void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
-- 
2.36.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/7] doc: Correct the description of ->release_folio
  2023-06-02 22:24 ` [PATCH v2 2/7] doc: Correct the description of ->release_folio Matthew Wilcox (Oracle)
@ 2023-06-04 17:55   ` Darrick J. Wong
  2023-06-04 20:10     ` Matthew Wilcox
  2023-06-05  7:12   ` Christoph Hellwig
  1 sibling, 1 reply; 39+ messages in thread
From: Darrick J. Wong @ 2023-06-04 17:55 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig

On Fri, Jun 02, 2023 at 11:24:39PM +0100, Matthew Wilcox (Oracle) wrote:
> The filesystem ->release_folio method is called under more circumstances
> now than when the documentation was written.  The second sentence
> describing the interpretation of the return value is the wrong polarity
> (false indicates failure, not success).  And the third sentence is also
> wrong (the kernel calls try_to_free_buffers() instead).
> 
> So replace the entire paragraph with a detailed description of what the
> state of the folio may be, the meaning of the gfp parameter, why the
> method is being called and what the filesystem is expected to do.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  Documentation/filesystems/locking.rst | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
> index aa1a233b0fa8..91dc9d5bc602 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -374,10 +374,16 @@ invalidate_lock before invalidating page cache in truncate / hole punch
>  path (and thus calling into ->invalidate_folio) to block races between page
>  cache invalidation and page cache filling functions (fault, read, ...).
>  
> -->release_folio() is called when the kernel is about to try to drop the
> -buffers from the folio in preparation for freeing it.  It returns false to
> -indicate that the buffers are (or may be) freeable.  If ->release_folio is
> -NULL, the kernel assumes that the fs has no private interest in the buffers.
> +->release_folio() is called when the MM wants to make a change to the
> +folio that would invalidate the filesystem's private data.  For example,
> +it may be about to be removed from the address_space or split.  The folio
> +is locked and not under writeback.  It may be dirty.  The gfp parameter is
> +not usually used for allocation, but rather to indicate what the filesystem
> +may do to attempt to free the private data.  The filesystem may
> +return false to indicate that the folio's private data cannot be freed.
> +If it returns true, it should have already removed the private data from
> +the folio.  If a filesystem does not provide a ->release_folio method,
> +the kernel will call try_to_free_buffers().

the MM?  Since you changed that above... :)

With that nit fixed,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

>  
>  ->free_folio() is called when the kernel has dropped the folio
>  from the page cache.
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/7] iomap: Remove large folio handling in iomap_invalidate_folio()
  2023-06-02 22:24 ` [PATCH v2 1/7] iomap: Remove large folio handling in iomap_invalidate_folio() Matthew Wilcox (Oracle)
@ 2023-06-04 17:58   ` Darrick J. Wong
  2023-06-05  7:11   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2023-06-04 17:58 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig

On Fri, Jun 02, 2023 at 11:24:38PM +0100, Matthew Wilcox (Oracle) wrote:
> We do not need to release the iomap_page in iomap_invalidate_folio()
> to allow the folio to be split.  The splitting code will call
> ->release_folio() if there is still per-fs private data attached to
> the folio.  At that point, we will check if the folio is still dirty
> and decline to release the iomap_page.  It is possible to trigger the
> warning in perfectly legitimate circumstances (eg if a disk read fails,
> we do a partial write to the folio, then we truncate the folio), which
> will cause those writes to be lost.
> 
> Fixes: 60d8231089f0 ("iomap: Support large folios in invalidatepage")
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Sounds reasonable to me...
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/iomap/buffered-io.c | 5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 063133ec77f4..08ee293c4117 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -508,11 +508,6 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
>  		WARN_ON_ONCE(folio_test_writeback(folio));
>  		folio_cancel_dirty(folio);
>  		iomap_page_release(folio);
> -	} else if (folio_test_large(folio)) {
> -		/* Must release the iop so the page can be split */
> -		WARN_ON_ONCE(!folio_test_uptodate(folio) &&
> -			     folio_test_dirty(folio));
> -		iomap_page_release(folio);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/7] iomap: Remove unnecessary test from iomap_release_folio()
  2023-06-02 22:24 ` [PATCH v2 3/7] iomap: Remove unnecessary test from iomap_release_folio() Matthew Wilcox (Oracle)
@ 2023-06-04 18:01   ` Darrick J. Wong
  2023-06-04 21:39     ` Matthew Wilcox
  2023-06-05  7:13   ` Christoph Hellwig
  1 sibling, 1 reply; 39+ messages in thread
From: Darrick J. Wong @ 2023-06-04 18:01 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig

On Fri, Jun 02, 2023 at 11:24:40PM +0100, Matthew Wilcox (Oracle) wrote:
> The check for the folio being under writeback is unnecessary; the caller
> has checked this and the folio is locked, so the folio cannot be under
> writeback at this point.

Do we need a debug assertion here to validate that filemap_release_folio
has already filtered out folios unergoing writeback?  The documentation
change in the next patch might be fine since you're the pagecache
maintainer.

> The comment is somewhat misleading in that it talks about one specific
> situation in which we can see a dirty folio.  There are others, so change
> the comment to explain why we can't release the iomap_page.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  fs/iomap/buffered-io.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 08ee293c4117..2054b85c9d9b 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -483,12 +483,10 @@ bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
>  			folio_size(folio));
>  
>  	/*
> -	 * mm accommodates an old ext3 case where clean folios might
> -	 * not have had the dirty bit cleared.  Thus, it can send actual
> -	 * dirty folios to ->release_folio() via shrink_active_list();
> -	 * skip those here.
> +	 * If the folio is dirty, we refuse to release our metadata because
> +	 * it may be partially dirty (FIXME, add a test for that).

Er... is this FIXME reflective of incomplete code?

--D

>  	 */
> -	if (folio_test_dirty(folio) || folio_test_writeback(folio))
> +	if (folio_test_dirty(folio))
>  		return false;
>  	iomap_page_release(folio);
>  	return true;
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] filemap: Add fgp_t typedef
  2023-06-02 22:24 ` [PATCH v2 4/7] filemap: Add fgp_t typedef Matthew Wilcox (Oracle)
@ 2023-06-04 18:02   ` Darrick J. Wong
  2023-06-05  7:14   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2023-06-04 18:02 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig

On Fri, Jun 02, 2023 at 11:24:41PM +0100, Matthew Wilcox (Oracle) wrote:
> Similarly to gfp_t, define fgp_t as its own type to prevent various
> misuses and confusion.  Move the documentation to the definition of the
> type insted of burying it in the __filemap_get_folio() documentation.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

This looks like a pretty straightforward change
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/btrfs/file.c         |  6 +++---
>  fs/f2fs/compress.c      |  2 +-
>  fs/f2fs/f2fs.h          |  2 +-
>  fs/iomap/buffered-io.c  |  2 +-
>  include/linux/pagemap.h | 48 +++++++++++++++++++++++++++++++----------
>  mm/filemap.c            | 19 ++--------------
>  mm/folio-compat.c       |  2 +-
>  7 files changed, 46 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index f649647392e0..4bbbbdafb472 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -876,9 +876,9 @@ static int prepare_uptodate_page(struct inode *inode,
>  	return 0;
>  }
>  
> -static unsigned int get_prepare_fgp_flags(bool nowait)
> +static fgp_t get_prepare_fgp_flags(bool nowait)
>  {
> -	unsigned int fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
> +	fgp_t fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
>  
>  	if (nowait)
>  		fgp_flags |= FGP_NOWAIT;
> @@ -910,7 +910,7 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
>  	int i;
>  	unsigned long index = pos >> PAGE_SHIFT;
>  	gfp_t mask = get_prepare_gfp_flags(inode, nowait);
> -	unsigned int fgp_flags = get_prepare_fgp_flags(nowait);
> +	fgp_t fgp_flags = get_prepare_fgp_flags(nowait);
>  	int err = 0;
>  	int faili;
>  
> diff --git a/fs/f2fs/compress.c b/fs/f2fs/compress.c
> index 11653fa79289..ee358f5f2a86 100644
> --- a/fs/f2fs/compress.c
> +++ b/fs/f2fs/compress.c
> @@ -1019,7 +1019,7 @@ static int prepare_compress_overwrite(struct compress_ctx *cc,
>  	struct address_space *mapping = cc->inode->i_mapping;
>  	struct page *page;
>  	sector_t last_block_in_bio;
> -	unsigned fgp_flag = FGP_LOCK | FGP_WRITE | FGP_CREAT;
> +	fgp_t fgp_flag = FGP_LOCK | FGP_WRITE | FGP_CREAT;
>  	pgoff_t start_idx = start_idx_of_cluster(cc);
>  	int i, ret;
>  
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index d211ee89c158..a03c3df75f7c 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -2715,7 +2715,7 @@ static inline struct page *f2fs_grab_cache_page(struct address_space *mapping,
>  
>  static inline struct page *f2fs_pagecache_get_page(
>  				struct address_space *mapping, pgoff_t index,
> -				int fgp_flags, gfp_t gfp_mask)
> +				fgp_t fgp_flags, gfp_t gfp_mask)
>  {
>  	if (time_to_inject(F2FS_M_SB(mapping), FAULT_PAGE_GET))
>  		return NULL;
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 2054b85c9d9b..30d53b50ee0f 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -467,7 +467,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
>   */
>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
>  {
> -	unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS;
> +	fgp_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
>  
>  	if (iter->flags & IOMAP_NOWAIT)
>  		fgp |= FGP_NOWAIT;
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index a56308a9d1a4..7ab57a2bb576 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -497,22 +497,48 @@ pgoff_t page_cache_next_miss(struct address_space *mapping,
>  pgoff_t page_cache_prev_miss(struct address_space *mapping,
>  			     pgoff_t index, unsigned long max_scan);
>  
> -#define FGP_ACCESSED		0x00000001
> -#define FGP_LOCK		0x00000002
> -#define FGP_CREAT		0x00000004
> -#define FGP_WRITE		0x00000008
> -#define FGP_NOFS		0x00000010
> -#define FGP_NOWAIT		0x00000020
> -#define FGP_FOR_MMAP		0x00000040
> -#define FGP_STABLE		0x00000080
> +/**
> + * typedef fgp_t - Flags for getting folios from the page cache.
> + *
> + * Most users of the page cache will not need to use these flags;
> + * there are convenience functions such as filemap_get_folio() and
> + * filemap_lock_folio().  For users which need more control over exactly
> + * what is done with the folios, these flags to __filemap_get_folio()
> + * are available.
> + *
> + * * %FGP_ACCESSED - The folio will be marked accessed.
> + * * %FGP_LOCK - The folio is returned locked.
> + * * %FGP_CREAT - If no folio is present then a new folio is allocated,
> + *   added to the page cache and the VM's LRU list.  The folio is
> + *   returned locked.
> + * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
> + *   folio is already in cache.  If the folio was allocated, unlock it
> + *   before returning so the caller can do the same dance.
> + * * %FGP_WRITE - The folio will be written to by the caller.
> + * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
> + * * %FGP_NOWAIT - Don't block on the folio lock.
> + * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
> + * * %FGP_WRITEBEGIN - The flags to use in a filesystem write_begin()
> + *   implementation.
> + */
> +typedef unsigned int __bitwise fgp_t;
> +
> +#define FGP_ACCESSED		((__force fgp_t)0x00000001)
> +#define FGP_LOCK		((__force fgp_t)0x00000002)
> +#define FGP_CREAT		((__force fgp_t)0x00000004)
> +#define FGP_WRITE		((__force fgp_t)0x00000008)
> +#define FGP_NOFS		((__force fgp_t)0x00000010)
> +#define FGP_NOWAIT		((__force fgp_t)0x00000020)
> +#define FGP_FOR_MMAP		((__force fgp_t)0x00000040)
> +#define FGP_STABLE		((__force fgp_t)0x00000080)
>  
>  #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
>  
>  void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
>  struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> -		int fgp_flags, gfp_t gfp);
> +		fgp_t fgp_flags, gfp_t gfp);
>  struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
> -		int fgp_flags, gfp_t gfp);
> +		fgp_t fgp_flags, gfp_t gfp);
>  
>  /**
>   * filemap_get_folio - Find and get a folio.
> @@ -586,7 +612,7 @@ static inline struct page *find_get_page(struct address_space *mapping,
>  }
>  
>  static inline struct page *find_get_page_flags(struct address_space *mapping,
> -					pgoff_t offset, int fgp_flags)
> +					pgoff_t offset, fgp_t fgp_flags)
>  {
>  	return pagecache_get_page(mapping, offset, fgp_flags, 0);
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b4c9bd368b7e..eb89a815f2f8 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1887,30 +1887,15 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
>   *
>   * Looks up the page cache entry at @mapping & @index.
>   *
> - * @fgp_flags can be zero or more of these flags:
> - *
> - * * %FGP_ACCESSED - The folio will be marked accessed.
> - * * %FGP_LOCK - The folio is returned locked.
> - * * %FGP_CREAT - If no page is present then a new page is allocated using
> - *   @gfp and added to the page cache and the VM's LRU list.
> - *   The page is returned locked and with an increased refcount.
> - * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
> - *   page is already in cache.  If the page was allocated, unlock it before
> - *   returning so the caller can do the same dance.
> - * * %FGP_WRITE - The page will be written to by the caller.
> - * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
> - * * %FGP_NOWAIT - Don't get blocked by page lock.
> - * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
> - *
>   * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
>   * if the %GFP flags specified for %FGP_CREAT are atomic.
>   *
> - * If there is a page cache page, it is returned with an increased refcount.
> + * If this function returns a folio, it is returned with an increased refcount.
>   *
>   * Return: The found folio or an ERR_PTR() otherwise.
>   */
>  struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> -		int fgp_flags, gfp_t gfp)
> +		fgp_t fgp_flags, gfp_t gfp)
>  {
>  	struct folio *folio;
>  
> diff --git a/mm/folio-compat.c b/mm/folio-compat.c
> index c6f056c20503..ae1a737d2533 100644
> --- a/mm/folio-compat.c
> +++ b/mm/folio-compat.c
> @@ -92,7 +92,7 @@ EXPORT_SYMBOL(add_to_page_cache_lru);
>  
>  noinline
>  struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
> -		int fgp_flags, gfp_t gfp)
> +		fgp_t fgp_flags, gfp_t gfp)
>  {
>  	struct folio *folio;
>  
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/7] filemap: Allow __filemap_get_folio to allocate large folios
  2023-06-02 22:24 ` [PATCH v2 5/7] filemap: Allow __filemap_get_folio to allocate large folios Matthew Wilcox (Oracle)
@ 2023-06-04 18:09   ` Darrick J. Wong
  2023-06-04 21:48     ` Matthew Wilcox
  2023-06-05  7:16   ` Christoph Hellwig
  1 sibling, 1 reply; 39+ messages in thread
From: Darrick J. Wong @ 2023-06-04 18:09 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig

On Fri, Jun 02, 2023 at 11:24:42PM +0100, Matthew Wilcox (Oracle) wrote:
> Allow callers of __filemap_get_folio() to specify a preferred folio
> order in the FGP flags.  This is only honoured in the FGP_CREATE path;
> if there is already a folio in the page cache that covers the index,
> we will return it, no matter what its order is.  No create-around is
> attempted; we will only create folios which start at the specified index.
> Unmodified callers will continue to allocate order 0 folios.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/pagemap.h | 23 ++++++++++++++++++++++
>  mm/filemap.c            | 42 ++++++++++++++++++++++++++++-------------
>  mm/readahead.c          | 13 -------------
>  3 files changed, 52 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 7ab57a2bb576..667ce668f438 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -466,6 +466,19 @@ static inline void *detach_page_private(struct page *page)
>  	return folio_detach_private(page_folio(page));
>  }
>  
> +/*
> + * There are some parts of the kernel which assume that PMD entries
> + * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
> + * limit the maximum allocation order to PMD size.  I'm not aware of any
> + * assumptions about maximum order if THP are disabled, but 8 seems like
> + * a good order (that's 1MB if you're using 4kB pages)
> + */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
> +#else
> +#define MAX_PAGECACHE_ORDER	8
> +#endif
> +
>  #ifdef CONFIG_NUMA
>  struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
>  #else
> @@ -531,9 +544,19 @@ typedef unsigned int __bitwise fgp_t;
>  #define FGP_NOWAIT		((__force fgp_t)0x00000020)
>  #define FGP_FOR_MMAP		((__force fgp_t)0x00000040)
>  #define FGP_STABLE		((__force fgp_t)0x00000080)
> +#define FGP_GET_ORDER(fgp)	(((__force unsigned)fgp) >> 26)	/* top 6 bits */
>  
>  #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
>  
> +static inline fgp_t fgp_set_order(size_t size)
> +{
> +	unsigned int shift = ilog2(size);
> +
> +	if (shift <= PAGE_SHIFT)
> +		return 0;
> +	return (__force fgp_t)((shift - PAGE_SHIFT) << 26);
> +}
> +
>  void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
>  struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		fgp_t fgp_flags, gfp_t gfp);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index eb89a815f2f8..10ea9321c36e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1937,7 +1937,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		folio_wait_stable(folio);
>  no_page:
>  	if (!folio && (fgp_flags & FGP_CREAT)) {
> +		unsigned order = FGP_GET_ORDER(fgp_flags);
>  		int err;
> +
>  		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
>  			gfp |= __GFP_WRITE;
>  		if (fgp_flags & FGP_NOFS)
> @@ -1946,26 +1948,40 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  			gfp &= ~GFP_KERNEL;
>  			gfp |= GFP_NOWAIT | __GFP_NOWARN;
>  		}
> -
> -		folio = filemap_alloc_folio(gfp, 0);
> -		if (!folio)
> -			return ERR_PTR(-ENOMEM);
> -
>  		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
>  			fgp_flags |= FGP_LOCK;
>  
> -		/* Init accessed so avoid atomic mark_page_accessed later */
> -		if (fgp_flags & FGP_ACCESSED)
> -			__folio_set_referenced(folio);
> +		if (!mapping_large_folio_support(mapping))
> +			order = 0;
> +		if (order > MAX_PAGECACHE_ORDER)
> +			order = MAX_PAGECACHE_ORDER;
> +		/* If we're not aligned, allocate a smaller folio */
> +		if (index & ((1UL << order) - 1))
> +			order = __ffs(index);
>  
> -		err = filemap_add_folio(mapping, folio, index, gfp);
> -		if (unlikely(err)) {
> +		do {
> +			err = -ENOMEM;
> +			if (order == 1)
> +				order = 0;

Doesn't this interrupt the scale-down progression 2M -> 1M -> 512K ->
256K -> 128K -> 64K -> 32K -> 16K -> 4k?  What if I want 8k folios?

--D

> +			folio = filemap_alloc_folio(gfp, order);
> +			if (!folio)
> +				continue;
> +
> +			/* Init accessed so avoid atomic mark_page_accessed later */
> +			if (fgp_flags & FGP_ACCESSED)
> +				__folio_set_referenced(folio);
> +
> +			err = filemap_add_folio(mapping, folio, index, gfp);
> +			if (!err)
> +				break;
>  			folio_put(folio);
>  			folio = NULL;
> -			if (err == -EEXIST)
> -				goto repeat;
> -		}
> +		} while (order-- > 0);
>  
> +		if (err == -EEXIST)
> +			goto repeat;
> +		if (err)
> +			return ERR_PTR(err);
>  		/*
>  		 * filemap_add_folio locks the page, and for mmap
>  		 * we expect an unlocked page.
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 47afbca1d122..59a071badb90 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -462,19 +462,6 @@ static int try_context_readahead(struct address_space *mapping,
>  	return 1;
>  }
>  
> -/*
> - * There are some parts of the kernel which assume that PMD entries
> - * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
> - * limit the maximum allocation order to PMD size.  I'm not aware of any
> - * assumptions about maximum order if THP are disabled, but 8 seems like
> - * a good order (that's 1MB if you're using 4kB pages)
> - */
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
> -#else
> -#define MAX_PAGECACHE_ORDER	8
> -#endif
> -
>  static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
>  		pgoff_t mark, unsigned int order, gfp_t gfp)
>  {
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/7] iomap: Create large folios in the buffered write path
  2023-06-02 22:24 ` [PATCH v2 6/7] iomap: Create large folios in the buffered write path Matthew Wilcox (Oracle)
@ 2023-06-04 18:10   ` Darrick J. Wong
  2023-06-05  7:16   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2023-06-04 18:10 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig

On Fri, Jun 02, 2023 at 11:24:43PM +0100, Matthew Wilcox (Oracle) wrote:
> Use the size of the write as a hint for the size of the folio to create.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Looks good to me,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/gfs2/bmap.c         | 2 +-
>  fs/iomap/buffered-io.c | 6 ++++--
>  include/linux/iomap.h  | 2 +-
>  3 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index c739b258a2d9..3702e5e47b0f 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -971,7 +971,7 @@ gfs2_iomap_get_folio(struct iomap_iter *iter, loff_t pos, unsigned len)
>  	if (status)
>  		return ERR_PTR(status);
>  
> -	folio = iomap_get_folio(iter, pos);
> +	folio = iomap_get_folio(iter, pos, len);
>  	if (IS_ERR(folio))
>  		gfs2_trans_end(sdp);
>  	return folio;
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 30d53b50ee0f..a10f9c037515 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -461,16 +461,18 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
>   * iomap_get_folio - get a folio reference for writing
>   * @iter: iteration structure
>   * @pos: start offset of write
> + * @len: length of write
>   *
>   * Returns a locked reference to the folio at @pos, or an error pointer if the
>   * folio could not be obtained.
>   */
> -struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
> +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>  {
>  	fgp_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
>  
>  	if (iter->flags & IOMAP_NOWAIT)
>  		fgp |= FGP_NOWAIT;
> +	fgp |= fgp_set_order(len);
>  
>  	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
>  			fgp, mapping_gfp_mask(iter->inode->i_mapping));
> @@ -596,7 +598,7 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
>  	if (folio_ops && folio_ops->get_folio)
>  		return folio_ops->get_folio(iter, pos, len);
>  	else
> -		return iomap_get_folio(iter, pos);
> +		return iomap_get_folio(iter, pos, len);
>  }
>  
>  static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index e2b836c2e119..80facb9c9e5b 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -261,7 +261,7 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
>  int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
>  void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
>  bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
> -struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
> +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
>  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
>  int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iomap: Copy larger chunks from userspace
  2023-06-02 22:24 ` [PATCH v2 7/7] iomap: Copy larger chunks from userspace Matthew Wilcox (Oracle)
@ 2023-06-04 18:29   ` Darrick J. Wong
  2023-06-04 22:11     ` Matthew Wilcox
  0 siblings, 1 reply; 39+ messages in thread
From: Darrick J. Wong @ 2023-06-04 18:29 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig

On Fri, Jun 02, 2023 at 11:24:44PM +0100, Matthew Wilcox (Oracle) wrote:
> If we have a large folio, we can copy in larger chunks than PAGE_SIZE.
> Start at the maximum page cache size and shrink by half every time we
> hit the "we are short on memory" problem.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  fs/iomap/buffered-io.c | 22 +++++++++++++---------
>  1 file changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index a10f9c037515..10434b07e0f9 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -768,6 +768,7 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
>  static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  {
>  	loff_t length = iomap_length(iter);
> +	size_t chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;
>  	loff_t pos = iter->pos;
>  	ssize_t written = 0;
>  	long status = 0;
> @@ -776,15 +777,13 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  
>  	do {
>  		struct folio *folio;
> -		struct page *page;
> -		unsigned long offset;	/* Offset into pagecache page */
> -		unsigned long bytes;	/* Bytes to write to page */
> +		size_t offset;		/* Offset into folio */
> +		unsigned long bytes;	/* Bytes to write to folio */
>  		size_t copied;		/* Bytes copied from user */
>  
> -		offset = offset_in_page(pos);
> -		bytes = min_t(unsigned long, PAGE_SIZE - offset,
> -						iov_iter_count(i));
>  again:
> +		offset = pos & (chunk - 1);
> +		bytes = min(chunk - offset, iov_iter_count(i));
>  		status = balance_dirty_pages_ratelimited_flags(mapping,
>  							       bdp_flags);
>  		if (unlikely(status))
> @@ -814,11 +813,14 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  		if (iter->iomap.flags & IOMAP_F_STALE)
>  			break;
>  
> -		page = folio_file_page(folio, pos >> PAGE_SHIFT);
> +		offset = offset_in_folio(folio, pos);
> +		if (bytes > folio_size(folio) - offset)
> +			bytes = folio_size(folio) - offset;
> +
>  		if (mapping_writably_mapped(mapping))
> -			flush_dcache_page(page);
> +			flush_dcache_folio(folio);
>  
> -		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
> +		copied = copy_page_from_iter_atomic(&folio->page, offset, bytes, i);

I think I've gotten lost in the weeds.  Does copy_page_from_iter_atomic
actually know how to deal with a multipage folio?  AFAICT it takes a
page, kmaps it, and copies @bytes starting at @offset in the page.  If
a caller feeds it a multipage folio, does that all work correctly?  Or
will the pagecache split multipage folios as needed to make it work
right?

If we create a 64k folio at pos 0 and then want to write a byte at pos
40k, does __filemap_get_folio break up the 64k folio so that the folio
returned by iomap_get_folio starts at 40k?  Or can the iter code handle
jumping ten pages into a 16-page folio and I just can't see it?

(Allergies suddenly went from 0 to 9, engage breaindead mode...)

--D

>  
>  		status = iomap_write_end(iter, pos, bytes, copied, folio);
>  
> @@ -835,6 +837,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  			 */
>  			if (copied)
>  				bytes = copied;
> +			if (chunk > PAGE_SIZE)
> +				chunk /= 2;
>  			goto again;
>  		}
>  		pos += status;
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/7] doc: Correct the description of ->release_folio
  2023-06-04 17:55   ` Darrick J. Wong
@ 2023-06-04 20:10     ` Matthew Wilcox
  2023-06-04 20:33       ` Darrick J. Wong
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2023-06-04 20:10 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig

On Sun, Jun 04, 2023 at 10:55:48AM -0700, Darrick J. Wong wrote:
> On Fri, Jun 02, 2023 at 11:24:39PM +0100, Matthew Wilcox (Oracle) wrote:
> > -->release_folio() is called when the kernel is about to try to drop the
> > -buffers from the folio in preparation for freeing it.  It returns false to
> > -indicate that the buffers are (or may be) freeable.  If ->release_folio is
> > -NULL, the kernel assumes that the fs has no private interest in the buffers.
> > +->release_folio() is called when the MM wants to make a change to the
> > +folio that would invalidate the filesystem's private data.  For example,
> > +it may be about to be removed from the address_space or split.  The folio
> > +is locked and not under writeback.  It may be dirty.  The gfp parameter is
> > +not usually used for allocation, but rather to indicate what the filesystem
> > +may do to attempt to free the private data.  The filesystem may
> > +return false to indicate that the folio's private data cannot be freed.
> > +If it returns true, it should have already removed the private data from
> > +the folio.  If a filesystem does not provide a ->release_folio method,
> > +the kernel will call try_to_free_buffers().
> 
> the MM?  Since you changed that above... :)
> 
> With that nit fixed,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Well, is it the MM?  At this point, the decision is made by
filemap_release_folio(), which is the VFS, in my opinion ;-)

But I'm happy to use "the MM".  Or "the pagecache".

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/7] doc: Correct the description of ->release_folio
  2023-06-04 20:10     ` Matthew Wilcox
@ 2023-06-04 20:33       ` Darrick J. Wong
  2023-06-05 13:11         ` Matthew Wilcox
  0 siblings, 1 reply; 39+ messages in thread
From: Darrick J. Wong @ 2023-06-04 20:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig

On Sun, Jun 04, 2023 at 09:10:55PM +0100, Matthew Wilcox wrote:
> On Sun, Jun 04, 2023 at 10:55:48AM -0700, Darrick J. Wong wrote:
> > On Fri, Jun 02, 2023 at 11:24:39PM +0100, Matthew Wilcox (Oracle) wrote:
> > > -->release_folio() is called when the kernel is about to try to drop the
> > > -buffers from the folio in preparation for freeing it.  It returns false to
> > > -indicate that the buffers are (or may be) freeable.  If ->release_folio is
> > > -NULL, the kernel assumes that the fs has no private interest in the buffers.
> > > +->release_folio() is called when the MM wants to make a change to the
> > > +folio that would invalidate the filesystem's private data.  For example,
> > > +it may be about to be removed from the address_space or split.  The folio
> > > +is locked and not under writeback.  It may be dirty.  The gfp parameter is
> > > +not usually used for allocation, but rather to indicate what the filesystem
> > > +may do to attempt to free the private data.  The filesystem may
> > > +return false to indicate that the folio's private data cannot be freed.
> > > +If it returns true, it should have already removed the private data from
> > > +the folio.  If a filesystem does not provide a ->release_folio method,
> > > +the kernel will call try_to_free_buffers().
> > 
> > the MM?  Since you changed that above... :)
> > 
> > With that nit fixed,
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> Well, is it the MM?  At this point, the decision is made by
> filemap_release_folio(), which is the VFS, in my opinion ;-)

It's in mm/filemap.c, which I think makes it squarely the pagecache/mm,
not the vfs.

"Yuh better stop your train, rabbit!" etc ;)

> But I'm happy to use "the MM".  Or "the pagecache".

Sounds good to me.

--D

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/7] iomap: Remove unnecessary test from iomap_release_folio()
  2023-06-04 18:01   ` Darrick J. Wong
@ 2023-06-04 21:39     ` Matthew Wilcox
  2023-06-05 21:10       ` Ritesh Harjani
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2023-06-04 21:39 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Ritesh Harjani

On Sun, Jun 04, 2023 at 11:01:41AM -0700, Darrick J. Wong wrote:
> On Fri, Jun 02, 2023 at 11:24:40PM +0100, Matthew Wilcox (Oracle) wrote:
> > The check for the folio being under writeback is unnecessary; the caller
> > has checked this and the folio is locked, so the folio cannot be under
> > writeback at this point.
> 
> Do we need a debug assertion here to validate that filemap_release_folio
> has already filtered out folios unergoing writeback?  The documentation
> change in the next patch might be fine since you're the pagecache
> maintainer.

I don't think so?  We don't usually include asserts in filesystems that
the VFS is living up to its promises.

> >  	/*
> > -	 * mm accommodates an old ext3 case where clean folios might
> > -	 * not have had the dirty bit cleared.  Thus, it can send actual
> > -	 * dirty folios to ->release_folio() via shrink_active_list();
> > -	 * skip those here.
> > +	 * If the folio is dirty, we refuse to release our metadata because
> > +	 * it may be partially dirty (FIXME, add a test for that).
> 
> Er... is this FIXME reflective of incomplete code?

It's a note to Ritesh ;-)

Once we have per-block dirty bits, if all the dirty bits are set, we
can free the iomap_page without losing any information.  But I don't
want to introduce a dependency between this and Ritesh's work.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/7] filemap: Allow __filemap_get_folio to allocate large folios
  2023-06-04 18:09   ` Darrick J. Wong
@ 2023-06-04 21:48     ` Matthew Wilcox
  2023-06-05 15:21       ` Darrick J. Wong
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2023-06-04 21:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig

On Sun, Jun 04, 2023 at 11:09:25AM -0700, Darrick J. Wong wrote:
> On Fri, Jun 02, 2023 at 11:24:42PM +0100, Matthew Wilcox (Oracle) wrote:
> > +		do {
> > +			err = -ENOMEM;
> > +			if (order == 1)
> > +				order = 0;
> 
> Doesn't this interrupt the scale-down progression 2M -> 1M -> 512K ->
> 256K -> 128K -> 64K -> 32K -> 16K -> 4k?  What if I want 8k folios?

You can't have order-1 file/anon folios.  We have deferred_list in the
third page, so we have to have at least three pages in every large folio.
I forget exactly what it's used for; maybe there's a way to do without
it, but for now that's the rule.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iomap: Copy larger chunks from userspace
  2023-06-04 18:29   ` Darrick J. Wong
@ 2023-06-04 22:11     ` Matthew Wilcox
  2023-06-05  8:25       ` Yin, Fengwei
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2023-06-04 22:11 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig

On Sun, Jun 04, 2023 at 11:29:52AM -0700, Darrick J. Wong wrote:
> On Fri, Jun 02, 2023 at 11:24:44PM +0100, Matthew Wilcox (Oracle) wrote:
> > -		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
> > +		copied = copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
> 
> I think I've gotten lost in the weeds.  Does copy_page_from_iter_atomic
> actually know how to deal with a multipage folio?  AFAICT it takes a
> page, kmaps it, and copies @bytes starting at @offset in the page.  If
> a caller feeds it a multipage folio, does that all work correctly?  Or
> will the pagecache split multipage folios as needed to make it work
> right?

It's a smidgen inefficient, but it does work.  First, it calls
page_copy_sane() to check that offset & n fit within the compound page
(ie this all predates folios).

... Oh.  copy_page_from_iter() handles this correctly.
copy_page_from_iter_atomic() doesn't.  I'll have to fix this
first.  Looks like Al fixed copy_page_from_iter() in c03f05f183cd
and didn't fix copy_page_from_iter_atomic().

> If we create a 64k folio at pos 0 and then want to write a byte at pos
> 40k, does __filemap_get_folio break up the 64k folio so that the folio
> returned by iomap_get_folio starts at 40k?  Or can the iter code handle
> jumping ten pages into a 16-page folio and I just can't see it?

Well ... it handles it fine unless it's highmem.  p is kaddr + offset,
so if offset is 40k, it works correctly on !highmem.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/7] iomap: Remove large folio handling in iomap_invalidate_folio()
  2023-06-02 22:24 ` [PATCH v2 1/7] iomap: Remove large folio handling in iomap_invalidate_folio() Matthew Wilcox (Oracle)
  2023-06-04 17:58   ` Darrick J. Wong
@ 2023-06-05  7:11   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2023-06-05  7:11 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/7] doc: Correct the description of ->release_folio
  2023-06-02 22:24 ` [PATCH v2 2/7] doc: Correct the description of ->release_folio Matthew Wilcox (Oracle)
  2023-06-04 17:55   ` Darrick J. Wong
@ 2023-06-05  7:12   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2023-06-05  7:12 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong

Never mind the shism on the nature of the holy trinity of
vfs/mm/pagecache the substance looks good here:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/7] iomap: Remove unnecessary test from iomap_release_folio()
  2023-06-02 22:24 ` [PATCH v2 3/7] iomap: Remove unnecessary test from iomap_release_folio() Matthew Wilcox (Oracle)
  2023-06-04 18:01   ` Darrick J. Wong
@ 2023-06-05  7:13   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2023-06-05  7:13 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong

> -	 * dirty folios to ->release_folio() via shrink_active_list();
> -	 * skip those here.
> +	 * If the folio is dirty, we refuse to release our metadata because
> +	 * it may be partially dirty (FIXME, add a test for that).

I'd prefer not to add this FIXME as we're fine without the per-block
dirty tracking.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/7] filemap: Add fgp_t typedef
  2023-06-02 22:24 ` [PATCH v2 4/7] filemap: Add fgp_t typedef Matthew Wilcox (Oracle)
  2023-06-04 18:02   ` Darrick J. Wong
@ 2023-06-05  7:14   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2023-06-05  7:14 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/7] filemap: Allow __filemap_get_folio to allocate large folios
  2023-06-02 22:24 ` [PATCH v2 5/7] filemap: Allow __filemap_get_folio to allocate large folios Matthew Wilcox (Oracle)
  2023-06-04 18:09   ` Darrick J. Wong
@ 2023-06-05  7:16   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2023-06-05  7:16 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong

Still not a huge fan of the dense encoding in the flags, but technically
this looks fine to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/7] iomap: Create large folios in the buffered write path
  2023-06-02 22:24 ` [PATCH v2 6/7] iomap: Create large folios in the buffered write path Matthew Wilcox (Oracle)
  2023-06-04 18:10   ` Darrick J. Wong
@ 2023-06-05  7:16   ` Christoph Hellwig
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2023-06-05  7:16 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iomap: Copy larger chunks from userspace
  2023-06-04 22:11     ` Matthew Wilcox
@ 2023-06-05  8:25       ` Yin, Fengwei
  2023-06-06 18:07         ` Matthew Wilcox
  0 siblings, 1 reply; 39+ messages in thread
From: Yin, Fengwei @ 2023-06-05  8:25 UTC (permalink / raw)
  To: Matthew Wilcox, Darrick J. Wong
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig



On 6/5/2023 6:11 AM, Matthew Wilcox wrote:
> On Sun, Jun 04, 2023 at 11:29:52AM -0700, Darrick J. Wong wrote:
>> On Fri, Jun 02, 2023 at 11:24:44PM +0100, Matthew Wilcox (Oracle) wrote:
>>> -		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
>>> +		copied = copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
>>
>> I think I've gotten lost in the weeds.  Does copy_page_from_iter_atomic
>> actually know how to deal with a multipage folio?  AFAICT it takes a
>> page, kmaps it, and copies @bytes starting at @offset in the page.  If
>> a caller feeds it a multipage folio, does that all work correctly?  Or
>> will the pagecache split multipage folios as needed to make it work
>> right?
> 
> It's a smidgen inefficient, but it does work.  First, it calls
> page_copy_sane() to check that offset & n fit within the compound page
> (ie this all predates folios).
> 
> ... Oh.  copy_page_from_iter() handles this correctly.
> copy_page_from_iter_atomic() doesn't.  I'll have to fix this
> first.  Looks like Al fixed copy_page_from_iter() in c03f05f183cd
> and didn't fix copy_page_from_iter_atomic().
> 
>> If we create a 64k folio at pos 0 and then want to write a byte at pos
>> 40k, does __filemap_get_folio break up the 64k folio so that the folio
>> returned by iomap_get_folio starts at 40k?  Or can the iter code handle
>> jumping ten pages into a 16-page folio and I just can't see it?
> 
> Well ... it handles it fine unless it's highmem.  p is kaddr + offset,
> so if offset is 40k, it works correctly on !highmem.
So is it better to have implementations for !highmem and highmem? And for
!highmem, we don't need the kmap_local_page()/kunmap_local() and chunk
size per copy is not limited to PAGE_SIZE. Thanks.


Regards
Yin, Fengwei

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/7] doc: Correct the description of ->release_folio
  2023-06-04 20:33       ` Darrick J. Wong
@ 2023-06-05 13:11         ` Matthew Wilcox
  2023-06-05 15:07           ` Darrick J. Wong
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2023-06-05 13:11 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig

On Sun, Jun 04, 2023 at 01:33:06PM -0700, Darrick J. Wong wrote:
> On Sun, Jun 04, 2023 at 09:10:55PM +0100, Matthew Wilcox wrote:
> > On Sun, Jun 04, 2023 at 10:55:48AM -0700, Darrick J. Wong wrote:
> > > On Fri, Jun 02, 2023 at 11:24:39PM +0100, Matthew Wilcox (Oracle) wrote:
> > > > -->release_folio() is called when the kernel is about to try to drop the
> > > > -buffers from the folio in preparation for freeing it.  It returns false to
> > > > -indicate that the buffers are (or may be) freeable.  If ->release_folio is
> > > > -NULL, the kernel assumes that the fs has no private interest in the buffers.
> > > > +->release_folio() is called when the MM wants to make a change to the
> > > > +folio that would invalidate the filesystem's private data.  For example,
> > > > +it may be about to be removed from the address_space or split.  The folio
> > > > +is locked and not under writeback.  It may be dirty.  The gfp parameter is
> > > > +not usually used for allocation, but rather to indicate what the filesystem
> > > > +may do to attempt to free the private data.  The filesystem may
> > > > +return false to indicate that the folio's private data cannot be freed.
> > > > +If it returns true, it should have already removed the private data from
> > > > +the folio.  If a filesystem does not provide a ->release_folio method,
> > > > +the kernel will call try_to_free_buffers().
> > > 
> > > the MM?  Since you changed that above... :)
> > > 
> > > With that nit fixed,
> > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > 
> > Well, is it the MM?  At this point, the decision is made by
> > filemap_release_folio(), which is the VFS, in my opinion ;-)
> 
> It's in mm/filemap.c, which I think makes it squarely the pagecache/mm,
> not the vfs.

Changed this to:

If a filesystem does not provide a ->release_folio method,
the pagecache will assume that private data is buffer_heads and call
try_to_free_buffers().


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 2/7] doc: Correct the description of ->release_folio
  2023-06-05 13:11         ` Matthew Wilcox
@ 2023-06-05 15:07           ` Darrick J. Wong
  0 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2023-06-05 15:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig

On Mon, Jun 05, 2023 at 02:11:39PM +0100, Matthew Wilcox wrote:
> On Sun, Jun 04, 2023 at 01:33:06PM -0700, Darrick J. Wong wrote:
> > On Sun, Jun 04, 2023 at 09:10:55PM +0100, Matthew Wilcox wrote:
> > > On Sun, Jun 04, 2023 at 10:55:48AM -0700, Darrick J. Wong wrote:
> > > > On Fri, Jun 02, 2023 at 11:24:39PM +0100, Matthew Wilcox (Oracle) wrote:
> > > > > -->release_folio() is called when the kernel is about to try to drop the
> > > > > -buffers from the folio in preparation for freeing it.  It returns false to
> > > > > -indicate that the buffers are (or may be) freeable.  If ->release_folio is
> > > > > -NULL, the kernel assumes that the fs has no private interest in the buffers.
> > > > > +->release_folio() is called when the MM wants to make a change to the
> > > > > +folio that would invalidate the filesystem's private data.  For example,
> > > > > +it may be about to be removed from the address_space or split.  The folio
> > > > > +is locked and not under writeback.  It may be dirty.  The gfp parameter is
> > > > > +not usually used for allocation, but rather to indicate what the filesystem
> > > > > +may do to attempt to free the private data.  The filesystem may
> > > > > +return false to indicate that the folio's private data cannot be freed.
> > > > > +If it returns true, it should have already removed the private data from
> > > > > +the folio.  If a filesystem does not provide a ->release_folio method,
> > > > > +the kernel will call try_to_free_buffers().
> > > > 
> > > > the MM?  Since you changed that above... :)
> > > > 
> > > > With that nit fixed,
> > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Well, is it the MM?  At this point, the decision is made by
> > > filemap_release_folio(), which is the VFS, in my opinion ;-)
> > 
> > It's in mm/filemap.c, which I think makes it squarely the pagecache/mm,
> > not the vfs.
> 
> Changed this to:
> 
> If a filesystem does not provide a ->release_folio method,
> the pagecache will assume that private data is buffer_heads and call
> try_to_free_buffers().

Holy schism resolved;
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 5/7] filemap: Allow __filemap_get_folio to allocate large folios
  2023-06-04 21:48     ` Matthew Wilcox
@ 2023-06-05 15:21       ` Darrick J. Wong
  0 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2023-06-05 15:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig

On Sun, Jun 04, 2023 at 10:48:47PM +0100, Matthew Wilcox wrote:
> On Sun, Jun 04, 2023 at 11:09:25AM -0700, Darrick J. Wong wrote:
> > On Fri, Jun 02, 2023 at 11:24:42PM +0100, Matthew Wilcox (Oracle) wrote:
> > > +		do {
> > > +			err = -ENOMEM;
> > > +			if (order == 1)
> > > +				order = 0;
> > 
> > Doesn't this interrupt the scale-down progression 2M -> 1M -> 512K ->
> > 256K -> 128K -> 64K -> 32K -> 16K -> 4k?  What if I want 8k folios?
> 
> You can't have order-1 file/anon folios.  We have deferred_list in the
> third page, so we have to have at least three pages in every large folio.
> I forget exactly what it's used for; maybe there's a way to do without
> it, but for now that's the rule.

Ahahaha, ok.  I hadn't realized/remembered that.

/me wonders if that ought to be captured in a header as some static
inline clamping function instead of opencoded, but afaict there's only
four places around the kernel that do/need this.

Really it's a pity we can't do

	for order in 9 8 7 6 5 4 3 2 0; do

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 3/7] iomap: Remove unnecessary test from iomap_release_folio()
  2023-06-04 21:39     ` Matthew Wilcox
@ 2023-06-05 21:10       ` Ritesh Harjani
  0 siblings, 0 replies; 39+ messages in thread
From: Ritesh Harjani @ 2023-06-05 21:10 UTC (permalink / raw)
  To: Matthew Wilcox, Darrick J. Wong
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner, Christoph Hellwig

Matthew Wilcox <willy@infradead.org> writes:

> On Sun, Jun 04, 2023 at 11:01:41AM -0700, Darrick J. Wong wrote:
>> On Fri, Jun 02, 2023 at 11:24:40PM +0100, Matthew Wilcox (Oracle) wrote:
>> > The check for the folio being under writeback is unnecessary; the caller
>> > has checked this and the folio is locked, so the folio cannot be under
>> > writeback at this point.
>>
>> Do we need a debug assertion here to validate that filemap_release_folio
>> has already filtered out folios unergoing writeback?  The documentation
>> change in the next patch might be fine since you're the pagecache
>> maintainer.
>
> I don't think so?  We don't usually include asserts in filesystems that
> the VFS is living up to its promises.
>
>> >  	/*
>> > -	 * mm accommodates an old ext3 case where clean folios might
>> > -	 * not have had the dirty bit cleared.  Thus, it can send actual
>> > -	 * dirty folios to ->release_folio() via shrink_active_list();
>> > -	 * skip those here.
>> > +	 * If the folio is dirty, we refuse to release our metadata because
>> > +	 * it may be partially dirty (FIXME, add a test for that).
>>
>> Er... is this FIXME reflective of incomplete code?
>
> It's a note to Ritesh ;-)
>
> Once we have per-block dirty bits, if all the dirty bits are set, we
> can free the iomap_page without losing any information.  But I don't
> want to introduce a dependency between this and Ritesh's work.

Sure Matthew, noted.
However this looks more like an optimization which can be done for cases
where we have all bits as dirty, than a dependency.

-ritesh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iomap: Copy larger chunks from userspace
  2023-06-05  8:25       ` Yin, Fengwei
@ 2023-06-06 18:07         ` Matthew Wilcox
  2023-06-07  2:21           ` Yin Fengwei
  2023-06-07  6:40           ` Yin Fengwei
  0 siblings, 2 replies; 39+ messages in thread
From: Matthew Wilcox @ 2023-06-06 18:07 UTC (permalink / raw)
  To: Yin, Fengwei
  Cc: Darrick J. Wong, linux-fsdevel, linux-xfs, Wang Yugui,
	Dave Chinner, Christoph Hellwig, Al Viro

On Mon, Jun 05, 2023 at 04:25:22PM +0800, Yin, Fengwei wrote:
> On 6/5/2023 6:11 AM, Matthew Wilcox wrote:
> > On Sun, Jun 04, 2023 at 11:29:52AM -0700, Darrick J. Wong wrote:
> >> On Fri, Jun 02, 2023 at 11:24:44PM +0100, Matthew Wilcox (Oracle) wrote:
> >>> -		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
> >>> +		copied = copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
> >>
> >> I think I've gotten lost in the weeds.  Does copy_page_from_iter_atomic
> >> actually know how to deal with a multipage folio?  AFAICT it takes a
> >> page, kmaps it, and copies @bytes starting at @offset in the page.  If
> >> a caller feeds it a multipage folio, does that all work correctly?  Or
> >> will the pagecache split multipage folios as needed to make it work
> >> right?
> > 
> > It's a smidgen inefficient, but it does work.  First, it calls
> > page_copy_sane() to check that offset & n fit within the compound page
> > (ie this all predates folios).
> > 
> > ... Oh.  copy_page_from_iter() handles this correctly.
> > copy_page_from_iter_atomic() doesn't.  I'll have to fix this
> > first.  Looks like Al fixed copy_page_from_iter() in c03f05f183cd
> > and didn't fix copy_page_from_iter_atomic().
> > 
> >> If we create a 64k folio at pos 0 and then want to write a byte at pos
> >> 40k, does __filemap_get_folio break up the 64k folio so that the folio
> >> returned by iomap_get_folio starts at 40k?  Or can the iter code handle
> >> jumping ten pages into a 16-page folio and I just can't see it?
> > 
> > Well ... it handles it fine unless it's highmem.  p is kaddr + offset,
> > so if offset is 40k, it works correctly on !highmem.
> So is it better to have implementations for !highmem and highmem? And for
> !highmem, we don't need the kmap_local_page()/kunmap_local() and chunk
> size per copy is not limited to PAGE_SIZE. Thanks.

No, that's not needed; we can handle that just fine.  Maybe this can
use kmap_local_page() instead of kmap_atomic().  Al, what do you think?
I haven't tested this yet; need to figure out a qemu config with highmem ...

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 960223ed9199..d3d6a0789625 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -857,24 +857,36 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_zero);
 
-size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
-				  struct iov_iter *i)
+size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
+		size_t bytes, struct iov_iter *i)
 {
-	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
-	if (!page_copy_sane(page, offset, bytes)) {
-		kunmap_atomic(kaddr);
+	size_t n = bytes, copied = 0;
+
+	if (!page_copy_sane(page, offset, bytes))
 		return 0;
-	}
-	if (WARN_ON_ONCE(!i->data_source)) {
-		kunmap_atomic(kaddr);
+	if (WARN_ON_ONCE(!i->data_source))
 		return 0;
+
+	page += offset / PAGE_SIZE;
+	offset %= PAGE_SIZE;
+	if (PageHighMem(page))
+		n = min_t(size_t, bytes, PAGE_SIZE);
+	while (1) {
+		char *kaddr = kmap_atomic(page) + offset;
+		iterate_and_advance(i, n, base, len, off,
+			copyin(kaddr + off, base, len),
+			memcpy_from_iter(i, kaddr + off, base, len)
+		)
+		kunmap_atomic(kaddr);
+		copied += n;
+		if (!PageHighMem(page) || copied == bytes || n == 0)
+			break;
+		offset += n;
+		page += offset / PAGE_SIZE;
+		offset %= PAGE_SIZE;
+		n = min_t(size_t, bytes - copied, PAGE_SIZE);
 	}
-	iterate_and_advance(i, bytes, base, len, off,
-		copyin(p + off, base, len),
-		memcpy_from_iter(i, p + off, base, len)
-	)
-	kunmap_atomic(kaddr);
-	return bytes;
+	return copied;
 }
 EXPORT_SYMBOL(copy_page_from_iter_atomic);
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iomap: Copy larger chunks from userspace
  2023-06-06 18:07         ` Matthew Wilcox
@ 2023-06-07  2:21           ` Yin Fengwei
  2023-06-07  5:33             ` Yin, Fengwei
  2023-06-07 15:55             ` Matthew Wilcox
  2023-06-07  6:40           ` Yin Fengwei
  1 sibling, 2 replies; 39+ messages in thread
From: Yin Fengwei @ 2023-06-07  2:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, linux-fsdevel, linux-xfs, Wang Yugui,
	Dave Chinner, Christoph Hellwig, Al Viro



On 6/7/23 02:07, Matthew Wilcox wrote:
> On Mon, Jun 05, 2023 at 04:25:22PM +0800, Yin, Fengwei wrote:
>> On 6/5/2023 6:11 AM, Matthew Wilcox wrote:
>>> On Sun, Jun 04, 2023 at 11:29:52AM -0700, Darrick J. Wong wrote:
>>>> On Fri, Jun 02, 2023 at 11:24:44PM +0100, Matthew Wilcox (Oracle) wrote:
>>>>> -		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
>>>>> +		copied = copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
>>>>
>>>> I think I've gotten lost in the weeds.  Does copy_page_from_iter_atomic
>>>> actually know how to deal with a multipage folio?  AFAICT it takes a
>>>> page, kmaps it, and copies @bytes starting at @offset in the page.  If
>>>> a caller feeds it a multipage folio, does that all work correctly?  Or
>>>> will the pagecache split multipage folios as needed to make it work
>>>> right?
>>>
>>> It's a smidgen inefficient, but it does work.  First, it calls
>>> page_copy_sane() to check that offset & n fit within the compound page
>>> (ie this all predates folios).
>>>
>>> ... Oh.  copy_page_from_iter() handles this correctly.
>>> copy_page_from_iter_atomic() doesn't.  I'll have to fix this
>>> first.  Looks like Al fixed copy_page_from_iter() in c03f05f183cd
>>> and didn't fix copy_page_from_iter_atomic().
>>>
>>>> If we create a 64k folio at pos 0 and then want to write a byte at pos
>>>> 40k, does __filemap_get_folio break up the 64k folio so that the folio
>>>> returned by iomap_get_folio starts at 40k?  Or can the iter code handle
>>>> jumping ten pages into a 16-page folio and I just can't see it?
>>>
>>> Well ... it handles it fine unless it's highmem.  p is kaddr + offset,
>>> so if offset is 40k, it works correctly on !highmem.
>> So is it better to have implementations for !highmem and highmem? And for
>> !highmem, we don't need the kmap_local_page()/kunmap_local() and chunk
>> size per copy is not limited to PAGE_SIZE. Thanks.
> 
> No, that's not needed; we can handle that just fine.  Maybe this can
> use kmap_local_page() instead of kmap_atomic().  Al, what do you think?
> I haven't tested this yet; need to figure out a qemu config with highmem ...
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 960223ed9199..d3d6a0789625 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -857,24 +857,36 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
>  }
>  EXPORT_SYMBOL(iov_iter_zero);
>  
> -size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
> -				  struct iov_iter *i)
> +size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
> +		size_t bytes, struct iov_iter *i)
>  {
> -	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
> -	if (!page_copy_sane(page, offset, bytes)) {
> -		kunmap_atomic(kaddr);
> +	size_t n = bytes, copied = 0;
> +
> +	if (!page_copy_sane(page, offset, bytes))
>  		return 0;
> -	}
> -	if (WARN_ON_ONCE(!i->data_source)) {
> -		kunmap_atomic(kaddr);
> +	if (WARN_ON_ONCE(!i->data_source))
>  		return 0;
> +
> +	page += offset / PAGE_SIZE;
> +	offset %= PAGE_SIZE;
> +	if (PageHighMem(page))
> +		n = min_t(size_t, bytes, PAGE_SIZE);
This is smart.

> +	while (1) {
> +		char *kaddr = kmap_atomic(page) + offset;
> +		iterate_and_advance(i, n, base, len, off,
> +			copyin(kaddr + off, base, len),
> +			memcpy_from_iter(i, kaddr + off, base, len)
> +		)
> +		kunmap_atomic(kaddr);
> +		copied += n;
> +		if (!PageHighMem(page) || copied == bytes || n == 0)
> +			break;
My understanding is copied == bytes could cover !PageHighMem(page).

> +		offset += n;
> +		page += offset / PAGE_SIZE;
Should be page += n / PAGE_SIZE? Thanks.


Regards
Yin, Fengwei

> +		offset %= PAGE_SIZE;
> +		n = min_t(size_t, bytes - copied, PAGE_SIZE);
>  	}
> -	iterate_and_advance(i, bytes, base, len, off,
> -		copyin(p + off, base, len),
> -		memcpy_from_iter(i, p + off, base, len)
> -	)
> -	kunmap_atomic(kaddr);
> -	return bytes;
> +	return copied;
>  }
>  EXPORT_SYMBOL(copy_page_from_iter_atomic);
>  

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iomap: Copy larger chunks from userspace
  2023-06-07  2:21           ` Yin Fengwei
@ 2023-06-07  5:33             ` Yin, Fengwei
  2023-06-07 15:55             ` Matthew Wilcox
  1 sibling, 0 replies; 39+ messages in thread
From: Yin, Fengwei @ 2023-06-07  5:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, linux-fsdevel, linux-xfs, Wang Yugui,
	Dave Chinner, Christoph Hellwig, Al Viro



On 6/7/2023 10:21 AM, Yin Fengwei wrote:
> 
> 
> On 6/7/23 02:07, Matthew Wilcox wrote:
>> On Mon, Jun 05, 2023 at 04:25:22PM +0800, Yin, Fengwei wrote:
>>> On 6/5/2023 6:11 AM, Matthew Wilcox wrote:
>>>> On Sun, Jun 04, 2023 at 11:29:52AM -0700, Darrick J. Wong wrote:
>>>>> On Fri, Jun 02, 2023 at 11:24:44PM +0100, Matthew Wilcox (Oracle) wrote:
>>>>>> -		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
>>>>>> +		copied = copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
>>>>>
>>>>> I think I've gotten lost in the weeds.  Does copy_page_from_iter_atomic
>>>>> actually know how to deal with a multipage folio?  AFAICT it takes a
>>>>> page, kmaps it, and copies @bytes starting at @offset in the page.  If
>>>>> a caller feeds it a multipage folio, does that all work correctly?  Or
>>>>> will the pagecache split multipage folios as needed to make it work
>>>>> right?
>>>>
>>>> It's a smidgen inefficient, but it does work.  First, it calls
>>>> page_copy_sane() to check that offset & n fit within the compound page
>>>> (ie this all predates folios).
>>>>
>>>> ... Oh.  copy_page_from_iter() handles this correctly.
>>>> copy_page_from_iter_atomic() doesn't.  I'll have to fix this
>>>> first.  Looks like Al fixed copy_page_from_iter() in c03f05f183cd
>>>> and didn't fix copy_page_from_iter_atomic().
>>>>
>>>>> If we create a 64k folio at pos 0 and then want to write a byte at pos
>>>>> 40k, does __filemap_get_folio break up the 64k folio so that the folio
>>>>> returned by iomap_get_folio starts at 40k?  Or can the iter code handle
>>>>> jumping ten pages into a 16-page folio and I just can't see it?
>>>>
>>>> Well ... it handles it fine unless it's highmem.  p is kaddr + offset,
>>>> so if offset is 40k, it works correctly on !highmem.
>>> So is it better to have implementations for !highmem and highmem? And for
>>> !highmem, we don't need the kmap_local_page()/kunmap_local() and chunk
>>> size per copy is not limited to PAGE_SIZE. Thanks.
>>
>> No, that's not needed; we can handle that just fine.  Maybe this can
>> use kmap_local_page() instead of kmap_atomic().  Al, what do you think?
>> I haven't tested this yet; need to figure out a qemu config with highmem ...
>>
>> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
>> index 960223ed9199..d3d6a0789625 100644
>> --- a/lib/iov_iter.c
>> +++ b/lib/iov_iter.c
>> @@ -857,24 +857,36 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
>>  }
>>  EXPORT_SYMBOL(iov_iter_zero);
>>  
>> -size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
>> -				  struct iov_iter *i)
>> +size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
>> +		size_t bytes, struct iov_iter *i)
>>  {
>> -	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
>> -	if (!page_copy_sane(page, offset, bytes)) {
>> -		kunmap_atomic(kaddr);
>> +	size_t n = bytes, copied = 0;
>> +
>> +	if (!page_copy_sane(page, offset, bytes))
>>  		return 0;
>> -	}
>> -	if (WARN_ON_ONCE(!i->data_source)) {
>> -		kunmap_atomic(kaddr);
>> +	if (WARN_ON_ONCE(!i->data_source))
>>  		return 0;
>> +
>> +	page += offset / PAGE_SIZE;
>> +	offset %= PAGE_SIZE;
>> +	if (PageHighMem(page))
>> +		n = min_t(size_t, bytes, PAGE_SIZE);
> This is smart.
> 
>> +	while (1) {
>> +		char *kaddr = kmap_atomic(page) + offset;
>> +		iterate_and_advance(i, n, base, len, off,
>> +			copyin(kaddr + off, base, len),
>> +			memcpy_from_iter(i, kaddr + off, base, len)
>> +		)
>> +		kunmap_atomic(kaddr);
>> +		copied += n;
>> +		if (!PageHighMem(page) || copied == bytes || n == 0)
>> +			break;
> My understanding is copied == bytes could cover !PageHighMem(page).
> 
>> +		offset += n;
>> +		page += offset / PAGE_SIZE;
> Should be page += n / PAGE_SIZE? Thanks.
offset / PAGE_SIZE is correct. Sorry for the noise.

Regards
Yin, Fengwei

> 
> 
> Regards
> Yin, Fengwei
> 
>> +		offset %= PAGE_SIZE;
>> +		n = min_t(size_t, bytes - copied, PAGE_SIZE);
>>  	}
>> -	iterate_and_advance(i, bytes, base, len, off,
>> -		copyin(p + off, base, len),
>> -		memcpy_from_iter(i, p + off, base, len)
>> -	)
>> -	kunmap_atomic(kaddr);
>> -	return bytes;
>> +	return copied;
>>  }
>>  EXPORT_SYMBOL(copy_page_from_iter_atomic);
>>  

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iomap: Copy larger chunks from userspace
  2023-06-06 18:07         ` Matthew Wilcox
  2023-06-07  2:21           ` Yin Fengwei
@ 2023-06-07  6:40           ` Yin Fengwei
  2023-06-07 15:56             ` Matthew Wilcox
  1 sibling, 1 reply; 39+ messages in thread
From: Yin Fengwei @ 2023-06-07  6:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, linux-fsdevel, linux-xfs, Wang Yugui,
	Dave Chinner, Christoph Hellwig, Al Viro



On 6/7/23 02:07, Matthew Wilcox wrote:
> On Mon, Jun 05, 2023 at 04:25:22PM +0800, Yin, Fengwei wrote:
>> On 6/5/2023 6:11 AM, Matthew Wilcox wrote:
>>> On Sun, Jun 04, 2023 at 11:29:52AM -0700, Darrick J. Wong wrote:
>>>> On Fri, Jun 02, 2023 at 11:24:44PM +0100, Matthew Wilcox (Oracle) wrote:
>>>>> -		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
>>>>> +		copied = copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
>>>>
>>>> I think I've gotten lost in the weeds.  Does copy_page_from_iter_atomic
>>>> actually know how to deal with a multipage folio?  AFAICT it takes a
>>>> page, kmaps it, and copies @bytes starting at @offset in the page.  If
>>>> a caller feeds it a multipage folio, does that all work correctly?  Or
>>>> will the pagecache split multipage folios as needed to make it work
>>>> right?
>>>
>>> It's a smidgen inefficient, but it does work.  First, it calls
>>> page_copy_sane() to check that offset & n fit within the compound page
>>> (ie this all predates folios).
>>>
>>> ... Oh.  copy_page_from_iter() handles this correctly.
>>> copy_page_from_iter_atomic() doesn't.  I'll have to fix this
>>> first.  Looks like Al fixed copy_page_from_iter() in c03f05f183cd
>>> and didn't fix copy_page_from_iter_atomic().
>>>
>>>> If we create a 64k folio at pos 0 and then want to write a byte at pos
>>>> 40k, does __filemap_get_folio break up the 64k folio so that the folio
>>>> returned by iomap_get_folio starts at 40k?  Or can the iter code handle
>>>> jumping ten pages into a 16-page folio and I just can't see it?
>>>
>>> Well ... it handles it fine unless it's highmem.  p is kaddr + offset,
>>> so if offset is 40k, it works correctly on !highmem.
>> So is it better to have implementations for !highmem and highmem? And for
>> !highmem, we don't need the kmap_local_page()/kunmap_local() and chunk
>> size per copy is not limited to PAGE_SIZE. Thanks.
> 
> No, that's not needed; we can handle that just fine.  Maybe this can
> use kmap_local_page() instead of kmap_atomic().  Al, what do you think?
> I haven't tested this yet; need to figure out a qemu config with highmem ...
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 960223ed9199..d3d6a0789625 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -857,24 +857,36 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
>  }
>  EXPORT_SYMBOL(iov_iter_zero);
>  
> -size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
> -				  struct iov_iter *i)
> +size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
> +		size_t bytes, struct iov_iter *i)
>  {
> -	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
> -	if (!page_copy_sane(page, offset, bytes)) {
> -		kunmap_atomic(kaddr);
> +	size_t n = bytes, copied = 0;
> +
> +	if (!page_copy_sane(page, offset, bytes))
>  		return 0;
> -	}
> -	if (WARN_ON_ONCE(!i->data_source)) {
> -		kunmap_atomic(kaddr);
> +	if (WARN_ON_ONCE(!i->data_source))
>  		return 0;
> +
> +	page += offset / PAGE_SIZE;
> +	offset %= PAGE_SIZE;
> +	if (PageHighMem(page))
> +		n = min_t(size_t, bytes, PAGE_SIZE);
Should be PAGE_SIZE - offset instead of PAGE_SIZE?

> +	while (1) {
> +		char *kaddr = kmap_atomic(page) + offset;
> +		iterate_and_advance(i, n, base, len, off,
> +			copyin(kaddr + off, base, len),
> +			memcpy_from_iter(i, kaddr + off, base, len)
> +		)
> +		kunmap_atomic(kaddr);
> +		copied += n;
> +		if (!PageHighMem(page) || copied == bytes || n == 0)
> +			break;
> +		offset += n;
> +		page += offset / PAGE_SIZE;
> +		offset %= PAGE_SIZE;
> +		n = min_t(size_t, bytes - copied, PAGE_SIZE);

Should be PAGE_SIZE - offset instead of PAGE_SIZE? Thanks.


Regards
Yin, Fengwei

>  	}
> -	iterate_and_advance(i, bytes, base, len, off,
> -		copyin(p + off, base, len),
> -		memcpy_from_iter(i, p + off, base, len)
> -	)
> -	kunmap_atomic(kaddr);
> -	return bytes;
> +	return copied;
>  }
>  EXPORT_SYMBOL(copy_page_from_iter_atomic);
>  

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iomap: Copy larger chunks from userspace
  2023-06-07  2:21           ` Yin Fengwei
  2023-06-07  5:33             ` Yin, Fengwei
@ 2023-06-07 15:55             ` Matthew Wilcox
  2023-06-08  1:22               ` Yin Fengwei
  1 sibling, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2023-06-07 15:55 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: Darrick J. Wong, linux-fsdevel, linux-xfs, Wang Yugui,
	Dave Chinner, Christoph Hellwig, Al Viro

On Wed, Jun 07, 2023 at 10:21:41AM +0800, Yin Fengwei wrote:
> On 6/7/23 02:07, Matthew Wilcox wrote:
> > +++ b/lib/iov_iter.c
> > @@ -857,24 +857,36 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
> >  }
> >  EXPORT_SYMBOL(iov_iter_zero);
> >  
> > -size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
> > -				  struct iov_iter *i)
> > +size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
> > +		size_t bytes, struct iov_iter *i)
> >  {
> > -	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
> > -	if (!page_copy_sane(page, offset, bytes)) {
> > -		kunmap_atomic(kaddr);
> > +	size_t n = bytes, copied = 0;
> > +
> > +	if (!page_copy_sane(page, offset, bytes))
> >  		return 0;
> > -	}
> > -	if (WARN_ON_ONCE(!i->data_source)) {
> > -		kunmap_atomic(kaddr);
> > +	if (WARN_ON_ONCE(!i->data_source))
> >  		return 0;
> > +
> > +	page += offset / PAGE_SIZE;
> > +	offset %= PAGE_SIZE;
> > +	if (PageHighMem(page))
> > +		n = min_t(size_t, bytes, PAGE_SIZE);
> This is smart.

Thanks ;-)

> > +	while (1) {
> > +		char *kaddr = kmap_atomic(page) + offset;
> > +		iterate_and_advance(i, n, base, len, off,
> > +			copyin(kaddr + off, base, len),
> > +			memcpy_from_iter(i, kaddr + off, base, len)
> > +		)
> > +		kunmap_atomic(kaddr);
> > +		copied += n;
> > +		if (!PageHighMem(page) || copied == bytes || n == 0)
> > +			break;
> My understanding is copied == bytes could cover !PageHighMem(page).

It could!  But the PageHighMem test serves two purposes.  One is that
it tells the human reader that this is all because of HighMem.  The
other is that on !HIGHMEM systems it compiles to false:

PAGEFLAG_FALSE(HighMem, highmem)

static inline int Page##uname(const struct page *page) { return 0; }

So it tells the _compiler_ that all of this code is ignorable and
it can optimise it out.  Now, you and I know that it will always
be true, but it lets the compiler remove the test.  Hopefully the
compiler can also see that:

	while (1) {
		...
		if (true)
			break;
	}

means it can optimise away the entire loop structure and just produce
the same code that it always did.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iomap: Copy larger chunks from userspace
  2023-06-07  6:40           ` Yin Fengwei
@ 2023-06-07 15:56             ` Matthew Wilcox
  0 siblings, 0 replies; 39+ messages in thread
From: Matthew Wilcox @ 2023-06-07 15:56 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: Darrick J. Wong, linux-fsdevel, linux-xfs, Wang Yugui,
	Dave Chinner, Christoph Hellwig, Al Viro

On Wed, Jun 07, 2023 at 02:40:02PM +0800, Yin Fengwei wrote:
> > +	page += offset / PAGE_SIZE;
> > +	offset %= PAGE_SIZE;
> > +	if (PageHighMem(page))
> > +		n = min_t(size_t, bytes, PAGE_SIZE);
> Should be PAGE_SIZE - offset instead of PAGE_SIZE?

Yes, it should.  Thanks.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/7] iomap: Copy larger chunks from userspace
  2023-06-07 15:55             ` Matthew Wilcox
@ 2023-06-08  1:22               ` Yin Fengwei
  0 siblings, 0 replies; 39+ messages in thread
From: Yin Fengwei @ 2023-06-08  1:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, linux-fsdevel, linux-xfs, Wang Yugui,
	Dave Chinner, Christoph Hellwig, Al Viro



On 6/7/23 23:55, Matthew Wilcox wrote:
> On Wed, Jun 07, 2023 at 10:21:41AM +0800, Yin Fengwei wrote:
>> On 6/7/23 02:07, Matthew Wilcox wrote:
>>> +++ b/lib/iov_iter.c
>>> @@ -857,24 +857,36 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
>>>  }
>>>  EXPORT_SYMBOL(iov_iter_zero);
>>>  
>>> -size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
>>> -				  struct iov_iter *i)
>>> +size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
>>> +		size_t bytes, struct iov_iter *i)
>>>  {
>>> -	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
>>> -	if (!page_copy_sane(page, offset, bytes)) {
>>> -		kunmap_atomic(kaddr);
>>> +	size_t n = bytes, copied = 0;
>>> +
>>> +	if (!page_copy_sane(page, offset, bytes))
>>>  		return 0;
>>> -	}
>>> -	if (WARN_ON_ONCE(!i->data_source)) {
>>> -		kunmap_atomic(kaddr);
>>> +	if (WARN_ON_ONCE(!i->data_source))
>>>  		return 0;
>>> +
>>> +	page += offset / PAGE_SIZE;
>>> +	offset %= PAGE_SIZE;
>>> +	if (PageHighMem(page))
>>> +		n = min_t(size_t, bytes, PAGE_SIZE);
>> This is smart.
> 
> Thanks ;-)
> 
>>> +	while (1) {
>>> +		char *kaddr = kmap_atomic(page) + offset;
>>> +		iterate_and_advance(i, n, base, len, off,
>>> +			copyin(kaddr + off, base, len),
>>> +			memcpy_from_iter(i, kaddr + off, base, len)
>>> +		)
>>> +		kunmap_atomic(kaddr);
>>> +		copied += n;
>>> +		if (!PageHighMem(page) || copied == bytes || n == 0)
>>> +			break;
>> My understanding is copied == bytes could cover !PageHighMem(page).
> 
> It could!  But the PageHighMem test serves two purposes.  One is that
> it tells the human reader that this is all because of HighMem.  The
> other is that on !HIGHMEM systems it compiles to false:
> 
> PAGEFLAG_FALSE(HighMem, highmem)
> 
> static inline int Page##uname(const struct page *page) { return 0; }
> 
> So it tells the _compiler_ that all of this code is ignorable and
> it can optimise it out.  Now, you and I know that it will always
> be true, but it lets the compiler remove the test.  Hopefully the
> compiler can also see that:
> 
> 	while (1) {
> 		...
> 		if (true)
> 			break;
> 	}
> 
> means it can optimise away the entire loop structure and just produce
> the same code that it always did.
I thought about the first purpose. But the second purpose is new thing
I learned from this thread. Thanks a lot for detail explanation.


Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2023-06-08  1:23 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-02 22:24 [PATCH v2 0/7] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
2023-06-02 22:24 ` [PATCH v2 1/7] iomap: Remove large folio handling in iomap_invalidate_folio() Matthew Wilcox (Oracle)
2023-06-04 17:58   ` Darrick J. Wong
2023-06-05  7:11   ` Christoph Hellwig
2023-06-02 22:24 ` [PATCH v2 2/7] doc: Correct the description of ->release_folio Matthew Wilcox (Oracle)
2023-06-04 17:55   ` Darrick J. Wong
2023-06-04 20:10     ` Matthew Wilcox
2023-06-04 20:33       ` Darrick J. Wong
2023-06-05 13:11         ` Matthew Wilcox
2023-06-05 15:07           ` Darrick J. Wong
2023-06-05  7:12   ` Christoph Hellwig
2023-06-02 22:24 ` [PATCH v2 3/7] iomap: Remove unnecessary test from iomap_release_folio() Matthew Wilcox (Oracle)
2023-06-04 18:01   ` Darrick J. Wong
2023-06-04 21:39     ` Matthew Wilcox
2023-06-05 21:10       ` Ritesh Harjani
2023-06-05  7:13   ` Christoph Hellwig
2023-06-02 22:24 ` [PATCH v2 4/7] filemap: Add fgp_t typedef Matthew Wilcox (Oracle)
2023-06-04 18:02   ` Darrick J. Wong
2023-06-05  7:14   ` Christoph Hellwig
2023-06-02 22:24 ` [PATCH v2 5/7] filemap: Allow __filemap_get_folio to allocate large folios Matthew Wilcox (Oracle)
2023-06-04 18:09   ` Darrick J. Wong
2023-06-04 21:48     ` Matthew Wilcox
2023-06-05 15:21       ` Darrick J. Wong
2023-06-05  7:16   ` Christoph Hellwig
2023-06-02 22:24 ` [PATCH v2 6/7] iomap: Create large folios in the buffered write path Matthew Wilcox (Oracle)
2023-06-04 18:10   ` Darrick J. Wong
2023-06-05  7:16   ` Christoph Hellwig
2023-06-02 22:24 ` [PATCH v2 7/7] iomap: Copy larger chunks from userspace Matthew Wilcox (Oracle)
2023-06-04 18:29   ` Darrick J. Wong
2023-06-04 22:11     ` Matthew Wilcox
2023-06-05  8:25       ` Yin, Fengwei
2023-06-06 18:07         ` Matthew Wilcox
2023-06-07  2:21           ` Yin Fengwei
2023-06-07  5:33             ` Yin, Fengwei
2023-06-07 15:55             ` Matthew Wilcox
2023-06-08  1:22               ` Yin Fengwei
2023-06-07  6:40           ` Yin Fengwei
2023-06-07 15:56             ` Matthew Wilcox
2023-06-04  0:19 ` [PATCH v2 0/7] Create large folios in iomap buffered write path Wang Yugui

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.