All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v6 00/10] Turn iomap_page_ops into iomap_folio_ops
@ 2023-01-08 19:40 ` Andreas Gruenbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J . Wong, Alexander Viro, Matthew Wilcox
  Cc: Andreas Gruenbacher, linux-xfs, linux-fsdevel, linux-ext4, cluster-devel

Here's an updated version of this patch queue.  Changes since v5 [*]:

* A new iomap-internal __iomap_get_folio() helper was added.

* The previous iomap-internal iomap_put_folio() helper was renamed to
  __iomap_put_folio() to mirror __iomap_get_folio().

* The comment describing struct iomap_folio_ops was still referring to
  pages instead of folios in two places.

Is this good enough for iomap-for-next now, please?

Thanks,
Andreas

[*] https://lore.kernel.org/linux-xfs/20221231150919.659533-1-agruenba@redhat.com/

Andreas Gruenbacher (10):
  iomap: Add __iomap_put_folio helper
  iomap/gfs2: Unlock and put folio in page_done handler
  iomap: Rename page_done handler to put_folio
  iomap: Add iomap_get_folio helper
  iomap/gfs2: Get page in page_prepare handler
  iomap: Add __iomap_get_folio helper
  iomap: Rename page_prepare handler to get_folio
  iomap/xfs: Eliminate the iomap_valid handler
  iomap: Rename page_ops to folio_ops
  xfs: Make xfs_iomap_folio_ops static

 fs/gfs2/bmap.c         |  38 ++++++++++-----
 fs/iomap/buffered-io.c | 105 +++++++++++++++++++++++------------------
 fs/xfs/xfs_iomap.c     |  41 +++++++++++-----
 include/linux/iomap.h  |  50 +++++++++-----------
 4 files changed, 134 insertions(+), 100 deletions(-)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 00/10] Turn iomap_page_ops into iomap_folio_ops
@ 2023-01-08 19:40 ` Andreas Gruenbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Here's an updated version of this patch queue.  Changes since v5 [*]:

* A new iomap-internal __iomap_get_folio() helper was added.

* The previous iomap-internal iomap_put_folio() helper was renamed to
  __iomap_put_folio() to mirror __iomap_get_folio().

* The comment describing struct iomap_folio_ops was still referring to
  pages instead of folios in two places.

Is this good enough for iomap-for-next now, please?

Thanks,
Andreas

[*] https://lore.kernel.org/linux-xfs/20221231150919.659533-1-agruenba at redhat.com/

Andreas Gruenbacher (10):
  iomap: Add __iomap_put_folio helper
  iomap/gfs2: Unlock and put folio in page_done handler
  iomap: Rename page_done handler to put_folio
  iomap: Add iomap_get_folio helper
  iomap/gfs2: Get page in page_prepare handler
  iomap: Add __iomap_get_folio helper
  iomap: Rename page_prepare handler to get_folio
  iomap/xfs: Eliminate the iomap_valid handler
  iomap: Rename page_ops to folio_ops
  xfs: Make xfs_iomap_folio_ops static

 fs/gfs2/bmap.c         |  38 ++++++++++-----
 fs/iomap/buffered-io.c | 105 +++++++++++++++++++++++------------------
 fs/xfs/xfs_iomap.c     |  41 +++++++++++-----
 include/linux/iomap.h  |  50 +++++++++-----------
 4 files changed, 134 insertions(+), 100 deletions(-)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [RFC v6 01/10] iomap: Add __iomap_put_folio helper
  2023-01-08 19:40 ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J . Wong, Alexander Viro, Matthew Wilcox
  Cc: Andreas Gruenbacher, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel, Christoph Hellwig

Add an __iomap_put_folio() helper to encapsulate unlocking the folio,
calling ->page_done(), and putting the folio.  Use the new helper in
iomap_write_begin() and iomap_write_end().

This effectively doesn't change the way the code works, but prepares for
successive improvements.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 356193e44cf0..c045689b6af8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -575,6 +575,19 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	return 0;
 }
 
+static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
+		struct folio *folio)
+{
+	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
+
+	if (folio)
+		folio_unlock(folio);
+	if (page_ops && page_ops->page_done)
+		page_ops->page_done(iter->inode, pos, ret, &folio->page);
+	if (folio)
+		folio_put(folio);
+}
+
 static int iomap_write_begin_inline(const struct iomap_iter *iter,
 		struct folio *folio)
 {
@@ -616,7 +629,8 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 			fgp, mapping_gfp_mask(iter->inode->i_mapping));
 	if (!folio) {
 		status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
-		goto out_no_page;
+		__iomap_put_folio(iter, pos, 0, NULL);
+		return status;
 	}
 
 	/*
@@ -656,13 +670,9 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 	return 0;
 
 out_unlock:
-	folio_unlock(folio);
-	folio_put(folio);
+	__iomap_put_folio(iter, pos, 0, folio);
 	iomap_write_failed(iter->inode, pos, len);
 
-out_no_page:
-	if (page_ops && page_ops->page_done)
-		page_ops->page_done(iter->inode, pos, 0, NULL);
 	return status;
 }
 
@@ -712,7 +722,6 @@ static size_t iomap_write_end_inline(const struct iomap_iter *iter,
 static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
-	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	loff_t old_size = iter->inode->i_size;
 	size_t ret;
@@ -735,14 +744,10 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 		i_size_write(iter->inode, pos + ret);
 		iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
 	}
-	folio_unlock(folio);
+	__iomap_put_folio(iter, pos, ret, folio);
 
 	if (old_size < pos)
 		pagecache_isize_extended(iter->inode, old_size, pos);
-	if (page_ops && page_ops->page_done)
-		page_ops->page_done(iter->inode, pos, ret, &folio->page);
-	folio_put(folio);
-
 	if (ret < len)
 		iomap_write_failed(iter->inode, pos + ret, len - ret);
 	return ret;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 01/10] iomap: Add __iomap_put_folio helper
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Add an __iomap_put_folio() helper to encapsulate unlocking the folio,
calling ->page_done(), and putting the folio.  Use the new helper in
iomap_write_begin() and iomap_write_end().

This effectively doesn't change the way the code works, but prepares for
successive improvements.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 356193e44cf0..c045689b6af8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -575,6 +575,19 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	return 0;
 }
 
+static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
+		struct folio *folio)
+{
+	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
+
+	if (folio)
+		folio_unlock(folio);
+	if (page_ops && page_ops->page_done)
+		page_ops->page_done(iter->inode, pos, ret, &folio->page);
+	if (folio)
+		folio_put(folio);
+}
+
 static int iomap_write_begin_inline(const struct iomap_iter *iter,
 		struct folio *folio)
 {
@@ -616,7 +629,8 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 			fgp, mapping_gfp_mask(iter->inode->i_mapping));
 	if (!folio) {
 		status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
-		goto out_no_page;
+		__iomap_put_folio(iter, pos, 0, NULL);
+		return status;
 	}
 
 	/*
@@ -656,13 +670,9 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 	return 0;
 
 out_unlock:
-	folio_unlock(folio);
-	folio_put(folio);
+	__iomap_put_folio(iter, pos, 0, folio);
 	iomap_write_failed(iter->inode, pos, len);
 
-out_no_page:
-	if (page_ops && page_ops->page_done)
-		page_ops->page_done(iter->inode, pos, 0, NULL);
 	return status;
 }
 
@@ -712,7 +722,6 @@ static size_t iomap_write_end_inline(const struct iomap_iter *iter,
 static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
-	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	loff_t old_size = iter->inode->i_size;
 	size_t ret;
@@ -735,14 +744,10 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 		i_size_write(iter->inode, pos + ret);
 		iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
 	}
-	folio_unlock(folio);
+	__iomap_put_folio(iter, pos, ret, folio);
 
 	if (old_size < pos)
 		pagecache_isize_extended(iter->inode, old_size, pos);
-	if (page_ops && page_ops->page_done)
-		page_ops->page_done(iter->inode, pos, ret, &folio->page);
-	folio_put(folio);
-
 	if (ret < len)
 		iomap_write_failed(iter->inode, pos + ret, len - ret);
 	return ret;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC v6 02/10] iomap/gfs2: Unlock and put folio in page_done handler
  2023-01-08 19:40 ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J . Wong, Alexander Viro, Matthew Wilcox
  Cc: Andreas Gruenbacher, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel, Christoph Hellwig

When an iomap defines a ->page_done() handler in its page_ops, delegate
unlocking the folio and putting the folio reference to that handler.

This allows to fix a race between journaled data writes and folio
writeback in gfs2: before this change, gfs2_iomap_page_done() was called
after unlocking the folio, so writeback could start writing back the
folio's buffers before they could be marked for writing to the journal.
Also, try_to_free_buffers() could free the buffers before
gfs2_iomap_page_done() was done adding the buffers to the current
current transaction.  With this change, gfs2_iomap_page_done() adds the
buffers to the current transaction while the folio is still locked, so
the problems described above can no longer occur.

The only current user of ->page_done() is gfs2, so other filesystems are
not affected.  To catch out any out-of-tree users, switch from a page to
a folio in ->page_done().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/bmap.c         | 15 ++++++++++++---
 fs/iomap/buffered-io.c |  8 ++++----
 include/linux/iomap.h  |  7 ++++---
 3 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index e7537fd305dd..46206286ad42 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -968,14 +968,23 @@ static int gfs2_iomap_page_prepare(struct inode *inode, loff_t pos,
 }
 
 static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
-				 unsigned copied, struct page *page)
+				 unsigned copied, struct folio *folio)
 {
 	struct gfs2_trans *tr = current->journal_info;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 
-	if (page && !gfs2_is_stuffed(ip))
-		gfs2_page_add_databufs(ip, page, offset_in_page(pos), copied);
+	if (!folio) {
+		gfs2_trans_end(sdp);
+		return;
+	}
+
+	if (!gfs2_is_stuffed(ip))
+		gfs2_page_add_databufs(ip, &folio->page, offset_in_page(pos),
+				       copied);
+
+	folio_unlock(folio);
+	folio_put(folio);
 
 	if (tr->tr_num_buf_new)
 		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index c045689b6af8..a9082078e4ed 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -580,12 +580,12 @@ static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
 {
 	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 
-	if (folio)
+	if (page_ops && page_ops->page_done) {
+		page_ops->page_done(iter->inode, pos, ret, folio);
+	} else if (folio) {
 		folio_unlock(folio);
-	if (page_ops && page_ops->page_done)
-		page_ops->page_done(iter->inode, pos, ret, &folio->page);
-	if (folio)
 		folio_put(folio);
+	}
 }
 
 static int iomap_write_begin_inline(const struct iomap_iter *iter,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0983dfc9a203..743e2a909162 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -131,13 +131,14 @@ static inline bool iomap_inline_data_valid(const struct iomap *iomap)
  * associated with them.
  *
  * When page_prepare succeeds, page_done will always be called to do any
- * cleanup work necessary.  In that page_done call, @page will be NULL if the
- * associated page could not be obtained.
+ * cleanup work necessary.  In that page_done call, @folio will be NULL if the
+ * associated folio could not be obtained.  When folio is not NULL, page_done
+ * is responsible for unlocking and putting the folio.
  */
 struct iomap_page_ops {
 	int (*page_prepare)(struct inode *inode, loff_t pos, unsigned len);
 	void (*page_done)(struct inode *inode, loff_t pos, unsigned copied,
-			struct page *page);
+			struct folio *folio);
 
 	/*
 	 * Check that the cached iomap still maps correctly to the filesystem's
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 02/10] iomap/gfs2: Unlock and put folio in page_done handler
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

When an iomap defines a ->page_done() handler in its page_ops, delegate
unlocking the folio and putting the folio reference to that handler.

This allows to fix a race between journaled data writes and folio
writeback in gfs2: before this change, gfs2_iomap_page_done() was called
after unlocking the folio, so writeback could start writing back the
folio's buffers before they could be marked for writing to the journal.
Also, try_to_free_buffers() could free the buffers before
gfs2_iomap_page_done() was done adding the buffers to the current
current transaction.  With this change, gfs2_iomap_page_done() adds the
buffers to the current transaction while the folio is still locked, so
the problems described above can no longer occur.

The only current user of ->page_done() is gfs2, so other filesystems are
not affected.  To catch out any out-of-tree users, switch from a page to
a folio in ->page_done().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/bmap.c         | 15 ++++++++++++---
 fs/iomap/buffered-io.c |  8 ++++----
 include/linux/iomap.h  |  7 ++++---
 3 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index e7537fd305dd..46206286ad42 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -968,14 +968,23 @@ static int gfs2_iomap_page_prepare(struct inode *inode, loff_t pos,
 }
 
 static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
-				 unsigned copied, struct page *page)
+				 unsigned copied, struct folio *folio)
 {
 	struct gfs2_trans *tr = current->journal_info;
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 
-	if (page && !gfs2_is_stuffed(ip))
-		gfs2_page_add_databufs(ip, page, offset_in_page(pos), copied);
+	if (!folio) {
+		gfs2_trans_end(sdp);
+		return;
+	}
+
+	if (!gfs2_is_stuffed(ip))
+		gfs2_page_add_databufs(ip, &folio->page, offset_in_page(pos),
+				       copied);
+
+	folio_unlock(folio);
+	folio_put(folio);
 
 	if (tr->tr_num_buf_new)
 		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index c045689b6af8..a9082078e4ed 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -580,12 +580,12 @@ static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
 {
 	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 
-	if (folio)
+	if (page_ops && page_ops->page_done) {
+		page_ops->page_done(iter->inode, pos, ret, folio);
+	} else if (folio) {
 		folio_unlock(folio);
-	if (page_ops && page_ops->page_done)
-		page_ops->page_done(iter->inode, pos, ret, &folio->page);
-	if (folio)
 		folio_put(folio);
+	}
 }
 
 static int iomap_write_begin_inline(const struct iomap_iter *iter,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0983dfc9a203..743e2a909162 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -131,13 +131,14 @@ static inline bool iomap_inline_data_valid(const struct iomap *iomap)
  * associated with them.
  *
  * When page_prepare succeeds, page_done will always be called to do any
- * cleanup work necessary.  In that page_done call, @page will be NULL if the
- * associated page could not be obtained.
+ * cleanup work necessary.  In that page_done call, @folio will be NULL if the
+ * associated folio could not be obtained.  When folio is not NULL, page_done
+ * is responsible for unlocking and putting the folio.
  */
 struct iomap_page_ops {
 	int (*page_prepare)(struct inode *inode, loff_t pos, unsigned len);
 	void (*page_done)(struct inode *inode, loff_t pos, unsigned copied,
-			struct page *page);
+			struct folio *folio);
 
 	/*
 	 * Check that the cached iomap still maps correctly to the filesystem's
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC v6 03/10] iomap: Rename page_done handler to put_folio
  2023-01-08 19:40 ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J . Wong, Alexander Viro, Matthew Wilcox
  Cc: Andreas Gruenbacher, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel, Christoph Hellwig

The ->page_done() handler in struct iomap_page_ops is now somewhat
misnamed in that it mainly deals with unlocking and putting a folio, so
rename it to ->put_folio().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/bmap.c         |  4 ++--
 fs/iomap/buffered-io.c |  4 ++--
 include/linux/iomap.h  | 12 ++++++------
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 46206286ad42..0c041459677b 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -967,7 +967,7 @@ static int gfs2_iomap_page_prepare(struct inode *inode, loff_t pos,
 	return gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
 }
 
-static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
+static void gfs2_iomap_put_folio(struct inode *inode, loff_t pos,
 				 unsigned copied, struct folio *folio)
 {
 	struct gfs2_trans *tr = current->journal_info;
@@ -994,7 +994,7 @@ static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
 
 static const struct iomap_page_ops gfs2_iomap_page_ops = {
 	.page_prepare = gfs2_iomap_page_prepare,
-	.page_done = gfs2_iomap_page_done,
+	.put_folio = gfs2_iomap_put_folio,
 };
 
 static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index a9082078e4ed..d4b444e44861 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -580,8 +580,8 @@ static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
 {
 	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 
-	if (page_ops && page_ops->page_done) {
-		page_ops->page_done(iter->inode, pos, ret, folio);
+	if (page_ops && page_ops->put_folio) {
+		page_ops->put_folio(iter->inode, pos, ret, folio);
 	} else if (folio) {
 		folio_unlock(folio);
 		folio_put(folio);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 743e2a909162..ecf815b34d51 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -126,18 +126,18 @@ static inline bool iomap_inline_data_valid(const struct iomap *iomap)
 
 /*
  * When a filesystem sets page_ops in an iomap mapping it returns, page_prepare
- * and page_done will be called for each page written to.  This only applies to
- * buffered writes as unbuffered writes will not typically have pages
+ * and put_folio will be called for each folio written to.  This only applies
+ * to buffered writes as unbuffered writes will not typically have folios
  * associated with them.
  *
- * When page_prepare succeeds, page_done will always be called to do any
- * cleanup work necessary.  In that page_done call, @folio will be NULL if the
- * associated folio could not be obtained.  When folio is not NULL, page_done
+ * When page_prepare succeeds, put_folio will always be called to do any
+ * cleanup work necessary.  In that put_folio call, @folio will be NULL if the
+ * associated folio could not be obtained.  When folio is not NULL, put_folio
  * is responsible for unlocking and putting the folio.
  */
 struct iomap_page_ops {
 	int (*page_prepare)(struct inode *inode, loff_t pos, unsigned len);
-	void (*page_done)(struct inode *inode, loff_t pos, unsigned copied,
+	void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
 			struct folio *folio);
 
 	/*
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 03/10] iomap: Rename page_done handler to put_folio
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

The ->page_done() handler in struct iomap_page_ops is now somewhat
misnamed in that it mainly deals with unlocking and putting a folio, so
rename it to ->put_folio().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/bmap.c         |  4 ++--
 fs/iomap/buffered-io.c |  4 ++--
 include/linux/iomap.h  | 12 ++++++------
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 46206286ad42..0c041459677b 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -967,7 +967,7 @@ static int gfs2_iomap_page_prepare(struct inode *inode, loff_t pos,
 	return gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
 }
 
-static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
+static void gfs2_iomap_put_folio(struct inode *inode, loff_t pos,
 				 unsigned copied, struct folio *folio)
 {
 	struct gfs2_trans *tr = current->journal_info;
@@ -994,7 +994,7 @@ static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
 
 static const struct iomap_page_ops gfs2_iomap_page_ops = {
 	.page_prepare = gfs2_iomap_page_prepare,
-	.page_done = gfs2_iomap_page_done,
+	.put_folio = gfs2_iomap_put_folio,
 };
 
 static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index a9082078e4ed..d4b444e44861 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -580,8 +580,8 @@ static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
 {
 	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 
-	if (page_ops && page_ops->page_done) {
-		page_ops->page_done(iter->inode, pos, ret, folio);
+	if (page_ops && page_ops->put_folio) {
+		page_ops->put_folio(iter->inode, pos, ret, folio);
 	} else if (folio) {
 		folio_unlock(folio);
 		folio_put(folio);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 743e2a909162..ecf815b34d51 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -126,18 +126,18 @@ static inline bool iomap_inline_data_valid(const struct iomap *iomap)
 
 /*
  * When a filesystem sets page_ops in an iomap mapping it returns, page_prepare
- * and page_done will be called for each page written to.  This only applies to
- * buffered writes as unbuffered writes will not typically have pages
+ * and put_folio will be called for each folio written to.  This only applies
+ * to buffered writes as unbuffered writes will not typically have folios
  * associated with them.
  *
- * When page_prepare succeeds, page_done will always be called to do any
- * cleanup work necessary.  In that page_done call, @folio will be NULL if the
- * associated folio could not be obtained.  When folio is not NULL, page_done
+ * When page_prepare succeeds, put_folio will always be called to do any
+ * cleanup work necessary.  In that put_folio call, @folio will be NULL if the
+ * associated folio could not be obtained.  When folio is not NULL, put_folio
  * is responsible for unlocking and putting the folio.
  */
 struct iomap_page_ops {
 	int (*page_prepare)(struct inode *inode, loff_t pos, unsigned len);
-	void (*page_done)(struct inode *inode, loff_t pos, unsigned copied,
+	void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
 			struct folio *folio);
 
 	/*
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC v6 04/10] iomap: Add iomap_get_folio helper
  2023-01-08 19:40 ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J . Wong, Alexander Viro, Matthew Wilcox
  Cc: Andreas Gruenbacher, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel, Christoph Hellwig

Add an iomap_get_folio() helper that gets a folio reference based on
an iomap iterator and an offset into the address space.  Use it in
iomap_write_begin().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 39 ++++++++++++++++++++++++++++++---------
 include/linux/iomap.h  |  1 +
 2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d4b444e44861..de4a8e5f721a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -457,6 +457,33 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
 }
 EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
 
+/**
+ * iomap_get_folio - get a folio reference for writing
+ * @iter: iteration structure
+ * @pos: start offset of write
+ *
+ * Returns a locked reference to the folio at @pos, or an error pointer if the
+ * folio could not be obtained.
+ */
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
+{
+	unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
+	struct folio *folio;
+
+	if (iter->flags & IOMAP_NOWAIT)
+		fgp |= FGP_NOWAIT;
+
+	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
+			fgp, mapping_gfp_mask(iter->inode->i_mapping));
+	if (folio)
+		return folio;
+
+	if (iter->flags & IOMAP_NOWAIT)
+		return ERR_PTR(-EAGAIN);
+	return ERR_PTR(-ENOMEM);
+}
+EXPORT_SYMBOL_GPL(iomap_get_folio);
+
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
 {
 	trace_iomap_release_folio(folio->mapping->host, folio_pos(folio),
@@ -603,12 +630,8 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	struct folio *folio;
-	unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
 	int status = 0;
 
-	if (iter->flags & IOMAP_NOWAIT)
-		fgp |= FGP_NOWAIT;
-
 	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
 	if (srcmap != &iter->iomap)
 		BUG_ON(pos + len > srcmap->offset + srcmap->length);
@@ -625,12 +648,10 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 			return status;
 	}
 
-	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
-			fgp, mapping_gfp_mask(iter->inode->i_mapping));
-	if (!folio) {
-		status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
+	folio = iomap_get_folio(iter, pos);
+	if (IS_ERR(folio)) {
 		__iomap_put_folio(iter, pos, 0, NULL);
-		return status;
+		return PTR_ERR(folio);
 	}
 
 	/*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index ecf815b34d51..188d14e786a4 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -261,6 +261,7 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
 void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Add an iomap_get_folio() helper that gets a folio reference based on
an iomap iterator and an offset into the address space.  Use it in
iomap_write_begin().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 39 ++++++++++++++++++++++++++++++---------
 include/linux/iomap.h  |  1 +
 2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d4b444e44861..de4a8e5f721a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -457,6 +457,33 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
 }
 EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
 
+/**
+ * iomap_get_folio - get a folio reference for writing
+ * @iter: iteration structure
+ * @pos: start offset of write
+ *
+ * Returns a locked reference to the folio at @pos, or an error pointer if the
+ * folio could not be obtained.
+ */
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
+{
+	unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
+	struct folio *folio;
+
+	if (iter->flags & IOMAP_NOWAIT)
+		fgp |= FGP_NOWAIT;
+
+	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
+			fgp, mapping_gfp_mask(iter->inode->i_mapping));
+	if (folio)
+		return folio;
+
+	if (iter->flags & IOMAP_NOWAIT)
+		return ERR_PTR(-EAGAIN);
+	return ERR_PTR(-ENOMEM);
+}
+EXPORT_SYMBOL_GPL(iomap_get_folio);
+
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
 {
 	trace_iomap_release_folio(folio->mapping->host, folio_pos(folio),
@@ -603,12 +630,8 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	struct folio *folio;
-	unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
 	int status = 0;
 
-	if (iter->flags & IOMAP_NOWAIT)
-		fgp |= FGP_NOWAIT;
-
 	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
 	if (srcmap != &iter->iomap)
 		BUG_ON(pos + len > srcmap->offset + srcmap->length);
@@ -625,12 +648,10 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 			return status;
 	}
 
-	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
-			fgp, mapping_gfp_mask(iter->inode->i_mapping));
-	if (!folio) {
-		status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
+	folio = iomap_get_folio(iter, pos);
+	if (IS_ERR(folio)) {
 		__iomap_put_folio(iter, pos, 0, NULL);
-		return status;
+		return PTR_ERR(folio);
 	}
 
 	/*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index ecf815b34d51..188d14e786a4 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -261,6 +261,7 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
 void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC v6 05/10] iomap/gfs2: Get page in page_prepare handler
  2023-01-08 19:40 ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J . Wong, Alexander Viro, Matthew Wilcox
  Cc: Andreas Gruenbacher, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel, Christoph Hellwig

Change the iomap ->page_prepare() handler to get and return a locked
folio instead of doing that in iomap_write_begin().  This allows to
recover from out-of-memory situations in ->page_prepare(), which
eliminates the corresponding error handling code in iomap_write_begin().
The ->put_folio() handler now also isn't called with NULL as the folio
value anymore.

Filesystems are expected to use the iomap_get_folio() helper for getting
locked folios in their ->page_prepare() handlers.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/bmap.c         | 21 +++++++++++++--------
 fs/iomap/buffered-io.c | 17 ++++++-----------
 include/linux/iomap.h  |  9 +++++----
 3 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 0c041459677b..41349e09558b 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -956,15 +956,25 @@ static int __gfs2_iomap_get(struct inode *inode, loff_t pos, loff_t length,
 	goto out;
 }
 
-static int gfs2_iomap_page_prepare(struct inode *inode, loff_t pos,
-				   unsigned len)
+static struct folio *
+gfs2_iomap_page_prepare(struct iomap_iter *iter, loff_t pos, unsigned len)
 {
+	struct inode *inode = iter->inode;
 	unsigned int blockmask = i_blocksize(inode) - 1;
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	unsigned int blocks;
+	struct folio *folio;
+	int status;
 
 	blocks = ((pos & blockmask) + len + blockmask) >> inode->i_blkbits;
-	return gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
+	status = gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
+	if (status)
+		return ERR_PTR(status);
+
+	folio = iomap_get_folio(iter, pos);
+	if (IS_ERR(folio))
+		gfs2_trans_end(sdp);
+	return folio;
 }
 
 static void gfs2_iomap_put_folio(struct inode *inode, loff_t pos,
@@ -974,11 +984,6 @@ static void gfs2_iomap_put_folio(struct inode *inode, loff_t pos,
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 
-	if (!folio) {
-		gfs2_trans_end(sdp);
-		return;
-	}
-
 	if (!gfs2_is_stuffed(ip))
 		gfs2_page_add_databufs(ip, &folio->page, offset_in_page(pos),
 				       copied);
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index de4a8e5f721a..418519dea2ce 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -609,7 +609,7 @@ static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
 
 	if (page_ops && page_ops->put_folio) {
 		page_ops->put_folio(iter->inode, pos, ret, folio);
-	} else if (folio) {
+	} else {
 		folio_unlock(folio);
 		folio_put(folio);
 	}
@@ -642,17 +642,12 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 	if (!mapping_large_folio_support(iter->inode->i_mapping))
 		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
 
-	if (page_ops && page_ops->page_prepare) {
-		status = page_ops->page_prepare(iter->inode, pos, len);
-		if (status)
-			return status;
-	}
-
-	folio = iomap_get_folio(iter, pos);
-	if (IS_ERR(folio)) {
-		__iomap_put_folio(iter, pos, 0, NULL);
+	if (page_ops && page_ops->page_prepare)
+		folio = page_ops->page_prepare(iter, pos, len);
+	else
+		folio = iomap_get_folio(iter, pos);
+	if (IS_ERR(folio))
 		return PTR_ERR(folio);
-	}
 
 	/*
 	 * Now we have a locked folio, before we do anything with it we need to
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 188d14e786a4..d50501781856 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -13,6 +13,7 @@
 struct address_space;
 struct fiemap_extent_info;
 struct inode;
+struct iomap_iter;
 struct iomap_dio;
 struct iomap_writepage_ctx;
 struct iov_iter;
@@ -131,12 +132,12 @@ static inline bool iomap_inline_data_valid(const struct iomap *iomap)
  * associated with them.
  *
  * When page_prepare succeeds, put_folio will always be called to do any
- * cleanup work necessary.  In that put_folio call, @folio will be NULL if the
- * associated folio could not be obtained.  When folio is not NULL, put_folio
- * is responsible for unlocking and putting the folio.
+ * cleanup work necessary.  put_folio is responsible for unlocking and putting
+ * @folio.
  */
 struct iomap_page_ops {
-	int (*page_prepare)(struct inode *inode, loff_t pos, unsigned len);
+	struct folio *(*page_prepare)(struct iomap_iter *iter, loff_t pos,
+			unsigned len);
 	void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
 			struct folio *folio);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 05/10] iomap/gfs2: Get page in page_prepare handler
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Change the iomap ->page_prepare() handler to get and return a locked
folio instead of doing that in iomap_write_begin().  This allows to
recover from out-of-memory situations in ->page_prepare(), which
eliminates the corresponding error handling code in iomap_write_begin().
The ->put_folio() handler now also isn't called with NULL as the folio
value anymore.

Filesystems are expected to use the iomap_get_folio() helper for getting
locked folios in their ->page_prepare() handlers.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/bmap.c         | 21 +++++++++++++--------
 fs/iomap/buffered-io.c | 17 ++++++-----------
 include/linux/iomap.h  |  9 +++++----
 3 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 0c041459677b..41349e09558b 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -956,15 +956,25 @@ static int __gfs2_iomap_get(struct inode *inode, loff_t pos, loff_t length,
 	goto out;
 }
 
-static int gfs2_iomap_page_prepare(struct inode *inode, loff_t pos,
-				   unsigned len)
+static struct folio *
+gfs2_iomap_page_prepare(struct iomap_iter *iter, loff_t pos, unsigned len)
 {
+	struct inode *inode = iter->inode;
 	unsigned int blockmask = i_blocksize(inode) - 1;
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	unsigned int blocks;
+	struct folio *folio;
+	int status;
 
 	blocks = ((pos & blockmask) + len + blockmask) >> inode->i_blkbits;
-	return gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
+	status = gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
+	if (status)
+		return ERR_PTR(status);
+
+	folio = iomap_get_folio(iter, pos);
+	if (IS_ERR(folio))
+		gfs2_trans_end(sdp);
+	return folio;
 }
 
 static void gfs2_iomap_put_folio(struct inode *inode, loff_t pos,
@@ -974,11 +984,6 @@ static void gfs2_iomap_put_folio(struct inode *inode, loff_t pos,
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 
-	if (!folio) {
-		gfs2_trans_end(sdp);
-		return;
-	}
-
 	if (!gfs2_is_stuffed(ip))
 		gfs2_page_add_databufs(ip, &folio->page, offset_in_page(pos),
 				       copied);
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index de4a8e5f721a..418519dea2ce 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -609,7 +609,7 @@ static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
 
 	if (page_ops && page_ops->put_folio) {
 		page_ops->put_folio(iter->inode, pos, ret, folio);
-	} else if (folio) {
+	} else {
 		folio_unlock(folio);
 		folio_put(folio);
 	}
@@ -642,17 +642,12 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 	if (!mapping_large_folio_support(iter->inode->i_mapping))
 		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
 
-	if (page_ops && page_ops->page_prepare) {
-		status = page_ops->page_prepare(iter->inode, pos, len);
-		if (status)
-			return status;
-	}
-
-	folio = iomap_get_folio(iter, pos);
-	if (IS_ERR(folio)) {
-		__iomap_put_folio(iter, pos, 0, NULL);
+	if (page_ops && page_ops->page_prepare)
+		folio = page_ops->page_prepare(iter, pos, len);
+	else
+		folio = iomap_get_folio(iter, pos);
+	if (IS_ERR(folio))
 		return PTR_ERR(folio);
-	}
 
 	/*
 	 * Now we have a locked folio, before we do anything with it we need to
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 188d14e786a4..d50501781856 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -13,6 +13,7 @@
 struct address_space;
 struct fiemap_extent_info;
 struct inode;
+struct iomap_iter;
 struct iomap_dio;
 struct iomap_writepage_ctx;
 struct iov_iter;
@@ -131,12 +132,12 @@ static inline bool iomap_inline_data_valid(const struct iomap *iomap)
  * associated with them.
  *
  * When page_prepare succeeds, put_folio will always be called to do any
- * cleanup work necessary.  In that put_folio call, @folio will be NULL if the
- * associated folio could not be obtained.  When folio is not NULL, put_folio
- * is responsible for unlocking and putting the folio.
+ * cleanup work necessary.  put_folio is responsible for unlocking and putting
+ * @folio.
  */
 struct iomap_page_ops {
-	int (*page_prepare)(struct inode *inode, loff_t pos, unsigned len);
+	struct folio *(*page_prepare)(struct iomap_iter *iter, loff_t pos,
+			unsigned len);
 	void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
 			struct folio *folio);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC v6 06/10] iomap: Add __iomap_get_folio helper
  2023-01-08 19:40 ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J . Wong, Alexander Viro, Matthew Wilcox
  Cc: Andreas Gruenbacher, linux-xfs, linux-fsdevel, linux-ext4, cluster-devel

Add an __iomap_get_folio() helper as the counterpart of the existing
__iomap_put_folio() helper.  Use the new helper in iomap_write_begin().
Not a functional change.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/buffered-io.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 418519dea2ce..666107c3a385 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -602,6 +602,17 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	return 0;
 }
 
+static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
+		size_t len)
+{
+	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
+
+	if (page_ops && page_ops->page_prepare)
+		return page_ops->page_prepare(iter, pos, len);
+	else
+		return iomap_get_folio(iter, pos);
+}
+
 static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
 		struct folio *folio)
 {
@@ -642,10 +653,7 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 	if (!mapping_large_folio_support(iter->inode->i_mapping))
 		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
 
-	if (page_ops && page_ops->page_prepare)
-		folio = page_ops->page_prepare(iter, pos, len);
-	else
-		folio = iomap_get_folio(iter, pos);
+	folio = __iomap_get_folio(iter, pos, len);
 	if (IS_ERR(folio))
 		return PTR_ERR(folio);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 06/10] iomap: Add __iomap_get_folio helper
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Add an __iomap_get_folio() helper as the counterpart of the existing
__iomap_put_folio() helper.  Use the new helper in iomap_write_begin().
Not a functional change.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/buffered-io.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 418519dea2ce..666107c3a385 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -602,6 +602,17 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	return 0;
 }
 
+static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
+		size_t len)
+{
+	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
+
+	if (page_ops && page_ops->page_prepare)
+		return page_ops->page_prepare(iter, pos, len);
+	else
+		return iomap_get_folio(iter, pos);
+}
+
 static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
 		struct folio *folio)
 {
@@ -642,10 +653,7 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 	if (!mapping_large_folio_support(iter->inode->i_mapping))
 		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
 
-	if (page_ops && page_ops->page_prepare)
-		folio = page_ops->page_prepare(iter, pos, len);
-	else
-		folio = iomap_get_folio(iter, pos);
+	folio = __iomap_get_folio(iter, pos, len);
 	if (IS_ERR(folio))
 		return PTR_ERR(folio);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC v6 07/10] iomap: Rename page_prepare handler to get_folio
  2023-01-08 19:40 ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J . Wong, Alexander Viro, Matthew Wilcox
  Cc: Andreas Gruenbacher, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel, Christoph Hellwig

The ->page_prepare() handler in struct iomap_page_ops is now somewhat
misnamed, so rename it to ->get_folio().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/bmap.c         | 6 +++---
 fs/iomap/buffered-io.c | 4 ++--
 include/linux/iomap.h  | 6 +++---
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 41349e09558b..d3adb715ac8c 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -957,7 +957,7 @@ static int __gfs2_iomap_get(struct inode *inode, loff_t pos, loff_t length,
 }
 
 static struct folio *
-gfs2_iomap_page_prepare(struct iomap_iter *iter, loff_t pos, unsigned len)
+gfs2_iomap_get_folio(struct iomap_iter *iter, loff_t pos, unsigned len)
 {
 	struct inode *inode = iter->inode;
 	unsigned int blockmask = i_blocksize(inode) - 1;
@@ -998,7 +998,7 @@ static void gfs2_iomap_put_folio(struct inode *inode, loff_t pos,
 }
 
 static const struct iomap_page_ops gfs2_iomap_page_ops = {
-	.page_prepare = gfs2_iomap_page_prepare,
+	.get_folio = gfs2_iomap_get_folio,
 	.put_folio = gfs2_iomap_put_folio,
 };
 
@@ -1291,7 +1291,7 @@ int gfs2_alloc_extent(struct inode *inode, u64 lblock, u64 *dblock,
 /*
  * NOTE: Never call gfs2_block_zero_range with an open transaction because it
  * uses iomap write to perform its actions, which begin their own transactions
- * (iomap_begin, page_prepare, etc.)
+ * (iomap_begin, get_folio, etc.)
  */
 static int gfs2_block_zero_range(struct inode *inode, loff_t from,
 				 unsigned int length)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 666107c3a385..006ddf933948 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -607,8 +607,8 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
 {
 	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 
-	if (page_ops && page_ops->page_prepare)
-		return page_ops->page_prepare(iter, pos, len);
+	if (page_ops && page_ops->get_folio)
+		return page_ops->get_folio(iter, pos, len);
 	else
 		return iomap_get_folio(iter, pos);
 }
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index d50501781856..da226032aedc 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -126,17 +126,17 @@ static inline bool iomap_inline_data_valid(const struct iomap *iomap)
 }
 
 /*
- * When a filesystem sets page_ops in an iomap mapping it returns, page_prepare
+ * When a filesystem sets page_ops in an iomap mapping it returns, get_folio
  * and put_folio will be called for each folio written to.  This only applies
  * to buffered writes as unbuffered writes will not typically have folios
  * associated with them.
  *
- * When page_prepare succeeds, put_folio will always be called to do any
+ * When get_folio succeeds, put_folio will always be called to do any
  * cleanup work necessary.  put_folio is responsible for unlocking and putting
  * @folio.
  */
 struct iomap_page_ops {
-	struct folio *(*page_prepare)(struct iomap_iter *iter, loff_t pos,
+	struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
 			unsigned len);
 	void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
 			struct folio *folio);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 07/10] iomap: Rename page_prepare handler to get_folio
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

The ->page_prepare() handler in struct iomap_page_ops is now somewhat
misnamed, so rename it to ->get_folio().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/bmap.c         | 6 +++---
 fs/iomap/buffered-io.c | 4 ++--
 include/linux/iomap.h  | 6 +++---
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 41349e09558b..d3adb715ac8c 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -957,7 +957,7 @@ static int __gfs2_iomap_get(struct inode *inode, loff_t pos, loff_t length,
 }
 
 static struct folio *
-gfs2_iomap_page_prepare(struct iomap_iter *iter, loff_t pos, unsigned len)
+gfs2_iomap_get_folio(struct iomap_iter *iter, loff_t pos, unsigned len)
 {
 	struct inode *inode = iter->inode;
 	unsigned int blockmask = i_blocksize(inode) - 1;
@@ -998,7 +998,7 @@ static void gfs2_iomap_put_folio(struct inode *inode, loff_t pos,
 }
 
 static const struct iomap_page_ops gfs2_iomap_page_ops = {
-	.page_prepare = gfs2_iomap_page_prepare,
+	.get_folio = gfs2_iomap_get_folio,
 	.put_folio = gfs2_iomap_put_folio,
 };
 
@@ -1291,7 +1291,7 @@ int gfs2_alloc_extent(struct inode *inode, u64 lblock, u64 *dblock,
 /*
  * NOTE: Never call gfs2_block_zero_range with an open transaction because it
  * uses iomap write to perform its actions, which begin their own transactions
- * (iomap_begin, page_prepare, etc.)
+ * (iomap_begin, get_folio, etc.)
  */
 static int gfs2_block_zero_range(struct inode *inode, loff_t from,
 				 unsigned int length)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 666107c3a385..006ddf933948 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -607,8 +607,8 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
 {
 	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 
-	if (page_ops && page_ops->page_prepare)
-		return page_ops->page_prepare(iter, pos, len);
+	if (page_ops && page_ops->get_folio)
+		return page_ops->get_folio(iter, pos, len);
 	else
 		return iomap_get_folio(iter, pos);
 }
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index d50501781856..da226032aedc 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -126,17 +126,17 @@ static inline bool iomap_inline_data_valid(const struct iomap *iomap)
 }
 
 /*
- * When a filesystem sets page_ops in an iomap mapping it returns, page_prepare
+ * When a filesystem sets page_ops in an iomap mapping it returns, get_folio
  * and put_folio will be called for each folio written to.  This only applies
  * to buffered writes as unbuffered writes will not typically have folios
  * associated with them.
  *
- * When page_prepare succeeds, put_folio will always be called to do any
+ * When get_folio succeeds, put_folio will always be called to do any
  * cleanup work necessary.  put_folio is responsible for unlocking and putting
  * @folio.
  */
 struct iomap_page_ops {
-	struct folio *(*page_prepare)(struct iomap_iter *iter, loff_t pos,
+	struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
 			unsigned len);
 	void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
 			struct folio *folio);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
  2023-01-08 19:40 ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J . Wong, Alexander Viro, Matthew Wilcox
  Cc: Andreas Gruenbacher, linux-xfs, linux-fsdevel, linux-ext4, cluster-devel

Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
handler and validating the mapping there.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/buffered-io.c | 26 +++++---------------------
 fs/xfs/xfs_iomap.c     | 37 ++++++++++++++++++++++++++-----------
 include/linux/iomap.h  | 23 ++++++-----------------
 3 files changed, 37 insertions(+), 49 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 006ddf933948..72dfbc3cb086 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -638,10 +638,9 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter,
 static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 		size_t len, struct folio **foliop)
 {
-	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	struct folio *folio;
-	int status = 0;
+	int status;
 
 	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
 	if (srcmap != &iter->iomap)
@@ -654,27 +653,12 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
 
 	folio = __iomap_get_folio(iter, pos, len);
-	if (IS_ERR(folio))
-		return PTR_ERR(folio);
-
-	/*
-	 * Now we have a locked folio, before we do anything with it we need to
-	 * check that the iomap we have cached is not stale. The inode extent
-	 * mapping can change due to concurrent IO in flight (e.g.
-	 * IOMAP_UNWRITTEN state can change and memory reclaim could have
-	 * reclaimed a previously partially written page at this index after IO
-	 * completion before this write reaches this file offset) and hence we
-	 * could do the wrong thing here (zero a page range incorrectly or fail
-	 * to zero) and corrupt data.
-	 */
-	if (page_ops && page_ops->iomap_valid) {
-		bool iomap_valid = page_ops->iomap_valid(iter->inode,
-							&iter->iomap);
-		if (!iomap_valid) {
+	if (IS_ERR(folio)) {
+		if (folio == ERR_PTR(-ESTALE)) {
 			iter->iomap.flags |= IOMAP_F_STALE;
-			status = 0;
-			goto out_unlock;
+			return 0;
 		}
+		return PTR_ERR(folio);
 	}
 
 	if (pos + len > folio_pos(folio) + folio_size(folio))
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 669c1bc5c3a7..d0bf99539180 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -62,29 +62,44 @@ xfs_iomap_inode_sequence(
 	return cookie | READ_ONCE(ip->i_df.if_seq);
 }
 
-/*
- * Check that the iomap passed to us is still valid for the given offset and
- * length.
- */
-static bool
-xfs_iomap_valid(
-	struct inode		*inode,
-	const struct iomap	*iomap)
+static struct folio *
+xfs_get_folio(
+	struct iomap_iter	*iter,
+	loff_t			pos,
+	unsigned		len)
 {
+	struct inode		*inode = iter->inode;
+	struct iomap		*iomap = &iter->iomap;
 	struct xfs_inode	*ip = XFS_I(inode);
+	struct folio		*folio;
 
+	folio = iomap_get_folio(iter, pos);
+	if (IS_ERR(folio))
+		return folio;
+
+	/*
+	 * Now that we have a locked folio, we need to check that the iomap we
+	 * have cached is not stale.  The inode extent mapping can change due to
+	 * concurrent IO in flight (e.g., IOMAP_UNWRITTEN state can change and
+	 * memory reclaim could have reclaimed a previously partially written
+	 * page at this index after IO completion before this write reaches
+	 * this file offset) and hence we could do the wrong thing here (zero a
+	 * page range incorrectly or fail to zero) and corrupt data.
+	 */
 	if (iomap->validity_cookie !=
 			xfs_iomap_inode_sequence(ip, iomap->flags)) {
 		trace_xfs_iomap_invalid(ip, iomap);
-		return false;
+		folio_unlock(folio);
+		folio_put(folio);
+		return ERR_PTR(-ESTALE);
 	}
 
 	XFS_ERRORTAG_DELAY(ip->i_mount, XFS_ERRTAG_WRITE_DELAY_MS);
-	return true;
+	return folio;
 }
 
 const struct iomap_page_ops xfs_iomap_page_ops = {
-	.iomap_valid		= xfs_iomap_valid,
+	.get_folio		= xfs_get_folio,
 };
 
 int
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index da226032aedc..0ae2cddbedd6 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -134,29 +134,18 @@ static inline bool iomap_inline_data_valid(const struct iomap *iomap)
  * When get_folio succeeds, put_folio will always be called to do any
  * cleanup work necessary.  put_folio is responsible for unlocking and putting
  * @folio.
+ *
+ * When an iomap is created, the filesystem can store internal state (e.g., a
+ * sequence number) in iomap->validity_cookie.  The get_folio handler can use
+ * this validity cookie to detect when the iomap needs to be refreshed because
+ * it is no longer up to date.  In that case, the function should return
+ * ERR_PTR(-ESTALE) to retry the operation with a fresh mapping.
  */
 struct iomap_page_ops {
 	struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
 			unsigned len);
 	void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
 			struct folio *folio);
-
-	/*
-	 * Check that the cached iomap still maps correctly to the filesystem's
-	 * internal extent map. FS internal extent maps can change while iomap
-	 * is iterating a cached iomap, so this hook allows iomap to detect that
-	 * the iomap needs to be refreshed during a long running write
-	 * operation.
-	 *
-	 * The filesystem can store internal state (e.g. a sequence number) in
-	 * iomap->validity_cookie when the iomap is first mapped to be able to
-	 * detect changes between mapping time and whenever .iomap_valid() is
-	 * called.
-	 *
-	 * This is called with the folio over the specified file position held
-	 * locked by the iomap code.
-	 */
-	bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
 };
 
 /*
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
handler and validating the mapping there.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/iomap/buffered-io.c | 26 +++++---------------------
 fs/xfs/xfs_iomap.c     | 37 ++++++++++++++++++++++++++-----------
 include/linux/iomap.h  | 23 ++++++-----------------
 3 files changed, 37 insertions(+), 49 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 006ddf933948..72dfbc3cb086 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -638,10 +638,9 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter,
 static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 		size_t len, struct folio **foliop)
 {
-	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	struct folio *folio;
-	int status = 0;
+	int status;
 
 	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
 	if (srcmap != &iter->iomap)
@@ -654,27 +653,12 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
 
 	folio = __iomap_get_folio(iter, pos, len);
-	if (IS_ERR(folio))
-		return PTR_ERR(folio);
-
-	/*
-	 * Now we have a locked folio, before we do anything with it we need to
-	 * check that the iomap we have cached is not stale. The inode extent
-	 * mapping can change due to concurrent IO in flight (e.g.
-	 * IOMAP_UNWRITTEN state can change and memory reclaim could have
-	 * reclaimed a previously partially written page at this index after IO
-	 * completion before this write reaches this file offset) and hence we
-	 * could do the wrong thing here (zero a page range incorrectly or fail
-	 * to zero) and corrupt data.
-	 */
-	if (page_ops && page_ops->iomap_valid) {
-		bool iomap_valid = page_ops->iomap_valid(iter->inode,
-							&iter->iomap);
-		if (!iomap_valid) {
+	if (IS_ERR(folio)) {
+		if (folio == ERR_PTR(-ESTALE)) {
 			iter->iomap.flags |= IOMAP_F_STALE;
-			status = 0;
-			goto out_unlock;
+			return 0;
 		}
+		return PTR_ERR(folio);
 	}
 
 	if (pos + len > folio_pos(folio) + folio_size(folio))
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 669c1bc5c3a7..d0bf99539180 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -62,29 +62,44 @@ xfs_iomap_inode_sequence(
 	return cookie | READ_ONCE(ip->i_df.if_seq);
 }
 
-/*
- * Check that the iomap passed to us is still valid for the given offset and
- * length.
- */
-static bool
-xfs_iomap_valid(
-	struct inode		*inode,
-	const struct iomap	*iomap)
+static struct folio *
+xfs_get_folio(
+	struct iomap_iter	*iter,
+	loff_t			pos,
+	unsigned		len)
 {
+	struct inode		*inode = iter->inode;
+	struct iomap		*iomap = &iter->iomap;
 	struct xfs_inode	*ip = XFS_I(inode);
+	struct folio		*folio;
 
+	folio = iomap_get_folio(iter, pos);
+	if (IS_ERR(folio))
+		return folio;
+
+	/*
+	 * Now that we have a locked folio, we need to check that the iomap we
+	 * have cached is not stale.  The inode extent mapping can change due to
+	 * concurrent IO in flight (e.g., IOMAP_UNWRITTEN state can change and
+	 * memory reclaim could have reclaimed a previously partially written
+	 * page at this index after IO completion before this write reaches
+	 * this file offset) and hence we could do the wrong thing here (zero a
+	 * page range incorrectly or fail to zero) and corrupt data.
+	 */
 	if (iomap->validity_cookie !=
 			xfs_iomap_inode_sequence(ip, iomap->flags)) {
 		trace_xfs_iomap_invalid(ip, iomap);
-		return false;
+		folio_unlock(folio);
+		folio_put(folio);
+		return ERR_PTR(-ESTALE);
 	}
 
 	XFS_ERRORTAG_DELAY(ip->i_mount, XFS_ERRTAG_WRITE_DELAY_MS);
-	return true;
+	return folio;
 }
 
 const struct iomap_page_ops xfs_iomap_page_ops = {
-	.iomap_valid		= xfs_iomap_valid,
+	.get_folio		= xfs_get_folio,
 };
 
 int
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index da226032aedc..0ae2cddbedd6 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -134,29 +134,18 @@ static inline bool iomap_inline_data_valid(const struct iomap *iomap)
  * When get_folio succeeds, put_folio will always be called to do any
  * cleanup work necessary.  put_folio is responsible for unlocking and putting
  * @folio.
+ *
+ * When an iomap is created, the filesystem can store internal state (e.g., a
+ * sequence number) in iomap->validity_cookie.  The get_folio handler can use
+ * this validity cookie to detect when the iomap needs to be refreshed because
+ * it is no longer up to date.  In that case, the function should return
+ * ERR_PTR(-ESTALE) to retry the operation with a fresh mapping.
  */
 struct iomap_page_ops {
 	struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
 			unsigned len);
 	void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
 			struct folio *folio);
-
-	/*
-	 * Check that the cached iomap still maps correctly to the filesystem's
-	 * internal extent map. FS internal extent maps can change while iomap
-	 * is iterating a cached iomap, so this hook allows iomap to detect that
-	 * the iomap needs to be refreshed during a long running write
-	 * operation.
-	 *
-	 * The filesystem can store internal state (e.g. a sequence number) in
-	 * iomap->validity_cookie when the iomap is first mapped to be able to
-	 * detect changes between mapping time and whenever .iomap_valid() is
-	 * called.
-	 *
-	 * This is called with the folio over the specified file position held
-	 * locked by the iomap code.
-	 */
-	bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
 };
 
 /*
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC v6 09/10] iomap: Rename page_ops to folio_ops
  2023-01-08 19:40 ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J . Wong, Alexander Viro, Matthew Wilcox
  Cc: Andreas Gruenbacher, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel, Christoph Hellwig

The operations in struct page_ops all operate on folios, so rename
struct page_ops to struct folio_ops.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/bmap.c         |  4 ++--
 fs/iomap/buffered-io.c | 12 ++++++------
 fs/xfs/xfs_iomap.c     |  4 ++--
 include/linux/iomap.h  |  8 ++++----
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index d3adb715ac8c..e191ecfb1fde 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -997,7 +997,7 @@ static void gfs2_iomap_put_folio(struct inode *inode, loff_t pos,
 	gfs2_trans_end(sdp);
 }
 
-static const struct iomap_page_ops gfs2_iomap_page_ops = {
+static const struct iomap_folio_ops gfs2_iomap_folio_ops = {
 	.get_folio = gfs2_iomap_get_folio,
 	.put_folio = gfs2_iomap_put_folio,
 };
@@ -1075,7 +1075,7 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 	}
 
 	if (gfs2_is_stuffed(ip) || gfs2_is_jdata(ip))
-		iomap->page_ops = &gfs2_iomap_page_ops;
+		iomap->folio_ops = &gfs2_iomap_folio_ops;
 	return 0;
 
 out_trans_end:
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 72dfbc3cb086..dacc7c80b20d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -605,10 +605,10 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
 		size_t len)
 {
-	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
+	const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
 
-	if (page_ops && page_ops->get_folio)
-		return page_ops->get_folio(iter, pos, len);
+	if (folio_ops && folio_ops->get_folio)
+		return folio_ops->get_folio(iter, pos, len);
 	else
 		return iomap_get_folio(iter, pos);
 }
@@ -616,10 +616,10 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
 static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
 		struct folio *folio)
 {
-	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
+	const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
 
-	if (page_ops && page_ops->put_folio) {
-		page_ops->put_folio(iter->inode, pos, ret, folio);
+	if (folio_ops && folio_ops->put_folio) {
+		folio_ops->put_folio(iter->inode, pos, ret, folio);
 	} else {
 		folio_unlock(folio);
 		folio_put(folio);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index d0bf99539180..5bddf31e21eb 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -98,7 +98,7 @@ xfs_get_folio(
 	return folio;
 }
 
-const struct iomap_page_ops xfs_iomap_page_ops = {
+const struct iomap_folio_ops xfs_iomap_folio_ops = {
 	.get_folio		= xfs_get_folio,
 };
 
@@ -148,7 +148,7 @@ xfs_bmbt_to_iomap(
 		iomap->flags |= IOMAP_F_DIRTY;
 
 	iomap->validity_cookie = sequence_cookie;
-	iomap->page_ops = &xfs_iomap_page_ops;
+	iomap->folio_ops = &xfs_iomap_folio_ops;
 	return 0;
 }
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0ae2cddbedd6..3e6c34b03c89 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -86,7 +86,7 @@ struct vm_fault;
  */
 #define IOMAP_NULL_ADDR -1ULL	/* addr is not valid */
 
-struct iomap_page_ops;
+struct iomap_folio_ops;
 
 struct iomap {
 	u64			addr; /* disk offset of mapping, bytes */
@@ -98,7 +98,7 @@ struct iomap {
 	struct dax_device	*dax_dev; /* dax_dev for dax operations */
 	void			*inline_data;
 	void			*private; /* filesystem private */
-	const struct iomap_page_ops *page_ops;
+	const struct iomap_folio_ops *folio_ops;
 	u64			validity_cookie; /* used with .iomap_valid() */
 };
 
@@ -126,7 +126,7 @@ static inline bool iomap_inline_data_valid(const struct iomap *iomap)
 }
 
 /*
- * When a filesystem sets page_ops in an iomap mapping it returns, get_folio
+ * When a filesystem sets folio_ops in an iomap mapping it returns, get_folio
  * and put_folio will be called for each folio written to.  This only applies
  * to buffered writes as unbuffered writes will not typically have folios
  * associated with them.
@@ -141,7 +141,7 @@ static inline bool iomap_inline_data_valid(const struct iomap *iomap)
  * it is no longer up to date.  In that case, the function should return
  * ERR_PTR(-ESTALE) to retry the operation with a fresh mapping.
  */
-struct iomap_page_ops {
+struct iomap_folio_ops {
 	struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
 			unsigned len);
 	void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 09/10] iomap: Rename page_ops to folio_ops
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

The operations in struct page_ops all operate on folios, so rename
struct page_ops to struct folio_ops.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/bmap.c         |  4 ++--
 fs/iomap/buffered-io.c | 12 ++++++------
 fs/xfs/xfs_iomap.c     |  4 ++--
 include/linux/iomap.h  |  8 ++++----
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index d3adb715ac8c..e191ecfb1fde 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -997,7 +997,7 @@ static void gfs2_iomap_put_folio(struct inode *inode, loff_t pos,
 	gfs2_trans_end(sdp);
 }
 
-static const struct iomap_page_ops gfs2_iomap_page_ops = {
+static const struct iomap_folio_ops gfs2_iomap_folio_ops = {
 	.get_folio = gfs2_iomap_get_folio,
 	.put_folio = gfs2_iomap_put_folio,
 };
@@ -1075,7 +1075,7 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 	}
 
 	if (gfs2_is_stuffed(ip) || gfs2_is_jdata(ip))
-		iomap->page_ops = &gfs2_iomap_page_ops;
+		iomap->folio_ops = &gfs2_iomap_folio_ops;
 	return 0;
 
 out_trans_end:
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 72dfbc3cb086..dacc7c80b20d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -605,10 +605,10 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
 		size_t len)
 {
-	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
+	const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
 
-	if (page_ops && page_ops->get_folio)
-		return page_ops->get_folio(iter, pos, len);
+	if (folio_ops && folio_ops->get_folio)
+		return folio_ops->get_folio(iter, pos, len);
 	else
 		return iomap_get_folio(iter, pos);
 }
@@ -616,10 +616,10 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
 static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
 		struct folio *folio)
 {
-	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
+	const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
 
-	if (page_ops && page_ops->put_folio) {
-		page_ops->put_folio(iter->inode, pos, ret, folio);
+	if (folio_ops && folio_ops->put_folio) {
+		folio_ops->put_folio(iter->inode, pos, ret, folio);
 	} else {
 		folio_unlock(folio);
 		folio_put(folio);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index d0bf99539180..5bddf31e21eb 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -98,7 +98,7 @@ xfs_get_folio(
 	return folio;
 }
 
-const struct iomap_page_ops xfs_iomap_page_ops = {
+const struct iomap_folio_ops xfs_iomap_folio_ops = {
 	.get_folio		= xfs_get_folio,
 };
 
@@ -148,7 +148,7 @@ xfs_bmbt_to_iomap(
 		iomap->flags |= IOMAP_F_DIRTY;
 
 	iomap->validity_cookie = sequence_cookie;
-	iomap->page_ops = &xfs_iomap_page_ops;
+	iomap->folio_ops = &xfs_iomap_folio_ops;
 	return 0;
 }
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0ae2cddbedd6..3e6c34b03c89 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -86,7 +86,7 @@ struct vm_fault;
  */
 #define IOMAP_NULL_ADDR -1ULL	/* addr is not valid */
 
-struct iomap_page_ops;
+struct iomap_folio_ops;
 
 struct iomap {
 	u64			addr; /* disk offset of mapping, bytes */
@@ -98,7 +98,7 @@ struct iomap {
 	struct dax_device	*dax_dev; /* dax_dev for dax operations */
 	void			*inline_data;
 	void			*private; /* filesystem private */
-	const struct iomap_page_ops *page_ops;
+	const struct iomap_folio_ops *folio_ops;
 	u64			validity_cookie; /* used with .iomap_valid() */
 };
 
@@ -126,7 +126,7 @@ static inline bool iomap_inline_data_valid(const struct iomap *iomap)
 }
 
 /*
- * When a filesystem sets page_ops in an iomap mapping it returns, get_folio
+ * When a filesystem sets folio_ops in an iomap mapping it returns, get_folio
  * and put_folio will be called for each folio written to.  This only applies
  * to buffered writes as unbuffered writes will not typically have folios
  * associated with them.
@@ -141,7 +141,7 @@ static inline bool iomap_inline_data_valid(const struct iomap *iomap)
  * it is no longer up to date.  In that case, the function should return
  * ERR_PTR(-ESTALE) to retry the operation with a fresh mapping.
  */
-struct iomap_page_ops {
+struct iomap_folio_ops {
 	struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
 			unsigned len);
 	void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC v6 10/10] xfs: Make xfs_iomap_folio_ops static
  2023-01-08 19:40 ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J . Wong, Alexander Viro, Matthew Wilcox
  Cc: Andreas Gruenbacher, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel, Christoph Hellwig

Variable xfs_iomap_folio_ops isn't used outside xfs_iomap.c, so it
should be static.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iomap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 5bddf31e21eb..7d1795a9c742 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -98,7 +98,7 @@ xfs_get_folio(
 	return folio;
 }
 
-const struct iomap_folio_ops xfs_iomap_folio_ops = {
+static const struct iomap_folio_ops xfs_iomap_folio_ops = {
 	.get_folio		= xfs_get_folio,
 };
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 10/10] xfs: Make xfs_iomap_folio_ops static
@ 2023-01-08 19:40   ` Andreas Gruenbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-08 19:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Variable xfs_iomap_folio_ops isn't used outside xfs_iomap.c, so it
should be static.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iomap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 5bddf31e21eb..7d1795a9c742 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -98,7 +98,7 @@ xfs_get_folio(
 	return folio;
 }
 
-const struct iomap_folio_ops xfs_iomap_folio_ops = {
+static const struct iomap_folio_ops xfs_iomap_folio_ops = {
 	.get_folio		= xfs_get_folio,
 };
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC v6 04/10] iomap: Add iomap_get_folio helper
  2023-01-08 19:40   ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-08 21:33     ` Dave Chinner
  -1 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2023-01-08 21:33 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, Darrick J . Wong, Alexander Viro,
	Matthew Wilcox, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel, Christoph Hellwig

On Sun, Jan 08, 2023 at 08:40:28PM +0100, Andreas Gruenbacher wrote:
> Add an iomap_get_folio() helper that gets a folio reference based on
> an iomap iterator and an offset into the address space.  Use it in
> iomap_write_begin().
> 
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/buffered-io.c | 39 ++++++++++++++++++++++++++++++---------
>  include/linux/iomap.h  |  1 +
>  2 files changed, 31 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d4b444e44861..de4a8e5f721a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -457,6 +457,33 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
>  }
>  EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
>  
> +/**
> + * iomap_get_folio - get a folio reference for writing
> + * @iter: iteration structure
> + * @pos: start offset of write
> + *
> + * Returns a locked reference to the folio at @pos, or an error pointer if the
> + * folio could not be obtained.
> + */
> +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
> +{
> +	unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
> +	struct folio *folio;
> +
> +	if (iter->flags & IOMAP_NOWAIT)
> +		fgp |= FGP_NOWAIT;
> +
> +	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> +			fgp, mapping_gfp_mask(iter->inode->i_mapping));
> +	if (folio)
> +		return folio;
> +
> +	if (iter->flags & IOMAP_NOWAIT)
> +		return ERR_PTR(-EAGAIN);
> +	return ERR_PTR(-ENOMEM);
> +}
> +EXPORT_SYMBOL_GPL(iomap_get_folio);

Hmmmm.

This is where things start to get complex. I have sent a patch to
fix a problem with iomap_zero_range() failing to zero cached dirty
pages over UNWRITTEN extents, and that requires making FGP_CREAT
optional. This is an iomap bug, and needs to be fixed in the core
iomap code:

https://lore.kernel.org/linux-xfs/20221201005214.3836105-1-david@fromorbit.com/

Essentially, we need to pass fgp flags to iomap_write_begin() need
so the callers can supply a 0 or FGP_CREAT appropriately. This
allows iomap_write_begin() to act only on pre-cached pages rather
than always instantiating a new page if one does not exist in cache.

This allows that iomap_write_begin() to return a NULL folio
successfully, and this is perfectly OK for callers that pass in fgp
= 0 as they are expected to handle a NULL folio return indicating
there was no cached data over the range...

Exposing the folio allocation as an external interface makes bug
fixes like this rather messy - it's taking a core abstraction (iomap
hides all the folio and page cache manipulations from the
filesystem) and punching a big hole in it by requiring filesystems
to actually allocation page cache folios on behalf of the iomap
core.

Given that I recently got major push-back for fixing an XFS-only bug
by walking the page cache directly instead of abstracting it via the
iomap core, punching an even bigger hole in the abstraction layer to
fix a GFS2-only problem is just as bad....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
@ 2023-01-08 21:33     ` Dave Chinner
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2023-01-08 21:33 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sun, Jan 08, 2023 at 08:40:28PM +0100, Andreas Gruenbacher wrote:
> Add an iomap_get_folio() helper that gets a folio reference based on
> an iomap iterator and an offset into the address space.  Use it in
> iomap_write_begin().
> 
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/buffered-io.c | 39 ++++++++++++++++++++++++++++++---------
>  include/linux/iomap.h  |  1 +
>  2 files changed, 31 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d4b444e44861..de4a8e5f721a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -457,6 +457,33 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
>  }
>  EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
>  
> +/**
> + * iomap_get_folio - get a folio reference for writing
> + * @iter: iteration structure
> + * @pos: start offset of write
> + *
> + * Returns a locked reference to the folio at @pos, or an error pointer if the
> + * folio could not be obtained.
> + */
> +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
> +{
> +	unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
> +	struct folio *folio;
> +
> +	if (iter->flags & IOMAP_NOWAIT)
> +		fgp |= FGP_NOWAIT;
> +
> +	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> +			fgp, mapping_gfp_mask(iter->inode->i_mapping));
> +	if (folio)
> +		return folio;
> +
> +	if (iter->flags & IOMAP_NOWAIT)
> +		return ERR_PTR(-EAGAIN);
> +	return ERR_PTR(-ENOMEM);
> +}
> +EXPORT_SYMBOL_GPL(iomap_get_folio);

Hmmmm.

This is where things start to get complex. I have sent a patch to
fix a problem with iomap_zero_range() failing to zero cached dirty
pages over UNWRITTEN extents, and that requires making FGP_CREAT
optional. This is an iomap bug, and needs to be fixed in the core
iomap code:

https://lore.kernel.org/linux-xfs/20221201005214.3836105-1-david at fromorbit.com/

Essentially, we need to pass fgp flags to iomap_write_begin() need
so the callers can supply a 0 or FGP_CREAT appropriately. This
allows iomap_write_begin() to act only on pre-cached pages rather
than always instantiating a new page if one does not exist in cache.

This allows that iomap_write_begin() to return a NULL folio
successfully, and this is perfectly OK for callers that pass in fgp
= 0 as they are expected to handle a NULL folio return indicating
there was no cached data over the range...

Exposing the folio allocation as an external interface makes bug
fixes like this rather messy - it's taking a core abstraction (iomap
hides all the folio and page cache manipulations from the
filesystem) and punching a big hole in it by requiring filesystems
to actually allocation page cache folios on behalf of the iomap
core.

Given that I recently got major push-back for fixing an XFS-only bug
by walking the page cache directly instead of abstracting it via the
iomap core, punching an even bigger hole in the abstraction layer to
fix a GFS2-only problem is just as bad....

-Dave.
-- 
Dave Chinner
david at fromorbit.com


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
  2023-01-08 19:40   ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-08 21:59     ` Dave Chinner
  -1 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2023-01-08 21:59 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, Darrick J . Wong, Alexander Viro,
	Matthew Wilcox, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel

On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> handler and validating the mapping there.
> 
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

I think this is wrong.

The ->iomap_valid() function handles a fundamental architectural
issue with cached iomaps: the iomap can become stale at any time
whilst it is in use by the iomap core code.

The current problem it solves in the iomap_write_begin() path has to
do with writeback and memory reclaim races over unwritten extents,
but the general case is that we must be able to check the iomap
at any point in time to assess it's validity.

Indeed, we also have this same "iomap valid check" functionality in the
writeback code as cached iomaps can become stale due to racing
writeback, truncated, etc. But you wouldn't know it by looking at the iomap
writeback code - this is currently hidden by XFS by embedding
the checks into the iomap writeback ->map_blocks function.

That is, the first thing that xfs_map_blocks() does is check if the
cached iomap is valid, and if it is valid it returns immediately and
the iomap writeback code uses it without question.

The reason that this is embedded like this is that the iomap did not
have a validity cookie field in it, and so the validity information
was wrapped around the outside of the iomap_writepage_ctx and the
filesystem has to decode it from that private wrapping structure.

However, the validity information iin the structure wrapper is
indentical to the iomap validity cookie, and so the direction I've
been working towards is to replace this implicit, hidden cached
iomap validity check with an explicit ->iomap_valid call and then
only call ->map_blocks if the validity check fails (or is not
implemented).

I want to use the same code for all the iomap validity checks in all
the iomap core code - this is an iomap issue, the conditions where
we need to check for iomap validity are different for depending on
the iomap context being run, and the checks are not necessarily
dependent on first having locked a folio.

Yes, the validity cookie needs to be decoded by the filesystem, but
that does not dictate where the validity checking needs to be done
by the iomap core.

Hence I think removing ->iomap_valid is a big step backwards for the
iomap core code - the iomap core needs to be able to formally verify
the iomap is valid at any point in time, not just at the point in
time a folio in the page cache has been locked...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
@ 2023-01-08 21:59     ` Dave Chinner
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2023-01-08 21:59 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> handler and validating the mapping there.
> 
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

I think this is wrong.

The ->iomap_valid() function handles a fundamental architectural
issue with cached iomaps: the iomap can become stale at any time
whilst it is in use by the iomap core code.

The current problem it solves in the iomap_write_begin() path has to
do with writeback and memory reclaim races over unwritten extents,
but the general case is that we must be able to check the iomap
at any point in time to assess it's validity.

Indeed, we also have this same "iomap valid check" functionality in the
writeback code as cached iomaps can become stale due to racing
writeback, truncated, etc. But you wouldn't know it by looking at the iomap
writeback code - this is currently hidden by XFS by embedding
the checks into the iomap writeback ->map_blocks function.

That is, the first thing that xfs_map_blocks() does is check if the
cached iomap is valid, and if it is valid it returns immediately and
the iomap writeback code uses it without question.

The reason that this is embedded like this is that the iomap did not
have a validity cookie field in it, and so the validity information
was wrapped around the outside of the iomap_writepage_ctx and the
filesystem has to decode it from that private wrapping structure.

However, the validity information iin the structure wrapper is
indentical to the iomap validity cookie, and so the direction I've
been working towards is to replace this implicit, hidden cached
iomap validity check with an explicit ->iomap_valid call and then
only call ->map_blocks if the validity check fails (or is not
implemented).

I want to use the same code for all the iomap validity checks in all
the iomap core code - this is an iomap issue, the conditions where
we need to check for iomap validity are different for depending on
the iomap context being run, and the checks are not necessarily
dependent on first having locked a folio.

Yes, the validity cookie needs to be decoded by the filesystem, but
that does not dictate where the validity checking needs to be done
by the iomap core.

Hence I think removing ->iomap_valid is a big step backwards for the
iomap core code - the iomap core needs to be able to formally verify
the iomap is valid at any point in time, not just at the point in
time a folio in the page cache has been locked...

-Dave.
-- 
Dave Chinner
david at fromorbit.com


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 04/10] iomap: Add iomap_get_folio helper
  2023-01-08 19:40   ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-09 12:46     ` Andreas Gruenbacher
  -1 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-09 12:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J . Wong, Alexander Viro,
	Matthew Wilcox, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel, Christoph Hellwig

On Sun, Jan 8, 2023 at 10:33 PM Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Jan 08, 2023 at 08:40:28PM +0100, Andreas Gruenbacher wrote:
> > Add an iomap_get_folio() helper that gets a folio reference based on
> > an iomap iterator and an offset into the address space.  Use it in
> > iomap_write_begin().
> >
> > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/iomap/buffered-io.c | 39 ++++++++++++++++++++++++++++++---------
> >  include/linux/iomap.h  |  1 +
> >  2 files changed, 31 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index d4b444e44861..de4a8e5f721a 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -457,6 +457,33 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
> >  }
> >  EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
> >
> > +/**
> > + * iomap_get_folio - get a folio reference for writing
> > + * @iter: iteration structure
> > + * @pos: start offset of write
> > + *
> > + * Returns a locked reference to the folio at @pos, or an error pointer if the
> > + * folio could not be obtained.
> > + */
> > +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
> > +{
> > +     unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
> > +     struct folio *folio;
> > +
> > +     if (iter->flags & IOMAP_NOWAIT)
> > +             fgp |= FGP_NOWAIT;
> > +
> > +     folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> > +                     fgp, mapping_gfp_mask(iter->inode->i_mapping));
> > +     if (folio)
> > +             return folio;
> > +
> > +     if (iter->flags & IOMAP_NOWAIT)
> > +             return ERR_PTR(-EAGAIN);
> > +     return ERR_PTR(-ENOMEM);
> > +}
> > +EXPORT_SYMBOL_GPL(iomap_get_folio);
>
> Hmmmm.
>
> This is where things start to get complex. I have sent a patch to
> fix a problem with iomap_zero_range() failing to zero cached dirty
> pages over UNWRITTEN extents, and that requires making FGP_CREAT
> optional. This is an iomap bug, and needs to be fixed in the core
> iomap code:
>
> https://lore.kernel.org/linux-xfs/20221201005214.3836105-1-david@fromorbit.com/
>
> Essentially, we need to pass fgp flags to iomap_write_begin() need
> so the callers can supply a 0 or FGP_CREAT appropriately. This
> allows iomap_write_begin() to act only on pre-cached pages rather
> than always instantiating a new page if one does not exist in cache.
>
> This allows that iomap_write_begin() to return a NULL folio
> successfully, and this is perfectly OK for callers that pass in fgp
> = 0 as they are expected to handle a NULL folio return indicating
> there was no cached data over the range...
>
> Exposing the folio allocation as an external interface makes bug
> fixes like this rather messy - it's taking a core abstraction (iomap
> hides all the folio and page cache manipulations from the
> filesystem) and punching a big hole in it by requiring filesystems
> to actually allocation page cache folios on behalf of the iomap
> core.
>
> Given that I recently got major push-back for fixing an XFS-only bug
> by walking the page cache directly instead of abstracting it via the
> iomap core, punching an even bigger hole in the abstraction layer to
> fix a GFS2-only problem is just as bad....

We can handle that by adding a new IOMAP_NOCREATE iterator flag and
checking for that in iomap_get_folio().  Your patch then turns into
the below.

Thanks,
Andreas

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index dacc7c80b20d..34b335a89527 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -470,6 +470,8 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
 	unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
 	struct folio *folio;
 
+	if (!(iter->flags & IOMAP_NOCREATE))
+		fgp |= FGP_CREAT;
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
 
@@ -478,6 +480,8 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
 	if (folio)
 		return folio;
 
+	if (iter->flags & IOMAP_NOCREATE)
+		return ERR_PTR(-ENODATA);
 	if (iter->flags & IOMAP_NOWAIT)
 		return ERR_PTR(-EAGAIN);
 	return ERR_PTR(-ENOMEM);
@@ -1162,8 +1166,12 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 	loff_t written = 0;
 
 	/* already zeroed?  we're done. */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+	if (srcmap->type == IOMAP_HOLE)
 		return length;
+	/* only do page cache lookups over unwritten extents */
+	iter->flags &= ~IOMAP_NOCREATE;
+	if (srcmap->type == IOMAP_UNWRITTEN)
+		iter->flags |= IOMAP_NOCREATE;
 
 	do {
 		struct folio *folio;
@@ -1172,8 +1180,19 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		size_t bytes = min_t(u64, SIZE_MAX, length);
 
 		status = iomap_write_begin(iter, pos, bytes, &folio);
-		if (status)
+		if (status) {
+			if (status == -ENODATA) {
+				/*
+				 * No folio was found, so skip to the start of
+				 * the next potential entry in the page cache
+				 * and continue from there.
+				 */
+				if (bytes > PAGE_SIZE - offset_in_page(pos))
+					bytes = PAGE_SIZE - offset_in_page(pos);
+				goto loop_continue;
+			}
 			return status;
+		}
 		if (iter->iomap.flags & IOMAP_F_STALE)
 			break;
 
@@ -1181,6 +1200,19 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (bytes > folio_size(folio) - offset)
 			bytes = folio_size(folio) - offset;
 
+		/*
+		 * If the folio over an unwritten extent is clean, then we
+		 * aren't going to touch the data in it at all. We don't want to
+		 * mark it dirty or change the uptodate state of data in the
+		 * page, so we just unlock it and skip to the next range over
+		 * the unwritten extent we need to check.
+		 */
+		if (srcmap->type == IOMAP_UNWRITTEN &&
+		    !folio_test_dirty(folio)) {
+			folio_unlock(folio);
+			goto loop_continue;
+		}
+
 		folio_zero_range(folio, offset, bytes);
 		folio_mark_accessed(folio);
 
@@ -1188,6 +1220,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (WARN_ON_ONCE(bytes == 0))
 			return -EIO;
 
+loop_continue:
 		pos += bytes;
 		length -= bytes;
 		written += bytes;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 515318dfbc38..87b9d9aba4bb 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -841,17 +841,7 @@ xfs_setattr_size(
 		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
 		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
 				&did_zeroing);
-	} else {
-		/*
-		 * iomap won't detect a dirty page over an unwritten block (or a
-		 * cow block over a hole) and subsequently skips zeroing the
-		 * newly post-EOF portion of the page. Flush the new EOF to
-		 * convert the block before the pagecache truncate.
-		 */
-		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
-						     newsize);
-		if (error)
-			return error;
+	} else if (newsize != oldsize) {
 		error = xfs_truncate_page(ip, newsize, &did_zeroing);
 	}
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 3e6c34b03c89..55f195866f00 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -164,6 +164,7 @@ struct iomap_folio_ops {
 #else
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
+#define IOMAP_NOCREATE		(1 << 9) /* look up folios without FGP_CREAT */
 
 struct iomap_ops {
 	/*
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
@ 2023-01-09 12:46     ` Andreas Gruenbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-09 12:46 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sun, Jan 8, 2023 at 10:33 PM Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Jan 08, 2023 at 08:40:28PM +0100, Andreas Gruenbacher wrote:
> > Add an iomap_get_folio() helper that gets a folio reference based on
> > an iomap iterator and an offset into the address space.  Use it in
> > iomap_write_begin().
> >
> > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/iomap/buffered-io.c | 39 ++++++++++++++++++++++++++++++---------
> >  include/linux/iomap.h  |  1 +
> >  2 files changed, 31 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index d4b444e44861..de4a8e5f721a 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -457,6 +457,33 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
> >  }
> >  EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
> >
> > +/**
> > + * iomap_get_folio - get a folio reference for writing
> > + * @iter: iteration structure
> > + * @pos: start offset of write
> > + *
> > + * Returns a locked reference to the folio at @pos, or an error pointer if the
> > + * folio could not be obtained.
> > + */
> > +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
> > +{
> > +     unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
> > +     struct folio *folio;
> > +
> > +     if (iter->flags & IOMAP_NOWAIT)
> > +             fgp |= FGP_NOWAIT;
> > +
> > +     folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> > +                     fgp, mapping_gfp_mask(iter->inode->i_mapping));
> > +     if (folio)
> > +             return folio;
> > +
> > +     if (iter->flags & IOMAP_NOWAIT)
> > +             return ERR_PTR(-EAGAIN);
> > +     return ERR_PTR(-ENOMEM);
> > +}
> > +EXPORT_SYMBOL_GPL(iomap_get_folio);
>
> Hmmmm.
>
> This is where things start to get complex. I have sent a patch to
> fix a problem with iomap_zero_range() failing to zero cached dirty
> pages over UNWRITTEN extents, and that requires making FGP_CREAT
> optional. This is an iomap bug, and needs to be fixed in the core
> iomap code:
>
> https://lore.kernel.org/linux-xfs/20221201005214.3836105-1-david at fromorbit.com/
>
> Essentially, we need to pass fgp flags to iomap_write_begin() need
> so the callers can supply a 0 or FGP_CREAT appropriately. This
> allows iomap_write_begin() to act only on pre-cached pages rather
> than always instantiating a new page if one does not exist in cache.
>
> This allows that iomap_write_begin() to return a NULL folio
> successfully, and this is perfectly OK for callers that pass in fgp
> = 0 as they are expected to handle a NULL folio return indicating
> there was no cached data over the range...
>
> Exposing the folio allocation as an external interface makes bug
> fixes like this rather messy - it's taking a core abstraction (iomap
> hides all the folio and page cache manipulations from the
> filesystem) and punching a big hole in it by requiring filesystems
> to actually allocation page cache folios on behalf of the iomap
> core.
>
> Given that I recently got major push-back for fixing an XFS-only bug
> by walking the page cache directly instead of abstracting it via the
> iomap core, punching an even bigger hole in the abstraction layer to
> fix a GFS2-only problem is just as bad....

We can handle that by adding a new IOMAP_NOCREATE iterator flag and
checking for that in iomap_get_folio().  Your patch then turns into
the below.

Thanks,
Andreas

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index dacc7c80b20d..34b335a89527 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -470,6 +470,8 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
 	unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
 	struct folio *folio;
 
+	if (!(iter->flags & IOMAP_NOCREATE))
+		fgp |= FGP_CREAT;
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
 
@@ -478,6 +480,8 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
 	if (folio)
 		return folio;
 
+	if (iter->flags & IOMAP_NOCREATE)
+		return ERR_PTR(-ENODATA);
 	if (iter->flags & IOMAP_NOWAIT)
 		return ERR_PTR(-EAGAIN);
 	return ERR_PTR(-ENOMEM);
@@ -1162,8 +1166,12 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 	loff_t written = 0;
 
 	/* already zeroed?  we're done. */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+	if (srcmap->type == IOMAP_HOLE)
 		return length;
+	/* only do page cache lookups over unwritten extents */
+	iter->flags &= ~IOMAP_NOCREATE;
+	if (srcmap->type == IOMAP_UNWRITTEN)
+		iter->flags |= IOMAP_NOCREATE;
 
 	do {
 		struct folio *folio;
@@ -1172,8 +1180,19 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		size_t bytes = min_t(u64, SIZE_MAX, length);
 
 		status = iomap_write_begin(iter, pos, bytes, &folio);
-		if (status)
+		if (status) {
+			if (status == -ENODATA) {
+				/*
+				 * No folio was found, so skip to the start of
+				 * the next potential entry in the page cache
+				 * and continue from there.
+				 */
+				if (bytes > PAGE_SIZE - offset_in_page(pos))
+					bytes = PAGE_SIZE - offset_in_page(pos);
+				goto loop_continue;
+			}
 			return status;
+		}
 		if (iter->iomap.flags & IOMAP_F_STALE)
 			break;
 
@@ -1181,6 +1200,19 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (bytes > folio_size(folio) - offset)
 			bytes = folio_size(folio) - offset;
 
+		/*
+		 * If the folio over an unwritten extent is clean, then we
+		 * aren't going to touch the data in it at all. We don't want to
+		 * mark it dirty or change the uptodate state of data in the
+		 * page, so we just unlock it and skip to the next range over
+		 * the unwritten extent we need to check.
+		 */
+		if (srcmap->type == IOMAP_UNWRITTEN &&
+		    !folio_test_dirty(folio)) {
+			folio_unlock(folio);
+			goto loop_continue;
+		}
+
 		folio_zero_range(folio, offset, bytes);
 		folio_mark_accessed(folio);
 
@@ -1188,6 +1220,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (WARN_ON_ONCE(bytes == 0))
 			return -EIO;
 
+loop_continue:
 		pos += bytes;
 		length -= bytes;
 		written += bytes;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 515318dfbc38..87b9d9aba4bb 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -841,17 +841,7 @@ xfs_setattr_size(
 		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
 		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
 				&did_zeroing);
-	} else {
-		/*
-		 * iomap won't detect a dirty page over an unwritten block (or a
-		 * cow block over a hole) and subsequently skips zeroing the
-		 * newly post-EOF portion of the page. Flush the new EOF to
-		 * convert the block before the pagecache truncate.
-		 */
-		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
-						     newsize);
-		if (error)
-			return error;
+	} else if (newsize != oldsize) {
 		error = xfs_truncate_page(ip, newsize, &did_zeroing);
 	}
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 3e6c34b03c89..55f195866f00 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -164,6 +164,7 @@ struct iomap_folio_ops {
 #else
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
+#define IOMAP_NOCREATE		(1 << 9) /* look up folios without FGP_CREAT */
 
 struct iomap_ops {
 	/*
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
  2023-01-08 21:59     ` [Cluster-devel] " Dave Chinner
@ 2023-01-09 18:45       ` Andreas Gruenbacher
  -1 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-09 18:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J . Wong, Alexander Viro,
	Matthew Wilcox, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel

On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> > Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> > handler and validating the mapping there.
> >
> > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
>
> I think this is wrong.
>
> The ->iomap_valid() function handles a fundamental architectural
> issue with cached iomaps: the iomap can become stale at any time
> whilst it is in use by the iomap core code.
>
> The current problem it solves in the iomap_write_begin() path has to
> do with writeback and memory reclaim races over unwritten extents,
> but the general case is that we must be able to check the iomap
> at any point in time to assess it's validity.
>
> Indeed, we also have this same "iomap valid check" functionality in the
> writeback code as cached iomaps can become stale due to racing
> writeback, truncated, etc. But you wouldn't know it by looking at the iomap
> writeback code - this is currently hidden by XFS by embedding
> the checks into the iomap writeback ->map_blocks function.
>
> That is, the first thing that xfs_map_blocks() does is check if the
> cached iomap is valid, and if it is valid it returns immediately and
> the iomap writeback code uses it without question.
>
> The reason that this is embedded like this is that the iomap did not
> have a validity cookie field in it, and so the validity information
> was wrapped around the outside of the iomap_writepage_ctx and the
> filesystem has to decode it from that private wrapping structure.
>
> However, the validity information iin the structure wrapper is
> indentical to the iomap validity cookie,

Then could that part of the xfs code be converted to use
iomap->validity_cookie so that struct iomap_writepage_ctx can be
eliminated?

> and so the direction I've
> been working towards is to replace this implicit, hidden cached
> iomap validity check with an explicit ->iomap_valid call and then
> only call ->map_blocks if the validity check fails (or is not
> implemented).
>
> I want to use the same code for all the iomap validity checks in all
> the iomap core code - this is an iomap issue, the conditions where
> we need to check for iomap validity are different for depending on
> the iomap context being run, and the checks are not necessarily
> dependent on first having locked a folio.
>
> Yes, the validity cookie needs to be decoded by the filesystem, but
> that does not dictate where the validity checking needs to be done
> by the iomap core.
>
> Hence I think removing ->iomap_valid is a big step backwards for the
> iomap core code - the iomap core needs to be able to formally verify
> the iomap is valid at any point in time, not just at the point in
> time a folio in the page cache has been locked...

We don't need to validate an iomap "at any time". It's two specific
places in the code in which we need to check, and we're not going to
end up with ten more such places tomorrow. I'd prefer to keep those
filesystem internals in the filesystem specific code instead of
exposing them to the iomap layer. But that's just me ...

If we ignore this particular commit for now, do you have any
objections to the patches in this series? If not, it would be great if
we could add the other patches to iomap-for-next.

By the way, I'm still not sure if gfs2 is affected by this whole iomap
validation drama given that it neither implements unwritten extents
nor delayed allocation. This is a mess.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
@ 2023-01-09 18:45       ` Andreas Gruenbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-09 18:45 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> > Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> > handler and validating the mapping there.
> >
> > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
>
> I think this is wrong.
>
> The ->iomap_valid() function handles a fundamental architectural
> issue with cached iomaps: the iomap can become stale at any time
> whilst it is in use by the iomap core code.
>
> The current problem it solves in the iomap_write_begin() path has to
> do with writeback and memory reclaim races over unwritten extents,
> but the general case is that we must be able to check the iomap
> at any point in time to assess it's validity.
>
> Indeed, we also have this same "iomap valid check" functionality in the
> writeback code as cached iomaps can become stale due to racing
> writeback, truncated, etc. But you wouldn't know it by looking at the iomap
> writeback code - this is currently hidden by XFS by embedding
> the checks into the iomap writeback ->map_blocks function.
>
> That is, the first thing that xfs_map_blocks() does is check if the
> cached iomap is valid, and if it is valid it returns immediately and
> the iomap writeback code uses it without question.
>
> The reason that this is embedded like this is that the iomap did not
> have a validity cookie field in it, and so the validity information
> was wrapped around the outside of the iomap_writepage_ctx and the
> filesystem has to decode it from that private wrapping structure.
>
> However, the validity information iin the structure wrapper is
> indentical to the iomap validity cookie,

Then could that part of the xfs code be converted to use
iomap->validity_cookie so that struct iomap_writepage_ctx can be
eliminated?

> and so the direction I've
> been working towards is to replace this implicit, hidden cached
> iomap validity check with an explicit ->iomap_valid call and then
> only call ->map_blocks if the validity check fails (or is not
> implemented).
>
> I want to use the same code for all the iomap validity checks in all
> the iomap core code - this is an iomap issue, the conditions where
> we need to check for iomap validity are different for depending on
> the iomap context being run, and the checks are not necessarily
> dependent on first having locked a folio.
>
> Yes, the validity cookie needs to be decoded by the filesystem, but
> that does not dictate where the validity checking needs to be done
> by the iomap core.
>
> Hence I think removing ->iomap_valid is a big step backwards for the
> iomap core code - the iomap core needs to be able to formally verify
> the iomap is valid at any point in time, not just at the point in
> time a folio in the page cache has been locked...

We don't need to validate an iomap "at any time". It's two specific
places in the code in which we need to check, and we're not going to
end up with ten more such places tomorrow. I'd prefer to keep those
filesystem internals in the filesystem specific code instead of
exposing them to the iomap layer. But that's just me ...

If we ignore this particular commit for now, do you have any
objections to the patches in this series? If not, it would be great if
we could add the other patches to iomap-for-next.

By the way, I'm still not sure if gfs2 is affected by this whole iomap
validation drama given that it neither implements unwritten extents
nor delayed allocation. This is a mess.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
  2023-01-09 18:45       ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-09 22:54         ` Dave Chinner
  -1 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2023-01-09 22:54 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, Darrick J . Wong, Alexander Viro,
	Matthew Wilcox, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel

On Mon, Jan 09, 2023 at 07:45:27PM +0100, Andreas Gruenbacher wrote:
> On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> > > Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> > > handler and validating the mapping there.
> > >
> > > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> >
> > I think this is wrong.
> >
> > The ->iomap_valid() function handles a fundamental architectural
> > issue with cached iomaps: the iomap can become stale at any time
> > whilst it is in use by the iomap core code.
> >
> > The current problem it solves in the iomap_write_begin() path has to
> > do with writeback and memory reclaim races over unwritten extents,
> > but the general case is that we must be able to check the iomap
> > at any point in time to assess it's validity.
> >
> > Indeed, we also have this same "iomap valid check" functionality in the
> > writeback code as cached iomaps can become stale due to racing
> > writeback, truncated, etc. But you wouldn't know it by looking at the iomap
> > writeback code - this is currently hidden by XFS by embedding
> > the checks into the iomap writeback ->map_blocks function.
> >
> > That is, the first thing that xfs_map_blocks() does is check if the
> > cached iomap is valid, and if it is valid it returns immediately and
> > the iomap writeback code uses it without question.
> >
> > The reason that this is embedded like this is that the iomap did not
> > have a validity cookie field in it, and so the validity information
> > was wrapped around the outside of the iomap_writepage_ctx and the
> > filesystem has to decode it from that private wrapping structure.
> >
> > However, the validity information iin the structure wrapper is
> > indentical to the iomap validity cookie,
> 
> Then could that part of the xfs code be converted to use
> iomap->validity_cookie so that struct iomap_writepage_ctx can be
> eliminated?

Yes, that is the plan.

> 
> > and so the direction I've
> > been working towards is to replace this implicit, hidden cached
> > iomap validity check with an explicit ->iomap_valid call and then
> > only call ->map_blocks if the validity check fails (or is not
> > implemented).
> >
> > I want to use the same code for all the iomap validity checks in all
> > the iomap core code - this is an iomap issue, the conditions where
> > we need to check for iomap validity are different for depending on
> > the iomap context being run, and the checks are not necessarily
> > dependent on first having locked a folio.
> >
> > Yes, the validity cookie needs to be decoded by the filesystem, but
> > that does not dictate where the validity checking needs to be done
> > by the iomap core.
> >
> > Hence I think removing ->iomap_valid is a big step backwards for the
> > iomap core code - the iomap core needs to be able to formally verify
> > the iomap is valid at any point in time, not just at the point in
> > time a folio in the page cache has been locked...
> 
> We don't need to validate an iomap "at any time". It's two specific
> places in the code in which we need to check, and we're not going to
> end up with ten more such places tomorrow.

Not immediately, but that doesn't change the fact this is not a
filesystem specific issue - it's an inherent characteristic of
cached iomaps and unsynchronised extent state changes that occur
outside exclusive inode->i_rwsem IO context (e.g. in writeback and
IO completion contexts).

Racing mmap + buffered writes can expose these state changes as the
iomap bufferred write IO path is not serialised against the iomap
mmap IO path except via folio locks. Hence a mmap page fault can
invalidate a cached buffered write iomap by causing a hole ->
unwritten, hole -> delalloc or hole -> written conversion in the
middle of the buffered write range. The buffered write still has a
hole mapping cached for that entire range, and it is now incorrect.

If the mmap write happens to change extent state at the trailing
edge of a partial buffered write, data corruption will occur if we
race just right with writeback and memory reclaim. I'm pretty sure
that this corruption can be reporduced on gfs2 if we try hard enough
- generic/346 triggers the mmap/write race condition, all that is
needed from that point is for writeback and reclaiming pages at
exactly the right time...

> I'd prefer to keep those
> filesystem internals in the filesystem specific code instead of
> exposing them to the iomap layer. But that's just me ...

My point is that there is nothing XFS specific about these stale
cached iomap race conditions, nor is it specifically related to
folio locking. The folio locking inversions w.r.t. iomap caching and
the interactions with writeback and reclaim are simply the
manifestation that brought the issue to our attention.

This is why I think hiding iomap validation filesystem specific page
cache allocation/lookup functions is entirely the wrong layer to be
doing iomap validity checks. Especially as it prevents us from
adding more validity checks in the core infrastructure when we need
them in future.

AFAIC, an iomap must carry with it a method for checking
that it is still valid. We need it in the write path, we need it in
the writeback path. If we want to relax the restrictions on clone
operations (e.g. shared locking on the source file), we'll need to
be able to detect stale cached iomaps in those paths, too. And I
haven't really thought through all the implications of shared
locking on buffered writes yet, but that may well require more
checks in other places as well.

> If we ignore this particular commit for now, do you have any
> objections to the patches in this series? If not, it would be great if
> we could add the other patches to iomap-for-next.

I still don't like moving page cache operations into individual
filesystems, but for the moment I can live with the IOMAP_NOCREATE
hack to drill iomap state through the filesystem without the
filesystem being aware of it.

> By the way, I'm still not sure if gfs2 is affected by this whole iomap
> validation drama given that it neither implements unwritten extents
> nor delayed allocation. This is a mess.

See above - I'm pretty sure it will be, but it may be very difficult
to expose. After all, it's taken several years before anyone noticed
this issue with XFS, even though we were aware of the issue of stale
cached iomaps causing data corruption in the writeback path....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
@ 2023-01-09 22:54         ` Dave Chinner
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2023-01-09 22:54 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Jan 09, 2023 at 07:45:27PM +0100, Andreas Gruenbacher wrote:
> On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> > > Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> > > handler and validating the mapping there.
> > >
> > > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> >
> > I think this is wrong.
> >
> > The ->iomap_valid() function handles a fundamental architectural
> > issue with cached iomaps: the iomap can become stale at any time
> > whilst it is in use by the iomap core code.
> >
> > The current problem it solves in the iomap_write_begin() path has to
> > do with writeback and memory reclaim races over unwritten extents,
> > but the general case is that we must be able to check the iomap
> > at any point in time to assess it's validity.
> >
> > Indeed, we also have this same "iomap valid check" functionality in the
> > writeback code as cached iomaps can become stale due to racing
> > writeback, truncated, etc. But you wouldn't know it by looking at the iomap
> > writeback code - this is currently hidden by XFS by embedding
> > the checks into the iomap writeback ->map_blocks function.
> >
> > That is, the first thing that xfs_map_blocks() does is check if the
> > cached iomap is valid, and if it is valid it returns immediately and
> > the iomap writeback code uses it without question.
> >
> > The reason that this is embedded like this is that the iomap did not
> > have a validity cookie field in it, and so the validity information
> > was wrapped around the outside of the iomap_writepage_ctx and the
> > filesystem has to decode it from that private wrapping structure.
> >
> > However, the validity information iin the structure wrapper is
> > indentical to the iomap validity cookie,
> 
> Then could that part of the xfs code be converted to use
> iomap->validity_cookie so that struct iomap_writepage_ctx can be
> eliminated?

Yes, that is the plan.

> 
> > and so the direction I've
> > been working towards is to replace this implicit, hidden cached
> > iomap validity check with an explicit ->iomap_valid call and then
> > only call ->map_blocks if the validity check fails (or is not
> > implemented).
> >
> > I want to use the same code for all the iomap validity checks in all
> > the iomap core code - this is an iomap issue, the conditions where
> > we need to check for iomap validity are different for depending on
> > the iomap context being run, and the checks are not necessarily
> > dependent on first having locked a folio.
> >
> > Yes, the validity cookie needs to be decoded by the filesystem, but
> > that does not dictate where the validity checking needs to be done
> > by the iomap core.
> >
> > Hence I think removing ->iomap_valid is a big step backwards for the
> > iomap core code - the iomap core needs to be able to formally verify
> > the iomap is valid at any point in time, not just at the point in
> > time a folio in the page cache has been locked...
> 
> We don't need to validate an iomap "at any time". It's two specific
> places in the code in which we need to check, and we're not going to
> end up with ten more such places tomorrow.

Not immediately, but that doesn't change the fact this is not a
filesystem specific issue - it's an inherent characteristic of
cached iomaps and unsynchronised extent state changes that occur
outside exclusive inode->i_rwsem IO context (e.g. in writeback and
IO completion contexts).

Racing mmap + buffered writes can expose these state changes as the
iomap bufferred write IO path is not serialised against the iomap
mmap IO path except via folio locks. Hence a mmap page fault can
invalidate a cached buffered write iomap by causing a hole ->
unwritten, hole -> delalloc or hole -> written conversion in the
middle of the buffered write range. The buffered write still has a
hole mapping cached for that entire range, and it is now incorrect.

If the mmap write happens to change extent state at the trailing
edge of a partial buffered write, data corruption will occur if we
race just right with writeback and memory reclaim. I'm pretty sure
that this corruption can be reporduced on gfs2 if we try hard enough
- generic/346 triggers the mmap/write race condition, all that is
needed from that point is for writeback and reclaiming pages at
exactly the right time...

> I'd prefer to keep those
> filesystem internals in the filesystem specific code instead of
> exposing them to the iomap layer. But that's just me ...

My point is that there is nothing XFS specific about these stale
cached iomap race conditions, nor is it specifically related to
folio locking. The folio locking inversions w.r.t. iomap caching and
the interactions with writeback and reclaim are simply the
manifestation that brought the issue to our attention.

This is why I think hiding iomap validation filesystem specific page
cache allocation/lookup functions is entirely the wrong layer to be
doing iomap validity checks. Especially as it prevents us from
adding more validity checks in the core infrastructure when we need
them in future.

AFAIC, an iomap must carry with it a method for checking
that it is still valid. We need it in the write path, we need it in
the writeback path. If we want to relax the restrictions on clone
operations (e.g. shared locking on the source file), we'll need to
be able to detect stale cached iomaps in those paths, too. And I
haven't really thought through all the implications of shared
locking on buffered writes yet, but that may well require more
checks in other places as well.

> If we ignore this particular commit for now, do you have any
> objections to the patches in this series? If not, it would be great if
> we could add the other patches to iomap-for-next.

I still don't like moving page cache operations into individual
filesystems, but for the moment I can live with the IOMAP_NOCREATE
hack to drill iomap state through the filesystem without the
filesystem being aware of it.

> By the way, I'm still not sure if gfs2 is affected by this whole iomap
> validation drama given that it neither implements unwritten extents
> nor delayed allocation. This is a mess.

See above - I'm pretty sure it will be, but it may be very difficult
to expose. After all, it's taken several years before anyone noticed
this issue with XFS, even though we were aware of the issue of stale
cached iomaps causing data corruption in the writeback path....

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
  2023-01-09 22:54         ` [Cluster-devel] " Dave Chinner
@ 2023-01-10  1:09           ` Andreas Grünbacher
  -1 siblings, 0 replies; 82+ messages in thread
From: Andreas Grünbacher @ 2023-01-10  1:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andreas Gruenbacher, Christoph Hellwig, Darrick J . Wong,
	Alexander Viro, Matthew Wilcox, linux-xfs, linux-fsdevel,
	linux-ext4, cluster-devel

Am Mo., 9. Jan. 2023 um 23:58 Uhr schrieb Dave Chinner <david@fromorbit.com>:
> On Mon, Jan 09, 2023 at 07:45:27PM +0100, Andreas Gruenbacher wrote:
> > On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner <david@fromorbit.com> wrote:
> > > On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> > > > Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> > > > handler and validating the mapping there.
> > > >
> > > > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> > >
> > > I think this is wrong.
> > >
> > > The ->iomap_valid() function handles a fundamental architectural
> > > issue with cached iomaps: the iomap can become stale at any time
> > > whilst it is in use by the iomap core code.
> > >
> > > The current problem it solves in the iomap_write_begin() path has to
> > > do with writeback and memory reclaim races over unwritten extents,
> > > but the general case is that we must be able to check the iomap
> > > at any point in time to assess it's validity.
> > >
> > > Indeed, we also have this same "iomap valid check" functionality in the
> > > writeback code as cached iomaps can become stale due to racing
> > > writeback, truncated, etc. But you wouldn't know it by looking at the iomap
> > > writeback code - this is currently hidden by XFS by embedding
> > > the checks into the iomap writeback ->map_blocks function.
> > >
> > > That is, the first thing that xfs_map_blocks() does is check if the
> > > cached iomap is valid, and if it is valid it returns immediately and
> > > the iomap writeback code uses it without question.
> > >
> > > The reason that this is embedded like this is that the iomap did not
> > > have a validity cookie field in it, and so the validity information
> > > was wrapped around the outside of the iomap_writepage_ctx and the
> > > filesystem has to decode it from that private wrapping structure.
> > >
> > > However, the validity information iin the structure wrapper is
> > > indentical to the iomap validity cookie,
> >
> > Then could that part of the xfs code be converted to use
> > iomap->validity_cookie so that struct iomap_writepage_ctx can be
> > eliminated?
>
> Yes, that is the plan.
>
> >
> > > and so the direction I've
> > > been working towards is to replace this implicit, hidden cached
> > > iomap validity check with an explicit ->iomap_valid call and then
> > > only call ->map_blocks if the validity check fails (or is not
> > > implemented).
> > >
> > > I want to use the same code for all the iomap validity checks in all
> > > the iomap core code - this is an iomap issue, the conditions where
> > > we need to check for iomap validity are different for depending on
> > > the iomap context being run, and the checks are not necessarily
> > > dependent on first having locked a folio.
> > >
> > > Yes, the validity cookie needs to be decoded by the filesystem, but
> > > that does not dictate where the validity checking needs to be done
> > > by the iomap core.
> > >
> > > Hence I think removing ->iomap_valid is a big step backwards for the
> > > iomap core code - the iomap core needs to be able to formally verify
> > > the iomap is valid at any point in time, not just at the point in
> > > time a folio in the page cache has been locked...
> >
> > We don't need to validate an iomap "at any time". It's two specific
> > places in the code in which we need to check, and we're not going to
> > end up with ten more such places tomorrow.
>
> Not immediately, but that doesn't change the fact this is not a
> filesystem specific issue - it's an inherent characteristic of
> cached iomaps and unsynchronised extent state changes that occur
> outside exclusive inode->i_rwsem IO context (e.g. in writeback and
> IO completion contexts).
>
> Racing mmap + buffered writes can expose these state changes as the
> iomap bufferred write IO path is not serialised against the iomap
> mmap IO path except via folio locks. Hence a mmap page fault can
> invalidate a cached buffered write iomap by causing a hole ->
> unwritten, hole -> delalloc or hole -> written conversion in the
> middle of the buffered write range. The buffered write still has a
> hole mapping cached for that entire range, and it is now incorrect.
>
> If the mmap write happens to change extent state at the trailing
> edge of a partial buffered write, data corruption will occur if we
> race just right with writeback and memory reclaim. I'm pretty sure
> that this corruption can be reporduced on gfs2 if we try hard enough
> - generic/346 triggers the mmap/write race condition, all that is
> needed from that point is for writeback and reclaiming pages at
> exactly the right time...
>
> > I'd prefer to keep those
> > filesystem internals in the filesystem specific code instead of
> > exposing them to the iomap layer. But that's just me ...
>
> My point is that there is nothing XFS specific about these stale
> cached iomap race conditions, nor is it specifically related to
> folio locking. The folio locking inversions w.r.t. iomap caching and
> the interactions with writeback and reclaim are simply the
> manifestation that brought the issue to our attention.
>
> This is why I think hiding iomap validation filesystem specific page
> cache allocation/lookup functions is entirely the wrong layer to be
> doing iomap validity checks. Especially as it prevents us from
> adding more validity checks in the core infrastructure when we need
> them in future.
>
> AFAIC, an iomap must carry with it a method for checking
> that it is still valid. We need it in the write path, we need it in
> the writeback path. If we want to relax the restrictions on clone
> operations (e.g. shared locking on the source file), we'll need to
> be able to detect stale cached iomaps in those paths, too. And I
> haven't really thought through all the implications of shared
> locking on buffered writes yet, but that may well require more
> checks in other places as well.
>
> > If we ignore this particular commit for now, do you have any
> > objections to the patches in this series? If not, it would be great if
> > we could add the other patches to iomap-for-next.
>
> I still don't like moving page cache operations into individual
> filesystems, but for the moment I can live with the IOMAP_NOCREATE
> hack to drill iomap state through the filesystem without the
> filesystem being aware of it.

Alright, works for me. Darrick?

> > By the way, I'm still not sure if gfs2 is affected by this whole iomap
> > validation drama given that it neither implements unwritten extents
> > nor delayed allocation. This is a mess.
>
> See above - I'm pretty sure it will be, but it may be very difficult
> to expose. After all, it's taken several years before anyone noticed
> this issue with XFS, even though we were aware of the issue of stale
> cached iomaps causing data corruption in the writeback path....

Okay, that's all pretty ugly. Thanks a lot for the detailed explanation.

Cheers,
Andreas

> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
@ 2023-01-10  1:09           ` Andreas Grünbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Grünbacher @ 2023-01-10  1:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Am Mo., 9. Jan. 2023 um 23:58 Uhr schrieb Dave Chinner <david@fromorbit.com>:
> On Mon, Jan 09, 2023 at 07:45:27PM +0100, Andreas Gruenbacher wrote:
> > On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner <david@fromorbit.com> wrote:
> > > On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> > > > Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> > > > handler and validating the mapping there.
> > > >
> > > > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> > >
> > > I think this is wrong.
> > >
> > > The ->iomap_valid() function handles a fundamental architectural
> > > issue with cached iomaps: the iomap can become stale at any time
> > > whilst it is in use by the iomap core code.
> > >
> > > The current problem it solves in the iomap_write_begin() path has to
> > > do with writeback and memory reclaim races over unwritten extents,
> > > but the general case is that we must be able to check the iomap
> > > at any point in time to assess it's validity.
> > >
> > > Indeed, we also have this same "iomap valid check" functionality in the
> > > writeback code as cached iomaps can become stale due to racing
> > > writeback, truncated, etc. But you wouldn't know it by looking at the iomap
> > > writeback code - this is currently hidden by XFS by embedding
> > > the checks into the iomap writeback ->map_blocks function.
> > >
> > > That is, the first thing that xfs_map_blocks() does is check if the
> > > cached iomap is valid, and if it is valid it returns immediately and
> > > the iomap writeback code uses it without question.
> > >
> > > The reason that this is embedded like this is that the iomap did not
> > > have a validity cookie field in it, and so the validity information
> > > was wrapped around the outside of the iomap_writepage_ctx and the
> > > filesystem has to decode it from that private wrapping structure.
> > >
> > > However, the validity information iin the structure wrapper is
> > > indentical to the iomap validity cookie,
> >
> > Then could that part of the xfs code be converted to use
> > iomap->validity_cookie so that struct iomap_writepage_ctx can be
> > eliminated?
>
> Yes, that is the plan.
>
> >
> > > and so the direction I've
> > > been working towards is to replace this implicit, hidden cached
> > > iomap validity check with an explicit ->iomap_valid call and then
> > > only call ->map_blocks if the validity check fails (or is not
> > > implemented).
> > >
> > > I want to use the same code for all the iomap validity checks in all
> > > the iomap core code - this is an iomap issue, the conditions where
> > > we need to check for iomap validity are different for depending on
> > > the iomap context being run, and the checks are not necessarily
> > > dependent on first having locked a folio.
> > >
> > > Yes, the validity cookie needs to be decoded by the filesystem, but
> > > that does not dictate where the validity checking needs to be done
> > > by the iomap core.
> > >
> > > Hence I think removing ->iomap_valid is a big step backwards for the
> > > iomap core code - the iomap core needs to be able to formally verify
> > > the iomap is valid at any point in time, not just at the point in
> > > time a folio in the page cache has been locked...
> >
> > We don't need to validate an iomap "at any time". It's two specific
> > places in the code in which we need to check, and we're not going to
> > end up with ten more such places tomorrow.
>
> Not immediately, but that doesn't change the fact this is not a
> filesystem specific issue - it's an inherent characteristic of
> cached iomaps and unsynchronised extent state changes that occur
> outside exclusive inode->i_rwsem IO context (e.g. in writeback and
> IO completion contexts).
>
> Racing mmap + buffered writes can expose these state changes as the
> iomap bufferred write IO path is not serialised against the iomap
> mmap IO path except via folio locks. Hence a mmap page fault can
> invalidate a cached buffered write iomap by causing a hole ->
> unwritten, hole -> delalloc or hole -> written conversion in the
> middle of the buffered write range. The buffered write still has a
> hole mapping cached for that entire range, and it is now incorrect.
>
> If the mmap write happens to change extent state at the trailing
> edge of a partial buffered write, data corruption will occur if we
> race just right with writeback and memory reclaim. I'm pretty sure
> that this corruption can be reporduced on gfs2 if we try hard enough
> - generic/346 triggers the mmap/write race condition, all that is
> needed from that point is for writeback and reclaiming pages at
> exactly the right time...
>
> > I'd prefer to keep those
> > filesystem internals in the filesystem specific code instead of
> > exposing them to the iomap layer. But that's just me ...
>
> My point is that there is nothing XFS specific about these stale
> cached iomap race conditions, nor is it specifically related to
> folio locking. The folio locking inversions w.r.t. iomap caching and
> the interactions with writeback and reclaim are simply the
> manifestation that brought the issue to our attention.
>
> This is why I think hiding iomap validation filesystem specific page
> cache allocation/lookup functions is entirely the wrong layer to be
> doing iomap validity checks. Especially as it prevents us from
> adding more validity checks in the core infrastructure when we need
> them in future.
>
> AFAIC, an iomap must carry with it a method for checking
> that it is still valid. We need it in the write path, we need it in
> the writeback path. If we want to relax the restrictions on clone
> operations (e.g. shared locking on the source file), we'll need to
> be able to detect stale cached iomaps in those paths, too. And I
> haven't really thought through all the implications of shared
> locking on buffered writes yet, but that may well require more
> checks in other places as well.
>
> > If we ignore this particular commit for now, do you have any
> > objections to the patches in this series? If not, it would be great if
> > we could add the other patches to iomap-for-next.
>
> I still don't like moving page cache operations into individual
> filesystems, but for the moment I can live with the IOMAP_NOCREATE
> hack to drill iomap state through the filesystem without the
> filesystem being aware of it.

Alright, works for me. Darrick?

> > By the way, I'm still not sure if gfs2 is affected by this whole iomap
> > validation drama given that it neither implements unwritten extents
> > nor delayed allocation. This is a mess.
>
> See above - I'm pretty sure it will be, but it may be very difficult
> to expose. After all, it's taken several years before anyone noticed
> this issue with XFS, even though we were aware of the issue of stale
> cached iomaps causing data corruption in the writeback path....

Okay, that's all pretty ugly. Thanks a lot for the detailed explanation.

Cheers,
Andreas

> Cheers,
>
> Dave.
> --
> Dave Chinner
> david at fromorbit.com


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 04/10] iomap: Add iomap_get_folio helper
  2023-01-09 12:46     ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-10  8:46       ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-10  8:46 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Dave Chinner, Christoph Hellwig, Darrick J . Wong,
	Alexander Viro, Matthew Wilcox, linux-xfs, linux-fsdevel,
	linux-ext4, cluster-devel, Christoph Hellwig

On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> checking for that in iomap_get_folio().  Your patch then turns into
> the below.

Exactly.  And as I already pointed out in reply to Dave's original
patch what we really should be doing is returning an ERR_PTR from
__filemap_get_folio instead of reverse-engineering the expected
error code.

The only big question is if we should apply Dave's patch first as a bug
fix before this series, and I suspect we should do that.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
@ 2023-01-10  8:46       ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-10  8:46 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> checking for that in iomap_get_folio().  Your patch then turns into
> the below.

Exactly.  And as I already pointed out in reply to Dave's original
patch what we really should be doing is returning an ERR_PTR from
__filemap_get_folio instead of reverse-engineering the expected
error code.

The only big question is if we should apply Dave's patch first as a bug
fix before this series, and I suspect we should do that.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 06/10] iomap: Add __iomap_get_folio helper
  2023-01-08 19:40   ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-10  8:48     ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-10  8:48 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, Darrick J . Wong, Alexander Viro,
	Matthew Wilcox, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel

On Sun, Jan 08, 2023 at 08:40:30PM +0100, Andreas Gruenbacher wrote:
> +static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
> +		size_t len)
> +{
> +	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
> +
> +	if (page_ops && page_ops->page_prepare)
> +		return page_ops->page_prepare(iter, pos, len);
> +	else
> +		return iomap_get_folio(iter, pos);

Nit: No need for an else after the return.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 06/10] iomap: Add __iomap_get_folio helper
@ 2023-01-10  8:48     ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-10  8:48 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sun, Jan 08, 2023 at 08:40:30PM +0100, Andreas Gruenbacher wrote:
> +static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
> +		size_t len)
> +{
> +	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
> +
> +	if (page_ops && page_ops->page_prepare)
> +		return page_ops->page_prepare(iter, pos, len);
> +	else
> +		return iomap_get_folio(iter, pos);

Nit: No need for an else after the return.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
  2023-01-08 21:59     ` [Cluster-devel] " Dave Chinner
@ 2023-01-10  8:51       ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-10  8:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andreas Gruenbacher, Christoph Hellwig, Darrick J . Wong,
	Alexander Viro, Matthew Wilcox, linux-xfs, linux-fsdevel,
	linux-ext4, cluster-devel

On Mon, Jan 09, 2023 at 08:59:11AM +1100, Dave Chinner wrote:
> Indeed, we also have this same "iomap valid check" functionality in the
> writeback code as cached iomaps can become stale due to racing
> writeback, truncated, etc. But you wouldn't know it by looking at the iomap
> writeback code - this is currently hidden by XFS by embedding
> the checks into the iomap writeback ->map_blocks function.

And that's in many ways a good thing, as it avoids various callouts
that are expensive and confusing.  Just like how this patch gets it
right by not having a mess of badly interacting callbacks, but
one that ensures that the page is ready.

> Hence I think removing ->iomap_valid is a big step backwards for the
> iomap core code - the iomap core needs to be able to formally verify
> the iomap is valid at any point in time, not just at the point in
> time a folio in the page cache has been locked...

For using it anywhere else but the buffered write path it is in the
wrong place to start with, and nonwithstanding my above concern I
can't relaly think of a good place and prototype for such a valid
callback to actually cover all use cases.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
@ 2023-01-10  8:51       ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-10  8:51 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Jan 09, 2023 at 08:59:11AM +1100, Dave Chinner wrote:
> Indeed, we also have this same "iomap valid check" functionality in the
> writeback code as cached iomaps can become stale due to racing
> writeback, truncated, etc. But you wouldn't know it by looking at the iomap
> writeback code - this is currently hidden by XFS by embedding
> the checks into the iomap writeback ->map_blocks function.

And that's in many ways a good thing, as it avoids various callouts
that are expensive and confusing.  Just like how this patch gets it
right by not having a mess of badly interacting callbacks, but
one that ensures that the page is ready.

> Hence I think removing ->iomap_valid is a big step backwards for the
> iomap core code - the iomap core needs to be able to formally verify
> the iomap is valid at any point in time, not just at the point in
> time a folio in the page cache has been locked...

For using it anywhere else but the buffered write path it is in the
wrong place to start with, and nonwithstanding my above concern I
can't relaly think of a good place and prototype for such a valid
callback to actually cover all use cases.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
  2023-01-08 19:40   ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-10  8:52     ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-10  8:52 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, Darrick J . Wong, Alexander Viro,
	Matthew Wilcox, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
@ 2023-01-10  8:52     ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-10  8:52 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 04/10] iomap: Add iomap_get_folio helper
  2023-01-10  8:46       ` [Cluster-devel] " Christoph Hellwig
@ 2023-01-10  9:07         ` Andreas Grünbacher
  -1 siblings, 0 replies; 82+ messages in thread
From: Andreas Grünbacher @ 2023-01-10  9:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Gruenbacher, Dave Chinner, Darrick J . Wong,
	Alexander Viro, Matthew Wilcox, linux-xfs, linux-fsdevel,
	linux-ext4, cluster-devel, Christoph Hellwig

Am Di., 10. Jan. 2023 um 09:52 Uhr schrieb Christoph Hellwig
<hch@infradead.org>:
> On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> > We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> > checking for that in iomap_get_folio().  Your patch then turns into
> > the below.
>
> Exactly.  And as I already pointed out in reply to Dave's original
> patch what we really should be doing is returning an ERR_PTR from
> __filemap_get_folio instead of reverse-engineering the expected
> error code.
>
> The only big question is if we should apply Dave's patch first as a bug
> fix before this series, and I suspect we should do that.

Sounds fine. I assume Dave is going to send an update.

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
@ 2023-01-10  9:07         ` Andreas Grünbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Grünbacher @ 2023-01-10  9:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Am Di., 10. Jan. 2023 um 09:52 Uhr schrieb Christoph Hellwig
<hch@infradead.org>:
> On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> > We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> > checking for that in iomap_get_folio().  Your patch then turns into
> > the below.
>
> Exactly.  And as I already pointed out in reply to Dave's original
> patch what we really should be doing is returning an ERR_PTR from
> __filemap_get_folio instead of reverse-engineering the expected
> error code.
>
> The only big question is if we should apply Dave's patch first as a bug
> fix before this series, and I suspect we should do that.

Sounds fine. I assume Dave is going to send an update.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 04/10] iomap: Add iomap_get_folio helper
  2023-01-10  8:46       ` [Cluster-devel] " Christoph Hellwig
@ 2023-01-10 13:34         ` Matthew Wilcox
  -1 siblings, 0 replies; 82+ messages in thread
From: Matthew Wilcox @ 2023-01-10 13:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Gruenbacher, Dave Chinner, Darrick J . Wong,
	Alexander Viro, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel, Christoph Hellwig

On Tue, Jan 10, 2023 at 12:46:45AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> > We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> > checking for that in iomap_get_folio().  Your patch then turns into
> > the below.
> 
> Exactly.  And as I already pointed out in reply to Dave's original
> patch what we really should be doing is returning an ERR_PTR from
> __filemap_get_folio instead of reverse-engineering the expected
> error code.

Ouch, we have a nasty problem.

If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
encodings for shadow entries overlap with the encodings for ERR_PTR,
meaning that some shadow entries will look like errors.  The way I
solved this in the XArray code is by shifting the error values by
two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).

I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
but so far we haven't, and I'd like to make that decision intentionally.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
@ 2023-01-10 13:34         ` Matthew Wilcox
  0 siblings, 0 replies; 82+ messages in thread
From: Matthew Wilcox @ 2023-01-10 13:34 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Jan 10, 2023 at 12:46:45AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> > We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> > checking for that in iomap_get_folio().  Your patch then turns into
> > the below.
> 
> Exactly.  And as I already pointed out in reply to Dave's original
> patch what we really should be doing is returning an ERR_PTR from
> __filemap_get_folio instead of reverse-engineering the expected
> error code.

Ouch, we have a nasty problem.

If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
encodings for shadow entries overlap with the encodings for ERR_PTR,
meaning that some shadow entries will look like errors.  The way I
solved this in the XArray code is by shifting the error values by
two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).

I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
but so far we haven't, and I'd like to make that decision intentionally.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 04/10] iomap: Add iomap_get_folio helper
  2023-01-10 13:34         ` [Cluster-devel] " Matthew Wilcox
@ 2023-01-10 15:24           ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-10 15:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Andreas Gruenbacher, Dave Chinner,
	Darrick J . Wong, Alexander Viro, linux-xfs, linux-fsdevel,
	linux-ext4, cluster-devel, Christoph Hellwig

On Tue, Jan 10, 2023 at 01:34:16PM +0000, Matthew Wilcox wrote:
> > Exactly.  And as I already pointed out in reply to Dave's original
> > patch what we really should be doing is returning an ERR_PTR from
> > __filemap_get_folio instead of reverse-engineering the expected
> > error code.
> 
> Ouch, we have a nasty problem.
> 
> If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> encodings for shadow entries overlap with the encodings for ERR_PTR,
> meaning that some shadow entries will look like errors.  The way I
> solved this in the XArray code is by shifting the error values by
> two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> 
> I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> but so far we haven't, and I'd like to make that decision intentionally.

So what would be an alternative way to tell the callers why no folio
was found instead of trying to reverse engineer that?  Return an errno
and the folio by reference?  The would work, but the calling conventions
would be awful.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
@ 2023-01-10 15:24           ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-10 15:24 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Jan 10, 2023 at 01:34:16PM +0000, Matthew Wilcox wrote:
> > Exactly.  And as I already pointed out in reply to Dave's original
> > patch what we really should be doing is returning an ERR_PTR from
> > __filemap_get_folio instead of reverse-engineering the expected
> > error code.
> 
> Ouch, we have a nasty problem.
> 
> If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> encodings for shadow entries overlap with the encodings for ERR_PTR,
> meaning that some shadow entries will look like errors.  The way I
> solved this in the XArray code is by shifting the error values by
> two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> 
> I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> but so far we haven't, and I'd like to make that decision intentionally.

So what would be an alternative way to tell the callers why no folio
was found instead of trying to reverse engineer that?  Return an errno
and the folio by reference?  The would work, but the calling conventions
would be awful.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 04/10] iomap: Add iomap_get_folio helper
  2023-01-10 15:24           ` [Cluster-devel] " Christoph Hellwig
@ 2023-01-11 19:36             ` Matthew Wilcox
  -1 siblings, 0 replies; 82+ messages in thread
From: Matthew Wilcox @ 2023-01-11 19:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Gruenbacher, Dave Chinner, Darrick J . Wong,
	Alexander Viro, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel, Christoph Hellwig

On Tue, Jan 10, 2023 at 07:24:27AM -0800, Christoph Hellwig wrote:
> On Tue, Jan 10, 2023 at 01:34:16PM +0000, Matthew Wilcox wrote:
> > > Exactly.  And as I already pointed out in reply to Dave's original
> > > patch what we really should be doing is returning an ERR_PTR from
> > > __filemap_get_folio instead of reverse-engineering the expected
> > > error code.
> > 
> > Ouch, we have a nasty problem.
> > 
> > If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> > encodings for shadow entries overlap with the encodings for ERR_PTR,
> > meaning that some shadow entries will look like errors.  The way I
> > solved this in the XArray code is by shifting the error values by
> > two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> > 
> > I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> > but so far we haven't, and I'd like to make that decision intentionally.
> 
> So what would be an alternative way to tell the callers why no folio
> was found instead of trying to reverse engineer that?  Return an errno
> and the folio by reference?  The would work, but the calling conventions
> would be awful.

Agreed.  How about an xa_filemap_get_folio()?

(there are a number of things to fix here; haven't decided if XA_ERROR
should return void *, or whether i should use a separate 'entry' and
'folio' until I know the entry is actually a folio ...)

Usage would seem pretty straightforward:

	folio = xa_filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
			fgp, mapping_gfp_mask(iter->inode->i_mapping));
	status = xa_err(folio);
	if (status)
		goto out_no_page;

diff --git a/mm/filemap.c b/mm/filemap.c
index 7bf8442bcfaa..7d489f96c690 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1800,40 +1800,25 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
 }
 
 /**
- * __filemap_get_folio - Find and get a reference to a folio.
+ * xa_filemap_get_folio - Find and get a reference to a folio.
  * @mapping: The address_space to search.
  * @index: The page index.
  * @fgp_flags: %FGP flags modify how the folio is returned.
  * @gfp: Memory allocation flags to use if %FGP_CREAT is specified.
  *
- * Looks up the page cache entry at @mapping & @index.
- *
- * @fgp_flags can be zero or more of these flags:
- *
- * * %FGP_ACCESSED - The folio will be marked accessed.
- * * %FGP_LOCK - The folio is returned locked.
- * * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it
- *   instead of allocating a new folio to replace it.
- * * %FGP_CREAT - If no page is present then a new page is allocated using
- *   @gfp and added to the page cache and the VM's LRU list.
- *   The page is returned locked and with an increased refcount.
- * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
- *   page is already in cache.  If the page was allocated, unlock it before
- *   returning so the caller can do the same dance.
- * * %FGP_WRITE - The page will be written to by the caller.
- * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
- * * %FGP_NOWAIT - Don't get blocked by page lock.
- * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
- *
- * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
- * if the %GFP flags specified for %FGP_CREAT are atomic.
+ * Looks up the page cache entry at @mapping & @index.  See
+ * __filemap_get_folio() for a detailed description.
  *
- * If there is a page cache page, it is returned with an increased refcount.
+ * This differs from __filemap_get_folio() in that it will return an
+ * XArray error instead of NULL if something goes wrong, allowing the
+ * advanced user to distinguish why the failure happened.  We can't use an
+ * ERR_PTR() because its encodings overlap with shadow/swap/dax entries.
  *
- * Return: The found folio or %NULL otherwise.
+ * Return: The entry in the page cache or an xa_err() if there is no entry
+ * or it could not be appropiately locked.
  */
-struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp)
+struct folio *xa_filemap_get_folio(struct address_space *mapping,
+		pgoff_t index, int fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
 
@@ -1851,7 +1836,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		if (fgp_flags & FGP_NOWAIT) {
 			if (!folio_trylock(folio)) {
 				folio_put(folio);
-				return NULL;
+				return (struct folio *)XA_ERROR(-EAGAIN);
 			}
 		} else {
 			folio_lock(folio);
@@ -1890,7 +1875,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 
 		folio = filemap_alloc_folio(gfp, 0);
 		if (!folio)
-			return NULL;
+			return (struct folio *)XA_ERROR(-ENOMEM);
 
 		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
 			fgp_flags |= FGP_LOCK;
@@ -1902,19 +1887,65 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		err = filemap_add_folio(mapping, folio, index, gfp);
 		if (unlikely(err)) {
 			folio_put(folio);
-			folio = NULL;
 			if (err == -EEXIST)
 				goto repeat;
+			folio = (struct folio *)XA_ERROR(err);
+		} else {
+			/*
+			 * filemap_add_folio locks the page, and for mmap
+			 * we expect an unlocked page.
+			 */
+			if (fgp_flags & FGP_FOR_MMAP)
+				folio_unlock(folio);
 		}
-
-		/*
-		 * filemap_add_folio locks the page, and for mmap
-		 * we expect an unlocked page.
-		 */
-		if (folio && (fgp_flags & FGP_FOR_MMAP))
-			folio_unlock(folio);
 	}
 
+	if (!folio)
+		folio = (struct folio *)XA_ERROR(-ENODATA);
+	return folio;
+}
+EXPORT_SYMBOL_GPL(xa_filemap_get_folio);
+
+/**
+ * __filemap_get_folio - Find and get a reference to a folio.
+ * @mapping: The address_space to search.
+ * @index: The page index.
+ * @fgp: %FGP flags modify how the folio is returned.
+ * @gfp: Memory allocation flags to use if %FGP_CREAT is specified.
+ *
+ * Looks up the page cache entry at @mapping & @index.
+ *
+ * @fgp_flags can be zero or more of these flags:
+ *
+ * * %FGP_ACCESSED - The folio will be marked accessed.
+ * * %FGP_LOCK - The folio is returned locked.
+ * * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it
+ *   instead of allocating a new folio to replace it.
+ * * %FGP_CREAT - If no page is present then a new page is allocated using
+ *   @gfp and added to the page cache and the VM's LRU list.
+ *   The page is returned locked and with an increased refcount.
+ * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
+ *   page is already in cache.  If the page was allocated, unlock it before
+ *   returning so the caller can do the same dance.
+ * * %FGP_WRITE - The page will be written to by the caller.
+ * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
+ * * %FGP_NOWAIT - Don't get blocked by page lock.
+ * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
+ *
+ * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
+ * if the %GFP flags specified for %FGP_CREAT are atomic.
+ *
+ * If there is a page cache page, it is returned with an increased refcount.
+ *
+ * Return: The found folio or %NULL otherwise.
+ */
+struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
+		int fgp, gfp_t gfp)
+{
+	struct folio *folio = xa_filemap_get_folio(mapping, index, fgp, gfp);
+
+	if (xa_is_err(folio))
+		return NULL;
 	return folio;
 }
 EXPORT_SYMBOL(__filemap_get_folio);

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
@ 2023-01-11 19:36             ` Matthew Wilcox
  0 siblings, 0 replies; 82+ messages in thread
From: Matthew Wilcox @ 2023-01-11 19:36 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Jan 10, 2023 at 07:24:27AM -0800, Christoph Hellwig wrote:
> On Tue, Jan 10, 2023 at 01:34:16PM +0000, Matthew Wilcox wrote:
> > > Exactly.  And as I already pointed out in reply to Dave's original
> > > patch what we really should be doing is returning an ERR_PTR from
> > > __filemap_get_folio instead of reverse-engineering the expected
> > > error code.
> > 
> > Ouch, we have a nasty problem.
> > 
> > If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> > encodings for shadow entries overlap with the encodings for ERR_PTR,
> > meaning that some shadow entries will look like errors.  The way I
> > solved this in the XArray code is by shifting the error values by
> > two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> > 
> > I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> > but so far we haven't, and I'd like to make that decision intentionally.
> 
> So what would be an alternative way to tell the callers why no folio
> was found instead of trying to reverse engineer that?  Return an errno
> and the folio by reference?  The would work, but the calling conventions
> would be awful.

Agreed.  How about an xa_filemap_get_folio()?

(there are a number of things to fix here; haven't decided if XA_ERROR
should return void *, or whether i should use a separate 'entry' and
'folio' until I know the entry is actually a folio ...)

Usage would seem pretty straightforward:

	folio = xa_filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
			fgp, mapping_gfp_mask(iter->inode->i_mapping));
	status = xa_err(folio);
	if (status)
		goto out_no_page;

diff --git a/mm/filemap.c b/mm/filemap.c
index 7bf8442bcfaa..7d489f96c690 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1800,40 +1800,25 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
 }
 
 /**
- * __filemap_get_folio - Find and get a reference to a folio.
+ * xa_filemap_get_folio - Find and get a reference to a folio.
  * @mapping: The address_space to search.
  * @index: The page index.
  * @fgp_flags: %FGP flags modify how the folio is returned.
  * @gfp: Memory allocation flags to use if %FGP_CREAT is specified.
  *
- * Looks up the page cache entry at @mapping & @index.
- *
- * @fgp_flags can be zero or more of these flags:
- *
- * * %FGP_ACCESSED - The folio will be marked accessed.
- * * %FGP_LOCK - The folio is returned locked.
- * * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it
- *   instead of allocating a new folio to replace it.
- * * %FGP_CREAT - If no page is present then a new page is allocated using
- *   @gfp and added to the page cache and the VM's LRU list.
- *   The page is returned locked and with an increased refcount.
- * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
- *   page is already in cache.  If the page was allocated, unlock it before
- *   returning so the caller can do the same dance.
- * * %FGP_WRITE - The page will be written to by the caller.
- * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
- * * %FGP_NOWAIT - Don't get blocked by page lock.
- * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
- *
- * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
- * if the %GFP flags specified for %FGP_CREAT are atomic.
+ * Looks up the page cache entry at @mapping & @index.  See
+ * __filemap_get_folio() for a detailed description.
  *
- * If there is a page cache page, it is returned with an increased refcount.
+ * This differs from __filemap_get_folio() in that it will return an
+ * XArray error instead of NULL if something goes wrong, allowing the
+ * advanced user to distinguish why the failure happened.  We can't use an
+ * ERR_PTR() because its encodings overlap with shadow/swap/dax entries.
  *
- * Return: The found folio or %NULL otherwise.
+ * Return: The entry in the page cache or an xa_err() if there is no entry
+ * or it could not be appropiately locked.
  */
-struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp)
+struct folio *xa_filemap_get_folio(struct address_space *mapping,
+		pgoff_t index, int fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
 
@@ -1851,7 +1836,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		if (fgp_flags & FGP_NOWAIT) {
 			if (!folio_trylock(folio)) {
 				folio_put(folio);
-				return NULL;
+				return (struct folio *)XA_ERROR(-EAGAIN);
 			}
 		} else {
 			folio_lock(folio);
@@ -1890,7 +1875,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 
 		folio = filemap_alloc_folio(gfp, 0);
 		if (!folio)
-			return NULL;
+			return (struct folio *)XA_ERROR(-ENOMEM);
 
 		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
 			fgp_flags |= FGP_LOCK;
@@ -1902,19 +1887,65 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		err = filemap_add_folio(mapping, folio, index, gfp);
 		if (unlikely(err)) {
 			folio_put(folio);
-			folio = NULL;
 			if (err == -EEXIST)
 				goto repeat;
+			folio = (struct folio *)XA_ERROR(err);
+		} else {
+			/*
+			 * filemap_add_folio locks the page, and for mmap
+			 * we expect an unlocked page.
+			 */
+			if (fgp_flags & FGP_FOR_MMAP)
+				folio_unlock(folio);
 		}
-
-		/*
-		 * filemap_add_folio locks the page, and for mmap
-		 * we expect an unlocked page.
-		 */
-		if (folio && (fgp_flags & FGP_FOR_MMAP))
-			folio_unlock(folio);
 	}
 
+	if (!folio)
+		folio = (struct folio *)XA_ERROR(-ENODATA);
+	return folio;
+}
+EXPORT_SYMBOL_GPL(xa_filemap_get_folio);
+
+/**
+ * __filemap_get_folio - Find and get a reference to a folio.
+ * @mapping: The address_space to search.
+ * @index: The page index.
+ * @fgp: %FGP flags modify how the folio is returned.
+ * @gfp: Memory allocation flags to use if %FGP_CREAT is specified.
+ *
+ * Looks up the page cache entry at @mapping & @index.
+ *
+ * @fgp_flags can be zero or more of these flags:
+ *
+ * * %FGP_ACCESSED - The folio will be marked accessed.
+ * * %FGP_LOCK - The folio is returned locked.
+ * * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it
+ *   instead of allocating a new folio to replace it.
+ * * %FGP_CREAT - If no page is present then a new page is allocated using
+ *   @gfp and added to the page cache and the VM's LRU list.
+ *   The page is returned locked and with an increased refcount.
+ * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
+ *   page is already in cache.  If the page was allocated, unlock it before
+ *   returning so the caller can do the same dance.
+ * * %FGP_WRITE - The page will be written to by the caller.
+ * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
+ * * %FGP_NOWAIT - Don't get blocked by page lock.
+ * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
+ *
+ * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
+ * if the %GFP flags specified for %FGP_CREAT are atomic.
+ *
+ * If there is a page cache page, it is returned with an increased refcount.
+ *
+ * Return: The found folio or %NULL otherwise.
+ */
+struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
+		int fgp, gfp_t gfp)
+{
+	struct folio *folio = xa_filemap_get_folio(mapping, index, fgp, gfp);
+
+	if (xa_is_err(folio))
+		return NULL;
 	return folio;
 }
 EXPORT_SYMBOL(__filemap_get_folio);


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC v6 04/10] iomap: Add iomap_get_folio helper
  2023-01-11 19:36             ` [Cluster-devel] " Matthew Wilcox
@ 2023-01-11 20:52               ` Dave Chinner
  -1 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2023-01-11 20:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Andreas Gruenbacher, Dave Chinner,
	Darrick J . Wong, Alexander Viro, linux-xfs, linux-fsdevel,
	linux-ext4, cluster-devel, Christoph Hellwig

On Wed, Jan 11, 2023 at 07:36:26PM +0000, Matthew Wilcox wrote:
> On Tue, Jan 10, 2023 at 07:24:27AM -0800, Christoph Hellwig wrote:
> > On Tue, Jan 10, 2023 at 01:34:16PM +0000, Matthew Wilcox wrote:
> > > > Exactly.  And as I already pointed out in reply to Dave's original
> > > > patch what we really should be doing is returning an ERR_PTR from
> > > > __filemap_get_folio instead of reverse-engineering the expected
> > > > error code.
> > > 
> > > Ouch, we have a nasty problem.
> > > 
> > > If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> > > encodings for shadow entries overlap with the encodings for ERR_PTR,
> > > meaning that some shadow entries will look like errors.  The way I
> > > solved this in the XArray code is by shifting the error values by
> > > two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> > > 
> > > I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> > > but so far we haven't, and I'd like to make that decision intentionally.
> > 
> > So what would be an alternative way to tell the callers why no folio
> > was found instead of trying to reverse engineer that?  Return an errno
> > and the folio by reference?  The would work, but the calling conventions
> > would be awful.
> 
> Agreed.  How about an xa_filemap_get_folio()?
> 
> (there are a number of things to fix here; haven't decided if XA_ERROR
> should return void *, or whether i should use a separate 'entry' and
> 'folio' until I know the entry is actually a folio ...)

That's awful. Exposing internal implementation details in the API
that is supposed to abstract away the internal implementation
details from users doesn't seem like a great idea to me.

Exactly what are we trying to fix here?  Do we really need to punch
a hole through the abstraction layers like this just to remove half
a dozen lines of -slow path- context specific error handling from a
single caller?

If there's half a dozen cases that need this sort of handling, then
maybe it's the right thing to do. But for a single calling context
that only needs to add a null return check in one specific case?
There's absolutely no need to make generic infrastructure violate
layering abstractions to handle that...

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
@ 2023-01-11 20:52               ` Dave Chinner
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2023-01-11 20:52 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Jan 11, 2023 at 07:36:26PM +0000, Matthew Wilcox wrote:
> On Tue, Jan 10, 2023 at 07:24:27AM -0800, Christoph Hellwig wrote:
> > On Tue, Jan 10, 2023 at 01:34:16PM +0000, Matthew Wilcox wrote:
> > > > Exactly.  And as I already pointed out in reply to Dave's original
> > > > patch what we really should be doing is returning an ERR_PTR from
> > > > __filemap_get_folio instead of reverse-engineering the expected
> > > > error code.
> > > 
> > > Ouch, we have a nasty problem.
> > > 
> > > If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> > > encodings for shadow entries overlap with the encodings for ERR_PTR,
> > > meaning that some shadow entries will look like errors.  The way I
> > > solved this in the XArray code is by shifting the error values by
> > > two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> > > 
> > > I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> > > but so far we haven't, and I'd like to make that decision intentionally.
> > 
> > So what would be an alternative way to tell the callers why no folio
> > was found instead of trying to reverse engineer that?  Return an errno
> > and the folio by reference?  The would work, but the calling conventions
> > would be awful.
> 
> Agreed.  How about an xa_filemap_get_folio()?
> 
> (there are a number of things to fix here; haven't decided if XA_ERROR
> should return void *, or whether i should use a separate 'entry' and
> 'folio' until I know the entry is actually a folio ...)

That's awful. Exposing internal implementation details in the API
that is supposed to abstract away the internal implementation
details from users doesn't seem like a great idea to me.

Exactly what are we trying to fix here?  Do we really need to punch
a hole through the abstraction layers like this just to remove half
a dozen lines of -slow path- context specific error handling from a
single caller?

If there's half a dozen cases that need this sort of handling, then
maybe it's the right thing to do. But for a single calling context
that only needs to add a null return check in one specific case?
There's absolutely no need to make generic infrastructure violate
layering abstractions to handle that...

-Dave.

-- 
Dave Chinner
david at fromorbit.com


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 04/10] iomap: Add iomap_get_folio helper
  2023-01-11 20:52               ` [Cluster-devel] " Dave Chinner
@ 2023-01-12  8:41                 ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-12  8:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Christoph Hellwig, Andreas Gruenbacher,
	Dave Chinner, Darrick J . Wong, Alexander Viro, linux-xfs,
	linux-fsdevel, linux-ext4, cluster-devel, Christoph Hellwig

On Thu, Jan 12, 2023 at 07:52:41AM +1100, Dave Chinner wrote:
> Exposing internal implementation details in the API
> that is supposed to abstract away the internal implementation
> details from users doesn't seem like a great idea to me.

While I somewhat agree with the concern of leaking the xarray
internals, at least they are clearly documented and easy to find..

> Exactly what are we trying to fix here?  Do we really need to punch
> a hole through the abstraction layers like this just to remove half
> a dozen lines of -slow path- context specific error handling from a
> single caller?

While the current code (which is getting worse with your fix) leaks
completely undocumented and internal decision making.  So what this
fixes is a real leak of internatal logic inside of __filemap_get_folio
into the callers.

So as far as I'm concerned we really do need the helper, and anyone
using !GFP_CREATE or FGP_NOWAIT should be using it.  The only question
to me is if exposing the xarray internals is worth it vs the
less optimal calling conventions of needing an extra argument for
the error code.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
@ 2023-01-12  8:41                 ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-12  8:41 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Jan 12, 2023 at 07:52:41AM +1100, Dave Chinner wrote:
> Exposing internal implementation details in the API
> that is supposed to abstract away the internal implementation
> details from users doesn't seem like a great idea to me.

While I somewhat agree with the concern of leaking the xarray
internals, at least they are clearly documented and easy to find..

> Exactly what are we trying to fix here?  Do we really need to punch
> a hole through the abstraction layers like this just to remove half
> a dozen lines of -slow path- context specific error handling from a
> single caller?

While the current code (which is getting worse with your fix) leaks
completely undocumented and internal decision making.  So what this
fixes is a real leak of internatal logic inside of __filemap_get_folio
into the callers.

So as far as I'm concerned we really do need the helper, and anyone
using !GFP_CREATE or FGP_NOWAIT should be using it.  The only question
to me is if exposing the xarray internals is worth it vs the
less optimal calling conventions of needing an extra argument for
the error code.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 04/10] iomap: Add iomap_get_folio helper
  2023-01-10 13:34         ` [Cluster-devel] " Matthew Wilcox
@ 2023-01-15 17:01           ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2023-01-15 17:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Andreas Gruenbacher, Dave Chinner,
	Alexander Viro, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel, Christoph Hellwig

On Tue, Jan 10, 2023 at 01:34:16PM +0000, Matthew Wilcox wrote:
> On Tue, Jan 10, 2023 at 12:46:45AM -0800, Christoph Hellwig wrote:
> > On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> > > We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> > > checking for that in iomap_get_folio().  Your patch then turns into
> > > the below.
> > 
> > Exactly.  And as I already pointed out in reply to Dave's original
> > patch what we really should be doing is returning an ERR_PTR from
> > __filemap_get_folio instead of reverse-engineering the expected
> > error code.
> 
> Ouch, we have a nasty problem.
> 
> If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> encodings for shadow entries overlap with the encodings for ERR_PTR,
> meaning that some shadow entries will look like errors.  The way I
> solved this in the XArray code is by shifting the error values by
> two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> 
> I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> but so far we haven't, and I'd like to make that decision intentionally.

Sorry, I'm not following this at all -- where in buffered-io.c does
anyone pass FGP_ENTRY?  Andreas' code doesn't seem to introduce it
either...?

--D

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
@ 2023-01-15 17:01           ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2023-01-15 17:01 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Jan 10, 2023 at 01:34:16PM +0000, Matthew Wilcox wrote:
> On Tue, Jan 10, 2023 at 12:46:45AM -0800, Christoph Hellwig wrote:
> > On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> > > We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> > > checking for that in iomap_get_folio().  Your patch then turns into
> > > the below.
> > 
> > Exactly.  And as I already pointed out in reply to Dave's original
> > patch what we really should be doing is returning an ERR_PTR from
> > __filemap_get_folio instead of reverse-engineering the expected
> > error code.
> 
> Ouch, we have a nasty problem.
> 
> If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> encodings for shadow entries overlap with the encodings for ERR_PTR,
> meaning that some shadow entries will look like errors.  The way I
> solved this in the XArray code is by shifting the error values by
> two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> 
> I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> but so far we haven't, and I'd like to make that decision intentionally.

Sorry, I'm not following this at all -- where in buffered-io.c does
anyone pass FGP_ENTRY?  Andreas' code doesn't seem to introduce it
either...?

--D


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 04/10] iomap: Add iomap_get_folio helper
  2023-01-15 17:01           ` [Cluster-devel] " Darrick J. Wong
@ 2023-01-15 17:06             ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2023-01-15 17:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Andreas Gruenbacher, Dave Chinner,
	Alexander Viro, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel, Christoph Hellwig

On Sun, Jan 15, 2023 at 09:01:22AM -0800, Darrick J. Wong wrote:
> On Tue, Jan 10, 2023 at 01:34:16PM +0000, Matthew Wilcox wrote:
> > On Tue, Jan 10, 2023 at 12:46:45AM -0800, Christoph Hellwig wrote:
> > > On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> > > > We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> > > > checking for that in iomap_get_folio().  Your patch then turns into
> > > > the below.
> > > 
> > > Exactly.  And as I already pointed out in reply to Dave's original
> > > patch what we really should be doing is returning an ERR_PTR from
> > > __filemap_get_folio instead of reverse-engineering the expected
> > > error code.
> > 
> > Ouch, we have a nasty problem.
> > 
> > If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> > encodings for shadow entries overlap with the encodings for ERR_PTR,
> > meaning that some shadow entries will look like errors.  The way I
> > solved this in the XArray code is by shifting the error values by
> > two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> > 
> > I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> > but so far we haven't, and I'd like to make that decision intentionally.
> 
> Sorry, I'm not following this at all -- where in buffered-io.c does
> anyone pass FGP_ENTRY?  Andreas' code doesn't seem to introduce it
> either...?

Oh, never mind, I worked out that the conflict is between iomap not
passing FGP_ENTRY and wanting a pointer or a negative errno; and someone
who does FGP_ENTRY, in which case the xarray value can be confused for a
negative errno.

OFC now I wonder, can we simply say that the return value is "The found
folio or NULL if you set FGP_ENTRY; or the found folio or a negative
errno if you don't" ?

--D

> --D

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
@ 2023-01-15 17:06             ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2023-01-15 17:06 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sun, Jan 15, 2023 at 09:01:22AM -0800, Darrick J. Wong wrote:
> On Tue, Jan 10, 2023 at 01:34:16PM +0000, Matthew Wilcox wrote:
> > On Tue, Jan 10, 2023 at 12:46:45AM -0800, Christoph Hellwig wrote:
> > > On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> > > > We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> > > > checking for that in iomap_get_folio().  Your patch then turns into
> > > > the below.
> > > 
> > > Exactly.  And as I already pointed out in reply to Dave's original
> > > patch what we really should be doing is returning an ERR_PTR from
> > > __filemap_get_folio instead of reverse-engineering the expected
> > > error code.
> > 
> > Ouch, we have a nasty problem.
> > 
> > If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> > encodings for shadow entries overlap with the encodings for ERR_PTR,
> > meaning that some shadow entries will look like errors.  The way I
> > solved this in the XArray code is by shifting the error values by
> > two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> > 
> > I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> > but so far we haven't, and I'd like to make that decision intentionally.
> 
> Sorry, I'm not following this at all -- where in buffered-io.c does
> anyone pass FGP_ENTRY?  Andreas' code doesn't seem to introduce it
> either...?

Oh, never mind, I worked out that the conflict is between iomap not
passing FGP_ENTRY and wanting a pointer or a negative errno; and someone
who does FGP_ENTRY, in which case the xarray value can be confused for a
negative errno.

OFC now I wonder, can we simply say that the return value is "The found
folio or NULL if you set FGP_ENTRY; or the found folio or a negative
errno if you don't" ?

--D

> --D


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
  2023-01-10  1:09           ` [Cluster-devel] " Andreas Grünbacher
@ 2023-01-15 17:29             ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2023-01-15 17:29 UTC (permalink / raw)
  To: Andreas Grünbacher
  Cc: Dave Chinner, Andreas Gruenbacher, Christoph Hellwig,
	Alexander Viro, Matthew Wilcox, linux-xfs, linux-fsdevel,
	linux-ext4, cluster-devel

On Tue, Jan 10, 2023 at 02:09:07AM +0100, Andreas Grünbacher wrote:
> Am Mo., 9. Jan. 2023 um 23:58 Uhr schrieb Dave Chinner <david@fromorbit.com>:
> > On Mon, Jan 09, 2023 at 07:45:27PM +0100, Andreas Gruenbacher wrote:
> > > On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> > > > > Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> > > > > handler and validating the mapping there.
> > > > >
> > > > > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> > > >
> > > > I think this is wrong.
> > > >
> > > > The ->iomap_valid() function handles a fundamental architectural
> > > > issue with cached iomaps: the iomap can become stale at any time
> > > > whilst it is in use by the iomap core code.
> > > >
> > > > The current problem it solves in the iomap_write_begin() path has to
> > > > do with writeback and memory reclaim races over unwritten extents,
> > > > but the general case is that we must be able to check the iomap
> > > > at any point in time to assess it's validity.
> > > >
> > > > Indeed, we also have this same "iomap valid check" functionality in the
> > > > writeback code as cached iomaps can become stale due to racing
> > > > writeback, truncated, etc. But you wouldn't know it by looking at the iomap
> > > > writeback code - this is currently hidden by XFS by embedding
> > > > the checks into the iomap writeback ->map_blocks function.
> > > >
> > > > That is, the first thing that xfs_map_blocks() does is check if the
> > > > cached iomap is valid, and if it is valid it returns immediately and
> > > > the iomap writeback code uses it without question.
> > > >
> > > > The reason that this is embedded like this is that the iomap did not
> > > > have a validity cookie field in it, and so the validity information
> > > > was wrapped around the outside of the iomap_writepage_ctx and the
> > > > filesystem has to decode it from that private wrapping structure.
> > > >
> > > > However, the validity information iin the structure wrapper is
> > > > indentical to the iomap validity cookie,
> > >
> > > Then could that part of the xfs code be converted to use
> > > iomap->validity_cookie so that struct iomap_writepage_ctx can be
> > > eliminated?
> >
> > Yes, that is the plan.
> >
> > >
> > > > and so the direction I've
> > > > been working towards is to replace this implicit, hidden cached
> > > > iomap validity check with an explicit ->iomap_valid call and then
> > > > only call ->map_blocks if the validity check fails (or is not
> > > > implemented).
> > > >
> > > > I want to use the same code for all the iomap validity checks in all
> > > > the iomap core code - this is an iomap issue, the conditions where
> > > > we need to check for iomap validity are different for depending on
> > > > the iomap context being run, and the checks are not necessarily
> > > > dependent on first having locked a folio.
> > > >
> > > > Yes, the validity cookie needs to be decoded by the filesystem, but
> > > > that does not dictate where the validity checking needs to be done
> > > > by the iomap core.
> > > >
> > > > Hence I think removing ->iomap_valid is a big step backwards for the
> > > > iomap core code - the iomap core needs to be able to formally verify
> > > > the iomap is valid at any point in time, not just at the point in
> > > > time a folio in the page cache has been locked...
> > >
> > > We don't need to validate an iomap "at any time". It's two specific
> > > places in the code in which we need to check, and we're not going to
> > > end up with ten more such places tomorrow.
> >
> > Not immediately, but that doesn't change the fact this is not a
> > filesystem specific issue - it's an inherent characteristic of
> > cached iomaps and unsynchronised extent state changes that occur
> > outside exclusive inode->i_rwsem IO context (e.g. in writeback and
> > IO completion contexts).
> >
> > Racing mmap + buffered writes can expose these state changes as the
> > iomap bufferred write IO path is not serialised against the iomap
> > mmap IO path except via folio locks. Hence a mmap page fault can
> > invalidate a cached buffered write iomap by causing a hole ->
> > unwritten, hole -> delalloc or hole -> written conversion in the
> > middle of the buffered write range. The buffered write still has a
> > hole mapping cached for that entire range, and it is now incorrect.
> >
> > If the mmap write happens to change extent state at the trailing
> > edge of a partial buffered write, data corruption will occur if we
> > race just right with writeback and memory reclaim. I'm pretty sure
> > that this corruption can be reporduced on gfs2 if we try hard enough
> > - generic/346 triggers the mmap/write race condition, all that is
> > needed from that point is for writeback and reclaiming pages at
> > exactly the right time...
> >
> > > I'd prefer to keep those
> > > filesystem internals in the filesystem specific code instead of
> > > exposing them to the iomap layer. But that's just me ...
> >
> > My point is that there is nothing XFS specific about these stale
> > cached iomap race conditions, nor is it specifically related to
> > folio locking. The folio locking inversions w.r.t. iomap caching and
> > the interactions with writeback and reclaim are simply the
> > manifestation that brought the issue to our attention.
> >
> > This is why I think hiding iomap validation filesystem specific page
> > cache allocation/lookup functions is entirely the wrong layer to be
> > doing iomap validity checks. Especially as it prevents us from
> > adding more validity checks in the core infrastructure when we need
> > them in future.
> >
> > AFAIC, an iomap must carry with it a method for checking
> > that it is still valid. We need it in the write path, we need it in
> > the writeback path. If we want to relax the restrictions on clone
> > operations (e.g. shared locking on the source file), we'll need to
> > be able to detect stale cached iomaps in those paths, too. And I
> > haven't really thought through all the implications of shared
> > locking on buffered writes yet, but that may well require more
> > checks in other places as well.
> >
> > > If we ignore this particular commit for now, do you have any
> > > objections to the patches in this series? If not, it would be great if
> > > we could add the other patches to iomap-for-next.
> >
> > I still don't like moving page cache operations into individual
> > filesystems, but for the moment I can live with the IOMAP_NOCREATE
> > hack to drill iomap state through the filesystem without the
> > filesystem being aware of it.
> 
> Alright, works for me. Darrick?

Works for me too.

I've wondered if IOMAP_NOCREATE could be useful for more things (e.g.
determining if part of a file has been cached) though I've not thought
of a good usecase for that.  Maybe something along the lines of a
"userspace wants us to redirty this critical file after fsync returned
EIO" type thing?

> > > By the way, I'm still not sure if gfs2 is affected by this whole iomap
> > > validation drama given that it neither implements unwritten extents
> > > nor delayed allocation. This is a mess.
> >
> > See above - I'm pretty sure it will be, but it may be very difficult
> > to expose. After all, it's taken several years before anyone noticed
> > this issue with XFS, even though we were aware of the issue of stale
> > cached iomaps causing data corruption in the writeback path....
> 
> Okay, that's all pretty ugly. Thanks a lot for the detailed explanation.

I don't have any objections to pulling everything except patches 8 and
10 for testing this week.  I find myself more in agreement with
Christoph and Andreas that whoever gets the folio is also responsible
for knowing if revalidating the mapping is necessary and then doing it.
However, I still have enough questions about the mapping revalidation to
make that a separate discussion.

Questions, namely:

1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
don't think it does, but OTOH zone pointer management might complicate
that.

2. How about porting the writeback iomap validation to use this
mechanism?  (I suspect Dave might already be working on this...)

2. Do we need to revalidate mappings for directio writes?  I think the
answer is no (for xfs) because the ->iomap_begin call will allocate
whatever blocks are needed and truncate/punch/reflink block on the
iolock while the directio writes are pending, so you'll never end up
with a stale mapping.  But I don't know if that statement applies
generally...

--D

> Cheers,
> Andreas
> 
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
@ 2023-01-15 17:29             ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2023-01-15 17:29 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Jan 10, 2023 at 02:09:07AM +0100, Andreas Gr?nbacher wrote:
> Am Mo., 9. Jan. 2023 um 23:58 Uhr schrieb Dave Chinner <david@fromorbit.com>:
> > On Mon, Jan 09, 2023 at 07:45:27PM +0100, Andreas Gruenbacher wrote:
> > > On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> > > > > Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> > > > > handler and validating the mapping there.
> > > > >
> > > > > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> > > >
> > > > I think this is wrong.
> > > >
> > > > The ->iomap_valid() function handles a fundamental architectural
> > > > issue with cached iomaps: the iomap can become stale at any time
> > > > whilst it is in use by the iomap core code.
> > > >
> > > > The current problem it solves in the iomap_write_begin() path has to
> > > > do with writeback and memory reclaim races over unwritten extents,
> > > > but the general case is that we must be able to check the iomap
> > > > at any point in time to assess it's validity.
> > > >
> > > > Indeed, we also have this same "iomap valid check" functionality in the
> > > > writeback code as cached iomaps can become stale due to racing
> > > > writeback, truncated, etc. But you wouldn't know it by looking at the iomap
> > > > writeback code - this is currently hidden by XFS by embedding
> > > > the checks into the iomap writeback ->map_blocks function.
> > > >
> > > > That is, the first thing that xfs_map_blocks() does is check if the
> > > > cached iomap is valid, and if it is valid it returns immediately and
> > > > the iomap writeback code uses it without question.
> > > >
> > > > The reason that this is embedded like this is that the iomap did not
> > > > have a validity cookie field in it, and so the validity information
> > > > was wrapped around the outside of the iomap_writepage_ctx and the
> > > > filesystem has to decode it from that private wrapping structure.
> > > >
> > > > However, the validity information iin the structure wrapper is
> > > > indentical to the iomap validity cookie,
> > >
> > > Then could that part of the xfs code be converted to use
> > > iomap->validity_cookie so that struct iomap_writepage_ctx can be
> > > eliminated?
> >
> > Yes, that is the plan.
> >
> > >
> > > > and so the direction I've
> > > > been working towards is to replace this implicit, hidden cached
> > > > iomap validity check with an explicit ->iomap_valid call and then
> > > > only call ->map_blocks if the validity check fails (or is not
> > > > implemented).
> > > >
> > > > I want to use the same code for all the iomap validity checks in all
> > > > the iomap core code - this is an iomap issue, the conditions where
> > > > we need to check for iomap validity are different for depending on
> > > > the iomap context being run, and the checks are not necessarily
> > > > dependent on first having locked a folio.
> > > >
> > > > Yes, the validity cookie needs to be decoded by the filesystem, but
> > > > that does not dictate where the validity checking needs to be done
> > > > by the iomap core.
> > > >
> > > > Hence I think removing ->iomap_valid is a big step backwards for the
> > > > iomap core code - the iomap core needs to be able to formally verify
> > > > the iomap is valid at any point in time, not just at the point in
> > > > time a folio in the page cache has been locked...
> > >
> > > We don't need to validate an iomap "at any time". It's two specific
> > > places in the code in which we need to check, and we're not going to
> > > end up with ten more such places tomorrow.
> >
> > Not immediately, but that doesn't change the fact this is not a
> > filesystem specific issue - it's an inherent characteristic of
> > cached iomaps and unsynchronised extent state changes that occur
> > outside exclusive inode->i_rwsem IO context (e.g. in writeback and
> > IO completion contexts).
> >
> > Racing mmap + buffered writes can expose these state changes as the
> > iomap bufferred write IO path is not serialised against the iomap
> > mmap IO path except via folio locks. Hence a mmap page fault can
> > invalidate a cached buffered write iomap by causing a hole ->
> > unwritten, hole -> delalloc or hole -> written conversion in the
> > middle of the buffered write range. The buffered write still has a
> > hole mapping cached for that entire range, and it is now incorrect.
> >
> > If the mmap write happens to change extent state at the trailing
> > edge of a partial buffered write, data corruption will occur if we
> > race just right with writeback and memory reclaim. I'm pretty sure
> > that this corruption can be reporduced on gfs2 if we try hard enough
> > - generic/346 triggers the mmap/write race condition, all that is
> > needed from that point is for writeback and reclaiming pages at
> > exactly the right time...
> >
> > > I'd prefer to keep those
> > > filesystem internals in the filesystem specific code instead of
> > > exposing them to the iomap layer. But that's just me ...
> >
> > My point is that there is nothing XFS specific about these stale
> > cached iomap race conditions, nor is it specifically related to
> > folio locking. The folio locking inversions w.r.t. iomap caching and
> > the interactions with writeback and reclaim are simply the
> > manifestation that brought the issue to our attention.
> >
> > This is why I think hiding iomap validation filesystem specific page
> > cache allocation/lookup functions is entirely the wrong layer to be
> > doing iomap validity checks. Especially as it prevents us from
> > adding more validity checks in the core infrastructure when we need
> > them in future.
> >
> > AFAIC, an iomap must carry with it a method for checking
> > that it is still valid. We need it in the write path, we need it in
> > the writeback path. If we want to relax the restrictions on clone
> > operations (e.g. shared locking on the source file), we'll need to
> > be able to detect stale cached iomaps in those paths, too. And I
> > haven't really thought through all the implications of shared
> > locking on buffered writes yet, but that may well require more
> > checks in other places as well.
> >
> > > If we ignore this particular commit for now, do you have any
> > > objections to the patches in this series? If not, it would be great if
> > > we could add the other patches to iomap-for-next.
> >
> > I still don't like moving page cache operations into individual
> > filesystems, but for the moment I can live with the IOMAP_NOCREATE
> > hack to drill iomap state through the filesystem without the
> > filesystem being aware of it.
> 
> Alright, works for me. Darrick?

Works for me too.

I've wondered if IOMAP_NOCREATE could be useful for more things (e.g.
determining if part of a file has been cached) though I've not thought
of a good usecase for that.  Maybe something along the lines of a
"userspace wants us to redirty this critical file after fsync returned
EIO" type thing?

> > > By the way, I'm still not sure if gfs2 is affected by this whole iomap
> > > validation drama given that it neither implements unwritten extents
> > > nor delayed allocation. This is a mess.
> >
> > See above - I'm pretty sure it will be, but it may be very difficult
> > to expose. After all, it's taken several years before anyone noticed
> > this issue with XFS, even though we were aware of the issue of stale
> > cached iomaps causing data corruption in the writeback path....
> 
> Okay, that's all pretty ugly. Thanks a lot for the detailed explanation.

I don't have any objections to pulling everything except patches 8 and
10 for testing this week.  I find myself more in agreement with
Christoph and Andreas that whoever gets the folio is also responsible
for knowing if revalidating the mapping is necessary and then doing it.
However, I still have enough questions about the mapping revalidation to
make that a separate discussion.

Questions, namely:

1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
don't think it does, but OTOH zone pointer management might complicate
that.

2. How about porting the writeback iomap validation to use this
mechanism?  (I suspect Dave might already be working on this...)

2. Do we need to revalidate mappings for directio writes?  I think the
answer is no (for xfs) because the ->iomap_begin call will allocate
whatever blocks are needed and truncate/punch/reflink block on the
iolock while the directio writes are pending, so you'll never end up
with a stale mapping.  But I don't know if that statement applies
generally...

--D

> Cheers,
> Andreas
> 
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david at fromorbit.com


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 04/10] iomap: Add iomap_get_folio helper
  2023-01-15 17:06             ` [Cluster-devel] " Darrick J. Wong
@ 2023-01-16  5:46               ` Matthew Wilcox
  -1 siblings, 0 replies; 82+ messages in thread
From: Matthew Wilcox @ 2023-01-16  5:46 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Andreas Gruenbacher, Dave Chinner,
	Alexander Viro, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel, Christoph Hellwig

On Sun, Jan 15, 2023 at 09:06:50AM -0800, Darrick J. Wong wrote:
> On Sun, Jan 15, 2023 at 09:01:22AM -0800, Darrick J. Wong wrote:
> > On Tue, Jan 10, 2023 at 01:34:16PM +0000, Matthew Wilcox wrote:
> > > On Tue, Jan 10, 2023 at 12:46:45AM -0800, Christoph Hellwig wrote:
> > > > On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> > > > > We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> > > > > checking for that in iomap_get_folio().  Your patch then turns into
> > > > > the below.
> > > > 
> > > > Exactly.  And as I already pointed out in reply to Dave's original
> > > > patch what we really should be doing is returning an ERR_PTR from
> > > > __filemap_get_folio instead of reverse-engineering the expected
> > > > error code.
> > > 
> > > Ouch, we have a nasty problem.
> > > 
> > > If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> > > encodings for shadow entries overlap with the encodings for ERR_PTR,
> > > meaning that some shadow entries will look like errors.  The way I
> > > solved this in the XArray code is by shifting the error values by
> > > two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> > > 
> > > I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> > > but so far we haven't, and I'd like to make that decision intentionally.
> > 
> > Sorry, I'm not following this at all -- where in buffered-io.c does
> > anyone pass FGP_ENTRY?  Andreas' code doesn't seem to introduce it
> > either...?
> 
> Oh, never mind, I worked out that the conflict is between iomap not
> passing FGP_ENTRY and wanting a pointer or a negative errno; and someone
> who does FGP_ENTRY, in which case the xarray value can be confused for a
> negative errno.
> 
> OFC now I wonder, can we simply say that the return value is "The found
> folio or NULL if you set FGP_ENTRY; or the found folio or a negative
> errno if you don't" ?

Erm ... I would rather not!

Part of me remembers that x86-64 has the rather nice calling convention
of being able to return a struct containing two values in two registers:

: Integer return values up to 64 bits in size are stored in RAX while
: values up to 128 bit are stored in RAX and RDX.

so maybe we can return:

struct OptionFolio {
	int err;
	struct folio *folio;
};

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
@ 2023-01-16  5:46               ` Matthew Wilcox
  0 siblings, 0 replies; 82+ messages in thread
From: Matthew Wilcox @ 2023-01-16  5:46 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sun, Jan 15, 2023 at 09:06:50AM -0800, Darrick J. Wong wrote:
> On Sun, Jan 15, 2023 at 09:01:22AM -0800, Darrick J. Wong wrote:
> > On Tue, Jan 10, 2023 at 01:34:16PM +0000, Matthew Wilcox wrote:
> > > On Tue, Jan 10, 2023 at 12:46:45AM -0800, Christoph Hellwig wrote:
> > > > On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> > > > > We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> > > > > checking for that in iomap_get_folio().  Your patch then turns into
> > > > > the below.
> > > > 
> > > > Exactly.  And as I already pointed out in reply to Dave's original
> > > > patch what we really should be doing is returning an ERR_PTR from
> > > > __filemap_get_folio instead of reverse-engineering the expected
> > > > error code.
> > > 
> > > Ouch, we have a nasty problem.
> > > 
> > > If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> > > encodings for shadow entries overlap with the encodings for ERR_PTR,
> > > meaning that some shadow entries will look like errors.  The way I
> > > solved this in the XArray code is by shifting the error values by
> > > two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> > > 
> > > I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> > > but so far we haven't, and I'd like to make that decision intentionally.
> > 
> > Sorry, I'm not following this at all -- where in buffered-io.c does
> > anyone pass FGP_ENTRY?  Andreas' code doesn't seem to introduce it
> > either...?
> 
> Oh, never mind, I worked out that the conflict is between iomap not
> passing FGP_ENTRY and wanting a pointer or a negative errno; and someone
> who does FGP_ENTRY, in which case the xarray value can be confused for a
> negative errno.
> 
> OFC now I wonder, can we simply say that the return value is "The found
> folio or NULL if you set FGP_ENTRY; or the found folio or a negative
> errno if you don't" ?

Erm ... I would rather not!

Part of me remembers that x86-64 has the rather nice calling convention
of being able to return a struct containing two values in two registers:

: Integer return values up to 64 bits in size are stored in RAX while
: values up to 128 bit are stored in RAX and RDX.

so maybe we can return:

struct OptionFolio {
	int err;
	struct folio *folio;
};


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 04/10] iomap: Add iomap_get_folio helper
  2023-01-16  5:46               ` [Cluster-devel] " Matthew Wilcox
@ 2023-01-16  7:34                 ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-16  7:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, Christoph Hellwig, Andreas Gruenbacher,
	Dave Chinner, Alexander Viro, linux-xfs, linux-fsdevel,
	linux-ext4, cluster-devel, Christoph Hellwig

On Mon, Jan 16, 2023 at 05:46:01AM +0000, Matthew Wilcox wrote:
> > OFC now I wonder, can we simply say that the return value is "The found
> > folio or NULL if you set FGP_ENTRY; or the found folio or a negative
> > errno if you don't" ?
> 
> Erm ... I would rather not!

Agreed.

> 
> Part of me remembers that x86-64 has the rather nice calling convention
> of being able to return a struct containing two values in two registers:

We could do that.  But while reading what Darrick wrote I came up with
another idea I quite like.  Just split the FGP_ENTRY handling into
a separate helper.  The logic and use cases are quite different from
the normal page cache lookup, and the returning of the xarray entry
is exactly the kind of layering violation that Dave is complaining
about.  So what about just splitting that use case into a separate
self contained helper?

---
From b4d10f98ea57f8480c03c0b00abad6f2b7186f56 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Mon, 16 Jan 2023 08:26:57 +0100
Subject: mm: replace FGP_ENTRY with a new __filemap_get_folio_entry helper

Split the xarray entry returning logic into a separate helper.  This will
allow returning ERR_PTRs from __filemap_get_folio, and also isolates the
logic that needs to known about xarray internals into a separate
function.  This causes some code duplication, but as most flags to
__filemap_get_folio are not applicable for the users that care about an
entry that amount is very limited.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/pagemap.h |  6 +++--
 mm/filemap.c            | 50 ++++++++++++++++++++++++++++++++++++-----
 mm/huge_memory.c        |  4 ++--
 mm/shmem.c              |  5 ++---
 mm/swap_state.c         |  2 +-
 5 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4b3a7124c76712..e06c14b610caf2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -504,8 +504,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
 #define FGP_NOFS		0x00000010
 #define FGP_NOWAIT		0x00000020
 #define FGP_FOR_MMAP		0x00000040
-#define FGP_ENTRY		0x00000080
-#define FGP_STABLE		0x00000100
+#define FGP_STABLE		0x00000080
 
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		int fgp_flags, gfp_t gfp);
@@ -546,6 +545,9 @@ static inline struct folio *filemap_lock_folio(struct address_space *mapping,
 	return __filemap_get_folio(mapping, index, FGP_LOCK, 0);
 }
 
+struct folio *__filemap_get_folio_entry(struct address_space *mapping,
+		pgoff_t index, int fgp_flags);
+
 /**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
diff --git a/mm/filemap.c b/mm/filemap.c
index c4d4ace9cc7003..d04613347b3e71 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1887,8 +1887,6 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
  *
  * * %FGP_ACCESSED - The folio will be marked accessed.
  * * %FGP_LOCK - The folio is returned locked.
- * * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it
- *   instead of allocating a new folio to replace it.
  * * %FGP_CREAT - If no page is present then a new page is allocated using
  *   @gfp and added to the page cache and the VM's LRU list.
  *   The page is returned locked and with an increased refcount.
@@ -1914,11 +1912,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 
 repeat:
 	folio = mapping_get_entry(mapping, index);
-	if (xa_is_value(folio)) {
-		if (fgp_flags & FGP_ENTRY)
-			return folio;
+	if (xa_is_value(folio))
 		folio = NULL;
-	}
 	if (!folio)
 		goto no_page;
 
@@ -1994,6 +1989,49 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 }
 EXPORT_SYMBOL(__filemap_get_folio);
 
+
+/**
+ * __filemap_get_folio_entry - Find and get a reference to a folio.
+ * @mapping: The address_space to search.
+ * @index: The page index.
+ * @fgp_flags: %FGP flags modify how the folio is returned.
+ *
+ * Looks up the page cache entry at @mapping & @index.  If there is a shadow /
+ * swap / DAX entry, return it instead of allocating a new folio to replace it.
+ *
+ * @fgp_flags can be zero or more of these flags:
+ *
+ * * %FGP_LOCK - The folio is returned locked.
+ *
+ * If there is a page cache page, it is returned with an increased refcount.
+ *
+ * Return: The found folio or %NULL otherwise.
+ */
+struct folio *__filemap_get_folio_entry(struct address_space *mapping,
+		pgoff_t index, int fgp_flags)
+{
+	struct folio *folio;
+
+	if (WARN_ON_ONCE(fgp_flags & ~FGP_LOCK))
+		return NULL;
+
+repeat:
+	folio = mapping_get_entry(mapping, index);
+	if (folio && !xa_is_value(folio) && (fgp_flags & FGP_LOCK)) {
+		folio_lock(folio);
+
+		/* Has the page been truncated? */
+		if (unlikely(folio->mapping != mapping)) {
+			folio_unlock(folio);
+			folio_put(folio);
+			goto repeat;
+		}
+		VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);
+	}
+
+	return folio;
+}
+
 static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
 		xa_mark_t mark)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index abe6cfd92ffa0e..88b517c338a6db 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3088,10 +3088,10 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 	mapping = candidate->f_mapping;
 
 	for (index = off_start; index < off_end; index += nr_pages) {
-		struct folio *folio = __filemap_get_folio(mapping, index,
-						FGP_ENTRY, 0);
+		struct folio *folio;
 
 		nr_pages = 1;
+		folio = __filemap_get_folio_entry(mapping, index, 0);
 		if (xa_is_value(folio) || !folio)
 			continue;
 
diff --git a/mm/shmem.c b/mm/shmem.c
index c301487be5fb40..0a36563ef7a0c1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -888,8 +888,7 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
 	 * At first avoid shmem_get_folio(,,,SGP_READ): that fails
 	 * beyond i_size, and reports fallocated pages as holes.
 	 */
-	folio = __filemap_get_folio(inode->i_mapping, index,
-					FGP_ENTRY | FGP_LOCK, 0);
+	folio = __filemap_get_folio_entry(inode->i_mapping, index, FGP_LOCK);
 	if (!xa_is_value(folio))
 		return folio;
 	/*
@@ -1860,7 +1859,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	sbinfo = SHMEM_SB(inode->i_sb);
 	charge_mm = vma ? vma->vm_mm : NULL;
 
-	folio = __filemap_get_folio(mapping, index, FGP_ENTRY | FGP_LOCK, 0);
+	folio = __filemap_get_folio_entry(mapping, index, FGP_LOCK);
 	if (folio && vma && userfaultfd_minor(vma)) {
 		if (!xa_is_value(folio)) {
 			folio_unlock(folio);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2927507b43d819..1f45241987aea2 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -384,7 +384,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
 {
 	swp_entry_t swp;
 	struct swap_info_struct *si;
-	struct folio *folio = __filemap_get_folio(mapping, index, FGP_ENTRY, 0);
+	struct folio *folio = __filemap_get_folio_entry(mapping, index, 0);
 
 	if (!xa_is_value(folio))
 		goto out;
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
@ 2023-01-16  7:34                 ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-16  7:34 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Jan 16, 2023 at 05:46:01AM +0000, Matthew Wilcox wrote:
> > OFC now I wonder, can we simply say that the return value is "The found
> > folio or NULL if you set FGP_ENTRY; or the found folio or a negative
> > errno if you don't" ?
> 
> Erm ... I would rather not!

Agreed.

> 
> Part of me remembers that x86-64 has the rather nice calling convention
> of being able to return a struct containing two values in two registers:

We could do that.  But while reading what Darrick wrote I came up with
another idea I quite like.  Just split the FGP_ENTRY handling into
a separate helper.  The logic and use cases are quite different from
the normal page cache lookup, and the returning of the xarray entry
is exactly the kind of layering violation that Dave is complaining
about.  So what about just splitting that use case into a separate
self contained helper?

---
From b4d10f98ea57f8480c03c0b00abad6f2b7186f56 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Mon, 16 Jan 2023 08:26:57 +0100
Subject: mm: replace FGP_ENTRY with a new __filemap_get_folio_entry helper

Split the xarray entry returning logic into a separate helper.  This will
allow returning ERR_PTRs from __filemap_get_folio, and also isolates the
logic that needs to known about xarray internals into a separate
function.  This causes some code duplication, but as most flags to
__filemap_get_folio are not applicable for the users that care about an
entry that amount is very limited.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/pagemap.h |  6 +++--
 mm/filemap.c            | 50 ++++++++++++++++++++++++++++++++++++-----
 mm/huge_memory.c        |  4 ++--
 mm/shmem.c              |  5 ++---
 mm/swap_state.c         |  2 +-
 5 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4b3a7124c76712..e06c14b610caf2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -504,8 +504,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
 #define FGP_NOFS		0x00000010
 #define FGP_NOWAIT		0x00000020
 #define FGP_FOR_MMAP		0x00000040
-#define FGP_ENTRY		0x00000080
-#define FGP_STABLE		0x00000100
+#define FGP_STABLE		0x00000080
 
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		int fgp_flags, gfp_t gfp);
@@ -546,6 +545,9 @@ static inline struct folio *filemap_lock_folio(struct address_space *mapping,
 	return __filemap_get_folio(mapping, index, FGP_LOCK, 0);
 }
 
+struct folio *__filemap_get_folio_entry(struct address_space *mapping,
+		pgoff_t index, int fgp_flags);
+
 /**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
diff --git a/mm/filemap.c b/mm/filemap.c
index c4d4ace9cc7003..d04613347b3e71 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1887,8 +1887,6 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
  *
  * * %FGP_ACCESSED - The folio will be marked accessed.
  * * %FGP_LOCK - The folio is returned locked.
- * * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it
- *   instead of allocating a new folio to replace it.
  * * %FGP_CREAT - If no page is present then a new page is allocated using
  *   @gfp and added to the page cache and the VM's LRU list.
  *   The page is returned locked and with an increased refcount.
@@ -1914,11 +1912,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 
 repeat:
 	folio = mapping_get_entry(mapping, index);
-	if (xa_is_value(folio)) {
-		if (fgp_flags & FGP_ENTRY)
-			return folio;
+	if (xa_is_value(folio))
 		folio = NULL;
-	}
 	if (!folio)
 		goto no_page;
 
@@ -1994,6 +1989,49 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 }
 EXPORT_SYMBOL(__filemap_get_folio);
 
+
+/**
+ * __filemap_get_folio_entry - Find and get a reference to a folio.
+ * @mapping: The address_space to search.
+ * @index: The page index.
+ * @fgp_flags: %FGP flags modify how the folio is returned.
+ *
+ * Looks up the page cache entry at @mapping & @index.  If there is a shadow /
+ * swap / DAX entry, return it instead of allocating a new folio to replace it.
+ *
+ * @fgp_flags can be zero or more of these flags:
+ *
+ * * %FGP_LOCK - The folio is returned locked.
+ *
+ * If there is a page cache page, it is returned with an increased refcount.
+ *
+ * Return: The found folio or %NULL otherwise.
+ */
+struct folio *__filemap_get_folio_entry(struct address_space *mapping,
+		pgoff_t index, int fgp_flags)
+{
+	struct folio *folio;
+
+	if (WARN_ON_ONCE(fgp_flags & ~FGP_LOCK))
+		return NULL;
+
+repeat:
+	folio = mapping_get_entry(mapping, index);
+	if (folio && !xa_is_value(folio) && (fgp_flags & FGP_LOCK)) {
+		folio_lock(folio);
+
+		/* Has the page been truncated? */
+		if (unlikely(folio->mapping != mapping)) {
+			folio_unlock(folio);
+			folio_put(folio);
+			goto repeat;
+		}
+		VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);
+	}
+
+	return folio;
+}
+
 static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
 		xa_mark_t mark)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index abe6cfd92ffa0e..88b517c338a6db 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3088,10 +3088,10 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 	mapping = candidate->f_mapping;
 
 	for (index = off_start; index < off_end; index += nr_pages) {
-		struct folio *folio = __filemap_get_folio(mapping, index,
-						FGP_ENTRY, 0);
+		struct folio *folio;
 
 		nr_pages = 1;
+		folio = __filemap_get_folio_entry(mapping, index, 0);
 		if (xa_is_value(folio) || !folio)
 			continue;
 
diff --git a/mm/shmem.c b/mm/shmem.c
index c301487be5fb40..0a36563ef7a0c1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -888,8 +888,7 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
 	 * At first avoid shmem_get_folio(,,,SGP_READ): that fails
 	 * beyond i_size, and reports fallocated pages as holes.
 	 */
-	folio = __filemap_get_folio(inode->i_mapping, index,
-					FGP_ENTRY | FGP_LOCK, 0);
+	folio = __filemap_get_folio_entry(inode->i_mapping, index, FGP_LOCK);
 	if (!xa_is_value(folio))
 		return folio;
 	/*
@@ -1860,7 +1859,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	sbinfo = SHMEM_SB(inode->i_sb);
 	charge_mm = vma ? vma->vm_mm : NULL;
 
-	folio = __filemap_get_folio(mapping, index, FGP_ENTRY | FGP_LOCK, 0);
+	folio = __filemap_get_folio_entry(mapping, index, FGP_LOCK);
 	if (folio && vma && userfaultfd_minor(vma)) {
 		if (!xa_is_value(folio)) {
 			folio_unlock(folio);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2927507b43d819..1f45241987aea2 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -384,7 +384,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
 {
 	swp_entry_t swp;
 	struct swap_info_struct *si;
-	struct folio *folio = __filemap_get_folio(mapping, index, FGP_ENTRY, 0);
+	struct folio *folio = __filemap_get_folio_entry(mapping, index, 0);
 
 	if (!xa_is_value(folio))
 		goto out;
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC v6 04/10] iomap: Add iomap_get_folio helper
  2023-01-16  7:34                 ` [Cluster-devel] " Christoph Hellwig
@ 2023-01-16 13:18                   ` Matthew Wilcox
  -1 siblings, 0 replies; 82+ messages in thread
From: Matthew Wilcox @ 2023-01-16 13:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Andreas Gruenbacher, Dave Chinner,
	Alexander Viro, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel, Christoph Hellwig

On Sun, Jan 15, 2023 at 11:34:26PM -0800, Christoph Hellwig wrote:
> We could do that.  But while reading what Darrick wrote I came up with
> another idea I quite like.  Just split the FGP_ENTRY handling into
> a separate helper.  The logic and use cases are quite different from
> the normal page cache lookup, and the returning of the xarray entry
> is exactly the kind of layering violation that Dave is complaining
> about.  So what about just splitting that use case into a separate
> self contained helper?

Essentially reverting 44835d20b2a0.  Although we retain the merging of
the lock & get functions via the use of FGP flags.  Let me think about
it for a day.

> ---
> >From b4d10f98ea57f8480c03c0b00abad6f2b7186f56 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Mon, 16 Jan 2023 08:26:57 +0100
> Subject: mm: replace FGP_ENTRY with a new __filemap_get_folio_entry helper
> 
> Split the xarray entry returning logic into a separate helper.  This will
> allow returning ERR_PTRs from __filemap_get_folio, and also isolates the
> logic that needs to known about xarray internals into a separate
> function.  This causes some code duplication, but as most flags to
> __filemap_get_folio are not applicable for the users that care about an
> entry that amount is very limited.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  include/linux/pagemap.h |  6 +++--
>  mm/filemap.c            | 50 ++++++++++++++++++++++++++++++++++++-----
>  mm/huge_memory.c        |  4 ++--
>  mm/shmem.c              |  5 ++---
>  mm/swap_state.c         |  2 +-
>  5 files changed, 53 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 4b3a7124c76712..e06c14b610caf2 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -504,8 +504,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
>  #define FGP_NOFS		0x00000010
>  #define FGP_NOWAIT		0x00000020
>  #define FGP_FOR_MMAP		0x00000040
> -#define FGP_ENTRY		0x00000080
> -#define FGP_STABLE		0x00000100
> +#define FGP_STABLE		0x00000080
>  
>  struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		int fgp_flags, gfp_t gfp);
> @@ -546,6 +545,9 @@ static inline struct folio *filemap_lock_folio(struct address_space *mapping,
>  	return __filemap_get_folio(mapping, index, FGP_LOCK, 0);
>  }
>  
> +struct folio *__filemap_get_folio_entry(struct address_space *mapping,
> +		pgoff_t index, int fgp_flags);
> +
>  /**
>   * find_get_page - find and get a page reference
>   * @mapping: the address_space to search
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c4d4ace9cc7003..d04613347b3e71 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1887,8 +1887,6 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
>   *
>   * * %FGP_ACCESSED - The folio will be marked accessed.
>   * * %FGP_LOCK - The folio is returned locked.
> - * * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it
> - *   instead of allocating a new folio to replace it.
>   * * %FGP_CREAT - If no page is present then a new page is allocated using
>   *   @gfp and added to the page cache and the VM's LRU list.
>   *   The page is returned locked and with an increased refcount.
> @@ -1914,11 +1912,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  
>  repeat:
>  	folio = mapping_get_entry(mapping, index);
> -	if (xa_is_value(folio)) {
> -		if (fgp_flags & FGP_ENTRY)
> -			return folio;
> +	if (xa_is_value(folio))
>  		folio = NULL;
> -	}
>  	if (!folio)
>  		goto no_page;
>  
> @@ -1994,6 +1989,49 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  }
>  EXPORT_SYMBOL(__filemap_get_folio);
>  
> +
> +/**
> + * __filemap_get_folio_entry - Find and get a reference to a folio.
> + * @mapping: The address_space to search.
> + * @index: The page index.
> + * @fgp_flags: %FGP flags modify how the folio is returned.
> + *
> + * Looks up the page cache entry at @mapping & @index.  If there is a shadow /
> + * swap / DAX entry, return it instead of allocating a new folio to replace it.
> + *
> + * @fgp_flags can be zero or more of these flags:
> + *
> + * * %FGP_LOCK - The folio is returned locked.
> + *
> + * If there is a page cache page, it is returned with an increased refcount.
> + *
> + * Return: The found folio or %NULL otherwise.
> + */
> +struct folio *__filemap_get_folio_entry(struct address_space *mapping,
> +		pgoff_t index, int fgp_flags)
> +{
> +	struct folio *folio;
> +
> +	if (WARN_ON_ONCE(fgp_flags & ~FGP_LOCK))
> +		return NULL;
> +
> +repeat:
> +	folio = mapping_get_entry(mapping, index);
> +	if (folio && !xa_is_value(folio) && (fgp_flags & FGP_LOCK)) {
> +		folio_lock(folio);
> +
> +		/* Has the page been truncated? */
> +		if (unlikely(folio->mapping != mapping)) {
> +			folio_unlock(folio);
> +			folio_put(folio);
> +			goto repeat;
> +		}
> +		VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);
> +	}
> +
> +	return folio;
> +}
> +
>  static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
>  		xa_mark_t mark)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index abe6cfd92ffa0e..88b517c338a6db 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3088,10 +3088,10 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
>  	mapping = candidate->f_mapping;
>  
>  	for (index = off_start; index < off_end; index += nr_pages) {
> -		struct folio *folio = __filemap_get_folio(mapping, index,
> -						FGP_ENTRY, 0);
> +		struct folio *folio;
>  
>  		nr_pages = 1;
> +		folio = __filemap_get_folio_entry(mapping, index, 0);
>  		if (xa_is_value(folio) || !folio)
>  			continue;
>  
> diff --git a/mm/shmem.c b/mm/shmem.c
> index c301487be5fb40..0a36563ef7a0c1 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -888,8 +888,7 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
>  	 * At first avoid shmem_get_folio(,,,SGP_READ): that fails
>  	 * beyond i_size, and reports fallocated pages as holes.
>  	 */
> -	folio = __filemap_get_folio(inode->i_mapping, index,
> -					FGP_ENTRY | FGP_LOCK, 0);
> +	folio = __filemap_get_folio_entry(inode->i_mapping, index, FGP_LOCK);
>  	if (!xa_is_value(folio))
>  		return folio;
>  	/*
> @@ -1860,7 +1859,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  	sbinfo = SHMEM_SB(inode->i_sb);
>  	charge_mm = vma ? vma->vm_mm : NULL;
>  
> -	folio = __filemap_get_folio(mapping, index, FGP_ENTRY | FGP_LOCK, 0);
> +	folio = __filemap_get_folio_entry(mapping, index, FGP_LOCK);
>  	if (folio && vma && userfaultfd_minor(vma)) {
>  		if (!xa_is_value(folio)) {
>  			folio_unlock(folio);
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 2927507b43d819..1f45241987aea2 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -384,7 +384,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
>  {
>  	swp_entry_t swp;
>  	struct swap_info_struct *si;
> -	struct folio *folio = __filemap_get_folio(mapping, index, FGP_ENTRY, 0);
> +	struct folio *folio = __filemap_get_folio_entry(mapping, index, 0);
>  
>  	if (!xa_is_value(folio))
>  		goto out;
> -- 
> 2.39.0
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
@ 2023-01-16 13:18                   ` Matthew Wilcox
  0 siblings, 0 replies; 82+ messages in thread
From: Matthew Wilcox @ 2023-01-16 13:18 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sun, Jan 15, 2023 at 11:34:26PM -0800, Christoph Hellwig wrote:
> We could do that.  But while reading what Darrick wrote I came up with
> another idea I quite like.  Just split the FGP_ENTRY handling into
> a separate helper.  The logic and use cases are quite different from
> the normal page cache lookup, and the returning of the xarray entry
> is exactly the kind of layering violation that Dave is complaining
> about.  So what about just splitting that use case into a separate
> self contained helper?

Essentially reverting 44835d20b2a0.  Although we retain the merging of
the lock & get functions via the use of FGP flags.  Let me think about
it for a day.

> ---
> >From b4d10f98ea57f8480c03c0b00abad6f2b7186f56 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Mon, 16 Jan 2023 08:26:57 +0100
> Subject: mm: replace FGP_ENTRY with a new __filemap_get_folio_entry helper
> 
> Split the xarray entry returning logic into a separate helper.  This will
> allow returning ERR_PTRs from __filemap_get_folio, and also isolates the
> logic that needs to known about xarray internals into a separate
> function.  This causes some code duplication, but as most flags to
> __filemap_get_folio are not applicable for the users that care about an
> entry that amount is very limited.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  include/linux/pagemap.h |  6 +++--
>  mm/filemap.c            | 50 ++++++++++++++++++++++++++++++++++++-----
>  mm/huge_memory.c        |  4 ++--
>  mm/shmem.c              |  5 ++---
>  mm/swap_state.c         |  2 +-
>  5 files changed, 53 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 4b3a7124c76712..e06c14b610caf2 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -504,8 +504,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
>  #define FGP_NOFS		0x00000010
>  #define FGP_NOWAIT		0x00000020
>  #define FGP_FOR_MMAP		0x00000040
> -#define FGP_ENTRY		0x00000080
> -#define FGP_STABLE		0x00000100
> +#define FGP_STABLE		0x00000080
>  
>  struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		int fgp_flags, gfp_t gfp);
> @@ -546,6 +545,9 @@ static inline struct folio *filemap_lock_folio(struct address_space *mapping,
>  	return __filemap_get_folio(mapping, index, FGP_LOCK, 0);
>  }
>  
> +struct folio *__filemap_get_folio_entry(struct address_space *mapping,
> +		pgoff_t index, int fgp_flags);
> +
>  /**
>   * find_get_page - find and get a page reference
>   * @mapping: the address_space to search
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c4d4ace9cc7003..d04613347b3e71 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1887,8 +1887,6 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
>   *
>   * * %FGP_ACCESSED - The folio will be marked accessed.
>   * * %FGP_LOCK - The folio is returned locked.
> - * * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it
> - *   instead of allocating a new folio to replace it.
>   * * %FGP_CREAT - If no page is present then a new page is allocated using
>   *   @gfp and added to the page cache and the VM's LRU list.
>   *   The page is returned locked and with an increased refcount.
> @@ -1914,11 +1912,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  
>  repeat:
>  	folio = mapping_get_entry(mapping, index);
> -	if (xa_is_value(folio)) {
> -		if (fgp_flags & FGP_ENTRY)
> -			return folio;
> +	if (xa_is_value(folio))
>  		folio = NULL;
> -	}
>  	if (!folio)
>  		goto no_page;
>  
> @@ -1994,6 +1989,49 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  }
>  EXPORT_SYMBOL(__filemap_get_folio);
>  
> +
> +/**
> + * __filemap_get_folio_entry - Find and get a reference to a folio.
> + * @mapping: The address_space to search.
> + * @index: The page index.
> + * @fgp_flags: %FGP flags modify how the folio is returned.
> + *
> + * Looks up the page cache entry at @mapping & @index.  If there is a shadow /
> + * swap / DAX entry, return it instead of allocating a new folio to replace it.
> + *
> + * @fgp_flags can be zero or more of these flags:
> + *
> + * * %FGP_LOCK - The folio is returned locked.
> + *
> + * If there is a page cache page, it is returned with an increased refcount.
> + *
> + * Return: The found folio or %NULL otherwise.
> + */
> +struct folio *__filemap_get_folio_entry(struct address_space *mapping,
> +		pgoff_t index, int fgp_flags)
> +{
> +	struct folio *folio;
> +
> +	if (WARN_ON_ONCE(fgp_flags & ~FGP_LOCK))
> +		return NULL;
> +
> +repeat:
> +	folio = mapping_get_entry(mapping, index);
> +	if (folio && !xa_is_value(folio) && (fgp_flags & FGP_LOCK)) {
> +		folio_lock(folio);
> +
> +		/* Has the page been truncated? */
> +		if (unlikely(folio->mapping != mapping)) {
> +			folio_unlock(folio);
> +			folio_put(folio);
> +			goto repeat;
> +		}
> +		VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);
> +	}
> +
> +	return folio;
> +}
> +
>  static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
>  		xa_mark_t mark)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index abe6cfd92ffa0e..88b517c338a6db 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3088,10 +3088,10 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
>  	mapping = candidate->f_mapping;
>  
>  	for (index = off_start; index < off_end; index += nr_pages) {
> -		struct folio *folio = __filemap_get_folio(mapping, index,
> -						FGP_ENTRY, 0);
> +		struct folio *folio;
>  
>  		nr_pages = 1;
> +		folio = __filemap_get_folio_entry(mapping, index, 0);
>  		if (xa_is_value(folio) || !folio)
>  			continue;
>  
> diff --git a/mm/shmem.c b/mm/shmem.c
> index c301487be5fb40..0a36563ef7a0c1 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -888,8 +888,7 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
>  	 * At first avoid shmem_get_folio(,,,SGP_READ): that fails
>  	 * beyond i_size, and reports fallocated pages as holes.
>  	 */
> -	folio = __filemap_get_folio(inode->i_mapping, index,
> -					FGP_ENTRY | FGP_LOCK, 0);
> +	folio = __filemap_get_folio_entry(inode->i_mapping, index, FGP_LOCK);
>  	if (!xa_is_value(folio))
>  		return folio;
>  	/*
> @@ -1860,7 +1859,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  	sbinfo = SHMEM_SB(inode->i_sb);
>  	charge_mm = vma ? vma->vm_mm : NULL;
>  
> -	folio = __filemap_get_folio(mapping, index, FGP_ENTRY | FGP_LOCK, 0);
> +	folio = __filemap_get_folio_entry(mapping, index, FGP_LOCK);
>  	if (folio && vma && userfaultfd_minor(vma)) {
>  		if (!xa_is_value(folio)) {
>  			folio_unlock(folio);
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 2927507b43d819..1f45241987aea2 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -384,7 +384,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
>  {
>  	swp_entry_t swp;
>  	struct swap_info_struct *si;
> -	struct folio *folio = __filemap_get_folio(mapping, index, FGP_ENTRY, 0);
> +	struct folio *folio = __filemap_get_folio_entry(mapping, index, 0);
>  
>  	if (!xa_is_value(folio))
>  		goto out;
> -- 
> 2.39.0
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 04/10] iomap: Add iomap_get_folio helper
  2023-01-16 13:18                   ` [Cluster-devel] " Matthew Wilcox
@ 2023-01-16 16:02                     ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-16 16:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Darrick J. Wong, Andreas Gruenbacher,
	Dave Chinner, Alexander Viro, linux-xfs, linux-fsdevel,
	linux-ext4, cluster-devel, Christoph Hellwig

On Mon, Jan 16, 2023 at 01:18:07PM +0000, Matthew Wilcox wrote:
> Essentially reverting 44835d20b2a0.

Yep.

> Although we retain the merging of
> the lock & get functions via the use of FGP flags.  Let me think about
> it for a day.

Yes.  But looking at the code again I wonder if even that is needed.
Out of the users of FGP_ENTRY / __filemap_get_folio_entry:

 - split_huge_pages_in_file really should not be using it at all,
   given that it checks for xa_is_value and treats that as !folio
 - one doesn't pass FGP_LOCK and could just use filemap_get_entry
 - the othr two are in shmem, so we could move the locking logic
   there (and maybe in future optimize it in the callers)

That would be something like this, although it should be split into
two or three patches:

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 29e1f9e76eb6dd..ecd1ff40a80621 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -504,9 +504,9 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
 #define FGP_NOFS		0x00000010
 #define FGP_NOWAIT		0x00000020
 #define FGP_FOR_MMAP		0x00000040
-#define FGP_ENTRY		0x00000080
-#define FGP_STABLE		0x00000100
+#define FGP_STABLE		0x00000080
 
+void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		int fgp_flags, gfp_t gfp);
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
diff --git a/mm/filemap.c b/mm/filemap.c
index c4d4ace9cc7003..85bd86c44e14d2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1832,7 +1832,7 @@ EXPORT_SYMBOL(page_cache_prev_miss);
  */
 
 /*
- * mapping_get_entry - Get a page cache entry.
+ * filemap_get_entry - Get a page cache entry.
  * @mapping: the address_space to search
  * @index: The page cache index.
  *
@@ -1843,7 +1843,7 @@ EXPORT_SYMBOL(page_cache_prev_miss);
  *
  * Return: The folio, swap or shadow entry, %NULL if nothing is found.
  */
-static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
+void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
 {
 	XA_STATE(xas, &mapping->i_pages, index);
 	struct folio *folio;
@@ -1887,8 +1887,6 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
  *
  * * %FGP_ACCESSED - The folio will be marked accessed.
  * * %FGP_LOCK - The folio is returned locked.
- * * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it
- *   instead of allocating a new folio to replace it.
  * * %FGP_CREAT - If no page is present then a new page is allocated using
  *   @gfp and added to the page cache and the VM's LRU list.
  *   The page is returned locked and with an increased refcount.
@@ -1913,12 +1911,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 	struct folio *folio;
 
 repeat:
-	folio = mapping_get_entry(mapping, index);
-	if (xa_is_value(folio)) {
-		if (fgp_flags & FGP_ENTRY)
-			return folio;
+	folio = filemap_get_entry(mapping, index);
+	if (xa_is_value(folio))
 		folio = NULL;
-	}
 	if (!folio)
 		goto no_page;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index abe6cfd92ffa0e..b182eb99044e9a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3088,11 +3088,10 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 	mapping = candidate->f_mapping;
 
 	for (index = off_start; index < off_end; index += nr_pages) {
-		struct folio *folio = __filemap_get_folio(mapping, index,
-						FGP_ENTRY, 0);
+		struct folio *folio = filemap_get_folio(mapping, index);
 
 		nr_pages = 1;
-		if (xa_is_value(folio) || !folio)
+		if (!folio)
 			continue;
 
 		if (!folio_test_large(folio))
diff --git a/mm/shmem.c b/mm/shmem.c
index 028675cd97d445..4650192dbcb91b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -880,6 +880,28 @@ void shmem_unlock_mapping(struct address_space *mapping)
 	}
 }
 
+static struct folio *shmem_get_entry(struct address_space *mapping,
+		pgoff_t index)
+{
+	struct folio *folio;
+
+repeat:
+	folio = filemap_get_entry(mapping, index);
+	if (folio && !xa_is_value(folio)) {
+		folio_lock(folio);
+
+		/* Has the page been truncated? */
+		if (unlikely(folio->mapping != mapping)) {
+			folio_unlock(folio);
+			folio_put(folio);
+			goto repeat;
+		}
+		VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);
+	}
+
+	return folio;
+}
+
 static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
 {
 	struct folio *folio;
@@ -888,8 +910,7 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
 	 * At first avoid shmem_get_folio(,,,SGP_READ): that fails
 	 * beyond i_size, and reports fallocated pages as holes.
 	 */
-	folio = __filemap_get_folio(inode->i_mapping, index,
-					FGP_ENTRY | FGP_LOCK, 0);
+	folio = shmem_get_entry(inode->i_mapping, index);
 	if (!xa_is_value(folio))
 		return folio;
 	/*
@@ -1860,7 +1881,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	sbinfo = SHMEM_SB(inode->i_sb);
 	charge_mm = vma ? vma->vm_mm : NULL;
 
-	folio = __filemap_get_folio(mapping, index, FGP_ENTRY | FGP_LOCK, 0);
+	folio = shmem_get_entry(mapping, index);
 	if (folio && vma && userfaultfd_minor(vma)) {
 		if (!xa_is_value(folio)) {
 			folio_unlock(folio);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2927507b43d819..e7f2083ad7e40a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -384,7 +384,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
 {
 	swp_entry_t swp;
 	struct swap_info_struct *si;
-	struct folio *folio = __filemap_get_folio(mapping, index, FGP_ENTRY, 0);
+	struct folio *folio = filemap_get_entry(mapping, index);
 
 	if (!xa_is_value(folio))
 		goto out;

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
@ 2023-01-16 16:02                     ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-16 16:02 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Jan 16, 2023 at 01:18:07PM +0000, Matthew Wilcox wrote:
> Essentially reverting 44835d20b2a0.

Yep.

> Although we retain the merging of
> the lock & get functions via the use of FGP flags.  Let me think about
> it for a day.

Yes.  But looking at the code again I wonder if even that is needed.
Out of the users of FGP_ENTRY / __filemap_get_folio_entry:

 - split_huge_pages_in_file really should not be using it at all,
   given that it checks for xa_is_value and treats that as !folio
 - one doesn't pass FGP_LOCK and could just use filemap_get_entry
 - the othr two are in shmem, so we could move the locking logic
   there (and maybe in future optimize it in the callers)

That would be something like this, although it should be split into
two or three patches:

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 29e1f9e76eb6dd..ecd1ff40a80621 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -504,9 +504,9 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
 #define FGP_NOFS		0x00000010
 #define FGP_NOWAIT		0x00000020
 #define FGP_FOR_MMAP		0x00000040
-#define FGP_ENTRY		0x00000080
-#define FGP_STABLE		0x00000100
+#define FGP_STABLE		0x00000080
 
+void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		int fgp_flags, gfp_t gfp);
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
diff --git a/mm/filemap.c b/mm/filemap.c
index c4d4ace9cc7003..85bd86c44e14d2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1832,7 +1832,7 @@ EXPORT_SYMBOL(page_cache_prev_miss);
  */
 
 /*
- * mapping_get_entry - Get a page cache entry.
+ * filemap_get_entry - Get a page cache entry.
  * @mapping: the address_space to search
  * @index: The page cache index.
  *
@@ -1843,7 +1843,7 @@ EXPORT_SYMBOL(page_cache_prev_miss);
  *
  * Return: The folio, swap or shadow entry, %NULL if nothing is found.
  */
-static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
+void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
 {
 	XA_STATE(xas, &mapping->i_pages, index);
 	struct folio *folio;
@@ -1887,8 +1887,6 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
  *
  * * %FGP_ACCESSED - The folio will be marked accessed.
  * * %FGP_LOCK - The folio is returned locked.
- * * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it
- *   instead of allocating a new folio to replace it.
  * * %FGP_CREAT - If no page is present then a new page is allocated using
  *   @gfp and added to the page cache and the VM's LRU list.
  *   The page is returned locked and with an increased refcount.
@@ -1913,12 +1911,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 	struct folio *folio;
 
 repeat:
-	folio = mapping_get_entry(mapping, index);
-	if (xa_is_value(folio)) {
-		if (fgp_flags & FGP_ENTRY)
-			return folio;
+	folio = filemap_get_entry(mapping, index);
+	if (xa_is_value(folio))
 		folio = NULL;
-	}
 	if (!folio)
 		goto no_page;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index abe6cfd92ffa0e..b182eb99044e9a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3088,11 +3088,10 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 	mapping = candidate->f_mapping;
 
 	for (index = off_start; index < off_end; index += nr_pages) {
-		struct folio *folio = __filemap_get_folio(mapping, index,
-						FGP_ENTRY, 0);
+		struct folio *folio = filemap_get_folio(mapping, index);
 
 		nr_pages = 1;
-		if (xa_is_value(folio) || !folio)
+		if (!folio)
 			continue;
 
 		if (!folio_test_large(folio))
diff --git a/mm/shmem.c b/mm/shmem.c
index 028675cd97d445..4650192dbcb91b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -880,6 +880,28 @@ void shmem_unlock_mapping(struct address_space *mapping)
 	}
 }
 
+static struct folio *shmem_get_entry(struct address_space *mapping,
+		pgoff_t index)
+{
+	struct folio *folio;
+
+repeat:
+	folio = filemap_get_entry(mapping, index);
+	if (folio && !xa_is_value(folio)) {
+		folio_lock(folio);
+
+		/* Has the page been truncated? */
+		if (unlikely(folio->mapping != mapping)) {
+			folio_unlock(folio);
+			folio_put(folio);
+			goto repeat;
+		}
+		VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);
+	}
+
+	return folio;
+}
+
 static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
 {
 	struct folio *folio;
@@ -888,8 +910,7 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
 	 * At first avoid shmem_get_folio(,,,SGP_READ): that fails
 	 * beyond i_size, and reports fallocated pages as holes.
 	 */
-	folio = __filemap_get_folio(inode->i_mapping, index,
-					FGP_ENTRY | FGP_LOCK, 0);
+	folio = shmem_get_entry(inode->i_mapping, index);
 	if (!xa_is_value(folio))
 		return folio;
 	/*
@@ -1860,7 +1881,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	sbinfo = SHMEM_SB(inode->i_sb);
 	charge_mm = vma ? vma->vm_mm : NULL;
 
-	folio = __filemap_get_folio(mapping, index, FGP_ENTRY | FGP_LOCK, 0);
+	folio = shmem_get_entry(mapping, index);
 	if (folio && vma && userfaultfd_minor(vma)) {
 		if (!xa_is_value(folio)) {
 			folio_unlock(folio);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2927507b43d819..e7f2083ad7e40a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -384,7 +384,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
 {
 	swp_entry_t swp;
 	struct swap_info_struct *si;
-	struct folio *folio = __filemap_get_folio(mapping, index, FGP_ENTRY, 0);
+	struct folio *folio = filemap_get_entry(mapping, index);
 
 	if (!xa_is_value(folio))
 		goto out;


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
  2023-01-15 17:29             ` [Cluster-devel] " Darrick J. Wong
@ 2023-01-18  7:21               ` Christoph Hellwig
  -1 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-18  7:21 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andreas Grünbacher, Dave Chinner, Andreas Gruenbacher,
	Christoph Hellwig, Alexander Viro, Damien Le Moal,
	Matthew Wilcox, linux-xfs, linux-fsdevel, linux-ext4,
	cluster-devel

On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> I don't have any objections to pulling everything except patches 8 and
> 10 for testing this week. 

That would be great.  I now have a series to return the ERR_PTR
from __filemap_get_folio which will cause a minor conflict, but
I think that's easy enough for Linux to handle.

> 
> 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
> don't think it does, but OTOH zone pointer management might complicate
> that.

Adding Damien.

> 2. How about porting the writeback iomap validation to use this
> mechanism?  (I suspect Dave might already be working on this...)

What is "this mechanism"?  Do you mean the here removed ->iomap_valid
?   writeback calls into ->map_blocks for every block while under the
folio lock, so the validation can (and for XFS currently is) done
in that.  Moving it out into a separate method with extra indirect
functiona call overhead and interactions between the methods seems
like a retrograde step to me.

> 2. Do we need to revalidate mappings for directio writes?  I think the
> answer is no (for xfs) because the ->iomap_begin call will allocate
> whatever blocks are needed and truncate/punch/reflink block on the
> iolock while the directio writes are pending, so you'll never end up
> with a stale mapping.

Yes.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
@ 2023-01-18  7:21               ` Christoph Hellwig
  0 siblings, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-18  7:21 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> I don't have any objections to pulling everything except patches 8 and
> 10 for testing this week. 

That would be great.  I now have a series to return the ERR_PTR
from __filemap_get_folio which will cause a minor conflict, but
I think that's easy enough for Linux to handle.

> 
> 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
> don't think it does, but OTOH zone pointer management might complicate
> that.

Adding Damien.

> 2. How about porting the writeback iomap validation to use this
> mechanism?  (I suspect Dave might already be working on this...)

What is "this mechanism"?  Do you mean the here removed ->iomap_valid
?   writeback calls into ->map_blocks for every block while under the
folio lock, so the validation can (and for XFS currently is) done
in that.  Moving it out into a separate method with extra indirect
functiona call overhead and interactions between the methods seems
like a retrograde step to me.

> 2. Do we need to revalidate mappings for directio writes?  I think the
> answer is no (for xfs) because the ->iomap_begin call will allocate
> whatever blocks are needed and truncate/punch/reflink block on the
> iolock while the directio writes are pending, so you'll never end up
> with a stale mapping.

Yes.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
  2023-01-18  7:21               ` [Cluster-devel] " Christoph Hellwig
@ 2023-01-18  9:11                 ` Damien Le Moal
  -1 siblings, 0 replies; 82+ messages in thread
From: Damien Le Moal @ 2023-01-18  9:11 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J. Wong
  Cc: Andreas Grünbacher, Dave Chinner, Andreas Gruenbacher,
	Alexander Viro, Damien Le Moal, Matthew Wilcox, linux-xfs,
	linux-fsdevel, linux-ext4, cluster-devel

On 1/18/23 16:21, Christoph Hellwig wrote:
> On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
>> I don't have any objections to pulling everything except patches 8 and
>> 10 for testing this week. 
> 
> That would be great.  I now have a series to return the ERR_PTR
> from __filemap_get_folio which will cause a minor conflict, but
> I think that's easy enough for Linux to handle.
> 
>>
>> 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
>> don't think it does, but OTOH zone pointer management might complicate
>> that.
> 
> Adding Damien.

zonefs has a static mapping of file blocks that never changes and is fully
populated up to a file max size from mount. So zonefs is not using the
iomap_valid page operation. In fact, zonefs is not even using struct
iomap_page_ops.

> 
>> 2. How about porting the writeback iomap validation to use this
>> mechanism?  (I suspect Dave might already be working on this...)
> 
> What is "this mechanism"?  Do you mean the here removed ->iomap_valid
> ?   writeback calls into ->map_blocks for every block while under the
> folio lock, so the validation can (and for XFS currently is) done
> in that.  Moving it out into a separate method with extra indirect
> functiona call overhead and interactions between the methods seems
> like a retrograde step to me.
> 
>> 2. Do we need to revalidate mappings for directio writes?  I think the
>> answer is no (for xfs) because the ->iomap_begin call will allocate
>> whatever blocks are needed and truncate/punch/reflink block on the
>> iolock while the directio writes are pending, so you'll never end up
>> with a stale mapping.
> 
> Yes.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
@ 2023-01-18  9:11                 ` Damien Le Moal
  0 siblings, 0 replies; 82+ messages in thread
From: Damien Le Moal @ 2023-01-18  9:11 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 1/18/23 16:21, Christoph Hellwig wrote:
> On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
>> I don't have any objections to pulling everything except patches 8 and
>> 10 for testing this week. 
> 
> That would be great.  I now have a series to return the ERR_PTR
> from __filemap_get_folio which will cause a minor conflict, but
> I think that's easy enough for Linux to handle.
> 
>>
>> 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
>> don't think it does, but OTOH zone pointer management might complicate
>> that.
> 
> Adding Damien.

zonefs has a static mapping of file blocks that never changes and is fully
populated up to a file max size from mount. So zonefs is not using the
iomap_valid page operation. In fact, zonefs is not even using struct
iomap_page_ops.

> 
>> 2. How about porting the writeback iomap validation to use this
>> mechanism?  (I suspect Dave might already be working on this...)
> 
> What is "this mechanism"?  Do you mean the here removed ->iomap_valid
> ?   writeback calls into ->map_blocks for every block while under the
> folio lock, so the validation can (and for XFS currently is) done
> in that.  Moving it out into a separate method with extra indirect
> functiona call overhead and interactions between the methods seems
> like a retrograde step to me.
> 
>> 2. Do we need to revalidate mappings for directio writes?  I think the
>> answer is no (for xfs) because the ->iomap_begin call will allocate
>> whatever blocks are needed and truncate/punch/reflink block on the
>> iolock while the directio writes are pending, so you'll never end up
>> with a stale mapping.
> 
> Yes.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
  2023-01-18  7:21               ` [Cluster-devel] " Christoph Hellwig
@ 2023-01-18 19:04                 ` Darrick J. Wong
  -1 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2023-01-18 19:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Grünbacher, Dave Chinner, Andreas Gruenbacher,
	Alexander Viro, Damien Le Moal, Matthew Wilcox, linux-xfs,
	linux-fsdevel, linux-ext4, cluster-devel

On Tue, Jan 17, 2023 at 11:21:38PM -0800, Christoph Hellwig wrote:
> On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> > I don't have any objections to pulling everything except patches 8 and
> > 10 for testing this week. 
> 
> That would be great.  I now have a series to return the ERR_PTR
> from __filemap_get_folio which will cause a minor conflict, but
> I think that's easy enough for Linux to handle.

Ok, done.

> > 
> > 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
> > don't think it does, but OTOH zone pointer management might complicate
> > that.
> 
> Adding Damien.
> 
> > 2. How about porting the writeback iomap validation to use this
> > mechanism?  (I suspect Dave might already be working on this...)
> 
> What is "this mechanism"?  Do you mean the here removed ->iomap_valid
> ?   writeback calls into ->map_blocks for every block while under the
> folio lock, so the validation can (and for XFS currently is) done
> in that.  Moving it out into a separate method with extra indirect
> functiona call overhead and interactions between the methods seems
> like a retrograde step to me.

Sorry, I should've been more specific -- can xfs writeback use the
validity cookie in struct iomap and thereby get rid of struct
xfs_writepage_ctx entirely?

> > 2. Do we need to revalidate mappings for directio writes?  I think the
> > answer is no (for xfs) because the ->iomap_begin call will allocate
> > whatever blocks are needed and truncate/punch/reflink block on the
> > iolock while the directio writes are pending, so you'll never end up
> > with a stale mapping.
> 
> Yes.

Er... yes as in "Yes, we *do* need to revalidate directio writes", or
"Yes, your reasoning is correct"?

--D

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
@ 2023-01-18 19:04                 ` Darrick J. Wong
  0 siblings, 0 replies; 82+ messages in thread
From: Darrick J. Wong @ 2023-01-18 19:04 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Jan 17, 2023 at 11:21:38PM -0800, Christoph Hellwig wrote:
> On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> > I don't have any objections to pulling everything except patches 8 and
> > 10 for testing this week. 
> 
> That would be great.  I now have a series to return the ERR_PTR
> from __filemap_get_folio which will cause a minor conflict, but
> I think that's easy enough for Linux to handle.

Ok, done.

> > 
> > 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
> > don't think it does, but OTOH zone pointer management might complicate
> > that.
> 
> Adding Damien.
> 
> > 2. How about porting the writeback iomap validation to use this
> > mechanism?  (I suspect Dave might already be working on this...)
> 
> What is "this mechanism"?  Do you mean the here removed ->iomap_valid
> ?   writeback calls into ->map_blocks for every block while under the
> folio lock, so the validation can (and for XFS currently is) done
> in that.  Moving it out into a separate method with extra indirect
> functiona call overhead and interactions between the methods seems
> like a retrograde step to me.

Sorry, I should've been more specific -- can xfs writeback use the
validity cookie in struct iomap and thereby get rid of struct
xfs_writepage_ctx entirely?

> > 2. Do we need to revalidate mappings for directio writes?  I think the
> > answer is no (for xfs) because the ->iomap_begin call will allocate
> > whatever blocks are needed and truncate/punch/reflink block on the
> > iolock while the directio writes are pending, so you'll never end up
> > with a stale mapping.
> 
> Yes.

Er... yes as in "Yes, we *do* need to revalidate directio writes", or
"Yes, your reasoning is correct"?

--D


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
  2023-01-18 19:04                 ` [Cluster-devel] " Darrick J. Wong
@ 2023-01-18 19:57                   ` Andreas Grünbacher
  -1 siblings, 0 replies; 82+ messages in thread
From: Andreas Grünbacher @ 2023-01-18 19:57 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Dave Chinner, Andreas Gruenbacher,
	Alexander Viro, Damien Le Moal, Matthew Wilcox, linux-xfs,
	linux-fsdevel, linux-ext4, cluster-devel

Am Mi., 18. Jan. 2023 um 20:04 Uhr schrieb Darrick J. Wong <djwong@kernel.org>:
>
> On Tue, Jan 17, 2023 at 11:21:38PM -0800, Christoph Hellwig wrote:
> > On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> > > I don't have any objections to pulling everything except patches 8 and
> > > 10 for testing this week.
> >
> > That would be great.  I now have a series to return the ERR_PTR
> > from __filemap_get_folio which will cause a minor conflict, but
> > I think that's easy enough for Linux to handle.
>
> Ok, done.
>
> > >
> > > 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
> > > don't think it does, but OTOH zone pointer management might complicate
> > > that.
> >
> > Adding Damien.
> >
> > > 2. How about porting the writeback iomap validation to use this
> > > mechanism?  (I suspect Dave might already be working on this...)
> >
> > What is "this mechanism"?  Do you mean the here removed ->iomap_valid
> > ?   writeback calls into ->map_blocks for every block while under the
> > folio lock, so the validation can (and for XFS currently is) done
> > in that.  Moving it out into a separate method with extra indirect
> > functiona call overhead and interactions between the methods seems
> > like a retrograde step to me.
>
> Sorry, I should've been more specific -- can xfs writeback use the
> validity cookie in struct iomap and thereby get rid of struct
> xfs_writepage_ctx entirely?

Already asked and answered in the same thread:

https://lore.kernel.org/linux-fsdevel/20230109225453.GQ1971568@dread.disaster.area/

> > > 2. Do we need to revalidate mappings for directio writes?  I think the
> > > answer is no (for xfs) because the ->iomap_begin call will allocate
> > > whatever blocks are needed and truncate/punch/reflink block on the
> > > iolock while the directio writes are pending, so you'll never end up
> > > with a stale mapping.
> >
> > Yes.
>
> Er... yes as in "Yes, we *do* need to revalidate directio writes", or
> "Yes, your reasoning is correct"?
>
> --D

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
@ 2023-01-18 19:57                   ` Andreas Grünbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Grünbacher @ 2023-01-18 19:57 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Am Mi., 18. Jan. 2023 um 20:04 Uhr schrieb Darrick J. Wong <djwong@kernel.org>:
>
> On Tue, Jan 17, 2023 at 11:21:38PM -0800, Christoph Hellwig wrote:
> > On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> > > I don't have any objections to pulling everything except patches 8 and
> > > 10 for testing this week.
> >
> > That would be great.  I now have a series to return the ERR_PTR
> > from __filemap_get_folio which will cause a minor conflict, but
> > I think that's easy enough for Linux to handle.
>
> Ok, done.
>
> > >
> > > 1. Does zonefs need to revalidate mappings?  The mappings are 1:1 so I
> > > don't think it does, but OTOH zone pointer management might complicate
> > > that.
> >
> > Adding Damien.
> >
> > > 2. How about porting the writeback iomap validation to use this
> > > mechanism?  (I suspect Dave might already be working on this...)
> >
> > What is "this mechanism"?  Do you mean the here removed ->iomap_valid
> > ?   writeback calls into ->map_blocks for every block while under the
> > folio lock, so the validation can (and for XFS currently is) done
> > in that.  Moving it out into a separate method with extra indirect
> > functiona call overhead and interactions between the methods seems
> > like a retrograde step to me.
>
> Sorry, I should've been more specific -- can xfs writeback use the
> validity cookie in struct iomap and thereby get rid of struct
> xfs_writepage_ctx entirely?

Already asked and answered in the same thread:

https://lore.kernel.org/linux-fsdevel/20230109225453.GQ1971568 at dread.disaster.area/

> > > 2. Do we need to revalidate mappings for directio writes?  I think the
> > > answer is no (for xfs) because the ->iomap_begin call will allocate
> > > whatever blocks are needed and truncate/punch/reflink block on the
> > > iolock while the directio writes are pending, so you'll never end up
> > > with a stale mapping.
> >
> > Yes.
>
> Er... yes as in "Yes, we *do* need to revalidate directio writes", or
> "Yes, your reasoning is correct"?
>
> --D


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
  2023-01-15 17:29             ` [Cluster-devel] " Darrick J. Wong
@ 2023-01-18 21:42               ` Dave Chinner
  -1 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2023-01-18 21:42 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andreas Grünbacher, Andreas Gruenbacher, Christoph Hellwig,
	Alexander Viro, Matthew Wilcox, linux-xfs, linux-fsdevel,
	linux-ext4, cluster-devel

On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> 2. Do we need to revalidate mappings for directio writes?  I think the
> answer is no (for xfs) because the ->iomap_begin call will allocate
> whatever blocks are needed and truncate/punch/reflink block on the
> iolock while the directio writes are pending, so you'll never end up
> with a stale mapping.  But I don't know if that statement applies
> generally...

The issue is not truncate/punch/reflink for either DIO or buffered
IO - the issue that leads to stale iomaps is async extent state.
i.e. IO completion doing unwritten extent conversion.

For DIO, AIO doesn't hold the IOLOCK at all when completion is run
(like buffered writeback), but non-AIO DIO writes hold the IOLOCK
shared while waiting for completion. This means that we can have DIO
submission and completion still running concurrently, and so stale
iomaps are a definite possibility.

From my notes when I looked at this:

1. the race condition for a DIO write mapping go stale is an
overlapping DIO completion and converting the block from unwritten
to written, and then the dio write incorrectly issuing sub-block
zeroing because the mapping is now stale.

2. DIO read into a hole or unwritten extent zeroes the entire range
in the user buffer in one operation. If this is a large range, this
could race with small DIO writes within that range that have
completed

3. There is a window between dio write completion doing unwritten
extent conversion (by ->end_io) and the page cache being
invalidated, providing a window where buffered read maps can be
stale and incorrect read behaviour exposed to userpace before
the page cache is invalidated.

These all stem from IO having overlapping ranges, which is largely
unsupported but can't be entirely prevented (e.g. backup
applications running in the background). Largely the problems are
confined to sub-block IOs. i.e.  when sub-block DIO writes to the
same block are being performed, we have the possiblity that one
write completes whilst the other is deciding what to zero, unaware
that the range is now MAPPED rather than UNWRITTEN.

We currently avoid issues with sub-block dio writes by using
IOMAP_DIO_OVERWRITE_ONLY with shared locking. This ensures that the
unaligned IO fits entirely within a MAPPED extent so no sub-block
zeroing is required. If allocation or sub-block zeroing is required,
then we force the filesystem to fall back to exclusive IO locking
and wait for all concurrent DIO in flight to complete so that it
can't race with any other DIO write that might cause the map to
become stale while we are doing the zeroing.

This does not avoid potential issues with DIO write vs buffered
read, nor DIO write vs mmap IO. It's not totally clear to me
whether we need ->iomap_valid checks in the buffered read paths
to avoid the completion races with DIO writes, but there are windows
there where cached iomaps could be considered stale....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
@ 2023-01-18 21:42               ` Dave Chinner
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Chinner @ 2023-01-18 21:42 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> 2. Do we need to revalidate mappings for directio writes?  I think the
> answer is no (for xfs) because the ->iomap_begin call will allocate
> whatever blocks are needed and truncate/punch/reflink block on the
> iolock while the directio writes are pending, so you'll never end up
> with a stale mapping.  But I don't know if that statement applies
> generally...

The issue is not truncate/punch/reflink for either DIO or buffered
IO - the issue that leads to stale iomaps is async extent state.
i.e. IO completion doing unwritten extent conversion.

For DIO, AIO doesn't hold the IOLOCK at all when completion is run
(like buffered writeback), but non-AIO DIO writes hold the IOLOCK
shared while waiting for completion. This means that we can have DIO
submission and completion still running concurrently, and so stale
iomaps are a definite possibility.

From my notes when I looked at this:

1. the race condition for a DIO write mapping go stale is an
overlapping DIO completion and converting the block from unwritten
to written, and then the dio write incorrectly issuing sub-block
zeroing because the mapping is now stale.

2. DIO read into a hole or unwritten extent zeroes the entire range
in the user buffer in one operation. If this is a large range, this
could race with small DIO writes within that range that have
completed

3. There is a window between dio write completion doing unwritten
extent conversion (by ->end_io) and the page cache being
invalidated, providing a window where buffered read maps can be
stale and incorrect read behaviour exposed to userpace before
the page cache is invalidated.

These all stem from IO having overlapping ranges, which is largely
unsupported but can't be entirely prevented (e.g. backup
applications running in the background). Largely the problems are
confined to sub-block IOs. i.e.  when sub-block DIO writes to the
same block are being performed, we have the possiblity that one
write completes whilst the other is deciding what to zero, unaware
that the range is now MAPPED rather than UNWRITTEN.

We currently avoid issues with sub-block dio writes by using
IOMAP_DIO_OVERWRITE_ONLY with shared locking. This ensures that the
unaligned IO fits entirely within a MAPPED extent so no sub-block
zeroing is required. If allocation or sub-block zeroing is required,
then we force the filesystem to fall back to exclusive IO locking
and wait for all concurrent DIO in flight to complete so that it
can't race with any other DIO write that might cause the map to
become stale while we are doing the zeroing.

This does not avoid potential issues with DIO write vs buffered
read, nor DIO write vs mmap IO. It's not totally clear to me
whether we need ->iomap_valid checks in the buffered read paths
to avoid the completion races with DIO writes, but there are windows
there where cached iomaps could be considered stale....

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 05/10] iomap/gfs2: Get page in page_prepare handler
  2023-01-08 19:40   ` [Cluster-devel] " Andreas Gruenbacher
@ 2023-01-31 19:37     ` Matthew Wilcox
  -1 siblings, 0 replies; 82+ messages in thread
From: Matthew Wilcox @ 2023-01-31 19:37 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, Darrick J . Wong, Alexander Viro, linux-xfs,
	linux-fsdevel, linux-ext4, cluster-devel, Christoph Hellwig

On Sun, Jan 08, 2023 at 08:40:29PM +0100, Andreas Gruenbacher wrote:
> +static struct folio *
> +gfs2_iomap_page_prepare(struct iomap_iter *iter, loff_t pos, unsigned len)
>  {
> +	struct inode *inode = iter->inode;
>  	unsigned int blockmask = i_blocksize(inode) - 1;
>  	struct gfs2_sbd *sdp = GFS2_SB(inode);
>  	unsigned int blocks;
> +	struct folio *folio;
> +	int status;
>  
>  	blocks = ((pos & blockmask) + len + blockmask) >> inode->i_blkbits;
> -	return gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
> +	status = gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
> +	if (status)
> +		return ERR_PTR(status);
> +
> +	folio = iomap_get_folio(iter, pos);
> +	if (IS_ERR(folio))
> +		gfs2_trans_end(sdp);
> +	return folio;
>  }

Hi Andreas,

I didn't think to mention this at the time, but I was reading through
buffered-io.c and this jumped out at me.  For filesystems which support
folios, we pass the entire length of the write (or at least the length
of the remaining iomap length).  That's intended to allow us to decide
how large a folio to allocate at some point in the future.

For GFS2, we do this:

        if (!mapping_large_folio_support(iter->inode->i_mapping))
                len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));

I'd like to drop that and pass the full length of the write to
->get_folio().  It looks like you'll have to clamp it yourself at this
point.  I am kind of curious why you do one transaction per page --
I would have thought you'd rather do one transaction for the entire write.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 05/10] iomap/gfs2: Get page in page_prepare handler
@ 2023-01-31 19:37     ` Matthew Wilcox
  0 siblings, 0 replies; 82+ messages in thread
From: Matthew Wilcox @ 2023-01-31 19:37 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sun, Jan 08, 2023 at 08:40:29PM +0100, Andreas Gruenbacher wrote:
> +static struct folio *
> +gfs2_iomap_page_prepare(struct iomap_iter *iter, loff_t pos, unsigned len)
>  {
> +	struct inode *inode = iter->inode;
>  	unsigned int blockmask = i_blocksize(inode) - 1;
>  	struct gfs2_sbd *sdp = GFS2_SB(inode);
>  	unsigned int blocks;
> +	struct folio *folio;
> +	int status;
>  
>  	blocks = ((pos & blockmask) + len + blockmask) >> inode->i_blkbits;
> -	return gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
> +	status = gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
> +	if (status)
> +		return ERR_PTR(status);
> +
> +	folio = iomap_get_folio(iter, pos);
> +	if (IS_ERR(folio))
> +		gfs2_trans_end(sdp);
> +	return folio;
>  }

Hi Andreas,

I didn't think to mention this at the time, but I was reading through
buffered-io.c and this jumped out at me.  For filesystems which support
folios, we pass the entire length of the write (or at least the length
of the remaining iomap length).  That's intended to allow us to decide
how large a folio to allocate at some point in the future.

For GFS2, we do this:

        if (!mapping_large_folio_support(iter->inode->i_mapping))
                len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));

I'd like to drop that and pass the full length of the write to
->get_folio().  It looks like you'll have to clamp it yourself at this
point.  I am kind of curious why you do one transaction per page --
I would have thought you'd rather do one transaction for the entire write.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC v6 05/10] iomap/gfs2: Get page in page_prepare handler
  2023-01-31 19:37     ` [Cluster-devel] " Matthew Wilcox
@ 2023-01-31 21:33       ` Andreas Gruenbacher
  -1 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-31 21:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Darrick J . Wong, Alexander Viro, linux-xfs,
	linux-fsdevel, linux-ext4, cluster-devel, Christoph Hellwig

On Tue, Jan 31, 2023 at 8:37 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Sun, Jan 08, 2023 at 08:40:29PM +0100, Andreas Gruenbacher wrote:
> > +static struct folio *
> > +gfs2_iomap_page_prepare(struct iomap_iter *iter, loff_t pos, unsigned len)
> >  {
> > +     struct inode *inode = iter->inode;
> >       unsigned int blockmask = i_blocksize(inode) - 1;
> >       struct gfs2_sbd *sdp = GFS2_SB(inode);
> >       unsigned int blocks;
> > +     struct folio *folio;
> > +     int status;
> >
> >       blocks = ((pos & blockmask) + len + blockmask) >> inode->i_blkbits;
> > -     return gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
> > +     status = gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
> > +     if (status)
> > +             return ERR_PTR(status);
> > +
> > +     folio = iomap_get_folio(iter, pos);
> > +     if (IS_ERR(folio))
> > +             gfs2_trans_end(sdp);
> > +     return folio;
> >  }
>
> Hi Andreas,

Hello,

> I didn't think to mention this at the time, but I was reading through
> buffered-io.c and this jumped out at me.  For filesystems which support
> folios, we pass the entire length of the write (or at least the length
> of the remaining iomap length).  That's intended to allow us to decide
> how large a folio to allocate at some point in the future.
>
> For GFS2, we do this:
>
>         if (!mapping_large_folio_support(iter->inode->i_mapping))
>                 len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
>
> I'd like to drop that and pass the full length of the write to
> ->get_folio().  It looks like you'll have to clamp it yourself at this
> point.

sounds reasonable to me.

I see that gfs2_page_add_databufs() hasn't been folio-ized yet, but it
looks like it might just work anway. So gfs2_iomap_get_folio() ...
gfs2_iomap_put_folio() should, in principle, work for requests bigger
than PAGE_SIZE.

Is there a reasonable way of trying it out?

We still want to keep the transaction size somewhat reasonable, but
the maximum size gfs2_iomap_begin() will return for a write is 509
blocks on a 4k-block filesystem, or slightly less than 2 MiB, which
should be fine.

>  I am kind of curious why you do one transaction per page --
> I would have thought you'd rather do one transaction for the entire write.

Only for journaled data writes. We could probably do bigger
transactions even in that case, but we'd rather get rid of data
journaling than encourage it, so we're also not spending a lot of time
on optimizing this case.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [Cluster-devel] [RFC v6 05/10] iomap/gfs2: Get page in page_prepare handler
@ 2023-01-31 21:33       ` Andreas Gruenbacher
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Gruenbacher @ 2023-01-31 21:33 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Jan 31, 2023 at 8:37 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Sun, Jan 08, 2023 at 08:40:29PM +0100, Andreas Gruenbacher wrote:
> > +static struct folio *
> > +gfs2_iomap_page_prepare(struct iomap_iter *iter, loff_t pos, unsigned len)
> >  {
> > +     struct inode *inode = iter->inode;
> >       unsigned int blockmask = i_blocksize(inode) - 1;
> >       struct gfs2_sbd *sdp = GFS2_SB(inode);
> >       unsigned int blocks;
> > +     struct folio *folio;
> > +     int status;
> >
> >       blocks = ((pos & blockmask) + len + blockmask) >> inode->i_blkbits;
> > -     return gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
> > +     status = gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
> > +     if (status)
> > +             return ERR_PTR(status);
> > +
> > +     folio = iomap_get_folio(iter, pos);
> > +     if (IS_ERR(folio))
> > +             gfs2_trans_end(sdp);
> > +     return folio;
> >  }
>
> Hi Andreas,

Hello,

> I didn't think to mention this at the time, but I was reading through
> buffered-io.c and this jumped out at me.  For filesystems which support
> folios, we pass the entire length of the write (or at least the length
> of the remaining iomap length).  That's intended to allow us to decide
> how large a folio to allocate at some point in the future.
>
> For GFS2, we do this:
>
>         if (!mapping_large_folio_support(iter->inode->i_mapping))
>                 len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
>
> I'd like to drop that and pass the full length of the write to
> ->get_folio().  It looks like you'll have to clamp it yourself at this
> point.

sounds reasonable to me.

I see that gfs2_page_add_databufs() hasn't been folio-ized yet, but it
looks like it might just work anway. So gfs2_iomap_get_folio() ...
gfs2_iomap_put_folio() should, in principle, work for requests bigger
than PAGE_SIZE.

Is there a reasonable way of trying it out?

We still want to keep the transaction size somewhat reasonable, but
the maximum size gfs2_iomap_begin() will return for a write is 509
blocks on a 4k-block filesystem, or slightly less than 2 MiB, which
should be fine.

>  I am kind of curious why you do one transaction per page --
> I would have thought you'd rather do one transaction for the entire write.

Only for journaled data writes. We could probably do bigger
transactions even in that case, but we'd rather get rid of data
journaling than encourage it, so we're also not spending a lot of time
on optimizing this case.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2023-01-31 21:34 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-08 19:40 [RFC v6 00/10] Turn iomap_page_ops into iomap_folio_ops Andreas Gruenbacher
2023-01-08 19:40 ` [Cluster-devel] " Andreas Gruenbacher
2023-01-08 19:40 ` [RFC v6 01/10] iomap: Add __iomap_put_folio helper Andreas Gruenbacher
2023-01-08 19:40   ` [Cluster-devel] " Andreas Gruenbacher
2023-01-08 19:40 ` [RFC v6 02/10] iomap/gfs2: Unlock and put folio in page_done handler Andreas Gruenbacher
2023-01-08 19:40   ` [Cluster-devel] " Andreas Gruenbacher
2023-01-08 19:40 ` [RFC v6 03/10] iomap: Rename page_done handler to put_folio Andreas Gruenbacher
2023-01-08 19:40   ` [Cluster-devel] " Andreas Gruenbacher
2023-01-08 19:40 ` [RFC v6 04/10] iomap: Add iomap_get_folio helper Andreas Gruenbacher
2023-01-08 19:40   ` [Cluster-devel] " Andreas Gruenbacher
2023-01-08 21:33   ` Dave Chinner
2023-01-08 21:33     ` [Cluster-devel] " Dave Chinner
2023-01-09 12:46   ` Andreas Gruenbacher
2023-01-09 12:46     ` [Cluster-devel] " Andreas Gruenbacher
2023-01-10  8:46     ` Christoph Hellwig
2023-01-10  8:46       ` [Cluster-devel] " Christoph Hellwig
2023-01-10  9:07       ` Andreas Grünbacher
2023-01-10  9:07         ` [Cluster-devel] " Andreas Grünbacher
2023-01-10 13:34       ` Matthew Wilcox
2023-01-10 13:34         ` [Cluster-devel] " Matthew Wilcox
2023-01-10 15:24         ` Christoph Hellwig
2023-01-10 15:24           ` [Cluster-devel] " Christoph Hellwig
2023-01-11 19:36           ` Matthew Wilcox
2023-01-11 19:36             ` [Cluster-devel] " Matthew Wilcox
2023-01-11 20:52             ` Dave Chinner
2023-01-11 20:52               ` [Cluster-devel] " Dave Chinner
2023-01-12  8:41               ` Christoph Hellwig
2023-01-12  8:41                 ` [Cluster-devel] " Christoph Hellwig
2023-01-15 17:01         ` Darrick J. Wong
2023-01-15 17:01           ` [Cluster-devel] " Darrick J. Wong
2023-01-15 17:06           ` Darrick J. Wong
2023-01-15 17:06             ` [Cluster-devel] " Darrick J. Wong
2023-01-16  5:46             ` Matthew Wilcox
2023-01-16  5:46               ` [Cluster-devel] " Matthew Wilcox
2023-01-16  7:34               ` Christoph Hellwig
2023-01-16  7:34                 ` [Cluster-devel] " Christoph Hellwig
2023-01-16 13:18                 ` Matthew Wilcox
2023-01-16 13:18                   ` [Cluster-devel] " Matthew Wilcox
2023-01-16 16:02                   ` Christoph Hellwig
2023-01-16 16:02                     ` [Cluster-devel] " Christoph Hellwig
2023-01-08 19:40 ` [RFC v6 05/10] iomap/gfs2: Get page in page_prepare handler Andreas Gruenbacher
2023-01-08 19:40   ` [Cluster-devel] " Andreas Gruenbacher
2023-01-31 19:37   ` Matthew Wilcox
2023-01-31 19:37     ` [Cluster-devel] " Matthew Wilcox
2023-01-31 21:33     ` Andreas Gruenbacher
2023-01-31 21:33       ` [Cluster-devel] " Andreas Gruenbacher
2023-01-08 19:40 ` [RFC v6 06/10] iomap: Add __iomap_get_folio helper Andreas Gruenbacher
2023-01-08 19:40   ` [Cluster-devel] " Andreas Gruenbacher
2023-01-10  8:48   ` Christoph Hellwig
2023-01-10  8:48     ` [Cluster-devel] " Christoph Hellwig
2023-01-08 19:40 ` [RFC v6 07/10] iomap: Rename page_prepare handler to get_folio Andreas Gruenbacher
2023-01-08 19:40   ` [Cluster-devel] " Andreas Gruenbacher
2023-01-08 19:40 ` [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler Andreas Gruenbacher
2023-01-08 19:40   ` [Cluster-devel] " Andreas Gruenbacher
2023-01-08 21:59   ` Dave Chinner
2023-01-08 21:59     ` [Cluster-devel] " Dave Chinner
2023-01-09 18:45     ` Andreas Gruenbacher
2023-01-09 18:45       ` [Cluster-devel] " Andreas Gruenbacher
2023-01-09 22:54       ` Dave Chinner
2023-01-09 22:54         ` [Cluster-devel] " Dave Chinner
2023-01-10  1:09         ` Andreas Grünbacher
2023-01-10  1:09           ` [Cluster-devel] " Andreas Grünbacher
2023-01-15 17:29           ` Darrick J. Wong
2023-01-15 17:29             ` [Cluster-devel] " Darrick J. Wong
2023-01-18  7:21             ` Christoph Hellwig
2023-01-18  7:21               ` [Cluster-devel] " Christoph Hellwig
2023-01-18  9:11               ` Damien Le Moal
2023-01-18  9:11                 ` [Cluster-devel] " Damien Le Moal
2023-01-18 19:04               ` Darrick J. Wong
2023-01-18 19:04                 ` [Cluster-devel] " Darrick J. Wong
2023-01-18 19:57                 ` Andreas Grünbacher
2023-01-18 19:57                   ` [Cluster-devel] " Andreas Grünbacher
2023-01-18 21:42             ` Dave Chinner
2023-01-18 21:42               ` [Cluster-devel] " Dave Chinner
2023-01-10  8:51     ` Christoph Hellwig
2023-01-10  8:51       ` [Cluster-devel] " Christoph Hellwig
2023-01-10  8:52   ` Christoph Hellwig
2023-01-10  8:52     ` [Cluster-devel] " Christoph Hellwig
2023-01-08 19:40 ` [RFC v6 09/10] iomap: Rename page_ops to folio_ops Andreas Gruenbacher
2023-01-08 19:40   ` [Cluster-devel] " Andreas Gruenbacher
2023-01-08 19:40 ` [RFC v6 10/10] xfs: Make xfs_iomap_folio_ops static Andreas Gruenbacher
2023-01-08 19:40   ` [Cluster-devel] " Andreas Gruenbacher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.