All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv7 0/6] iomap: Add support for per-block dirty state to improve write performance
@ 2023-06-05 10:55 Ritesh Harjani (IBM)
  2023-06-05 10:55 ` [PATCHv7 1/6] iomap: Rename iomap_page_create/release() to iomap_iop_alloc/free() Ritesh Harjani (IBM)
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-05 10:55 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo,
	Disha Goel, Ritesh Harjani (IBM)

Hello All,

Please find PATCHv7 which adds per-block dirty tracking to iomap.
As discussed earlier this is required to improve write performance and reduce
write amplification for cases where either blocksize is less than pagesize (such
as Power platform with 64k pagesize) or when we have a large folio (such as xfs
which currently supports large folio).

Patchv6 -> Patchv7
==================
1. Fixed __maybe_unused annotation.
2. Added this patch-4
   iomap: Refactor iomap_write_delalloc_punch() function out

RFCv5 -> PATCHv6:
=================
1. Addresses review comments from Brian, Christoph and Matthew.
   @Christoph:
     - I have renamed the higher level functions such as iop_alloc/iop_free() to
       iomap_iop_alloc/free() in v6.
     - As for the low level bitmap accessor functions I couldn't find any better
       naming then iop_test_/set/clear_**. I could have gone for
       iomap_iop__test/set/clear/_** or iomap__iop_test/set/clear_**, but
       I wasn't convinced with either of above as it also increases function
       name.
       Besides iop_test/set_clear_ accessors functions for uptodate and dirty
       status tracking make sense as we are sure we have a valid iop in such
       cases. Please do let me know if this looks ok to you.
2. I tried testing gfs2 (initially with no patches) with xfstests. But I always ended up
   in some or the other deadlock (I couldn't spend any time debugging that).
   I also ran it with -x log, but still it was always failing for me.
   @Andreas:
   - could you please suggest how can I test gfs2 with these patches. I see gfs2
     can have a smaller blocksize and it uses iomap buffered io path. It will be
     good if we can get these patches tested on it too.

3. I can now say I have run some good amount of fstests on these patches on
   these platforms and I haven't found any new failure in my testing so far.
   arm64 (64k pagesize): with 4k -g quick
   Power: with 4k -g auto
   x86: 1k, 4k with -g auto and adv_auto

From my testing so far these patches looks stable to me and if this looks good
to reviewers as well, do you think this can be queued to linux-next for wider
testing?

Performance numbers copied from last patch commit message
==================================================
Performance testing of below fio workload reveals ~16x performance
improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.

1. <test_randwrite.fio>
[global]
	ioengine=psync
	rw=randwrite
	overwrite=1
	pre_read=1
	direct=0
	bs=4k
	size=1G
	dir=./
	numjobs=8
	fdatasync=1
	runtime=60
	iodepth=64
	group_reporting=1

[fio-run]

2. Also our internal performance team reported that this patch improves
   their database workload performance by around ~83% (with XFS on Power)


Ritesh Harjani (IBM) (6):
  iomap: Rename iomap_page_create/release() to iomap_iop_alloc/free()
  iomap: Move folio_detach_private() in iomap_iop_free() to the end
  iomap: Refactor some iop related accessor functions
  iomap: Refactor iomap_write_delalloc_punch() function out
  iomap: Allocate iop in ->write_begin() early
  iomap: Add per-block dirty state tracking to improve performance

 fs/gfs2/aops.c         |   2 +-
 fs/iomap/buffered-io.c | 312 ++++++++++++++++++++++++++++++-----------
 fs/xfs/xfs_aops.c      |   2 +-
 fs/zonefs/file.c       |   2 +-
 include/linux/iomap.h  |   1 +
 5 files changed, 236 insertions(+), 83 deletions(-)

--
2.40.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCHv7 1/6] iomap: Rename iomap_page_create/release() to iomap_iop_alloc/free()
  2023-06-05 10:55 [PATCHv7 0/6] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
@ 2023-06-05 10:55 ` Ritesh Harjani (IBM)
  2023-06-05 22:36   ` Darrick J. Wong
  2023-06-05 10:55 ` [PATCHv7 2/6] iomap: Move folio_detach_private() in iomap_iop_free() to the end Ritesh Harjani (IBM)
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-05 10:55 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo,
	Disha Goel, Ritesh Harjani (IBM)

This patch renames the iomap_page_create/release() functions to
iomap_iop_alloc/free() calls. Later patches adds more functions for
handling iop structure with iomap_iop_** naming conventions.
Hence iomap_iop_alloc/free() makes more sense to be consistent with all
APIs.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 063133ec77f4..4567bdd4fff9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -43,8 +43,8 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
 
 static struct bio_set iomap_ioend_bioset;
 
-static struct iomap_page *
-iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
+static struct iomap_page *iomap_iop_alloc(struct inode *inode,
+				struct folio *folio, unsigned int flags)
 {
 	struct iomap_page *iop = to_iomap_page(folio);
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
@@ -69,7 +69,7 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
 	return iop;
 }
 
-static void iomap_page_release(struct folio *folio)
+static void iomap_iop_free(struct folio *folio)
 {
 	struct iomap_page *iop = folio_detach_private(folio);
 	struct inode *inode = folio->mapping->host;
@@ -231,7 +231,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
 	if (offset > 0)
-		iop = iomap_page_create(iter->inode, folio, iter->flags);
+		iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
 	else
 		iop = to_iomap_page(folio);
 
@@ -269,7 +269,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		return iomap_read_inline_data(iter, folio);
 
 	/* zero post-eof blocks as the page may be mapped */
-	iop = iomap_page_create(iter->inode, folio, iter->flags);
+	iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
 	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
@@ -490,7 +490,7 @@ bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
 	 */
 	if (folio_test_dirty(folio) || folio_test_writeback(folio))
 		return false;
-	iomap_page_release(folio);
+	iomap_iop_free(folio);
 	return true;
 }
 EXPORT_SYMBOL_GPL(iomap_release_folio);
@@ -507,12 +507,12 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 	if (offset == 0 && len == folio_size(folio)) {
 		WARN_ON_ONCE(folio_test_writeback(folio));
 		folio_cancel_dirty(folio);
-		iomap_page_release(folio);
+		iomap_iop_free(folio);
 	} else if (folio_test_large(folio)) {
 		/* Must release the iop so the page can be split */
 		WARN_ON_ONCE(!folio_test_uptodate(folio) &&
 			     folio_test_dirty(folio));
-		iomap_page_release(folio);
+		iomap_iop_free(folio);
 	}
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
@@ -559,7 +559,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		return 0;
 	folio_clear_error(folio);
 
-	iop = iomap_page_create(iter->inode, folio, iter->flags);
+	iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
+
 	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
 		return -EAGAIN;
 
@@ -1612,7 +1613,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct folio *folio, u64 end_pos)
 {
-	struct iomap_page *iop = iomap_page_create(inode, folio, 0);
+	struct iomap_page *iop = iomap_iop_alloc(inode, folio, 0);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	unsigned nblocks = i_blocks_per_folio(inode, folio);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv7 2/6] iomap: Move folio_detach_private() in iomap_iop_free() to the end
  2023-06-05 10:55 [PATCHv7 0/6] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
  2023-06-05 10:55 ` [PATCHv7 1/6] iomap: Rename iomap_page_create/release() to iomap_iop_alloc/free() Ritesh Harjani (IBM)
@ 2023-06-05 10:55 ` Ritesh Harjani (IBM)
  2023-06-05 10:55 ` [PATCHv7 3/6] iomap: Refactor some iop related accessor functions Ritesh Harjani (IBM)
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-05 10:55 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo,
	Disha Goel, Ritesh Harjani (IBM)

In later patches we will add other accessor APIs which will take inode
and folio to operate over struct iomap_page. Since we need folio's
private (iomap_page) in those functions, hence this function moves
detaching of folio's private at the end just before calling kfree(iop).

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4567bdd4fff9..6fffda355c45 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -71,7 +71,7 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
 
 static void iomap_iop_free(struct folio *folio)
 {
-	struct iomap_page *iop = folio_detach_private(folio);
+	struct iomap_page *iop = to_iomap_page(folio);
 	struct inode *inode = folio->mapping->host;
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 
@@ -81,6 +81,7 @@ static void iomap_iop_free(struct folio *folio)
 	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
 	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
 			folio_test_uptodate(folio));
+	folio_detach_private(folio);
 	kfree(iop);
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv7 3/6] iomap: Refactor some iop related accessor functions
  2023-06-05 10:55 [PATCHv7 0/6] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
  2023-06-05 10:55 ` [PATCHv7 1/6] iomap: Rename iomap_page_create/release() to iomap_iop_alloc/free() Ritesh Harjani (IBM)
  2023-06-05 10:55 ` [PATCHv7 2/6] iomap: Move folio_detach_private() in iomap_iop_free() to the end Ritesh Harjani (IBM)
@ 2023-06-05 10:55 ` Ritesh Harjani (IBM)
  2023-06-05 14:15   ` Andreas Gruenbacher
  2023-06-05 22:54   ` Darrick J. Wong
  2023-06-05 10:55 ` [PATCHv7 4/6] iomap: Refactor iomap_write_delalloc_punch() function out Ritesh Harjani (IBM)
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-05 10:55 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo,
	Disha Goel, Ritesh Harjani (IBM)

We would eventually use iomap_iop_** function naming by the rest of the
buffered-io iomap code. This patch update function arguments and naming
from iomap_set_range_uptodate() -> iomap_iop_set_range_uptodate().
iop_set_range_uptodate() then becomes an accessor function used by
iomap_iop_** functions.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 111 +++++++++++++++++++++++------------------
 1 file changed, 63 insertions(+), 48 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6fffda355c45..136f57ccd0be 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -24,14 +24,14 @@
 #define IOEND_BATCH_SIZE	4096
 
 /*
- * Structure allocated for each folio when block size < folio size
- * to track sub-folio uptodate status and I/O completions.
+ * Structure allocated for each folio to track per-block uptodate state
+ * and I/O completions.
  */
 struct iomap_page {
 	atomic_t		read_bytes_pending;
 	atomic_t		write_bytes_pending;
-	spinlock_t		uptodate_lock;
-	unsigned long		uptodate[];
+	spinlock_t		state_lock;
+	unsigned long		state[];
 };
 
 static inline struct iomap_page *to_iomap_page(struct folio *folio)
@@ -43,6 +43,48 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
 
 static struct bio_set iomap_ioend_bioset;
 
+static bool iop_test_full_uptodate(struct folio *folio)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	struct inode *inode = folio->mapping->host;
+
+	return bitmap_full(iop->state, i_blocks_per_folio(inode, folio));
+}
+
+static bool iop_test_block_uptodate(struct folio *folio, unsigned int block)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+
+	return test_bit(block, iop->state);
+}
+
+static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
+				   size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	unsigned int first_blk = off >> inode->i_blkbits;
+	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
+	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iop->state_lock, flags);
+	bitmap_set(iop->state, first_blk, nr_blks);
+	if (iop_test_full_uptodate(folio))
+		folio_mark_uptodate(folio);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
+}
+
+static void iomap_iop_set_range_uptodate(struct inode *inode,
+		struct folio *folio, size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+
+	if (iop)
+		iop_set_range_uptodate(inode, folio, off, len);
+	else
+		folio_mark_uptodate(folio);
+}
+
 static struct iomap_page *iomap_iop_alloc(struct inode *inode,
 				struct folio *folio, unsigned int flags)
 {
@@ -58,12 +100,12 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;
 
-	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
+	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
 		      gfp);
 	if (iop) {
-		spin_lock_init(&iop->uptodate_lock);
+		spin_lock_init(&iop->state_lock);
 		if (folio_test_uptodate(folio))
-			bitmap_fill(iop->uptodate, nr_blocks);
+			bitmap_fill(iop->state, nr_blocks);
 		folio_attach_private(folio, iop);
 	}
 	return iop;
@@ -72,14 +114,12 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
 static void iomap_iop_free(struct folio *folio)
 {
 	struct iomap_page *iop = to_iomap_page(folio);
-	struct inode *inode = folio->mapping->host;
-	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 
 	if (!iop)
 		return;
 	WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
 	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
-	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
+	WARN_ON_ONCE(iop_test_full_uptodate(folio) !=
 			folio_test_uptodate(folio));
 	folio_detach_private(folio);
 	kfree(iop);
@@ -111,7 +151,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 
 		/* move forward for each leading block marked uptodate */
 		for (i = first; i <= last; i++) {
-			if (!test_bit(i, iop->uptodate))
+			if (!iop_test_block_uptodate(folio, i))
 				break;
 			*pos += block_size;
 			poff += block_size;
@@ -121,7 +161,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 
 		/* truncate len if we find any trailing uptodate block(s) */
 		for ( ; i <= last; i++) {
-			if (test_bit(i, iop->uptodate)) {
+			if (iop_test_block_uptodate(folio, i)) {
 				plen -= (last - i + 1) * block_size;
 				last = i - 1;
 				break;
@@ -145,30 +185,6 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 	*lenp = plen;
 }
 
-static void iomap_iop_set_range_uptodate(struct folio *folio,
-		struct iomap_page *iop, size_t off, size_t len)
-{
-	struct inode *inode = folio->mapping->host;
-	unsigned first = off >> inode->i_blkbits;
-	unsigned last = (off + len - 1) >> inode->i_blkbits;
-	unsigned long flags;
-
-	spin_lock_irqsave(&iop->uptodate_lock, flags);
-	bitmap_set(iop->uptodate, first, last - first + 1);
-	if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio)))
-		folio_mark_uptodate(folio);
-	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
-}
-
-static void iomap_set_range_uptodate(struct folio *folio,
-		struct iomap_page *iop, size_t off, size_t len)
-{
-	if (iop)
-		iomap_iop_set_range_uptodate(folio, iop, off, len);
-	else
-		folio_mark_uptodate(folio);
-}
-
 static void iomap_finish_folio_read(struct folio *folio, size_t offset,
 		size_t len, int error)
 {
@@ -178,7 +194,8 @@ static void iomap_finish_folio_read(struct folio *folio, size_t offset,
 		folio_clear_uptodate(folio);
 		folio_set_error(folio);
 	} else {
-		iomap_set_range_uptodate(folio, iop, offset, len);
+		iomap_iop_set_range_uptodate(folio->mapping->host, folio,
+					     offset, len);
 	}
 
 	if (!iop || atomic_sub_and_test(len, &iop->read_bytes_pending))
@@ -214,7 +231,6 @@ struct iomap_readpage_ctx {
 static int iomap_read_inline_data(const struct iomap_iter *iter,
 		struct folio *folio)
 {
-	struct iomap_page *iop;
 	const struct iomap *iomap = iomap_iter_srcmap(iter);
 	size_t size = i_size_read(iter->inode) - iomap->offset;
 	size_t poff = offset_in_page(iomap->offset);
@@ -232,15 +248,14 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
 	if (offset > 0)
-		iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
-	else
-		iop = to_iomap_page(folio);
+		iomap_iop_alloc(iter->inode, folio, iter->flags);
 
 	addr = kmap_local_folio(folio, offset);
 	memcpy(addr, iomap->inline_data, size);
 	memset(addr + size, 0, PAGE_SIZE - poff - size);
 	kunmap_local(addr);
-	iomap_set_range_uptodate(folio, iop, offset, PAGE_SIZE - poff);
+	iomap_iop_set_range_uptodate(iter->inode, folio, offset,
+				     PAGE_SIZE - poff);
 	return 0;
 }
 
@@ -277,7 +292,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 
 	if (iomap_block_needs_zeroing(iter, pos)) {
 		folio_zero_range(folio, poff, plen);
-		iomap_set_range_uptodate(folio, iop, poff, plen);
+		iomap_iop_set_range_uptodate(iter->inode, folio, poff, plen);
 		goto done;
 	}
 
@@ -452,7 +467,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
 	last = (from + count - 1) >> inode->i_blkbits;
 
 	for (i = first; i <= last; i++)
-		if (!test_bit(i, iop->uptodate))
+		if (!iop_test_block_uptodate(folio, i))
 			return false;
 	return true;
 }
@@ -591,7 +606,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 			if (status)
 				return status;
 		}
-		iomap_set_range_uptodate(folio, iop, poff, plen);
+		iomap_iop_set_range_uptodate(iter->inode, folio, poff, plen);
 	} while ((block_start += plen) < block_end);
 
 	return 0;
@@ -698,7 +713,6 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
-	struct iomap_page *iop = to_iomap_page(folio);
 	flush_dcache_folio(folio);
 
 	/*
@@ -714,7 +728,8 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	 */
 	if (unlikely(copied < len && !folio_test_uptodate(folio)))
 		return 0;
-	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
+	iomap_iop_set_range_uptodate(inode, folio, offset_in_folio(folio, pos),
+				     len);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
 }
@@ -1630,7 +1645,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iop && !test_bit(i, iop->uptodate))
+		if (iop && !iop_test_block_uptodate(folio, i))
 			continue;
 
 		error = wpc->ops->map_blocks(wpc, inode, pos);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv7 4/6] iomap: Refactor iomap_write_delalloc_punch() function out
  2023-06-05 10:55 [PATCHv7 0/6] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
                   ` (2 preceding siblings ...)
  2023-06-05 10:55 ` [PATCHv7 3/6] iomap: Refactor some iop related accessor functions Ritesh Harjani (IBM)
@ 2023-06-05 10:55 ` Ritesh Harjani (IBM)
  2023-06-05 22:55   ` Darrick J. Wong
  2023-06-05 10:55 ` [PATCHv7 5/6] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
  2023-06-05 10:55 ` [PATCHv7 6/6] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
  5 siblings, 1 reply; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-05 10:55 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo,
	Disha Goel, Ritesh Harjani (IBM)

This patch moves iomap_write_delalloc_punch() out of
iomap_write_delalloc_scan(). No functionality change in this patch.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 54 ++++++++++++++++++++++++++----------------
 1 file changed, 34 insertions(+), 20 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 136f57ccd0be..f55a339f99ec 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -894,6 +894,33 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 }
 EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
 
+static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
+		loff_t *punch_start_byte, loff_t start_byte, loff_t end_byte,
+		int (*punch)(struct inode *inode, loff_t offset, loff_t length))
+{
+	int ret = 0;
+
+	if (!folio_test_dirty(folio))
+		return ret;
+
+	/* if dirty, punch up to offset */
+	if (start_byte > *punch_start_byte) {
+		ret = punch(inode, *punch_start_byte,
+				start_byte - *punch_start_byte);
+		if (ret)
+			goto out;
+	}
+	/*
+	 * Make sure the next punch start is correctly bound to
+	 * the end of this data range, not the end of the folio.
+	 */
+	*punch_start_byte = min_t(loff_t, end_byte,
+				  folio_next_index(folio) << PAGE_SHIFT);
+
+out:
+	return ret;
+}
+
 /*
  * Scan the data range passed to us for dirty page cache folios. If we find a
  * dirty folio, punch out the preceeding range and update the offset from which
@@ -917,6 +944,7 @@ static int iomap_write_delalloc_scan(struct inode *inode,
 {
 	while (start_byte < end_byte) {
 		struct folio	*folio;
+		int ret;
 
 		/* grab locked page */
 		folio = filemap_lock_folio(inode->i_mapping,
@@ -927,26 +955,12 @@ static int iomap_write_delalloc_scan(struct inode *inode,
 			continue;
 		}
 
-		/* if dirty, punch up to offset */
-		if (folio_test_dirty(folio)) {
-			if (start_byte > *punch_start_byte) {
-				int	error;
-
-				error = punch(inode, *punch_start_byte,
-						start_byte - *punch_start_byte);
-				if (error) {
-					folio_unlock(folio);
-					folio_put(folio);
-					return error;
-				}
-			}
-
-			/*
-			 * Make sure the next punch start is correctly bound to
-			 * the end of this data range, not the end of the folio.
-			 */
-			*punch_start_byte = min_t(loff_t, end_byte,
-					folio_next_index(folio) << PAGE_SHIFT);
+		ret = iomap_write_delalloc_punch(inode, folio, punch_start_byte,
+						 start_byte, end_byte, punch);
+		if (ret) {
+			folio_unlock(folio);
+			folio_put(folio);
+			return ret;
 		}
 
 		/* move offset to start of next folio in range */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv7 5/6] iomap: Allocate iop in ->write_begin() early
  2023-06-05 10:55 [PATCHv7 0/6] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
                   ` (3 preceding siblings ...)
  2023-06-05 10:55 ` [PATCHv7 4/6] iomap: Refactor iomap_write_delalloc_punch() function out Ritesh Harjani (IBM)
@ 2023-06-05 10:55 ` Ritesh Harjani (IBM)
  2023-06-05 22:58   ` Darrick J. Wong
  2023-06-05 10:55 ` [PATCHv7 6/6] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
  5 siblings, 1 reply; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-05 10:55 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo,
	Disha Goel, Ritesh Harjani (IBM)

We dont need to allocate an iop in ->write_begin() for writes where the
position and length completely overlap with the given folio.
Therefore, such cases are skipped.

Currently when the folio is uptodate, we only allocate iop at writeback
time (in iomap_writepage_map()). This is ok until now, but when we are
going to add support for per-block dirty state bitmap in iop, this
could cause some performance degradation. The reason is that if we don't
allocate iop during ->write_begin(), then we will never mark the
necessary dirty bits in ->write_end() call. And we will have to mark all
the bits as dirty at the writeback time, that could cause the same write
amplification and performance problems as it is now.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f55a339f99ec..2a97d73edb96 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -571,15 +571,24 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	size_t from = offset_in_folio(folio, pos), to = from + len;
 	size_t poff, plen;
 
-	if (folio_test_uptodate(folio))
+	/*
+	 * If the write completely overlaps the current folio, then
+	 * entire folio will be dirtied so there is no need for
+	 * per-block state tracking structures to be attached to this folio.
+	 */
+	if (pos <= folio_pos(folio) &&
+	    pos + len >= folio_pos(folio) + folio_size(folio))
 		return 0;
-	folio_clear_error(folio);
 
 	iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
 
 	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
 		return -EAGAIN;
 
+	if (folio_test_uptodate(folio))
+		return 0;
+	folio_clear_error(folio);
+
 	do {
 		iomap_adjust_read_range(iter->inode, folio, &block_start,
 				block_end - block_start, &poff, &plen);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv7 6/6] iomap: Add per-block dirty state tracking to improve performance
  2023-06-05 10:55 [PATCHv7 0/6] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
                   ` (4 preceding siblings ...)
  2023-06-05 10:55 ` [PATCHv7 5/6] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
@ 2023-06-05 10:55 ` Ritesh Harjani (IBM)
  2023-06-05 23:10   ` Darrick J. Wong
  5 siblings, 1 reply; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-06-05 10:55 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo,
	Disha Goel, Ritesh Harjani (IBM),
	Aravinda Herle

When filesystem blocksize is less than folio size (either with
mapping_large_folio_support() or with blocksize < pagesize) and when the
folio is uptodate in pagecache, then even a byte write can cause
an entire folio to be written to disk during writeback. This happens
because we currently don't have a mechanism to track per-block dirty
state within struct iomap_page. We currently only track uptodate state.

This patch implements support for tracking per-block dirty state in
iomap_page->state bitmap. This should help improve the filesystem write
performance and help reduce write amplification.

Performance testing of below fio workload reveals ~16x performance
improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.

1. <test_randwrite.fio>
[global]
	ioengine=psync
	rw=randwrite
	overwrite=1
	pre_read=1
	direct=0
	bs=4k
	size=1G
	dir=./
	numjobs=8
	fdatasync=1
	runtime=60
	iodepth=64
	group_reporting=1

[fio-run]

2. Also our internal performance team reported that this patch improves
   their database workload performance by around ~83% (with XFS on Power)

Reported-by: Aravinda Herle <araherle@in.ibm.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/gfs2/aops.c         |   2 +-
 fs/iomap/buffered-io.c | 120 +++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_aops.c      |   2 +-
 fs/zonefs/file.c       |   2 +-
 include/linux/iomap.h  |   1 +
 5 files changed, 120 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index a5f4be6b9213..75efec3c3b71 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -746,7 +746,7 @@ static const struct address_space_operations gfs2_aops = {
 	.writepages = gfs2_writepages,
 	.read_folio = gfs2_read_folio,
 	.readahead = gfs2_readahead,
-	.dirty_folio = filemap_dirty_folio,
+	.dirty_folio = iomap_dirty_folio,
 	.release_folio = iomap_release_folio,
 	.invalidate_folio = iomap_invalidate_folio,
 	.bmap = gfs2_bmap,
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 2a97d73edb96..e7d114b5b918 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -85,6 +85,63 @@ static void iomap_iop_set_range_uptodate(struct inode *inode,
 		folio_mark_uptodate(folio);
 }

+static bool iop_test_block_dirty(struct folio *folio, int block)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	struct inode *inode = folio->mapping->host;
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+
+	return test_bit(block + blks_per_folio, iop->state);
+}
+
+static void iop_set_range_dirty(struct inode *inode, struct folio *folio,
+				size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+	unsigned int first_blk = off >> inode->i_blkbits;
+	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
+	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iop->state_lock, flags);
+	bitmap_set(iop->state, first_blk + blks_per_folio, nr_blks);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
+}
+
+static void iomap_iop_set_range_dirty(struct inode *inode, struct folio *folio,
+				size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+
+	if (iop)
+		iop_set_range_dirty(inode, folio, off, len);
+}
+
+static void iop_clear_range_dirty(struct inode *inode, struct folio *folio,
+				  size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+	unsigned int first_blk = off >> inode->i_blkbits;
+	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
+	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iop->state_lock, flags);
+	bitmap_clear(iop->state, first_blk + blks_per_folio, nr_blks);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
+}
+
+static void iomap_iop_clear_range_dirty(struct inode *inode,
+				struct folio *folio, size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+
+	if (iop)
+		iop_clear_range_dirty(inode, folio, off, len);
+}
+
 static struct iomap_page *iomap_iop_alloc(struct inode *inode,
 				struct folio *folio, unsigned int flags)
 {
@@ -100,12 +157,20 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;

-	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
+	/*
+	 * iop->state tracks two sets of state flags when the
+	 * filesystem block size is smaller than the folio size.
+	 * The first state tracks per-block uptodate and the
+	 * second tracks per-block dirty state.
+	 */
+	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
 		      gfp);
 	if (iop) {
 		spin_lock_init(&iop->state_lock);
 		if (folio_test_uptodate(folio))
-			bitmap_fill(iop->state, nr_blocks);
+			bitmap_set(iop->state, 0, nr_blocks);
+		if (folio_test_dirty(folio))
+			bitmap_set(iop->state, nr_blocks, nr_blocks);
 		folio_attach_private(folio, iop);
 	}
 	return iop;
@@ -533,6 +598,17 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);

+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
+{
+	struct inode *inode = mapping->host;
+	size_t len = folio_size(folio);
+
+	iomap_iop_alloc(inode, folio, 0);
+	iomap_iop_set_range_dirty(inode, folio, 0, len);
+	return filemap_dirty_folio(mapping, folio);
+}
+EXPORT_SYMBOL_GPL(iomap_dirty_folio);
+
 static void
 iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 {
@@ -739,6 +815,8 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 		return 0;
 	iomap_iop_set_range_uptodate(inode, folio, offset_in_folio(folio, pos),
 				     len);
+	iomap_iop_set_range_dirty(inode, folio, offset_in_folio(folio, pos),
+				  copied);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
 }
@@ -908,6 +986,10 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
 		int (*punch)(struct inode *inode, loff_t offset, loff_t length))
 {
 	int ret = 0;
+	struct iomap_page *iop;
+	unsigned int first_blk, last_blk, i;
+	loff_t last_byte;
+	u8 blkbits = inode->i_blkbits;

 	if (!folio_test_dirty(folio))
 		return ret;
@@ -919,6 +1001,29 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
 		if (ret)
 			goto out;
 	}
+	/*
+	 * When we have per-block dirty tracking, there can be
+	 * blocks within a folio which are marked uptodate
+	 * but not dirty. In that case it is necessary to punch
+	 * out such blocks to avoid leaking any delalloc blocks.
+	 */
+	iop = to_iomap_page(folio);
+	if (!iop)
+		goto skip_iop_punch;
+
+	last_byte = min_t(loff_t, end_byte - 1,
+		(folio_next_index(folio) << PAGE_SHIFT) - 1);
+	first_blk = offset_in_folio(folio, start_byte) >> blkbits;
+	last_blk = offset_in_folio(folio, last_byte) >> blkbits;
+	for (i = first_blk; i <= last_blk; i++) {
+		if (!iop_test_block_dirty(folio, i)) {
+			ret = punch(inode, i << blkbits, 1 << blkbits);
+			if (ret)
+				goto out;
+		}
+	}
+
+skip_iop_punch:
 	/*
 	 * Make sure the next punch start is correctly bound to
 	 * the end of this data range, not the end of the folio.
@@ -1652,7 +1757,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct folio *folio, u64 end_pos)
 {
-	struct iomap_page *iop = iomap_iop_alloc(inode, folio, 0);
+	struct iomap_page *iop = to_iomap_page(folio);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	unsigned nblocks = i_blocks_per_folio(inode, folio);
@@ -1660,6 +1765,11 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	int error = 0, count = 0, i;
 	LIST_HEAD(submit_list);

+	if (!iop && nblocks > 1) {
+		iop = iomap_iop_alloc(inode, folio, 0);
+		iomap_iop_set_range_dirty(inode, folio, 0, folio_size(folio));
+	}
+
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);

 	/*
@@ -1668,7 +1778,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iop && !iop_test_block_uptodate(folio, i))
+		if (iop && !iop_test_block_dirty(folio, i))
 			continue;

 		error = wpc->ops->map_blocks(wpc, inode, pos);
@@ -1712,6 +1822,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		}
 	}

+	iomap_iop_clear_range_dirty(inode, folio, 0,
+				    end_pos - folio_pos(folio));
 	folio_start_writeback(folio);
 	folio_unlock(folio);

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 2ef78aa1d3f6..77c7332ae197 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -578,7 +578,7 @@ const struct address_space_operations xfs_address_space_operations = {
 	.read_folio		= xfs_vm_read_folio,
 	.readahead		= xfs_vm_readahead,
 	.writepages		= xfs_vm_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.bmap			= xfs_vm_bmap,
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index 132f01d3461f..e508c8e97372 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -175,7 +175,7 @@ const struct address_space_operations zonefs_file_aops = {
 	.read_folio		= zonefs_read_folio,
 	.readahead		= zonefs_readahead,
 	.writepages		= zonefs_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.migrate_folio		= filemap_migrate_folio,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index e2b836c2e119..eb9335c46bf3 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -264,6 +264,7 @@ bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
--
2.40.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCHv7 3/6] iomap: Refactor some iop related accessor functions
  2023-06-05 10:55 ` [PATCHv7 3/6] iomap: Refactor some iop related accessor functions Ritesh Harjani (IBM)
@ 2023-06-05 14:15   ` Andreas Gruenbacher
  2023-06-05 14:31     ` Matthew Wilcox
  2023-06-05 21:00     ` Ritesh Harjani
  2023-06-05 22:54   ` Darrick J. Wong
  1 sibling, 2 replies; 23+ messages in thread
From: Andreas Gruenbacher @ 2023-06-05 14:15 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Ojaswin Mujoo, Disha Goel

On Mon, Jun 5, 2023 at 12:55 PM Ritesh Harjani (IBM)
<ritesh.list@gmail.com> wrote:
> We would eventually use iomap_iop_** function naming by the rest of the
> buffered-io iomap code. This patch update function arguments and naming
> from iomap_set_range_uptodate() -> iomap_iop_set_range_uptodate().
> iop_set_range_uptodate() then becomes an accessor function used by
> iomap_iop_** functions.
>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/iomap/buffered-io.c | 111 +++++++++++++++++++++++------------------
>  1 file changed, 63 insertions(+), 48 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 6fffda355c45..136f57ccd0be 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -24,14 +24,14 @@
>  #define IOEND_BATCH_SIZE       4096
>
>  /*
> - * Structure allocated for each folio when block size < folio size
> - * to track sub-folio uptodate status and I/O completions.
> + * Structure allocated for each folio to track per-block uptodate state
> + * and I/O completions.
>   */
>  struct iomap_page {
>         atomic_t                read_bytes_pending;
>         atomic_t                write_bytes_pending;
> -       spinlock_t              uptodate_lock;
> -       unsigned long           uptodate[];
> +       spinlock_t              state_lock;
> +       unsigned long           state[];
>  };
>
>  static inline struct iomap_page *to_iomap_page(struct folio *folio)
> @@ -43,6 +43,48 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
>
>  static struct bio_set iomap_ioend_bioset;
>
> +static bool iop_test_full_uptodate(struct folio *folio)
> +{
> +       struct iomap_page *iop = to_iomap_page(folio);
> +       struct inode *inode = folio->mapping->host;
> +
> +       return bitmap_full(iop->state, i_blocks_per_folio(inode, folio));
> +}

Can this be called iop_test_fully_uptodate(), please?

> +
> +static bool iop_test_block_uptodate(struct folio *folio, unsigned int block)
> +{
> +       struct iomap_page *iop = to_iomap_page(folio);
> +
> +       return test_bit(block, iop->state);
> +}
> +
> +static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
> +                                  size_t off, size_t len)
> +{
> +       struct iomap_page *iop = to_iomap_page(folio);

Note that to_iomap_page() does folio_test_private() followed by
folio_get_private(), which doesn't really make sense in places where
we know that iop is defined. Maybe we want to split that up.

> +       unsigned int first_blk = off >> inode->i_blkbits;
> +       unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
> +       unsigned int nr_blks = last_blk - first_blk + 1;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&iop->state_lock, flags);
> +       bitmap_set(iop->state, first_blk, nr_blks);
> +       if (iop_test_full_uptodate(folio))
> +               folio_mark_uptodate(folio);
> +       spin_unlock_irqrestore(&iop->state_lock, flags);
> +}
> +
> +static void iomap_iop_set_range_uptodate(struct inode *inode,
> +               struct folio *folio, size_t off, size_t len)
> +{
> +       struct iomap_page *iop = to_iomap_page(folio);
> +
> +       if (iop)
> +               iop_set_range_uptodate(inode, folio, off, len);
> +       else
> +               folio_mark_uptodate(folio);
> +}

This patch passes the inode into iomap_iop_set_range_uptodate() and
removes the iop argument. Can this be done in a separate patch,
please?

We have a few places like the above, where we look up the iop without
using it. Would a function like folio_has_iop() make more sense?

> +
>  static struct iomap_page *iomap_iop_alloc(struct inode *inode,
>                                 struct folio *folio, unsigned int flags)
>  {
> @@ -58,12 +100,12 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
>         else
>                 gfp = GFP_NOFS | __GFP_NOFAIL;
>
> -       iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
> +       iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
>                       gfp);
>         if (iop) {
> -               spin_lock_init(&iop->uptodate_lock);
> +               spin_lock_init(&iop->state_lock);
>                 if (folio_test_uptodate(folio))
> -                       bitmap_fill(iop->uptodate, nr_blocks);
> +                       bitmap_fill(iop->state, nr_blocks);
>                 folio_attach_private(folio, iop);
>         }
>         return iop;
> @@ -72,14 +114,12 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
>  static void iomap_iop_free(struct folio *folio)
>  {
>         struct iomap_page *iop = to_iomap_page(folio);
> -       struct inode *inode = folio->mapping->host;
> -       unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
>
>         if (!iop)
>                 return;
>         WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
>         WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
> -       WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
> +       WARN_ON_ONCE(iop_test_full_uptodate(folio) !=
>                         folio_test_uptodate(folio));
>         folio_detach_private(folio);
>         kfree(iop);
> @@ -111,7 +151,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>
>                 /* move forward for each leading block marked uptodate */
>                 for (i = first; i <= last; i++) {
> -                       if (!test_bit(i, iop->uptodate))
> +                       if (!iop_test_block_uptodate(folio, i))
>                                 break;
>                         *pos += block_size;
>                         poff += block_size;
> @@ -121,7 +161,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>
>                 /* truncate len if we find any trailing uptodate block(s) */
>                 for ( ; i <= last; i++) {
> -                       if (test_bit(i, iop->uptodate)) {
> +                       if (iop_test_block_uptodate(folio, i)) {
>                                 plen -= (last - i + 1) * block_size;
>                                 last = i - 1;
>                                 break;
> @@ -145,30 +185,6 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>         *lenp = plen;
>  }
>
> -static void iomap_iop_set_range_uptodate(struct folio *folio,
> -               struct iomap_page *iop, size_t off, size_t len)
> -{
> -       struct inode *inode = folio->mapping->host;
> -       unsigned first = off >> inode->i_blkbits;
> -       unsigned last = (off + len - 1) >> inode->i_blkbits;
> -       unsigned long flags;
> -
> -       spin_lock_irqsave(&iop->uptodate_lock, flags);
> -       bitmap_set(iop->uptodate, first, last - first + 1);
> -       if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio)))
> -               folio_mark_uptodate(folio);
> -       spin_unlock_irqrestore(&iop->uptodate_lock, flags);
> -}
> -
> -static void iomap_set_range_uptodate(struct folio *folio,
> -               struct iomap_page *iop, size_t off, size_t len)
> -{
> -       if (iop)
> -               iomap_iop_set_range_uptodate(folio, iop, off, len);
> -       else
> -               folio_mark_uptodate(folio);
> -}
> -
>  static void iomap_finish_folio_read(struct folio *folio, size_t offset,
>                 size_t len, int error)
>  {
> @@ -178,7 +194,8 @@ static void iomap_finish_folio_read(struct folio *folio, size_t offset,
>                 folio_clear_uptodate(folio);
>                 folio_set_error(folio);
>         } else {
> -               iomap_set_range_uptodate(folio, iop, offset, len);
> +               iomap_iop_set_range_uptodate(folio->mapping->host, folio,
> +                                            offset, len);
>         }
>
>         if (!iop || atomic_sub_and_test(len, &iop->read_bytes_pending))
> @@ -214,7 +231,6 @@ struct iomap_readpage_ctx {
>  static int iomap_read_inline_data(const struct iomap_iter *iter,
>                 struct folio *folio)
>  {
> -       struct iomap_page *iop;
>         const struct iomap *iomap = iomap_iter_srcmap(iter);
>         size_t size = i_size_read(iter->inode) - iomap->offset;
>         size_t poff = offset_in_page(iomap->offset);
> @@ -232,15 +248,14 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
>         if (WARN_ON_ONCE(size > iomap->length))
>                 return -EIO;
>         if (offset > 0)
> -               iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
> -       else
> -               iop = to_iomap_page(folio);
> +               iomap_iop_alloc(iter->inode, folio, iter->flags);
>
>         addr = kmap_local_folio(folio, offset);
>         memcpy(addr, iomap->inline_data, size);
>         memset(addr + size, 0, PAGE_SIZE - poff - size);
>         kunmap_local(addr);
> -       iomap_set_range_uptodate(folio, iop, offset, PAGE_SIZE - poff);
> +       iomap_iop_set_range_uptodate(iter->inode, folio, offset,
> +                                    PAGE_SIZE - poff);
>         return 0;
>  }
>
> @@ -277,7 +292,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>
>         if (iomap_block_needs_zeroing(iter, pos)) {
>                 folio_zero_range(folio, poff, plen);
> -               iomap_set_range_uptodate(folio, iop, poff, plen);
> +               iomap_iop_set_range_uptodate(iter->inode, folio, poff, plen);
>                 goto done;
>         }
>
> @@ -452,7 +467,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
>         last = (from + count - 1) >> inode->i_blkbits;
>
>         for (i = first; i <= last; i++)
> -               if (!test_bit(i, iop->uptodate))
> +               if (!iop_test_block_uptodate(folio, i))
>                         return false;
>         return true;
>  }
> @@ -591,7 +606,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>                         if (status)
>                                 return status;
>                 }
> -               iomap_set_range_uptodate(folio, iop, poff, plen);
> +               iomap_iop_set_range_uptodate(iter->inode, folio, poff, plen);
>         } while ((block_start += plen) < block_end);
>
>         return 0;
> @@ -698,7 +713,6 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
>  static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>                 size_t copied, struct folio *folio)
>  {
> -       struct iomap_page *iop = to_iomap_page(folio);
>         flush_dcache_folio(folio);
>
>         /*
> @@ -714,7 +728,8 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>          */
>         if (unlikely(copied < len && !folio_test_uptodate(folio)))
>                 return 0;
> -       iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
> +       iomap_iop_set_range_uptodate(inode, folio, offset_in_folio(folio, pos),
> +                                    len);
>         filemap_dirty_folio(inode->i_mapping, folio);
>         return copied;
>  }
> @@ -1630,7 +1645,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>          * invalid, grab a new one.
>          */
>         for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> -               if (iop && !test_bit(i, iop->uptodate))
> +               if (iop && !iop_test_block_uptodate(folio, i))
>                         continue;
>
>                 error = wpc->ops->map_blocks(wpc, inode, pos);
> --
> 2.40.1
>

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv7 3/6] iomap: Refactor some iop related accessor functions
  2023-06-05 14:15   ` Andreas Gruenbacher
@ 2023-06-05 14:31     ` Matthew Wilcox
  2023-06-05 20:48       ` Ritesh Harjani
  2023-06-05 21:00     ` Ritesh Harjani
  1 sibling, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2023-06-05 14:31 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Ritesh Harjani (IBM),
	linux-xfs, linux-fsdevel, Dave Chinner, Brian Foster,
	Christoph Hellwig, Ojaswin Mujoo, Disha Goel

On Mon, Jun 05, 2023 at 04:15:31PM +0200, Andreas Gruenbacher wrote:
> Note that to_iomap_page() does folio_test_private() followed by
> folio_get_private(), which doesn't really make sense in places where
> we know that iop is defined. Maybe we want to split that up.

The plan is to retire the folio private flag entirely.  I just haven't
got round to cleaning up iomap yet.  For folios which we know to be
file-backed, we can just test whether folio->private is NULL or not.

So I'd do this as:

-	struct iomap_page *iop = to_iomap_page(folio);
+	struct iomap_page *iop = folio->private;

and not even use folio_get_private() or to_iomap_page() any more.

> > +       unsigned int first_blk = off >> inode->i_blkbits;
> > +       unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
> > +       unsigned int nr_blks = last_blk - first_blk + 1;
> > +       unsigned long flags;
> > +
> > +       spin_lock_irqsave(&iop->state_lock, flags);
> > +       bitmap_set(iop->state, first_blk, nr_blks);
> > +       if (iop_test_full_uptodate(folio))
> > +               folio_mark_uptodate(folio);
> > +       spin_unlock_irqrestore(&iop->state_lock, flags);
> > +}
> > +
> > +static void iomap_iop_set_range_uptodate(struct inode *inode,
> > +               struct folio *folio, size_t off, size_t len)
> > +{
> > +       struct iomap_page *iop = to_iomap_page(folio);
> > +
> > +       if (iop)
> > +               iop_set_range_uptodate(inode, folio, off, len);
> > +       else
> > +               folio_mark_uptodate(folio);
> > +}
> 
> This patch passes the inode into iomap_iop_set_range_uptodate() and
> removes the iop argument. Can this be done in a separate patch,
> please?
> 
> We have a few places like the above, where we look up the iop without
> using it. Would a function like folio_has_iop() make more sense?

I think this is all a symptom of the unnecessary splitting of
iomap_iop_set_range_uptodate and iop_set_range_uptodate.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv7 3/6] iomap: Refactor some iop related accessor functions
  2023-06-05 14:31     ` Matthew Wilcox
@ 2023-06-05 20:48       ` Ritesh Harjani
  0 siblings, 0 replies; 23+ messages in thread
From: Ritesh Harjani @ 2023-06-05 20:48 UTC (permalink / raw)
  To: Matthew Wilcox, Andreas Gruenbacher
  Cc: linux-xfs, linux-fsdevel, Dave Chinner, Brian Foster,
	Christoph Hellwig, Ojaswin Mujoo, Disha Goel

Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Jun 05, 2023 at 04:15:31PM +0200, Andreas Gruenbacher wrote:
>> Note that to_iomap_page() does folio_test_private() followed by
>> folio_get_private(), which doesn't really make sense in places where
>> we know that iop is defined. Maybe we want to split that up.
>
> The plan is to retire the folio private flag entirely.  I just haven't
> got round to cleaning up iomap yet.  For folios which we know to be
> file-backed, we can just test whether folio->private is NULL or not.
>
> So I'd do this as:
>
> -	struct iomap_page *iop = to_iomap_page(folio);
> +	struct iomap_page *iop = folio->private;
>
> and not even use folio_get_private() or to_iomap_page() any more.
>

In that case, shouldn't we just modify to_iomap_page(folio) function
to just return folio_get_private(folio) or folio->private ?

>> > +       unsigned int first_blk = off >> inode->i_blkbits;
>> > +       unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
>> > +       unsigned int nr_blks = last_blk - first_blk + 1;
>> > +       unsigned long flags;
>> > +
>> > +       spin_lock_irqsave(&iop->state_lock, flags);
>> > +       bitmap_set(iop->state, first_blk, nr_blks);
>> > +       if (iop_test_full_uptodate(folio))
>> > +               folio_mark_uptodate(folio);
>> > +       spin_unlock_irqrestore(&iop->state_lock, flags);
>> > +}
>> > +
>> > +static void iomap_iop_set_range_uptodate(struct inode *inode,
>> > +               struct folio *folio, size_t off, size_t len)
>> > +{
>> > +       struct iomap_page *iop = to_iomap_page(folio);
>> > +
>> > +       if (iop)
>> > +               iop_set_range_uptodate(inode, folio, off, len);
>> > +       else
>> > +               folio_mark_uptodate(folio);
>> > +}
>>
>> This patch passes the inode into iomap_iop_set_range_uptodate() and
>> removes the iop argument. Can this be done in a separate patch,
>> please?
>>
>> We have a few places like the above, where we look up the iop without
>> using it. Would a function like folio_has_iop() make more sense?
>
> I think this is all a symptom of the unnecessary splitting of
> iomap_iop_set_range_uptodate and iop_set_range_uptodate.

Thanks for the review. The problem in not splitting this would be a lot
of variable initialization for !iop case as well.
Hence in one of the previous versions it was discussed that splitting it
into a helper routine for iop case would be a better approach.

-ritesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv7 3/6] iomap: Refactor some iop related accessor functions
  2023-06-05 14:15   ` Andreas Gruenbacher
  2023-06-05 14:31     ` Matthew Wilcox
@ 2023-06-05 21:00     ` Ritesh Harjani
  2023-06-05 21:50       ` Andreas Grünbacher
  1 sibling, 1 reply; 23+ messages in thread
From: Ritesh Harjani @ 2023-06-05 21:00 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Ojaswin Mujoo, Disha Goel

Andreas Gruenbacher <agruenba@redhat.com> writes:

> On Mon, Jun 5, 2023 at 12:55 PM Ritesh Harjani (IBM)
> <ritesh.list@gmail.com> wrote:
>> We would eventually use iomap_iop_** function naming by the rest of the
>> buffered-io iomap code. This patch update function arguments and naming
>> from iomap_set_range_uptodate() -> iomap_iop_set_range_uptodate().
>> iop_set_range_uptodate() then becomes an accessor function used by
>> iomap_iop_** functions.
>>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/iomap/buffered-io.c | 111 +++++++++++++++++++++++------------------
>>  1 file changed, 63 insertions(+), 48 deletions(-)
>>
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 6fffda355c45..136f57ccd0be 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -24,14 +24,14 @@
>>  #define IOEND_BATCH_SIZE       4096
>>
>>  /*
>> - * Structure allocated for each folio when block size < folio size
>> - * to track sub-folio uptodate status and I/O completions.
>> + * Structure allocated for each folio to track per-block uptodate state
>> + * and I/O completions.
>>   */
>>  struct iomap_page {
>>         atomic_t                read_bytes_pending;
>>         atomic_t                write_bytes_pending;
>> -       spinlock_t              uptodate_lock;
>> -       unsigned long           uptodate[];
>> +       spinlock_t              state_lock;
>> +       unsigned long           state[];
>>  };
>>
>>  static inline struct iomap_page *to_iomap_page(struct folio *folio)
>> @@ -43,6 +43,48 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
>>
>>  static struct bio_set iomap_ioend_bioset;
>>
>> +static bool iop_test_full_uptodate(struct folio *folio)
>> +{
>> +       struct iomap_page *iop = to_iomap_page(folio);
>> +       struct inode *inode = folio->mapping->host;
>> +
>> +       return bitmap_full(iop->state, i_blocks_per_folio(inode, folio));
>> +}
>
> Can this be called iop_test_fully_uptodate(), please?
>

IMHO, iop_test_full_uptodate() looks fine. It goes similar to
bitmap_full() function.

>> +
>> +static bool iop_test_block_uptodate(struct folio *folio, unsigned int block)
>> +{
>> +       struct iomap_page *iop = to_iomap_page(folio);
>> +
>> +       return test_bit(block, iop->state);
>> +}
>> +
>> +static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
>> +                                  size_t off, size_t len)
>> +{
>> +       struct iomap_page *iop = to_iomap_page(folio);
>
> Note that to_iomap_page() does folio_test_private() followed by
> folio_get_private(), which doesn't really make sense in places where
> we know that iop is defined. Maybe we want to split that up.
>

I think in mm we have PG_Private flag which gets set as a pageflag.
So folio_test_private() actually checks whether we have PG_Private flag
set or not ( I guess it could be to overload folio->private use).

For file folio, maybe can we directly return folio_get_private(folio)
from to_iomap_page(folio) ?

>> +       unsigned int first_blk = off >> inode->i_blkbits;
>> +       unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
>> +       unsigned int nr_blks = last_blk - first_blk + 1;
>> +       unsigned long flags;
>> +
>> +       spin_lock_irqsave(&iop->state_lock, flags);
>> +       bitmap_set(iop->state, first_blk, nr_blks);
>> +       if (iop_test_full_uptodate(folio))
>> +               folio_mark_uptodate(folio);
>> +       spin_unlock_irqrestore(&iop->state_lock, flags);
>> +}
>> +
>> +static void iomap_iop_set_range_uptodate(struct inode *inode,
>> +               struct folio *folio, size_t off, size_t len)
>> +{
>> +       struct iomap_page *iop = to_iomap_page(folio);
>> +
>> +       if (iop)
>> +               iop_set_range_uptodate(inode, folio, off, len);
>> +       else
>> +               folio_mark_uptodate(folio);
>> +}
>
> This patch passes the inode into iomap_iop_set_range_uptodate() and
> removes the iop argument. Can this be done in a separate patch,
> please?
>
> We have a few places like the above, where we look up the iop without
> using it. Would a function like folio_has_iop() make more sense?
>

Just realized that we should also rename
to_iomap_page(folio) -> iomap_get_iop(folio).

For your comment, we can use that as -

+static void iomap_iop_set_range_uptodate(struct inode *inode,
+               struct folio *folio, size_t off, size_t len)
+{
+       if (iomap_get_iop(folio))
+               iop_set_range_uptodate(inode, folio, off, len);
+       else
+               folio_mark_uptodate(folio);
+}


Does this looks ok?

-ritesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv7 3/6] iomap: Refactor some iop related accessor functions
  2023-06-05 21:00     ` Ritesh Harjani
@ 2023-06-05 21:50       ` Andreas Grünbacher
  0 siblings, 0 replies; 23+ messages in thread
From: Andreas Grünbacher @ 2023-06-05 21:50 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Andreas Gruenbacher, linux-xfs, linux-fsdevel, Matthew Wilcox,
	Dave Chinner, Brian Foster, Christoph Hellwig, Ojaswin Mujoo,
	Disha Goel

Am Mo., 5. Juni 2023 um 23:05 Uhr schrieb Ritesh Harjani
<ritesh.list@gmail.com>:
> Andreas Gruenbacher <agruenba@redhat.com> writes:
> > On Mon, Jun 5, 2023 at 12:55 PM Ritesh Harjani (IBM) <ritesh.list@gmail.com> wrote:
> >> We would eventually use iomap_iop_** function naming by the rest of the
> >> buffered-io iomap code. This patch update function arguments and naming
> >> from iomap_set_range_uptodate() -> iomap_iop_set_range_uptodate().
> >> iop_set_range_uptodate() then becomes an accessor function used by
> >> iomap_iop_** functions.
> >>
> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> >> ---
> >>  fs/iomap/buffered-io.c | 111 +++++++++++++++++++++++------------------
> >>  1 file changed, 63 insertions(+), 48 deletions(-)
> >>
> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >> index 6fffda355c45..136f57ccd0be 100644
> >> --- a/fs/iomap/buffered-io.c
> >> +++ b/fs/iomap/buffered-io.c
> >> @@ -24,14 +24,14 @@
> >>  #define IOEND_BATCH_SIZE       4096
> >>
> >>  /*
> >> - * Structure allocated for each folio when block size < folio size
> >> - * to track sub-folio uptodate status and I/O completions.
> >> + * Structure allocated for each folio to track per-block uptodate state
> >> + * and I/O completions.
> >>   */
> >>  struct iomap_page {
> >>         atomic_t                read_bytes_pending;
> >>         atomic_t                write_bytes_pending;
> >> -       spinlock_t              uptodate_lock;
> >> -       unsigned long           uptodate[];
> >> +       spinlock_t              state_lock;
> >> +       unsigned long           state[];
> >>  };
> >>
> >>  static inline struct iomap_page *to_iomap_page(struct folio *folio)
> >> @@ -43,6 +43,48 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
> >>
> >>  static struct bio_set iomap_ioend_bioset;
> >>
> >> +static bool iop_test_full_uptodate(struct folio *folio)
> >> +{
> >> +       struct iomap_page *iop = to_iomap_page(folio);
> >> +       struct inode *inode = folio->mapping->host;
> >> +
> >> +       return bitmap_full(iop->state, i_blocks_per_folio(inode, folio));
> >> +}
> >
> > Can this be called iop_test_fully_uptodate(), please?
> >
>
> IMHO, iop_test_full_uptodate() looks fine. It goes similar to
> bitmap_full() function.

Nah, it really isn't fine, it's "the bitmap is full" vs. "the iop is
fully uptodate"; you can't say "the iop is full uptodate".

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv7 1/6] iomap: Rename iomap_page_create/release() to iomap_iop_alloc/free()
  2023-06-05 10:55 ` [PATCHv7 1/6] iomap: Rename iomap_page_create/release() to iomap_iop_alloc/free() Ritesh Harjani (IBM)
@ 2023-06-05 22:36   ` Darrick J. Wong
  2023-06-06  4:20     ` Ritesh Harjani
  0 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2023-06-05 22:36 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel

On Mon, Jun 05, 2023 at 04:25:01PM +0530, Ritesh Harjani (IBM) wrote:
> This patch renames the iomap_page_create/release() functions to
> iomap_iop_alloc/free() calls. Later patches adds more functions for
> handling iop structure with iomap_iop_** naming conventions.
> Hence iomap_iop_alloc/free() makes more sense to be consistent with all
> APIs.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/iomap/buffered-io.c | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 063133ec77f4..4567bdd4fff9 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -43,8 +43,8 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
>  
>  static struct bio_set iomap_ioend_bioset;
>  
> -static struct iomap_page *
> -iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
> +static struct iomap_page *iomap_iop_alloc(struct inode *inode,

Personally I preferred iop_alloc, but as I wasn't around to make to that
point during the v6 review I'll let this slide.  iomap_iop_* it is.

(I invoke maintainer privilege and will rename the structure to
iomap_folio and iop->iof since the objects no longer track /only/ a
single page state.)

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> +				struct folio *folio, unsigned int flags)
>  {
>  	struct iomap_page *iop = to_iomap_page(folio);
>  	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> @@ -69,7 +69,7 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
>  	return iop;
>  }
>  
> -static void iomap_page_release(struct folio *folio)
> +static void iomap_iop_free(struct folio *folio)
>  {
>  	struct iomap_page *iop = folio_detach_private(folio);
>  	struct inode *inode = folio->mapping->host;
> @@ -231,7 +231,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
>  	if (WARN_ON_ONCE(size > iomap->length))
>  		return -EIO;
>  	if (offset > 0)
> -		iop = iomap_page_create(iter->inode, folio, iter->flags);
> +		iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
>  	else
>  		iop = to_iomap_page(folio);
>  
> @@ -269,7 +269,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>  		return iomap_read_inline_data(iter, folio);
>  
>  	/* zero post-eof blocks as the page may be mapped */
> -	iop = iomap_page_create(iter->inode, folio, iter->flags);
> +	iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
>  	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
>  	if (plen == 0)
>  		goto done;
> @@ -490,7 +490,7 @@ bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
>  	 */
>  	if (folio_test_dirty(folio) || folio_test_writeback(folio))
>  		return false;
> -	iomap_page_release(folio);
> +	iomap_iop_free(folio);
>  	return true;
>  }
>  EXPORT_SYMBOL_GPL(iomap_release_folio);
> @@ -507,12 +507,12 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
>  	if (offset == 0 && len == folio_size(folio)) {
>  		WARN_ON_ONCE(folio_test_writeback(folio));
>  		folio_cancel_dirty(folio);
> -		iomap_page_release(folio);
> +		iomap_iop_free(folio);
>  	} else if (folio_test_large(folio)) {
>  		/* Must release the iop so the page can be split */
>  		WARN_ON_ONCE(!folio_test_uptodate(folio) &&
>  			     folio_test_dirty(folio));
> -		iomap_page_release(folio);
> +		iomap_iop_free(folio);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
> @@ -559,7 +559,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  		return 0;
>  	folio_clear_error(folio);
>  
> -	iop = iomap_page_create(iter->inode, folio, iter->flags);
> +	iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
> +
>  	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
>  		return -EAGAIN;
>  
> @@ -1612,7 +1613,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		struct writeback_control *wbc, struct inode *inode,
>  		struct folio *folio, u64 end_pos)
>  {
> -	struct iomap_page *iop = iomap_page_create(inode, folio, 0);
> +	struct iomap_page *iop = iomap_iop_alloc(inode, folio, 0);
>  	struct iomap_ioend *ioend, *next;
>  	unsigned len = i_blocksize(inode);
>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv7 3/6] iomap: Refactor some iop related accessor functions
  2023-06-05 10:55 ` [PATCHv7 3/6] iomap: Refactor some iop related accessor functions Ritesh Harjani (IBM)
  2023-06-05 14:15   ` Andreas Gruenbacher
@ 2023-06-05 22:54   ` Darrick J. Wong
  2023-06-05 23:51     ` Ritesh Harjani
  1 sibling, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2023-06-05 22:54 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel

On Mon, Jun 05, 2023 at 04:25:03PM +0530, Ritesh Harjani (IBM) wrote:
> We would eventually use iomap_iop_** function naming by the rest of the
> buffered-io iomap code. This patch update function arguments and naming
> from iomap_set_range_uptodate() -> iomap_iop_set_range_uptodate().
> iop_set_range_uptodate() then becomes an accessor function used by
> iomap_iop_** functions.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/iomap/buffered-io.c | 111 +++++++++++++++++++++++------------------
>  1 file changed, 63 insertions(+), 48 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 6fffda355c45..136f57ccd0be 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -24,14 +24,14 @@
>  #define IOEND_BATCH_SIZE	4096
>  
>  /*
> - * Structure allocated for each folio when block size < folio size
> - * to track sub-folio uptodate status and I/O completions.
> + * Structure allocated for each folio to track per-block uptodate state
> + * and I/O completions.
>   */
>  struct iomap_page {
>  	atomic_t		read_bytes_pending;
>  	atomic_t		write_bytes_pending;
> -	spinlock_t		uptodate_lock;
> -	unsigned long		uptodate[];
> +	spinlock_t		state_lock;
> +	unsigned long		state[];
>  };
>  
>  static inline struct iomap_page *to_iomap_page(struct folio *folio)
> @@ -43,6 +43,48 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
>  
>  static struct bio_set iomap_ioend_bioset;
>  
> +static bool iop_test_full_uptodate(struct folio *folio)

Same comment as Andreas, I think this works better with 'fully', e.g.

iop_test_fully_uptodate()

Why you don't pass the iomap_page directly into this function?  Doesn't
that eliminate the need for iomap_iop_free to keep folio->private set
until the very end?

static inline bool
iomap_iop_is_fully_uptodate(const struct iomap_page *iop,
			    const struct folio *folio)

Same sort of thing for the second function -- we already extracted
folio->private and checked it wasn't null, so we don't need to do that
again.

static inline bool
iomap_io_is_block_uptodate(const struct iomap_page *iop,
			   const struct folio *folio,
			   unsigned int block)

> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +	struct inode *inode = folio->mapping->host;
> +
> +	return bitmap_full(iop->state, i_blocks_per_folio(inode, folio));
> +}
> +
> +static bool iop_test_block_uptodate(struct folio *folio, unsigned int block)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +
> +	return test_bit(block, iop->state);
> +}
> +
> +static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
> +				   size_t off, size_t len)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +	unsigned int first_blk = off >> inode->i_blkbits;
> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
> +	unsigned int nr_blks = last_blk - first_blk + 1;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&iop->state_lock, flags);
> +	bitmap_set(iop->state, first_blk, nr_blks);
> +	if (iop_test_full_uptodate(folio))
> +		folio_mark_uptodate(folio);
> +	spin_unlock_irqrestore(&iop->state_lock, flags);
> +}
> +
> +static void iomap_iop_set_range_uptodate(struct inode *inode,

I don't understand why iomap_set_range_uptodate is now
iomap_iop_set_range_uptodate; it doesn't take an iomap_page object as an
argument...?

I thought I understood that iomap_FOO operates on a folio and a range,
whereas iomap_iop_FOO operates on sub-blocks within a folio?  And that
you were renaming the iop_* functions to iomap_iop_*?

I'm also not sure why iop_set_range_uptodate needs to be passed the
struct inode; can't it extract that from folio->mapping->host, like
current upstream does?

Generally I don't understand why this part of the patch is needed at
all.  Wasn't the point merely to rename uptodate_* to state_* and
introduce the iomap_iop_test_*_uptodate helpers?

--D

> +		struct folio *folio, size_t off, size_t len)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +
> +	if (iop)
> +		iop_set_range_uptodate(inode, folio, off, len);
> +	else
> +		folio_mark_uptodate(folio);
> +}
> +
>  static struct iomap_page *iomap_iop_alloc(struct inode *inode,
>  				struct folio *folio, unsigned int flags)
>  {
> @@ -58,12 +100,12 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
>  	else
>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>  
> -	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
>  		      gfp);
>  	if (iop) {
> -		spin_lock_init(&iop->uptodate_lock);
> +		spin_lock_init(&iop->state_lock);
>  		if (folio_test_uptodate(folio))
> -			bitmap_fill(iop->uptodate, nr_blocks);
> +			bitmap_fill(iop->state, nr_blocks);
>  		folio_attach_private(folio, iop);
>  	}
>  	return iop;
> @@ -72,14 +114,12 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
>  static void iomap_iop_free(struct folio *folio)
>  {
>  	struct iomap_page *iop = to_iomap_page(folio);
> -	struct inode *inode = folio->mapping->host;
> -	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
>  
>  	if (!iop)
>  		return;
>  	WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
>  	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
> -	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
> +	WARN_ON_ONCE(iop_test_full_uptodate(folio) !=
>  			folio_test_uptodate(folio));
>  	folio_detach_private(folio);
>  	kfree(iop);
> @@ -111,7 +151,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>  
>  		/* move forward for each leading block marked uptodate */
>  		for (i = first; i <= last; i++) {
> -			if (!test_bit(i, iop->uptodate))
> +			if (!iop_test_block_uptodate(folio, i))
>  				break;
>  			*pos += block_size;
>  			poff += block_size;
> @@ -121,7 +161,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>  
>  		/* truncate len if we find any trailing uptodate block(s) */
>  		for ( ; i <= last; i++) {
> -			if (test_bit(i, iop->uptodate)) {
> +			if (iop_test_block_uptodate(folio, i)) {
>  				plen -= (last - i + 1) * block_size;
>  				last = i - 1;
>  				break;
> @@ -145,30 +185,6 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>  	*lenp = plen;
>  }
>  
> -static void iomap_iop_set_range_uptodate(struct folio *folio,
> -		struct iomap_page *iop, size_t off, size_t len)
> -{
> -	struct inode *inode = folio->mapping->host;
> -	unsigned first = off >> inode->i_blkbits;
> -	unsigned last = (off + len - 1) >> inode->i_blkbits;
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&iop->uptodate_lock, flags);
> -	bitmap_set(iop->uptodate, first, last - first + 1);
> -	if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio)))
> -		folio_mark_uptodate(folio);
> -	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
> -}
> -
> -static void iomap_set_range_uptodate(struct folio *folio,
> -		struct iomap_page *iop, size_t off, size_t len)
> -{
> -	if (iop)
> -		iomap_iop_set_range_uptodate(folio, iop, off, len);
> -	else
> -		folio_mark_uptodate(folio);
> -}
> -
>  static void iomap_finish_folio_read(struct folio *folio, size_t offset,
>  		size_t len, int error)
>  {
> @@ -178,7 +194,8 @@ static void iomap_finish_folio_read(struct folio *folio, size_t offset,
>  		folio_clear_uptodate(folio);
>  		folio_set_error(folio);
>  	} else {
> -		iomap_set_range_uptodate(folio, iop, offset, len);
> +		iomap_iop_set_range_uptodate(folio->mapping->host, folio,
> +					     offset, len);
>  	}
>  
>  	if (!iop || atomic_sub_and_test(len, &iop->read_bytes_pending))
> @@ -214,7 +231,6 @@ struct iomap_readpage_ctx {
>  static int iomap_read_inline_data(const struct iomap_iter *iter,
>  		struct folio *folio)
>  {
> -	struct iomap_page *iop;
>  	const struct iomap *iomap = iomap_iter_srcmap(iter);
>  	size_t size = i_size_read(iter->inode) - iomap->offset;
>  	size_t poff = offset_in_page(iomap->offset);
> @@ -232,15 +248,14 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
>  	if (WARN_ON_ONCE(size > iomap->length))
>  		return -EIO;
>  	if (offset > 0)
> -		iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
> -	else
> -		iop = to_iomap_page(folio);
> +		iomap_iop_alloc(iter->inode, folio, iter->flags);
>  
>  	addr = kmap_local_folio(folio, offset);
>  	memcpy(addr, iomap->inline_data, size);
>  	memset(addr + size, 0, PAGE_SIZE - poff - size);
>  	kunmap_local(addr);
> -	iomap_set_range_uptodate(folio, iop, offset, PAGE_SIZE - poff);
> +	iomap_iop_set_range_uptodate(iter->inode, folio, offset,
> +				     PAGE_SIZE - poff);
>  	return 0;
>  }
>  
> @@ -277,7 +292,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>  
>  	if (iomap_block_needs_zeroing(iter, pos)) {
>  		folio_zero_range(folio, poff, plen);
> -		iomap_set_range_uptodate(folio, iop, poff, plen);
> +		iomap_iop_set_range_uptodate(iter->inode, folio, poff, plen);
>  		goto done;
>  	}
>  
> @@ -452,7 +467,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
>  	last = (from + count - 1) >> inode->i_blkbits;
>  
>  	for (i = first; i <= last; i++)
> -		if (!test_bit(i, iop->uptodate))
> +		if (!iop_test_block_uptodate(folio, i))
>  			return false;
>  	return true;
>  }
> @@ -591,7 +606,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  			if (status)
>  				return status;
>  		}
> -		iomap_set_range_uptodate(folio, iop, poff, plen);
> +		iomap_iop_set_range_uptodate(iter->inode, folio, poff, plen);
>  	} while ((block_start += plen) < block_end);
>  
>  	return 0;
> @@ -698,7 +713,6 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
>  static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>  		size_t copied, struct folio *folio)
>  {
> -	struct iomap_page *iop = to_iomap_page(folio);
>  	flush_dcache_folio(folio);
>  
>  	/*
> @@ -714,7 +728,8 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>  	 */
>  	if (unlikely(copied < len && !folio_test_uptodate(folio)))
>  		return 0;
> -	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
> +	iomap_iop_set_range_uptodate(inode, folio, offset_in_folio(folio, pos),
> +				     len);
>  	filemap_dirty_folio(inode->i_mapping, folio);
>  	return copied;
>  }
> @@ -1630,7 +1645,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	 * invalid, grab a new one.
>  	 */
>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> -		if (iop && !test_bit(i, iop->uptodate))
> +		if (iop && !iop_test_block_uptodate(folio, i))
>  			continue;
>  
>  		error = wpc->ops->map_blocks(wpc, inode, pos);
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv7 4/6] iomap: Refactor iomap_write_delalloc_punch() function out
  2023-06-05 10:55 ` [PATCHv7 4/6] iomap: Refactor iomap_write_delalloc_punch() function out Ritesh Harjani (IBM)
@ 2023-06-05 22:55   ` Darrick J. Wong
  0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2023-06-05 22:55 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel

On Mon, Jun 05, 2023 at 04:25:04PM +0530, Ritesh Harjani (IBM) wrote:
> This patch moves iomap_write_delalloc_punch() out of
> iomap_write_delalloc_scan(). No functionality change in this patch.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

Seems fine on its own...
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/iomap/buffered-io.c | 54 ++++++++++++++++++++++++++----------------
>  1 file changed, 34 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 136f57ccd0be..f55a339f99ec 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -894,6 +894,33 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
>  
> +static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
> +		loff_t *punch_start_byte, loff_t start_byte, loff_t end_byte,
> +		int (*punch)(struct inode *inode, loff_t offset, loff_t length))
> +{
> +	int ret = 0;
> +
> +	if (!folio_test_dirty(folio))
> +		return ret;
> +
> +	/* if dirty, punch up to offset */
> +	if (start_byte > *punch_start_byte) {
> +		ret = punch(inode, *punch_start_byte,
> +				start_byte - *punch_start_byte);
> +		if (ret)
> +			goto out;
> +	}
> +	/*
> +	 * Make sure the next punch start is correctly bound to
> +	 * the end of this data range, not the end of the folio.
> +	 */
> +	*punch_start_byte = min_t(loff_t, end_byte,
> +				  folio_next_index(folio) << PAGE_SHIFT);
> +
> +out:
> +	return ret;
> +}
> +
>  /*
>   * Scan the data range passed to us for dirty page cache folios. If we find a
>   * dirty folio, punch out the preceeding range and update the offset from which
> @@ -917,6 +944,7 @@ static int iomap_write_delalloc_scan(struct inode *inode,
>  {
>  	while (start_byte < end_byte) {
>  		struct folio	*folio;
> +		int ret;
>  
>  		/* grab locked page */
>  		folio = filemap_lock_folio(inode->i_mapping,
> @@ -927,26 +955,12 @@ static int iomap_write_delalloc_scan(struct inode *inode,
>  			continue;
>  		}
>  
> -		/* if dirty, punch up to offset */
> -		if (folio_test_dirty(folio)) {
> -			if (start_byte > *punch_start_byte) {
> -				int	error;
> -
> -				error = punch(inode, *punch_start_byte,
> -						start_byte - *punch_start_byte);
> -				if (error) {
> -					folio_unlock(folio);
> -					folio_put(folio);
> -					return error;
> -				}
> -			}
> -
> -			/*
> -			 * Make sure the next punch start is correctly bound to
> -			 * the end of this data range, not the end of the folio.
> -			 */
> -			*punch_start_byte = min_t(loff_t, end_byte,
> -					folio_next_index(folio) << PAGE_SHIFT);
> +		ret = iomap_write_delalloc_punch(inode, folio, punch_start_byte,
> +						 start_byte, end_byte, punch);
> +		if (ret) {
> +			folio_unlock(folio);
> +			folio_put(folio);
> +			return ret;
>  		}
>  
>  		/* move offset to start of next folio in range */
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv7 5/6] iomap: Allocate iop in ->write_begin() early
  2023-06-05 10:55 ` [PATCHv7 5/6] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
@ 2023-06-05 22:58   ` Darrick J. Wong
  0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2023-06-05 22:58 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel

On Mon, Jun 05, 2023 at 04:25:05PM +0530, Ritesh Harjani (IBM) wrote:
> We dont need to allocate an iop in ->write_begin() for writes where the
> position and length completely overlap with the given folio.
> Therefore, such cases are skipped.
> 
> Currently when the folio is uptodate, we only allocate iop at writeback
> time (in iomap_writepage_map()). This is ok until now, but when we are
> going to add support for per-block dirty state bitmap in iop, this
> could cause some performance degradation. The reason is that if we don't
> allocate iop during ->write_begin(), then we will never mark the
> necessary dirty bits in ->write_end() call. And we will have to mark all
> the bits as dirty at the writeback time, that could cause the same write
> amplification and performance problems as it is now.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

Makes sense to me, but moving on to the next patch...
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/iomap/buffered-io.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index f55a339f99ec..2a97d73edb96 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -571,15 +571,24 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  	size_t from = offset_in_folio(folio, pos), to = from + len;
>  	size_t poff, plen;
>  
> -	if (folio_test_uptodate(folio))
> +	/*
> +	 * If the write completely overlaps the current folio, then
> +	 * entire folio will be dirtied so there is no need for
> +	 * per-block state tracking structures to be attached to this folio.
> +	 */
> +	if (pos <= folio_pos(folio) &&
> +	    pos + len >= folio_pos(folio) + folio_size(folio))
>  		return 0;
> -	folio_clear_error(folio);
>  
>  	iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
>  
>  	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
>  		return -EAGAIN;
>  
> +	if (folio_test_uptodate(folio))
> +		return 0;
> +	folio_clear_error(folio);
> +
>  	do {
>  		iomap_adjust_read_range(iter->inode, folio, &block_start,
>  				block_end - block_start, &poff, &plen);
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv7 6/6] iomap: Add per-block dirty state tracking to improve performance
  2023-06-05 10:55 ` [PATCHv7 6/6] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
@ 2023-06-05 23:10   ` Darrick J. Wong
  2023-06-06  0:08     ` Ritesh Harjani
  0 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2023-06-05 23:10 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel, Aravinda Herle

On Mon, Jun 05, 2023 at 04:25:06PM +0530, Ritesh Harjani (IBM) wrote:
> When filesystem blocksize is less than folio size (either with
> mapping_large_folio_support() or with blocksize < pagesize) and when the
> folio is uptodate in pagecache, then even a byte write can cause
> an entire folio to be written to disk during writeback. This happens
> because we currently don't have a mechanism to track per-block dirty
> state within struct iomap_page. We currently only track uptodate state.
> 
> This patch implements support for tracking per-block dirty state in
> iomap_page->state bitmap. This should help improve the filesystem write
> performance and help reduce write amplification.
> 
> Performance testing of below fio workload reveals ~16x performance
> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> 
> 1. <test_randwrite.fio>
> [global]
> 	ioengine=psync
> 	rw=randwrite
> 	overwrite=1
> 	pre_read=1
> 	direct=0
> 	bs=4k
> 	size=1G
> 	dir=./
> 	numjobs=8
> 	fdatasync=1
> 	runtime=60
> 	iodepth=64
> 	group_reporting=1
> 
> [fio-run]
> 
> 2. Also our internal performance team reported that this patch improves
>    their database workload performance by around ~83% (with XFS on Power)
> 
> Reported-by: Aravinda Herle <araherle@in.ibm.com>
> Reported-by: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/gfs2/aops.c         |   2 +-
>  fs/iomap/buffered-io.c | 120 +++++++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_aops.c      |   2 +-
>  fs/zonefs/file.c       |   2 +-
>  include/linux/iomap.h  |   1 +
>  5 files changed, 120 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index a5f4be6b9213..75efec3c3b71 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -746,7 +746,7 @@ static const struct address_space_operations gfs2_aops = {
>  	.writepages = gfs2_writepages,
>  	.read_folio = gfs2_read_folio,
>  	.readahead = gfs2_readahead,
> -	.dirty_folio = filemap_dirty_folio,
> +	.dirty_folio = iomap_dirty_folio,
>  	.release_folio = iomap_release_folio,
>  	.invalidate_folio = iomap_invalidate_folio,
>  	.bmap = gfs2_bmap,
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 2a97d73edb96..e7d114b5b918 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -85,6 +85,63 @@ static void iomap_iop_set_range_uptodate(struct inode *inode,
>  		folio_mark_uptodate(folio);
>  }
> 
> +static bool iop_test_block_dirty(struct folio *folio, int block)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +	struct inode *inode = folio->mapping->host;
> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> +
> +	return test_bit(block + blks_per_folio, iop->state);
> +}
> +
> +static void iop_set_range_dirty(struct inode *inode, struct folio *folio,
> +				size_t off, size_t len)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> +	unsigned int first_blk = off >> inode->i_blkbits;
> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
> +	unsigned int nr_blks = last_blk - first_blk + 1;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&iop->state_lock, flags);
> +	bitmap_set(iop->state, first_blk + blks_per_folio, nr_blks);
> +	spin_unlock_irqrestore(&iop->state_lock, flags);
> +}
> +
> +static void iomap_iop_set_range_dirty(struct inode *inode, struct folio *folio,
> +				size_t off, size_t len)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +
> +	if (iop)
> +		iop_set_range_dirty(inode, folio, off, len);
> +}
> +
> +static void iop_clear_range_dirty(struct inode *inode, struct folio *folio,
> +				  size_t off, size_t len)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> +	unsigned int first_blk = off >> inode->i_blkbits;
> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
> +	unsigned int nr_blks = last_blk - first_blk + 1;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&iop->state_lock, flags);
> +	bitmap_clear(iop->state, first_blk + blks_per_folio, nr_blks);
> +	spin_unlock_irqrestore(&iop->state_lock, flags);
> +}
> +
> +static void iomap_iop_clear_range_dirty(struct inode *inode,
> +				struct folio *folio, size_t off, size_t len)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +
> +	if (iop)
> +		iop_clear_range_dirty(inode, folio, off, len);
> +}

Same question as patch 3 -- I was under the impression that
iomap_clear_range_dirty would take a folio/off/len, and if an iop is
necessary, then it would call iomap_iop_clear_range_dirty with the
iop/folio/off/len?


> +
>  static struct iomap_page *iomap_iop_alloc(struct inode *inode,
>  				struct folio *folio, unsigned int flags)
>  {
> @@ -100,12 +157,20 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
>  	else
>  		gfp = GFP_NOFS | __GFP_NOFAIL;
> 
> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
> +	/*
> +	 * iop->state tracks two sets of state flags when the
> +	 * filesystem block size is smaller than the folio size.
> +	 * The first state tracks per-block uptodate and the
> +	 * second tracks per-block dirty state.
> +	 */
> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
>  		      gfp);
>  	if (iop) {
>  		spin_lock_init(&iop->state_lock);
>  		if (folio_test_uptodate(folio))
> -			bitmap_fill(iop->state, nr_blocks);
> +			bitmap_set(iop->state, 0, nr_blocks);
> +		if (folio_test_dirty(folio))
> +			bitmap_set(iop->state, nr_blocks, nr_blocks);
>  		folio_attach_private(folio, iop);
>  	}
>  	return iop;
> @@ -533,6 +598,17 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
>  }
>  EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
> 
> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
> +{
> +	struct inode *inode = mapping->host;
> +	size_t len = folio_size(folio);
> +
> +	iomap_iop_alloc(inode, folio, 0);

Why do we need to allocate an iop if there isn't already one?  Doesn't
it suffice to let filemap_dirty_folio even if there are multiple blocks
per folio?  I understand why __iomap_write_end would do that for a
sub-folio write, but not here where we always going to mark the whole
folio dirty...?

Oh, is it because everything else after this (write_end,
write_delalloc_punch, etc) might need to undo some of the work, hence
you want to have the iop allocated before we dirty the folio, no matter
which path lead to the folio getting dirtied?

> +	iomap_iop_set_range_dirty(inode, folio, 0, len);
> +	return filemap_dirty_folio(mapping, folio);
> +}
> +EXPORT_SYMBOL_GPL(iomap_dirty_folio);
> +
>  static void
>  iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
>  {
> @@ -739,6 +815,8 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>  		return 0;
>  	iomap_iop_set_range_uptodate(inode, folio, offset_in_folio(folio, pos),
>  				     len);
> +	iomap_iop_set_range_dirty(inode, folio, offset_in_folio(folio, pos),
> +				  copied);

...and hence you don't want to be allocating it now?

--D

>  	filemap_dirty_folio(inode->i_mapping, folio);
>  	return copied;
>  }
> @@ -908,6 +986,10 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
>  		int (*punch)(struct inode *inode, loff_t offset, loff_t length))
>  {
>  	int ret = 0;
> +	struct iomap_page *iop;
> +	unsigned int first_blk, last_blk, i;
> +	loff_t last_byte;
> +	u8 blkbits = inode->i_blkbits;
> 
>  	if (!folio_test_dirty(folio))
>  		return ret;
> @@ -919,6 +1001,29 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
>  		if (ret)
>  			goto out;
>  	}
> +	/*
> +	 * When we have per-block dirty tracking, there can be
> +	 * blocks within a folio which are marked uptodate
> +	 * but not dirty. In that case it is necessary to punch
> +	 * out such blocks to avoid leaking any delalloc blocks.
> +	 */
> +	iop = to_iomap_page(folio);
> +	if (!iop)
> +		goto skip_iop_punch;
> +
> +	last_byte = min_t(loff_t, end_byte - 1,
> +		(folio_next_index(folio) << PAGE_SHIFT) - 1);
> +	first_blk = offset_in_folio(folio, start_byte) >> blkbits;
> +	last_blk = offset_in_folio(folio, last_byte) >> blkbits;
> +	for (i = first_blk; i <= last_blk; i++) {
> +		if (!iop_test_block_dirty(folio, i)) {
> +			ret = punch(inode, i << blkbits, 1 << blkbits);
> +			if (ret)
> +				goto out;
> +		}
> +	}
> +
> +skip_iop_punch:
>  	/*
>  	 * Make sure the next punch start is correctly bound to
>  	 * the end of this data range, not the end of the folio.
> @@ -1652,7 +1757,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		struct writeback_control *wbc, struct inode *inode,
>  		struct folio *folio, u64 end_pos)
>  {
> -	struct iomap_page *iop = iomap_iop_alloc(inode, folio, 0);
> +	struct iomap_page *iop = to_iomap_page(folio);
>  	struct iomap_ioend *ioend, *next;
>  	unsigned len = i_blocksize(inode);
>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
> @@ -1660,6 +1765,11 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	int error = 0, count = 0, i;
>  	LIST_HEAD(submit_list);
> 
> +	if (!iop && nblocks > 1) {
> +		iop = iomap_iop_alloc(inode, folio, 0);
> +		iomap_iop_set_range_dirty(inode, folio, 0, folio_size(folio));
> +	}
> +
>  	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
> 
>  	/*
> @@ -1668,7 +1778,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	 * invalid, grab a new one.
>  	 */
>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> -		if (iop && !iop_test_block_uptodate(folio, i))
> +		if (iop && !iop_test_block_dirty(folio, i))
>  			continue;
> 
>  		error = wpc->ops->map_blocks(wpc, inode, pos);
> @@ -1712,6 +1822,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		}
>  	}
> 
> +	iomap_iop_clear_range_dirty(inode, folio, 0,
> +				    end_pos - folio_pos(folio));
>  	folio_start_writeback(folio);
>  	folio_unlock(folio);
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 2ef78aa1d3f6..77c7332ae197 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -578,7 +578,7 @@ const struct address_space_operations xfs_address_space_operations = {
>  	.read_folio		= xfs_vm_read_folio,
>  	.readahead		= xfs_vm_readahead,
>  	.writepages		= xfs_vm_writepages,
> -	.dirty_folio		= filemap_dirty_folio,
> +	.dirty_folio		= iomap_dirty_folio,
>  	.release_folio		= iomap_release_folio,
>  	.invalidate_folio	= iomap_invalidate_folio,
>  	.bmap			= xfs_vm_bmap,
> diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
> index 132f01d3461f..e508c8e97372 100644
> --- a/fs/zonefs/file.c
> +++ b/fs/zonefs/file.c
> @@ -175,7 +175,7 @@ const struct address_space_operations zonefs_file_aops = {
>  	.read_folio		= zonefs_read_folio,
>  	.readahead		= zonefs_readahead,
>  	.writepages		= zonefs_writepages,
> -	.dirty_folio		= filemap_dirty_folio,
> +	.dirty_folio		= iomap_dirty_folio,
>  	.release_folio		= iomap_release_folio,
>  	.invalidate_folio	= iomap_invalidate_folio,
>  	.migrate_folio		= filemap_migrate_folio,
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index e2b836c2e119..eb9335c46bf3 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -264,6 +264,7 @@ bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
>  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
>  int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>  		const struct iomap_ops *ops);
>  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
> --
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv7 3/6] iomap: Refactor some iop related accessor functions
  2023-06-05 22:54   ` Darrick J. Wong
@ 2023-06-05 23:51     ` Ritesh Harjani
  2023-06-06 16:03       ` Darrick J. Wong
  0 siblings, 1 reply; 23+ messages in thread
From: Ritesh Harjani @ 2023-06-05 23:51 UTC (permalink / raw)
  To: Darrick J. Wong, Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo,
	Disha Goel

"Darrick J. Wong" <djwong@kernel.org> writes:

> On Mon, Jun 05, 2023 at 04:25:03PM +0530, Ritesh Harjani (IBM) wrote:
>> We would eventually use iomap_iop_** function naming by the rest of the
>> buffered-io iomap code. This patch update function arguments and naming
>> from iomap_set_range_uptodate() -> iomap_iop_set_range_uptodate().
>> iop_set_range_uptodate() then becomes an accessor function used by
>> iomap_iop_** functions.
>>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/iomap/buffered-io.c | 111 +++++++++++++++++++++++------------------
>>  1 file changed, 63 insertions(+), 48 deletions(-)
>>
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 6fffda355c45..136f57ccd0be 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -24,14 +24,14 @@
>>  #define IOEND_BATCH_SIZE	4096
>>
>>  /*
>> - * Structure allocated for each folio when block size < folio size
>> - * to track sub-folio uptodate status and I/O completions.
>> + * Structure allocated for each folio to track per-block uptodate state
>> + * and I/O completions.
>>   */
>>  struct iomap_page {
>>  	atomic_t		read_bytes_pending;
>>  	atomic_t		write_bytes_pending;
>> -	spinlock_t		uptodate_lock;
>> -	unsigned long		uptodate[];
>> +	spinlock_t		state_lock;
>> +	unsigned long		state[];
>>  };
>>
>>  static inline struct iomap_page *to_iomap_page(struct folio *folio)
>> @@ -43,6 +43,48 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
>>
>>  static struct bio_set iomap_ioend_bioset;
>>
>> +static bool iop_test_full_uptodate(struct folio *folio)
>
> Same comment as Andreas, I think this works better with 'fully', e.g.
>
> iop_test_fully_uptodate()
>
> Why you don't pass the iomap_page directly into this function?  Doesn't
> that eliminate the need for iomap_iop_free to keep folio->private set
> until the very end?
>
> static inline bool
> iomap_iop_is_fully_uptodate(const struct iomap_page *iop,
> 			    const struct folio *folio)
>
> Same sort of thing for the second function -- we already extracted
> folio->private and checked it wasn't null, so we don't need to do that
> again.
>
> static inline bool
> iomap_io_is_block_uptodate(const struct iomap_page *iop,
> 			   const struct folio *folio,
> 			   unsigned int block)
>
>> +{
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +	struct inode *inode = folio->mapping->host;
>> +
>> +	return bitmap_full(iop->state, i_blocks_per_folio(inode, folio));
>> +}
>> +
>> +static bool iop_test_block_uptodate(struct folio *folio, unsigned int block)
>> +{
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +
>> +	return test_bit(block, iop->state);
>> +}
>> +
>> +static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
>> +				   size_t off, size_t len)
>> +{
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +	unsigned int first_blk = off >> inode->i_blkbits;
>> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
>> +	unsigned int nr_blks = last_blk - first_blk + 1;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&iop->state_lock, flags);
>> +	bitmap_set(iop->state, first_blk, nr_blks);
>> +	if (iop_test_full_uptodate(folio))
>> +		folio_mark_uptodate(folio);
>> +	spin_unlock_irqrestore(&iop->state_lock, flags);
>> +}
>> +
>> +static void iomap_iop_set_range_uptodate(struct inode *inode,
>
> I don't understand why iomap_set_range_uptodate is now
> iomap_iop_set_range_uptodate; it doesn't take an iomap_page object as an
> argument...?
>
> I thought I understood that iomap_FOO operates on a folio and a range,
> whereas iomap_iop_FOO operates on sub-blocks within a folio?  And that
> you were renaming the iop_* functions to iomap_iop_*?
>
> I'm also not sure why iop_set_range_uptodate needs to be passed the
> struct inode; can't it extract that from folio->mapping->host, like
> current upstream does?
>

So, I do have a confusion in __folio_mark_dirty() function...

i.e. __folio_mark_dirty checks whether folio->mapping is not NULL.
That means for marking range of blocks dirty within iop from
->dirty_folio(), we can't use folio->mapping->host is it?
We have to use inode from mapping->host (mapping is passed as a
parameter in ->dirty_folio).

I tried looking into the history of this, but I couldn't find any.
This is also the reason I thought we would need to pass an inode for the
iop_set_range_dirty() function (because it is also called from
->dirty_folio -> iomap_dirty_folio()) and to keep it consistent, I kept
inode argument for iop_**_uptodate functions as well.

Thoughts please?

<some code snippet>
====================

iomap_dirty_folio -> filemap_dirty_folio-> __folio_mark_dirty()

void __folio_mark_dirty(struct folio *folio, struct address_space *mapping,
			     int warn)
{
	unsigned long flags;

	xa_lock_irqsave(&mapping->i_pages, flags);
	if (folio->mapping) {	/* Race with truncate? */
		WARN_ON_ONCE(warn && !folio_test_uptodate(folio));
		folio_account_dirtied(folio, mapping);
		__xa_set_mark(&mapping->i_pages, folio_index(folio),
				PAGECACHE_TAG_DIRTY);
	}
	xa_unlock_irqrestore(&mapping->i_pages, flags);
}

-ritesh



> Generally I don't understand why this part of the patch is needed at
> all.  Wasn't the point merely to rename uptodate_* to state_* and
> introduce the iomap_iop_test_*_uptodate helpers?
>
> --D
>
>> +		struct folio *folio, size_t off, size_t len)
>> +{
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +
>> +	if (iop)
>> +		iop_set_range_uptodate(inode, folio, off, len);
>> +	else
>> +		folio_mark_uptodate(folio);
>> +}
>> +
>>  static struct iomap_page *iomap_iop_alloc(struct inode *inode,
>>  				struct folio *folio, unsigned int flags)
>>  {
>> @@ -58,12 +100,12 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
>>  	else
>>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>>
>> -	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
>> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
>>  		      gfp);
>>  	if (iop) {
>> -		spin_lock_init(&iop->uptodate_lock);
>> +		spin_lock_init(&iop->state_lock);
>>  		if (folio_test_uptodate(folio))
>> -			bitmap_fill(iop->uptodate, nr_blocks);
>> +			bitmap_fill(iop->state, nr_blocks);
>>  		folio_attach_private(folio, iop);
>>  	}
>>  	return iop;
>> @@ -72,14 +114,12 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
>>  static void iomap_iop_free(struct folio *folio)
>>  {
>>  	struct iomap_page *iop = to_iomap_page(folio);
>> -	struct inode *inode = folio->mapping->host;
>> -	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
>>
>>  	if (!iop)
>>  		return;
>>  	WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
>>  	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
>> -	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
>> +	WARN_ON_ONCE(iop_test_full_uptodate(folio) !=
>>  			folio_test_uptodate(folio));
>>  	folio_detach_private(folio);
>>  	kfree(iop);
>> @@ -111,7 +151,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>>
>>  		/* move forward for each leading block marked uptodate */
>>  		for (i = first; i <= last; i++) {
>> -			if (!test_bit(i, iop->uptodate))
>> +			if (!iop_test_block_uptodate(folio, i))
>>  				break;
>>  			*pos += block_size;
>>  			poff += block_size;
>> @@ -121,7 +161,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>>
>>  		/* truncate len if we find any trailing uptodate block(s) */
>>  		for ( ; i <= last; i++) {
>> -			if (test_bit(i, iop->uptodate)) {
>> +			if (iop_test_block_uptodate(folio, i)) {
>>  				plen -= (last - i + 1) * block_size;
>>  				last = i - 1;
>>  				break;
>> @@ -145,30 +185,6 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>>  	*lenp = plen;
>>  }
>>
>> -static void iomap_iop_set_range_uptodate(struct folio *folio,
>> -		struct iomap_page *iop, size_t off, size_t len)
>> -{
>> -	struct inode *inode = folio->mapping->host;
>> -	unsigned first = off >> inode->i_blkbits;
>> -	unsigned last = (off + len - 1) >> inode->i_blkbits;
>> -	unsigned long flags;
>> -
>> -	spin_lock_irqsave(&iop->uptodate_lock, flags);
>> -	bitmap_set(iop->uptodate, first, last - first + 1);
>> -	if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio)))
>> -		folio_mark_uptodate(folio);
>> -	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
>> -}
>> -
>> -static void iomap_set_range_uptodate(struct folio *folio,
>> -		struct iomap_page *iop, size_t off, size_t len)
>> -{
>> -	if (iop)
>> -		iomap_iop_set_range_uptodate(folio, iop, off, len);
>> -	else
>> -		folio_mark_uptodate(folio);
>> -}
>> -
>>  static void iomap_finish_folio_read(struct folio *folio, size_t offset,
>>  		size_t len, int error)
>>  {
>> @@ -178,7 +194,8 @@ static void iomap_finish_folio_read(struct folio *folio, size_t offset,
>>  		folio_clear_uptodate(folio);
>>  		folio_set_error(folio);
>>  	} else {
>> -		iomap_set_range_uptodate(folio, iop, offset, len);
>> +		iomap_iop_set_range_uptodate(folio->mapping->host, folio,
>> +					     offset, len);
>>  	}
>>
>>  	if (!iop || atomic_sub_and_test(len, &iop->read_bytes_pending))
>> @@ -214,7 +231,6 @@ struct iomap_readpage_ctx {
>>  static int iomap_read_inline_data(const struct iomap_iter *iter,
>>  		struct folio *folio)
>>  {
>> -	struct iomap_page *iop;
>>  	const struct iomap *iomap = iomap_iter_srcmap(iter);
>>  	size_t size = i_size_read(iter->inode) - iomap->offset;
>>  	size_t poff = offset_in_page(iomap->offset);
>> @@ -232,15 +248,14 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
>>  	if (WARN_ON_ONCE(size > iomap->length))
>>  		return -EIO;
>>  	if (offset > 0)
>> -		iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
>> -	else
>> -		iop = to_iomap_page(folio);
>> +		iomap_iop_alloc(iter->inode, folio, iter->flags);
>>
>>  	addr = kmap_local_folio(folio, offset);
>>  	memcpy(addr, iomap->inline_data, size);
>>  	memset(addr + size, 0, PAGE_SIZE - poff - size);
>>  	kunmap_local(addr);
>> -	iomap_set_range_uptodate(folio, iop, offset, PAGE_SIZE - poff);
>> +	iomap_iop_set_range_uptodate(iter->inode, folio, offset,
>> +				     PAGE_SIZE - poff);
>>  	return 0;
>>  }
>>
>> @@ -277,7 +292,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>>
>>  	if (iomap_block_needs_zeroing(iter, pos)) {
>>  		folio_zero_range(folio, poff, plen);
>> -		iomap_set_range_uptodate(folio, iop, poff, plen);
>> +		iomap_iop_set_range_uptodate(iter->inode, folio, poff, plen);
>>  		goto done;
>>  	}
>>
>> @@ -452,7 +467,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
>>  	last = (from + count - 1) >> inode->i_blkbits;
>>
>>  	for (i = first; i <= last; i++)
>> -		if (!test_bit(i, iop->uptodate))
>> +		if (!iop_test_block_uptodate(folio, i))
>>  			return false;
>>  	return true;
>>  }
>> @@ -591,7 +606,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>>  			if (status)
>>  				return status;
>>  		}
>> -		iomap_set_range_uptodate(folio, iop, poff, plen);
>> +		iomap_iop_set_range_uptodate(iter->inode, folio, poff, plen);
>>  	} while ((block_start += plen) < block_end);
>>
>>  	return 0;
>> @@ -698,7 +713,6 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
>>  static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>>  		size_t copied, struct folio *folio)
>>  {
>> -	struct iomap_page *iop = to_iomap_page(folio);
>>  	flush_dcache_folio(folio);
>>
>>  	/*
>> @@ -714,7 +728,8 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>>  	 */
>>  	if (unlikely(copied < len && !folio_test_uptodate(folio)))
>>  		return 0;
>> -	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
>> +	iomap_iop_set_range_uptodate(inode, folio, offset_in_folio(folio, pos),
>> +				     len);
>>  	filemap_dirty_folio(inode->i_mapping, folio);
>>  	return copied;
>>  }
>> @@ -1630,7 +1645,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  	 * invalid, grab a new one.
>>  	 */
>>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
>> -		if (iop && !test_bit(i, iop->uptodate))
>> +		if (iop && !iop_test_block_uptodate(folio, i))
>>  			continue;
>>
>>  		error = wpc->ops->map_blocks(wpc, inode, pos);
>> --
>> 2.40.1
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv7 6/6] iomap: Add per-block dirty state tracking to improve performance
  2023-06-05 23:10   ` Darrick J. Wong
@ 2023-06-06  0:08     ` Ritesh Harjani
  0 siblings, 0 replies; 23+ messages in thread
From: Ritesh Harjani @ 2023-06-06  0:08 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel, Aravinda Herle

"Darrick J. Wong" <djwong@kernel.org> writes:

> On Mon, Jun 05, 2023 at 04:25:06PM +0530, Ritesh Harjani (IBM) wrote:
>> When filesystem blocksize is less than folio size (either with
>> mapping_large_folio_support() or with blocksize < pagesize) and when the
>> folio is uptodate in pagecache, then even a byte write can cause
>> an entire folio to be written to disk during writeback. This happens
>> because we currently don't have a mechanism to track per-block dirty
>> state within struct iomap_page. We currently only track uptodate state.
>>
>> This patch implements support for tracking per-block dirty state in
>> iomap_page->state bitmap. This should help improve the filesystem write
>> performance and help reduce write amplification.
>>
>> Performance testing of below fio workload reveals ~16x performance
>> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
>> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
>>
>> 1. <test_randwrite.fio>
>> [global]
>> 	ioengine=psync
>> 	rw=randwrite
>> 	overwrite=1
>> 	pre_read=1
>> 	direct=0
>> 	bs=4k
>> 	size=1G
>> 	dir=./
>> 	numjobs=8
>> 	fdatasync=1
>> 	runtime=60
>> 	iodepth=64
>> 	group_reporting=1
>>
>> [fio-run]
>>
>> 2. Also our internal performance team reported that this patch improves
>>    their database workload performance by around ~83% (with XFS on Power)
>>
>> Reported-by: Aravinda Herle <araherle@in.ibm.com>
>> Reported-by: Brian Foster <bfoster@redhat.com>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/gfs2/aops.c         |   2 +-
>>  fs/iomap/buffered-io.c | 120 +++++++++++++++++++++++++++++++++++++++--
>>  fs/xfs/xfs_aops.c      |   2 +-
>>  fs/zonefs/file.c       |   2 +-
>>  include/linux/iomap.h  |   1 +
>>  5 files changed, 120 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
>> index a5f4be6b9213..75efec3c3b71 100644
>> --- a/fs/gfs2/aops.c
>> +++ b/fs/gfs2/aops.c
>> @@ -746,7 +746,7 @@ static const struct address_space_operations gfs2_aops = {
>>  	.writepages = gfs2_writepages,
>>  	.read_folio = gfs2_read_folio,
>>  	.readahead = gfs2_readahead,
>> -	.dirty_folio = filemap_dirty_folio,
>> +	.dirty_folio = iomap_dirty_folio,
>>  	.release_folio = iomap_release_folio,
>>  	.invalidate_folio = iomap_invalidate_folio,
>>  	.bmap = gfs2_bmap,
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 2a97d73edb96..e7d114b5b918 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -85,6 +85,63 @@ static void iomap_iop_set_range_uptodate(struct inode *inode,
>>  		folio_mark_uptodate(folio);
>>  }
>>
>> +static bool iop_test_block_dirty(struct folio *folio, int block)
>> +{
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +	struct inode *inode = folio->mapping->host;
>> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
>> +
>> +	return test_bit(block + blks_per_folio, iop->state);
>> +}
>> +
>> +static void iop_set_range_dirty(struct inode *inode, struct folio *folio,
>> +				size_t off, size_t len)
>> +{
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
>> +	unsigned int first_blk = off >> inode->i_blkbits;
>> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
>> +	unsigned int nr_blks = last_blk - first_blk + 1;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&iop->state_lock, flags);
>> +	bitmap_set(iop->state, first_blk + blks_per_folio, nr_blks);
>> +	spin_unlock_irqrestore(&iop->state_lock, flags);
>> +}
>> +
>> +static void iomap_iop_set_range_dirty(struct inode *inode, struct folio *folio,
>> +				size_t off, size_t len)
>> +{
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +
>> +	if (iop)
>> +		iop_set_range_dirty(inode, folio, off, len);
>> +}
>> +
>> +static void iop_clear_range_dirty(struct inode *inode, struct folio *folio,
>> +				  size_t off, size_t len)
>> +{
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
>> +	unsigned int first_blk = off >> inode->i_blkbits;
>> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
>> +	unsigned int nr_blks = last_blk - first_blk + 1;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&iop->state_lock, flags);
>> +	bitmap_clear(iop->state, first_blk + blks_per_folio, nr_blks);
>> +	spin_unlock_irqrestore(&iop->state_lock, flags);
>> +}
>> +
>> +static void iomap_iop_clear_range_dirty(struct inode *inode,
>> +				struct folio *folio, size_t off, size_t len)
>> +{
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +
>> +	if (iop)
>> +		iop_clear_range_dirty(inode, folio, off, len);
>> +}
>
> Same question as patch 3 -- I was under the impression that
> iomap_clear_range_dirty would take a folio/off/len, and if an iop is
> necessary, then it would call iomap_iop_clear_range_dirty with the
> iop/folio/off/len?
>
>

Sorry a lot of renaming of the functions happened. I started with the
thought that anything iop related should start with iomap_iop_ prefx.
But what you are suggesting make more sense.

For e.g. we have iomap_set_range_uptodate() which has both cases iop and
!iop. So for iop case it would then be iomap_iop_set_range_uptodate().
And then we should do the same thing for iomap_set_range_dirty part as well.

Sure. This makes it more clear.

Also I agree we can directly pass iop to iomap_iop_** functions.
I was assuming we need not pass it because it can be extracted easily. But as we
are anyways extracting it in the caller, it make sense to just pass it
in the function arguments itself.

>> +
>>  static struct iomap_page *iomap_iop_alloc(struct inode *inode,
>>  				struct folio *folio, unsigned int flags)
>>  {
>> @@ -100,12 +157,20 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
>>  	else
>>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>>
>> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
>> +	/*
>> +	 * iop->state tracks two sets of state flags when the
>> +	 * filesystem block size is smaller than the folio size.
>> +	 * The first state tracks per-block uptodate and the
>> +	 * second tracks per-block dirty state.
>> +	 */
>> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
>>  		      gfp);
>>  	if (iop) {
>>  		spin_lock_init(&iop->state_lock);
>>  		if (folio_test_uptodate(folio))
>> -			bitmap_fill(iop->state, nr_blocks);
>> +			bitmap_set(iop->state, 0, nr_blocks);
>> +		if (folio_test_dirty(folio))
>> +			bitmap_set(iop->state, nr_blocks, nr_blocks);
>>  		folio_attach_private(folio, iop);
>>  	}
>>  	return iop;
>> @@ -533,6 +598,17 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
>>  }
>>  EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
>>
>> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
>> +{
>> +	struct inode *inode = mapping->host;
>> +	size_t len = folio_size(folio);
>> +
>> +	iomap_iop_alloc(inode, folio, 0);
>
> Why do we need to allocate an iop if there isn't already one?  Doesn't
> it suffice to let filemap_dirty_folio even if there are multiple blocks
> per folio?  I understand why __iomap_write_end would do that for a
> sub-folio write, but not here where we always going to mark the whole
> folio dirty...?
>
> Oh, is it because everything else after this (write_end,
> write_delalloc_punch, etc) might need to undo some of the work, hence
> you want to have the iop allocated before we dirty the folio, no matter
> which path lead to the folio getting dirtied?
>

Frankly speaking allocating an iop is always the right thing to do here.
Not allocating an iop for special cases like all bits are dirty is an
optimization case for lazy allocation during the writeback time.

We discussed other reasons as well for not doing this for mmapped writes
in v5 [1]. And I suggested that we can look into such optimizations as a
seperate series after this.

[1]: https://lore.kernel.org/linux-xfs/87ttw5ugse.fsf@doe.com/

-ritesh


>> +	iomap_iop_set_range_dirty(inode, folio, 0, len);
>> +	return filemap_dirty_folio(mapping, folio);
>> +}
>> +EXPORT_SYMBOL_GPL(iomap_dirty_folio);
>> +
>>  static void
>>  iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
>>  {
>> @@ -739,6 +815,8 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>>  		return 0;
>>  	iomap_iop_set_range_uptodate(inode, folio, offset_in_folio(folio, pos),
>>  				     len);
>> +	iomap_iop_set_range_dirty(inode, folio, offset_in_folio(folio, pos),
>> +				  copied);
>
> ...and hence you don't want to be allocating it now?
>
> --D
>
>>  	filemap_dirty_folio(inode->i_mapping, folio);
>>  	return copied;
>>  }
>> @@ -908,6 +986,10 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
>>  		int (*punch)(struct inode *inode, loff_t offset, loff_t length))
>>  {
>>  	int ret = 0;
>> +	struct iomap_page *iop;
>> +	unsigned int first_blk, last_blk, i;
>> +	loff_t last_byte;
>> +	u8 blkbits = inode->i_blkbits;
>>
>>  	if (!folio_test_dirty(folio))
>>  		return ret;
>> @@ -919,6 +1001,29 @@ static int iomap_write_delalloc_punch(struct inode *inode, struct folio *folio,
>>  		if (ret)
>>  			goto out;
>>  	}
>> +	/*
>> +	 * When we have per-block dirty tracking, there can be
>> +	 * blocks within a folio which are marked uptodate
>> +	 * but not dirty. In that case it is necessary to punch
>> +	 * out such blocks to avoid leaking any delalloc blocks.
>> +	 */
>> +	iop = to_iomap_page(folio);
>> +	if (!iop)
>> +		goto skip_iop_punch;
>> +
>> +	last_byte = min_t(loff_t, end_byte - 1,
>> +		(folio_next_index(folio) << PAGE_SHIFT) - 1);
>> +	first_blk = offset_in_folio(folio, start_byte) >> blkbits;
>> +	last_blk = offset_in_folio(folio, last_byte) >> blkbits;
>> +	for (i = first_blk; i <= last_blk; i++) {
>> +		if (!iop_test_block_dirty(folio, i)) {
>> +			ret = punch(inode, i << blkbits, 1 << blkbits);
>> +			if (ret)
>> +				goto out;
>> +		}
>> +	}
>> +
>> +skip_iop_punch:
>>  	/*
>>  	 * Make sure the next punch start is correctly bound to
>>  	 * the end of this data range, not the end of the folio.
>> @@ -1652,7 +1757,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  		struct writeback_control *wbc, struct inode *inode,
>>  		struct folio *folio, u64 end_pos)
>>  {
>> -	struct iomap_page *iop = iomap_iop_alloc(inode, folio, 0);
>> +	struct iomap_page *iop = to_iomap_page(folio);
>>  	struct iomap_ioend *ioend, *next;
>>  	unsigned len = i_blocksize(inode);
>>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
>> @@ -1660,6 +1765,11 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  	int error = 0, count = 0, i;
>>  	LIST_HEAD(submit_list);
>>
>> +	if (!iop && nblocks > 1) {
>> +		iop = iomap_iop_alloc(inode, folio, 0);
>> +		iomap_iop_set_range_dirty(inode, folio, 0, folio_size(folio));
>> +	}
>> +
>>  	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
>>
>>  	/*
>> @@ -1668,7 +1778,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  	 * invalid, grab a new one.
>>  	 */
>>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
>> -		if (iop && !iop_test_block_uptodate(folio, i))
>> +		if (iop && !iop_test_block_dirty(folio, i))
>>  			continue;
>>
>>  		error = wpc->ops->map_blocks(wpc, inode, pos);
>> @@ -1712,6 +1822,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  		}
>>  	}
>>
>> +	iomap_iop_clear_range_dirty(inode, folio, 0,
>> +				    end_pos - folio_pos(folio));
>>  	folio_start_writeback(folio);
>>  	folio_unlock(folio);
>>
>> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
>> index 2ef78aa1d3f6..77c7332ae197 100644
>> --- a/fs/xfs/xfs_aops.c
>> +++ b/fs/xfs/xfs_aops.c
>> @@ -578,7 +578,7 @@ const struct address_space_operations xfs_address_space_operations = {
>>  	.read_folio		= xfs_vm_read_folio,
>>  	.readahead		= xfs_vm_readahead,
>>  	.writepages		= xfs_vm_writepages,
>> -	.dirty_folio		= filemap_dirty_folio,
>> +	.dirty_folio		= iomap_dirty_folio,
>>  	.release_folio		= iomap_release_folio,
>>  	.invalidate_folio	= iomap_invalidate_folio,
>>  	.bmap			= xfs_vm_bmap,
>> diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
>> index 132f01d3461f..e508c8e97372 100644
>> --- a/fs/zonefs/file.c
>> +++ b/fs/zonefs/file.c
>> @@ -175,7 +175,7 @@ const struct address_space_operations zonefs_file_aops = {
>>  	.read_folio		= zonefs_read_folio,
>>  	.readahead		= zonefs_readahead,
>>  	.writepages		= zonefs_writepages,
>> -	.dirty_folio		= filemap_dirty_folio,
>> +	.dirty_folio		= iomap_dirty_folio,
>>  	.release_folio		= iomap_release_folio,
>>  	.invalidate_folio	= iomap_invalidate_folio,
>>  	.migrate_folio		= filemap_migrate_folio,
>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index e2b836c2e119..eb9335c46bf3 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -264,6 +264,7 @@ bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
>>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
>>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
>>  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
>> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
>>  int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>>  		const struct iomap_ops *ops);
>>  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
>> --
>> 2.40.1
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv7 1/6] iomap: Rename iomap_page_create/release() to iomap_iop_alloc/free()
  2023-06-05 22:36   ` Darrick J. Wong
@ 2023-06-06  4:20     ` Ritesh Harjani
  0 siblings, 0 replies; 23+ messages in thread
From: Ritesh Harjani @ 2023-06-06  4:20 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel

"Darrick J. Wong" <djwong@kernel.org> writes:

> On Mon, Jun 05, 2023 at 04:25:01PM +0530, Ritesh Harjani (IBM) wrote:
>> This patch renames the iomap_page_create/release() functions to
>> iomap_iop_alloc/free() calls. Later patches adds more functions for
>> handling iop structure with iomap_iop_** naming conventions.
>> Hence iomap_iop_alloc/free() makes more sense to be consistent with all
>> APIs.
>>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/iomap/buffered-io.c | 21 +++++++++++----------
>>  1 file changed, 11 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 063133ec77f4..4567bdd4fff9 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -43,8 +43,8 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
>>
>>  static struct bio_set iomap_ioend_bioset;
>>
>> -static struct iomap_page *
>> -iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
>> +static struct iomap_page *iomap_iop_alloc(struct inode *inode,
>
> Personally I preferred iop_alloc, but as I wasn't around to make to that
> point during the v6 review I'll let this slide.  iomap_iop_* it is.
>
> (I invoke maintainer privilege and will rename the structure to
> iomap_folio and iop->iof since the objects no longer track /only/ a
> single page state.)

Darrick,
Do you want me to rename iomap_page -> iomap_folio in this patch itself
or would you rather prefer the renaming of iomap_page -> iomap_folio and
iop -> iof as a separate last patch in the series?

>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>

Thanks!
-ritesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv7 3/6] iomap: Refactor some iop related accessor functions
  2023-06-05 23:51     ` Ritesh Harjani
@ 2023-06-06 16:03       ` Darrick J. Wong
  2023-06-06 16:29         ` Matthew Wilcox
  0 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2023-06-06 16:03 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Matthew Wilcox, linux-xfs, linux-fsdevel, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel

On Tue, Jun 06, 2023 at 05:21:32AM +0530, Ritesh Harjani wrote:
> "Darrick J. Wong" <djwong@kernel.org> writes:
> 
> > On Mon, Jun 05, 2023 at 04:25:03PM +0530, Ritesh Harjani (IBM) wrote:
> >> We would eventually use iomap_iop_** function naming by the rest of the
> >> buffered-io iomap code. This patch update function arguments and naming
> >> from iomap_set_range_uptodate() -> iomap_iop_set_range_uptodate().
> >> iop_set_range_uptodate() then becomes an accessor function used by
> >> iomap_iop_** functions.
> >>
> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> >> ---
> >>  fs/iomap/buffered-io.c | 111 +++++++++++++++++++++++------------------
> >>  1 file changed, 63 insertions(+), 48 deletions(-)
> >>
> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >> index 6fffda355c45..136f57ccd0be 100644
> >> --- a/fs/iomap/buffered-io.c
> >> +++ b/fs/iomap/buffered-io.c
> >> @@ -24,14 +24,14 @@
> >>  #define IOEND_BATCH_SIZE	4096
> >>
> >>  /*
> >> - * Structure allocated for each folio when block size < folio size
> >> - * to track sub-folio uptodate status and I/O completions.
> >> + * Structure allocated for each folio to track per-block uptodate state
> >> + * and I/O completions.
> >>   */
> >>  struct iomap_page {
> >>  	atomic_t		read_bytes_pending;
> >>  	atomic_t		write_bytes_pending;
> >> -	spinlock_t		uptodate_lock;
> >> -	unsigned long		uptodate[];
> >> +	spinlock_t		state_lock;
> >> +	unsigned long		state[];
> >>  };
> >>
> >>  static inline struct iomap_page *to_iomap_page(struct folio *folio)
> >> @@ -43,6 +43,48 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
> >>
> >>  static struct bio_set iomap_ioend_bioset;
> >>
> >> +static bool iop_test_full_uptodate(struct folio *folio)
> >
> > Same comment as Andreas, I think this works better with 'fully', e.g.
> >
> > iop_test_fully_uptodate()
> >
> > Why you don't pass the iomap_page directly into this function?  Doesn't
> > that eliminate the need for iomap_iop_free to keep folio->private set
> > until the very end?
> >
> > static inline bool
> > iomap_iop_is_fully_uptodate(const struct iomap_page *iop,
> > 			    const struct folio *folio)
> >
> > Same sort of thing for the second function -- we already extracted
> > folio->private and checked it wasn't null, so we don't need to do that
> > again.
> >
> > static inline bool
> > iomap_io_is_block_uptodate(const struct iomap_page *iop,
> > 			   const struct folio *folio,
> > 			   unsigned int block)
> >
> >> +{
> >> +	struct iomap_page *iop = to_iomap_page(folio);
> >> +	struct inode *inode = folio->mapping->host;
> >> +
> >> +	return bitmap_full(iop->state, i_blocks_per_folio(inode, folio));
> >> +}
> >> +
> >> +static bool iop_test_block_uptodate(struct folio *folio, unsigned int block)
> >> +{
> >> +	struct iomap_page *iop = to_iomap_page(folio);
> >> +
> >> +	return test_bit(block, iop->state);
> >> +}
> >> +
> >> +static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
> >> +				   size_t off, size_t len)
> >> +{
> >> +	struct iomap_page *iop = to_iomap_page(folio);
> >> +	unsigned int first_blk = off >> inode->i_blkbits;
> >> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
> >> +	unsigned int nr_blks = last_blk - first_blk + 1;
> >> +	unsigned long flags;
> >> +
> >> +	spin_lock_irqsave(&iop->state_lock, flags);
> >> +	bitmap_set(iop->state, first_blk, nr_blks);
> >> +	if (iop_test_full_uptodate(folio))
> >> +		folio_mark_uptodate(folio);
> >> +	spin_unlock_irqrestore(&iop->state_lock, flags);
> >> +}
> >> +
> >> +static void iomap_iop_set_range_uptodate(struct inode *inode,
> >
> > I don't understand why iomap_set_range_uptodate is now
> > iomap_iop_set_range_uptodate; it doesn't take an iomap_page object as an
> > argument...?
> >
> > I thought I understood that iomap_FOO operates on a folio and a range,
> > whereas iomap_iop_FOO operates on sub-blocks within a folio?  And that
> > you were renaming the iop_* functions to iomap_iop_*?
> >
> > I'm also not sure why iop_set_range_uptodate needs to be passed the
> > struct inode; can't it extract that from folio->mapping->host, like
> > current upstream does?
> >
> 
> So, I do have a confusion in __folio_mark_dirty() function...
> 
> i.e. __folio_mark_dirty checks whether folio->mapping is not NULL.
> That means for marking range of blocks dirty within iop from
> ->dirty_folio(), we can't use folio->mapping->host is it?
> We have to use inode from mapping->host (mapping is passed as a
> parameter in ->dirty_folio).

Ah, yeah.  folio->mapping can become NULL if truncate races with us in
removing the folio from the foliocache.

For regular reads and writes this is a nonissue because those paths all
take i_rwsem and will block truncate.  However, for page_mkwrite, xfs
doesn't take mmap_invalidate_lock until after the vm_fault has been
given a folio to play with.

I think that means we can't rely on folio->mapping to be non-null, and
hence can't dereference folio->mapping->host.  That said, if the folio's
locked and folio->mapping is null, the page has been truncated so we
could just return VM_FAULT_NOPAGE, perhaps?

(willy might know better than I do...)

> I tried looking into the history of this, but I couldn't find any.
> This is also the reason I thought we would need to pass an inode for the
> iop_set_range_dirty() function (because it is also called from
> ->dirty_folio -> iomap_dirty_folio()) and to keep it consistent, I kept
> inode argument for iop_**_uptodate functions as well.

That seems ok to me, now that you've brought that ^^ to my attention.

--D

> Thoughts please?
> 
> <some code snippet>
> ====================
> 
> iomap_dirty_folio -> filemap_dirty_folio-> __folio_mark_dirty()
> 
> void __folio_mark_dirty(struct folio *folio, struct address_space *mapping,
> 			     int warn)
> {
> 	unsigned long flags;
> 
> 	xa_lock_irqsave(&mapping->i_pages, flags);
> 	if (folio->mapping) {	/* Race with truncate? */
> 		WARN_ON_ONCE(warn && !folio_test_uptodate(folio));
> 		folio_account_dirtied(folio, mapping);
> 		__xa_set_mark(&mapping->i_pages, folio_index(folio),
> 				PAGECACHE_TAG_DIRTY);
> 	}
> 	xa_unlock_irqrestore(&mapping->i_pages, flags);
> }
> 
> -ritesh
> 
> 
> 
> > Generally I don't understand why this part of the patch is needed at
> > all.  Wasn't the point merely to rename uptodate_* to state_* and
> > introduce the iomap_iop_test_*_uptodate helpers?
> >
> > --D
> >
> >> +		struct folio *folio, size_t off, size_t len)
> >> +{
> >> +	struct iomap_page *iop = to_iomap_page(folio);
> >> +
> >> +	if (iop)
> >> +		iop_set_range_uptodate(inode, folio, off, len);
> >> +	else
> >> +		folio_mark_uptodate(folio);
> >> +}
> >> +
> >>  static struct iomap_page *iomap_iop_alloc(struct inode *inode,
> >>  				struct folio *folio, unsigned int flags)
> >>  {
> >> @@ -58,12 +100,12 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
> >>  	else
> >>  		gfp = GFP_NOFS | __GFP_NOFAIL;
> >>
> >> -	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
> >> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
> >>  		      gfp);
> >>  	if (iop) {
> >> -		spin_lock_init(&iop->uptodate_lock);
> >> +		spin_lock_init(&iop->state_lock);
> >>  		if (folio_test_uptodate(folio))
> >> -			bitmap_fill(iop->uptodate, nr_blocks);
> >> +			bitmap_fill(iop->state, nr_blocks);
> >>  		folio_attach_private(folio, iop);
> >>  	}
> >>  	return iop;
> >> @@ -72,14 +114,12 @@ static struct iomap_page *iomap_iop_alloc(struct inode *inode,
> >>  static void iomap_iop_free(struct folio *folio)
> >>  {
> >>  	struct iomap_page *iop = to_iomap_page(folio);
> >> -	struct inode *inode = folio->mapping->host;
> >> -	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> >>
> >>  	if (!iop)
> >>  		return;
> >>  	WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
> >>  	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
> >> -	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
> >> +	WARN_ON_ONCE(iop_test_full_uptodate(folio) !=
> >>  			folio_test_uptodate(folio));
> >>  	folio_detach_private(folio);
> >>  	kfree(iop);
> >> @@ -111,7 +151,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
> >>
> >>  		/* move forward for each leading block marked uptodate */
> >>  		for (i = first; i <= last; i++) {
> >> -			if (!test_bit(i, iop->uptodate))
> >> +			if (!iop_test_block_uptodate(folio, i))
> >>  				break;
> >>  			*pos += block_size;
> >>  			poff += block_size;
> >> @@ -121,7 +161,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
> >>
> >>  		/* truncate len if we find any trailing uptodate block(s) */
> >>  		for ( ; i <= last; i++) {
> >> -			if (test_bit(i, iop->uptodate)) {
> >> +			if (iop_test_block_uptodate(folio, i)) {
> >>  				plen -= (last - i + 1) * block_size;
> >>  				last = i - 1;
> >>  				break;
> >> @@ -145,30 +185,6 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
> >>  	*lenp = plen;
> >>  }
> >>
> >> -static void iomap_iop_set_range_uptodate(struct folio *folio,
> >> -		struct iomap_page *iop, size_t off, size_t len)
> >> -{
> >> -	struct inode *inode = folio->mapping->host;
> >> -	unsigned first = off >> inode->i_blkbits;
> >> -	unsigned last = (off + len - 1) >> inode->i_blkbits;
> >> -	unsigned long flags;
> >> -
> >> -	spin_lock_irqsave(&iop->uptodate_lock, flags);
> >> -	bitmap_set(iop->uptodate, first, last - first + 1);
> >> -	if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio)))
> >> -		folio_mark_uptodate(folio);
> >> -	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
> >> -}
> >> -
> >> -static void iomap_set_range_uptodate(struct folio *folio,
> >> -		struct iomap_page *iop, size_t off, size_t len)
> >> -{
> >> -	if (iop)
> >> -		iomap_iop_set_range_uptodate(folio, iop, off, len);
> >> -	else
> >> -		folio_mark_uptodate(folio);
> >> -}
> >> -
> >>  static void iomap_finish_folio_read(struct folio *folio, size_t offset,
> >>  		size_t len, int error)
> >>  {
> >> @@ -178,7 +194,8 @@ static void iomap_finish_folio_read(struct folio *folio, size_t offset,
> >>  		folio_clear_uptodate(folio);
> >>  		folio_set_error(folio);
> >>  	} else {
> >> -		iomap_set_range_uptodate(folio, iop, offset, len);
> >> +		iomap_iop_set_range_uptodate(folio->mapping->host, folio,
> >> +					     offset, len);
> >>  	}
> >>
> >>  	if (!iop || atomic_sub_and_test(len, &iop->read_bytes_pending))
> >> @@ -214,7 +231,6 @@ struct iomap_readpage_ctx {
> >>  static int iomap_read_inline_data(const struct iomap_iter *iter,
> >>  		struct folio *folio)
> >>  {
> >> -	struct iomap_page *iop;
> >>  	const struct iomap *iomap = iomap_iter_srcmap(iter);
> >>  	size_t size = i_size_read(iter->inode) - iomap->offset;
> >>  	size_t poff = offset_in_page(iomap->offset);
> >> @@ -232,15 +248,14 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
> >>  	if (WARN_ON_ONCE(size > iomap->length))
> >>  		return -EIO;
> >>  	if (offset > 0)
> >> -		iop = iomap_iop_alloc(iter->inode, folio, iter->flags);
> >> -	else
> >> -		iop = to_iomap_page(folio);
> >> +		iomap_iop_alloc(iter->inode, folio, iter->flags);
> >>
> >>  	addr = kmap_local_folio(folio, offset);
> >>  	memcpy(addr, iomap->inline_data, size);
> >>  	memset(addr + size, 0, PAGE_SIZE - poff - size);
> >>  	kunmap_local(addr);
> >> -	iomap_set_range_uptodate(folio, iop, offset, PAGE_SIZE - poff);
> >> +	iomap_iop_set_range_uptodate(iter->inode, folio, offset,
> >> +				     PAGE_SIZE - poff);
> >>  	return 0;
> >>  }
> >>
> >> @@ -277,7 +292,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
> >>
> >>  	if (iomap_block_needs_zeroing(iter, pos)) {
> >>  		folio_zero_range(folio, poff, plen);
> >> -		iomap_set_range_uptodate(folio, iop, poff, plen);
> >> +		iomap_iop_set_range_uptodate(iter->inode, folio, poff, plen);
> >>  		goto done;
> >>  	}
> >>
> >> @@ -452,7 +467,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
> >>  	last = (from + count - 1) >> inode->i_blkbits;
> >>
> >>  	for (i = first; i <= last; i++)
> >> -		if (!test_bit(i, iop->uptodate))
> >> +		if (!iop_test_block_uptodate(folio, i))
> >>  			return false;
> >>  	return true;
> >>  }
> >> @@ -591,7 +606,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> >>  			if (status)
> >>  				return status;
> >>  		}
> >> -		iomap_set_range_uptodate(folio, iop, poff, plen);
> >> +		iomap_iop_set_range_uptodate(iter->inode, folio, poff, plen);
> >>  	} while ((block_start += plen) < block_end);
> >>
> >>  	return 0;
> >> @@ -698,7 +713,6 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
> >>  static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> >>  		size_t copied, struct folio *folio)
> >>  {
> >> -	struct iomap_page *iop = to_iomap_page(folio);
> >>  	flush_dcache_folio(folio);
> >>
> >>  	/*
> >> @@ -714,7 +728,8 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> >>  	 */
> >>  	if (unlikely(copied < len && !folio_test_uptodate(folio)))
> >>  		return 0;
> >> -	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
> >> +	iomap_iop_set_range_uptodate(inode, folio, offset_in_folio(folio, pos),
> >> +				     len);
> >>  	filemap_dirty_folio(inode->i_mapping, folio);
> >>  	return copied;
> >>  }
> >> @@ -1630,7 +1645,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> >>  	 * invalid, grab a new one.
> >>  	 */
> >>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> >> -		if (iop && !test_bit(i, iop->uptodate))
> >> +		if (iop && !iop_test_block_uptodate(folio, i))
> >>  			continue;
> >>
> >>  		error = wpc->ops->map_blocks(wpc, inode, pos);
> >> --
> >> 2.40.1
> >>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv7 3/6] iomap: Refactor some iop related accessor functions
  2023-06-06 16:03       ` Darrick J. Wong
@ 2023-06-06 16:29         ` Matthew Wilcox
  2023-06-07 13:08           ` Ritesh Harjani
  0 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2023-06-06 16:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Ritesh Harjani, linux-xfs, linux-fsdevel, Dave Chinner,
	Brian Foster, Christoph Hellwig, Andreas Gruenbacher,
	Ojaswin Mujoo, Disha Goel

On Tue, Jun 06, 2023 at 09:03:17AM -0700, Darrick J. Wong wrote:
> On Tue, Jun 06, 2023 at 05:21:32AM +0530, Ritesh Harjani wrote:
> > So, I do have a confusion in __folio_mark_dirty() function...
> > 
> > i.e. __folio_mark_dirty checks whether folio->mapping is not NULL.
> > That means for marking range of blocks dirty within iop from
> > ->dirty_folio(), we can't use folio->mapping->host is it?
> > We have to use inode from mapping->host (mapping is passed as a
> > parameter in ->dirty_folio).

It probably helps to read the commentary above filemap_dirty_folio().

 * The caller must ensure this doesn't race with truncation.  Most will
 * simply hold the folio lock, but e.g. zap_pte_range() calls with the
 * folio mapped and the pte lock held, which also locks out truncation.

But __folio_mark_dirty() can't rely on that!  Again, see the commentary:

 * This can also be called from mark_buffer_dirty(), which I
 * cannot prove is always protected against truncate.

iomap doesn't do bottom-up dirtying, only top-down.  So it absolutely
can rely on the VFS having taken the appropriate locks.

> Ah, yeah.  folio->mapping can become NULL if truncate races with us in
> removing the folio from the foliocache.
> 
> For regular reads and writes this is a nonissue because those paths all
> take i_rwsem and will block truncate.  However, for page_mkwrite, xfs
> doesn't take mmap_invalidate_lock until after the vm_fault has been
> given a folio to play with.

invalidate_lock isn't needed here.  You take the folio_lock, then you
call folio_mkwrite_check_truncate() to make sure it wasn't truncated
before you took the folio_lock.  Truncation will block on the folio_lock,
so you're good unless you release the folio_lock (which you don't,
you return it to the MM locked).


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv7 3/6] iomap: Refactor some iop related accessor functions
  2023-06-06 16:29         ` Matthew Wilcox
@ 2023-06-07 13:08           ` Ritesh Harjani
  0 siblings, 0 replies; 23+ messages in thread
From: Ritesh Harjani @ 2023-06-07 13:08 UTC (permalink / raw)
  To: Matthew Wilcox, Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, Dave Chinner, Brian Foster,
	Christoph Hellwig, Andreas Gruenbacher, Ojaswin Mujoo,
	Disha Goel

Matthew Wilcox <willy@infradead.org> writes:

> On Tue, Jun 06, 2023 at 09:03:17AM -0700, Darrick J. Wong wrote:
>> On Tue, Jun 06, 2023 at 05:21:32AM +0530, Ritesh Harjani wrote:
>> > So, I do have a confusion in __folio_mark_dirty() function...
>> > 
>> > i.e. __folio_mark_dirty checks whether folio->mapping is not NULL.
>> > That means for marking range of blocks dirty within iop from
>> > ->dirty_folio(), we can't use folio->mapping->host is it?
>> > We have to use inode from mapping->host (mapping is passed as a
>> > parameter in ->dirty_folio).
>
> It probably helps to read the commentary above filemap_dirty_folio().
>
>  * The caller must ensure this doesn't race with truncation.  Most will
>  * simply hold the folio lock, but e.g. zap_pte_range() calls with the
>  * folio mapped and the pte lock held, which also locks out truncation.
>
> But __folio_mark_dirty() can't rely on that!  Again, see the commentary:
>
>  * This can also be called from mark_buffer_dirty(), which I
>  * cannot prove is always protected against truncate.
>
> iomap doesn't do bottom-up dirtying, only top-down.  So it absolutely
> can rely on the VFS having taken the appropriate locks.
>

Right.

>> Ah, yeah.  folio->mapping can become NULL if truncate races with us in
>> removing the folio from the foliocache.
>> 
>> For regular reads and writes this is a nonissue because those paths all
>> take i_rwsem and will block truncate.  However, for page_mkwrite, xfs
>> doesn't take mmap_invalidate_lock until after the vm_fault has been
>> given a folio to play with.
>
> invalidate_lock isn't needed here.  You take the folio_lock, then you
> call folio_mkwrite_check_truncate() to make sure it wasn't truncated
> before you took the folio_lock.  Truncation will block on the folio_lock,
> so you're good unless you release the folio_lock (which you don't,
> you return it to the MM locked).

ohhk. Thanks for explaining this. So most callers hold the folio_lock()
which prevents agains the race from truncation while calling
->dirty_folio(). Some of the callers cannot use folio_lock() so instead
they hold the page table lock which can as well prevent against
truncation.

So I can just go ahead and use folio->mapping->host in
iomap_dirty_folio() function as well. 

Thanks a lot!! This helped. Will drop the inode from the function
argument then and will use folio->mapping->host instead. 

-ritesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-06-07 13:09 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-05 10:55 [PATCHv7 0/6] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
2023-06-05 10:55 ` [PATCHv7 1/6] iomap: Rename iomap_page_create/release() to iomap_iop_alloc/free() Ritesh Harjani (IBM)
2023-06-05 22:36   ` Darrick J. Wong
2023-06-06  4:20     ` Ritesh Harjani
2023-06-05 10:55 ` [PATCHv7 2/6] iomap: Move folio_detach_private() in iomap_iop_free() to the end Ritesh Harjani (IBM)
2023-06-05 10:55 ` [PATCHv7 3/6] iomap: Refactor some iop related accessor functions Ritesh Harjani (IBM)
2023-06-05 14:15   ` Andreas Gruenbacher
2023-06-05 14:31     ` Matthew Wilcox
2023-06-05 20:48       ` Ritesh Harjani
2023-06-05 21:00     ` Ritesh Harjani
2023-06-05 21:50       ` Andreas Grünbacher
2023-06-05 22:54   ` Darrick J. Wong
2023-06-05 23:51     ` Ritesh Harjani
2023-06-06 16:03       ` Darrick J. Wong
2023-06-06 16:29         ` Matthew Wilcox
2023-06-07 13:08           ` Ritesh Harjani
2023-06-05 10:55 ` [PATCHv7 4/6] iomap: Refactor iomap_write_delalloc_punch() function out Ritesh Harjani (IBM)
2023-06-05 22:55   ` Darrick J. Wong
2023-06-05 10:55 ` [PATCHv7 5/6] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
2023-06-05 22:58   ` Darrick J. Wong
2023-06-05 10:55 ` [PATCHv7 6/6] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
2023-06-05 23:10   ` Darrick J. Wong
2023-06-06  0:08     ` Ritesh Harjani

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.