linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFCv3 0/3] iomap: Add support for subpage dirty state tracking to improve write performance
@ 2023-02-26 19:43 Ritesh Harjani (IBM)
  2023-02-26 19:43 ` [RFCv3 1/3] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-02-26 19:43 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel; +Cc: Ritesh Harjani

Hello All,

Please find the RFCv3 patchset which adds support for iomap subpage dirty state
tracking which improves write performance and should reduce the write
amplification problem on platforms with smaller filesystem blocksize compared
to pagesize.
E.g. On Power with 64k default pagesize and with 4k filesystem blocksize.

RFCv2 -> RFCv3
===============
1. Addressed review comments on adding accessor APIs for both uptodate and dirty
   iop bitmap. (todo-1 of rfcv2).
   Addressed few other review comments from Christoph & Matthew.
2. Performance testing of these patches reveal the same performance improvement
   i.e. the given fio workload shows 16x perf improvement on nvme drive.
   (completed todo-3 of rfcv2)
3. Addressed todo-4 of rfcv2

Few TODOs
===========
1. Test gfs2 and zonefs with these changes (todo-2 of rfcv2)
2. Look into todo-5 of rfcv2

xfstests testing with default options and 1k blocksize on x86 reveals no new
issues. Also didn't observe any surprises on Power with 4k blocksize.
(Please do suggest if there are any specific xfstests config options (for
xfs) which are good to get it tested for this patch series?)


Copy-Paste Cover letter of RFCv2
================================

RFC -> RFCv2
=============
1. One of the key fix in v2 is that earlier when the folio gets marked as dirty,
   we were never marking the bits dirty in iop bitmap.
   This patch adds support for iomap_dirty_folio() as new ->dirty_folio() aops
   callback, which sets the dirty bitmap in iop and later call filemap_dirty_folio().
   This was one of the review comment that was discussed in RFC.

2. One of the other key fix identified in testing was that iop structure could
   get allocated at the time of the writeback if the folio is uptodate.
   (since it can get freed during memory pressure or during
   truncate_inode_partial_folio() in case of large folio). This could then cause
   nothing to get written if we have not marked the necessary bits as dirty in
   iop->state[]. Patch-1 & Patch-3 takes care of that.

TODOs
======
1. I still need to work on macros which we could declare and use for easy
   reference to uptodate/dirty bits in iop->state[] bitmap (based on previous
   review comments).

2. Test xfstests on other filesystems which are using the iomap buffered write
   path (gfs2, zonefs).

3. Latest performance testing with this patch series (I am not expecting any
   surprises here. The perf improvements should be more or less similar to rfc).

4. To address one of the todo in Patch-3. I think I missed to address it and
   noticed it only now before sending. But it should be easily addressable.
   I can address it in the next revision along with others.

5. To address one of the other review comments like what happens with a large
   folio. Can we limit the size of bitmaps if the folio is too large e.g. > 2MB.

   [RH] - I can start looking into this area too, if we think these patches
   are looking good. My preference would be to work on todos 1-4 as part of this
   patch series and take up bitmap optimization as a follow-up work for next
   part. Please do let me know your thoughts and suggestions on this.

Note: I have done a 4k bs test with auto group on Power with 64k pagesize and
I haven't found any surprises. I am also running a full bench of all tests with
x86 and 1k blocksize, but it still hasn't completed. I can update the results
once it completes.

Also as we discussed, all the dirty and uptodate bitmap tracking code for
iomap_page's state[] bitmap, is still contained within iomap buffered-io.c file.

I would appreciate any review comments/feedback and help on this work i.e.
adding subpage size dirty tracking to reduce write amplification problem and
improve buffered write performance. Kindly note that w/o these patches,
below type of workload gets severly impacted.


Performance Results from RFC [1]:
=================================
1. Performance testing of below fio workload reveals ~16x performance
improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
FIO reported write bw scores, improved from ~28 MBps to ~452 MBps.

<test_randwrite.fio>
[global]
	ioengine=psync
	rw=randwrite
	overwrite=1
	pre_read=1
	direct=0
	bs=4k
	size=1G
	dir=./
	numjobs=8
	fdatasync=1
	runtime=60
	iodepth=64
	group_reporting=1

[fio-run]

2. Also our internal performance team reported that this patch improves there
   database workload performance by around ~83% (with XFS on Power)

[1]: https://lore.kernel.org/linux-xfs/cover.1666928993.git.ritesh.list@gmail.com/


Ritesh Harjani (IBM) (3):
  iomap: Allocate iop in ->write_begin() early
  iomap: Change uptodate variable name to state
  iomap: Support subpage size dirty tracking to improve write performance

 fs/gfs2/aops.c         |   2 +-
 fs/iomap/buffered-io.c | 166 ++++++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_aops.c      |   2 +-
 fs/zonefs/super.c      |   2 +-
 include/linux/iomap.h  |   1 +
 5 files changed, 150 insertions(+), 23 deletions(-)

--
2.39.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFCv3 1/3] iomap: Allocate iop in ->write_begin() early
  2023-02-26 19:43 [RFCv3 0/3] iomap: Add support for subpage dirty state tracking to improve write performance Ritesh Harjani (IBM)
@ 2023-02-26 19:43 ` Ritesh Harjani (IBM)
  2023-02-26 22:41   ` Dave Chinner
  2023-02-26 23:12   ` Matthew Wilcox
  2023-02-26 19:43 ` [RFCv3 2/3] iomap: Change uptodate variable name to state Ritesh Harjani (IBM)
  2023-02-26 19:43 ` [RFCv3 3/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
  2 siblings, 2 replies; 18+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-02-26 19:43 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel; +Cc: Ritesh Harjani

Earlier when the folio is uptodate, we only allocate iop at writeback
time (in iomap_writepage_map()). This is ok until now, but when we are
going to add support for subpage size dirty bitmap tracking in iop, this
could cause some performance degradation. The reason is that if we don't
allocate iop during ->write_begin(), then we will never mark the
necessary dirty bits in ->write_end() call. And we will have to mark all
the bits as dirty at the writeback time, that could cause the same write
amplification and performance problems as it is now (w/o subpage dirty
bitmap tracking in iop).

However, for all the writes with (pos, len) which completely overlaps
the given folio, there is no need to allocate an iop during
->write_begin(). So skip those cases.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 356193e44cf0..c5b51ab1184e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -535,11 +535,16 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	size_t from = offset_in_folio(folio, pos), to = from + len;
 	size_t poff, plen;
 
+	if (pos <= folio_pos(folio) &&
+	    pos + len >= folio_pos(folio) + folio_size(folio))
+		return 0;
+
+	iop = iomap_page_create(iter->inode, folio, iter->flags);
+
 	if (folio_test_uptodate(folio))
 		return 0;
 	folio_clear_error(folio);
 
-	iop = iomap_page_create(iter->inode, folio, iter->flags);
 	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
 		return -EAGAIN;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFCv3 2/3] iomap: Change uptodate variable name to state
  2023-02-26 19:43 [RFCv3 0/3] iomap: Add support for subpage dirty state tracking to improve write performance Ritesh Harjani (IBM)
  2023-02-26 19:43 ` [RFCv3 1/3] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
@ 2023-02-26 19:43 ` Ritesh Harjani (IBM)
  2023-02-26 23:12   ` Dave Chinner
  2023-02-26 23:23   ` Matthew Wilcox
  2023-02-26 19:43 ` [RFCv3 3/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
  2 siblings, 2 replies; 18+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-02-26 19:43 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel; +Cc: Ritesh Harjani

This patch changes the struct iomap_page uptodate & uptodate_lock
member names to state and state_lock to better reflect their purpose
for the upcoming patch. It also introduces the accessor functions for
updating uptodate state bits in iop->state bitmap. This makes the code
easy to understand on when different bitmap types are getting referred
in different code paths.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 65 ++++++++++++++++++++++++++++++++----------
 1 file changed, 50 insertions(+), 15 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index c5b51ab1184e..e0b0be16278e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -25,13 +25,13 @@
 
 /*
  * Structure allocated for each folio when block size < folio size
- * to track sub-folio uptodate status and I/O completions.
+ * to track sub-folio uptodate state and I/O completions.
  */
 struct iomap_page {
 	atomic_t		read_bytes_pending;
 	atomic_t		write_bytes_pending;
-	spinlock_t		uptodate_lock;
-	unsigned long		uptodate[];
+	spinlock_t		state_lock;
+	unsigned long		state[];
 };
 
 static inline struct iomap_page *to_iomap_page(struct folio *folio)
@@ -43,6 +43,38 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
 
 static struct bio_set iomap_ioend_bioset;
 
+/*
+ * Accessor functions for setting/clearing/checking uptodate bits in
+ * iop->state bitmap.
+ * nrblocks is i_blocks_per_folio() which is passed in every
+ * function as the last argument for API consistency.
+ */
+static inline void iop_set_range_uptodate(struct iomap_page *iop,
+				unsigned int start, unsigned int len,
+				unsigned int nrblocks)
+{
+	bitmap_set(iop->state, start, len);
+}
+
+static inline void iop_clear_range_uptodate(struct iomap_page *iop,
+				unsigned int start, unsigned int len,
+				unsigned int nrblocks)
+{
+	bitmap_clear(iop->state, start, len);
+}
+
+static inline bool iop_test_uptodate(struct iomap_page *iop, unsigned int pos,
+				unsigned int nrblocks)
+{
+	return test_bit(pos, iop->state);
+}
+
+static inline bool iop_full_uptodate(struct iomap_page *iop,
+				unsigned int nrblocks)
+{
+	return bitmap_full(iop->state, nrblocks);
+}
+
 static struct iomap_page *
 iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
 {
@@ -58,12 +90,12 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;
 
-	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
+	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
 		      gfp);
 	if (iop) {
-		spin_lock_init(&iop->uptodate_lock);
+		spin_lock_init(&iop->state_lock);
 		if (folio_test_uptodate(folio))
-			bitmap_fill(iop->uptodate, nr_blocks);
+			iop_set_range_uptodate(iop, 0, nr_blocks, nr_blocks);
 		folio_attach_private(folio, iop);
 	}
 	return iop;
@@ -79,7 +111,7 @@ static void iomap_page_release(struct folio *folio)
 		return;
 	WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
 	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
-	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
+	WARN_ON_ONCE(iop_full_uptodate(iop, nr_blocks) !=
 			folio_test_uptodate(folio));
 	kfree(iop);
 }
@@ -99,6 +131,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 	size_t plen = min_t(loff_t, folio_size(folio) - poff, length);
 	unsigned first = poff >> block_bits;
 	unsigned last = (poff + plen - 1) >> block_bits;
+	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 
 	/*
 	 * If the block size is smaller than the page size, we need to check the
@@ -110,7 +143,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 
 		/* move forward for each leading block marked uptodate */
 		for (i = first; i <= last; i++) {
-			if (!test_bit(i, iop->uptodate))
+			if (!iop_test_uptodate(iop, i, nr_blocks))
 				break;
 			*pos += block_size;
 			poff += block_size;
@@ -120,7 +153,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 
 		/* truncate len if we find any trailing uptodate block(s) */
 		for ( ; i <= last; i++) {
-			if (test_bit(i, iop->uptodate)) {
+			if (iop_test_uptodate(iop, i, nr_blocks)) {
 				plen -= (last - i + 1) * block_size;
 				last = i - 1;
 				break;
@@ -151,12 +184,13 @@ static void iomap_iop_set_range_uptodate(struct folio *folio,
 	unsigned first = off >> inode->i_blkbits;
 	unsigned last = (off + len - 1) >> inode->i_blkbits;
 	unsigned long flags;
+	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 
-	spin_lock_irqsave(&iop->uptodate_lock, flags);
-	bitmap_set(iop->uptodate, first, last - first + 1);
-	if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio)))
+	spin_lock_irqsave(&iop->state_lock, flags);
+	iop_set_range_uptodate(iop, first, last - first + 1, nr_blocks);
+	if (iop_full_uptodate(iop, nr_blocks))
 		folio_mark_uptodate(folio);
-	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
 }
 
 static void iomap_set_range_uptodate(struct folio *folio,
@@ -439,6 +473,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
 	struct iomap_page *iop = to_iomap_page(folio);
 	struct inode *inode = folio->mapping->host;
 	unsigned first, last, i;
+	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 
 	if (!iop)
 		return false;
@@ -451,7 +486,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
 	last = (from + count - 1) >> inode->i_blkbits;
 
 	for (i = first; i <= last; i++)
-		if (!test_bit(i, iop->uptodate))
+		if (!iop_test_uptodate(iop, i, nr_blocks))
 			return false;
 	return true;
 }
@@ -1611,7 +1646,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iop && !test_bit(i, iop->uptodate))
+		if (iop && !iop_test_uptodate(iop, i, nblocks))
 			continue;
 
 		error = wpc->ops->map_blocks(wpc, inode, pos);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFCv3 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-02-26 19:43 [RFCv3 0/3] iomap: Add support for subpage dirty state tracking to improve write performance Ritesh Harjani (IBM)
  2023-02-26 19:43 ` [RFCv3 1/3] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
  2023-02-26 19:43 ` [RFCv3 2/3] iomap: Change uptodate variable name to state Ritesh Harjani (IBM)
@ 2023-02-26 19:43 ` Ritesh Harjani (IBM)
  2023-02-26 23:48   ` Dave Chinner
  2023-03-02 14:02   ` Brian Foster
  2 siblings, 2 replies; 18+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-02-26 19:43 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel; +Cc: Ritesh Harjani, Aravinda Herle

On a 64k pagesize platforms (specially Power and/or aarch64) with 4k
filesystem blocksize, this patch should improve the performance by doing
only the subpage dirty data write.

This should also reduce the write amplification since we can now track
subpage dirty status within state bitmaps. Earlier we had to
write the entire 64k page even if only a part of it (e.g. 4k) was
updated.

Performance testing of below fio workload reveals ~16x performance
improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.

1. <test_randwrite.fio>
[global]
	ioengine=psync
	rw=randwrite
	overwrite=1
	pre_read=1
	direct=0
	bs=4k
	size=1G
	dir=./
	numjobs=8
	fdatasync=1
	runtime=60
	iodepth=64
	group_reporting=1

[fio-run]

2. Also our internal performance team reported that this patch improves there
   database workload performance by around ~83% (with XFS on Power)

Reported-by: Aravinda Herle <araherle@in.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/gfs2/aops.c         |   2 +-
 fs/iomap/buffered-io.c | 104 +++++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_aops.c      |   2 +-
 fs/zonefs/super.c      |   2 +-
 include/linux/iomap.h  |   1 +
 5 files changed, 99 insertions(+), 12 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index e782b4f1d104..b9c35288a5eb 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -741,7 +741,7 @@ static const struct address_space_operations gfs2_aops = {
 	.writepages = gfs2_writepages,
 	.read_folio = gfs2_read_folio,
 	.readahead = gfs2_readahead,
-	.dirty_folio = filemap_dirty_folio,
+	.dirty_folio = iomap_dirty_folio,
 	.release_folio = iomap_release_folio,
 	.invalidate_folio = iomap_invalidate_folio,
 	.bmap = gfs2_bmap,
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e0b0be16278e..fb55183c547f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -44,8 +44,8 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
 static struct bio_set iomap_ioend_bioset;

 /*
- * Accessor functions for setting/clearing/checking uptodate bits in
- * iop->state bitmap.
+ * Accessor functions for setting/clearing/checking uptodate and
+ * dirty bits in iop->state bitmap.
  * nrblocks is i_blocks_per_folio() which is passed in every
  * function as the last argument for API consistency.
  */
@@ -75,8 +75,29 @@ static inline bool iop_full_uptodate(struct iomap_page *iop,
 	return bitmap_full(iop->state, nrblocks);
 }

+static inline void iop_set_range_dirty(struct iomap_page *iop,
+				unsigned int start, unsigned int len,
+				unsigned int nrblocks)
+{
+	bitmap_set(iop->state, start + nrblocks, len);
+}
+
+static inline void iop_clear_range_dirty(struct iomap_page *iop,
+				unsigned int start, unsigned int len,
+				unsigned int nrblocks)
+{
+	bitmap_clear(iop->state, start + nrblocks, len);
+}
+
+static inline bool iop_test_dirty(struct iomap_page *iop, unsigned int pos,
+			     unsigned int nrblocks)
+{
+	return test_bit(pos + nrblocks, iop->state);
+}
+
 static struct iomap_page *
-iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
+iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags,
+		  bool is_dirty)
 {
 	struct iomap_page *iop = to_iomap_page(folio);
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
@@ -90,12 +111,18 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;

-	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
+	/*
+	 * iop->state tracks 2 types of bitmaps i.e. uptodate & dirty
+	 * for bs < ps.
+	 */
+	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
 		      gfp);
 	if (iop) {
 		spin_lock_init(&iop->state_lock);
 		if (folio_test_uptodate(folio))
 			iop_set_range_uptodate(iop, 0, nr_blocks, nr_blocks);
+		if (is_dirty)
+			iop_set_range_dirty(iop, 0, nr_blocks, nr_blocks);
 		folio_attach_private(folio, iop);
 	}
 	return iop;
@@ -202,6 +229,48 @@ static void iomap_set_range_uptodate(struct folio *folio,
 		folio_mark_uptodate(folio);
 }

+static void iomap_iop_set_range_dirty(struct folio *folio,
+		struct iomap_page *iop, size_t off, size_t len)
+{
+	struct inode *inode = folio->mapping->host;
+	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
+	unsigned first = (off >> inode->i_blkbits);
+	unsigned last = ((off + len - 1) >> inode->i_blkbits);
+	unsigned long flags;
+
+	spin_lock_irqsave(&iop->state_lock, flags);
+	iop_set_range_dirty(iop, first, last - first + 1, nr_blocks);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
+}
+
+static void iomap_set_range_dirty(struct folio *folio,
+		struct iomap_page *iop, size_t off, size_t len)
+{
+	if (iop)
+		iomap_iop_set_range_dirty(folio, iop, off, len);
+}
+
+static void iomap_iop_clear_range_dirty(struct folio *folio,
+		struct iomap_page *iop, size_t off, size_t len)
+{
+	struct inode *inode = folio->mapping->host;
+	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
+	unsigned first = (off >> inode->i_blkbits);
+	unsigned last = ((off + len - 1) >> inode->i_blkbits);
+	unsigned long flags;
+
+	spin_lock_irqsave(&iop->state_lock, flags);
+	iop_clear_range_dirty(iop, first, last - first + 1, nr_blocks);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
+}
+
+static void iomap_clear_range_dirty(struct folio *folio,
+		struct iomap_page *iop, size_t off, size_t len)
+{
+	if (iop)
+		iomap_iop_clear_range_dirty(folio, iop, off, len);
+}
+
 static void iomap_finish_folio_read(struct folio *folio, size_t offset,
 		size_t len, int error)
 {
@@ -265,7 +334,8 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
 	if (offset > 0)
-		iop = iomap_page_create(iter->inode, folio, iter->flags);
+		iop = iomap_page_create(iter->inode, folio, iter->flags,
+					folio_test_dirty(folio));
 	else
 		iop = to_iomap_page(folio);

@@ -303,7 +373,8 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		return iomap_read_inline_data(iter, folio);

 	/* zero post-eof blocks as the page may be mapped */
-	iop = iomap_page_create(iter->inode, folio, iter->flags);
+	iop = iomap_page_create(iter->inode, folio, iter->flags,
+				folio_test_dirty(folio));
 	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
@@ -532,6 +603,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);

+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
+{
+	unsigned int nr_blocks = i_blocks_per_folio(mapping->host, folio);
+	struct iomap_page *iop;
+
+	iop = iomap_page_create(mapping->host, folio, 0, false);
+	iomap_set_range_dirty(folio, iop, 0,
+			      nr_blocks << mapping->host->i_blkbits);
+	return filemap_dirty_folio(mapping, folio);
+}
+EXPORT_SYMBOL_GPL(iomap_dirty_folio);
+
 static void
 iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 {
@@ -574,7 +657,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	    pos + len >= folio_pos(folio) + folio_size(folio))
 		return 0;

-	iop = iomap_page_create(iter->inode, folio, iter->flags);
+	iop = iomap_page_create(iter->inode, folio, iter->flags,
+				folio_test_dirty(folio));

 	if (folio_test_uptodate(folio))
 		return 0;
@@ -726,6 +810,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	if (unlikely(copied < len && !folio_test_uptodate(folio)))
 		return 0;
 	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
+	iomap_set_range_dirty(folio, iop, offset_in_folio(folio, pos), len);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
 }
@@ -1630,7 +1715,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct folio *folio, u64 end_pos)
 {
-	struct iomap_page *iop = iomap_page_create(inode, folio, 0);
+	struct iomap_page *iop = iomap_page_create(inode, folio, 0, true);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	unsigned nblocks = i_blocks_per_folio(inode, folio);
@@ -1646,7 +1731,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iop && !iop_test_uptodate(iop, i, nblocks))
+		if (iop && !iop_test_dirty(iop, i, nblocks))
 			continue;

 		error = wpc->ops->map_blocks(wpc, inode, pos);
@@ -1690,6 +1775,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		}
 	}

+	iomap_clear_range_dirty(folio, iop, 0, end_pos - folio_pos(folio));
 	folio_start_writeback(folio);
 	folio_unlock(folio);

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 41734202796f..7e6c54955b4f 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -571,7 +571,7 @@ const struct address_space_operations xfs_address_space_operations = {
 	.read_folio		= xfs_vm_read_folio,
 	.readahead		= xfs_vm_readahead,
 	.writepages		= xfs_vm_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.bmap			= xfs_vm_bmap,
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index a9c5c3f720ad..4cefc2af87f3 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -267,7 +267,7 @@ static const struct address_space_operations zonefs_file_aops = {
 	.read_folio		= zonefs_read_folio,
 	.readahead		= zonefs_readahead,
 	.writepages		= zonefs_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.migrate_folio		= filemap_migrate_folio,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0983dfc9a203..b60562a0b893 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -262,6 +262,7 @@ void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
--
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFCv3 1/3] iomap: Allocate iop in ->write_begin() early
  2023-02-26 19:43 ` [RFCv3 1/3] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
@ 2023-02-26 22:41   ` Dave Chinner
  2023-02-28 17:55     ` Ritesh Harjani
  2023-02-26 23:12   ` Matthew Wilcox
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2023-02-26 22:41 UTC (permalink / raw)
  To: Ritesh Harjani (IBM); +Cc: linux-xfs, linux-fsdevel

On Mon, Feb 27, 2023 at 01:13:30AM +0530, Ritesh Harjani (IBM) wrote:
> Earlier when the folio is uptodate, we only allocate iop at writeback
> time (in iomap_writepage_map()). This is ok until now, but when we are
> going to add support for subpage size dirty bitmap tracking in iop, this
> could cause some performance degradation. The reason is that if we don't
> allocate iop during ->write_begin(), then we will never mark the
> necessary dirty bits in ->write_end() call. And we will have to mark all
> the bits as dirty at the writeback time, that could cause the same write
> amplification and performance problems as it is now (w/o subpage dirty
> bitmap tracking in iop).
> 
> However, for all the writes with (pos, len) which completely overlaps
> the given folio, there is no need to allocate an iop during
> ->write_begin(). So skip those cases.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/iomap/buffered-io.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 356193e44cf0..c5b51ab1184e 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -535,11 +535,16 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  	size_t from = offset_in_folio(folio, pos), to = from + len;
>  	size_t poff, plen;
>  
> +	if (pos <= folio_pos(folio) &&
> +	    pos + len >= folio_pos(folio) + folio_size(folio))
> +		return 0;

This is magic without a comment explaining why it exists. You have
that explanation in the commit message, but that doesn't help anyone
looking at the code:

	/*
	 * If the write completely overlaps the current folio, then
	 * entire folio will be dirtied so there is no need for
	 * sub-folio state tracking structures to be attached to this folio.
	 */

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv3 1/3] iomap: Allocate iop in ->write_begin() early
  2023-02-26 19:43 ` [RFCv3 1/3] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
  2023-02-26 22:41   ` Dave Chinner
@ 2023-02-26 23:12   ` Matthew Wilcox
  2023-02-28 18:33     ` Ritesh Harjani
  1 sibling, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2023-02-26 23:12 UTC (permalink / raw)
  To: Ritesh Harjani (IBM); +Cc: linux-xfs, linux-fsdevel

On Mon, Feb 27, 2023 at 01:13:30AM +0530, Ritesh Harjani (IBM) wrote:
> +++ b/fs/iomap/buffered-io.c
> @@ -535,11 +535,16 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  	size_t from = offset_in_folio(folio, pos), to = from + len;
>  	size_t poff, plen;
>  
> +	if (pos <= folio_pos(folio) &&
> +	    pos + len >= folio_pos(folio) + folio_size(folio))
> +		return 0;
> +
> +	iop = iomap_page_create(iter->inode, folio, iter->flags);
> +
>  	if (folio_test_uptodate(folio))
>  		return 0;
>  	folio_clear_error(folio);
>  
> -	iop = iomap_page_create(iter->inode, folio, iter->flags);
>  	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
>  		return -EAGAIN;

Don't you want to move the -EAGAIN check up too?  Otherwise an
io_uring write will dirty the entire folio rather than a block.

It occurs to me (even though I was the one who suggested the current
check) that pos <= folio_pos etc is actually a bit tighter than
necessary.  We could get away with:

	if (pos < folio_pos(folio) + block_size &&
	    pos + len > folio_pos(folio) + folio_size(folio) - block_size)

since that will also cause the entire folio to be dirtied.  Not sure if
it's worth it.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv3 2/3] iomap: Change uptodate variable name to state
  2023-02-26 19:43 ` [RFCv3 2/3] iomap: Change uptodate variable name to state Ritesh Harjani (IBM)
@ 2023-02-26 23:12   ` Dave Chinner
  2023-02-26 23:23   ` Matthew Wilcox
  1 sibling, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2023-02-26 23:12 UTC (permalink / raw)
  To: Ritesh Harjani (IBM); +Cc: linux-xfs, linux-fsdevel

On Mon, Feb 27, 2023 at 01:13:31AM +0530, Ritesh Harjani (IBM) wrote:
> This patch changes the struct iomap_page uptodate & uptodate_lock
> member names to state and state_lock to better reflect their purpose
> for the upcoming patch. It also introduces the accessor functions for
> updating uptodate state bits in iop->state bitmap. This makes the code
> easy to understand on when different bitmap types are getting referred
> in different code paths.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/iomap/buffered-io.c | 65 ++++++++++++++++++++++++++++++++----------
>  1 file changed, 50 insertions(+), 15 deletions(-)
....

The mechanical change itself looks fine, so from that perspective:

Reviewed-by: Dave Chinner <dchinner@redhat.com>

However, I'm wondering about the efficiency of these bit searches.

> @@ -110,7 +143,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>  
>  		/* move forward for each leading block marked uptodate */
>  		for (i = first; i <= last; i++) {
> -			if (!test_bit(i, iop->uptodate))
> +			if (!iop_test_uptodate(iop, i, nr_blocks))
>  				break;
>  			*pos += block_size;
>  			poff += block_size;

Looking at this code, it could have been written to use
find_first_zero_bit() rather than testing each bit individually...

> @@ -120,7 +153,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>  
>  		/* truncate len if we find any trailing uptodate block(s) */
>  		for ( ; i <= last; i++) {
> -			if (test_bit(i, iop->uptodate)) {
> +			if (iop_test_uptodate(iop, i, nr_blocks)) {
>  				plen -= (last - i + 1) * block_size;
>  				last = i - 1;
>  				break;

And this is find_first_bit()...

>  static void iomap_set_range_uptodate(struct folio *folio,
> @@ -439,6 +473,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
>  	struct iomap_page *iop = to_iomap_page(folio);
>  	struct inode *inode = folio->mapping->host;
>  	unsigned first, last, i;
> +	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
>  
>  	if (!iop)
>  		return false;
> @@ -451,7 +486,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
>  	last = (from + count - 1) >> inode->i_blkbits;
>  
>  	for (i = first; i <= last; i++)
> -		if (!test_bit(i, iop->uptodate))
> +		if (!iop_test_uptodate(iop, i, nr_blocks))
>  			return false;

Again, find_first_zero_bit().

These seem like worthwhile optimisations in light of the heavier use
these bitmaps will get with sub-folio dirty tracking, especially
considering large folios will now use these paths. Do these
interface changes preclude the use of efficient bitmap searching
functions?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv3 2/3] iomap: Change uptodate variable name to state
  2023-02-26 19:43 ` [RFCv3 2/3] iomap: Change uptodate variable name to state Ritesh Harjani (IBM)
  2023-02-26 23:12   ` Dave Chinner
@ 2023-02-26 23:23   ` Matthew Wilcox
  2023-02-28 18:38     ` Ritesh Harjani
  1 sibling, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2023-02-26 23:23 UTC (permalink / raw)
  To: Ritesh Harjani (IBM); +Cc: linux-xfs, linux-fsdevel

On Mon, Feb 27, 2023 at 01:13:31AM +0530, Ritesh Harjani (IBM) wrote:
> +static inline bool iop_test_uptodate(struct iomap_page *iop, unsigned int pos,
> +				unsigned int nrblocks)
> +{
> +	return test_bit(pos, iop->state);
> +}

'pos' is usually position within file, not within the folio.  That
should be called 'block' or 'start' like the other accessors.

> +static inline bool iop_full_uptodate(struct iomap_page *iop,
> +				unsigned int nrblocks)
> +{
> +	return bitmap_full(iop->state, nrblocks);
> +}

Not sure I like iop_full_uptodate() as a name.  iop_entirely_uptodate()?
iop_folio_uptodate()?  iop_all_uptodate()?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv3 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-02-26 19:43 ` [RFCv3 3/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
@ 2023-02-26 23:48   ` Dave Chinner
  2023-02-26 23:56     ` Matthew Wilcox
  2023-03-02 14:02   ` Brian Foster
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2023-02-26 23:48 UTC (permalink / raw)
  To: Ritesh Harjani (IBM); +Cc: linux-xfs, linux-fsdevel, Aravinda Herle

On Mon, Feb 27, 2023 at 01:13:32AM +0530, Ritesh Harjani (IBM) wrote:
> On a 64k pagesize platforms (specially Power and/or aarch64) with 4k
> filesystem blocksize, this patch should improve the performance by doing
> only the subpage dirty data write.
> 
> This should also reduce the write amplification since we can now track
> subpage dirty status within state bitmaps. Earlier we had to
> write the entire 64k page even if only a part of it (e.g. 4k) was
> updated.
> 
> Performance testing of below fio workload reveals ~16x performance
> improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> 
> 1. <test_randwrite.fio>
> [global]
> 	ioengine=psync
> 	rw=randwrite
> 	overwrite=1
> 	pre_read=1
> 	direct=0
> 	bs=4k
> 	size=1G
> 	dir=./
> 	numjobs=8
> 	fdatasync=1
> 	runtime=60
> 	iodepth=64
> 	group_reporting=1
> 
> [fio-run]
> 
> 2. Also our internal performance team reported that this patch improves there
>    database workload performance by around ~83% (with XFS on Power)
> 
> Reported-by: Aravinda Herle <araherle@in.ibm.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/gfs2/aops.c         |   2 +-
>  fs/iomap/buffered-io.c | 104 +++++++++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_aops.c      |   2 +-
>  fs/zonefs/super.c      |   2 +-
>  include/linux/iomap.h  |   1 +
>  5 files changed, 99 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index e782b4f1d104..b9c35288a5eb 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -741,7 +741,7 @@ static const struct address_space_operations gfs2_aops = {
>  	.writepages = gfs2_writepages,
>  	.read_folio = gfs2_read_folio,
>  	.readahead = gfs2_readahead,
> -	.dirty_folio = filemap_dirty_folio,
> +	.dirty_folio = iomap_dirty_folio,
>  	.release_folio = iomap_release_folio,
>  	.invalidate_folio = iomap_invalidate_folio,
>  	.bmap = gfs2_bmap,
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index e0b0be16278e..fb55183c547f 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -44,8 +44,8 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
>  static struct bio_set iomap_ioend_bioset;
> 
>  /*
> - * Accessor functions for setting/clearing/checking uptodate bits in
> - * iop->state bitmap.
> + * Accessor functions for setting/clearing/checking uptodate and
> + * dirty bits in iop->state bitmap.
>   * nrblocks is i_blocks_per_folio() which is passed in every
>   * function as the last argument for API consistency.
>   */
> @@ -75,8 +75,29 @@ static inline bool iop_full_uptodate(struct iomap_page *iop,
>  	return bitmap_full(iop->state, nrblocks);
>  }
> 
> +static inline void iop_set_range_dirty(struct iomap_page *iop,
> +				unsigned int start, unsigned int len,
> +				unsigned int nrblocks)
> +{
> +	bitmap_set(iop->state, start + nrblocks, len);
> +}
> +
> +static inline void iop_clear_range_dirty(struct iomap_page *iop,
> +				unsigned int start, unsigned int len,
> +				unsigned int nrblocks)
> +{
> +	bitmap_clear(iop->state, start + nrblocks, len);
> +}
> +
> +static inline bool iop_test_dirty(struct iomap_page *iop, unsigned int pos,
> +			     unsigned int nrblocks)
> +{
> +	return test_bit(pos + nrblocks, iop->state);
> +}
> +
>  static struct iomap_page *
> -iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
> +iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags,
> +		  bool is_dirty)
>  {
>  	struct iomap_page *iop = to_iomap_page(folio);
>  	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> @@ -90,12 +111,18 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
>  	else
>  		gfp = GFP_NOFS | __GFP_NOFAIL;
> 
> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
> +	/*
> +	 * iop->state tracks 2 types of bitmaps i.e. uptodate & dirty
> +	 * for bs < ps.
> +	 */

PLease write comments out in full. Also "bs < ps" is actually
wrong because we have large folios in the page cache and they will
need to use sub-folio state tracking if they are dirtied.

	/*
	 * iop->state tracks two sets of state flags when the
	 * filesystem block size is smaller than the folio size.
	 * state. The first tracks per-filesystem block uptodate
	 * state, the second tracks per-filesystem block dirty
	 * state.
	 */

> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
>  		      gfp);
>  	if (iop) {
>  		spin_lock_init(&iop->state_lock);
>  		if (folio_test_uptodate(folio))
>  			iop_set_range_uptodate(iop, 0, nr_blocks, nr_blocks);
> +		if (is_dirty)
> +			iop_set_range_dirty(iop, 0, nr_blocks, nr_blocks);
>  		folio_attach_private(folio, iop);
>  	}
>  	return iop;
> @@ -202,6 +229,48 @@ static void iomap_set_range_uptodate(struct folio *folio,
>  		folio_mark_uptodate(folio);
>  }
> 
> +static void iomap_iop_set_range_dirty(struct folio *folio,
> +		struct iomap_page *iop, size_t off, size_t len)
> +{
> +	struct inode *inode = folio->mapping->host;
> +	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> +	unsigned first = (off >> inode->i_blkbits);
> +	unsigned last = ((off + len - 1) >> inode->i_blkbits);

first_bit, last_bit if we are leaving this code unchanged.


> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&iop->state_lock, flags);
> +	iop_set_range_dirty(iop, first, last - first + 1, nr_blocks);
                                        ^^^^^^^^^^^^^^^^ nr_bits

I dislike all the magic "- 1" and "+ 1" sprinkles that end up in
this code because of the closed ranges nomenclature infecting this
code. If we use closed start/open ended ranges like we do all
through XFS, we have:

offset_to_start_bit()
{
	return off >> bits;			// round_down
}

offset_to_end_bit()
{
	return (off + (1 << bits) - 1) >> bits; // round_up
}

	unsigned start_bit = offset_to_start_bit(off, inode->i_blkbits);
	unsigned end_bit = offset_to_end_bit(off + len, inode->i_blkbits);
	unsigned nr_bits = end_bit - start_bit;

	iop_set_range_dirty(iop, start_bit, nr_bits, nr_blocks);

And we don't have to add magic "handle the off by one"
sprinkles everywhere that maths on closed ranges requires to get
correct...


> +	spin_unlock_irqrestore(&iop->state_lock, flags);
> +}
> +
> +static void iomap_set_range_dirty(struct folio *folio,
> +		struct iomap_page *iop, size_t off, size_t len)
> +{
> +	if (iop)
> +		iomap_iop_set_range_dirty(folio, iop, off, len);
> +}
> +
> +static void iomap_iop_clear_range_dirty(struct folio *folio,
> +		struct iomap_page *iop, size_t off, size_t len)
> +{
> +	struct inode *inode = folio->mapping->host;
> +	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> +	unsigned first = (off >> inode->i_blkbits);
> +	unsigned last = ((off + len - 1) >> inode->i_blkbits);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&iop->state_lock, flags);
> +	iop_clear_range_dirty(iop, first, last - first + 1, nr_blocks);
> +	spin_unlock_irqrestore(&iop->state_lock, flags);

Same here with the magic off by one sprinkles.

FWIW, this is exactly the same code as
iomap_iop_clear_range_uptodate(), is it not? The only difference is
the offset into the state bitmap and that is done by
iop_clear_range_dirty() vs iop_clear_range_uptodate()? Seems like a
layer of wrappers could be taken out of here simply by placing the
start offset correctly here for a common iop_clear_range() helper...

> +static void iomap_clear_range_dirty(struct folio *folio,
> +		struct iomap_page *iop, size_t off, size_t len)
> +{
> +	if (iop)
> +		iomap_iop_clear_range_dirty(folio, iop, off, len);
> +}

Or even here at this layer, and we squash out two layers of
very similar wrappers.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv3 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-02-26 23:48   ` Dave Chinner
@ 2023-02-26 23:56     ` Matthew Wilcox
  2023-02-27  0:10       ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2023-02-26 23:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ritesh Harjani (IBM), linux-xfs, linux-fsdevel, Aravinda Herle

On Mon, Feb 27, 2023 at 10:48:14AM +1100, Dave Chinner wrote:
> > +static void iomap_iop_set_range_dirty(struct folio *folio,
> > +		struct iomap_page *iop, size_t off, size_t len)
> > +{
> > +	struct inode *inode = folio->mapping->host;
> > +	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> > +	unsigned first = (off >> inode->i_blkbits);
> > +	unsigned last = ((off + len - 1) >> inode->i_blkbits);
> 
> first_bit, last_bit if we are leaving this code unchanged.

first_blk, last_blk, surely?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv3 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-02-26 23:56     ` Matthew Wilcox
@ 2023-02-27  0:10       ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2023-02-27  0:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ritesh Harjani (IBM), linux-xfs, linux-fsdevel, Aravinda Herle

On Sun, Feb 26, 2023 at 11:56:05PM +0000, Matthew Wilcox wrote:
> On Mon, Feb 27, 2023 at 10:48:14AM +1100, Dave Chinner wrote:
> > > +static void iomap_iop_set_range_dirty(struct folio *folio,
> > > +		struct iomap_page *iop, size_t off, size_t len)
> > > +{
> > > +	struct inode *inode = folio->mapping->host;
> > > +	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> > > +	unsigned first = (off >> inode->i_blkbits);
> > > +	unsigned last = ((off + len - 1) >> inode->i_blkbits);
> > 
> > first_bit, last_bit if we are leaving this code unchanged.
> 
> first_blk, last_blk, surely?

Probably.

And that's entirely my point - the variable names need to
describe what they contain...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv3 1/3] iomap: Allocate iop in ->write_begin() early
  2023-02-26 22:41   ` Dave Chinner
@ 2023-02-28 17:55     ` Ritesh Harjani
  0 siblings, 0 replies; 18+ messages in thread
From: Ritesh Harjani @ 2023-02-28 17:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

Dave Chinner <david@fromorbit.com> writes:

> On Mon, Feb 27, 2023 at 01:13:30AM +0530, Ritesh Harjani (IBM) wrote:
>> Earlier when the folio is uptodate, we only allocate iop at writeback
>> time (in iomap_writepage_map()). This is ok until now, but when we are
>> going to add support for subpage size dirty bitmap tracking in iop, this
>> could cause some performance degradation. The reason is that if we don't
>> allocate iop during ->write_begin(), then we will never mark the
>> necessary dirty bits in ->write_end() call. And we will have to mark all
>> the bits as dirty at the writeback time, that could cause the same write
>> amplification and performance problems as it is now (w/o subpage dirty
>> bitmap tracking in iop).
>>
>> However, for all the writes with (pos, len) which completely overlaps
>> the given folio, there is no need to allocate an iop during
>> ->write_begin(). So skip those cases.
>>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/iomap/buffered-io.c | 7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 356193e44cf0..c5b51ab1184e 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -535,11 +535,16 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>>  	size_t from = offset_in_folio(folio, pos), to = from + len;
>>  	size_t poff, plen;
>>
>> +	if (pos <= folio_pos(folio) &&
>> +	    pos + len >= folio_pos(folio) + folio_size(folio))
>> +		return 0;
>
> This is magic without a comment explaining why it exists. You have
> that explanation in the commit message, but that doesn't help anyone
> looking at the code:
>
> 	/*
> 	 * If the write completely overlaps the current folio, then
> 	 * entire folio will be dirtied so there is no need for
> 	 * sub-folio state tracking structures to be attached to this folio.
> 	 */

Sure, got it. I will add a comment which explains this in the code as
well.

Thanks for the review!
-ritesh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv3 1/3] iomap: Allocate iop in ->write_begin() early
  2023-02-26 23:12   ` Matthew Wilcox
@ 2023-02-28 18:33     ` Ritesh Harjani
  2023-02-28 18:36       ` Matthew Wilcox
  0 siblings, 1 reply; 18+ messages in thread
From: Ritesh Harjani @ 2023-02-28 18:33 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-xfs, linux-fsdevel

Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Feb 27, 2023 at 01:13:30AM +0530, Ritesh Harjani (IBM) wrote:
>> +++ b/fs/iomap/buffered-io.c
>> @@ -535,11 +535,16 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>>  	size_t from = offset_in_folio(folio, pos), to = from + len;
>>  	size_t poff, plen;
>>
>> +	if (pos <= folio_pos(folio) &&
>> +	    pos + len >= folio_pos(folio) + folio_size(folio))
>> +		return 0;
>> +
>> +	iop = iomap_page_create(iter->inode, folio, iter->flags);
>> +
>>  	if (folio_test_uptodate(folio))
>>  		return 0;
>>  	folio_clear_error(folio);
>>
>> -	iop = iomap_page_create(iter->inode, folio, iter->flags);
>>  	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
>>  		return -EAGAIN;
>
> Don't you want to move the -EAGAIN check up too?  Otherwise an
> io_uring write will dirty the entire folio rather than a block.

I am not entirely convinced whether we should move this check up
(to put it just after the iop allocation). The reason is if the folio is
uptodate then it is ok to return 0 rather than -EAGAIN, because we are
anyway not going to read the folio from disk (given it is completely
uptodate).

Thoughts? Or am I missing anything here.

>
> It occurs to me (even though I was the one who suggested the current
> check) that pos <= folio_pos etc is actually a bit tighter than
> necessary.  We could get away with:
>
> 	if (pos < folio_pos(folio) + block_size &&
> 	    pos + len > folio_pos(folio) + folio_size(folio) - block_size)
>
> since that will also cause the entire folio to be dirtied.  Not sure if
> it's worth it.

I am not sure of how much impact such a change can cause. But I agree
that the above check is much lighter in terms of restriction.

Let me spend some more time thinking it through.

Thanks for the review!
-ritesh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv3 1/3] iomap: Allocate iop in ->write_begin() early
  2023-02-28 18:33     ` Ritesh Harjani
@ 2023-02-28 18:36       ` Matthew Wilcox
  2023-03-02 18:59         ` Ritesh Harjani
  0 siblings, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2023-02-28 18:36 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: linux-xfs, linux-fsdevel

On Wed, Mar 01, 2023 at 12:03:48AM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Mon, Feb 27, 2023 at 01:13:30AM +0530, Ritesh Harjani (IBM) wrote:
> >> +++ b/fs/iomap/buffered-io.c
> >> @@ -535,11 +535,16 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> >>  	size_t from = offset_in_folio(folio, pos), to = from + len;
> >>  	size_t poff, plen;
> >>
> >> +	if (pos <= folio_pos(folio) &&
> >> +	    pos + len >= folio_pos(folio) + folio_size(folio))
> >> +		return 0;
> >> +
> >> +	iop = iomap_page_create(iter->inode, folio, iter->flags);
> >> +
> >>  	if (folio_test_uptodate(folio))
> >>  		return 0;
> >>  	folio_clear_error(folio);
> >>
> >> -	iop = iomap_page_create(iter->inode, folio, iter->flags);
> >>  	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
> >>  		return -EAGAIN;
> >
> > Don't you want to move the -EAGAIN check up too?  Otherwise an
> > io_uring write will dirty the entire folio rather than a block.
> 
> I am not entirely convinced whether we should move this check up
> (to put it just after the iop allocation). The reason is if the folio is
> uptodate then it is ok to return 0 rather than -EAGAIN, because we are
> anyway not going to read the folio from disk (given it is completely
> uptodate).
> 
> Thoughts? Or am I missing anything here.

But then we won't have an iop, so a write will dirty the entire folio
instead of just the blocks you want to dirty.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv3 2/3] iomap: Change uptodate variable name to state
  2023-02-26 23:23   ` Matthew Wilcox
@ 2023-02-28 18:38     ` Ritesh Harjani
  0 siblings, 0 replies; 18+ messages in thread
From: Ritesh Harjani @ 2023-02-28 18:38 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-xfs, linux-fsdevel

Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Feb 27, 2023 at 01:13:31AM +0530, Ritesh Harjani (IBM) wrote:
>> +static inline bool iop_test_uptodate(struct iomap_page *iop, unsigned int pos,
>> +				unsigned int nrblocks)
>> +{
>> +	return test_bit(pos, iop->state);
>> +}
>
> 'pos' is usually position within file, not within the folio.  That
> should be called 'block' or 'start' like the other accessors.

Agreed. Will make the change in next rev.

>
>> +static inline bool iop_full_uptodate(struct iomap_page *iop,
>> +				unsigned int nrblocks)
>> +{
>> +	return bitmap_full(iop->state, nrblocks);
>> +}
>
> Not sure I like iop_full_uptodate() as a name.  iop_entirely_uptodate()?
> iop_folio_uptodate()?  iop_all_uptodate()?

I can settle for iop_all_uptodate(). But would you rather prefer
iop_uptodate_full() like bitmap_full()?

-ritesh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv3 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-02-26 19:43 ` [RFCv3 3/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
  2023-02-26 23:48   ` Dave Chinner
@ 2023-03-02 14:02   ` Brian Foster
  2023-04-26  9:57     ` Ritesh Harjani
  1 sibling, 1 reply; 18+ messages in thread
From: Brian Foster @ 2023-03-02 14:02 UTC (permalink / raw)
  To: Ritesh Harjani (IBM); +Cc: linux-xfs, linux-fsdevel, Aravinda Herle

On Mon, Feb 27, 2023 at 01:13:32AM +0530, Ritesh Harjani (IBM) wrote:
> On a 64k pagesize platforms (specially Power and/or aarch64) with 4k
> filesystem blocksize, this patch should improve the performance by doing
> only the subpage dirty data write.
> 
> This should also reduce the write amplification since we can now track
> subpage dirty status within state bitmaps. Earlier we had to
> write the entire 64k page even if only a part of it (e.g. 4k) was
> updated.
> 
> Performance testing of below fio workload reveals ~16x performance
> improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> 
> 1. <test_randwrite.fio>
> [global]
> 	ioengine=psync
> 	rw=randwrite
> 	overwrite=1
> 	pre_read=1
> 	direct=0
> 	bs=4k
> 	size=1G
> 	dir=./
> 	numjobs=8
> 	fdatasync=1
> 	runtime=60
> 	iodepth=64
> 	group_reporting=1
> 
> [fio-run]
> 
> 2. Also our internal performance team reported that this patch improves there
>    database workload performance by around ~83% (with XFS on Power)
> 
> Reported-by: Aravinda Herle <araherle@in.ibm.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/gfs2/aops.c         |   2 +-
>  fs/iomap/buffered-io.c | 104 +++++++++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_aops.c      |   2 +-
>  fs/zonefs/super.c      |   2 +-
>  include/linux/iomap.h  |   1 +
>  5 files changed, 99 insertions(+), 12 deletions(-)
> 
...
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index e0b0be16278e..fb55183c547f 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
...
> @@ -1630,7 +1715,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		struct writeback_control *wbc, struct inode *inode,
>  		struct folio *folio, u64 end_pos)
>  {
> -	struct iomap_page *iop = iomap_page_create(inode, folio, 0);
> +	struct iomap_page *iop = iomap_page_create(inode, folio, 0, true);
>  	struct iomap_ioend *ioend, *next;
>  	unsigned len = i_blocksize(inode);
>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
> @@ -1646,7 +1731,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	 * invalid, grab a new one.
>  	 */
>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> -		if (iop && !iop_test_uptodate(iop, i, nblocks))
> +		if (iop && !iop_test_dirty(iop, i, nblocks))
>  			continue;
> 
>  		error = wpc->ops->map_blocks(wpc, inode, pos);

Hi Ritesh,

I'm not sure if you followed any of the discussion on the imap
revalidation series that landed in the last cycle or so, but the
associated delalloc punch error handling code has a subtle dependency on
current writeback behavior and thus left a bit of a landmine for this
work. For reference, more detailed discussion starts around here [1].
The context is basically that the use of mapping seek hole/data relies
on uptodate status, which means in certain error cases the filesystem
might allocate a delalloc block for a write, but not punch it out of the
associated write happens to fail and the underlying portion of the folio
was uptodate.

This doesn't cause a problem in current mainline because writeback maps
every uptodate block in a dirty folio, and so the delalloc block will
convert at writeback time even though it wasn't written. This no longer
occurs with the change above, which means there's a vector for a stale
delalloc block to be left around in the inode. This is a free space
accounting corruption issue on XFS. Here's a quick example [2] on a 1k
FSB XFS filesystem to show exactly what I mean:

# xfs_io -fc "truncate 4k" -c "mmap 0 4k" -c "mread 0 4k" -c "pwrite 0 1" -c "pwrite -f 2k 1" -c fsync /mnt/file
# xfs_io -c "fiemap -v" /mnt/file 
/mnt/file:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
...
   2: [4..5]:          0..1                 2   0x7
...
(the above shows delalloc after an fsync)
# umount /mnt
  kernel:XFS: Assertion failed: xfs_is_shutdown(mp) || percpu_counter_sum(&mp->m_delalloc_blks) == 0, file: fs/xfs/xfs_super.c, line: 1068
# xfs_repair -n /dev/vdb2 
Phase 1 - find and verify superblock...
Phase 2 - using internal log
...
sb_fdblocks 20960187, counted 20960195
...
#

I suspect this means either the error handling code needs to be updated
to consider dirty state (i.e. punch delalloc if the block is !dirty), or
otherwise this needs to depend on a broader change in XFS to reclaim
delalloc blocks before inode eviction (along the lines of Dave's recent
proposal to do something like that for corrupted inodes). Of course the
caveat with the latter is that doesn't help for any other filesystems
(?) that might have similar expectations for delayed allocation and want
to use iomap.

Brian

[1] https://lore.kernel.org/linux-fsdevel/Y3TsPzd0XzXXIzQv@bfoster/

[2] This test case depends on a local xfs_io hack to co-opt the -f flag
into inducing a write failure. A POC patch for that is available here,
if you wanted to replicate:

https://lore.kernel.org/linux-xfs/20221123181322.3710820-1-bfoster@redhat.com/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv3 1/3] iomap: Allocate iop in ->write_begin() early
  2023-02-28 18:36       ` Matthew Wilcox
@ 2023-03-02 18:59         ` Ritesh Harjani
  0 siblings, 0 replies; 18+ messages in thread
From: Ritesh Harjani @ 2023-03-02 18:59 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-xfs, linux-fsdevel

Matthew Wilcox <willy@infradead.org> writes:

> On Wed, Mar 01, 2023 at 12:03:48AM +0530, Ritesh Harjani wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>>
>> > On Mon, Feb 27, 2023 at 01:13:30AM +0530, Ritesh Harjani (IBM) wrote:
>> >> +++ b/fs/iomap/buffered-io.c
>> >> @@ -535,11 +535,16 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>> >>  	size_t from = offset_in_folio(folio, pos), to = from + len;
>> >>  	size_t poff, plen;
>> >>
>> >> +	if (pos <= folio_pos(folio) &&
>> >> +	    pos + len >= folio_pos(folio) + folio_size(folio))
>> >> +		return 0;
>> >> +
>> >> +	iop = iomap_page_create(iter->inode, folio, iter->flags);
>> >> +
>> >>  	if (folio_test_uptodate(folio))
>> >>  		return 0;
>> >>  	folio_clear_error(folio);
>> >>
>> >> -	iop = iomap_page_create(iter->inode, folio, iter->flags);
>> >>  	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
>> >>  		return -EAGAIN;
>> >
>> > Don't you want to move the -EAGAIN check up too?  Otherwise an
>> > io_uring write will dirty the entire folio rather than a block.
>>
>> I am not entirely convinced whether we should move this check up
>> (to put it just after the iop allocation). The reason is if the folio is
>> uptodate then it is ok to return 0 rather than -EAGAIN, because we are
>> anyway not going to read the folio from disk (given it is completely
>> uptodate).
>>
>> Thoughts? Or am I missing anything here.
>
> But then we won't have an iop, so a write will dirty the entire folio
> instead of just the blocks you want to dirty.

Ok, I got what you are saying. Make sense. I will give it a try.

Thanks
-ritesh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFCv3 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-03-02 14:02   ` Brian Foster
@ 2023-04-26  9:57     ` Ritesh Harjani
  0 siblings, 0 replies; 18+ messages in thread
From: Ritesh Harjani @ 2023-04-26  9:57 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel, Aravinda Herle

Brian Foster <bfoster@redhat.com> writes:

> On Mon, Feb 27, 2023 at 01:13:32AM +0530, Ritesh Harjani (IBM) wrote:
>> On a 64k pagesize platforms (specially Power and/or aarch64) with 4k
>> filesystem blocksize, this patch should improve the performance by doing
>> only the subpage dirty data write.
>>
>> This should also reduce the write amplification since we can now track
>> subpage dirty status within state bitmaps. Earlier we had to
>> write the entire 64k page even if only a part of it (e.g. 4k) was
>> updated.
>>
>> Performance testing of below fio workload reveals ~16x performance
>> improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
>> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
>>
>> 1. <test_randwrite.fio>
>> [global]
>> 	ioengine=psync
>> 	rw=randwrite
>> 	overwrite=1
>> 	pre_read=1
>> 	direct=0
>> 	bs=4k
>> 	size=1G
>> 	dir=./
>> 	numjobs=8
>> 	fdatasync=1
>> 	runtime=60
>> 	iodepth=64
>> 	group_reporting=1
>>
>> [fio-run]
>>
>> 2. Also our internal performance team reported that this patch improves there
>>    database workload performance by around ~83% (with XFS on Power)
>>
>> Reported-by: Aravinda Herle <araherle@in.ibm.com>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/gfs2/aops.c         |   2 +-
>>  fs/iomap/buffered-io.c | 104 +++++++++++++++++++++++++++++++++++++----
>>  fs/xfs/xfs_aops.c      |   2 +-
>>  fs/zonefs/super.c      |   2 +-
>>  include/linux/iomap.h  |   1 +
>>  5 files changed, 99 insertions(+), 12 deletions(-)
>>
> ...
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index e0b0be16278e..fb55183c547f 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
> ...
>> @@ -1630,7 +1715,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  		struct writeback_control *wbc, struct inode *inode,
>>  		struct folio *folio, u64 end_pos)
>>  {
>> -	struct iomap_page *iop = iomap_page_create(inode, folio, 0);
>> +	struct iomap_page *iop = iomap_page_create(inode, folio, 0, true);
>>  	struct iomap_ioend *ioend, *next;
>>  	unsigned len = i_blocksize(inode);
>>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
>> @@ -1646,7 +1731,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  	 * invalid, grab a new one.
>>  	 */
>>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
>> -		if (iop && !iop_test_uptodate(iop, i, nblocks))
>> +		if (iop && !iop_test_dirty(iop, i, nblocks))
>>  			continue;
>>
>>  		error = wpc->ops->map_blocks(wpc, inode, pos);
>
> Hi Ritesh,
>
> I'm not sure if you followed any of the discussion on the imap
> revalidation series that landed in the last cycle or so, but the
> associated delalloc punch error handling code has a subtle dependency on
> current writeback behavior and thus left a bit of a landmine for this
> work. For reference, more detailed discussion starts around here [1].
> The context is basically that the use of mapping seek hole/data relies
> on uptodate status, which means in certain error cases the filesystem
> might allocate a delalloc block for a write, but not punch it out of the
> associated write happens to fail and the underlying portion of the folio
> was uptodate.
>
> This doesn't cause a problem in current mainline because writeback maps
> every uptodate block in a dirty folio, and so the delalloc block will
> convert at writeback time even though it wasn't written. This no longer
> occurs with the change above, which means there's a vector for a stale
> delalloc block to be left around in the inode. This is a free space
> accounting corruption issue on XFS. Here's a quick example [2] on a 1k
> FSB XFS filesystem to show exactly what I mean:
>
> # xfs_io -fc "truncate 4k" -c "mmap 0 4k" -c "mread 0 4k" -c "pwrite 0 1" -c "pwrite -f 2k 1" -c fsync /mnt/file
> # xfs_io -c "fiemap -v" /mnt/file
> /mnt/file:
>  EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
> ...
>    2: [4..5]:          0..1                 2   0x7
> ...
> (the above shows delalloc after an fsync)
> # umount /mnt
>   kernel:XFS: Assertion failed: xfs_is_shutdown(mp) || percpu_counter_sum(&mp->m_delalloc_blks) == 0, file: fs/xfs/xfs_super.c, line: 1068
> # xfs_repair -n /dev/vdb2
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> ...
> sb_fdblocks 20960187, counted 20960195
> ...
> #
>
> I suspect this means either the error handling code needs to be updated
> to consider dirty state (i.e. punch delalloc if the block is !dirty), or
> otherwise this needs to depend on a broader change in XFS to reclaim
> delalloc blocks before inode eviction (along the lines of Dave's recent
> proposal to do something like that for corrupted inodes). Of course the
> caveat with the latter is that doesn't help for any other filesystems
> (?) that might have similar expectations for delayed allocation and want
> to use iomap.
>
> Brian
>
> [1] https://lore.kernel.org/linux-fsdevel/Y3TsPzd0XzXXIzQv@bfoster/
>
> [2] This test case depends on a local xfs_io hack to co-opt the -f flag
> into inducing a write failure. A POC patch for that is available here,
> if you wanted to replicate:
>
> https://lore.kernel.org/linux-xfs/20221123181322.3710820-1-bfoster@redhat.com/

Thanks a lot Brian for such detailed info along with a test case.
Really appreciate your insights on this! :)
And apologies for not getting to it earlier. I picked up iomap DIO conversion for
ext2 in between. Now that it is settled, I will be working on addressing
this problem.

Yes, I did follow that patch series which you pointed, but I guess
I mostly thought it was the stale iomap problem which we were solving.
But you are right the approach we take in that patch series to punch out
the delalloc blocks in case of short writes does open up a potential bug
when we add subpage size dirty tracking in iomap, which was also
discussed during the time and I had marked it as well. But I guess I had
completely forgotten about it. So thanks for bringing it up again.

I looked into the Dave's series and what we are trying to do there is
identifying uptodate folios within the [start, end) range and when if
there is a dirty folio within that range, then we punch out all of the
data blocks from *punch_start_byte uptil before this dirty folio byte.
(because these dirty folios anyways contain uptodate dirty data which
can be written back at the time of writeback. Any uptodate non-dirty
folios can be punched out to ensure we don't leak any delalloc blocks
due to short writes).
The only problem here is...when we have subpage size blocks within a
dirty folio, it can still have subblocks within the folio which are
only marked uptodate but not dirty.
Given that we never punch out these uptodate non-dirty subblocks
from within this folio, this can leave a potential delalloc blocks leak
because at the time of writeback we only considers writing subblocks
of a folio which are marked dirty (in this patch series which adds
subpage size dirty tracking). This can then cause bug on problems as you
pointed out during unmount.

I have a short fix for this problem i.e. similar to how we do writeback i.e.
to consider the dirty state of every subblock within a folio, during
punch out operation in iomap_write_delalloc_release() ->
iomap_write_delalloc_scan().
So when we identify a dirty folio in above delalloc release path, we
should iterate over each subblock of that folio as well to punch out non-dirty
blocks. This should make sure that we don't have any uptodate non-dirty
subblocks in the entire given range from [start, end).

I have also addressed many of the other smaller review comments from
others in v3. I will include this change as well will send a new
revision for review (after some cleanups and some initial testing).

Thanks!
-ritesh

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-04-26  9:57 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-26 19:43 [RFCv3 0/3] iomap: Add support for subpage dirty state tracking to improve write performance Ritesh Harjani (IBM)
2023-02-26 19:43 ` [RFCv3 1/3] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
2023-02-26 22:41   ` Dave Chinner
2023-02-28 17:55     ` Ritesh Harjani
2023-02-26 23:12   ` Matthew Wilcox
2023-02-28 18:33     ` Ritesh Harjani
2023-02-28 18:36       ` Matthew Wilcox
2023-03-02 18:59         ` Ritesh Harjani
2023-02-26 19:43 ` [RFCv3 2/3] iomap: Change uptodate variable name to state Ritesh Harjani (IBM)
2023-02-26 23:12   ` Dave Chinner
2023-02-26 23:23   ` Matthew Wilcox
2023-02-28 18:38     ` Ritesh Harjani
2023-02-26 19:43 ` [RFCv3 3/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
2023-02-26 23:48   ` Dave Chinner
2023-02-26 23:56     ` Matthew Wilcox
2023-02-27  0:10       ` Dave Chinner
2023-03-02 14:02   ` Brian Foster
2023-04-26  9:57     ` Ritesh Harjani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).