All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFCv2 0/3] iomap: Add support for subpage dirty state tracking to improve write performance
@ 2023-01-30 16:14 Ritesh Harjani (IBM)
  2023-01-30 16:14 ` [RFCv2 1/3] iomap: Move creation of iomap_page early in __iomap_write_begin Ritesh Harjani (IBM)
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-01-30 16:14 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel; +Cc: Aravinda Herle, Ritesh Harjani (IBM)

Hello All,

Please find the RFCv2 patchset which adds support for iomap subpage dirty state
tracking which improves write performance and should reduce the write
amplification problem on platforms with smaller filesystem blocksize compared
to pagesize.
E.g. On Power with 64k default pagesize and with 4k filesystem blocksize.

RFC -> RFCv2
=============
1. One of the key fix in v2 is that earlier when the folio gets marked as dirty,
   we were never marking the bits dirty in iop bitmap.
   This patch adds support for iomap_dirty_folio() as new ->dirty_folio() aops
   callback, which sets the dirty bitmap in iop and later call filemap_dirty_folio().
   This was one of the review comment that was discussed in RFC.

2. One of the other key fix identified in testing was that iop structure could
   get allocated at the time of the writeback if the folio is uptodate.
   (since it can get freed during memory pressure or during
   truncate_inode_partial_folio() in case of large folio). This could then cause
   nothing to get written if we have not marked the necessary bits as dirty in
   iop->state[]. Patch-1 & Patch-3 takes care of that.

TODOs
======
1. I still need to work on macros which we could declare and use for easy
   reference to uptodate/dirty bits in iop->state[] bitmap (based on previous
   review comments).

2. Test xfstests on other filesystems which are using the iomap buffered write
   path (gfs2, zonefs).

3. Latest performance testing with this patch series (I am not expecting any
   surprises here. The perf improvements should be more or less similar to rfc).

4. To address one of the todo in Patch-3. I think I missed to address it and
   noticed it only now before sending. But it should be easily addressable.
   I can address it in the next revision along with others.

5. To address one of the other review comments like what happens with a large
   folio. Can we limit the size of bitmaps if the folio is too large e.g. > 2MB.

   [RH] - I can start looking into this area too, if we think these patches
   are looking good. My preference would be to work on todos 1-4 as part of this
   patch series and take up bitmap optimization as a follow-up work for next
   part. Please do let me know your thoughts and suggestions on this.

Note: I have done a 4k bs test with auto group on Power with 64k pagesize and
I haven't found any surprises. I am also running a full bench of all tests with
x86 and 1k blocksize, but it still hasn't completed. I can update the results
once it completes.

Also as we discussed, all the dirty and uptodate bitmap tracking code for
iomap_page's state[] bitmap, is still contained within iomap buffered-io.c file.

I would appreciate any review comments/feedback and help on this work i.e.
adding subpage size dirty tracking to reduce write amplification problem and
improve buffered write performance. Kindly note that w/o these patches,
below type of workload gets severly impacted.


Performance Results from RFC [1]:
=================================
1. Performance testing of below fio workload reveals ~16x performance
improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
FIO reported write bw scores, improved from ~28 MBps to ~452 MBps.

<test_randwrite.fio>
[global]
	ioengine=psync
	rw=randwrite
	overwrite=1
	pre_read=1
	direct=0
	bs=4k
	size=1G
	dir=./
	numjobs=8
	fdatasync=1
	runtime=60
	iodepth=64
	group_reporting=1

[fio-run]

2. Also our internal performance team reported that this patch improves there
   database workload performance by around ~83% (with XFS on Power)

[1]: https://lore.kernel.org/linux-xfs/cover.1666928993.git.ritesh.list@gmail.com/


Ritesh Harjani (IBM) (3):
  iomap: Move creation of iomap_page early in __iomap_write_begin
  iomap: Change uptodate variable name to state
  iomap: Support subpage size dirty tracking to improve write performance

 fs/gfs2/aops.c         |   2 +-
 fs/iomap/buffered-io.c | 129 ++++++++++++++++++++++++++++++++++-------
 fs/xfs/xfs_aops.c      |   2 +-
 fs/zonefs/super.c      |   2 +-
 include/linux/iomap.h  |   1 +
 5 files changed, 111 insertions(+), 25 deletions(-)

--
2.39.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFCv2 1/3] iomap: Move creation of iomap_page early in __iomap_write_begin
  2023-01-30 16:14 [RFCv2 0/3] iomap: Add support for subpage dirty state tracking to improve write performance Ritesh Harjani (IBM)
@ 2023-01-30 16:14 ` Ritesh Harjani (IBM)
  2023-01-30 17:02   ` Christoph Hellwig
  2023-01-30 16:14 ` [RFCv2 2/3] iomap: Change uptodate variable name to state Ritesh Harjani (IBM)
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-01-30 16:14 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel; +Cc: Aravinda Herle, Ritesh Harjani (IBM)

Before this commit[1], we used to call iomap_page_create() before
checking folio_test_uptodate() in __iomap_write_begin().

The problem is that commit[1] moved iop creation later i.e. after checking for
whether the folio is uptodate. And if the folio is uptodate, it simply
returns and doesn't allocate a iop.
Now what can happen is that during __iomap_write_begin() for bs < ps,
there can be a folio which is marked uptodate but does not have a iomap_page
structure allocated.
(I think one of the reason it can happen is due to memory pressure, we
can end up freeing folio->private resource).

Thus the iop structure will only gets allocated at the time of writeback
in iomap_writepage_map(). This I think, was a not problem till now since
we anyway only track uptodate status in iop (no support of tracking
dirty bitmap status which later patches will add), and we also end up
setting all the bits in iomap_page_create(), if the page is uptodate.

[1]: https://lore.kernel.org/all/20220623175157.1715274-5-shr@fb.com/
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 356193e44cf0..e9c85fcf7a1f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -527,7 +527,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		size_t len, struct folio *folio)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
-	struct iomap_page *iop;
+	struct iomap_page *iop = iomap_page_create(iter->inode, folio,
+						   iter->flags);
 	loff_t block_size = i_blocksize(iter->inode);
 	loff_t block_start = round_down(pos, block_size);
 	loff_t block_end = round_up(pos + len, block_size);
@@ -539,7 +540,6 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		return 0;
 	folio_clear_error(folio);
 
-	iop = iomap_page_create(iter->inode, folio, iter->flags);
 	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
 		return -EAGAIN;
 
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFCv2 2/3] iomap: Change uptodate variable name to state
  2023-01-30 16:14 [RFCv2 0/3] iomap: Add support for subpage dirty state tracking to improve write performance Ritesh Harjani (IBM)
  2023-01-30 16:14 ` [RFCv2 1/3] iomap: Move creation of iomap_page early in __iomap_write_begin Ritesh Harjani (IBM)
@ 2023-01-30 16:14 ` Ritesh Harjani (IBM)
  2023-01-30 21:56   ` Dave Chinner
  2023-01-30 16:14 ` [RFCv2 3/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-01-30 16:14 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel; +Cc: Aravinda Herle, Ritesh Harjani (IBM)

This patch just changes the struct iomap_page uptodate & uptodate_lock
member names to state and state_lock to better reflect their purpose for
the upcoming patch.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e9c85fcf7a1f..faee2852db8f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -25,13 +25,13 @@
 
 /*
  * Structure allocated for each folio when block size < folio size
- * to track sub-folio uptodate status and I/O completions.
+ * to track sub-folio uptodate state and I/O completions.
  */
 struct iomap_page {
 	atomic_t		read_bytes_pending;
 	atomic_t		write_bytes_pending;
-	spinlock_t		uptodate_lock;
-	unsigned long		uptodate[];
+	spinlock_t		state_lock;
+	unsigned long		state[];
 };
 
 static inline struct iomap_page *to_iomap_page(struct folio *folio)
@@ -58,12 +58,12 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;
 
-	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
+	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
 		      gfp);
 	if (iop) {
-		spin_lock_init(&iop->uptodate_lock);
+		spin_lock_init(&iop->state_lock);
 		if (folio_test_uptodate(folio))
-			bitmap_fill(iop->uptodate, nr_blocks);
+			bitmap_fill(iop->state, nr_blocks);
 		folio_attach_private(folio, iop);
 	}
 	return iop;
@@ -79,7 +79,7 @@ static void iomap_page_release(struct folio *folio)
 		return;
 	WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
 	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
-	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
+	WARN_ON_ONCE(bitmap_full(iop->state, nr_blocks) !=
 			folio_test_uptodate(folio));
 	kfree(iop);
 }
@@ -110,7 +110,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 
 		/* move forward for each leading block marked uptodate */
 		for (i = first; i <= last; i++) {
-			if (!test_bit(i, iop->uptodate))
+			if (!test_bit(i, iop->state))
 				break;
 			*pos += block_size;
 			poff += block_size;
@@ -120,7 +120,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 
 		/* truncate len if we find any trailing uptodate block(s) */
 		for ( ; i <= last; i++) {
-			if (test_bit(i, iop->uptodate)) {
+			if (test_bit(i, iop->state)) {
 				plen -= (last - i + 1) * block_size;
 				last = i - 1;
 				break;
@@ -152,11 +152,11 @@ static void iomap_iop_set_range_uptodate(struct folio *folio,
 	unsigned last = (off + len - 1) >> inode->i_blkbits;
 	unsigned long flags;
 
-	spin_lock_irqsave(&iop->uptodate_lock, flags);
-	bitmap_set(iop->uptodate, first, last - first + 1);
-	if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio)))
+	spin_lock_irqsave(&iop->state_lock, flags);
+	bitmap_set(iop->state, first, last - first + 1);
+	if (bitmap_full(iop->state, i_blocks_per_folio(inode, folio)))
 		folio_mark_uptodate(folio);
-	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
 }
 
 static void iomap_set_range_uptodate(struct folio *folio,
@@ -451,7 +451,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
 	last = (from + count - 1) >> inode->i_blkbits;
 
 	for (i = first; i <= last; i++)
-		if (!test_bit(i, iop->uptodate))
+		if (!test_bit(i, iop->state))
 			return false;
 	return true;
 }
@@ -1606,7 +1606,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iop && !test_bit(i, iop->uptodate))
+		if (iop && !test_bit(i, iop->state))
 			continue;
 
 		error = wpc->ops->map_blocks(wpc, inode, pos);
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFCv2 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-01-30 16:14 [RFCv2 0/3] iomap: Add support for subpage dirty state tracking to improve write performance Ritesh Harjani (IBM)
  2023-01-30 16:14 ` [RFCv2 1/3] iomap: Move creation of iomap_page early in __iomap_write_begin Ritesh Harjani (IBM)
  2023-01-30 16:14 ` [RFCv2 2/3] iomap: Change uptodate variable name to state Ritesh Harjani (IBM)
@ 2023-01-30 16:14 ` Ritesh Harjani (IBM)
  2023-01-30 17:16   ` Christoph Hellwig
  2023-01-30 17:54   ` Matthew Wilcox
  2023-01-30 18:10 ` [RFCv2 0/3] iomap: Add support for subpage dirty state " Matthew Wilcox
  2023-02-02  4:45 ` Ritesh Harjani (IBM)
  4 siblings, 2 replies; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-01-30 16:14 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel; +Cc: Aravinda Herle, Ritesh Harjani (IBM)

On a 64k pagesize platforms (specially Power and/or aarch64) with 4k
filesystem blocksize, this patch should improve the performance by doing
only the subpage dirty data write.

This should also reduce the write amplification since we can now track
subpage dirty status within state bitmaps. Earlier we had to
write the entire 64k page even if only a part of it (e.g. 4k) was
updated.

Performance testing of below fio workload reveals ~16x performance
improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.

<test_randwrite.fio>
[global]
	ioengine=psync
	rw=randwrite
	overwrite=1
	pre_read=1
	direct=0
	bs=4k
	size=1G
	dir=./
	numjobs=8
	fdatasync=1
	runtime=60
	iodepth=64
	group_reporting=1

[fio-run]

Reported-by: Aravinda Herle <araherle@in.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/gfs2/aops.c         |   2 +-
 fs/iomap/buffered-io.c | 103 +++++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_aops.c      |   2 +-
 fs/zonefs/super.c      |   2 +-
 include/linux/iomap.h  |   1 +
 5 files changed, 98 insertions(+), 12 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index e782b4f1d104..b9c35288a5eb 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -741,7 +741,7 @@ static const struct address_space_operations gfs2_aops = {
 	.writepages = gfs2_writepages,
 	.read_folio = gfs2_read_folio,
 	.readahead = gfs2_readahead,
-	.dirty_folio = filemap_dirty_folio,
+	.dirty_folio = iomap_dirty_folio,
 	.release_folio = iomap_release_folio,
 	.invalidate_folio = iomap_invalidate_folio,
 	.bmap = gfs2_bmap,
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index faee2852db8f..af3e77276dab 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -44,7 +44,8 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
 static struct bio_set iomap_ioend_bioset;
 
 static struct iomap_page *
-iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
+iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags,
+		  bool from_writeback)
 {
 	struct iomap_page *iop = to_iomap_page(folio);
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
@@ -58,12 +59,32 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;
 
-	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
+	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
 		      gfp);
 	if (iop) {
 		spin_lock_init(&iop->state_lock);
-		if (folio_test_uptodate(folio))
-			bitmap_fill(iop->state, nr_blocks);
+		/*
+		 * iomap_page_create can get called from writeback after
+		 * a truncate_inode_partial_folio operation on a large folio.
+		 * For large folio the iop structure is freed in
+		 * iomap_invalidate_folio() to ensure we can split the folio.
+		 * That means we will have to let go of the optimization of
+		 * tracking dirty bits here and set all bits as dirty if
+		 * the folio is marked uptodate.
+		 */
+		if (from_writeback && folio_test_uptodate(folio))
+			bitmap_fill(iop->state, 2 * nr_blocks);
+		else if (folio_test_uptodate(folio)) {
+			unsigned start = offset_in_folio(folio,
+					folio_pos(folio)) >> inode->i_blkbits;
+			bitmap_set(iop->state, start, nr_blocks);
+		}
+		if (folio_test_dirty(folio)) {
+			unsigned start = offset_in_folio(folio,
+					folio_pos(folio)) >> inode->i_blkbits;
+			start = start + nr_blocks;
+			bitmap_set(iop->state, start, nr_blocks);
+		}
 		folio_attach_private(folio, iop);
 	}
 	return iop;
@@ -168,6 +189,48 @@ static void iomap_set_range_uptodate(struct folio *folio,
 		folio_mark_uptodate(folio);
 }
 
+static void iomap_iop_set_range_dirty(struct folio *folio,
+		struct iomap_page *iop, size_t off, size_t len)
+{
+	struct inode *inode = folio->mapping->host;
+	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
+	unsigned first = (off >> inode->i_blkbits) + nr_blocks;
+	unsigned last = ((off + len - 1) >> inode->i_blkbits) + nr_blocks;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iop->state_lock, flags);
+	bitmap_set(iop->state, first, last - first + 1);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
+}
+
+static void iomap_set_range_dirty(struct folio *folio,
+		struct iomap_page *iop, size_t off, size_t len)
+{
+	if (iop)
+		iomap_iop_set_range_dirty(folio, iop, off, len);
+}
+
+static void iomap_iop_clear_range_dirty(struct folio *folio,
+		struct iomap_page *iop, size_t off, size_t len)
+{
+	struct inode *inode = folio->mapping->host;
+	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
+	unsigned first = (off >> inode->i_blkbits) + nr_blocks;
+	unsigned last = ((off + len - 1) >> inode->i_blkbits) + nr_blocks;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iop->state_lock, flags);
+	bitmap_clear(iop->state, first, last - first + 1);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
+}
+
+static void iomap_clear_range_dirty(struct folio *folio,
+		struct iomap_page *iop, size_t off, size_t len)
+{
+	if (iop)
+		iomap_iop_clear_range_dirty(folio, iop, off, len);
+}
+
 static void iomap_finish_folio_read(struct folio *folio, size_t offset,
 		size_t len, int error)
 {
@@ -231,7 +294,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
 	if (offset > 0)
-		iop = iomap_page_create(iter->inode, folio, iter->flags);
+		iop = iomap_page_create(iter->inode, folio, iter->flags, false);
 	else
 		iop = to_iomap_page(folio);
 
@@ -269,7 +332,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		return iomap_read_inline_data(iter, folio);
 
 	/* zero post-eof blocks as the page may be mapped */
-	iop = iomap_page_create(iter->inode, folio, iter->flags);
+	iop = iomap_page_create(iter->inode, folio, iter->flags, false);
 	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
@@ -497,6 +560,17 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
 
+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
+{
+	unsigned int nr_blocks = i_blocks_per_folio(mapping->host, folio);
+	struct iomap_page *iop = iomap_page_create(mapping->host, folio, 0, false);
+
+	iomap_set_range_dirty(folio, iop, offset_in_folio(folio, folio_pos(folio)),
+			      nr_blocks << mapping->host->i_blkbits);
+	return filemap_dirty_folio(mapping, folio);
+}
+EXPORT_SYMBOL_GPL(iomap_dirty_folio);
+
 static void
 iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 {
@@ -528,7 +602,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	struct iomap_page *iop = iomap_page_create(iter->inode, folio,
-						   iter->flags);
+						   iter->flags, false);
 	loff_t block_size = i_blocksize(iter->inode);
 	loff_t block_start = round_down(pos, block_size);
 	loff_t block_end = round_up(pos + len, block_size);
@@ -686,6 +760,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	if (unlikely(copied < len && !folio_test_uptodate(folio)))
 		return 0;
 	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
+	iomap_set_range_dirty(folio, iop, offset_in_folio(folio, pos), len);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
 }
@@ -1231,6 +1306,13 @@ static loff_t iomap_folio_mkwrite_iter(struct iomap_iter *iter,
 		block_commit_write(&folio->page, 0, length);
 	} else {
 		WARN_ON_ONCE(!folio_test_uptodate(folio));
+		/*
+		 * TODO: We need not set range of dirty bits in iop here.
+		 * This will be taken care by iomap_dirty_folio callback
+		 * function which gets called from folio_mark_dirty().
+		 */
+		iomap_set_range_dirty(folio, to_iomap_page(folio),
+				offset_in_folio(folio, iter->pos), length);
 		folio_mark_dirty(folio);
 	}
 
@@ -1590,7 +1672,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct folio *folio, u64 end_pos)
 {
-	struct iomap_page *iop = iomap_page_create(inode, folio, 0);
+	struct iomap_page *iop = iomap_page_create(inode, folio, 0, true);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	unsigned nblocks = i_blocks_per_folio(inode, folio);
@@ -1606,7 +1688,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iop && !test_bit(i, iop->state))
+		if (iop && !test_bit(i + nblocks, iop->state))
 			continue;
 
 		error = wpc->ops->map_blocks(wpc, inode, pos);
@@ -1650,6 +1732,9 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		}
 	}
 
+	iomap_clear_range_dirty(folio, iop,
+				offset_in_folio(folio, folio_pos(folio)),
+				end_pos - folio_pos(folio));
 	folio_start_writeback(folio);
 	folio_unlock(folio);
 
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 41734202796f..7e6c54955b4f 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -571,7 +571,7 @@ const struct address_space_operations xfs_address_space_operations = {
 	.read_folio		= xfs_vm_read_folio,
 	.readahead		= xfs_vm_readahead,
 	.writepages		= xfs_vm_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.bmap			= xfs_vm_bmap,
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index a9c5c3f720ad..4cefc2af87f3 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -267,7 +267,7 @@ static const struct address_space_operations zonefs_file_aops = {
 	.read_folio		= zonefs_read_folio,
 	.readahead		= zonefs_readahead,
 	.writepages		= zonefs_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.migrate_folio		= filemap_migrate_folio,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0983dfc9a203..b60562a0b893 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -262,6 +262,7 @@ void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFCv2 1/3] iomap: Move creation of iomap_page early in __iomap_write_begin
  2023-01-30 16:14 ` [RFCv2 1/3] iomap: Move creation of iomap_page early in __iomap_write_begin Ritesh Harjani (IBM)
@ 2023-01-30 17:02   ` Christoph Hellwig
  2023-01-30 20:21     ` Ritesh Harjani (IBM)
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2023-01-30 17:02 UTC (permalink / raw)
  To: Ritesh Harjani (IBM); +Cc: linux-xfs, linux-fsdevel, Aravinda Herle

On Mon, Jan 30, 2023 at 09:44:11PM +0530, Ritesh Harjani (IBM) wrote:
> The problem is that commit[1] moved iop creation later i.e. after checking for
> whether the folio is uptodate. And if the folio is uptodate, it simply
> returns and doesn't allocate a iop.
> Now what can happen is that during __iomap_write_begin() for bs < ps,
> there can be a folio which is marked uptodate but does not have a iomap_page
> structure allocated.
> (I think one of the reason it can happen is due to memory pressure, we
> can end up freeing folio->private resource).
> 
> Thus the iop structure will only gets allocated at the time of writeback
> in iomap_writepage_map(). This I think, was a not problem till now since
> we anyway only track uptodate status in iop (no support of tracking
> dirty bitmap status which later patches will add), and we also end up
> setting all the bits in iomap_page_create(), if the page is uptodate.

delayed iop allocation is a feature and not a bug.  We might have to
refine the criteria for sub-page dirty tracking, but in general having
the iop allocates is a memory and performance overhead and should be
avoided as much as possible.  In fact I still have some unfinished
work to allocate it even more lazily.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-01-30 16:14 ` [RFCv2 3/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
@ 2023-01-30 17:16   ` Christoph Hellwig
  2023-01-30 18:01     ` Matthew Wilcox
  2023-01-30 20:27     ` Ritesh Harjani (IBM)
  2023-01-30 17:54   ` Matthew Wilcox
  1 sibling, 2 replies; 23+ messages in thread
From: Christoph Hellwig @ 2023-01-30 17:16 UTC (permalink / raw)
  To: Ritesh Harjani (IBM); +Cc: linux-xfs, linux-fsdevel, Aravinda Herle

On Mon, Jan 30, 2023 at 09:44:13PM +0530, Ritesh Harjani (IBM) wrote:
> +iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags,
> +		  bool from_writeback)
>  {
>  	struct iomap_page *iop = to_iomap_page(folio);
>  	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> @@ -58,12 +59,32 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
>  	else
>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>  
> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
>  		      gfp);
>  	if (iop) {

Please just return early here for the allocation failure case instead of 
adding a lot of code with extra indentation.

>  		spin_lock_init(&iop->state_lock);
> -		if (folio_test_uptodate(folio))
> -			bitmap_fill(iop->state, nr_blocks);
> +		/*
> +		 * iomap_page_create can get called from writeback after
> +		 * a truncate_inode_partial_folio operation on a large folio.
> +		 * For large folio the iop structure is freed in
> +		 * iomap_invalidate_folio() to ensure we can split the folio.
> +		 * That means we will have to let go of the optimization of
> +		 * tracking dirty bits here and set all bits as dirty if
> +		 * the folio is marked uptodate.
> +		 */
> +		if (from_writeback && folio_test_uptodate(folio))
> +			bitmap_fill(iop->state, 2 * nr_blocks);
> +		else if (folio_test_uptodate(folio)) {

This code is very confusing.  First please only check
folio_test_uptodate one, and then check the from_writeback flag
inside the branch.  And as mentioned last time I think you really
need some symbolic constants for dealing with dirty vs uptodate
state and not just do a single fill for them.

> +			unsigned start = offset_in_folio(folio,
> +					folio_pos(folio)) >> inode->i_blkbits;
> +			bitmap_set(iop->state, start, nr_blocks);

Also this code leaves my head scratching.  Unless I'm missing something
important

	 offset_in_folio(folio, folio_pos(folio))

must always return 0.

Also the from_writeback logic is weird.  I'd rather have a
"bool is_dirty" argument and then pass true for writeback beause
we know the folio is dirty, false where we know it can't be
dirty and do the folio_test_dirty in the caller where we don't
know the state.

> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
> +{
> +	unsigned int nr_blocks = i_blocks_per_folio(mapping->host, folio);
> +	struct iomap_page *iop = iomap_page_create(mapping->host, folio, 0, false);

Please avoid the overly long line.  In fact with such long function
calls I'd generally prefer if the initialization was moved out of the
declaration.

> +
> +	iomap_set_range_dirty(folio, iop, offset_in_folio(folio, folio_pos(folio)),

Another overly long line here.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-01-30 16:14 ` [RFCv2 3/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
  2023-01-30 17:16   ` Christoph Hellwig
@ 2023-01-30 17:54   ` Matthew Wilcox
  2023-01-30 20:34     ` Ritesh Harjani (IBM)
  1 sibling, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2023-01-30 17:54 UTC (permalink / raw)
  To: Ritesh Harjani (IBM); +Cc: linux-xfs, linux-fsdevel, Aravinda Herle

On Mon, Jan 30, 2023 at 09:44:13PM +0530, Ritesh Harjani (IBM) wrote:
> On a 64k pagesize platforms (specially Power and/or aarch64) with 4k
> filesystem blocksize, this patch should improve the performance by doing
> only the subpage dirty data write.
> 
> This should also reduce the write amplification since we can now track
> subpage dirty status within state bitmaps. Earlier we had to
> write the entire 64k page even if only a part of it (e.g. 4k) was
> updated.
> 
> Performance testing of below fio workload reveals ~16x performance
> improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> 
> <test_randwrite.fio>
> [global]
> 	ioengine=psync
> 	rw=randwrite
> 	overwrite=1
> 	pre_read=1
> 	direct=0
> 	bs=4k
> 	size=1G
> 	dir=./
> 	numjobs=8
> 	fdatasync=1
> 	runtime=60
> 	iodepth=64
> 	group_reporting=1
> 
> [fio-run]

You really need to include this sentence from the cover letter in this
patch:

2. Also our internal performance team reported that this patch improves there
   database workload performance by around ~83% (with XFS on Power)

because that's far more meaningful than "Look, I cooked up an artificial
workload where this makes a difference".

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-01-30 17:16   ` Christoph Hellwig
@ 2023-01-30 18:01     ` Matthew Wilcox
  2023-01-30 20:44       ` Ritesh Harjani (IBM)
  2023-01-30 20:27     ` Ritesh Harjani (IBM)
  1 sibling, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2023-01-30 18:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ritesh Harjani (IBM), linux-xfs, linux-fsdevel, Aravinda Herle

On Mon, Jan 30, 2023 at 09:16:33AM -0800, Christoph Hellwig wrote:
> > +		if (from_writeback && folio_test_uptodate(folio))
> > +			bitmap_fill(iop->state, 2 * nr_blocks);
> > +		else if (folio_test_uptodate(folio)) {
> 
> This code is very confusing.  First please only check
> folio_test_uptodate one, and then check the from_writeback flag
> inside the branch.  And as mentioned last time I think you really
> need some symbolic constants for dealing with dirty vs uptodate
> state and not just do a single fill for them.

And I don't think this 'from_writeback' argument is well-named.
Presumably it's needed because folio_test_dirty() will be false
at this point in the writeback path because it got cleared by the VFS?
But in any case, it should be called 'dirty' or something, not tell me
where the function was called from.  I think what this should really
do is:

		if (dirty)
			iop_set_dirty(iop, 0, nr_blocks);
		if (folio_test_uptodate(folio))
			iop_set_uptodate(iop, 0, nr_blocks);

> > +			unsigned start = offset_in_folio(folio,
> > +					folio_pos(folio)) >> inode->i_blkbits;
> > +			bitmap_set(iop->state, start, nr_blocks);
> 
> Also this code leaves my head scratching.  Unless I'm missing something
> important
> 
> 	 offset_in_folio(folio, folio_pos(folio))
> 
> must always return 0.

You are not missing anything.  I don't understand the mental process
that gets someone to writing that.  It should logically be 0.

> Also the from_writeback logic is weird.  I'd rather have a
> "bool is_dirty" argument and then pass true for writeback beause
> we know the folio is dirty, false where we know it can't be
> dirty and do the folio_test_dirty in the caller where we don't
> know the state.

hahaha, you think the same.  ok, i'm leaving my above comment though ;-)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 0/3] iomap: Add support for subpage dirty state tracking to improve write performance
  2023-01-30 16:14 [RFCv2 0/3] iomap: Add support for subpage dirty state tracking to improve write performance Ritesh Harjani (IBM)
                   ` (2 preceding siblings ...)
  2023-01-30 16:14 ` [RFCv2 3/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
@ 2023-01-30 18:10 ` Matthew Wilcox
  2023-01-30 21:01   ` Ritesh Harjani (IBM)
  2023-02-02  4:45 ` Ritesh Harjani (IBM)
  4 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2023-01-30 18:10 UTC (permalink / raw)
  To: Ritesh Harjani (IBM); +Cc: linux-xfs, linux-fsdevel, Aravinda Herle

On Mon, Jan 30, 2023 at 09:44:10PM +0530, Ritesh Harjani (IBM) wrote:
> TODOs
> ======
> 1. I still need to work on macros which we could declare and use for easy
>    reference to uptodate/dirty bits in iop->state[] bitmap (based on previous
>    review comments).

I'm not sure it was worth posting this series without doing this, tbh.

> 5. To address one of the other review comments like what happens with a large
>    folio. Can we limit the size of bitmaps if the folio is too large e.g. > 2MB.
> 
>    [RH] - I can start looking into this area too, if we think these patches
>    are looking good. My preference would be to work on todos 1-4 as part of this
>    patch series and take up bitmap optimization as a follow-up work for next
>    part. Please do let me know your thoughts and suggestions on this.

I was hoping to push you towards investigating a better data structure
than a bitmap.  I know a bitmap solves your immediate problem since
there are only 16 4kB blocks in a 64kB page, but in a linear-read
scenario, XFS is going to create large folios on POWER machines, all
the way up to 16MB IIUC.  Whatever your PMD page size is.  So you're
going to be exposed to this in some scenarios, even if you're not seeing
them in your current testing.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 1/3] iomap: Move creation of iomap_page early in __iomap_write_begin
  2023-01-30 17:02   ` Christoph Hellwig
@ 2023-01-30 20:21     ` Ritesh Harjani (IBM)
  2023-01-30 21:00       ` Matthew Wilcox
  0 siblings, 1 reply; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-01-30 20:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, Aravinda Herle

On 23/01/30 09:02AM, Christoph Hellwig wrote:
> On Mon, Jan 30, 2023 at 09:44:11PM +0530, Ritesh Harjani (IBM) wrote:
> > The problem is that commit[1] moved iop creation later i.e. after checking for
> > whether the folio is uptodate. And if the folio is uptodate, it simply
> > returns and doesn't allocate a iop.
> > Now what can happen is that during __iomap_write_begin() for bs < ps,
> > there can be a folio which is marked uptodate but does not have a iomap_page
> > structure allocated.
> > (I think one of the reason it can happen is due to memory pressure, we
> > can end up freeing folio->private resource).
> >
> > Thus the iop structure will only gets allocated at the time of writeback
> > in iomap_writepage_map(). This I think, was a not problem till now since
> > we anyway only track uptodate status in iop (no support of tracking
> > dirty bitmap status which later patches will add), and we also end up
> > setting all the bits in iomap_page_create(), if the page is uptodate.
>
> delayed iop allocation is a feature and not a bug.  We might have to
> refine the criteria for sub-page dirty tracking, but in general having
> the iop allocates is a memory and performance overhead and should be
> avoided as much as possible.  In fact I still have some unfinished
> work to allocate it even more lazily.

So, what I meant here was that the commit[1] chaged the behavior/functionality
without indenting to. I agree it's not a bug.
But when I added dirty bitmap tracking support, I couldn't understand for
sometime on why were we allocating iop only at the time of writeback.
And it was due to a small line change which somehow slipped into this commit [1].
Hence I made this as a seperate patch so that it doesn't slip through again w/o
getting noticed/review.

Thanks for the info on the lazy allocation work. Yes, though it is not a bug, but
with subpage dirty tracking in iop->state[], if we end up allocating iop only
at the time of writeback, than that might cause some performance degradation
compared to, if we allocat iop at ->write_begin() and mark the required dirty
bit ranges in ->write_end(). Like how we do in this patch series.
(Ofcourse it is true only for bs < ps use case).

[1]: https://lore.kernel.org/all/20220623175157.1715274-5-shr@fb.com/


Thanks again for your quick review!!
-ritesh


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-01-30 17:16   ` Christoph Hellwig
  2023-01-30 18:01     ` Matthew Wilcox
@ 2023-01-30 20:27     ` Ritesh Harjani (IBM)
  1 sibling, 0 replies; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-01-30 20:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel, Aravinda Herle

On 23/01/30 09:16AM, Christoph Hellwig wrote:
> On Mon, Jan 30, 2023 at 09:44:13PM +0530, Ritesh Harjani (IBM) wrote:
> > +iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags,
> > +		  bool from_writeback)
> >  {
> >  	struct iomap_page *iop = to_iomap_page(folio);
> >  	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> > @@ -58,12 +59,32 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
> >  	else
> >  		gfp = GFP_NOFS | __GFP_NOFAIL;
> >
> > -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
> > +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
> >  		      gfp);
> >  	if (iop) {
>
> Please just return early here for the allocation failure case instead of
> adding a lot of code with extra indentation.

Sure. Will do that.


>
> >  		spin_lock_init(&iop->state_lock);
> > -		if (folio_test_uptodate(folio))
> > -			bitmap_fill(iop->state, nr_blocks);
> > +		/*
> > +		 * iomap_page_create can get called from writeback after
> > +		 * a truncate_inode_partial_folio operation on a large folio.
> > +		 * For large folio the iop structure is freed in
> > +		 * iomap_invalidate_folio() to ensure we can split the folio.
> > +		 * That means we will have to let go of the optimization of
> > +		 * tracking dirty bits here and set all bits as dirty if
> > +		 * the folio is marked uptodate.
> > +		 */
> > +		if (from_writeback && folio_test_uptodate(folio))
> > +			bitmap_fill(iop->state, 2 * nr_blocks);
> > +		else if (folio_test_uptodate(folio)) {
>
> This code is very confusing.  First please only check
> folio_test_uptodate one, and then check the from_writeback flag
> inside the branch.  And as mentioned last time I think you really

Ok, sure. I will try and simplify it.


> need some symbolic constants for dealing with dirty vs uptodate
> state and not just do a single fill for them.

Yes, I agree. Sorry my bad. I will add that change in the next rev.


>
> > +			unsigned start = offset_in_folio(folio,
> > +					folio_pos(folio)) >> inode->i_blkbits;
> > +			bitmap_set(iop->state, start, nr_blocks);
>
> Also this code leaves my head scratching.  Unless I'm missing something
> important
>
> 	 offset_in_folio(folio, folio_pos(folio))
>
> must always return 0.

That is true always yes. In the next rev, I can make it explicitly 0 then
(maybe with just a small comment if required).

>
> Also the from_writeback logic is weird.  I'd rather have a
> "bool is_dirty" argument and then pass true for writeback beause
> we know the folio is dirty, false where we know it can't be
> dirty and do the folio_test_dirty in the caller where we don't
> know the state.

Agreed. Thanks.


>
> > +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
> > +{
> > +	unsigned int nr_blocks = i_blocks_per_folio(mapping->host, folio);
> > +	struct iomap_page *iop = iomap_page_create(mapping->host, folio, 0, false);
>
> Please avoid the overly long line.  In fact with such long function
> calls I'd generally prefer if the initialization was moved out of the
> declaration.

Sure, will do the same.

>
> > +
> > +	iomap_set_range_dirty(folio, iop, offset_in_folio(folio, folio_pos(folio)),
>
> Another overly long line here.

sure. Will do the same.

-ritesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-01-30 17:54   ` Matthew Wilcox
@ 2023-01-30 20:34     ` Ritesh Harjani (IBM)
  0 siblings, 0 replies; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-01-30 20:34 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-xfs, linux-fsdevel, Aravinda Herle

On 23/01/30 05:54PM, Matthew Wilcox wrote:
> On Mon, Jan 30, 2023 at 09:44:13PM +0530, Ritesh Harjani (IBM) wrote:
> > On a 64k pagesize platforms (specially Power and/or aarch64) with 4k
> > filesystem blocksize, this patch should improve the performance by doing
> > only the subpage dirty data write.
> >
> > This should also reduce the write amplification since we can now track
> > subpage dirty status within state bitmaps. Earlier we had to
> > write the entire 64k page even if only a part of it (e.g. 4k) was
> > updated.
> >
> > Performance testing of below fio workload reveals ~16x performance
> > improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
> > FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> >
> > <test_randwrite.fio>
> > [global]
> > 	ioengine=psync
> > 	rw=randwrite
> > 	overwrite=1
> > 	pre_read=1
> > 	direct=0
> > 	bs=4k
> > 	size=1G
> > 	dir=./
> > 	numjobs=8
> > 	fdatasync=1
> > 	runtime=60
> > 	iodepth=64
> > 	group_reporting=1
> >
> > [fio-run]
>
> You really need to include this sentence from the cover letter in this
> patch:
>
> 2. Also our internal performance team reported that this patch improves there
>    database workload performance by around ~83% (with XFS on Power)
>
> because that's far more meaningful than "Look, I cooked up an artificial
> workload where this makes a difference".

Agreed. I will add the other lines too in the commit message.

The intention behind adding fio workload is for others to have a test
case to verify against and/or provide more info when someone later refers
to the commit message.
One of the interesting observation with this synthetic fio workload was we can
/should easily observe the theoritical performance gain of around ~16x
(i.e. 64k(ps) / 4k(bs)).

Thanks again for the review!
-ritesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-01-30 18:01     ` Matthew Wilcox
@ 2023-01-30 20:44       ` Ritesh Harjani (IBM)
  0 siblings, 0 replies; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-01-30 20:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Aravinda Herle

On 23/01/30 06:01PM, Matthew Wilcox wrote:
> On Mon, Jan 30, 2023 at 09:16:33AM -0800, Christoph Hellwig wrote:
> > > +		if (from_writeback && folio_test_uptodate(folio))
> > > +			bitmap_fill(iop->state, 2 * nr_blocks);
> > > +		else if (folio_test_uptodate(folio)) {
> >
> > This code is very confusing.  First please only check
> > folio_test_uptodate one, and then check the from_writeback flag
> > inside the branch.  And as mentioned last time I think you really
> > need some symbolic constants for dealing with dirty vs uptodate
> > state and not just do a single fill for them.
>
> And I don't think this 'from_writeback' argument is well-named.
> Presumably it's needed because folio_test_dirty() will be false
> at this point in the writeback path because it got cleared by the VFS?

Yes, folio_test_dirty() is false. We clear it in write_cache_pages() by calling
clear_page_dirty_for_io() before calling iomap_do_writepage().

> But in any case, it should be called 'dirty' or something, not tell me
> where the function was called from.  I think what this should really
> do is:
>
> 		if (dirty)
> 			iop_set_dirty(iop, 0, nr_blocks);
> 		if (folio_test_uptodate(folio))
> 			iop_set_uptodate(iop, 0, nr_blocks);

Sure I got the idea. I will use "bool is_dirty".

>
> > > +			unsigned start = offset_in_folio(folio,
> > > +					folio_pos(folio)) >> inode->i_blkbits;
> > > +			bitmap_set(iop->state, start, nr_blocks);
> >
> > Also this code leaves my head scratching.  Unless I'm missing something
> > important
> >
> > 	 offset_in_folio(folio, folio_pos(folio))
> >
> > must always return 0.
>
> You are not missing anything.  I don't understand the mental process
> that gets someone to writing that.  It should logically be 0.

Sorry about the confusion. Yes, that is correct. I ended up using above at some
place and 0 at others. Then for final cleanup I ended up using the above call.

I will correct it in the next rev.

>
> > Also the from_writeback logic is weird.  I'd rather have a
> > "bool is_dirty" argument and then pass true for writeback beause
> > we know the folio is dirty, false where we know it can't be
> > dirty and do the folio_test_dirty in the caller where we don't
> > know the state.
>
> hahaha, you think the same.  ok, i'm leaving my above comment though ;-)
>
No problem ;)

Thanks for your review!!
-ritesh


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 1/3] iomap: Move creation of iomap_page early in __iomap_write_begin
  2023-01-30 20:21     ` Ritesh Harjani (IBM)
@ 2023-01-30 21:00       ` Matthew Wilcox
  2023-01-31 18:37         ` Ritesh Harjani (IBM)
  0 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2023-01-30 21:00 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Aravinda Herle

On Tue, Jan 31, 2023 at 01:51:50AM +0530, Ritesh Harjani (IBM) wrote:
> > > Thus the iop structure will only gets allocated at the time of writeback
> > > in iomap_writepage_map(). This I think, was a not problem till now since
> > > we anyway only track uptodate status in iop (no support of tracking
> > > dirty bitmap status which later patches will add), and we also end up
> > > setting all the bits in iomap_page_create(), if the page is uptodate.
> >
> > delayed iop allocation is a feature and not a bug.  We might have to
> > refine the criteria for sub-page dirty tracking, but in general having
> > the iop allocates is a memory and performance overhead and should be
> > avoided as much as possible.  In fact I still have some unfinished
> > work to allocate it even more lazily.
> 
> So, what I meant here was that the commit[1] chaged the behavior/functionality
> without indenting to. I agree it's not a bug.

It didn't change the behaviour or functionality.  It broke your patches,
but it certainly doesn't deserve its own commit reverting it -- because
it's not wrong.

> But when I added dirty bitmap tracking support, I couldn't understand for
> sometime on why were we allocating iop only at the time of writeback.
> And it was due to a small line change which somehow slipped into this commit [1].
> Hence I made this as a seperate patch so that it doesn't slip through again w/o
> getting noticed/review.

It didn't "slip through".  It was intended.

> Thanks for the info on the lazy allocation work. Yes, though it is not a bug, but
> with subpage dirty tracking in iop->state[], if we end up allocating iop only
> at the time of writeback, than that might cause some performance degradation
> compared to, if we allocat iop at ->write_begin() and mark the required dirty
> bit ranges in ->write_end(). Like how we do in this patch series.
> (Ofcourse it is true only for bs < ps use case).
> 
> [1]: https://lore.kernel.org/all/20220623175157.1715274-5-shr@fb.com/

You absolutely can allocate it in iomap_write_begin, but you can avoid
allocating it until writeback time if (pos, len) entirely overlap the
folio.  ie:

	if (pos > folio_pos(folio) ||
	    pos + len < folio_pos(folio) + folio_size(folio))
		iop = iomap_page_create(iter->inode, folio, iter->flags, false);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 0/3] iomap: Add support for subpage dirty state tracking to improve write performance
  2023-01-30 18:10 ` [RFCv2 0/3] iomap: Add support for subpage dirty state " Matthew Wilcox
@ 2023-01-30 21:01   ` Ritesh Harjani (IBM)
  0 siblings, 0 replies; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-01-30 21:01 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-xfs, linux-fsdevel, Aravinda Herle

On 23/01/30 06:10PM, Matthew Wilcox wrote:
> On Mon, Jan 30, 2023 at 09:44:10PM +0530, Ritesh Harjani (IBM) wrote:
> > TODOs
> > ======
> > 1. I still need to work on macros which we could declare and use for easy
> >    reference to uptodate/dirty bits in iop->state[] bitmap (based on previous
> >    review comments).
>
> I'm not sure it was worth posting this series without doing this, tbh.

Really sorry about that. Since there was a functionality changes in
this patches which were earlier missing from the last series that you pointed
out i.e. marking the bits dirty when the folio is marked dirty, along with one
other change which I mentioned in cover letter. So I thought of pushing these
changes to get some early review.

Sure, I will definitely work on it and will push out the next rev with these
changes included.

>
> > 5. To address one of the other review comments like what happens with a large
> >    folio. Can we limit the size of bitmaps if the folio is too large e.g. > 2MB.
> >
> >    [RH] - I can start looking into this area too, if we think these patches
> >    are looking good. My preference would be to work on todos 1-4 as part of this
> >    patch series and take up bitmap optimization as a follow-up work for next
> >    part. Please do let me know your thoughts and suggestions on this.
>
> I was hoping to push you towards investigating a better data structure
> than a bitmap. I know a bitmap solves your immediate problem since
> there are only 16 4kB blocks in a 64kB page, but in a linear-read
> scenario, XFS is going to create large folios on POWER machines, all
> the way up to 16MB IIUC.  Whatever your PMD page size is.  So you're
> going to be exposed to this in some scenarios, even if you're not seeing
> them in your current testing.

Got it!! Let me come back on this after giving some more thoughts.

-ritesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 2/3] iomap: Change uptodate variable name to state
  2023-01-30 16:14 ` [RFCv2 2/3] iomap: Change uptodate variable name to state Ritesh Harjani (IBM)
@ 2023-01-30 21:56   ` Dave Chinner
  2023-01-30 22:24     ` Matthew Wilcox
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2023-01-30 21:56 UTC (permalink / raw)
  To: Ritesh Harjani (IBM); +Cc: linux-xfs, linux-fsdevel, Aravinda Herle

On Mon, Jan 30, 2023 at 09:44:12PM +0530, Ritesh Harjani (IBM) wrote:
> This patch just changes the struct iomap_page uptodate & uptodate_lock
> member names to state and state_lock to better reflect their purpose for
> the upcoming patch.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/iomap/buffered-io.c | 30 +++++++++++++++---------------
>  1 file changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index e9c85fcf7a1f..faee2852db8f 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -25,13 +25,13 @@
>  
>  /*
>   * Structure allocated for each folio when block size < folio size
> - * to track sub-folio uptodate status and I/O completions.
> + * to track sub-folio uptodate state and I/O completions.
>   */
>  struct iomap_page {
>  	atomic_t		read_bytes_pending;
>  	atomic_t		write_bytes_pending;
> -	spinlock_t		uptodate_lock;
> -	unsigned long		uptodate[];
> +	spinlock_t		state_lock;
> +	unsigned long		state[];

I don't realy like this change, nor the followup in the next patch
that puts two different state bits somewhere inside a single bitmap.

>  };
>  
>  static inline struct iomap_page *to_iomap_page(struct folio *folio)
> @@ -58,12 +58,12 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
>  	else
>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>  
> -	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
>  		      gfp);
>  	if (iop) {
> -		spin_lock_init(&iop->uptodate_lock);
> +		spin_lock_init(&iop->state_lock);
>  		if (folio_test_uptodate(folio))
> -			bitmap_fill(iop->uptodate, nr_blocks);
> +			bitmap_fill(iop->state, nr_blocks);

This is the reason I don't like it - we lose the self-documenting
aspect of the code. bitmap_fill(iop->uptodate, nr_blocks) is
obviously correct, the new version isn't because "state" has no
obvious meaning, and it only gets worse in the next patch where
state is changed to have a magic "2 * nr_blocks" length and multiple
state bits per block.

Having an obvious setup where there are two bitmaps, one for dirty
and one for uptodate, and the same bit in each bitmap corresponds to
the state for that sub-block region, it is easy to see that the code
is operating on the correct bit, to look at the bitmap and see what
bits are set, to compare uptodate and dirty bitmaps side by side,
etc. It's a much easier setup to read, code correctly, analyse and
debug than putting multiple state bits in the same bitmap array at
different indexes.

If you are trying to keep this down to a single allocation by using
a single bitmap of undefined length, then change the declaration and
the structure size calculation away from using array notation and
instead just use pointers to the individual bitmap regions within
the allocated region.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 2/3] iomap: Change uptodate variable name to state
  2023-01-30 21:56   ` Dave Chinner
@ 2023-01-30 22:24     ` Matthew Wilcox
  2023-01-31 15:07       ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2023-01-30 22:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ritesh Harjani (IBM), linux-xfs, linux-fsdevel, Aravinda Herle

On Tue, Jan 31, 2023 at 08:56:23AM +1100, Dave Chinner wrote:
> > +	spinlock_t		state_lock;
> > +	unsigned long		state[];
> 
> I don't realy like this change, nor the followup in the next patch
> that puts two different state bits somewhere inside a single bitmap.

I think that's due to the open-coding of accesses to those bits.
Which was why I questioned the wisdom of sending out this patchset
without having introduced the accessors.

> This is the reason I don't like it - we lose the self-documenting
> aspect of the code. bitmap_fill(iop->uptodate, nr_blocks) is
> obviously correct, the new version isn't because "state" has no
> obvious meaning, and it only gets worse in the next patch where
> state is changed to have a magic "2 * nr_blocks" length and multiple
> state bits per block.

Completely agreed.

> Having an obvious setup where there are two bitmaps, one for dirty
> and one for uptodate, and the same bit in each bitmap corresponds to
> the state for that sub-block region, it is easy to see that the code
> is operating on the correct bit, to look at the bitmap and see what
> bits are set, to compare uptodate and dirty bitmaps side by side,
> etc. It's a much easier setup to read, code correctly, analyse and
> debug than putting multiple state bits in the same bitmap array at
> different indexes.
> 
> If you are trying to keep this down to a single allocation by using
> a single bitmap of undefined length, then change the declaration and
> the structure size calculation away from using array notation and
> instead just use pointers to the individual bitmap regions within
> the allocated region.

Hard to stomach that solution when the bitmap is usually 2 bytes long
(in Ritesh's case).  Let's see a version of this patchset with
accessors before rendering judgement.

Although to my mind, we still want a solution that scales beyond
a bitmap.  But a proper set of accessors will abstract that away.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 2/3] iomap: Change uptodate variable name to state
  2023-01-30 22:24     ` Matthew Wilcox
@ 2023-01-31 15:07       ` Christoph Hellwig
  2023-01-31 18:05         ` Ritesh Harjani (IBM)
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2023-01-31 15:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Ritesh Harjani (IBM),
	linux-xfs, linux-fsdevel, Aravinda Herle

On Mon, Jan 30, 2023 at 10:24:59PM +0000, Matthew Wilcox wrote:
> > a single bitmap of undefined length, then change the declaration and
> > the structure size calculation away from using array notation and
> > instead just use pointers to the individual bitmap regions within
> > the allocated region.
> 
> Hard to stomach that solution when the bitmap is usually 2 bytes long
> (in Ritesh's case).  Let's see a version of this patchset with
> accessors before rendering judgement.

Yes.  I think what we need is proper helpers that are self-documenting
for every bitmap update as I already suggsted last round.  That keeps
the efficient allocation, and keeps all the users self-documenting.
It just adds a bit of boilerplate for all these helpers, but that
should be worth having the clarity and performance benefits.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 2/3] iomap: Change uptodate variable name to state
  2023-01-31 15:07       ` Christoph Hellwig
@ 2023-01-31 18:05         ` Ritesh Harjani (IBM)
  0 siblings, 0 replies; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-01-31 18:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Dave Chinner, linux-xfs, linux-fsdevel, Aravinda Herle

On 23/01/31 07:07AM, Christoph Hellwig wrote:
> On Mon, Jan 30, 2023 at 10:24:59PM +0000, Matthew Wilcox wrote:
> > > a single bitmap of undefined length, then change the declaration and
> > > the structure size calculation away from using array notation and
> > > instead just use pointers to the individual bitmap regions within
> > > the allocated region.
> >
> > Hard to stomach that solution when the bitmap is usually 2 bytes long
> > (in Ritesh's case).  Let's see a version of this patchset with
> > accessors before rendering judgement.
>
> Yes.  I think what we need is proper helpers that are self-documenting
> for every bitmap update as I already suggsted last round.  That keeps
> the efficient allocation, and keeps all the users self-documenting.
> It just adds a bit of boilerplate for all these helpers, but that
> should be worth having the clarity and performance benefits.

Are these accessor apis looking good to be part of fs/iomap/buffered-io.c
The rest of the changes will then be just using this to set/clear/test the
uptodate/dirty bits of iop->state bitmap.
I think, after these apis, there shouldn't be any place where we need to
directly manipulate iop->state bitmap. These APIs are all what I think are
required for current changeset.

Please let me know your thoughts/suggestions on these.
If it looks good, then I can fold these in, in different patches which
implements uptodate/dirty functionality and send rfcv3 along with other
review comments addressed.

/*
 * Accessor functions for setting/clearing/checking uptodate and
 * dirty bits in iop->state bitmap.
 * nrblocks is i_blocks_per_folio() which is passed in every
 * function as the last argument for API consistency.
 */
static inline void iop_set_range_uptodate(struct iomap_page *iop,
                                unsigned int start, unsigned int len,
                                unsigned int nrblocks)
{
        bitmap_set(iop->state, start, len);
}

static inline void iop_clear_range_uptodate(struct iomap_page *iop,
                                unsigned int start, unsigned int len,
                                unsigned int nrblocks)
{
        bitmap_clear(iop->state, start, len);
}

static inline bool iop_test_uptodate(struct iomap_page *iop, unsigned int pos,
                                unsigned int nrblocks)
{
        return test_bit(pos, iop->state);
}

static inline bool iop_full_uptodate(struct iomap_page *iop,
                                unsigned int nrblocks)
{
        return bitmap_full(iop->state, nrblocks);
}

static inline void iop_set_range_dirty(struct iomap_page *iop,
                                unsigned int start, unsigned int len,
                                unsigned int nrblocks)
{
        bitmap_set(iop->state, start + nrblocks, len);
}

static inline void iop_clear_range_dirty(struct iomap_page *iop,
                                unsigned int start, unsigned int len,
                                unsigned int nrblocks)
{
        bitmap_clear(iop->state, start + nrblocks, len);
}

static inline bool iop_test_dirty(struct iomap_page *iop, unsigned int pos,
                             unsigned int nrblocks)
{
        return test_bit(pos + nrblocks, iop->state);
}

-ritesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 1/3] iomap: Move creation of iomap_page early in __iomap_write_begin
  2023-01-30 21:00       ` Matthew Wilcox
@ 2023-01-31 18:37         ` Ritesh Harjani (IBM)
  2023-01-31 18:48           ` Matthew Wilcox
  0 siblings, 1 reply; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-01-31 18:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Aravinda Herle

On 23/01/30 09:00PM, Matthew Wilcox wrote:
> On Tue, Jan 31, 2023 at 01:51:50AM +0530, Ritesh Harjani (IBM) wrote:
> > > > Thus the iop structure will only gets allocated at the time of writeback
> > > > in iomap_writepage_map(). This I think, was a not problem till now since
> > > > we anyway only track uptodate status in iop (no support of tracking
> > > > dirty bitmap status which later patches will add), and we also end up
> > > > setting all the bits in iomap_page_create(), if the page is uptodate.
> > >
> > > delayed iop allocation is a feature and not a bug.  We might have to
> > > refine the criteria for sub-page dirty tracking, but in general having
> > > the iop allocates is a memory and performance overhead and should be
> > > avoided as much as possible.  In fact I still have some unfinished
> > > work to allocate it even more lazily.
> >
> > So, what I meant here was that the commit[1] chaged the behavior/functionality
> > without indenting to. I agree it's not a bug.
>
> It didn't change the behaviour or functionality.  It broke your patches,
> but it certainly doesn't deserve its own commit reverting it -- because
> it's not wrong.
>
> > But when I added dirty bitmap tracking support, I couldn't understand for
> > sometime on why were we allocating iop only at the time of writeback.
> > And it was due to a small line change which somehow slipped into this commit [1].
> > Hence I made this as a seperate patch so that it doesn't slip through again w/o
> > getting noticed/review.
>
> It didn't "slip through".  It was intended.
>
> > Thanks for the info on the lazy allocation work. Yes, though it is not a bug, but
> > with subpage dirty tracking in iop->state[], if we end up allocating iop only
> > at the time of writeback, than that might cause some performance degradation
> > compared to, if we allocat iop at ->write_begin() and mark the required dirty
> > bit ranges in ->write_end(). Like how we do in this patch series.
> > (Ofcourse it is true only for bs < ps use case).
> >
> > [1]: https://lore.kernel.org/all/20220623175157.1715274-5-shr@fb.com/
>
> You absolutely can allocate it in iomap_write_begin, but you can avoid
> allocating it until writeback time if (pos, len) entirely overlap the
> folio.  ie:
>
> 	if (pos > folio_pos(folio) ||
> 	    pos + len < folio_pos(folio) + folio_size(folio))
> 		iop = iomap_page_create(iter->inode, folio, iter->flags, false);

Thanks for the suggestion. However do you think it will be better if this is
introduced along with lazy allocation changes which Christoph was mentioning
about?
Why I am thinking that is because, with above approach we delay the allocation
of iop until writeback, for entire folio overlap case. But then later
in __iomap_write_begin(), we require iop if folio is not uptodate.
Hence we again will have to do some checks to see if the iop is not allocated
then allocate it (which is for entire folio overlap case).
That somehow looked like an overkill for a very little gain in the context of
this patch series. Kindly let me know your thoughts on this.

-ritesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 1/3] iomap: Move creation of iomap_page early in __iomap_write_begin
  2023-01-31 18:37         ` Ritesh Harjani (IBM)
@ 2023-01-31 18:48           ` Matthew Wilcox
  2023-01-31 20:00             ` Ritesh Harjani (IBM)
  0 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2023-01-31 18:48 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Aravinda Herle

On Wed, Feb 01, 2023 at 12:07:25AM +0530, Ritesh Harjani (IBM) wrote:
> On 23/01/30 09:00PM, Matthew Wilcox wrote:
> > On Tue, Jan 31, 2023 at 01:51:50AM +0530, Ritesh Harjani (IBM) wrote:
> > > > > Thus the iop structure will only gets allocated at the time of writeback
> > > > > in iomap_writepage_map(). This I think, was a not problem till now since
> > > > > we anyway only track uptodate status in iop (no support of tracking
> > > > > dirty bitmap status which later patches will add), and we also end up
> > > > > setting all the bits in iomap_page_create(), if the page is uptodate.
> > > >
> > > > delayed iop allocation is a feature and not a bug.  We might have to
> > > > refine the criteria for sub-page dirty tracking, but in general having
> > > > the iop allocates is a memory and performance overhead and should be
> > > > avoided as much as possible.  In fact I still have some unfinished
> > > > work to allocate it even more lazily.
> > >
> > > So, what I meant here was that the commit[1] chaged the behavior/functionality
> > > without indenting to. I agree it's not a bug.
> >
> > It didn't change the behaviour or functionality.  It broke your patches,
> > but it certainly doesn't deserve its own commit reverting it -- because
> > it's not wrong.
> >
> > > But when I added dirty bitmap tracking support, I couldn't understand for
> > > sometime on why were we allocating iop only at the time of writeback.
> > > And it was due to a small line change which somehow slipped into this commit [1].
> > > Hence I made this as a seperate patch so that it doesn't slip through again w/o
> > > getting noticed/review.
> >
> > It didn't "slip through".  It was intended.
> >
> > > Thanks for the info on the lazy allocation work. Yes, though it is not a bug, but
> > > with subpage dirty tracking in iop->state[], if we end up allocating iop only
> > > at the time of writeback, than that might cause some performance degradation
> > > compared to, if we allocat iop at ->write_begin() and mark the required dirty
> > > bit ranges in ->write_end(). Like how we do in this patch series.
> > > (Ofcourse it is true only for bs < ps use case).
> > >
> > > [1]: https://lore.kernel.org/all/20220623175157.1715274-5-shr@fb.com/
> >
> > You absolutely can allocate it in iomap_write_begin, but you can avoid
> > allocating it until writeback time if (pos, len) entirely overlap the
> > folio.  ie:
> >
> > 	if (pos > folio_pos(folio) ||
> > 	    pos + len < folio_pos(folio) + folio_size(folio))
> > 		iop = iomap_page_create(iter->inode, folio, iter->flags, false);
> 
> Thanks for the suggestion. However do you think it will be better if this is
> introduced along with lazy allocation changes which Christoph was mentioning
> about?
> Why I am thinking that is because, with above approach we delay the allocation
> of iop until writeback, for entire folio overlap case. But then later
> in __iomap_write_begin(), we require iop if folio is not uptodate.
> Hence we again will have to do some checks to see if the iop is not allocated
> then allocate it (which is for entire folio overlap case).
> That somehow looked like an overkill for a very little gain in the context of
> this patch series. Kindly let me know your thoughts on this.

Look at *why* __iomap_write_begin() allocates an iop.  It's to read in the
blocks which are going to be partially-overwritten by the write.  If the
write overlaps the entire folio, there are no parts which need to be read
in, and we can simply return.  Maybe we should make that more obvious:

	if (folio_test_uptodate(folio))
		return 0;
	if (pos <= folio_pos(folio) &&
	    pos + len >= folio_pos(folio) + folio_size(folio))
		return 0;
	folio_clear_error(folio);

(I think pos must always be >= folio_pos(), so that <= could be ==, but
it doesn't hurt anything to use <=)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 1/3] iomap: Move creation of iomap_page early in __iomap_write_begin
  2023-01-31 18:48           ` Matthew Wilcox
@ 2023-01-31 20:00             ` Ritesh Harjani (IBM)
  0 siblings, 0 replies; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-01-31 20:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Aravinda Herle

On 23/01/31 06:48PM, Matthew Wilcox wrote:
> On Wed, Feb 01, 2023 at 12:07:25AM +0530, Ritesh Harjani (IBM) wrote:
> > On 23/01/30 09:00PM, Matthew Wilcox wrote:
> > > On Tue, Jan 31, 2023 at 01:51:50AM +0530, Ritesh Harjani (IBM) wrote:
> > > > > > Thus the iop structure will only gets allocated at the time of writeback
> > > > > > in iomap_writepage_map(). This I think, was a not problem till now since
> > > > > > we anyway only track uptodate status in iop (no support of tracking
> > > > > > dirty bitmap status which later patches will add), and we also end up
> > > > > > setting all the bits in iomap_page_create(), if the page is uptodate.
> > > > >
> > > > > delayed iop allocation is a feature and not a bug.  We might have to
> > > > > refine the criteria for sub-page dirty tracking, but in general having
> > > > > the iop allocates is a memory and performance overhead and should be
> > > > > avoided as much as possible.  In fact I still have some unfinished
> > > > > work to allocate it even more lazily.
> > > >
> > > > So, what I meant here was that the commit[1] chaged the behavior/functionality
> > > > without indenting to. I agree it's not a bug.
> > >
> > > It didn't change the behaviour or functionality.  It broke your patches,
> > > but it certainly doesn't deserve its own commit reverting it -- because
> > > it's not wrong.
> > >
> > > > But when I added dirty bitmap tracking support, I couldn't understand for
> > > > sometime on why were we allocating iop only at the time of writeback.
> > > > And it was due to a small line change which somehow slipped into this commit [1].
> > > > Hence I made this as a seperate patch so that it doesn't slip through again w/o
> > > > getting noticed/review.
> > >
> > > It didn't "slip through".  It was intended.
> > >
> > > > Thanks for the info on the lazy allocation work. Yes, though it is not a bug, but
> > > > with subpage dirty tracking in iop->state[], if we end up allocating iop only
> > > > at the time of writeback, than that might cause some performance degradation
> > > > compared to, if we allocat iop at ->write_begin() and mark the required dirty
> > > > bit ranges in ->write_end(). Like how we do in this patch series.
> > > > (Ofcourse it is true only for bs < ps use case).
> > > >
> > > > [1]: https://lore.kernel.org/all/20220623175157.1715274-5-shr@fb.com/
> > >
> > > You absolutely can allocate it in iomap_write_begin, but you can avoid
> > > allocating it until writeback time if (pos, len) entirely overlap the
> > > folio.  ie:
> > >
> > > 	if (pos > folio_pos(folio) ||
> > > 	    pos + len < folio_pos(folio) + folio_size(folio))
> > > 		iop = iomap_page_create(iter->inode, folio, iter->flags, false);
> >
> > Thanks for the suggestion. However do you think it will be better if this is
> > introduced along with lazy allocation changes which Christoph was mentioning
> > about?
> > Why I am thinking that is because, with above approach we delay the allocation
> > of iop until writeback, for entire folio overlap case. But then later
> > in __iomap_write_begin(), we require iop if folio is not uptodate.
> > Hence we again will have to do some checks to see if the iop is not allocated
> > then allocate it (which is for entire folio overlap case).
> > That somehow looked like an overkill for a very little gain in the context of
> > this patch series. Kindly let me know your thoughts on this.
>
> Look at *why* __iomap_write_begin() allocates an iop.  It's to read in the
> blocks which are going to be partially-overwritten by the write.  If the
> write overlaps the entire folio, there are no parts which need to be read
> in, and we can simply return.

Yes that make sense.

> Maybe we should make that more obvious:

Yes, I think this maybe required. Because otherwise we might end up using
uninitialized iop. generic/156 (funshare), can easily trigger that.
Will spend sometime on the unshare path of iomap.

>
> 	if (folio_test_uptodate(folio))
> 		return 0;
> 	if (pos <= folio_pos(folio) &&
> 	    pos + len >= folio_pos(folio) + folio_size(folio))
> 		return 0;
> 	folio_clear_error(folio);
>
> (I think pos must always be >= folio_pos(), so that <= could be ==, but
> it doesn't hurt anything to use <=)

Thanks for sharing this.

-ritesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFCv2 0/3] iomap: Add support for subpage dirty state tracking to improve write performance
  2023-01-30 16:14 [RFCv2 0/3] iomap: Add support for subpage dirty state tracking to improve write performance Ritesh Harjani (IBM)
                   ` (3 preceding siblings ...)
  2023-01-30 18:10 ` [RFCv2 0/3] iomap: Add support for subpage dirty state " Matthew Wilcox
@ 2023-02-02  4:45 ` Ritesh Harjani (IBM)
  4 siblings, 0 replies; 23+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-02-02  4:45 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel; +Cc: Aravinda Herle

Hello All,

Thanks for the review and feedback on the patch series.
I am on travel starting today. If I get sometime in between I will try and work
on the rest of the review comments. Once I get back next week, I will prepare
to send out V3 with review comments addressed.

Thanks
-ritesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-02-02  4:47 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-30 16:14 [RFCv2 0/3] iomap: Add support for subpage dirty state tracking to improve write performance Ritesh Harjani (IBM)
2023-01-30 16:14 ` [RFCv2 1/3] iomap: Move creation of iomap_page early in __iomap_write_begin Ritesh Harjani (IBM)
2023-01-30 17:02   ` Christoph Hellwig
2023-01-30 20:21     ` Ritesh Harjani (IBM)
2023-01-30 21:00       ` Matthew Wilcox
2023-01-31 18:37         ` Ritesh Harjani (IBM)
2023-01-31 18:48           ` Matthew Wilcox
2023-01-31 20:00             ` Ritesh Harjani (IBM)
2023-01-30 16:14 ` [RFCv2 2/3] iomap: Change uptodate variable name to state Ritesh Harjani (IBM)
2023-01-30 21:56   ` Dave Chinner
2023-01-30 22:24     ` Matthew Wilcox
2023-01-31 15:07       ` Christoph Hellwig
2023-01-31 18:05         ` Ritesh Harjani (IBM)
2023-01-30 16:14 ` [RFCv2 3/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
2023-01-30 17:16   ` Christoph Hellwig
2023-01-30 18:01     ` Matthew Wilcox
2023-01-30 20:44       ` Ritesh Harjani (IBM)
2023-01-30 20:27     ` Ritesh Harjani (IBM)
2023-01-30 17:54   ` Matthew Wilcox
2023-01-30 20:34     ` Ritesh Harjani (IBM)
2023-01-30 18:10 ` [RFCv2 0/3] iomap: Add support for subpage dirty state " Matthew Wilcox
2023-01-30 21:01   ` Ritesh Harjani (IBM)
2023-02-02  4:45 ` Ritesh Harjani (IBM)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.