linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFCv5 0/5] iomap: Add support for per-block dirty state to improve write performance
@ 2023-05-07 19:27 Ritesh Harjani (IBM)
  2023-05-07 19:27 ` [RFCv5 1/5] iomap: Rename iomap_page_create/release() to iop_alloc/free() Ritesh Harjani (IBM)
                   ` (4 more replies)
  0 siblings, 5 replies; 44+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-05-07 19:27 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Ojaswin Mujoo, Disha Goel, Ritesh Harjani (IBM)

Hello,

Please find version-5 of this series. v5 mainly addresses the review comments
from Matthew to have a better iop->state bitmap handling functions.
Please do let me know if any comments/feedback.

Testing
=========
Continuing to run more tests, but so far I haven't observed any surprises
on my 1k and 4k blocksize with default options & -g "auto" runs.

<Pasting RFCv3 => RFCv4 relevant changelog for v5 and dopping everything else>
========================================================================
This addresses a problem reported by Brian for a short write case with delalloc
   release. This is addressed in patch-5 in function iomap_write_delalloc_scan().
   I suppose this is a major change from the previous rfcv3.
   I did test a unit test which Brian provided with xfs_io -f option.
   Before those changes, the kernel caused a bug_on during unmount
   with the unit test. This gets fixed with the changes added in v4.

	i.e. After v4
	=================
	root@ubuntu# ./xfs_io -fc "truncate 4k" -c "mmap 0 4k" -c "mread 0 4k" -c "pwrite 0 1" -c "pwrite -f 2k 1" -c fsync /mnt1/tmnt/testfile
	wrote 1/1 bytes at offset 0
	1.000000 bytes, 1 ops; 0.0001 sec (7.077 KiB/sec and 7246.3768 ops/sec)
	pwrite: Bad address
	root@ubuntu# ./xfs_io -c "fiemap -v" /mnt1/tmnt/testfile
	/mnt1/tmnt/testfile:
	 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
	   0: [0..1]:          22..23               2   0x1
	   1: [2..7]:          hole                 6
	root@ubuntu# filefrag -v /mnt1/tmnt/testfile
	Filesystem type is: 58465342
	File size of /mnt1/tmnt/testfile is 4096 (4 blocks of 1024 bytes)
	 ext:     logical_offset:        physical_offset: length:   expected: flags:
	   0:        0..       0:         11..        11:      1:             last
	/mnt1/tmnt/testfile: 1 extent found
	root@ubuntu# umount /mnt1/tmnt
	root@ubuntu#

	Before v4
	===========
	root@ubuntu# mount /dev/loop8 /mnt1/test
	root@ubuntu# ./xfs_io -fc "truncate 4k" -c "mmap 0 4k" -c "mread 0 4k" -c "pwrite 0 1" -c "pwrite -f 2k 1" -c fsync /mnt1/test/testfile
	wrote 1/1 bytes at offset 0
	1.000000 bytes, 1 ops; 0.0000 sec (10.280 KiB/sec and 10526.3158 ops/sec)
	pwrite: Bad address
	root@ubuntu# ./xfs_io -c "fiemap -v" /mnt1/test/testfile
	/mnt1/test/testfile:
	 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
	   0: [0..1]:          22..23               2   0x0
	   1: [2..3]:          hole                 2
	   2: [4..5]:          0..1                 2   0x7
	   3: [6..7]:          hole                 2
	root@ubuntu# filefrag -v /mnt1/test/testfile
	Filesystem type is: 58465342
	File size of /mnt1/test/testfile is 4096 (4 blocks of 1024 bytes)
	 ext:     logical_offset:        physical_offset: length:   expected: flags:
	   0:        0..       0:         11..        11:      1:
	   1:        2..       2:          0..         0:      0:             last,unknown_loc,delalloc
	/mnt1/test/testfile: 2 extents found
	root@ubuntu# umount /mnt1/test
	<dmesg snip>
	[  156.581188] XFS (loop8): Unmounting Filesystem 7889507d-fc7f-4a1c-94d5-06797f2cc790
	[  156.584455] XFS (loop8): ino 43 data fork has delalloc extent at [0x2:0x1]
	[  156.587847] XFS: Assertion failed: 0, file: fs/xfs/xfs_icache.c, line: 1816
	[  156.591675] ------------[ cut here ]------------
	[  156.594133] kernel BUG at fs/xfs/xfs_message.c:102!
	[  156.596669] invalid opcode: 0000 [#1] PREEMPT SMP PTI
	[  156.599277] CPU: 7 PID: 435 Comm: kworker/7:5 Not tainted 6.3.0-rc6-xfstests-00003-g99a844f4e411-dirty #105
	[  156.603721] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
	[  156.608701] Workqueue: xfs-inodegc/loop8 xfs_inodegc_worker
	[  156.611426] RIP: 0010:assfail+0x38/0x40
	[  156.646981] Call Trace:
	[  156.647951]  <TASK>
	[  156.648904]  xfs_inodegc_set_reclaimable+0x15b/0x160
	[  156.651270]  xfs_inodegc_worker+0x95/0x1d0
	[  156.653202]  process_one_work+0x274/0x550
	[  156.655305]  worker_thread+0x4f/0x300
	[  156.657081]  ? __pfx_worker_thread+0x10/0x10
	[  156.658977]  kthread+0xf6/0x120
	[  156.660657]  ? __pfx_kthread+0x10/0x10
	[  156.662565]  ret_from_fork+0x2c/0x50
	[  156.664421]  </TASK>

Previous changelogs & TODOs at [1]

[1]: https://lore.kernel.org/linux-xfs/cover.1683208091.git.ritesh.list@gmail.com/

Ritesh Harjani (IBM) (5):
  iomap: Rename iomap_page_create/release() to iop_alloc/free()
  iomap: Refactor iop_set_range_uptodate() function
  iomap: Add iop's uptodate state handling functions
  iomap: Allocate iop in ->write_begin() early
  iomap: Add per-block dirty state tracking to improve performance

 fs/gfs2/aops.c         |   2 +-
 fs/iomap/buffered-io.c | 264 ++++++++++++++++++++++++++++++++---------
 fs/xfs/xfs_aops.c      |   2 +-
 fs/zonefs/file.c       |   2 +-
 include/linux/iomap.h  |   1 +
 5 files changed, 210 insertions(+), 61 deletions(-)

--
2.39.2


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFCv5 1/5] iomap: Rename iomap_page_create/release() to iop_alloc/free()
  2023-05-07 19:27 [RFCv5 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
@ 2023-05-07 19:27 ` Ritesh Harjani (IBM)
  2023-05-18  6:13   ` Christoph Hellwig
  2023-05-07 19:27 ` [RFCv5 2/5] iomap: Refactor iop_set_range_uptodate() function Ritesh Harjani (IBM)
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 44+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-05-07 19:27 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Ojaswin Mujoo, Disha Goel, Ritesh Harjani (IBM)

This patch renames the iomap_page_create/release() functions to
iop_alloc/free() calls. Later patches adds more functions for
handling iop structure with iop_** naming conventions.
Hence iop_alloc/free() makes more sense.

Note, this patch also move folio_detach_private() to happen later
after checking for bitmap_full(). This is just another small refactor
because in later patches we will move bitmap_** helpers to iop_** related
helpers which will only take a folio and hence we should move
folio_detach_private() to the end before calling kfree(iop).

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6f4c97a6d7e9..cbd945d96584 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -43,8 +43,8 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
 
 static struct bio_set iomap_ioend_bioset;
 
-static struct iomap_page *
-iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
+static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
+				    unsigned int flags)
 {
 	struct iomap_page *iop = to_iomap_page(folio);
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
@@ -69,9 +69,9 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
 	return iop;
 }
 
-static void iomap_page_release(struct folio *folio)
+static void iop_free(struct folio *folio)
 {
-	struct iomap_page *iop = folio_detach_private(folio);
+	struct iomap_page *iop = to_iomap_page(folio);
 	struct inode *inode = folio->mapping->host;
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 
@@ -81,6 +81,7 @@ static void iomap_page_release(struct folio *folio)
 	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
 	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
 			folio_test_uptodate(folio));
+	folio_detach_private(folio);
 	kfree(iop);
 }
 
@@ -231,7 +232,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
 	if (offset > 0)
-		iop = iomap_page_create(iter->inode, folio, iter->flags);
+		iop = iop_alloc(iter->inode, folio, iter->flags);
 	else
 		iop = to_iomap_page(folio);
 
@@ -269,7 +270,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		return iomap_read_inline_data(iter, folio);
 
 	/* zero post-eof blocks as the page may be mapped */
-	iop = iomap_page_create(iter->inode, folio, iter->flags);
+	iop = iop_alloc(iter->inode, folio, iter->flags);
 	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
@@ -497,7 +498,7 @@ bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
 	 */
 	if (folio_test_dirty(folio) || folio_test_writeback(folio))
 		return false;
-	iomap_page_release(folio);
+	iop_free(folio);
 	return true;
 }
 EXPORT_SYMBOL_GPL(iomap_release_folio);
@@ -514,12 +515,12 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 	if (offset == 0 && len == folio_size(folio)) {
 		WARN_ON_ONCE(folio_test_writeback(folio));
 		folio_cancel_dirty(folio);
-		iomap_page_release(folio);
+		iop_free(folio);
 	} else if (folio_test_large(folio)) {
 		/* Must release the iop so the page can be split */
 		WARN_ON_ONCE(!folio_test_uptodate(folio) &&
 			     folio_test_dirty(folio));
-		iomap_page_release(folio);
+		iop_free(folio);
 	}
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
@@ -566,7 +567,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		return 0;
 	folio_clear_error(folio);
 
-	iop = iomap_page_create(iter->inode, folio, iter->flags);
+	iop = iop_alloc(iter->inode, folio, iter->flags);
+
 	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
 		return -EAGAIN;
 
@@ -1619,7 +1621,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct folio *folio, u64 end_pos)
 {
-	struct iomap_page *iop = iomap_page_create(inode, folio, 0);
+	struct iomap_page *iop = iop_alloc(inode, folio, 0);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	unsigned nblocks = i_blocks_per_folio(inode, folio);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFCv5 2/5] iomap: Refactor iop_set_range_uptodate() function
  2023-05-07 19:27 [RFCv5 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
  2023-05-07 19:27 ` [RFCv5 1/5] iomap: Rename iomap_page_create/release() to iop_alloc/free() Ritesh Harjani (IBM)
@ 2023-05-07 19:27 ` Ritesh Harjani (IBM)
  2023-05-15 15:09   ` Brian Foster
  2023-05-18  6:16   ` Christoph Hellwig
  2023-05-07 19:27 ` [RFCv5 3/5] iomap: Add iop's uptodate state handling functions Ritesh Harjani (IBM)
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 44+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-05-07 19:27 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Ojaswin Mujoo, Disha Goel, Ritesh Harjani (IBM)

This patch moves up and combine the definitions of two functions
(iomap_iop_set_range_uptodate() & iomap_set_range_uptodate()) into
iop_set_range_uptodate() & refactors it's arguments a bit.

No functionality change in this patch.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 57 ++++++++++++++++++++----------------------
 1 file changed, 27 insertions(+), 30 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index cbd945d96584..e732581dc2d4 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -43,6 +43,27 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)

 static struct bio_set iomap_ioend_bioset;

+static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
+				   size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	unsigned int first_blk = off >> inode->i_blkbits;
+	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
+	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned long flags;
+
+	if (iop) {
+		spin_lock_irqsave(&iop->uptodate_lock, flags);
+		bitmap_set(iop->uptodate, first_blk, nr_blks);
+		if (bitmap_full(iop->uptodate,
+				i_blocks_per_folio(inode, folio)))
+			folio_mark_uptodate(folio);
+		spin_unlock_irqrestore(&iop->uptodate_lock, flags);
+	} else {
+		folio_mark_uptodate(folio);
+	}
+}
+
 static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
 				    unsigned int flags)
 {
@@ -145,30 +166,6 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 	*lenp = plen;
 }

-static void iomap_iop_set_range_uptodate(struct folio *folio,
-		struct iomap_page *iop, size_t off, size_t len)
-{
-	struct inode *inode = folio->mapping->host;
-	unsigned first = off >> inode->i_blkbits;
-	unsigned last = (off + len - 1) >> inode->i_blkbits;
-	unsigned long flags;
-
-	spin_lock_irqsave(&iop->uptodate_lock, flags);
-	bitmap_set(iop->uptodate, first, last - first + 1);
-	if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio)))
-		folio_mark_uptodate(folio);
-	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
-}
-
-static void iomap_set_range_uptodate(struct folio *folio,
-		struct iomap_page *iop, size_t off, size_t len)
-{
-	if (iop)
-		iomap_iop_set_range_uptodate(folio, iop, off, len);
-	else
-		folio_mark_uptodate(folio);
-}
-
 static void iomap_finish_folio_read(struct folio *folio, size_t offset,
 		size_t len, int error)
 {
@@ -178,7 +175,8 @@ static void iomap_finish_folio_read(struct folio *folio, size_t offset,
 		folio_clear_uptodate(folio);
 		folio_set_error(folio);
 	} else {
-		iomap_set_range_uptodate(folio, iop, offset, len);
+		iop_set_range_uptodate(folio->mapping->host, folio, offset,
+				       len);
 	}

 	if (!iop || atomic_sub_and_test(len, &iop->read_bytes_pending))
@@ -240,7 +238,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	memcpy(addr, iomap->inline_data, size);
 	memset(addr + size, 0, PAGE_SIZE - poff - size);
 	kunmap_local(addr);
-	iomap_set_range_uptodate(folio, iop, offset, PAGE_SIZE - poff);
+	iop_set_range_uptodate(iter->inode, folio, offset, PAGE_SIZE - poff);
 	return 0;
 }

@@ -277,7 +275,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,

 	if (iomap_block_needs_zeroing(iter, pos)) {
 		folio_zero_range(folio, poff, plen);
-		iomap_set_range_uptodate(folio, iop, poff, plen);
+		iop_set_range_uptodate(iter->inode, folio, poff, plen);
 		goto done;
 	}

@@ -598,7 +596,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 			if (status)
 				return status;
 		}
-		iomap_set_range_uptodate(folio, iop, poff, plen);
+		iop_set_range_uptodate(iter->inode, folio, poff, plen);
 	} while ((block_start += plen) < block_end);

 	return 0;
@@ -705,7 +703,6 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
-	struct iomap_page *iop = to_iomap_page(folio);
 	flush_dcache_folio(folio);

 	/*
@@ -721,7 +718,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	 */
 	if (unlikely(copied < len && !folio_test_uptodate(folio)))
 		return 0;
-	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
+	iop_set_range_uptodate(inode, folio, offset_in_folio(folio, pos), len);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
 }
--
2.39.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFCv5 3/5] iomap: Add iop's uptodate state handling functions
  2023-05-07 19:27 [RFCv5 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
  2023-05-07 19:27 ` [RFCv5 1/5] iomap: Rename iomap_page_create/release() to iop_alloc/free() Ritesh Harjani (IBM)
  2023-05-07 19:27 ` [RFCv5 2/5] iomap: Refactor iop_set_range_uptodate() function Ritesh Harjani (IBM)
@ 2023-05-07 19:27 ` Ritesh Harjani (IBM)
  2023-05-15 15:10   ` Brian Foster
  2023-05-18  6:18   ` Christoph Hellwig
  2023-05-07 19:27 ` [RFCv5 4/5] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
  2023-05-07 19:28 ` [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
  4 siblings, 2 replies; 44+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-05-07 19:27 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Ojaswin Mujoo, Disha Goel, Ritesh Harjani (IBM)

Firstly this patch renames iop->uptodate to iop->state bitmap.
This is because we will add dirty state to iop->state bitmap in later
patches. So it makes sense to rename the iop->uptodate bitmap to
iop->state.

Secondly this patch also adds other helpers for uptodate state bitmap
handling of iop->state.

No functionality change in this patch.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 78 +++++++++++++++++++++++++++++++-----------
 1 file changed, 58 insertions(+), 20 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e732581dc2d4..5103b644e115 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -24,14 +24,14 @@
 #define IOEND_BATCH_SIZE	4096
 
 /*
- * Structure allocated for each folio when block size < folio size
- * to track sub-folio uptodate status and I/O completions.
+ * Structure allocated for each folio to track per-block uptodate state
+ * and I/O completions.
  */
 struct iomap_page {
 	atomic_t		read_bytes_pending;
 	atomic_t		write_bytes_pending;
-	spinlock_t		uptodate_lock;
-	unsigned long		uptodate[];
+	spinlock_t		state_lock;
+	unsigned long		state[];
 };
 
 static inline struct iomap_page *to_iomap_page(struct folio *folio)
@@ -43,6 +43,47 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
 
 static struct bio_set iomap_ioend_bioset;
 
+/*
+ * inline helpers for bitmap operations on iop->state
+ */
+static inline void iop_set_range(struct iomap_page *iop, unsigned int start_blk,
+				 unsigned int nr_blks)
+{
+	bitmap_set(iop->state, start_blk, nr_blks);
+}
+
+static inline bool iop_test_block(struct iomap_page *iop, unsigned int block)
+{
+	return test_bit(block, iop->state);
+}
+
+static inline bool iop_bitmap_full(struct iomap_page *iop,
+				   unsigned int blks_per_folio)
+{
+	return bitmap_full(iop->state, blks_per_folio);
+}
+
+/*
+ * iop related helpers for checking uptodate/dirty state of per-block
+ * or range of blocks within a folio
+ */
+static bool iop_test_full_uptodate(struct folio *folio)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	struct inode *inode = folio->mapping->host;
+
+	WARN_ON(!iop);
+	return iop_bitmap_full(iop, i_blocks_per_folio(inode, folio));
+}
+
+static bool iop_test_block_uptodate(struct folio *folio, unsigned int block)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+
+	WARN_ON(!iop);
+	return iop_test_block(iop, block);
+}
+
 static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
 				   size_t off, size_t len)
 {
@@ -53,12 +94,11 @@ static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
 	unsigned long flags;
 
 	if (iop) {
-		spin_lock_irqsave(&iop->uptodate_lock, flags);
-		bitmap_set(iop->uptodate, first_blk, nr_blks);
-		if (bitmap_full(iop->uptodate,
-				i_blocks_per_folio(inode, folio)))
+		spin_lock_irqsave(&iop->state_lock, flags);
+		iop_set_range(iop, first_blk, nr_blks);
+		if (iop_test_full_uptodate(folio))
 			folio_mark_uptodate(folio);
-		spin_unlock_irqrestore(&iop->uptodate_lock, flags);
+		spin_unlock_irqrestore(&iop->state_lock, flags);
 	} else {
 		folio_mark_uptodate(folio);
 	}
@@ -79,12 +119,12 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;
 
-	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
+	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
 		      gfp);
 	if (iop) {
-		spin_lock_init(&iop->uptodate_lock);
+		spin_lock_init(&iop->state_lock);
 		if (folio_test_uptodate(folio))
-			bitmap_fill(iop->uptodate, nr_blocks);
+			iop_set_range(iop, 0, nr_blocks);
 		folio_attach_private(folio, iop);
 	}
 	return iop;
@@ -93,15 +133,13 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
 static void iop_free(struct folio *folio)
 {
 	struct iomap_page *iop = to_iomap_page(folio);
-	struct inode *inode = folio->mapping->host;
-	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 
 	if (!iop)
 		return;
 	WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
 	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
-	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
-			folio_test_uptodate(folio));
+	WARN_ON_ONCE(iop_test_full_uptodate(folio) !=
+		     folio_test_uptodate(folio));
 	folio_detach_private(folio);
 	kfree(iop);
 }
@@ -132,7 +170,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 
 		/* move forward for each leading block marked uptodate */
 		for (i = first; i <= last; i++) {
-			if (!test_bit(i, iop->uptodate))
+			if (!iop_test_block_uptodate(folio, i))
 				break;
 			*pos += block_size;
 			poff += block_size;
@@ -142,7 +180,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 
 		/* truncate len if we find any trailing uptodate block(s) */
 		for ( ; i <= last; i++) {
-			if (test_bit(i, iop->uptodate)) {
+			if (iop_test_block_uptodate(folio, i)) {
 				plen -= (last - i + 1) * block_size;
 				last = i - 1;
 				break;
@@ -450,7 +488,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
 	last = (from + count - 1) >> inode->i_blkbits;
 
 	for (i = first; i <= last; i++)
-		if (!test_bit(i, iop->uptodate))
+		if (!iop_test_block_uptodate(folio, i))
 			return false;
 	return true;
 }
@@ -1634,7 +1672,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iop && !test_bit(i, iop->uptodate))
+		if (iop && !iop_test_block_uptodate(folio, i))
 			continue;
 
 		error = wpc->ops->map_blocks(wpc, inode, pos);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFCv5 4/5] iomap: Allocate iop in ->write_begin() early
  2023-05-07 19:27 [RFCv5 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
                   ` (2 preceding siblings ...)
  2023-05-07 19:27 ` [RFCv5 3/5] iomap: Add iop's uptodate state handling functions Ritesh Harjani (IBM)
@ 2023-05-07 19:27 ` Ritesh Harjani (IBM)
  2023-05-18  6:21   ` Christoph Hellwig
  2023-05-07 19:28 ` [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
  4 siblings, 1 reply; 44+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-05-07 19:27 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Ojaswin Mujoo, Disha Goel, Ritesh Harjani (IBM)

Earlier when the folio is uptodate, we only allocate iop at writeback
time (in iomap_writepage_map()). This is ok until now, but when we are
going to add support for per-block dirty state bitmap in iop, this
could cause some performance degradation. The reason is that if we don't
allocate iop during ->write_begin(), then we will never mark the
necessary dirty bits in ->write_end() call. And we will have to mark all
the bits as dirty at the writeback time, that could cause the same write
amplification and performance problems as it is now.

However, for all the writes with (pos, len) which completely overlaps
the given folio, there is no need to allocate an iop during
->write_begin(). So skip those cases.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 5103b644e115..25f20f269214 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -599,15 +599,25 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	size_t from = offset_in_folio(folio, pos), to = from + len;
 	size_t poff, plen;
 
-	if (folio_test_uptodate(folio))
+	/*
+	 * If the write completely overlaps the current folio, then
+	 * entire folio will be dirtied so there is no need for
+	 * per-block state tracking structures to be attached to this folio.
+	 */
+	if (pos <= folio_pos(folio) &&
+	    pos + len >= folio_pos(folio) + folio_size(folio))
 		return 0;
-	folio_clear_error(folio);
 
 	iop = iop_alloc(iter->inode, folio, iter->flags);
 
 	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
 		return -EAGAIN;
 
+	if (folio_test_uptodate(folio))
+		return 0;
+	folio_clear_error(folio);
+
+
 	do {
 		iomap_adjust_read_range(iter->inode, folio, &block_start,
 				block_end - block_start, &poff, &plen);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-07 19:27 [RFCv5 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
                   ` (3 preceding siblings ...)
  2023-05-07 19:27 ` [RFCv5 4/5] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
@ 2023-05-07 19:28 ` Ritesh Harjani (IBM)
       [not found]   ` <CGME20230515081618eucas1p1c852fec3ba7a42ee7094248c30ff5978@eucas1p1.samsung.com>
                     ` (2 more replies)
  4 siblings, 3 replies; 44+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-05-07 19:28 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Ojaswin Mujoo, Disha Goel, Ritesh Harjani (IBM),
	Aravinda Herle

When filesystem blocksize is less than folio size (either with
mapping_large_folio_support() or with blocksize < pagesize) and when the
folio is uptodate in pagecache, then even a byte write can cause
an entire folio to be written to disk during writeback. This happens
because we currently don't have a mechanism to track per-block dirty
state within struct iomap_page. We currently only track uptodate state.

This patch implements support for tracking per-block dirty state in
iomap_page->state bitmap. This should help improve the filesystem write
performance and help reduce write amplification.

Performance testing of below fio workload reveals ~16x performance
improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.

1. <test_randwrite.fio>
[global]
	ioengine=psync
	rw=randwrite
	overwrite=1
	pre_read=1
	direct=0
	bs=4k
	size=1G
	dir=./
	numjobs=8
	fdatasync=1
	runtime=60
	iodepth=64
	group_reporting=1

[fio-run]

2. Also our internal performance team reported that this patch improves
   their database workload performance by around ~83% (with XFS on Power)

Reported-by: Aravinda Herle <araherle@in.ibm.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/gfs2/aops.c         |   2 +-
 fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_aops.c      |   2 +-
 fs/zonefs/file.c       |   2 +-
 include/linux/iomap.h  |   1 +
 5 files changed, 112 insertions(+), 10 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index a5f4be6b9213..75efec3c3b71 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -746,7 +746,7 @@ static const struct address_space_operations gfs2_aops = {
 	.writepages = gfs2_writepages,
 	.read_folio = gfs2_read_folio,
 	.readahead = gfs2_readahead,
-	.dirty_folio = filemap_dirty_folio,
+	.dirty_folio = iomap_dirty_folio,
 	.release_folio = iomap_release_folio,
 	.invalidate_folio = iomap_invalidate_folio,
 	.bmap = gfs2_bmap,
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 25f20f269214..c7f41b26280a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -52,6 +52,12 @@ static inline void iop_set_range(struct iomap_page *iop, unsigned int start_blk,
 	bitmap_set(iop->state, start_blk, nr_blks);
 }

+static inline void iop_clear_range(struct iomap_page *iop,
+				   unsigned int start_blk, unsigned int nr_blks)
+{
+	bitmap_clear(iop->state, start_blk, nr_blks);
+}
+
 static inline bool iop_test_block(struct iomap_page *iop, unsigned int block)
 {
 	return test_bit(block, iop->state);
@@ -84,6 +90,16 @@ static bool iop_test_block_uptodate(struct folio *folio, unsigned int block)
 	return iop_test_block(iop, block);
 }

+static bool iop_test_block_dirty(struct folio *folio, int block)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	struct inode *inode = folio->mapping->host;
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+
+	WARN_ON(!iop);
+	return iop_test_block(iop, block + blks_per_folio);
+}
+
 static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
 				   size_t off, size_t len)
 {
@@ -104,8 +120,42 @@ static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
 	}
 }

+static void iop_set_range_dirty(struct inode *inode, struct folio *folio,
+				size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+	unsigned int first_blk = (off >> inode->i_blkbits);
+	unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits);
+	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned long flags;
+
+	if (!iop)
+		return;
+	spin_lock_irqsave(&iop->state_lock, flags);
+	iop_set_range(iop, first_blk + blks_per_folio, nr_blks);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
+}
+
+static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	struct inode *inode = folio->mapping->host;
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+	unsigned int first_blk = (off >> inode->i_blkbits);
+	unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits);
+	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned long flags;
+
+	if (!iop)
+		return;
+	spin_lock_irqsave(&iop->state_lock, flags);
+	iop_clear_range(iop, first_blk + blks_per_folio, nr_blks);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
+}
+
 static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
-				    unsigned int flags)
+				    unsigned int flags, bool is_dirty)
 {
 	struct iomap_page *iop = to_iomap_page(folio);
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
@@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;

-	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
+	/*
+	 * iop->state tracks two sets of state flags when the
+	 * filesystem block size is smaller than the folio size.
+	 * The first state tracks per-block uptodate and the
+	 * second tracks per-block dirty state.
+	 */
+	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
 		      gfp);
 	if (iop) {
 		spin_lock_init(&iop->state_lock);
 		if (folio_test_uptodate(folio))
 			iop_set_range(iop, 0, nr_blocks);
+		if (is_dirty)
+			iop_set_range(iop, nr_blocks, nr_blocks);
 		folio_attach_private(folio, iop);
 	}
 	return iop;
@@ -268,7 +326,8 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
 	if (offset > 0)
-		iop = iop_alloc(iter->inode, folio, iter->flags);
+		iop = iop_alloc(iter->inode, folio, iter->flags,
+				folio_test_dirty(folio));
 	else
 		iop = to_iomap_page(folio);

@@ -306,7 +365,8 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		return iomap_read_inline_data(iter, folio);

 	/* zero post-eof blocks as the page may be mapped */
-	iop = iop_alloc(iter->inode, folio, iter->flags);
+	iop = iop_alloc(iter->inode, folio, iter->flags,
+			folio_test_dirty(folio));
 	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
@@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);

+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
+{
+	struct iomap_page *iop;
+	struct inode *inode = mapping->host;
+	size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits;
+
+	iop = iop_alloc(inode, folio, 0, false);
+	iop_set_range_dirty(inode, folio, 0, len);
+	return filemap_dirty_folio(mapping, folio);
+}
+EXPORT_SYMBOL_GPL(iomap_dirty_folio);
+
 static void
 iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 {
@@ -608,7 +680,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	    pos + len >= folio_pos(folio) + folio_size(folio))
 		return 0;

-	iop = iop_alloc(iter->inode, folio, iter->flags);
+	iop = iop_alloc(iter->inode, folio, iter->flags,
+			folio_test_dirty(folio));

 	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
 		return -EAGAIN;
@@ -767,6 +840,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	if (unlikely(copied < len && !folio_test_uptodate(folio)))
 		return 0;
 	iop_set_range_uptodate(inode, folio, offset_in_folio(folio, pos), len);
+	iop_set_range_dirty(inode, folio, offset_in_folio(folio, pos), copied);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
 }
@@ -954,6 +1028,10 @@ static int iomap_write_delalloc_scan(struct inode *inode,
 {
 	while (start_byte < end_byte) {
 		struct folio	*folio;
+		struct iomap_page *iop;
+		unsigned int first_blk, last_blk, blks_per_folio, i;
+		loff_t last_byte;
+		u8 blkbits = inode->i_blkbits;

 		/* grab locked page */
 		folio = filemap_lock_folio(inode->i_mapping,
@@ -978,6 +1056,28 @@ static int iomap_write_delalloc_scan(struct inode *inode,
 				}
 			}

+			/*
+			 * When we have per-block dirty tracking, there can be
+			 * blocks within a folio which are marked uptodate
+			 * but not dirty. In that case it is necessary to punch
+			 * out such blocks to avoid leaking any delalloc blocks.
+			 */
+			iop = to_iomap_page(folio);
+			if (!iop)
+				goto skip_iop_punch;
+			last_byte = min_t(loff_t, end_byte - 1,
+				(folio_next_index(folio) << PAGE_SHIFT) - 1);
+			first_blk = offset_in_folio(folio, start_byte) >>
+						    blkbits;
+			last_blk = offset_in_folio(folio, last_byte) >>
+						   blkbits;
+			blks_per_folio = i_blocks_per_folio(inode, folio);
+			for (i = first_blk; i <= last_blk; i++) {
+				if (!iop_test_block_dirty(folio, i))
+					punch(inode, i << blkbits,
+						     1 << blkbits);
+			}
+skip_iop_punch:
 			/*
 			 * Make sure the next punch start is correctly bound to
 			 * the end of this data range, not the end of the folio.
@@ -1666,7 +1766,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct folio *folio, u64 end_pos)
 {
-	struct iomap_page *iop = iop_alloc(inode, folio, 0);
+	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	unsigned nblocks = i_blocks_per_folio(inode, folio);
@@ -1682,7 +1782,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iop && !iop_test_block_uptodate(folio, i))
+		if (iop && !iop_test_block_dirty(folio, i))
 			continue;

 		error = wpc->ops->map_blocks(wpc, inode, pos);
@@ -1726,6 +1826,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		}
 	}

+	iop_clear_range_dirty(folio, 0, end_pos - folio_pos(folio));
 	folio_start_writeback(folio);
 	folio_unlock(folio);

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 2ef78aa1d3f6..77c7332ae197 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -578,7 +578,7 @@ const struct address_space_operations xfs_address_space_operations = {
 	.read_folio		= xfs_vm_read_folio,
 	.readahead		= xfs_vm_readahead,
 	.writepages		= xfs_vm_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.bmap			= xfs_vm_bmap,
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index 132f01d3461f..e508c8e97372 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -175,7 +175,7 @@ const struct address_space_operations zonefs_file_aops = {
 	.read_folio		= zonefs_read_folio,
 	.readahead		= zonefs_readahead,
 	.writepages		= zonefs_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.migrate_folio		= filemap_migrate_folio,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0f8123504e5e..0c2bee80565c 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -264,6 +264,7 @@ bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
--
2.39.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
       [not found]   ` <CGME20230515081618eucas1p1c852fec3ba7a42ee7094248c30ff5978@eucas1p1.samsung.com>
@ 2023-05-15  8:16     ` Pankaj Raghav
  2023-05-15  8:31       ` Ritesh Harjani
  0 siblings, 1 reply; 44+ messages in thread
From: Pankaj Raghav @ 2023-05-15  8:16 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Ojaswin Mujoo, Disha Goel, Aravinda Herle, mcgrof

[-- Attachment #1: Type: text/plain, Size: 1059 bytes --]

> @@ -1666,7 +1766,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		struct writeback_control *wbc, struct inode *inode,
>  		struct folio *folio, u64 end_pos)
>  {
> -	struct iomap_page *iop = iop_alloc(inode, folio, 0);
> +	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
>  	struct iomap_ioend *ioend, *next;
>  	unsigned len = i_blocksize(inode);
>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
> @@ -1682,7 +1782,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	 * invalid, grab a new one.
>  	 */
>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> -		if (iop && !iop_test_block_uptodate(folio, i))
> +		if (iop && !iop_test_block_dirty(folio, i))

Shouldn't this be if(iop && iop_test_block_dirty(folio, i))? 

Before we were skipping if the blocks were not uptodate but now we are
skipping if the blocks are not dirty (which means they are uptodate)?

I am new to iomap but let me know if I am missing something here.

>  			continue;
> 
>  		error = wpc->ops->map_blocks(wpc, inode, pos);

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-15  8:16     ` Pankaj Raghav
@ 2023-05-15  8:31       ` Ritesh Harjani
  2023-05-15 13:23         ` Pankaj Raghav
  0 siblings, 1 reply; 44+ messages in thread
From: Ritesh Harjani @ 2023-05-15  8:31 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Ojaswin Mujoo, Disha Goel, Aravinda Herle, mcgrof

Pankaj Raghav <p.raghav@samsung.com> writes:

>> @@ -1666,7 +1766,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  		struct writeback_control *wbc, struct inode *inode,
>>  		struct folio *folio, u64 end_pos)
>>  {
>> -	struct iomap_page *iop = iop_alloc(inode, folio, 0);
>> +	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
>>  	struct iomap_ioend *ioend, *next;
>>  	unsigned len = i_blocksize(inode);
>>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
>> @@ -1682,7 +1782,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  	 * invalid, grab a new one.
>>  	 */
>>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
>> -		if (iop && !iop_test_block_uptodate(folio, i))
>> +		if (iop && !iop_test_block_dirty(folio, i))
>
> Shouldn't this be if(iop && iop_test_block_dirty(folio, i))?
>
> Before we were skipping if the blocks were not uptodate but now we are
> skipping if the blocks are not dirty (which means they are uptodate)?
>
> I am new to iomap but let me know if I am missing something here.
>

We set the per-block dirty status in ->write_begin. The check above happens in the
writeback path when we are about to write the dirty data to the disk.
What the check is doing is that, it checks if the block state is not dirty
then skip it which means the block was not written to in the ->write_begin().
Also the block with dirty status always means that the block is uptodate
anyways.

Hope it helps!

-ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-15  8:31       ` Ritesh Harjani
@ 2023-05-15 13:23         ` Pankaj Raghav
  0 siblings, 0 replies; 44+ messages in thread
From: Pankaj Raghav @ 2023-05-15 13:23 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Ojaswin Mujoo, Disha Goel, Aravinda Herle, mcgrof

> 
> We set the per-block dirty status in ->write_begin. The check above happens in the
> writeback path when we are about to write the dirty data to the disk.
> What the check is doing is that, it checks if the block state is not dirty
> then skip it which means the block was not written to in the ->write_begin().
> Also the block with dirty status always means that the block is uptodate
> anyways.
> 
> Hope it helps!
> 

Definitely. I also checked your 1st RFC version to get more context. Thanks for
the explanation.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 2/5] iomap: Refactor iop_set_range_uptodate() function
  2023-05-07 19:27 ` [RFCv5 2/5] iomap: Refactor iop_set_range_uptodate() function Ritesh Harjani (IBM)
@ 2023-05-15 15:09   ` Brian Foster
  2023-05-16 10:12     ` Ritesh Harjani
  2023-05-18  6:16   ` Christoph Hellwig
  1 sibling, 1 reply; 44+ messages in thread
From: Brian Foster @ 2023-05-15 15:09 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Ojaswin Mujoo, Disha Goel

On Mon, May 08, 2023 at 12:57:57AM +0530, Ritesh Harjani (IBM) wrote:
> This patch moves up and combine the definitions of two functions
> (iomap_iop_set_range_uptodate() & iomap_set_range_uptodate()) into
> iop_set_range_uptodate() & refactors it's arguments a bit.
> 
> No functionality change in this patch.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---

Hi Ritesh,

I just have a few random and nitty comments/questions on the series..

>  fs/iomap/buffered-io.c | 57 ++++++++++++++++++++----------------------
>  1 file changed, 27 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index cbd945d96584..e732581dc2d4 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -43,6 +43,27 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
> 
>  static struct bio_set iomap_ioend_bioset;
> 
> +static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
> +				   size_t off, size_t len)
> +{

Any particular reason this now takes the inode as a param now instead of
continuing to use the folio?

Brian

> +	struct iomap_page *iop = to_iomap_page(folio);
> +	unsigned int first_blk = off >> inode->i_blkbits;
> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
> +	unsigned int nr_blks = last_blk - first_blk + 1;
> +	unsigned long flags;
> +
> +	if (iop) {
> +		spin_lock_irqsave(&iop->uptodate_lock, flags);
> +		bitmap_set(iop->uptodate, first_blk, nr_blks);
> +		if (bitmap_full(iop->uptodate,
> +				i_blocks_per_folio(inode, folio)))
> +			folio_mark_uptodate(folio);
> +		spin_unlock_irqrestore(&iop->uptodate_lock, flags);
> +	} else {
> +		folio_mark_uptodate(folio);
> +	}
> +}
> +
>  static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
>  				    unsigned int flags)
>  {
> @@ -145,30 +166,6 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>  	*lenp = plen;
>  }
> 
> -static void iomap_iop_set_range_uptodate(struct folio *folio,
> -		struct iomap_page *iop, size_t off, size_t len)
> -{
> -	struct inode *inode = folio->mapping->host;
> -	unsigned first = off >> inode->i_blkbits;
> -	unsigned last = (off + len - 1) >> inode->i_blkbits;
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&iop->uptodate_lock, flags);
> -	bitmap_set(iop->uptodate, first, last - first + 1);
> -	if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio)))
> -		folio_mark_uptodate(folio);
> -	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
> -}
> -
> -static void iomap_set_range_uptodate(struct folio *folio,
> -		struct iomap_page *iop, size_t off, size_t len)
> -{
> -	if (iop)
> -		iomap_iop_set_range_uptodate(folio, iop, off, len);
> -	else
> -		folio_mark_uptodate(folio);
> -}
> -
>  static void iomap_finish_folio_read(struct folio *folio, size_t offset,
>  		size_t len, int error)
>  {
> @@ -178,7 +175,8 @@ static void iomap_finish_folio_read(struct folio *folio, size_t offset,
>  		folio_clear_uptodate(folio);
>  		folio_set_error(folio);
>  	} else {
> -		iomap_set_range_uptodate(folio, iop, offset, len);
> +		iop_set_range_uptodate(folio->mapping->host, folio, offset,
> +				       len);
>  	}
> 
>  	if (!iop || atomic_sub_and_test(len, &iop->read_bytes_pending))
> @@ -240,7 +238,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
>  	memcpy(addr, iomap->inline_data, size);
>  	memset(addr + size, 0, PAGE_SIZE - poff - size);
>  	kunmap_local(addr);
> -	iomap_set_range_uptodate(folio, iop, offset, PAGE_SIZE - poff);
> +	iop_set_range_uptodate(iter->inode, folio, offset, PAGE_SIZE - poff);
>  	return 0;
>  }
> 
> @@ -277,7 +275,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
> 
>  	if (iomap_block_needs_zeroing(iter, pos)) {
>  		folio_zero_range(folio, poff, plen);
> -		iomap_set_range_uptodate(folio, iop, poff, plen);
> +		iop_set_range_uptodate(iter->inode, folio, poff, plen);
>  		goto done;
>  	}
> 
> @@ -598,7 +596,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  			if (status)
>  				return status;
>  		}
> -		iomap_set_range_uptodate(folio, iop, poff, plen);
> +		iop_set_range_uptodate(iter->inode, folio, poff, plen);
>  	} while ((block_start += plen) < block_end);
> 
>  	return 0;
> @@ -705,7 +703,6 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
>  static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>  		size_t copied, struct folio *folio)
>  {
> -	struct iomap_page *iop = to_iomap_page(folio);
>  	flush_dcache_folio(folio);
> 
>  	/*
> @@ -721,7 +718,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>  	 */
>  	if (unlikely(copied < len && !folio_test_uptodate(folio)))
>  		return 0;
> -	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
> +	iop_set_range_uptodate(inode, folio, offset_in_folio(folio, pos), len);
>  	filemap_dirty_folio(inode->i_mapping, folio);
>  	return copied;
>  }
> --
> 2.39.2
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 3/5] iomap: Add iop's uptodate state handling functions
  2023-05-07 19:27 ` [RFCv5 3/5] iomap: Add iop's uptodate state handling functions Ritesh Harjani (IBM)
@ 2023-05-15 15:10   ` Brian Foster
  2023-05-16 10:14     ` Ritesh Harjani
  2023-05-18  6:18   ` Christoph Hellwig
  1 sibling, 1 reply; 44+ messages in thread
From: Brian Foster @ 2023-05-15 15:10 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Ojaswin Mujoo, Disha Goel

On Mon, May 08, 2023 at 12:57:58AM +0530, Ritesh Harjani (IBM) wrote:
> Firstly this patch renames iop->uptodate to iop->state bitmap.
> This is because we will add dirty state to iop->state bitmap in later
> patches. So it makes sense to rename the iop->uptodate bitmap to
> iop->state.
> 
> Secondly this patch also adds other helpers for uptodate state bitmap
> handling of iop->state.
> 
> No functionality change in this patch.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/iomap/buffered-io.c | 78 +++++++++++++++++++++++++++++++-----------
>  1 file changed, 58 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index e732581dc2d4..5103b644e115 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
...
> @@ -43,6 +43,47 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
...
> +/*
> + * iop related helpers for checking uptodate/dirty state of per-block
> + * or range of blocks within a folio
> + */
> +static bool iop_test_full_uptodate(struct folio *folio)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +	struct inode *inode = folio->mapping->host;
> +
> +	WARN_ON(!iop);

It looks like an oops or something is imminent here if iop is NULL. Why
the warn (here and in a couple other places)?

Brian

> +	return iop_bitmap_full(iop, i_blocks_per_folio(inode, folio));
> +}
> +
> +static bool iop_test_block_uptodate(struct folio *folio, unsigned int block)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +
> +	WARN_ON(!iop);
> +	return iop_test_block(iop, block);
> +}
> +
>  static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
>  				   size_t off, size_t len)
>  {
> @@ -53,12 +94,11 @@ static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
>  	unsigned long flags;
>  
>  	if (iop) {
> -		spin_lock_irqsave(&iop->uptodate_lock, flags);
> -		bitmap_set(iop->uptodate, first_blk, nr_blks);
> -		if (bitmap_full(iop->uptodate,
> -				i_blocks_per_folio(inode, folio)))
> +		spin_lock_irqsave(&iop->state_lock, flags);
> +		iop_set_range(iop, first_blk, nr_blks);
> +		if (iop_test_full_uptodate(folio))
>  			folio_mark_uptodate(folio);
> -		spin_unlock_irqrestore(&iop->uptodate_lock, flags);
> +		spin_unlock_irqrestore(&iop->state_lock, flags);
>  	} else {
>  		folio_mark_uptodate(folio);
>  	}
> @@ -79,12 +119,12 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
>  	else
>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>  
> -	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
>  		      gfp);
>  	if (iop) {
> -		spin_lock_init(&iop->uptodate_lock);
> +		spin_lock_init(&iop->state_lock);
>  		if (folio_test_uptodate(folio))
> -			bitmap_fill(iop->uptodate, nr_blocks);
> +			iop_set_range(iop, 0, nr_blocks);
>  		folio_attach_private(folio, iop);
>  	}
>  	return iop;
> @@ -93,15 +133,13 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
>  static void iop_free(struct folio *folio)
>  {
>  	struct iomap_page *iop = to_iomap_page(folio);
> -	struct inode *inode = folio->mapping->host;
> -	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
>  
>  	if (!iop)
>  		return;
>  	WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
>  	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
> -	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
> -			folio_test_uptodate(folio));
> +	WARN_ON_ONCE(iop_test_full_uptodate(folio) !=
> +		     folio_test_uptodate(folio));
>  	folio_detach_private(folio);
>  	kfree(iop);
>  }
> @@ -132,7 +170,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>  
>  		/* move forward for each leading block marked uptodate */
>  		for (i = first; i <= last; i++) {
> -			if (!test_bit(i, iop->uptodate))
> +			if (!iop_test_block_uptodate(folio, i))
>  				break;
>  			*pos += block_size;
>  			poff += block_size;
> @@ -142,7 +180,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>  
>  		/* truncate len if we find any trailing uptodate block(s) */
>  		for ( ; i <= last; i++) {
> -			if (test_bit(i, iop->uptodate)) {
> +			if (iop_test_block_uptodate(folio, i)) {
>  				plen -= (last - i + 1) * block_size;
>  				last = i - 1;
>  				break;
> @@ -450,7 +488,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
>  	last = (from + count - 1) >> inode->i_blkbits;
>  
>  	for (i = first; i <= last; i++)
> -		if (!test_bit(i, iop->uptodate))
> +		if (!iop_test_block_uptodate(folio, i))
>  			return false;
>  	return true;
>  }
> @@ -1634,7 +1672,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	 * invalid, grab a new one.
>  	 */
>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> -		if (iop && !test_bit(i, iop->uptodate))
> +		if (iop && !iop_test_block_uptodate(folio, i))
>  			continue;
>  
>  		error = wpc->ops->map_blocks(wpc, inode, pos);
> -- 
> 2.39.2
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-07 19:28 ` [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
       [not found]   ` <CGME20230515081618eucas1p1c852fec3ba7a42ee7094248c30ff5978@eucas1p1.samsung.com>
@ 2023-05-15 15:15   ` Brian Foster
  2023-05-16 14:49     ` Ritesh Harjani
  2023-05-18 13:27   ` Christoph Hellwig
  2 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2023-05-15 15:15 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Ojaswin Mujoo, Disha Goel, Aravinda Herle

On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote:
> When filesystem blocksize is less than folio size (either with
> mapping_large_folio_support() or with blocksize < pagesize) and when the
> folio is uptodate in pagecache, then even a byte write can cause
> an entire folio to be written to disk during writeback. This happens
> because we currently don't have a mechanism to track per-block dirty
> state within struct iomap_page. We currently only track uptodate state.
> 
> This patch implements support for tracking per-block dirty state in
> iomap_page->state bitmap. This should help improve the filesystem write
> performance and help reduce write amplification.
> 
> Performance testing of below fio workload reveals ~16x performance
> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> 
> 1. <test_randwrite.fio>
> [global]
> 	ioengine=psync
> 	rw=randwrite
> 	overwrite=1
> 	pre_read=1
> 	direct=0
> 	bs=4k
> 	size=1G
> 	dir=./
> 	numjobs=8
> 	fdatasync=1
> 	runtime=60
> 	iodepth=64
> 	group_reporting=1
> 
> [fio-run]
> 
> 2. Also our internal performance team reported that this patch improves
>    their database workload performance by around ~83% (with XFS on Power)
> 
> Reported-by: Aravinda Herle <araherle@in.ibm.com>
> Reported-by: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/gfs2/aops.c         |   2 +-
>  fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++---
>  fs/xfs/xfs_aops.c      |   2 +-
>  fs/zonefs/file.c       |   2 +-
>  include/linux/iomap.h  |   1 +
>  5 files changed, 112 insertions(+), 10 deletions(-)
> 
...
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 25f20f269214..c7f41b26280a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
...
> @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
>  	else
>  		gfp = GFP_NOFS | __GFP_NOFAIL;
> 
> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
> +	/*
> +	 * iop->state tracks two sets of state flags when the
> +	 * filesystem block size is smaller than the folio size.
> +	 * The first state tracks per-block uptodate and the
> +	 * second tracks per-block dirty state.
> +	 */
> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
>  		      gfp);
>  	if (iop) {
>  		spin_lock_init(&iop->state_lock);
>  		if (folio_test_uptodate(folio))
>  			iop_set_range(iop, 0, nr_blocks);
> +		if (is_dirty)
> +			iop_set_range(iop, nr_blocks, nr_blocks);

I find the is_dirty logic here a bit confusing. AFAICT, this is
primarily to handle the case where a folio is completely overwritten, so
no iop is allocated at write time, and so then writeback needs to
allocate the iop as fully dirty to reflect that. I.e., all iop_alloc()
callers other than iomap_writepage_map() either pass the result of
folio_test_dirty() or explicitly dirty the entire range of the folio
anyways.  iomap_dirty_folio() essentially does the latter because it
needs to dirty the entire folio regardless of whether the iop already
exists or not, right?

If so and if I'm following all of that correctly, could this complexity
be isolated to iomap_writepage_map() by simply checking for the !iop
case first, then call iop_alloc() immediately followed by
set_range_dirty() of the entire folio? Then presumably iop_alloc() could
always just dirty based on folio state with the writeback path exception
case handled explicitly. Hm?

>  		folio_attach_private(folio, iop);
>  	}
>  	return iop;
...
> @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
>  }
>  EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
> 
> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
> +{
> +	struct iomap_page *iop;
> +	struct inode *inode = mapping->host;
> +	size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits;

folio_size()?

> +
> +	iop = iop_alloc(inode, folio, 0, false);
> +	iop_set_range_dirty(inode, folio, 0, len);
> +	return filemap_dirty_folio(mapping, folio);
> +}
> +EXPORT_SYMBOL_GPL(iomap_dirty_folio);
> +
>  static void
>  iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
>  {
...
> @@ -978,6 +1056,28 @@ static int iomap_write_delalloc_scan(struct inode *inode,
>  				}
>  			}
> 
> +			/*
> +			 * When we have per-block dirty tracking, there can be
> +			 * blocks within a folio which are marked uptodate
> +			 * but not dirty. In that case it is necessary to punch
> +			 * out such blocks to avoid leaking any delalloc blocks.
> +			 */
> +			iop = to_iomap_page(folio);
> +			if (!iop)
> +				goto skip_iop_punch;
> +			last_byte = min_t(loff_t, end_byte - 1,
> +				(folio_next_index(folio) << PAGE_SHIFT) - 1);
> +			first_blk = offset_in_folio(folio, start_byte) >>
> +						    blkbits;
> +			last_blk = offset_in_folio(folio, last_byte) >>
> +						   blkbits;
> +			blks_per_folio = i_blocks_per_folio(inode, folio);

Unused?

> +			for (i = first_blk; i <= last_blk; i++) {
> +				if (!iop_test_block_dirty(folio, i))
> +					punch(inode, i << blkbits,
> +						     1 << blkbits);
> +			}
> +skip_iop_punch:

Looks reasonable at first glance, though the deep indentation here kind
of makes my eyes cross. Could we stuff this loop into a
iomap_write_delalloc_scan_folio() helper or some such, and then maybe
update the comment at the top of the whole branch to explain both
punches?

WRT to the !iop case.. I _think_ this is handled correctly here because
that means we'd handle the folio as completely dirty at writeback time.
Is that the case? If so, it might be nice to document that assumption
somewhere (here or perhaps in the writeback path).

Brian

>  			/*
>  			 * Make sure the next punch start is correctly bound to
>  			 * the end of this data range, not the end of the folio.
> @@ -1666,7 +1766,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		struct writeback_control *wbc, struct inode *inode,
>  		struct folio *folio, u64 end_pos)
>  {
> -	struct iomap_page *iop = iop_alloc(inode, folio, 0);
> +	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
>  	struct iomap_ioend *ioend, *next;
>  	unsigned len = i_blocksize(inode);
>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
> @@ -1682,7 +1782,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	 * invalid, grab a new one.
>  	 */
>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> -		if (iop && !iop_test_block_uptodate(folio, i))
> +		if (iop && !iop_test_block_dirty(folio, i))
>  			continue;
> 
>  		error = wpc->ops->map_blocks(wpc, inode, pos);
> @@ -1726,6 +1826,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		}
>  	}
> 
> +	iop_clear_range_dirty(folio, 0, end_pos - folio_pos(folio));
>  	folio_start_writeback(folio);
>  	folio_unlock(folio);
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 2ef78aa1d3f6..77c7332ae197 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -578,7 +578,7 @@ const struct address_space_operations xfs_address_space_operations = {
>  	.read_folio		= xfs_vm_read_folio,
>  	.readahead		= xfs_vm_readahead,
>  	.writepages		= xfs_vm_writepages,
> -	.dirty_folio		= filemap_dirty_folio,
> +	.dirty_folio		= iomap_dirty_folio,
>  	.release_folio		= iomap_release_folio,
>  	.invalidate_folio	= iomap_invalidate_folio,
>  	.bmap			= xfs_vm_bmap,
> diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
> index 132f01d3461f..e508c8e97372 100644
> --- a/fs/zonefs/file.c
> +++ b/fs/zonefs/file.c
> @@ -175,7 +175,7 @@ const struct address_space_operations zonefs_file_aops = {
>  	.read_folio		= zonefs_read_folio,
>  	.readahead		= zonefs_readahead,
>  	.writepages		= zonefs_writepages,
> -	.dirty_folio		= filemap_dirty_folio,
> +	.dirty_folio		= iomap_dirty_folio,
>  	.release_folio		= iomap_release_folio,
>  	.invalidate_folio	= iomap_invalidate_folio,
>  	.migrate_folio		= filemap_migrate_folio,
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 0f8123504e5e..0c2bee80565c 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -264,6 +264,7 @@ bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
>  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
>  int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>  		const struct iomap_ops *ops);
>  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
> --
> 2.39.2
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 2/5] iomap: Refactor iop_set_range_uptodate() function
  2023-05-15 15:09   ` Brian Foster
@ 2023-05-16 10:12     ` Ritesh Harjani
  0 siblings, 0 replies; 44+ messages in thread
From: Ritesh Harjani @ 2023-05-16 10:12 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Ojaswin Mujoo, Disha Goel

Brian Foster <bfoster@redhat.com> writes:

> On Mon, May 08, 2023 at 12:57:57AM +0530, Ritesh Harjani (IBM) wrote:
>> This patch moves up and combine the definitions of two functions
>> (iomap_iop_set_range_uptodate() & iomap_set_range_uptodate()) into
>> iop_set_range_uptodate() & refactors it's arguments a bit.
>>
>> No functionality change in this patch.
>>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>
> Hi Ritesh,
>
> I just have a few random and nitty comments/questions on the series..
>

Thanks a lot for reviewing.


>>  fs/iomap/buffered-io.c | 57 ++++++++++++++++++++----------------------
>>  1 file changed, 27 insertions(+), 30 deletions(-)
>>
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index cbd945d96584..e732581dc2d4 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -43,6 +43,27 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
>>
>>  static struct bio_set iomap_ioend_bioset;
>>
>> +static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
>> +				   size_t off, size_t len)
>> +{
>
> Any particular reason this now takes the inode as a param now instead of
> continuing to use the folio?
>

Mainly for blkbits (or blocksize) and blocks_per_folio.
Earlier we were explicitly passing this info, but with recent comments,
I felt the api make sense to have inode and folio (given that we want to
work on blocksize, blocks_per_folio and folio).

-ritesh

> Brian
>
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +	unsigned int first_blk = off >> inode->i_blkbits;
>> +	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
>> +	unsigned int nr_blks = last_blk - first_blk + 1;
>> +	unsigned long flags;
>> +
>> +	if (iop) {
>> +		spin_lock_irqsave(&iop->uptodate_lock, flags);
>> +		bitmap_set(iop->uptodate, first_blk, nr_blks);
>> +		if (bitmap_full(iop->uptodate,
>> +				i_blocks_per_folio(inode, folio)))
>> +			folio_mark_uptodate(folio);
>> +		spin_unlock_irqrestore(&iop->uptodate_lock, flags);
>> +	} else {
>> +		folio_mark_uptodate(folio);
>> +	}
>> +}
>> +
>>  static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
>>  				    unsigned int flags)
>>  {
>> @@ -145,30 +166,6 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>>  	*lenp = plen;
>>  }
>>
>> -static void iomap_iop_set_range_uptodate(struct folio *folio,
>> -		struct iomap_page *iop, size_t off, size_t len)
>> -{
>> -	struct inode *inode = folio->mapping->host;
>> -	unsigned first = off >> inode->i_blkbits;
>> -	unsigned last = (off + len - 1) >> inode->i_blkbits;
>> -	unsigned long flags;
>> -
>> -	spin_lock_irqsave(&iop->uptodate_lock, flags);
>> -	bitmap_set(iop->uptodate, first, last - first + 1);
>> -	if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio)))
>> -		folio_mark_uptodate(folio);
>> -	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
>> -}
>> -
>> -static void iomap_set_range_uptodate(struct folio *folio,
>> -		struct iomap_page *iop, size_t off, size_t len)
>> -{
>> -	if (iop)
>> -		iomap_iop_set_range_uptodate(folio, iop, off, len);
>> -	else
>> -		folio_mark_uptodate(folio);
>> -}
>> -
>>  static void iomap_finish_folio_read(struct folio *folio, size_t offset,
>>  		size_t len, int error)
>>  {
>> @@ -178,7 +175,8 @@ static void iomap_finish_folio_read(struct folio *folio, size_t offset,
>>  		folio_clear_uptodate(folio);
>>  		folio_set_error(folio);
>>  	} else {
>> -		iomap_set_range_uptodate(folio, iop, offset, len);
>> +		iop_set_range_uptodate(folio->mapping->host, folio, offset,
>> +				       len);
>>  	}
>>
>>  	if (!iop || atomic_sub_and_test(len, &iop->read_bytes_pending))
>> @@ -240,7 +238,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
>>  	memcpy(addr, iomap->inline_data, size);
>>  	memset(addr + size, 0, PAGE_SIZE - poff - size);
>>  	kunmap_local(addr);
>> -	iomap_set_range_uptodate(folio, iop, offset, PAGE_SIZE - poff);
>> +	iop_set_range_uptodate(iter->inode, folio, offset, PAGE_SIZE - poff);
>>  	return 0;
>>  }
>>
>> @@ -277,7 +275,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>>
>>  	if (iomap_block_needs_zeroing(iter, pos)) {
>>  		folio_zero_range(folio, poff, plen);
>> -		iomap_set_range_uptodate(folio, iop, poff, plen);
>> +		iop_set_range_uptodate(iter->inode, folio, poff, plen);
>>  		goto done;
>>  	}
>>
>> @@ -598,7 +596,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>>  			if (status)
>>  				return status;
>>  		}
>> -		iomap_set_range_uptodate(folio, iop, poff, plen);
>> +		iop_set_range_uptodate(iter->inode, folio, poff, plen);
>>  	} while ((block_start += plen) < block_end);
>>
>>  	return 0;
>> @@ -705,7 +703,6 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
>>  static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>>  		size_t copied, struct folio *folio)
>>  {
>> -	struct iomap_page *iop = to_iomap_page(folio);
>>  	flush_dcache_folio(folio);
>>
>>  	/*
>> @@ -721,7 +718,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
>>  	 */
>>  	if (unlikely(copied < len && !folio_test_uptodate(folio)))
>>  		return 0;
>> -	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
>> +	iop_set_range_uptodate(inode, folio, offset_in_folio(folio, pos), len);
>>  	filemap_dirty_folio(inode->i_mapping, folio);
>>  	return copied;
>>  }
>> --
>> 2.39.2
>>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 3/5] iomap: Add iop's uptodate state handling functions
  2023-05-15 15:10   ` Brian Foster
@ 2023-05-16 10:14     ` Ritesh Harjani
  0 siblings, 0 replies; 44+ messages in thread
From: Ritesh Harjani @ 2023-05-16 10:14 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Ojaswin Mujoo, Disha Goel

Brian Foster <bfoster@redhat.com> writes:

> On Mon, May 08, 2023 at 12:57:58AM +0530, Ritesh Harjani (IBM) wrote:
>> Firstly this patch renames iop->uptodate to iop->state bitmap.
>> This is because we will add dirty state to iop->state bitmap in later
>> patches. So it makes sense to rename the iop->uptodate bitmap to
>> iop->state.
>>
>> Secondly this patch also adds other helpers for uptodate state bitmap
>> handling of iop->state.
>>
>> No functionality change in this patch.
>>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/iomap/buffered-io.c | 78 +++++++++++++++++++++++++++++++-----------
>>  1 file changed, 58 insertions(+), 20 deletions(-)
>>
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index e732581dc2d4..5103b644e115 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
> ...
>> @@ -43,6 +43,47 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
> ...
>> +/*
>> + * iop related helpers for checking uptodate/dirty state of per-block
>> + * or range of blocks within a folio
>> + */
>> +static bool iop_test_full_uptodate(struct folio *folio)
>> +{
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +	struct inode *inode = folio->mapping->host;
>> +
>> +	WARN_ON(!iop);
>
> It looks like an oops or something is imminent here if iop is NULL. Why
> the warn (here and in a couple other places)?
>

I agree. I will remove it in the next revision.

-ritesh

> Brian
>
>> +	return iop_bitmap_full(iop, i_blocks_per_folio(inode, folio));
>> +}
>> +
>> +static bool iop_test_block_uptodate(struct folio *folio, unsigned int block)
>> +{
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +
>> +	WARN_ON(!iop);
>> +	return iop_test_block(iop, block);
>> +}
>> +
>>  static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
>>  				   size_t off, size_t len)
>>  {
>> @@ -53,12 +94,11 @@ static void iop_set_range_uptodate(struct inode *inode, struct folio *folio,
>>  	unsigned long flags;
>>
>>  	if (iop) {
>> -		spin_lock_irqsave(&iop->uptodate_lock, flags);
>> -		bitmap_set(iop->uptodate, first_blk, nr_blks);
>> -		if (bitmap_full(iop->uptodate,
>> -				i_blocks_per_folio(inode, folio)))
>> +		spin_lock_irqsave(&iop->state_lock, flags);
>> +		iop_set_range(iop, first_blk, nr_blks);
>> +		if (iop_test_full_uptodate(folio))
>>  			folio_mark_uptodate(folio);
>> -		spin_unlock_irqrestore(&iop->uptodate_lock, flags);
>> +		spin_unlock_irqrestore(&iop->state_lock, flags);
>>  	} else {
>>  		folio_mark_uptodate(folio);
>>  	}
>> @@ -79,12 +119,12 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
>>  	else
>>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>>
>> -	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
>> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
>>  		      gfp);
>>  	if (iop) {
>> -		spin_lock_init(&iop->uptodate_lock);
>> +		spin_lock_init(&iop->state_lock);
>>  		if (folio_test_uptodate(folio))
>> -			bitmap_fill(iop->uptodate, nr_blocks);
>> +			iop_set_range(iop, 0, nr_blocks);
>>  		folio_attach_private(folio, iop);
>>  	}
>>  	return iop;
>> @@ -93,15 +133,13 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
>>  static void iop_free(struct folio *folio)
>>  {
>>  	struct iomap_page *iop = to_iomap_page(folio);
>> -	struct inode *inode = folio->mapping->host;
>> -	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
>>
>>  	if (!iop)
>>  		return;
>>  	WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
>>  	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
>> -	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
>> -			folio_test_uptodate(folio));
>> +	WARN_ON_ONCE(iop_test_full_uptodate(folio) !=
>> +		     folio_test_uptodate(folio));
>>  	folio_detach_private(folio);
>>  	kfree(iop);
>>  }
>> @@ -132,7 +170,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>>
>>  		/* move forward for each leading block marked uptodate */
>>  		for (i = first; i <= last; i++) {
>> -			if (!test_bit(i, iop->uptodate))
>> +			if (!iop_test_block_uptodate(folio, i))
>>  				break;
>>  			*pos += block_size;
>>  			poff += block_size;
>> @@ -142,7 +180,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
>>
>>  		/* truncate len if we find any trailing uptodate block(s) */
>>  		for ( ; i <= last; i++) {
>> -			if (test_bit(i, iop->uptodate)) {
>> +			if (iop_test_block_uptodate(folio, i)) {
>>  				plen -= (last - i + 1) * block_size;
>>  				last = i - 1;
>>  				break;
>> @@ -450,7 +488,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
>>  	last = (from + count - 1) >> inode->i_blkbits;
>>
>>  	for (i = first; i <= last; i++)
>> -		if (!test_bit(i, iop->uptodate))
>> +		if (!iop_test_block_uptodate(folio, i))
>>  			return false;
>>  	return true;
>>  }
>> @@ -1634,7 +1672,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  	 * invalid, grab a new one.
>>  	 */
>>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
>> -		if (iop && !test_bit(i, iop->uptodate))
>> +		if (iop && !iop_test_block_uptodate(folio, i))
>>  			continue;
>>
>>  		error = wpc->ops->map_blocks(wpc, inode, pos);
>> --
>> 2.39.2
>>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-15 15:15   ` Brian Foster
@ 2023-05-16 14:49     ` Ritesh Harjani
  2023-05-16 19:29       ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Ritesh Harjani @ 2023-05-16 14:49 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Ojaswin Mujoo, Disha Goel, Aravinda Herle

Brian Foster <bfoster@redhat.com> writes:

> On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote:
>> When filesystem blocksize is less than folio size (either with
>> mapping_large_folio_support() or with blocksize < pagesize) and when the
>> folio is uptodate in pagecache, then even a byte write can cause
>> an entire folio to be written to disk during writeback. This happens
>> because we currently don't have a mechanism to track per-block dirty
>> state within struct iomap_page. We currently only track uptodate state.
>>
>> This patch implements support for tracking per-block dirty state in
>> iomap_page->state bitmap. This should help improve the filesystem write
>> performance and help reduce write amplification.
>>
>> Performance testing of below fio workload reveals ~16x performance
>> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
>> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
>>
>> 1. <test_randwrite.fio>
>> [global]
>> 	ioengine=psync
>> 	rw=randwrite
>> 	overwrite=1
>> 	pre_read=1
>> 	direct=0
>> 	bs=4k
>> 	size=1G
>> 	dir=./
>> 	numjobs=8
>> 	fdatasync=1
>> 	runtime=60
>> 	iodepth=64
>> 	group_reporting=1
>>
>> [fio-run]
>>
>> 2. Also our internal performance team reported that this patch improves
>>    their database workload performance by around ~83% (with XFS on Power)
>>
>> Reported-by: Aravinda Herle <araherle@in.ibm.com>
>> Reported-by: Brian Foster <bfoster@redhat.com>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/gfs2/aops.c         |   2 +-
>>  fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++---
>>  fs/xfs/xfs_aops.c      |   2 +-
>>  fs/zonefs/file.c       |   2 +-
>>  include/linux/iomap.h  |   1 +
>>  5 files changed, 112 insertions(+), 10 deletions(-)
>>
> ...
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 25f20f269214..c7f41b26280a 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
> ...
>> @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
>>  	else
>>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>>
>> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
>> +	/*
>> +	 * iop->state tracks two sets of state flags when the
>> +	 * filesystem block size is smaller than the folio size.
>> +	 * The first state tracks per-block uptodate and the
>> +	 * second tracks per-block dirty state.
>> +	 */
>> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
>>  		      gfp);
>>  	if (iop) {
>>  		spin_lock_init(&iop->state_lock);
>>  		if (folio_test_uptodate(folio))
>>  			iop_set_range(iop, 0, nr_blocks);
>> +		if (is_dirty)
>> +			iop_set_range(iop, nr_blocks, nr_blocks);
>
> I find the is_dirty logic here a bit confusing. AFAICT, this is
> primarily to handle the case where a folio is completely overwritten, so
> no iop is allocated at write time, and so then writeback needs to
> allocate the iop as fully dirty to reflect that. I.e., all iop_alloc()
> callers other than iomap_writepage_map() either pass the result of
> folio_test_dirty() or explicitly dirty the entire range of the folio
> anyways.  iomap_dirty_folio() essentially does the latter because it
> needs to dirty the entire folio regardless of whether the iop already
> exists or not, right?

Yes.

>
> If so and if I'm following all of that correctly, could this complexity
> be isolated to iomap_writepage_map() by simply checking for the !iop
> case first, then call iop_alloc() immediately followed by
> set_range_dirty() of the entire folio? Then presumably iop_alloc() could
> always just dirty based on folio state with the writeback path exception
> case handled explicitly. Hm?
>

Hi Brian,

It was discussed here [1] to pass is_dirty flag at the time of iop
allocation. We can do what you are essentially suggesting, but it's just
extra work i.e. we will again do some calculations of blocks_per_folio,
start, end and more importantly take and release iop->state_lock
spinlock. Whereas with above approach we could get away with this at the
time of iop allocation itself.

Besides, isn't it easier this way? which as you also stated we will
dirty all the blocks based on is_dirty flag, which is folio_test_dirty()
except at the writeback time.


>>  		folio_attach_private(folio, iop);
>>  	}
>>  	return iop;
> ...
>> @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
>>  }
>>  EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
>>
>> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
>> +{
>> +	struct iomap_page *iop;
>> +	struct inode *inode = mapping->host;
>> +	size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits;
>
> folio_size()?
>

sure.

>> +
>> +	iop = iop_alloc(inode, folio, 0, false);
>> +	iop_set_range_dirty(inode, folio, 0, len);
>> +	return filemap_dirty_folio(mapping, folio);
>> +}
>> +EXPORT_SYMBOL_GPL(iomap_dirty_folio);
>> +
>>  static void
>>  iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
>>  {
> ...
>> @@ -978,6 +1056,28 @@ static int iomap_write_delalloc_scan(struct inode *inode,
>>  				}
>>  			}
>>
>> +			/*
>> +			 * When we have per-block dirty tracking, there can be
>> +			 * blocks within a folio which are marked uptodate
>> +			 * but not dirty. In that case it is necessary to punch
>> +			 * out such blocks to avoid leaking any delalloc blocks.
>> +			 */
>> +			iop = to_iomap_page(folio);
>> +			if (!iop)
>> +				goto skip_iop_punch;
>> +			last_byte = min_t(loff_t, end_byte - 1,
>> +				(folio_next_index(folio) << PAGE_SHIFT) - 1);
>> +			first_blk = offset_in_folio(folio, start_byte) >>
>> +						    blkbits;
>> +			last_blk = offset_in_folio(folio, last_byte) >>
>> +						   blkbits;
>> +			blks_per_folio = i_blocks_per_folio(inode, folio);
>
> Unused?
>

Yes. I will fix it in next rev somehow compilation didn't give me any warnings.

>> +			for (i = first_blk; i <= last_blk; i++) {
>> +				if (!iop_test_block_dirty(folio, i))
>> +					punch(inode, i << blkbits,
>> +						     1 << blkbits);
>> +			}
>> +skip_iop_punch:
>
> Looks reasonable at first glance, though the deep indentation here kind
> of makes my eyes cross. Could we stuff this loop into a
> iomap_write_delalloc_scan_folio() helper or some such, and then maybe
> update the comment at the top of the whole branch to explain both
> punches?

I felt that too. Sure will give it a try in the next spin.

>
> WRT to the !iop case.. I _think_ this is handled correctly here because
> that means we'd handle the folio as completely dirty at writeback time.
> Is that the case? If so, it might be nice to document that assumption
> somewhere (here or perhaps in the writeback path).
>

!iop case is simply when we don't have a large folio and blocksize ==
 pagesize. In that case we don't allocate any iop and simply returns
 from iop_alloc().
So then we just skip the loop which is only meant when we have blocks
within a folio.


-ritesh


> Brian
>
>>  			/*
>>  			 * Make sure the next punch start is correctly bound to
>>  			 * the end of this data range, not the end of the folio.
>> @@ -1666,7 +1766,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  		struct writeback_control *wbc, struct inode *inode,
>>  		struct folio *folio, u64 end_pos)
>>  {
>> -	struct iomap_page *iop = iop_alloc(inode, folio, 0);
>> +	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
>>  	struct iomap_ioend *ioend, *next;
>>  	unsigned len = i_blocksize(inode);
>>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
>> @@ -1682,7 +1782,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  	 * invalid, grab a new one.
>>  	 */
>>  	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
>> -		if (iop && !iop_test_block_uptodate(folio, i))
>> +		if (iop && !iop_test_block_dirty(folio, i))
>>  			continue;
>>
>>  		error = wpc->ops->map_blocks(wpc, inode, pos);
>> @@ -1726,6 +1826,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>>  		}
>>  	}
>>
>> +	iop_clear_range_dirty(folio, 0, end_pos - folio_pos(folio));
>>  	folio_start_writeback(folio);
>>  	folio_unlock(folio);
>>
>> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
>> index 2ef78aa1d3f6..77c7332ae197 100644
>> --- a/fs/xfs/xfs_aops.c
>> +++ b/fs/xfs/xfs_aops.c
>> @@ -578,7 +578,7 @@ const struct address_space_operations xfs_address_space_operations = {
>>  	.read_folio		= xfs_vm_read_folio,
>>  	.readahead		= xfs_vm_readahead,
>>  	.writepages		= xfs_vm_writepages,
>> -	.dirty_folio		= filemap_dirty_folio,
>> +	.dirty_folio		= iomap_dirty_folio,
>>  	.release_folio		= iomap_release_folio,
>>  	.invalidate_folio	= iomap_invalidate_folio,
>>  	.bmap			= xfs_vm_bmap,
>> diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
>> index 132f01d3461f..e508c8e97372 100644
>> --- a/fs/zonefs/file.c
>> +++ b/fs/zonefs/file.c
>> @@ -175,7 +175,7 @@ const struct address_space_operations zonefs_file_aops = {
>>  	.read_folio		= zonefs_read_folio,
>>  	.readahead		= zonefs_readahead,
>>  	.writepages		= zonefs_writepages,
>> -	.dirty_folio		= filemap_dirty_folio,
>> +	.dirty_folio		= iomap_dirty_folio,
>>  	.release_folio		= iomap_release_folio,
>>  	.invalidate_folio	= iomap_invalidate_folio,
>>  	.migrate_folio		= filemap_migrate_folio,
>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index 0f8123504e5e..0c2bee80565c 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -264,6 +264,7 @@ bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
>>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
>>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
>>  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
>> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
>>  int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>>  		const struct iomap_ops *ops);
>>  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
>> --
>> 2.39.2
>>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-16 14:49     ` Ritesh Harjani
@ 2023-05-16 19:29       ` Brian Foster
  2023-05-17 15:20         ` Ritesh Harjani
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2023-05-16 19:29 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Ojaswin Mujoo, Disha Goel, Aravinda Herle

On Tue, May 16, 2023 at 08:19:31PM +0530, Ritesh Harjani wrote:
> Brian Foster <bfoster@redhat.com> writes:
> 
> > On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote:
> >> When filesystem blocksize is less than folio size (either with
> >> mapping_large_folio_support() or with blocksize < pagesize) and when the
> >> folio is uptodate in pagecache, then even a byte write can cause
> >> an entire folio to be written to disk during writeback. This happens
> >> because we currently don't have a mechanism to track per-block dirty
> >> state within struct iomap_page. We currently only track uptodate state.
> >>
> >> This patch implements support for tracking per-block dirty state in
> >> iomap_page->state bitmap. This should help improve the filesystem write
> >> performance and help reduce write amplification.
> >>
> >> Performance testing of below fio workload reveals ~16x performance
> >> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
> >> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> >>
> >> 1. <test_randwrite.fio>
> >> [global]
> >> 	ioengine=psync
> >> 	rw=randwrite
> >> 	overwrite=1
> >> 	pre_read=1
> >> 	direct=0
> >> 	bs=4k
> >> 	size=1G
> >> 	dir=./
> >> 	numjobs=8
> >> 	fdatasync=1
> >> 	runtime=60
> >> 	iodepth=64
> >> 	group_reporting=1
> >>
> >> [fio-run]
> >>
> >> 2. Also our internal performance team reported that this patch improves
> >>    their database workload performance by around ~83% (with XFS on Power)
> >>
> >> Reported-by: Aravinda Herle <araherle@in.ibm.com>
> >> Reported-by: Brian Foster <bfoster@redhat.com>
> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> >> ---
> >>  fs/gfs2/aops.c         |   2 +-
> >>  fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++---
> >>  fs/xfs/xfs_aops.c      |   2 +-
> >>  fs/zonefs/file.c       |   2 +-
> >>  include/linux/iomap.h  |   1 +
> >>  5 files changed, 112 insertions(+), 10 deletions(-)
> >>
> > ...
> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >> index 25f20f269214..c7f41b26280a 100644
> >> --- a/fs/iomap/buffered-io.c
> >> +++ b/fs/iomap/buffered-io.c
> > ...
> >> @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
> >>  	else
> >>  		gfp = GFP_NOFS | __GFP_NOFAIL;
> >>
> >> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
> >> +	/*
> >> +	 * iop->state tracks two sets of state flags when the
> >> +	 * filesystem block size is smaller than the folio size.
> >> +	 * The first state tracks per-block uptodate and the
> >> +	 * second tracks per-block dirty state.
> >> +	 */
> >> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
> >>  		      gfp);
> >>  	if (iop) {
> >>  		spin_lock_init(&iop->state_lock);
> >>  		if (folio_test_uptodate(folio))
> >>  			iop_set_range(iop, 0, nr_blocks);
> >> +		if (is_dirty)
> >> +			iop_set_range(iop, nr_blocks, nr_blocks);
> >
> > I find the is_dirty logic here a bit confusing. AFAICT, this is
> > primarily to handle the case where a folio is completely overwritten, so
> > no iop is allocated at write time, and so then writeback needs to
> > allocate the iop as fully dirty to reflect that. I.e., all iop_alloc()
> > callers other than iomap_writepage_map() either pass the result of
> > folio_test_dirty() or explicitly dirty the entire range of the folio
> > anyways.  iomap_dirty_folio() essentially does the latter because it
> > needs to dirty the entire folio regardless of whether the iop already
> > exists or not, right?
> 
> Yes.
> 
> >
> > If so and if I'm following all of that correctly, could this complexity
> > be isolated to iomap_writepage_map() by simply checking for the !iop
> > case first, then call iop_alloc() immediately followed by
> > set_range_dirty() of the entire folio? Then presumably iop_alloc() could
> > always just dirty based on folio state with the writeback path exception
> > case handled explicitly. Hm?
> >
> 
> Hi Brian,
> 
> It was discussed here [1] to pass is_dirty flag at the time of iop
> allocation. We can do what you are essentially suggesting, but it's just
> extra work i.e. we will again do some calculations of blocks_per_folio,
> start, end and more importantly take and release iop->state_lock
> spinlock. Whereas with above approach we could get away with this at the
> time of iop allocation itself.
> 

Hi Ritesh,

Isn't that extra work already occurring in iomap_dirty_folio()? I was
just thinking that maybe moving it to where it's apparently needed (i.e.
writeback) might eliminate the need for the param.

I suppose iomap_dirty_folio() would need to call filemap_dirty_folio()
first to make sure iop_alloc() sees the dirty state, but maybe that
would also allow skipping the iop alloc if the folio was already dirty
(i.e. if the folio was previously dirtied by a full buffered overwite
for example)?

I've appended a quick diff below (compile tested only) just to explain
what I mean. When doing that it also occurred to me that if we really
care about the separate call, we could keep the is_dirty param but do
the __iop_alloc() wrapper thing where iop_alloc() always passes
folio_test_dirty().

BTW, I think you left off your [1] discussion reference..

> Besides, isn't it easier this way? which as you also stated we will
> dirty all the blocks based on is_dirty flag, which is folio_test_dirty()
> except at the writeback time.
> 

My thinking was just that I kind of had to read through all of the
iop_alloc() callsites to grok the purpose of the parameter, which made
it seem unnecessarily confusing. But ultimately it made sense, so I
don't insist on changing it or anything if this approach is intentional
and/or preferred by others. That's just my .02 and I'll defer to your
preference. :)

> 
> >>  		folio_attach_private(folio, iop);
> >>  	}
> >>  	return iop;
> > ...
> >> @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
...
> >
> > WRT to the !iop case.. I _think_ this is handled correctly here because
> > that means we'd handle the folio as completely dirty at writeback time.
> > Is that the case? If so, it might be nice to document that assumption
> > somewhere (here or perhaps in the writeback path).
> >
> 
> !iop case is simply when we don't have a large folio and blocksize ==
>  pagesize. In that case we don't allocate any iop and simply returns
>  from iop_alloc().
> So then we just skip the loop which is only meant when we have blocks
> within a folio.
> 

Isn't it also the case that iop might be NULL at this point if the fs
has sub-folio blocks, but the folio was dirtied by a full overwrite of
the folio? Or did I misunderstand patch 4?

Brian

--- 8< ---

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 92e1e1061225..89b3053e3f2d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -155,7 +155,7 @@ static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len)
 }
 
 static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
-				    unsigned int flags, bool is_dirty)
+				    unsigned int flags)
 {
 	struct iomap_page *iop = to_iomap_page(folio);
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
@@ -181,7 +181,7 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
 		spin_lock_init(&iop->state_lock);
 		if (folio_test_uptodate(folio))
 			iop_set_range(iop, 0, nr_blocks);
-		if (is_dirty)
+		if (folio_test_dirty(folio))
 			iop_set_range(iop, nr_blocks, nr_blocks);
 		folio_attach_private(folio, iop);
 	}
@@ -326,8 +326,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
 	if (offset > 0)
-		iop = iop_alloc(iter->inode, folio, iter->flags,
-				folio_test_dirty(folio));
+		iop = iop_alloc(iter->inode, folio, iter->flags);
 	else
 		iop = to_iomap_page(folio);
 
@@ -365,8 +364,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		return iomap_read_inline_data(iter, folio);
 
 	/* zero post-eof blocks as the page may be mapped */
-	iop = iop_alloc(iter->inode, folio, iter->flags,
-			folio_test_dirty(folio));
+	iop = iop_alloc(iter->inode, folio, iter->flags);
 	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
@@ -616,13 +614,10 @@ EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
 
 bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
 {
-	struct iomap_page *iop;
-	struct inode *inode = mapping->host;
-	size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits;
-
-	iop = iop_alloc(inode, folio, 0, false);
-	iop_set_range_dirty(inode, folio, 0, len);
-	return filemap_dirty_folio(mapping, folio);
+	bool dirtied = filemap_dirty_folio(mapping, folio);
+	if (dirtied)
+		iop_alloc(mapping->host, folio, 0);
+	return dirtied;
 }
 EXPORT_SYMBOL_GPL(iomap_dirty_folio);
 
@@ -673,8 +668,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	    pos + len >= folio_pos(folio) + folio_size(folio))
 		return 0;
 
-	iop = iop_alloc(iter->inode, folio, iter->flags,
-			folio_test_dirty(folio));
+	iop = iop_alloc(iter->inode, folio, iter->flags);
 
 	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
 		return -EAGAIN;
@@ -1759,7 +1753,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct folio *folio, u64 end_pos)
 {
-	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
+	struct iomap_page *iop = to_iomap_page(folio);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	unsigned nblocks = i_blocks_per_folio(inode, folio);
@@ -1767,6 +1761,11 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	int error = 0, count = 0, i;
 	LIST_HEAD(submit_list);
 
+	if (!iop) {
+		iop = iop_alloc(inode, folio, 0);
+		iop_set_range_dirty(inode, folio, 0, folio_size(folio));
+	}
+
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
 
 	/*


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-16 19:29       ` Brian Foster
@ 2023-05-17 15:20         ` Ritesh Harjani
  2023-05-17 18:48           ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Ritesh Harjani @ 2023-05-17 15:20 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Ojaswin Mujoo, Disha Goel, Aravinda Herle

Brian Foster <bfoster@redhat.com> writes:

> On Tue, May 16, 2023 at 08:19:31PM +0530, Ritesh Harjani wrote:
>> Brian Foster <bfoster@redhat.com> writes:
>>
>> > On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote:
>> >> When filesystem blocksize is less than folio size (either with
>> >> mapping_large_folio_support() or with blocksize < pagesize) and when the
>> >> folio is uptodate in pagecache, then even a byte write can cause
>> >> an entire folio to be written to disk during writeback. This happens
>> >> because we currently don't have a mechanism to track per-block dirty
>> >> state within struct iomap_page. We currently only track uptodate state.
>> >>
>> >> This patch implements support for tracking per-block dirty state in
>> >> iomap_page->state bitmap. This should help improve the filesystem write
>> >> performance and help reduce write amplification.
>> >>
>> >> Performance testing of below fio workload reveals ~16x performance
>> >> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
>> >> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
>> >>
>> >> 1. <test_randwrite.fio>
>> >> [global]
>> >> 	ioengine=psync
>> >> 	rw=randwrite
>> >> 	overwrite=1
>> >> 	pre_read=1
>> >> 	direct=0
>> >> 	bs=4k
>> >> 	size=1G
>> >> 	dir=./
>> >> 	numjobs=8
>> >> 	fdatasync=1
>> >> 	runtime=60
>> >> 	iodepth=64
>> >> 	group_reporting=1
>> >>
>> >> [fio-run]
>> >>
>> >> 2. Also our internal performance team reported that this patch improves
>> >>    their database workload performance by around ~83% (with XFS on Power)
>> >>
>> >> Reported-by: Aravinda Herle <araherle@in.ibm.com>
>> >> Reported-by: Brian Foster <bfoster@redhat.com>
>> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> >> ---
>> >>  fs/gfs2/aops.c         |   2 +-
>> >>  fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++---
>> >>  fs/xfs/xfs_aops.c      |   2 +-
>> >>  fs/zonefs/file.c       |   2 +-
>> >>  include/linux/iomap.h  |   1 +
>> >>  5 files changed, 112 insertions(+), 10 deletions(-)
>> >>
>> > ...
>> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> >> index 25f20f269214..c7f41b26280a 100644
>> >> --- a/fs/iomap/buffered-io.c
>> >> +++ b/fs/iomap/buffered-io.c
>> > ...
>> >> @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
>> >>  	else
>> >>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>> >>
>> >> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
>> >> +	/*
>> >> +	 * iop->state tracks two sets of state flags when the
>> >> +	 * filesystem block size is smaller than the folio size.
>> >> +	 * The first state tracks per-block uptodate and the
>> >> +	 * second tracks per-block dirty state.
>> >> +	 */
>> >> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
>> >>  		      gfp);
>> >>  	if (iop) {
>> >>  		spin_lock_init(&iop->state_lock);
>> >>  		if (folio_test_uptodate(folio))
>> >>  			iop_set_range(iop, 0, nr_blocks);
>> >> +		if (is_dirty)
>> >> +			iop_set_range(iop, nr_blocks, nr_blocks);
>> >
>> > I find the is_dirty logic here a bit confusing. AFAICT, this is
>> > primarily to handle the case where a folio is completely overwritten, so
>> > no iop is allocated at write time, and so then writeback needs to
>> > allocate the iop as fully dirty to reflect that. I.e., all iop_alloc()
>> > callers other than iomap_writepage_map() either pass the result of
>> > folio_test_dirty() or explicitly dirty the entire range of the folio
>> > anyways.  iomap_dirty_folio() essentially does the latter because it
>> > needs to dirty the entire folio regardless of whether the iop already
>> > exists or not, right?
>>
>> Yes.
>>
>> >
>> > If so and if I'm following all of that correctly, could this complexity
>> > be isolated to iomap_writepage_map() by simply checking for the !iop
>> > case first, then call iop_alloc() immediately followed by
>> > set_range_dirty() of the entire folio? Then presumably iop_alloc() could
>> > always just dirty based on folio state with the writeback path exception
>> > case handled explicitly. Hm?
>> >
>>
>> Hi Brian,
>>
>> It was discussed here [1] to pass is_dirty flag at the time of iop
>> allocation. We can do what you are essentially suggesting, but it's just
>> extra work i.e. we will again do some calculations of blocks_per_folio,
>> start, end and more importantly take and release iop->state_lock
>> spinlock. Whereas with above approach we could get away with this at the
>> time of iop allocation itself.
>>
>
> Hi Ritesh,
>
> Isn't that extra work already occurring in iomap_dirty_folio()? I was
> just thinking that maybe moving it to where it's apparently needed (i.e.
> writeback) might eliminate the need for the param.
>
> I suppose iomap_dirty_folio() would need to call filemap_dirty_folio()
> first to make sure iop_alloc() sees the dirty state, but maybe that
> would also allow skipping the iop alloc if the folio was already dirty
> (i.e. if the folio was previously dirtied by a full buffered overwite
> for example)?
>
> I've appended a quick diff below (compile tested only) just to explain
> what I mean. When doing that it also occurred to me that if we really
> care about the separate call, we could keep the is_dirty param but do
> the __iop_alloc() wrapper thing where iop_alloc() always passes
> folio_test_dirty().

Sure. Brian. I guess when we got the comment, it was still in the intial
work & I was anyway passing a from_writeback flag. Thanks for the diff,
it does make sense to me. I will try to add that change in the next revision.

>
> BTW, I think you left off your [1] discussion reference..
>

Sorry, my bad.

[1]: https://lore.kernel.org/linux-xfs/Y9f7cZxnXbL7x0p+@infradead.org/

>> Besides, isn't it easier this way? which as you also stated we will
>> dirty all the blocks based on is_dirty flag, which is folio_test_dirty()
>> except at the writeback time.
>>
>
> My thinking was just that I kind of had to read through all of the
> iop_alloc() callsites to grok the purpose of the parameter, which made
> it seem unnecessarily confusing. But ultimately it made sense, so I
> don't insist on changing it or anything if this approach is intentional
> and/or preferred by others. That's just my .02 and I'll defer to your
> preference. :)
>

Sure Thanks!

>>
>> >>  		folio_attach_private(folio, iop);
>> >>  	}
>> >>  	return iop;
>> > ...
>> >> @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
> ...
>> >
>> > WRT to the !iop case.. I _think_ this is handled correctly here because
>> > that means we'd handle the folio as completely dirty at writeback time.
>> > Is that the case? If so, it might be nice to document that assumption
>> > somewhere (here or perhaps in the writeback path).
>> >
>>
>> !iop case is simply when we don't have a large folio and blocksize ==
>>  pagesize. In that case we don't allocate any iop and simply returns
>>  from iop_alloc().
>> So then we just skip the loop which is only meant when we have blocks
>> within a folio.
>>
>
> Isn't it also the case that iop might be NULL at this point if the fs
> has sub-folio blocks, but the folio was dirtied by a full overwrite of
> the folio? Or did I misunderstand patch 4?
>

Yes. Both of the cases. We can either have bs == ps or we can have a
complete overwritten folio in which case we don't allocate any iop in
iomap_writepage_map() function.

Sure. I will document that when we factor out that change in a seperate function.

> Brian
>
> --- 8< ---
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 92e1e1061225..89b3053e3f2d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -155,7 +155,7 @@ static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len)
>  }
>
>  static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
> -				    unsigned int flags, bool is_dirty)
> +				    unsigned int flags)
>  {
>  	struct iomap_page *iop = to_iomap_page(folio);
>  	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> @@ -181,7 +181,7 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
>  		spin_lock_init(&iop->state_lock);
>  		if (folio_test_uptodate(folio))
>  			iop_set_range(iop, 0, nr_blocks);
> -		if (is_dirty)
> +		if (folio_test_dirty(folio))
>  			iop_set_range(iop, nr_blocks, nr_blocks);
>  		folio_attach_private(folio, iop);
>  	}
> @@ -326,8 +326,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
>  	if (WARN_ON_ONCE(size > iomap->length))
>  		return -EIO;
>  	if (offset > 0)
> -		iop = iop_alloc(iter->inode, folio, iter->flags,
> -				folio_test_dirty(folio));
> +		iop = iop_alloc(iter->inode, folio, iter->flags);
>  	else
>  		iop = to_iomap_page(folio);
>
> @@ -365,8 +364,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>  		return iomap_read_inline_data(iter, folio);
>
>  	/* zero post-eof blocks as the page may be mapped */
> -	iop = iop_alloc(iter->inode, folio, iter->flags,
> -			folio_test_dirty(folio));
> +	iop = iop_alloc(iter->inode, folio, iter->flags);
>  	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
>  	if (plen == 0)
>  		goto done;
> @@ -616,13 +614,10 @@ EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
>
>  bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
>  {
> -	struct iomap_page *iop;
> -	struct inode *inode = mapping->host;
> -	size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits;
> -
> -	iop = iop_alloc(inode, folio, 0, false);
> -	iop_set_range_dirty(inode, folio, 0, len);
> -	return filemap_dirty_folio(mapping, folio);
> +	bool dirtied = filemap_dirty_folio(mapping, folio);
> +	if (dirtied)
> +		iop_alloc(mapping->host, folio, 0);
> +	return dirtied;
>  }
>  EXPORT_SYMBOL_GPL(iomap_dirty_folio);
>
> @@ -673,8 +668,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  	    pos + len >= folio_pos(folio) + folio_size(folio))
>  		return 0;
>
> -	iop = iop_alloc(iter->inode, folio, iter->flags,
> -			folio_test_dirty(folio));
> +	iop = iop_alloc(iter->inode, folio, iter->flags);
>
>  	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
>  		return -EAGAIN;
> @@ -1759,7 +1753,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  		struct writeback_control *wbc, struct inode *inode,
>  		struct folio *folio, u64 end_pos)
>  {
> -	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
> +	struct iomap_page *iop = to_iomap_page(folio);
>  	struct iomap_ioend *ioend, *next;
>  	unsigned len = i_blocksize(inode);
>  	unsigned nblocks = i_blocks_per_folio(inode, folio);
> @@ -1767,6 +1761,11 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	int error = 0, count = 0, i;
>  	LIST_HEAD(submit_list);
>
> +	if (!iop) {
> +		iop = iop_alloc(inode, folio, 0);
> +		iop_set_range_dirty(inode, folio, 0, folio_size(folio));
> +	}
> +
>  	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
>
>  	/*

Thanks for the review!

-ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-17 15:20         ` Ritesh Harjani
@ 2023-05-17 18:48           ` Brian Foster
  2023-05-18 13:23             ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2023-05-17 18:48 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Ojaswin Mujoo, Disha Goel, Aravinda Herle

On Wed, May 17, 2023 at 08:50:39PM +0530, Ritesh Harjani wrote:
> Brian Foster <bfoster@redhat.com> writes:
> 
> > On Tue, May 16, 2023 at 08:19:31PM +0530, Ritesh Harjani wrote:
> >> Brian Foster <bfoster@redhat.com> writes:
> >>
> >> > On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote:
> >> >> When filesystem blocksize is less than folio size (either with
> >> >> mapping_large_folio_support() or with blocksize < pagesize) and when the
> >> >> folio is uptodate in pagecache, then even a byte write can cause
> >> >> an entire folio to be written to disk during writeback. This happens
> >> >> because we currently don't have a mechanism to track per-block dirty
> >> >> state within struct iomap_page. We currently only track uptodate state.
> >> >>
> >> >> This patch implements support for tracking per-block dirty state in
> >> >> iomap_page->state bitmap. This should help improve the filesystem write
> >> >> performance and help reduce write amplification.
> >> >>
> >> >> Performance testing of below fio workload reveals ~16x performance
> >> >> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
> >> >> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
> >> >>
> >> >> 1. <test_randwrite.fio>
> >> >> [global]
> >> >> 	ioengine=psync
> >> >> 	rw=randwrite
> >> >> 	overwrite=1
> >> >> 	pre_read=1
> >> >> 	direct=0
> >> >> 	bs=4k
> >> >> 	size=1G
> >> >> 	dir=./
> >> >> 	numjobs=8
> >> >> 	fdatasync=1
> >> >> 	runtime=60
> >> >> 	iodepth=64
> >> >> 	group_reporting=1
> >> >>
> >> >> [fio-run]
> >> >>
> >> >> 2. Also our internal performance team reported that this patch improves
> >> >>    their database workload performance by around ~83% (with XFS on Power)
> >> >>
> >> >> Reported-by: Aravinda Herle <araherle@in.ibm.com>
> >> >> Reported-by: Brian Foster <bfoster@redhat.com>
> >> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> >> >> ---
> >> >>  fs/gfs2/aops.c         |   2 +-
> >> >>  fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++---
> >> >>  fs/xfs/xfs_aops.c      |   2 +-
> >> >>  fs/zonefs/file.c       |   2 +-
> >> >>  include/linux/iomap.h  |   1 +
> >> >>  5 files changed, 112 insertions(+), 10 deletions(-)
> >> >>
> >> > ...
> >> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >> >> index 25f20f269214..c7f41b26280a 100644
> >> >> --- a/fs/iomap/buffered-io.c
> >> >> +++ b/fs/iomap/buffered-io.c
> >> > ...
> >> >> @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
> >> >>  	else
> >> >>  		gfp = GFP_NOFS | __GFP_NOFAIL;
> >> >>
> >> >> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
> >> >> +	/*
> >> >> +	 * iop->state tracks two sets of state flags when the
> >> >> +	 * filesystem block size is smaller than the folio size.
> >> >> +	 * The first state tracks per-block uptodate and the
> >> >> +	 * second tracks per-block dirty state.
> >> >> +	 */
> >> >> +	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
> >> >>  		      gfp);
> >> >>  	if (iop) {
> >> >>  		spin_lock_init(&iop->state_lock);
> >> >>  		if (folio_test_uptodate(folio))
> >> >>  			iop_set_range(iop, 0, nr_blocks);
> >> >> +		if (is_dirty)
> >> >> +			iop_set_range(iop, nr_blocks, nr_blocks);
> >> >
> >> > I find the is_dirty logic here a bit confusing. AFAICT, this is
> >> > primarily to handle the case where a folio is completely overwritten, so
> >> > no iop is allocated at write time, and so then writeback needs to
> >> > allocate the iop as fully dirty to reflect that. I.e., all iop_alloc()
> >> > callers other than iomap_writepage_map() either pass the result of
> >> > folio_test_dirty() or explicitly dirty the entire range of the folio
> >> > anyways.  iomap_dirty_folio() essentially does the latter because it
> >> > needs to dirty the entire folio regardless of whether the iop already
> >> > exists or not, right?
> >>
> >> Yes.
> >>
> >> >
> >> > If so and if I'm following all of that correctly, could this complexity
> >> > be isolated to iomap_writepage_map() by simply checking for the !iop
> >> > case first, then call iop_alloc() immediately followed by
> >> > set_range_dirty() of the entire folio? Then presumably iop_alloc() could
> >> > always just dirty based on folio state with the writeback path exception
> >> > case handled explicitly. Hm?
> >> >
> >>
> >> Hi Brian,
> >>
> >> It was discussed here [1] to pass is_dirty flag at the time of iop
> >> allocation. We can do what you are essentially suggesting, but it's just
> >> extra work i.e. we will again do some calculations of blocks_per_folio,
> >> start, end and more importantly take and release iop->state_lock
> >> spinlock. Whereas with above approach we could get away with this at the
> >> time of iop allocation itself.
> >>
> >
> > Hi Ritesh,
> >
> > Isn't that extra work already occurring in iomap_dirty_folio()? I was
> > just thinking that maybe moving it to where it's apparently needed (i.e.
> > writeback) might eliminate the need for the param.
> >
> > I suppose iomap_dirty_folio() would need to call filemap_dirty_folio()
> > first to make sure iop_alloc() sees the dirty state, but maybe that
> > would also allow skipping the iop alloc if the folio was already dirty
> > (i.e. if the folio was previously dirtied by a full buffered overwite
> > for example)?
> >
> > I've appended a quick diff below (compile tested only) just to explain
> > what I mean. When doing that it also occurred to me that if we really
> > care about the separate call, we could keep the is_dirty param but do
> > the __iop_alloc() wrapper thing where iop_alloc() always passes
> > folio_test_dirty().
> 
> Sure. Brian. I guess when we got the comment, it was still in the intial
> work & I was anyway passing a from_writeback flag. Thanks for the diff,
> it does make sense to me. I will try to add that change in the next revision.
> 

Ok, though I think what I did to iomap_folio_dirty() might be wrong...
If we have a folio/iop that is already partially dirty with sub-folio
blocks, and then that folio is mapped and write faulted, we still need
to explicitly dirty the rest of the iop during the fault, right? If so,
then I guess we'd still need to keep the iop_set_range_dirty() call
there at least for that case.

Hmm.. so I suppose I could see doing that a couple different ways. One
is just to just keep it as you already have it and just drop the dirty
param. I.e.:

bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
{
        iop_alloc(mapping->host, folio, 0);
        iop_set_range_dirty(mapping->host, folio, 0, folio_size(folio));
        return filemap_dirty_folio(mapping, folio);
}

But I also wonder.. if we can skip the iop alloc on full folio buffered
overwrites, isn't that also true of mapped writes to folios that don't
already have an iop? I.e., I think what I was trying to do in the
previous diff was actually something like this:

bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
{
        iop_set_range_dirty(mapping->host, folio, 0, folio_size(folio));
        return filemap_dirty_folio(mapping, folio);
}

... which would only dirty the full iop if it already exists. Thoughts?

Brian

> >
> > BTW, I think you left off your [1] discussion reference..
> >
> 
> Sorry, my bad.
> 
> [1]: https://lore.kernel.org/linux-xfs/Y9f7cZxnXbL7x0p+@infradead.org/
> 
> >> Besides, isn't it easier this way? which as you also stated we will
> >> dirty all the blocks based on is_dirty flag, which is folio_test_dirty()
> >> except at the writeback time.
> >>
> >
> > My thinking was just that I kind of had to read through all of the
> > iop_alloc() callsites to grok the purpose of the parameter, which made
> > it seem unnecessarily confusing. But ultimately it made sense, so I
> > don't insist on changing it or anything if this approach is intentional
> > and/or preferred by others. That's just my .02 and I'll defer to your
> > preference. :)
> >
> 
> Sure Thanks!
> 
> >>
> >> >>  		folio_attach_private(folio, iop);
> >> >>  	}
> >> >>  	return iop;
> >> > ...
> >> >> @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
> > ...
> >> >
> >> > WRT to the !iop case.. I _think_ this is handled correctly here because
> >> > that means we'd handle the folio as completely dirty at writeback time.
> >> > Is that the case? If so, it might be nice to document that assumption
> >> > somewhere (here or perhaps in the writeback path).
> >> >
> >>
> >> !iop case is simply when we don't have a large folio and blocksize ==
> >>  pagesize. In that case we don't allocate any iop and simply returns
> >>  from iop_alloc().
> >> So then we just skip the loop which is only meant when we have blocks
> >> within a folio.
> >>
> >
> > Isn't it also the case that iop might be NULL at this point if the fs
> > has sub-folio blocks, but the folio was dirtied by a full overwrite of
> > the folio? Or did I misunderstand patch 4?
> >
> 
> Yes. Both of the cases. We can either have bs == ps or we can have a
> complete overwritten folio in which case we don't allocate any iop in
> iomap_writepage_map() function.
> 
> Sure. I will document that when we factor out that change in a seperate function.
> 
> > Brian
> >
> > --- 8< ---
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 92e1e1061225..89b3053e3f2d 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -155,7 +155,7 @@ static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len)
> >  }
> >
> >  static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
> > -				    unsigned int flags, bool is_dirty)
> > +				    unsigned int flags)
> >  {
> >  	struct iomap_page *iop = to_iomap_page(folio);
> >  	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> > @@ -181,7 +181,7 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio,
> >  		spin_lock_init(&iop->state_lock);
> >  		if (folio_test_uptodate(folio))
> >  			iop_set_range(iop, 0, nr_blocks);
> > -		if (is_dirty)
> > +		if (folio_test_dirty(folio))
> >  			iop_set_range(iop, nr_blocks, nr_blocks);
> >  		folio_attach_private(folio, iop);
> >  	}
> > @@ -326,8 +326,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
> >  	if (WARN_ON_ONCE(size > iomap->length))
> >  		return -EIO;
> >  	if (offset > 0)
> > -		iop = iop_alloc(iter->inode, folio, iter->flags,
> > -				folio_test_dirty(folio));
> > +		iop = iop_alloc(iter->inode, folio, iter->flags);
> >  	else
> >  		iop = to_iomap_page(folio);
> >
> > @@ -365,8 +364,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
> >  		return iomap_read_inline_data(iter, folio);
> >
> >  	/* zero post-eof blocks as the page may be mapped */
> > -	iop = iop_alloc(iter->inode, folio, iter->flags,
> > -			folio_test_dirty(folio));
> > +	iop = iop_alloc(iter->inode, folio, iter->flags);
> >  	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
> >  	if (plen == 0)
> >  		goto done;
> > @@ -616,13 +614,10 @@ EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
> >
> >  bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
> >  {
> > -	struct iomap_page *iop;
> > -	struct inode *inode = mapping->host;
> > -	size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits;
> > -
> > -	iop = iop_alloc(inode, folio, 0, false);
> > -	iop_set_range_dirty(inode, folio, 0, len);
> > -	return filemap_dirty_folio(mapping, folio);
> > +	bool dirtied = filemap_dirty_folio(mapping, folio);
> > +	if (dirtied)
> > +		iop_alloc(mapping->host, folio, 0);
> > +	return dirtied;
> >  }
> >  EXPORT_SYMBOL_GPL(iomap_dirty_folio);
> >
> > @@ -673,8 +668,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> >  	    pos + len >= folio_pos(folio) + folio_size(folio))
> >  		return 0;
> >
> > -	iop = iop_alloc(iter->inode, folio, iter->flags,
> > -			folio_test_dirty(folio));
> > +	iop = iop_alloc(iter->inode, folio, iter->flags);
> >
> >  	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
> >  		return -EAGAIN;
> > @@ -1759,7 +1753,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> >  		struct writeback_control *wbc, struct inode *inode,
> >  		struct folio *folio, u64 end_pos)
> >  {
> > -	struct iomap_page *iop = iop_alloc(inode, folio, 0, true);
> > +	struct iomap_page *iop = to_iomap_page(folio);
> >  	struct iomap_ioend *ioend, *next;
> >  	unsigned len = i_blocksize(inode);
> >  	unsigned nblocks = i_blocks_per_folio(inode, folio);
> > @@ -1767,6 +1761,11 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> >  	int error = 0, count = 0, i;
> >  	LIST_HEAD(submit_list);
> >
> > +	if (!iop) {
> > +		iop = iop_alloc(inode, folio, 0);
> > +		iop_set_range_dirty(inode, folio, 0, folio_size(folio));
> > +	}
> > +
> >  	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
> >
> >  	/*
> 
> Thanks for the review!
> 
> -ritesh
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 1/5] iomap: Rename iomap_page_create/release() to iop_alloc/free()
  2023-05-07 19:27 ` [RFCv5 1/5] iomap: Rename iomap_page_create/release() to iop_alloc/free() Ritesh Harjani (IBM)
@ 2023-05-18  6:13   ` Christoph Hellwig
  2023-05-19 15:01     ` Ritesh Harjani
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2023-05-18  6:13 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Ojaswin Mujoo, Disha Goel

On Mon, May 08, 2023 at 12:57:56AM +0530, Ritesh Harjani (IBM) wrote:
> This patch renames the iomap_page_create/release() functions to
> iop_alloc/free() calls. Later patches adds more functions for
> handling iop structure with iop_** naming conventions.
> Hence iop_alloc/free() makes more sense.

I can't say I like the iop_* naming all that much, especially as we
it is very generic and we use an iomap_ prefix every else.

> Note, this patch also move folio_detach_private() to happen later
> after checking for bitmap_full(). This is just another small refactor
> because in later patches we will move bitmap_** helpers to iop_** related
> helpers which will only take a folio and hence we should move
> folio_detach_private() to the end before calling kfree(iop).

Please don't mix renames and code movements.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 2/5] iomap: Refactor iop_set_range_uptodate() function
  2023-05-07 19:27 ` [RFCv5 2/5] iomap: Refactor iop_set_range_uptodate() function Ritesh Harjani (IBM)
  2023-05-15 15:09   ` Brian Foster
@ 2023-05-18  6:16   ` Christoph Hellwig
  2023-05-19 15:03     ` Ritesh Harjani
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2023-05-18  6:16 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Ojaswin Mujoo, Disha Goel

> +	if (iop) {
> +		spin_lock_irqsave(&iop->uptodate_lock, flags);
> +		bitmap_set(iop->uptodate, first_blk, nr_blks);
> +		if (bitmap_full(iop->uptodate,
> +				i_blocks_per_folio(inode, folio)))
> +			folio_mark_uptodate(folio);
> +		spin_unlock_irqrestore(&iop->uptodate_lock, flags);
> +	} else {
> +		folio_mark_uptodate(folio);
> +	}

If we did a:

	if (!iop) {
		folio_mark_uptodate(folio);
		return;
	}

we can remove a leel of identation and keep thing a bit simpler.
But I can live with either style.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 3/5] iomap: Add iop's uptodate state handling functions
  2023-05-07 19:27 ` [RFCv5 3/5] iomap: Add iop's uptodate state handling functions Ritesh Harjani (IBM)
  2023-05-15 15:10   ` Brian Foster
@ 2023-05-18  6:18   ` Christoph Hellwig
  2023-05-19 15:07     ` Ritesh Harjani
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2023-05-18  6:18 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Ojaswin Mujoo, Disha Goel

> + * inline helpers for bitmap operations on iop->state
> + */
> +static inline void iop_set_range(struct iomap_page *iop, unsigned int start_blk,
> +				 unsigned int nr_blks)
> +{
> +	bitmap_set(iop->state, start_blk, nr_blks);
> +}
> +
> +static inline bool iop_test_block(struct iomap_page *iop, unsigned int block)
> +{
> +	return test_bit(block, iop->state);
> +}
> +
> +static inline bool iop_bitmap_full(struct iomap_page *iop,
> +				   unsigned int blks_per_folio)
> +{
> +	return bitmap_full(iop->state, blks_per_folio);
> +}

I don't really see much poin in these helpers, any particular reason
for adding them?

> +/*
> + * iop related helpers for checking uptodate/dirty state of per-block
> + * or range of blocks within a folio
> + */

I'm also not sure this comment adds a whole lot of value.

The rest looks good modulo the WARN_ONs already mentined by Brian.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 4/5] iomap: Allocate iop in ->write_begin() early
  2023-05-07 19:27 ` [RFCv5 4/5] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
@ 2023-05-18  6:21   ` Christoph Hellwig
  2023-05-19 15:18     ` Ritesh Harjani
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2023-05-18  6:21 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Ojaswin Mujoo, Disha Goel

On Mon, May 08, 2023 at 12:57:59AM +0530, Ritesh Harjani (IBM) wrote:
> Earlier when the folio is uptodate, we only allocate iop at writeback

s/Earlier/Currently/ ?

> time (in iomap_writepage_map()). This is ok until now, but when we are
> going to add support for per-block dirty state bitmap in iop, this
> could cause some performance degradation. The reason is that if we don't
> allocate iop during ->write_begin(), then we will never mark the
> necessary dirty bits in ->write_end() call. And we will have to mark all
> the bits as dirty at the writeback time, that could cause the same write
> amplification and performance problems as it is now.
> 
> However, for all the writes with (pos, len) which completely overlaps
> the given folio, there is no need to allocate an iop during
> ->write_begin(). So skip those cases.

This reads a bit backwards, I'd suggest to mention early
allocation only happens for sub-page writes before going into the
details.

The changes themselves looks good to me.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-17 18:48           ` Brian Foster
@ 2023-05-18 13:23             ` Christoph Hellwig
  2023-05-18 16:15               ` Matthew Wilcox
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2023-05-18 13:23 UTC (permalink / raw)
  To: Brian Foster
  Cc: Ritesh Harjani, linux-xfs, linux-fsdevel, Matthew Wilcox,
	Dave Chinner, Ojaswin Mujoo, Disha Goel, Aravinda Herle

On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> But I also wonder.. if we can skip the iop alloc on full folio buffered
> overwrites, isn't that also true of mapped writes to folios that don't
> already have an iop?

Yes.

> I.e., I think what I was trying to do in the
> previous diff was actually something like this:
> 
> bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
> {
>         iop_set_range_dirty(mapping->host, folio, 0, folio_size(folio));
>         return filemap_dirty_folio(mapping, folio);
> }
> 
> ... which would only dirty the full iop if it already exists. Thoughts?

That does sound pretty good to me, and I can't think of any obvious
pitfall.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-07 19:28 ` [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
       [not found]   ` <CGME20230515081618eucas1p1c852fec3ba7a42ee7094248c30ff5978@eucas1p1.samsung.com>
  2023-05-15 15:15   ` Brian Foster
@ 2023-05-18 13:27   ` Christoph Hellwig
  2023-05-19 16:08     ` Ritesh Harjani
  2 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2023-05-18 13:27 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Ojaswin Mujoo, Disha Goel, Aravinda Herle

On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote:
> +static inline void iop_clear_range(struct iomap_page *iop,
> +				   unsigned int start_blk, unsigned int nr_blks)
> +{
> +	bitmap_clear(iop->state, start_blk, nr_blks);
> +}

Similar to the other trivial bitmap wrappers added earlier in the
series I don't think this one is very useful.

> +	struct iomap_page *iop = to_iomap_page(folio);
> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> +	unsigned int first_blk = (off >> inode->i_blkbits);
> +	unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits);
> +	unsigned int nr_blks = last_blk - first_blk + 1;
> +	unsigned long flags;
> +
> +	if (!iop)
> +		return;
> +	spin_lock_irqsave(&iop->state_lock, flags);
> +	iop_set_range(iop, first_blk + blks_per_folio, nr_blks);
> +	spin_unlock_irqrestore(&iop->state_lock, flags);

All the variable initializations except for ios should really move
into a branch here.  Or we just use separate helpers for the case
where we actually have an iops and isolate all that, which would
be my preference (but goes counter to the direction of changes
earlier in the series to the existing functions).

> +static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +	struct inode *inode = folio->mapping->host;
> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> +	unsigned int first_blk = (off >> inode->i_blkbits);
> +	unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits);
> +	unsigned int nr_blks = last_blk - first_blk + 1;
> +	unsigned long flags;
> +
> +	if (!iop)
> +		return;
> +	spin_lock_irqsave(&iop->state_lock, flags);
> +	iop_clear_range(iop, first_blk + blks_per_folio, nr_blks);
> +	spin_unlock_irqrestore(&iop->state_lock, flags);
> +}

Same here.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-18 13:23             ` Christoph Hellwig
@ 2023-05-18 16:15               ` Matthew Wilcox
  2023-05-22  4:33                 ` Ritesh Harjani
  0 siblings, 1 reply; 44+ messages in thread
From: Matthew Wilcox @ 2023-05-18 16:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Brian Foster, Ritesh Harjani, linux-xfs, linux-fsdevel,
	Dave Chinner, Ojaswin Mujoo, Disha Goel, Aravinda Herle

On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> > overwrites, isn't that also true of mapped writes to folios that don't
> > already have an iop?
> 
> Yes.

Hm, well, maybe?  If somebody stores to a page, we obviously set the
dirty flag on the folio, but depending on the architecture, we may
or may not have independent dirty bits on the PTEs (eg if it's a PMD,
we have one dirty bit for the entire folio; similarly if ARM uses the
contiguous PTE bit).  If we do have independent dirty bits, we could
dirty only the blocks corresponding to a single page at a time.

This has potential for causing some nasty bugs, so I'm inclined to
rule that if a folio is mmaped, then it's all dirty from any writable
page fault.  The fact is that applications generally do not perform
writes through mmap because the error handling story is so poor.

There may be a different answer for anonymous memory, but that doesn't
feel like my problem and shouldn't feel like any FS developer's problem.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 1/5] iomap: Rename iomap_page_create/release() to iop_alloc/free()
  2023-05-18  6:13   ` Christoph Hellwig
@ 2023-05-19 15:01     ` Ritesh Harjani
  0 siblings, 0 replies; 44+ messages in thread
From: Ritesh Harjani @ 2023-05-19 15:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Ojaswin Mujoo, Disha Goel

Christoph Hellwig <hch@infradead.org> writes:

> On Mon, May 08, 2023 at 12:57:56AM +0530, Ritesh Harjani (IBM) wrote:
>> This patch renames the iomap_page_create/release() functions to
>> iop_alloc/free() calls. Later patches adds more functions for
>> handling iop structure with iop_** naming conventions.
>> Hence iop_alloc/free() makes more sense.
>
> I can't say I like the iop_* naming all that much, especially as we
> it is very generic and we use an iomap_ prefix every else.
>

I can prefix iomap_ so it will then become iomap_iop_alloc()/iomap_iop_free().
All other helpers will be like...
- iomap_iop_set_range_uptodate()
- iomap_iop_set_range_dirty()
- ...

'iomap_iop' prefix just helps distinguish all the APIs which are
working over iomap_page (iop) structure.

>> Note, this patch also move folio_detach_private() to happen later
>> after checking for bitmap_full(). This is just another small refactor
>> because in later patches we will move bitmap_** helpers to iop_** related
>> helpers which will only take a folio and hence we should move
>> folio_detach_private() to the end before calling kfree(iop).
>
> Please don't mix renames and code movements.

Sure. Will take care of it in the next rev.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 2/5] iomap: Refactor iop_set_range_uptodate() function
  2023-05-18  6:16   ` Christoph Hellwig
@ 2023-05-19 15:03     ` Ritesh Harjani
  0 siblings, 0 replies; 44+ messages in thread
From: Ritesh Harjani @ 2023-05-19 15:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Ojaswin Mujoo, Disha Goel

Christoph Hellwig <hch@infradead.org> writes:

>> +	if (iop) {
>> +		spin_lock_irqsave(&iop->uptodate_lock, flags);
>> +		bitmap_set(iop->uptodate, first_blk, nr_blks);
>> +		if (bitmap_full(iop->uptodate,
>> +				i_blocks_per_folio(inode, folio)))
>> +			folio_mark_uptodate(folio);
>> +		spin_unlock_irqrestore(&iop->uptodate_lock, flags);
>> +	} else {
>> +		folio_mark_uptodate(folio);
>> +	}
>
> If we did a:
>
> 	if (!iop) {
> 		folio_mark_uptodate(folio);
> 		return;
> 	}
>
> we can remove a leel of identation and keep thing a bit simpler.
> But I can live with either style.

Yup, it pleases my eyes too!! I will change it.

-ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 3/5] iomap: Add iop's uptodate state handling functions
  2023-05-18  6:18   ` Christoph Hellwig
@ 2023-05-19 15:07     ` Ritesh Harjani
  2023-05-23  6:00       ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Ritesh Harjani @ 2023-05-19 15:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Ojaswin Mujoo, Disha Goel

Christoph Hellwig <hch@infradead.org> writes:

>> + * inline helpers for bitmap operations on iop->state
>> + */
>> +static inline void iop_set_range(struct iomap_page *iop, unsigned int start_blk,
>> +				 unsigned int nr_blks)
>> +{
>> +	bitmap_set(iop->state, start_blk, nr_blks);
>> +}
>> +
>> +static inline bool iop_test_block(struct iomap_page *iop, unsigned int block)
>> +{
>> +	return test_bit(block, iop->state);
>> +}
>> +
>> +static inline bool iop_bitmap_full(struct iomap_page *iop,
>> +				   unsigned int blks_per_folio)
>> +{
>> +	return bitmap_full(iop->state, blks_per_folio);
>> +}
>
> I don't really see much poin in these helpers, any particular reason
> for adding them?
>

We ended up modifying the APIs in v5. The idea on v4 was we will keep
iop_set_range() function which will be same for both uptodate and dirty.
The caller can pass start_blk depending upon whether we dirty/uptodate
needs to be marked.
But I guess with the API changes, we don't need this low level helpers
anymore. So If no one has any objection, I can kill this one liners.

>> +/*
>> + * iop related helpers for checking uptodate/dirty state of per-block
>> + * or range of blocks within a folio
>> + */
>
> I'm also not sure this comment adds a whole lot of value.
>
> The rest looks good modulo the WARN_ONs already mentined by Brian.

Sure. Thanks for the review!

-ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 4/5] iomap: Allocate iop in ->write_begin() early
  2023-05-18  6:21   ` Christoph Hellwig
@ 2023-05-19 15:18     ` Ritesh Harjani
  2023-05-19 15:53       ` Matthew Wilcox
  0 siblings, 1 reply; 44+ messages in thread
From: Ritesh Harjani @ 2023-05-19 15:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Ojaswin Mujoo, Disha Goel

Christoph Hellwig <hch@infradead.org> writes:

> On Mon, May 08, 2023 at 12:57:59AM +0530, Ritesh Harjani (IBM) wrote:
>> Earlier when the folio is uptodate, we only allocate iop at writeback
>
> s/Earlier/Currently/ ?
>
>> time (in iomap_writepage_map()). This is ok until now, but when we are
>> going to add support for per-block dirty state bitmap in iop, this
>> could cause some performance degradation. The reason is that if we don't
>> allocate iop during ->write_begin(), then we will never mark the
>> necessary dirty bits in ->write_end() call. And we will have to mark all
>> the bits as dirty at the writeback time, that could cause the same write
>> amplification and performance problems as it is now.
>>
>> However, for all the writes with (pos, len) which completely overlaps
>> the given folio, there is no need to allocate an iop during
>> ->write_begin(). So skip those cases.
>
> This reads a bit backwards, I'd suggest to mention early
> allocation only happens for sub-page writes before going into the
> details.
>

sub-page is a bit confusing here. Because we can have a large folio too
with blocks within that folio. So we decided to go with per-block
terminology [1].

[1]: https://lore.kernel.org/linux-xfs/ZFR%2FGuVca5nFlLYF@casper.infradead.org/

I am guessing you would like to me to re-write the above para. Is this better?

"We dont need to allocate an iop in ->write_begin() for writes where the
position and length completely overlap with the given folio.
Therefore, such cases are skipped."

> The changes themselves looks good to me.

Sure. Thanks!

-ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 4/5] iomap: Allocate iop in ->write_begin() early
  2023-05-19 15:18     ` Ritesh Harjani
@ 2023-05-19 15:53       ` Matthew Wilcox
  2023-05-22  4:05         ` Ritesh Harjani
  0 siblings, 1 reply; 44+ messages in thread
From: Matthew Wilcox @ 2023-05-19 15:53 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Dave Chinner,
	Brian Foster, Ojaswin Mujoo, Disha Goel

On Fri, May 19, 2023 at 08:48:37PM +0530, Ritesh Harjani wrote:
> Christoph Hellwig <hch@infradead.org> writes:
> 
> > On Mon, May 08, 2023 at 12:57:59AM +0530, Ritesh Harjani (IBM) wrote:
> >> Earlier when the folio is uptodate, we only allocate iop at writeback
> >
> > s/Earlier/Currently/ ?
> >
> >> time (in iomap_writepage_map()). This is ok until now, but when we are
> >> going to add support for per-block dirty state bitmap in iop, this
> >> could cause some performance degradation. The reason is that if we don't
> >> allocate iop during ->write_begin(), then we will never mark the
> >> necessary dirty bits in ->write_end() call. And we will have to mark all
> >> the bits as dirty at the writeback time, that could cause the same write
> >> amplification and performance problems as it is now.
> >>
> >> However, for all the writes with (pos, len) which completely overlaps
> >> the given folio, there is no need to allocate an iop during
> >> ->write_begin(). So skip those cases.
> >
> > This reads a bit backwards, I'd suggest to mention early
> > allocation only happens for sub-page writes before going into the
> > details.
> >
> 
> sub-page is a bit confusing here. Because we can have a large folio too
> with blocks within that folio. So we decided to go with per-block
> terminology [1].
> 
> [1]: https://lore.kernel.org/linux-xfs/ZFR%2FGuVca5nFlLYF@casper.infradead.org/
> 
> I am guessing you would like to me to re-write the above para. Is this better?
> 
> "We dont need to allocate an iop in ->write_begin() for writes where the
> position and length completely overlap with the given folio.
> Therefore, such cases are skipped."

... and reorder that paragraph to be first.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-18 13:27   ` Christoph Hellwig
@ 2023-05-19 16:08     ` Ritesh Harjani
  0 siblings, 0 replies; 44+ messages in thread
From: Ritesh Harjani @ 2023-05-19 16:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Matthew Wilcox, Dave Chinner,
	Brian Foster, Ojaswin Mujoo, Disha Goel, Aravinda Herle

Christoph Hellwig <hch@infradead.org> writes:

> On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote:
>> +static inline void iop_clear_range(struct iomap_page *iop,
>> +				   unsigned int start_blk, unsigned int nr_blks)
>> +{
>> +	bitmap_clear(iop->state, start_blk, nr_blks);
>> +}
>
> Similar to the other trivial bitmap wrappers added earlier in the
> series I don't think this one is very useful.
>
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
>> +	unsigned int first_blk = (off >> inode->i_blkbits);
>> +	unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits);
>> +	unsigned int nr_blks = last_blk - first_blk + 1;
>> +	unsigned long flags;
>> +
>> +	if (!iop)
>> +		return;
>> +	spin_lock_irqsave(&iop->state_lock, flags);
>> +	iop_set_range(iop, first_blk + blks_per_folio, nr_blks);
>> +	spin_unlock_irqrestore(&iop->state_lock, flags);
>
> All the variable initializations except for ios should really move
> into a branch here.

Branch won't be needed I guess, will just move the initialization after
the return.

> Or we just use separate helpers for the case
> where we actually have an iops and isolate all that, which would
> be my preference (but goes counter to the direction of changes
> earlier in the series to the existing functions).
>
>> +static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len)
>> +{
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +	struct inode *inode = folio->mapping->host;
>> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
>> +	unsigned int first_blk = (off >> inode->i_blkbits);
>> +	unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits);
>> +	unsigned int nr_blks = last_blk - first_blk + 1;
>> +	unsigned long flags;
>> +
>> +	if (!iop)
>> +		return;
>> +	spin_lock_irqsave(&iop->state_lock, flags);
>> +	iop_clear_range(iop, first_blk + blks_per_folio, nr_blks);
>> +	spin_unlock_irqrestore(&iop->state_lock, flags);
>> +}
>
> Same here.

Sure.

-ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 4/5] iomap: Allocate iop in ->write_begin() early
  2023-05-19 15:53       ` Matthew Wilcox
@ 2023-05-22  4:05         ` Ritesh Harjani
  0 siblings, 0 replies; 44+ messages in thread
From: Ritesh Harjani @ 2023-05-22  4:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Dave Chinner,
	Brian Foster, Ojaswin Mujoo, Disha Goel

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, May 19, 2023 at 08:48:37PM +0530, Ritesh Harjani wrote:
>> Christoph Hellwig <hch@infradead.org> writes:
>>
>> > On Mon, May 08, 2023 at 12:57:59AM +0530, Ritesh Harjani (IBM) wrote:
>> >> Earlier when the folio is uptodate, we only allocate iop at writeback
>> >
>> > s/Earlier/Currently/ ?
>> >
>> >> time (in iomap_writepage_map()). This is ok until now, but when we are
>> >> going to add support for per-block dirty state bitmap in iop, this
>> >> could cause some performance degradation. The reason is that if we don't
>> >> allocate iop during ->write_begin(), then we will never mark the
>> >> necessary dirty bits in ->write_end() call. And we will have to mark all
>> >> the bits as dirty at the writeback time, that could cause the same write
>> >> amplification and performance problems as it is now.
>> >>
>> >> However, for all the writes with (pos, len) which completely overlaps
>> >> the given folio, there is no need to allocate an iop during
>> >> ->write_begin(). So skip those cases.
>> >
>> > This reads a bit backwards, I'd suggest to mention early
>> > allocation only happens for sub-page writes before going into the
>> > details.
>> >
>>
>> sub-page is a bit confusing here. Because we can have a large folio too
>> with blocks within that folio. So we decided to go with per-block
>> terminology [1].
>>
>> [1]: https://lore.kernel.org/linux-xfs/ZFR%2FGuVca5nFlLYF@casper.infradead.org/
>>
>> I am guessing you would like to me to re-write the above para. Is this better?
>>
>> "We dont need to allocate an iop in ->write_begin() for writes where the
>> position and length completely overlap with the given folio.
>> Therefore, such cases are skipped."
>
> ... and reorder that paragraph to be first.

Sure.

-ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-18 16:15               ` Matthew Wilcox
@ 2023-05-22  4:33                 ` Ritesh Harjani
  2023-05-22  4:48                   ` Matthew Wilcox
  2023-05-22 11:18                   ` Brian Foster
  0 siblings, 2 replies; 44+ messages in thread
From: Ritesh Harjani @ 2023-05-22  4:33 UTC (permalink / raw)
  To: Matthew Wilcox, Christoph Hellwig
  Cc: Brian Foster, linux-xfs, linux-fsdevel, Dave Chinner,
	Ojaswin Mujoo, Disha Goel, Aravinda Herle

Matthew Wilcox <willy@infradead.org> writes:

> On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
>> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
>> > But I also wonder.. if we can skip the iop alloc on full folio buffered
>> > overwrites, isn't that also true of mapped writes to folios that don't
>> > already have an iop?
>>
>> Yes.
>
> Hm, well, maybe?  If somebody stores to a page, we obviously set the
> dirty flag on the folio, but depending on the architecture, we may
> or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> we have one dirty bit for the entire folio; similarly if ARM uses the
> contiguous PTE bit).  If we do have independent dirty bits, we could
> dirty only the blocks corresponding to a single page at a time.
>
> This has potential for causing some nasty bugs, so I'm inclined to
> rule that if a folio is mmaped, then it's all dirty from any writable
> page fault.  The fact is that applications generally do not perform
> writes through mmap because the error handling story is so poor.
>
> There may be a different answer for anonymous memory, but that doesn't
> feel like my problem and shouldn't feel like any FS developer's problem.

Although I am skeptical too to do the changes which Brian is suggesting
here. i.e. not making all the blocks of the folio dirty when we are
going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).

However, I am sorry but I coudn't completely follow your reasoning
above. I think what Brian is suggesting here is that
filemap_dirty_folio() should be similar to complete buffered overwrite
case where we do not allocate the iop at the ->write_begin() time.
Then at the writeback time we allocate an iop and mark all blocks dirty.

In a way it is also the similar case as for mmapped writes too but my
only worry is the way mmaped writes work and it makes more
sense to keep the dirty state of folio and per-block within iop in sync.
For that matter, we can even just make sure we always allocate an iop in
the complete overwrites case as well. I didn't change that code because
it was kept that way for uptodate state as well and based on one of your
inputs for complete overwrite case.

Though I agree that we should ideally be allocatting & marking all
blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
understand your reasoning better.

Thanks!
-ritesh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-22  4:33                 ` Ritesh Harjani
@ 2023-05-22  4:48                   ` Matthew Wilcox
  2023-05-22 11:18                   ` Brian Foster
  1 sibling, 0 replies; 44+ messages in thread
From: Matthew Wilcox @ 2023-05-22  4:48 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Christoph Hellwig, Brian Foster, linux-xfs, linux-fsdevel,
	Dave Chinner, Ojaswin Mujoo, Disha Goel, Aravinda Herle

On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> >> > overwrites, isn't that also true of mapped writes to folios that don't
> >> > already have an iop?
> >>
> >> Yes.
> >
> > Hm, well, maybe?  If somebody stores to a page, we obviously set the
> > dirty flag on the folio, but depending on the architecture, we may
> > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> > we have one dirty bit for the entire folio; similarly if ARM uses the
> > contiguous PTE bit).  If we do have independent dirty bits, we could
> > dirty only the blocks corresponding to a single page at a time.
> >
> > This has potential for causing some nasty bugs, so I'm inclined to
> > rule that if a folio is mmaped, then it's all dirty from any writable
> > page fault.  The fact is that applications generally do not perform
> > writes through mmap because the error handling story is so poor.
> >
> > There may be a different answer for anonymous memory, but that doesn't
> > feel like my problem and shouldn't feel like any FS developer's problem.
> 
> Although I am skeptical too to do the changes which Brian is suggesting
> here. i.e. not making all the blocks of the folio dirty when we are
> going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
> 
> However, I am sorry but I coudn't completely follow your reasoning
> above. I think what Brian is suggesting here is that
> filemap_dirty_folio() should be similar to complete buffered overwrite
> case where we do not allocate the iop at the ->write_begin() time.
> Then at the writeback time we allocate an iop and mark all blocks dirty.
> 
> In a way it is also the similar case as for mmapped writes too but my
> only worry is the way mmaped writes work and it makes more
> sense to keep the dirty state of folio and per-block within iop in sync.
> For that matter, we can even just make sure we always allocate an iop in
> the complete overwrites case as well. I didn't change that code because
> it was kept that way for uptodate state as well and based on one of your
> inputs for complete overwrite case.
> 
> Though I agree that we should ideally be allocatting & marking all
> blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
> understand your reasoning better.

Think about it at a higher level than the implementation ("do we allocate
an iop or not").  If userspace dirties one page in a folio, should it
dirty all pages in the folio, or just the page that was actually dirtied?
I appreciate you're thinking about this from the point of view of 64kB
pages on PPC and using single-page folios, but pretend we've allocated a
1MB folio, mmaped it (or a part of it?)  and now userspace stores to it.
How much of it do we want to write back?

My argument is that this is RARE.  Userspace generally does not
mmap(MAP_SHARED), store to it and call msync() to do writes.  Writes are
almost always done using the write() syscall.  Userspace gets a lot more
control about when the writeback happens, and they actually get errors
back from the write() syscall.

If we attempt to track which pages have actually been dirtied, I worry
the fs and the mm will lose track of what the other needs to know.
eg the mm will make every page in the folio writable and then not notify
the fs when subsequent pages in the folio are stored to.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-22  4:33                 ` Ritesh Harjani
  2023-05-22  4:48                   ` Matthew Wilcox
@ 2023-05-22 11:18                   ` Brian Foster
  2023-05-23  0:56                     ` Darrick J. Wong
  1 sibling, 1 reply; 44+ messages in thread
From: Brian Foster @ 2023-05-22 11:18 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Matthew Wilcox, Christoph Hellwig, linux-xfs, linux-fsdevel,
	Dave Chinner, Ojaswin Mujoo, Disha Goel, Aravinda Herle

On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> >> > overwrites, isn't that also true of mapped writes to folios that don't
> >> > already have an iop?
> >>
> >> Yes.
> >
> > Hm, well, maybe?  If somebody stores to a page, we obviously set the
> > dirty flag on the folio, but depending on the architecture, we may
> > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> > we have one dirty bit for the entire folio; similarly if ARM uses the
> > contiguous PTE bit).  If we do have independent dirty bits, we could
> > dirty only the blocks corresponding to a single page at a time.
> >
> > This has potential for causing some nasty bugs, so I'm inclined to
> > rule that if a folio is mmaped, then it's all dirty from any writable
> > page fault.  The fact is that applications generally do not perform
> > writes through mmap because the error handling story is so poor.
> >
> > There may be a different answer for anonymous memory, but that doesn't
> > feel like my problem and shouldn't feel like any FS developer's problem.
> 
> Although I am skeptical too to do the changes which Brian is suggesting
> here. i.e. not making all the blocks of the folio dirty when we are
> going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
> 
> However, I am sorry but I coudn't completely follow your reasoning
> above. I think what Brian is suggesting here is that
> filemap_dirty_folio() should be similar to complete buffered overwrite
> case where we do not allocate the iop at the ->write_begin() time.
> Then at the writeback time we allocate an iop and mark all blocks dirty.
> 

Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
granularity of intra-folio faults) makes sense, but I'm also not sure
what it has to do with the idea of being consistent with how full folio
overwrites are implemented (between buffered or mapped writes). We're
not changing historical dirtying granularity either way. I think this is
just a bigger picture thought for future consideration as opposed to
direct feedback on this patch..

> In a way it is also the similar case as for mmapped writes too but my
> only worry is the way mmaped writes work and it makes more
> sense to keep the dirty state of folio and per-block within iop in sync.
> For that matter, we can even just make sure we always allocate an iop in
> the complete overwrites case as well. I didn't change that code because
> it was kept that way for uptodate state as well and based on one of your
> inputs for complete overwrite case.
> 

Can you elaborate on your concerns, out of curiosity?

Either way, IMO it also seems reasonable to drop this behavior for the
basic implementation of dirty tracking (so always allocate the iop for
sub-folio tracking as you suggest above) and then potentially restore it
as a separate optimization patch at the end of the series.

That said, I'm not totally clear why it exists in the first place, so
that might warrant some investigation. Is it primarily to defer
allocations out of task write/fault contexts? To optimize the case where
pagecache is dirtied but truncated or something and thus never written
back? Is there any room for further improvement where the alloc could be
avoided completely for folio overwrites instead of just deferred? Was
that actually the case at some point and then something later decided
the iop was needed at writeback time, leading to current behavior?

Brian

> Though I agree that we should ideally be allocatting & marking all
> blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
> understand your reasoning better.
> 
> Thanks!
> -ritesh
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-22 11:18                   ` Brian Foster
@ 2023-05-23  0:56                     ` Darrick J. Wong
  2023-05-23 12:15                       ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Darrick J. Wong @ 2023-05-23  0:56 UTC (permalink / raw)
  To: Brian Foster
  Cc: Ritesh Harjani, Matthew Wilcox, Christoph Hellwig, linux-xfs,
	linux-fsdevel, Dave Chinner, Ojaswin Mujoo, Disha Goel,
	Aravinda Herle

On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote:
> On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
> > Matthew Wilcox <willy@infradead.org> writes:
> > 
> > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> > >> > overwrites, isn't that also true of mapped writes to folios that don't
> > >> > already have an iop?
> > >>
> > >> Yes.
> > >
> > > Hm, well, maybe?  If somebody stores to a page, we obviously set the
> > > dirty flag on the folio, but depending on the architecture, we may
> > > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> > > we have one dirty bit for the entire folio; similarly if ARM uses the
> > > contiguous PTE bit).  If we do have independent dirty bits, we could
> > > dirty only the blocks corresponding to a single page at a time.
> > >
> > > This has potential for causing some nasty bugs, so I'm inclined to
> > > rule that if a folio is mmaped, then it's all dirty from any writable
> > > page fault.  The fact is that applications generally do not perform
> > > writes through mmap because the error handling story is so poor.
> > >
> > > There may be a different answer for anonymous memory, but that doesn't
> > > feel like my problem and shouldn't feel like any FS developer's problem.
> > 
> > Although I am skeptical too to do the changes which Brian is suggesting
> > here. i.e. not making all the blocks of the folio dirty when we are
> > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
> > 
> > However, I am sorry but I coudn't completely follow your reasoning
> > above. I think what Brian is suggesting here is that
> > filemap_dirty_folio() should be similar to complete buffered overwrite
> > case where we do not allocate the iop at the ->write_begin() time.
> > Then at the writeback time we allocate an iop and mark all blocks dirty.
> > 
> 
> Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
> granularity of intra-folio faults) makes sense, but I'm also not sure
> what it has to do with the idea of being consistent with how full folio
> overwrites are implemented (between buffered or mapped writes). We're
> not changing historical dirtying granularity either way. I think this is
> just a bigger picture thought for future consideration as opposed to
> direct feedback on this patch..

<nod>

> > In a way it is also the similar case as for mmapped writes too but my
> > only worry is the way mmaped writes work and it makes more
> > sense to keep the dirty state of folio and per-block within iop in sync.
> > For that matter, we can even just make sure we always allocate an iop in
> > the complete overwrites case as well. I didn't change that code because
> > it was kept that way for uptodate state as well and based on one of your
> > inputs for complete overwrite case.
> > 
> 
> Can you elaborate on your concerns, out of curiosity?
> 
> Either way, IMO it also seems reasonable to drop this behavior for the
> basic implementation of dirty tracking (so always allocate the iop for
> sub-folio tracking as you suggest above) and then potentially restore it
> as a separate optimization patch at the end of the series.

Agree.

> That said, I'm not totally clear why it exists in the first place, so
> that might warrant some investigation. Is it primarily to defer
> allocations out of task write/fault contexts?

(Assuming by 'it' you mean the behavior where we don't unconditionally
allocate iops for blocksize < foliosize...)

IIRC the reason is to reduce memory usage by eliding iop allocations
unless it's absolutely necessary for correctness was /my/ understanding
of why we don't always allocate the iop...

> To optimize the case where pagecache is dirtied but truncated or
> something and thus never written back?

...because this might very well happen.  Write a temporary .o file to
the filesystem, then delete the whole thing before writeback ever gets
its hands on the file.

> Is there any room for further improvement where the alloc could be
> avoided completely for folio overwrites instead of just deferred?

Once writeback starts, though, we need the iop so that we can know when
all the writeback for that folio is actually complete, no matter how
many IOs might be in flight for that folio.  I don't know how you'd get
around this problem.

> Was that actually the case at some point and then something later
> decided the iop was needed at writeback time, leading to current
> behavior?

It's been in iomap since the beginning when we lifted it from xfs.

--D (who is now weeks behind on reviewing things and stressed out)

> Brian
> 
> > Though I agree that we should ideally be allocatting & marking all
> > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
> > understand your reasoning better.
> > 
> > Thanks!
> > -ritesh
> > 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 3/5] iomap: Add iop's uptodate state handling functions
  2023-05-19 15:07     ` Ritesh Harjani
@ 2023-05-23  6:00       ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2023-05-23  6:00 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Matthew Wilcox,
	Dave Chinner, Brian Foster, Ojaswin Mujoo, Disha Goel

On Fri, May 19, 2023 at 08:37:01PM +0530, Ritesh Harjani wrote:
> We ended up modifying the APIs in v5. The idea on v4 was we will keep
> iop_set_range() function which will be same for both uptodate and dirty.
> The caller can pass start_blk depending upon whether we dirty/uptodate
> needs to be marked.
> But I guess with the API changes, we don't need this low level helpers
> anymore. So If no one has any objection, I can kill this one liners.

Yes, the actually uptodate/dirty helper isolate the caller from the
API.  We don't really need another tiny wrapper here.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-23  0:56                     ` Darrick J. Wong
@ 2023-05-23 12:15                       ` Brian Foster
  2023-05-23 13:43                         ` Ritesh Harjani
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2023-05-23 12:15 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Ritesh Harjani, Matthew Wilcox, Christoph Hellwig, linux-xfs,
	linux-fsdevel, Dave Chinner, Ojaswin Mujoo, Disha Goel,
	Aravinda Herle

On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote:
> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote:
> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
> > > Matthew Wilcox <willy@infradead.org> writes:
> > > 
> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> > > >> > overwrites, isn't that also true of mapped writes to folios that don't
> > > >> > already have an iop?
> > > >>
> > > >> Yes.
> > > >
> > > > Hm, well, maybe?  If somebody stores to a page, we obviously set the
> > > > dirty flag on the folio, but depending on the architecture, we may
> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> > > > we have one dirty bit for the entire folio; similarly if ARM uses the
> > > > contiguous PTE bit).  If we do have independent dirty bits, we could
> > > > dirty only the blocks corresponding to a single page at a time.
> > > >
> > > > This has potential for causing some nasty bugs, so I'm inclined to
> > > > rule that if a folio is mmaped, then it's all dirty from any writable
> > > > page fault.  The fact is that applications generally do not perform
> > > > writes through mmap because the error handling story is so poor.
> > > >
> > > > There may be a different answer for anonymous memory, but that doesn't
> > > > feel like my problem and shouldn't feel like any FS developer's problem.
> > > 
> > > Although I am skeptical too to do the changes which Brian is suggesting
> > > here. i.e. not making all the blocks of the folio dirty when we are
> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
> > > 
> > > However, I am sorry but I coudn't completely follow your reasoning
> > > above. I think what Brian is suggesting here is that
> > > filemap_dirty_folio() should be similar to complete buffered overwrite
> > > case where we do not allocate the iop at the ->write_begin() time.
> > > Then at the writeback time we allocate an iop and mark all blocks dirty.
> > > 
> > 
> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
> > granularity of intra-folio faults) makes sense, but I'm also not sure
> > what it has to do with the idea of being consistent with how full folio
> > overwrites are implemented (between buffered or mapped writes). We're
> > not changing historical dirtying granularity either way. I think this is
> > just a bigger picture thought for future consideration as opposed to
> > direct feedback on this patch..
> 
> <nod>
> 
> > > In a way it is also the similar case as for mmapped writes too but my
> > > only worry is the way mmaped writes work and it makes more
> > > sense to keep the dirty state of folio and per-block within iop in sync.
> > > For that matter, we can even just make sure we always allocate an iop in
> > > the complete overwrites case as well. I didn't change that code because
> > > it was kept that way for uptodate state as well and based on one of your
> > > inputs for complete overwrite case.
> > > 
> > 
> > Can you elaborate on your concerns, out of curiosity?
> > 
> > Either way, IMO it also seems reasonable to drop this behavior for the
> > basic implementation of dirty tracking (so always allocate the iop for
> > sub-folio tracking as you suggest above) and then potentially restore it
> > as a separate optimization patch at the end of the series.
> 
> Agree.
> 
> > That said, I'm not totally clear why it exists in the first place, so
> > that might warrant some investigation. Is it primarily to defer
> > allocations out of task write/fault contexts?
> 
> (Assuming by 'it' you mean the behavior where we don't unconditionally
> allocate iops for blocksize < foliosize...)
> 
> IIRC the reason is to reduce memory usage by eliding iop allocations
> unless it's absolutely necessary for correctness was /my/ understanding
> of why we don't always allocate the iop...
> 
> > To optimize the case where pagecache is dirtied but truncated or
> > something and thus never written back?
> 
> ...because this might very well happen.  Write a temporary .o file to
> the filesystem, then delete the whole thing before writeback ever gets
> its hands on the file.
> 

I don't think a simple temp write will trigger this scenario currently
because the folios would have to be uptodate at the time of the write to
bypass the iop alloc. I guess you'd have to read folios (even if backed
by holes) first to start seeing the !iop case at writeback time (for bs
!= ps).

That could change with these patches, but I was trying to reason about
the intent of the existing code and whether there was some known reason
to continue to try and defer the iop allocation as the need/complexity
for deferring it grows with the addition of more (i.e. dirty) tracking.

> > Is there any room for further improvement where the alloc could be
> > avoided completely for folio overwrites instead of just deferred?
> 
> Once writeback starts, though, we need the iop so that we can know when
> all the writeback for that folio is actually complete, no matter how
> many IOs might be in flight for that folio.  I don't know how you'd get
> around this problem.
> 

Ok. I noticed some kind of counter or something being updated last time
I looked through that code, so it sounds like that's the reason the iop
eventually needs to exist. Thanks.

> > Was that actually the case at some point and then something later
> > decided the iop was needed at writeback time, leading to current
> > behavior?
> 
> It's been in iomap since the beginning when we lifted it from xfs.
> 

Not sure exactly what you're referring to here. iomap_writepage_map()
would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d
("iomap: Permit pages without an iop to enter writeback"), so I don't
see how iop allocs were deferred (other than when bs == ps, obviously)
prior to that.

Heh, but I'm starting to get my wires crossed just trying to piece
things together here. Ritesh, ISTM the (writeback && !iop && bs != ps)
case is primarily a subtle side effect of the current writeback behavior
being driven by uptodate status. I think it's probably wise to drop it
at least initially, always alloc and dirty the appropriate iop ranges
for sub-folio blocks, and then if you or others think there is value in
the overwrite optimization to defer iop allocs, tack that on as a
separate patch and try to be consistent between buffered and mapped
writes.

Darrick noted above that he also agrees with that separate patch
approach. For me, I think it would also be useful to show that there is
some measurable performance benefit on at least one reasonable workload
to help justify it.

Brian

> --D (who is now weeks behind on reviewing things and stressed out)
> 
> > Brian
> > 
> > > Though I agree that we should ideally be allocatting & marking all
> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
> > > understand your reasoning better.
> > > 
> > > Thanks!
> > > -ritesh
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-23 12:15                       ` Brian Foster
@ 2023-05-23 13:43                         ` Ritesh Harjani
  2023-05-23 14:44                           ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Ritesh Harjani @ 2023-05-23 13:43 UTC (permalink / raw)
  To: Brian Foster, Darrick J. Wong
  Cc: Matthew Wilcox, Christoph Hellwig, linux-xfs, linux-fsdevel,
	Dave Chinner, Ojaswin Mujoo, Disha Goel, Aravinda Herle

Brian Foster <bfoster@redhat.com> writes:

> On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote:
>> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote:
>> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
>> > > Matthew Wilcox <willy@infradead.org> writes:
>> > >
>> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
>> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
>> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
>> > > >> > overwrites, isn't that also true of mapped writes to folios that don't
>> > > >> > already have an iop?
>> > > >>
>> > > >> Yes.
>> > > >
>> > > > Hm, well, maybe?  If somebody stores to a page, we obviously set the
>> > > > dirty flag on the folio, but depending on the architecture, we may
>> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
>> > > > we have one dirty bit for the entire folio; similarly if ARM uses the
>> > > > contiguous PTE bit).  If we do have independent dirty bits, we could
>> > > > dirty only the blocks corresponding to a single page at a time.
>> > > >
>> > > > This has potential for causing some nasty bugs, so I'm inclined to
>> > > > rule that if a folio is mmaped, then it's all dirty from any writable
>> > > > page fault.  The fact is that applications generally do not perform
>> > > > writes through mmap because the error handling story is so poor.
>> > > >
>> > > > There may be a different answer for anonymous memory, but that doesn't
>> > > > feel like my problem and shouldn't feel like any FS developer's problem.
>> > >
>> > > Although I am skeptical too to do the changes which Brian is suggesting
>> > > here. i.e. not making all the blocks of the folio dirty when we are
>> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
>> > >
>> > > However, I am sorry but I coudn't completely follow your reasoning
>> > > above. I think what Brian is suggesting here is that
>> > > filemap_dirty_folio() should be similar to complete buffered overwrite
>> > > case where we do not allocate the iop at the ->write_begin() time.
>> > > Then at the writeback time we allocate an iop and mark all blocks dirty.
>> > >
>> >
>> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
>> > granularity of intra-folio faults) makes sense, but I'm also not sure
>> > what it has to do with the idea of being consistent with how full folio
>> > overwrites are implemented (between buffered or mapped writes). We're
>> > not changing historical dirtying granularity either way. I think this is
>> > just a bigger picture thought for future consideration as opposed to
>> > direct feedback on this patch..
>>
>> <nod>
>>
>> > > In a way it is also the similar case as for mmapped writes too but my
>> > > only worry is the way mmaped writes work and it makes more
>> > > sense to keep the dirty state of folio and per-block within iop in sync.
>> > > For that matter, we can even just make sure we always allocate an iop in
>> > > the complete overwrites case as well. I didn't change that code because
>> > > it was kept that way for uptodate state as well and based on one of your
>> > > inputs for complete overwrite case.
>> > >
>> >
>> > Can you elaborate on your concerns, out of curiosity?
>> >
>> > Either way, IMO it also seems reasonable to drop this behavior for the
>> > basic implementation of dirty tracking (so always allocate the iop for
>> > sub-folio tracking as you suggest above) and then potentially restore it
>> > as a separate optimization patch at the end of the series.
>>
>> Agree.
>>
>> > That said, I'm not totally clear why it exists in the first place, so
>> > that might warrant some investigation. Is it primarily to defer
>> > allocations out of task write/fault contexts?
>>
>> (Assuming by 'it' you mean the behavior where we don't unconditionally
>> allocate iops for blocksize < foliosize...)
>>
>> IIRC the reason is to reduce memory usage by eliding iop allocations
>> unless it's absolutely necessary for correctness was /my/ understanding
>> of why we don't always allocate the iop...
>>
>> > To optimize the case where pagecache is dirtied but truncated or
>> > something and thus never written back?
>>
>> ...because this might very well happen.  Write a temporary .o file to
>> the filesystem, then delete the whole thing before writeback ever gets
>> its hands on the file.
>>
>
> I don't think a simple temp write will trigger this scenario currently
> because the folios would have to be uptodate at the time of the write to
> bypass the iop alloc. I guess you'd have to read folios (even if backed
> by holes) first to start seeing the !iop case at writeback time (for bs
> != ps).
>
> That could change with these patches, but I was trying to reason about
> the intent of the existing code and whether there was some known reason
> to continue to try and defer the iop allocation as the need/complexity
> for deferring it grows with the addition of more (i.e. dirty) tracking.
>

Here is the 1st discussion/reasoning where the deferred iop allocation
in the readpage path got discussed [1].
And here is the discussion when I first pointed out the deferred
allocation in writepage path. IMO, it got slipped in with the
discussions maybe only on mailing list but nothing in the commit
messages or comments.[2]

[1]: https://lore.kernel.org/linux-xfs/20210628172727.1894503-1-agruenba@redhat.com/
[2]: https://lore.kernel.org/linux-xfs/20230130202150.pfohy5yg6dtu64ce@rh-tp/

>> > Is there any room for further improvement where the alloc could be
>> > avoided completely for folio overwrites instead of just deferred?
>>
>> Once writeback starts, though, we need the iop so that we can know when
>> all the writeback for that folio is actually complete, no matter how
>> many IOs might be in flight for that folio.  I don't know how you'd get
>> around this problem.
>>
>
> Ok. I noticed some kind of counter or something being updated last time
> I looked through that code, so it sounds like that's the reason the iop
> eventually needs to exist. Thanks.
>
>> > Was that actually the case at some point and then something later
>> > decided the iop was needed at writeback time, leading to current
>> > behavior?
>>
>> It's been in iomap since the beginning when we lifted it from xfs.
>>
>
> Not sure exactly what you're referring to here. iomap_writepage_map()
> would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d
> ("iomap: Permit pages without an iop to enter writeback"), so I don't
> see how iop allocs were deferred (other than when bs == ps, obviously)
> prior to that.
>
> Heh, but I'm starting to get my wires crossed just trying to piece
> things together here. Ritesh, ISTM the (writeback && !iop && bs != ps)
> case is primarily a subtle side effect of the current writeback behavior
> being driven by uptodate status. I think it's probably wise to drop it
> at least initially, always alloc and dirty the appropriate iop ranges
> for sub-folio blocks, and then if you or others think there is value in
> the overwrite optimization to defer iop allocs, tack that on as a
> separate patch and try to be consistent between buffered and mapped
> writes.

Based on the discussion so far, I would like to think of this as follow:
We already have some sort of lazy iop allocation in the buffered I/O
path (discussed above). This patch series does not changes that
behavior. For now I would like to keep the page mkwrite page as is
without any lazy iop allocation optimization.
I am ok to pick this optimization work as a seperate series
because, IIUC, Christoph has some ideas on deferring iop allocations
even further [2] (from link shared above).

Does that sound good?

>
> Darrick noted above that he also agrees with that separate patch
> approach. For me, I think it would also be useful to show that there is
> some measurable performance benefit on at least one reasonable workload
> to help justify it.

Agree that when we work on such optimizations as a seperate series, it
will be worth measuring the performance benefits of that.


-ritesh

>
> Brian
>
>> --D (who is now weeks behind on reviewing things and stressed out)
>>
>> > Brian
>> >
>> > > Though I agree that we should ideally be allocatting & marking all
>> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
>> > > understand your reasoning better.
>> > >
>> > > Thanks!
>> > > -ritesh
>> > >
>> >
>>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-23 13:43                         ` Ritesh Harjani
@ 2023-05-23 14:44                           ` Brian Foster
  2023-05-23 15:02                             ` Ritesh Harjani
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2023-05-23 14:44 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Darrick J. Wong, Matthew Wilcox, Christoph Hellwig, linux-xfs,
	linux-fsdevel, Dave Chinner, Ojaswin Mujoo, Disha Goel,
	Aravinda Herle

On Tue, May 23, 2023 at 07:13:04PM +0530, Ritesh Harjani wrote:
> Brian Foster <bfoster@redhat.com> writes:
> 
> > On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote:
> >> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote:
> >> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
> >> > > Matthew Wilcox <willy@infradead.org> writes:
> >> > >
> >> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> >> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> >> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> >> > > >> > overwrites, isn't that also true of mapped writes to folios that don't
> >> > > >> > already have an iop?
> >> > > >>
> >> > > >> Yes.
> >> > > >
> >> > > > Hm, well, maybe?  If somebody stores to a page, we obviously set the
> >> > > > dirty flag on the folio, but depending on the architecture, we may
> >> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> >> > > > we have one dirty bit for the entire folio; similarly if ARM uses the
> >> > > > contiguous PTE bit).  If we do have independent dirty bits, we could
> >> > > > dirty only the blocks corresponding to a single page at a time.
> >> > > >
> >> > > > This has potential for causing some nasty bugs, so I'm inclined to
> >> > > > rule that if a folio is mmaped, then it's all dirty from any writable
> >> > > > page fault.  The fact is that applications generally do not perform
> >> > > > writes through mmap because the error handling story is so poor.
> >> > > >
> >> > > > There may be a different answer for anonymous memory, but that doesn't
> >> > > > feel like my problem and shouldn't feel like any FS developer's problem.
> >> > >
> >> > > Although I am skeptical too to do the changes which Brian is suggesting
> >> > > here. i.e. not making all the blocks of the folio dirty when we are
> >> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
> >> > >
> >> > > However, I am sorry but I coudn't completely follow your reasoning
> >> > > above. I think what Brian is suggesting here is that
> >> > > filemap_dirty_folio() should be similar to complete buffered overwrite
> >> > > case where we do not allocate the iop at the ->write_begin() time.
> >> > > Then at the writeback time we allocate an iop and mark all blocks dirty.
> >> > >
> >> >
> >> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
> >> > granularity of intra-folio faults) makes sense, but I'm also not sure
> >> > what it has to do with the idea of being consistent with how full folio
> >> > overwrites are implemented (between buffered or mapped writes). We're
> >> > not changing historical dirtying granularity either way. I think this is
> >> > just a bigger picture thought for future consideration as opposed to
> >> > direct feedback on this patch..
> >>
> >> <nod>
> >>
> >> > > In a way it is also the similar case as for mmapped writes too but my
> >> > > only worry is the way mmaped writes work and it makes more
> >> > > sense to keep the dirty state of folio and per-block within iop in sync.
> >> > > For that matter, we can even just make sure we always allocate an iop in
> >> > > the complete overwrites case as well. I didn't change that code because
> >> > > it was kept that way for uptodate state as well and based on one of your
> >> > > inputs for complete overwrite case.
> >> > >
> >> >
> >> > Can you elaborate on your concerns, out of curiosity?
> >> >
> >> > Either way, IMO it also seems reasonable to drop this behavior for the
> >> > basic implementation of dirty tracking (so always allocate the iop for
> >> > sub-folio tracking as you suggest above) and then potentially restore it
> >> > as a separate optimization patch at the end of the series.
> >>
> >> Agree.
> >>
> >> > That said, I'm not totally clear why it exists in the first place, so
> >> > that might warrant some investigation. Is it primarily to defer
> >> > allocations out of task write/fault contexts?
> >>
> >> (Assuming by 'it' you mean the behavior where we don't unconditionally
> >> allocate iops for blocksize < foliosize...)
> >>
> >> IIRC the reason is to reduce memory usage by eliding iop allocations
> >> unless it's absolutely necessary for correctness was /my/ understanding
> >> of why we don't always allocate the iop...
> >>
> >> > To optimize the case where pagecache is dirtied but truncated or
> >> > something and thus never written back?
> >>
> >> ...because this might very well happen.  Write a temporary .o file to
> >> the filesystem, then delete the whole thing before writeback ever gets
> >> its hands on the file.
> >>
> >
> > I don't think a simple temp write will trigger this scenario currently
> > because the folios would have to be uptodate at the time of the write to
> > bypass the iop alloc. I guess you'd have to read folios (even if backed
> > by holes) first to start seeing the !iop case at writeback time (for bs
> > != ps).
> >
> > That could change with these patches, but I was trying to reason about
> > the intent of the existing code and whether there was some known reason
> > to continue to try and defer the iop allocation as the need/complexity
> > for deferring it grows with the addition of more (i.e. dirty) tracking.
> >
> 
> Here is the 1st discussion/reasoning where the deferred iop allocation
> in the readpage path got discussed [1].
> And here is the discussion when I first pointed out the deferred
> allocation in writepage path. IMO, it got slipped in with the
> discussions maybe only on mailing list but nothing in the commit
> messages or comments.[2]
> 
> [1]: https://lore.kernel.org/linux-xfs/20210628172727.1894503-1-agruenba@redhat.com/
> [2]: https://lore.kernel.org/linux-xfs/20230130202150.pfohy5yg6dtu64ce@rh-tp/
> 
> >> > Is there any room for further improvement where the alloc could be
> >> > avoided completely for folio overwrites instead of just deferred?
> >>
> >> Once writeback starts, though, we need the iop so that we can know when
> >> all the writeback for that folio is actually complete, no matter how
> >> many IOs might be in flight for that folio.  I don't know how you'd get
> >> around this problem.
> >>
> >
> > Ok. I noticed some kind of counter or something being updated last time
> > I looked through that code, so it sounds like that's the reason the iop
> > eventually needs to exist. Thanks.
> >
> >> > Was that actually the case at some point and then something later
> >> > decided the iop was needed at writeback time, leading to current
> >> > behavior?
> >>
> >> It's been in iomap since the beginning when we lifted it from xfs.
> >>
> >
> > Not sure exactly what you're referring to here. iomap_writepage_map()
> > would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d
> > ("iomap: Permit pages without an iop to enter writeback"), so I don't
> > see how iop allocs were deferred (other than when bs == ps, obviously)
> > prior to that.
> >
> > Heh, but I'm starting to get my wires crossed just trying to piece
> > things together here. Ritesh, ISTM the (writeback && !iop && bs != ps)
> > case is primarily a subtle side effect of the current writeback behavior
> > being driven by uptodate status. I think it's probably wise to drop it
> > at least initially, always alloc and dirty the appropriate iop ranges
> > for sub-folio blocks, and then if you or others think there is value in
> > the overwrite optimization to defer iop allocs, tack that on as a
> > separate patch and try to be consistent between buffered and mapped
> > writes.
> 
> Based on the discussion so far, I would like to think of this as follow:
> We already have some sort of lazy iop allocation in the buffered I/O
> path (discussed above). This patch series does not changes that
> behavior. For now I would like to keep the page mkwrite page as is
> without any lazy iop allocation optimization.
> I am ok to pick this optimization work as a seperate series
> because, IIUC, Christoph has some ideas on deferring iop allocations
> even further [2] (from link shared above).
> 
> Does that sound good?
> 

Could you do that in two steps where the buffered I/O path variant is
replaced by explicit dirty tracking in the initial patch, and then is
restored by a subsequent patch in the same series? That would allow
keeping it around and documenting it explicitly in the commit log for
the separate patch, but IMO makes this a bit easier to review (and
potentially debug/bisect if needed down the road).

But I don't insist if that's too troublesome for some reason...

Brian

> >
> > Darrick noted above that he also agrees with that separate patch
> > approach. For me, I think it would also be useful to show that there is
> > some measurable performance benefit on at least one reasonable workload
> > to help justify it.
> 
> Agree that when we work on such optimizations as a seperate series, it
> will be worth measuring the performance benefits of that.
> 
> 
> -ritesh
> 
> >
> > Brian
> >
> >> --D (who is now weeks behind on reviewing things and stressed out)
> >>
> >> > Brian
> >> >
> >> > > Though I agree that we should ideally be allocatting & marking all
> >> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
> >> > > understand your reasoning better.
> >> > >
> >> > > Thanks!
> >> > > -ritesh
> >> > >
> >> >
> >>
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-23 14:44                           ` Brian Foster
@ 2023-05-23 15:02                             ` Ritesh Harjani
  2023-05-23 15:22                               ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Ritesh Harjani @ 2023-05-23 15:02 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, Matthew Wilcox, Christoph Hellwig, linux-xfs,
	linux-fsdevel, Dave Chinner, Ojaswin Mujoo, Disha Goel,
	Aravinda Herle

Brian Foster <bfoster@redhat.com> writes:

> On Tue, May 23, 2023 at 07:13:04PM +0530, Ritesh Harjani wrote:
>> Brian Foster <bfoster@redhat.com> writes:
>> 
>> > On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote:
>> >> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote:
>> >> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
>> >> > > Matthew Wilcox <willy@infradead.org> writes:
>> >> > >
>> >> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
>> >> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
>> >> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
>> >> > > >> > overwrites, isn't that also true of mapped writes to folios that don't
>> >> > > >> > already have an iop?
>> >> > > >>
>> >> > > >> Yes.
>> >> > > >
>> >> > > > Hm, well, maybe?  If somebody stores to a page, we obviously set the
>> >> > > > dirty flag on the folio, but depending on the architecture, we may
>> >> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
>> >> > > > we have one dirty bit for the entire folio; similarly if ARM uses the
>> >> > > > contiguous PTE bit).  If we do have independent dirty bits, we could
>> >> > > > dirty only the blocks corresponding to a single page at a time.
>> >> > > >
>> >> > > > This has potential for causing some nasty bugs, so I'm inclined to
>> >> > > > rule that if a folio is mmaped, then it's all dirty from any writable
>> >> > > > page fault.  The fact is that applications generally do not perform
>> >> > > > writes through mmap because the error handling story is so poor.
>> >> > > >
>> >> > > > There may be a different answer for anonymous memory, but that doesn't
>> >> > > > feel like my problem and shouldn't feel like any FS developer's problem.
>> >> > >
>> >> > > Although I am skeptical too to do the changes which Brian is suggesting
>> >> > > here. i.e. not making all the blocks of the folio dirty when we are
>> >> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
>> >> > >
>> >> > > However, I am sorry but I coudn't completely follow your reasoning
>> >> > > above. I think what Brian is suggesting here is that
>> >> > > filemap_dirty_folio() should be similar to complete buffered overwrite
>> >> > > case where we do not allocate the iop at the ->write_begin() time.
>> >> > > Then at the writeback time we allocate an iop and mark all blocks dirty.
>> >> > >
>> >> >
>> >> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
>> >> > granularity of intra-folio faults) makes sense, but I'm also not sure
>> >> > what it has to do with the idea of being consistent with how full folio
>> >> > overwrites are implemented (between buffered or mapped writes). We're
>> >> > not changing historical dirtying granularity either way. I think this is
>> >> > just a bigger picture thought for future consideration as opposed to
>> >> > direct feedback on this patch..
>> >>
>> >> <nod>
>> >>
>> >> > > In a way it is also the similar case as for mmapped writes too but my
>> >> > > only worry is the way mmaped writes work and it makes more
>> >> > > sense to keep the dirty state of folio and per-block within iop in sync.
>> >> > > For that matter, we can even just make sure we always allocate an iop in
>> >> > > the complete overwrites case as well. I didn't change that code because
>> >> > > it was kept that way for uptodate state as well and based on one of your
>> >> > > inputs for complete overwrite case.
>> >> > >
>> >> >
>> >> > Can you elaborate on your concerns, out of curiosity?
>> >> >
>> >> > Either way, IMO it also seems reasonable to drop this behavior for the
>> >> > basic implementation of dirty tracking (so always allocate the iop for
>> >> > sub-folio tracking as you suggest above) and then potentially restore it
>> >> > as a separate optimization patch at the end of the series.
>> >>
>> >> Agree.
>> >>
>> >> > That said, I'm not totally clear why it exists in the first place, so
>> >> > that might warrant some investigation. Is it primarily to defer
>> >> > allocations out of task write/fault contexts?
>> >>
>> >> (Assuming by 'it' you mean the behavior where we don't unconditionally
>> >> allocate iops for blocksize < foliosize...)
>> >>
>> >> IIRC the reason is to reduce memory usage by eliding iop allocations
>> >> unless it's absolutely necessary for correctness was /my/ understanding
>> >> of why we don't always allocate the iop...
>> >>
>> >> > To optimize the case where pagecache is dirtied but truncated or
>> >> > something and thus never written back?
>> >>
>> >> ...because this might very well happen.  Write a temporary .o file to
>> >> the filesystem, then delete the whole thing before writeback ever gets
>> >> its hands on the file.
>> >>
>> >
>> > I don't think a simple temp write will trigger this scenario currently
>> > because the folios would have to be uptodate at the time of the write to
>> > bypass the iop alloc. I guess you'd have to read folios (even if backed
>> > by holes) first to start seeing the !iop case at writeback time (for bs
>> > != ps).
>> >
>> > That could change with these patches, but I was trying to reason about
>> > the intent of the existing code and whether there was some known reason
>> > to continue to try and defer the iop allocation as the need/complexity
>> > for deferring it grows with the addition of more (i.e. dirty) tracking.
>> >
>> 
>> Here is the 1st discussion/reasoning where the deferred iop allocation
>> in the readpage path got discussed [1].
>> And here is the discussion when I first pointed out the deferred
>> allocation in writepage path. IMO, it got slipped in with the
>> discussions maybe only on mailing list but nothing in the commit
>> messages or comments.[2]
>> 
>> [1]: https://lore.kernel.org/linux-xfs/20210628172727.1894503-1-agruenba@redhat.com/
>> [2]: https://lore.kernel.org/linux-xfs/20230130202150.pfohy5yg6dtu64ce@rh-tp/
>> 
>> >> > Is there any room for further improvement where the alloc could be
>> >> > avoided completely for folio overwrites instead of just deferred?
>> >>
>> >> Once writeback starts, though, we need the iop so that we can know when
>> >> all the writeback for that folio is actually complete, no matter how
>> >> many IOs might be in flight for that folio.  I don't know how you'd get
>> >> around this problem.
>> >>
>> >
>> > Ok. I noticed some kind of counter or something being updated last time
>> > I looked through that code, so it sounds like that's the reason the iop
>> > eventually needs to exist. Thanks.
>> >
>> >> > Was that actually the case at some point and then something later
>> >> > decided the iop was needed at writeback time, leading to current
>> >> > behavior?
>> >>
>> >> It's been in iomap since the beginning when we lifted it from xfs.
>> >>
>> >
>> > Not sure exactly what you're referring to here. iomap_writepage_map()
>> > would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d
>> > ("iomap: Permit pages without an iop to enter writeback"), so I don't
>> > see how iop allocs were deferred (other than when bs == ps, obviously)
>> > prior to that.
>> >
>> > Heh, but I'm starting to get my wires crossed just trying to piece
>> > things together here. Ritesh, ISTM the (writeback && !iop && bs != ps)
>> > case is primarily a subtle side effect of the current writeback behavior
>> > being driven by uptodate status. I think it's probably wise to drop it
>> > at least initially, always alloc and dirty the appropriate iop ranges
>> > for sub-folio blocks, and then if you or others think there is value in
>> > the overwrite optimization to defer iop allocs, tack that on as a
>> > separate patch and try to be consistent between buffered and mapped
>> > writes.
>> 
>> Based on the discussion so far, I would like to think of this as follow:
>> We already have some sort of lazy iop allocation in the buffered I/O
>> path (discussed above). This patch series does not changes that
>> behavior. For now I would like to keep the page mkwrite page as is
>> without any lazy iop allocation optimization.
>> I am ok to pick this optimization work as a seperate series
>> because, IIUC, Christoph has some ideas on deferring iop allocations
>> even further [2] (from link shared above).
>> 
>> Does that sound good?
>> 
>
> Could you do that in two steps where the buffered I/O path variant is
> replaced by explicit dirty tracking in the initial patch, and then is
> restored by a subsequent patch in the same series? That would allow

Sorry, I couldn't follow it. Can you please elaborate.

So, what I was suggesting was - for buffered I/O path we should continue
to have the lazy iop allocation scheme whereever possible.
Rest of the optimizations of further deferring the iop allocation
including in the ->mkwrite path should be dealt seperately in a later
patch series.

Also we already have a seperate patch in this series which defers the
iop allocation if the write completely overwrites the folio [1].
Earlier the behavior was that it skips it entirely if the folio was
uptodate, but since we require it for per-block dirty tracking, we
defer iop allocation only if it was a complete overwrite of the folio. 

[1]: https://lore.kernel.org/linux-xfs/ZGzRX9YVkAYJGLqV@bfoster/T/#m048d0a097f7abb09da1c12c9a02afbcc3b9d39ee


-ritesh

> keeping it around and documenting it explicitly in the commit log for
> the separate patch, but IMO makes this a bit easier to review (and
> potentially debug/bisect if needed down the road).
>
> But I don't insist if that's too troublesome for some reason...
>
> Brian
>
>> >
>> > Darrick noted above that he also agrees with that separate patch
>> > approach. For me, I think it would also be useful to show that there is
>> > some measurable performance benefit on at least one reasonable workload
>> > to help justify it.
>> 
>> Agree that when we work on such optimizations as a seperate series, it
>> will be worth measuring the performance benefits of that.
>> 
>> 
>> -ritesh
>> 
>> >
>> > Brian
>> >
>> >> --D (who is now weeks behind on reviewing things and stressed out)
>> >>
>> >> > Brian
>> >> >
>> >> > > Though I agree that we should ideally be allocatting & marking all
>> >> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
>> >> > > understand your reasoning better.
>> >> > >
>> >> > > Thanks!
>> >> > > -ritesh
>> >> > >
>> >> >
>> >>
>> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-23 15:02                             ` Ritesh Harjani
@ 2023-05-23 15:22                               ` Brian Foster
  2023-05-23 15:38                                 ` Ritesh Harjani
  2023-05-23 15:59                                 ` Matthew Wilcox
  0 siblings, 2 replies; 44+ messages in thread
From: Brian Foster @ 2023-05-23 15:22 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Darrick J. Wong, Matthew Wilcox, Christoph Hellwig, linux-xfs,
	linux-fsdevel, Dave Chinner, Ojaswin Mujoo, Disha Goel,
	Aravinda Herle

On Tue, May 23, 2023 at 08:32:42PM +0530, Ritesh Harjani wrote:
> Brian Foster <bfoster@redhat.com> writes:
> 
> > On Tue, May 23, 2023 at 07:13:04PM +0530, Ritesh Harjani wrote:
> >> Brian Foster <bfoster@redhat.com> writes:
> >> 
> >> > On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote:
> >> >> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote:
> >> >> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
> >> >> > > Matthew Wilcox <willy@infradead.org> writes:
> >> >> > >
> >> >> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> >> >> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> >> >> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> >> >> > > >> > overwrites, isn't that also true of mapped writes to folios that don't
> >> >> > > >> > already have an iop?
> >> >> > > >>
> >> >> > > >> Yes.
> >> >> > > >
> >> >> > > > Hm, well, maybe?  If somebody stores to a page, we obviously set the
> >> >> > > > dirty flag on the folio, but depending on the architecture, we may
> >> >> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> >> >> > > > we have one dirty bit for the entire folio; similarly if ARM uses the
> >> >> > > > contiguous PTE bit).  If we do have independent dirty bits, we could
> >> >> > > > dirty only the blocks corresponding to a single page at a time.
> >> >> > > >
> >> >> > > > This has potential for causing some nasty bugs, so I'm inclined to
> >> >> > > > rule that if a folio is mmaped, then it's all dirty from any writable
> >> >> > > > page fault.  The fact is that applications generally do not perform
> >> >> > > > writes through mmap because the error handling story is so poor.
> >> >> > > >
> >> >> > > > There may be a different answer for anonymous memory, but that doesn't
> >> >> > > > feel like my problem and shouldn't feel like any FS developer's problem.
> >> >> > >
> >> >> > > Although I am skeptical too to do the changes which Brian is suggesting
> >> >> > > here. i.e. not making all the blocks of the folio dirty when we are
> >> >> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
> >> >> > >
> >> >> > > However, I am sorry but I coudn't completely follow your reasoning
> >> >> > > above. I think what Brian is suggesting here is that
> >> >> > > filemap_dirty_folio() should be similar to complete buffered overwrite
> >> >> > > case where we do not allocate the iop at the ->write_begin() time.
> >> >> > > Then at the writeback time we allocate an iop and mark all blocks dirty.
> >> >> > >
> >> >> >
> >> >> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
> >> >> > granularity of intra-folio faults) makes sense, but I'm also not sure
> >> >> > what it has to do with the idea of being consistent with how full folio
> >> >> > overwrites are implemented (between buffered or mapped writes). We're
> >> >> > not changing historical dirtying granularity either way. I think this is
> >> >> > just a bigger picture thought for future consideration as opposed to
> >> >> > direct feedback on this patch..
> >> >>
> >> >> <nod>
> >> >>
> >> >> > > In a way it is also the similar case as for mmapped writes too but my
> >> >> > > only worry is the way mmaped writes work and it makes more
> >> >> > > sense to keep the dirty state of folio and per-block within iop in sync.
> >> >> > > For that matter, we can even just make sure we always allocate an iop in
> >> >> > > the complete overwrites case as well. I didn't change that code because
> >> >> > > it was kept that way for uptodate state as well and based on one of your
> >> >> > > inputs for complete overwrite case.
> >> >> > >
> >> >> >
> >> >> > Can you elaborate on your concerns, out of curiosity?
> >> >> >
> >> >> > Either way, IMO it also seems reasonable to drop this behavior for the
> >> >> > basic implementation of dirty tracking (so always allocate the iop for
> >> >> > sub-folio tracking as you suggest above) and then potentially restore it
> >> >> > as a separate optimization patch at the end of the series.
> >> >>
> >> >> Agree.
> >> >>
> >> >> > That said, I'm not totally clear why it exists in the first place, so
> >> >> > that might warrant some investigation. Is it primarily to defer
> >> >> > allocations out of task write/fault contexts?
> >> >>
> >> >> (Assuming by 'it' you mean the behavior where we don't unconditionally
> >> >> allocate iops for blocksize < foliosize...)
> >> >>
> >> >> IIRC the reason is to reduce memory usage by eliding iop allocations
> >> >> unless it's absolutely necessary for correctness was /my/ understanding
> >> >> of why we don't always allocate the iop...
> >> >>
> >> >> > To optimize the case where pagecache is dirtied but truncated or
> >> >> > something and thus never written back?
> >> >>
> >> >> ...because this might very well happen.  Write a temporary .o file to
> >> >> the filesystem, then delete the whole thing before writeback ever gets
> >> >> its hands on the file.
> >> >>
> >> >
> >> > I don't think a simple temp write will trigger this scenario currently
> >> > because the folios would have to be uptodate at the time of the write to
> >> > bypass the iop alloc. I guess you'd have to read folios (even if backed
> >> > by holes) first to start seeing the !iop case at writeback time (for bs
> >> > != ps).
> >> >
> >> > That could change with these patches, but I was trying to reason about
> >> > the intent of the existing code and whether there was some known reason
> >> > to continue to try and defer the iop allocation as the need/complexity
> >> > for deferring it grows with the addition of more (i.e. dirty) tracking.
> >> >
> >> 
> >> Here is the 1st discussion/reasoning where the deferred iop allocation
> >> in the readpage path got discussed [1].
> >> And here is the discussion when I first pointed out the deferred
> >> allocation in writepage path. IMO, it got slipped in with the
> >> discussions maybe only on mailing list but nothing in the commit
> >> messages or comments.[2]
> >> 
> >> [1]: https://lore.kernel.org/linux-xfs/20210628172727.1894503-1-agruenba@redhat.com/
> >> [2]: https://lore.kernel.org/linux-xfs/20230130202150.pfohy5yg6dtu64ce@rh-tp/
> >> 
> >> >> > Is there any room for further improvement where the alloc could be
> >> >> > avoided completely for folio overwrites instead of just deferred?
> >> >>
> >> >> Once writeback starts, though, we need the iop so that we can know when
> >> >> all the writeback for that folio is actually complete, no matter how
> >> >> many IOs might be in flight for that folio.  I don't know how you'd get
> >> >> around this problem.
> >> >>
> >> >
> >> > Ok. I noticed some kind of counter or something being updated last time
> >> > I looked through that code, so it sounds like that's the reason the iop
> >> > eventually needs to exist. Thanks.
> >> >
> >> >> > Was that actually the case at some point and then something later
> >> >> > decided the iop was needed at writeback time, leading to current
> >> >> > behavior?
> >> >>
> >> >> It's been in iomap since the beginning when we lifted it from xfs.
> >> >>
> >> >
> >> > Not sure exactly what you're referring to here. iomap_writepage_map()
> >> > would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d
> >> > ("iomap: Permit pages without an iop to enter writeback"), so I don't
> >> > see how iop allocs were deferred (other than when bs == ps, obviously)
> >> > prior to that.
> >> >
> >> > Heh, but I'm starting to get my wires crossed just trying to piece
> >> > things together here. Ritesh, ISTM the (writeback && !iop && bs != ps)
> >> > case is primarily a subtle side effect of the current writeback behavior
> >> > being driven by uptodate status. I think it's probably wise to drop it
> >> > at least initially, always alloc and dirty the appropriate iop ranges
> >> > for sub-folio blocks, and then if you or others think there is value in
> >> > the overwrite optimization to defer iop allocs, tack that on as a
> >> > separate patch and try to be consistent between buffered and mapped
> >> > writes.
> >> 
> >> Based on the discussion so far, I would like to think of this as follow:
> >> We already have some sort of lazy iop allocation in the buffered I/O
> >> path (discussed above). This patch series does not changes that
> >> behavior. For now I would like to keep the page mkwrite page as is
> >> without any lazy iop allocation optimization.
> >> I am ok to pick this optimization work as a seperate series
> >> because, IIUC, Christoph has some ideas on deferring iop allocations
> >> even further [2] (from link shared above).
> >> 
> >> Does that sound good?
> >> 
> >
> > Could you do that in two steps where the buffered I/O path variant is
> > replaced by explicit dirty tracking in the initial patch, and then is
> > restored by a subsequent patch in the same series? That would allow
> 
> Sorry, I couldn't follow it. Can you please elaborate.
> 

Sorry for the confusion...

> So, what I was suggesting was - for buffered I/O path we should continue
> to have the lazy iop allocation scheme whereever possible.
> Rest of the optimizations of further deferring the iop allocation
> including in the ->mkwrite path should be dealt seperately in a later
> patch series.
> 

Yup, agree.

> Also we already have a seperate patch in this series which defers the
> iop allocation if the write completely overwrites the folio [1].
> Earlier the behavior was that it skips it entirely if the folio was
> uptodate, but since we require it for per-block dirty tracking, we
> defer iop allocation only if it was a complete overwrite of the folio. 
> 

That is a prepatory patch before iop dirty tracking is enabled in patch
5, right? I was mainly just suggesting to make the overwrite checking
part of this patch come after dirty tracking is enabled (as a small
optimization patch) rather than before.

I don't want to harp on it if that's difficult or still doesn't make
sense for some reason. I'll take a closer look the next go around when I
have a bit more time and just send a diff if it seems it can be done
cleanly..

Brian

> [1]: https://lore.kernel.org/linux-xfs/ZGzRX9YVkAYJGLqV@bfoster/T/#m048d0a097f7abb09da1c12c9a02afbcc3b9d39ee
> 
> 
> -ritesh
> 
> > keeping it around and documenting it explicitly in the commit log for
> > the separate patch, but IMO makes this a bit easier to review (and
> > potentially debug/bisect if needed down the road).
> >
> > But I don't insist if that's too troublesome for some reason...
> >
> > Brian
> >
> >> >
> >> > Darrick noted above that he also agrees with that separate patch
> >> > approach. For me, I think it would also be useful to show that there is
> >> > some measurable performance benefit on at least one reasonable workload
> >> > to help justify it.
> >> 
> >> Agree that when we work on such optimizations as a seperate series, it
> >> will be worth measuring the performance benefits of that.
> >> 
> >> 
> >> -ritesh
> >> 
> >> >
> >> > Brian
> >> >
> >> >> --D (who is now weeks behind on reviewing things and stressed out)
> >> >>
> >> >> > Brian
> >> >> >
> >> >> > > Though I agree that we should ideally be allocatting & marking all
> >> >> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
> >> >> > > understand your reasoning better.
> >> >> > >
> >> >> > > Thanks!
> >> >> > > -ritesh
> >> >> > >
> >> >> >
> >> >>
> >> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-23 15:22                               ` Brian Foster
@ 2023-05-23 15:38                                 ` Ritesh Harjani
  2023-05-23 15:59                                 ` Matthew Wilcox
  1 sibling, 0 replies; 44+ messages in thread
From: Ritesh Harjani @ 2023-05-23 15:38 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, Matthew Wilcox, Christoph Hellwig, linux-xfs,
	linux-fsdevel, Dave Chinner, Ojaswin Mujoo, Disha Goel,
	Aravinda Herle

Brian Foster <bfoster@redhat.com> writes:

> On Tue, May 23, 2023 at 08:32:42PM +0530, Ritesh Harjani wrote:
>> Brian Foster <bfoster@redhat.com> writes:
>>
>> > On Tue, May 23, 2023 at 07:13:04PM +0530, Ritesh Harjani wrote:
>> >> Brian Foster <bfoster@redhat.com> writes:
>> >>
>> >> > On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote:
>> >> >> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote:
>> >> >> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
>> >> >> > > Matthew Wilcox <willy@infradead.org> writes:
>> >> >> > >
>> >> >> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
>> >> >> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
>> >> >> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
>> >> >> > > >> > overwrites, isn't that also true of mapped writes to folios that don't
>> >> >> > > >> > already have an iop?
>> >> >> > > >>
>> >> >> > > >> Yes.
>> >> >> > > >
>> >> >> > > > Hm, well, maybe?  If somebody stores to a page, we obviously set the
>> >> >> > > > dirty flag on the folio, but depending on the architecture, we may
>> >> >> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
>> >> >> > > > we have one dirty bit for the entire folio; similarly if ARM uses the
>> >> >> > > > contiguous PTE bit).  If we do have independent dirty bits, we could
>> >> >> > > > dirty only the blocks corresponding to a single page at a time.
>> >> >> > > >
>> >> >> > > > This has potential for causing some nasty bugs, so I'm inclined to
>> >> >> > > > rule that if a folio is mmaped, then it's all dirty from any writable
>> >> >> > > > page fault.  The fact is that applications generally do not perform
>> >> >> > > > writes through mmap because the error handling story is so poor.
>> >> >> > > >
>> >> >> > > > There may be a different answer for anonymous memory, but that doesn't
>> >> >> > > > feel like my problem and shouldn't feel like any FS developer's problem.
>> >> >> > >
>> >> >> > > Although I am skeptical too to do the changes which Brian is suggesting
>> >> >> > > here. i.e. not making all the blocks of the folio dirty when we are
>> >> >> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
>> >> >> > >
>> >> >> > > However, I am sorry but I coudn't completely follow your reasoning
>> >> >> > > above. I think what Brian is suggesting here is that
>> >> >> > > filemap_dirty_folio() should be similar to complete buffered overwrite
>> >> >> > > case where we do not allocate the iop at the ->write_begin() time.
>> >> >> > > Then at the writeback time we allocate an iop and mark all blocks dirty.
>> >> >> > >
>> >> >> >
>> >> >> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
>> >> >> > granularity of intra-folio faults) makes sense, but I'm also not sure
>> >> >> > what it has to do with the idea of being consistent with how full folio
>> >> >> > overwrites are implemented (between buffered or mapped writes). We're
>> >> >> > not changing historical dirtying granularity either way. I think this is
>> >> >> > just a bigger picture thought for future consideration as opposed to
>> >> >> > direct feedback on this patch..
>> >> >>
>> >> >> <nod>
>> >> >>
>> >> >> > > In a way it is also the similar case as for mmapped writes too but my
>> >> >> > > only worry is the way mmaped writes work and it makes more
>> >> >> > > sense to keep the dirty state of folio and per-block within iop in sync.
>> >> >> > > For that matter, we can even just make sure we always allocate an iop in
>> >> >> > > the complete overwrites case as well. I didn't change that code because
>> >> >> > > it was kept that way for uptodate state as well and based on one of your
>> >> >> > > inputs for complete overwrite case.
>> >> >> > >
>> >> >> >
>> >> >> > Can you elaborate on your concerns, out of curiosity?
>> >> >> >
>> >> >> > Either way, IMO it also seems reasonable to drop this behavior for the
>> >> >> > basic implementation of dirty tracking (so always allocate the iop for
>> >> >> > sub-folio tracking as you suggest above) and then potentially restore it
>> >> >> > as a separate optimization patch at the end of the series.
>> >> >>
>> >> >> Agree.
>> >> >>
>> >> >> > That said, I'm not totally clear why it exists in the first place, so
>> >> >> > that might warrant some investigation. Is it primarily to defer
>> >> >> > allocations out of task write/fault contexts?
>> >> >>
>> >> >> (Assuming by 'it' you mean the behavior where we don't unconditionally
>> >> >> allocate iops for blocksize < foliosize...)
>> >> >>
>> >> >> IIRC the reason is to reduce memory usage by eliding iop allocations
>> >> >> unless it's absolutely necessary for correctness was /my/ understanding
>> >> >> of why we don't always allocate the iop...
>> >> >>
>> >> >> > To optimize the case where pagecache is dirtied but truncated or
>> >> >> > something and thus never written back?
>> >> >>
>> >> >> ...because this might very well happen.  Write a temporary .o file to
>> >> >> the filesystem, then delete the whole thing before writeback ever gets
>> >> >> its hands on the file.
>> >> >>
>> >> >
>> >> > I don't think a simple temp write will trigger this scenario currently
>> >> > because the folios would have to be uptodate at the time of the write to
>> >> > bypass the iop alloc. I guess you'd have to read folios (even if backed
>> >> > by holes) first to start seeing the !iop case at writeback time (for bs
>> >> > != ps).
>> >> >
>> >> > That could change with these patches, but I was trying to reason about
>> >> > the intent of the existing code and whether there was some known reason
>> >> > to continue to try and defer the iop allocation as the need/complexity
>> >> > for deferring it grows with the addition of more (i.e. dirty) tracking.
>> >> >
>> >>
>> >> Here is the 1st discussion/reasoning where the deferred iop allocation
>> >> in the readpage path got discussed [1].
>> >> And here is the discussion when I first pointed out the deferred
>> >> allocation in writepage path. IMO, it got slipped in with the
>> >> discussions maybe only on mailing list but nothing in the commit
>> >> messages or comments.[2]
>> >>
>> >> [1]: https://lore.kernel.org/linux-xfs/20210628172727.1894503-1-agruenba@redhat.com/
>> >> [2]: https://lore.kernel.org/linux-xfs/20230130202150.pfohy5yg6dtu64ce@rh-tp/
>> >>
>> >> >> > Is there any room for further improvement where the alloc could be
>> >> >> > avoided completely for folio overwrites instead of just deferred?
>> >> >>
>> >> >> Once writeback starts, though, we need the iop so that we can know when
>> >> >> all the writeback for that folio is actually complete, no matter how
>> >> >> many IOs might be in flight for that folio.  I don't know how you'd get
>> >> >> around this problem.
>> >> >>
>> >> >
>> >> > Ok. I noticed some kind of counter or something being updated last time
>> >> > I looked through that code, so it sounds like that's the reason the iop
>> >> > eventually needs to exist. Thanks.
>> >> >
>> >> >> > Was that actually the case at some point and then something later
>> >> >> > decided the iop was needed at writeback time, leading to current
>> >> >> > behavior?
>> >> >>
>> >> >> It's been in iomap since the beginning when we lifted it from xfs.
>> >> >>
>> >> >
>> >> > Not sure exactly what you're referring to here. iomap_writepage_map()
>> >> > would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d
>> >> > ("iomap: Permit pages without an iop to enter writeback"), so I don't
>> >> > see how iop allocs were deferred (other than when bs == ps, obviously)
>> >> > prior to that.
>> >> >
>> >> > Heh, but I'm starting to get my wires crossed just trying to piece
>> >> > things together here. Ritesh, ISTM the (writeback && !iop && bs != ps)
>> >> > case is primarily a subtle side effect of the current writeback behavior
>> >> > being driven by uptodate status. I think it's probably wise to drop it
>> >> > at least initially, always alloc and dirty the appropriate iop ranges
>> >> > for sub-folio blocks, and then if you or others think there is value in
>> >> > the overwrite optimization to defer iop allocs, tack that on as a
>> >> > separate patch and try to be consistent between buffered and mapped
>> >> > writes.
>> >>
>> >> Based on the discussion so far, I would like to think of this as follow:
>> >> We already have some sort of lazy iop allocation in the buffered I/O
>> >> path (discussed above). This patch series does not changes that
>> >> behavior. For now I would like to keep the page mkwrite page as is
>> >> without any lazy iop allocation optimization.
>> >> I am ok to pick this optimization work as a seperate series
>> >> because, IIUC, Christoph has some ideas on deferring iop allocations
>> >> even further [2] (from link shared above).
>> >>
>> >> Does that sound good?
>> >>
>> >
>> > Could you do that in two steps where the buffered I/O path variant is
>> > replaced by explicit dirty tracking in the initial patch, and then is
>> > restored by a subsequent patch in the same series? That would allow
>>
>> Sorry, I couldn't follow it. Can you please elaborate.
>>
>
> Sorry for the confusion...
>
>> So, what I was suggesting was - for buffered I/O path we should continue
>> to have the lazy iop allocation scheme whereever possible.
>> Rest of the optimizations of further deferring the iop allocation
>> including in the ->mkwrite path should be dealt seperately in a later
>> patch series.
>>
>
> Yup, agree.
>
>> Also we already have a seperate patch in this series which defers the
>> iop allocation if the write completely overwrites the folio [1].
>> Earlier the behavior was that it skips it entirely if the folio was
>> uptodate, but since we require it for per-block dirty tracking, we
>> defer iop allocation only if it was a complete overwrite of the folio.
>>
>
> That is a prepatory patch before iop dirty tracking is enabled in patch
> 5, right?

Yes, right.


> I was mainly just suggesting to make the overwrite checking
> part of this patch come after dirty tracking is enabled (as a small
> optimization patch) rather than before.
>
> I don't want to harp on it if that's difficult or still doesn't make
> sense for some reason. I'll take a closer look the next go around when I
> have a bit more time and just send a diff if it seems it can be done
> cleanly..

It should not be difficult. Will make the change in next rev.

Thanks for the help. Seems a lot of things have been sorted out now :)

I will be working on the review comments, finish off few of the
remaining todos, get more testing done and will spin off the next
revision.


-ritesh

>
> Brian
>
>> [1]: https://lore.kernel.org/linux-xfs/ZGzRX9YVkAYJGLqV@bfoster/T/#m048d0a097f7abb09da1c12c9a02afbcc3b9d39ee
>>
>>
>> -ritesh
>>
>> > keeping it around and documenting it explicitly in the commit log for
>> > the separate patch, but IMO makes this a bit easier to review (and
>> > potentially debug/bisect if needed down the road).
>> >
>> > But I don't insist if that's too troublesome for some reason...
>> >
>> > Brian
>> >
>> >> >
>> >> > Darrick noted above that he also agrees with that separate patch
>> >> > approach. For me, I think it would also be useful to show that there is
>> >> > some measurable performance benefit on at least one reasonable workload
>> >> > to help justify it.
>> >>
>> >> Agree that when we work on such optimizations as a seperate series, it
>> >> will be worth measuring the performance benefits of that.
>> >>
>> >>
>> >> -ritesh
>> >>
>> >> >
>> >> > Brian
>> >> >
>> >> >> --D (who is now weeks behind on reviewing things and stressed out)
>> >> >>
>> >> >> > Brian
>> >> >> >
>> >> >> > > Though I agree that we should ideally be allocatting & marking all
>> >> >> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
>> >> >> > > understand your reasoning better.
>> >> >> > >
>> >> >> > > Thanks!
>> >> >> > > -ritesh
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
  2023-05-23 15:22                               ` Brian Foster
  2023-05-23 15:38                                 ` Ritesh Harjani
@ 2023-05-23 15:59                                 ` Matthew Wilcox
  1 sibling, 0 replies; 44+ messages in thread
From: Matthew Wilcox @ 2023-05-23 15:59 UTC (permalink / raw)
  To: Brian Foster
  Cc: Ritesh Harjani, Darrick J. Wong, Christoph Hellwig, linux-xfs,
	linux-fsdevel, Dave Chinner, Ojaswin Mujoo, Disha Goel,
	Aravinda Herle

On Tue, May 23, 2023 at 11:22:33AM -0400, Brian Foster wrote:
> That is a prepatory patch before iop dirty tracking is enabled in patch
> 5, right? I was mainly just suggesting to make the overwrite checking
> part of this patch come after dirty tracking is enabled (as a small
> optimization patch) rather than before.
> 
> I don't want to harp on it if that's difficult or still doesn't make
> sense for some reason. I'll take a closer look the next go around when I
> have a bit more time and just send a diff if it seems it can be done
> cleanly..

Honestly, I think the way the patchset is structured now is fine.
There are always multiple ways to structure a set of patches, and I don't
think we need be too concerned about what order the things are done in,
as long as they get done.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2023-05-23 15:59 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-07 19:27 [RFCv5 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
2023-05-07 19:27 ` [RFCv5 1/5] iomap: Rename iomap_page_create/release() to iop_alloc/free() Ritesh Harjani (IBM)
2023-05-18  6:13   ` Christoph Hellwig
2023-05-19 15:01     ` Ritesh Harjani
2023-05-07 19:27 ` [RFCv5 2/5] iomap: Refactor iop_set_range_uptodate() function Ritesh Harjani (IBM)
2023-05-15 15:09   ` Brian Foster
2023-05-16 10:12     ` Ritesh Harjani
2023-05-18  6:16   ` Christoph Hellwig
2023-05-19 15:03     ` Ritesh Harjani
2023-05-07 19:27 ` [RFCv5 3/5] iomap: Add iop's uptodate state handling functions Ritesh Harjani (IBM)
2023-05-15 15:10   ` Brian Foster
2023-05-16 10:14     ` Ritesh Harjani
2023-05-18  6:18   ` Christoph Hellwig
2023-05-19 15:07     ` Ritesh Harjani
2023-05-23  6:00       ` Christoph Hellwig
2023-05-07 19:27 ` [RFCv5 4/5] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
2023-05-18  6:21   ` Christoph Hellwig
2023-05-19 15:18     ` Ritesh Harjani
2023-05-19 15:53       ` Matthew Wilcox
2023-05-22  4:05         ` Ritesh Harjani
2023-05-07 19:28 ` [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
     [not found]   ` <CGME20230515081618eucas1p1c852fec3ba7a42ee7094248c30ff5978@eucas1p1.samsung.com>
2023-05-15  8:16     ` Pankaj Raghav
2023-05-15  8:31       ` Ritesh Harjani
2023-05-15 13:23         ` Pankaj Raghav
2023-05-15 15:15   ` Brian Foster
2023-05-16 14:49     ` Ritesh Harjani
2023-05-16 19:29       ` Brian Foster
2023-05-17 15:20         ` Ritesh Harjani
2023-05-17 18:48           ` Brian Foster
2023-05-18 13:23             ` Christoph Hellwig
2023-05-18 16:15               ` Matthew Wilcox
2023-05-22  4:33                 ` Ritesh Harjani
2023-05-22  4:48                   ` Matthew Wilcox
2023-05-22 11:18                   ` Brian Foster
2023-05-23  0:56                     ` Darrick J. Wong
2023-05-23 12:15                       ` Brian Foster
2023-05-23 13:43                         ` Ritesh Harjani
2023-05-23 14:44                           ` Brian Foster
2023-05-23 15:02                             ` Ritesh Harjani
2023-05-23 15:22                               ` Brian Foster
2023-05-23 15:38                                 ` Ritesh Harjani
2023-05-23 15:59                                 ` Matthew Wilcox
2023-05-18 13:27   ` Christoph Hellwig
2023-05-19 16:08     ` Ritesh Harjani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).