linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFCv4 0/3] iomap: Support subpage size dirty tracking to improve write performance
@ 2023-05-04 14:51 Ritesh Harjani (IBM)
  2023-05-04 14:51 ` [RFCv4 1/3] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-05-04 14:51 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Ojaswin Mujoo, Disha Goel, Ritesh Harjani (IBM)

Hello All,

Please find RFCv4 of the subpage size dirty tracking support to iomap
buffered-io.

RFCv3 -> RFCv4
===============
1. Addressed most of the review comments from Dave and Matthew on rfcv3.
2. Note I have added Reviewed-by from Dave in Patch-2 which he provide for the
   mechanical changes.
3. We now have two functions iomap_iop_set_range() and iomap_iop_clear_range()
   after addressing review comment from Dave to add a common wrapper for
   uptodate and dirty bits rather than each having different function.
4. Addressed a problem reported by Brian for a short write case with delalloc
   release. This is addressed in patch-3 in function iomap_write_delalloc_scan().
   I suppose this is a major change from the previous rfcv3 which adds some
   functionality change. I did test a unit test which Brian provided with xfs_io
   -f option. Before those changes, the kernel caused a bug_on during unmount
   with the unit test. This gets fixed with the changes added in v4.

	i.e. After v4
	=================
	root@ubuntu# ./xfs_io -fc "truncate 4k" -c "mmap 0 4k" -c "mread 0 4k" -c "pwrite 0 1" -c "pwrite -f 2k 1" -c fsync /mnt1/tmnt/testfile
	wrote 1/1 bytes at offset 0
	1.000000 bytes, 1 ops; 0.0001 sec (7.077 KiB/sec and 7246.3768 ops/sec)
	pwrite: Bad address
	root@ubuntu# ./xfs_io -c "fiemap -v" /mnt1/tmnt/testfile
	/mnt1/tmnt/testfile:
	 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
	   0: [0..1]:          22..23               2   0x1
	   1: [2..7]:          hole                 6
	root@ubuntu# filefrag -v /mnt1/tmnt/testfile
	Filesystem type is: 58465342
	File size of /mnt1/tmnt/testfile is 4096 (4 blocks of 1024 bytes)
	 ext:     logical_offset:        physical_offset: length:   expected: flags:
	   0:        0..       0:         11..        11:      1:             last
	/mnt1/tmnt/testfile: 1 extent found
	root@ubuntu# umount /mnt1/tmnt
	root@ubuntu#

	Before v4
	===========
	root@ubuntu# mount /dev/loop8 /mnt1/test
	root@ubuntu# ./xfs_io -fc "truncate 4k" -c "mmap 0 4k" -c "mread 0 4k" -c "pwrite 0 1" -c "pwrite -f 2k 1" -c fsync /mnt1/test/testfile
	wrote 1/1 bytes at offset 0
	1.000000 bytes, 1 ops; 0.0000 sec (10.280 KiB/sec and 10526.3158 ops/sec)
	pwrite: Bad address
	root@ubuntu# ./xfs_io -c "fiemap -v" /mnt1/test/testfile
	/mnt1/test/testfile:
	 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
	   0: [0..1]:          22..23               2   0x0
	   1: [2..3]:          hole                 2
	   2: [4..5]:          0..1                 2   0x7
	   3: [6..7]:          hole                 2
	root@ubuntu# filefrag -v /mnt1/test/testfile
	Filesystem type is: 58465342
	File size of /mnt1/test/testfile is 4096 (4 blocks of 1024 bytes)
	 ext:     logical_offset:        physical_offset: length:   expected: flags:
	   0:        0..       0:         11..        11:      1:
	   1:        2..       2:          0..         0:      0:             last,unknown_loc,delalloc
	/mnt1/test/testfile: 2 extents found
	root@ubuntu# umount /mnt1/test
	<dmesg snip>
	[  156.581188] XFS (loop8): Unmounting Filesystem 7889507d-fc7f-4a1c-94d5-06797f2cc790
	[  156.584455] XFS (loop8): ino 43 data fork has delalloc extent at [0x2:0x1]
	[  156.587847] XFS: Assertion failed: 0, file: fs/xfs/xfs_icache.c, line: 1816
	[  156.591675] ------------[ cut here ]------------
	[  156.594133] kernel BUG at fs/xfs/xfs_message.c:102!
	[  156.596669] invalid opcode: 0000 [#1] PREEMPT SMP PTI
	[  156.599277] CPU: 7 PID: 435 Comm: kworker/7:5 Not tainted 6.3.0-rc6-xfstests-00003-g99a844f4e411-dirty #105
	[  156.603721] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
	[  156.608701] Workqueue: xfs-inodegc/loop8 xfs_inodegc_worker
	[  156.611426] RIP: 0010:assfail+0x38/0x40
	[  156.613329] Code: c9 48 c7 c2 b8 87 a5 82 48 89 f1 48 89 fe 48 c7 c7 1f 81 96 82 e8 68 fd ff ff 80 3d f1 03 98 01 00 75 07 0f 0b c3 cc cc0
	[  156.621642] RSP: 0018:ffffc90001a33e08 EFLAGS: 00010202
	[  156.624012] RAX: 0000000000000000 RBX: ffff888320bd0000 RCX: 0000000000000000
	[  156.627238] RDX: 00000000ffffffc0 RSI: 0000000000000000 RDI: ffffffff8296811f
	[  156.630430] RBP: ffff888320bd0000 R08: 0000000000000000 R09: 000000000000000a
	[  156.633573] R10: 000000000000000a R11: f000000000000000 R12: fffffffffffffeb0
	[  156.636732] R13: 0000000000000000 R14: ffff8881d49d7000 R15: ffffe8ffffdf6d05
	[  156.639042] FS:  0000000000000000(0000) GS:ffff88842fdc0000(0000) knlGS:0000000000000000
	[  156.641957] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	[  156.644666] CR2: 00007ffff7daf1f3 CR3: 00000001057f6006 CR4: 0000000000170ee0
	[  156.646981] Call Trace:
	[  156.647951]  <TASK>
	[  156.648904]  xfs_inodegc_set_reclaimable+0x15b/0x160
	[  156.651270]  xfs_inodegc_worker+0x95/0x1d0
	[  156.653202]  process_one_work+0x274/0x550
	[  156.655305]  worker_thread+0x4f/0x300
	[  156.657081]  ? __pfx_worker_thread+0x10/0x10
	[  156.658977]  kthread+0xf6/0x120
	[  156.660657]  ? __pfx_kthread+0x10/0x10
	[  156.662565]  ret_from_fork+0x2c/0x50
	[  156.664421]  </TASK>

Testing
==========
1. I have tested this on 1k and 4k block size on x86 box with no new failures.
2. Tested this on Power with 4k and 64k. No failures reported till now. I am
   continuing to test this as we speak.
3. fio benchmark gives 16x performance boost on Power with 4k blocksize.
4. Will soon report pgbench scores as well on Power. I am assuming we won't see
   any surprises here with v4.

TODOs:
========
1. Test zonefs / gfs2 using xfstests suite. Make sure no regression.
   Although as I was seeing gfs2 uses the IOMAP_F_BUFFER_HEAD path of the code so
   I am not sure if these changes can impact gfs2. I recently noticed that for
   IOMAP_F_BUFFER_HEAD we take __block_write_begin_int() path and not
   __iomap_write_begin() path. That means we never create iop structures for them
   in the write_begin path. Yet we create and use iop in iomap_writepage_map() path
   during the writeback.
   Note: This is an existing behavior and not due to this patch. However since
   I noticed that only recently, I do want to keep a note of it so that we can
   maybe look at that in next revision.
2. I still don't have an answer to todo-5 from rfcv2. a better data structure
   instead of bitmap for iops.
3. Look into other optimizations which Dave suggested on v2 which is whether we
   can use find_first_zero_bit() helpers and whether those will be helpful?
   Sorry that I couldn't get this in v4. I initially thought that we don't do
   this even today for uptodate bit searches with large folios. So maybe I can
   work on these optimizations in next round with other cleanup work in iomap
   once this patch series looks good?

<From here is the copy paste from previous versions>
RFCv2 -> RFCv3
===============
1. Addressed review comments on adding accessor APIs for both uptodate and dirty
   iop bitmap. (todo-1 of rfcv2).
   Addressed few other review comments from Christoph & Matthew.
2. Performance testing of these patches reveal the same performance improvement
   i.e. the given fio workload shows 16x perf improvement on nvme drive.
   (completed todo-3 of rfcv2)
3. Addressed todo-4 of rfcv2

Few TODOs
===========
1. Test gfs2 and zonefs with these changes (todo-2 of rfcv2)
2. Look into todo-5 of rfcv2

xfstests testing with default options and 1k blocksize on x86 reveals no new
issues. Also didn't observe any surprises on Power with 4k blocksize.
(Please do suggest if there are any specific xfstests config options (for
xfs) which are good to get it tested for this patch series?)


Copy-Paste Cover letter of RFCv2
================================

RFC -> RFCv2
=============
1. One of the key fix in v2 is that earlier when the folio gets marked as dirty,
   we were never marking the bits dirty in iop bitmap.
   This patch adds support for iomap_dirty_folio() as new ->dirty_folio() aops
   callback, which sets the dirty bitmap in iop and later call filemap_dirty_folio().
   This was one of the review comment that was discussed in RFC.

2. One of the other key fix identified in testing was that iop structure could
   get allocated at the time of the writeback if the folio is uptodate.
   (since it can get freed during memory pressure or during
   truncate_inode_partial_folio() in case of large folio). This could then cause
   nothing to get written if we have not marked the necessary bits as dirty in
   iop->state[]. Patch-1 & Patch-3 takes care of that.

TODOs
======
1. I still need to work on macros which we could declare and use for easy
   reference to uptodate/dirty bits in iop->state[] bitmap (based on previous
   review comments).

2. Test xfstests on other filesystems which are using the iomap buffered write
   path (gfs2, zonefs).

3. Latest performance testing with this patch series (I am not expecting any
   surprises here. The perf improvements should be more or less similar to rfc).

4. To address one of the todo in Patch-3. I think I missed to address it and
   noticed it only now before sending. But it should be easily addressable.
   I can address it in the next revision along with others.

5. To address one of the other review comments like what happens with a large
   folio. Can we limit the size of bitmaps if the folio is too large e.g. > 2MB.

   [RH] - I can start looking into this area too, if we think these patches
   are looking good. My preference would be to work on todos 1-4 as part of this
   patch series and take up bitmap optimization as a follow-up work for next
   part. Please do let me know your thoughts and suggestions on this.

Note: I have done a 4k bs test with auto group on Power with 64k pagesize and
I haven't found any surprises. I am also running a full bench of all tests with
x86 and 1k blocksize, but it still hasn't completed. I can update the results
once it completes.

Also as we discussed, all the dirty and uptodate bitmap tracking code for
iomap_page's state[] bitmap, is still contained within iomap buffered-io.c file.

I would appreciate any review comments/feedback and help on this work i.e.
adding subpage size dirty tracking to reduce write amplification problem and
improve buffered write performance. Kindly note that w/o these patches,
below type of workload gets severly impacted.


Performance Results from RFC [1]:
=================================
1. Performance testing of below fio workload reveals ~16x performance
improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
FIO reported write bw scores, improved from ~28 MBps to ~452 MBps.

<test_randwrite.fio>
[global]
	ioengine=psync
	rw=randwrite
	overwrite=1
	pre_read=1
	direct=0
	bs=4k
	size=1G
	dir=./
	numjobs=8
	fdatasync=1
	runtime=60
	iodepth=64
	group_reporting=1

[fio-run]

2. Also our internal performance team reported that this patch improves there
   database workload performance by around ~83% (with XFS on Power)

[1]: https://lore.kernel.org/linux-xfs/cover.1666928993.git.ritesh.list@gmail.com/

Ritesh Harjani (IBM) (3):
  iomap: Allocate iop in ->write_begin() early
  iomap: Change uptodate variable name to state
  iomap: Support subpage size dirty tracking to improve write performance

 fs/gfs2/aops.c         |   2 +-
 fs/iomap/buffered-io.c | 243 ++++++++++++++++++++++++++++++++++-------
 fs/xfs/xfs_aops.c      |   2 +-
 fs/zonefs/file.c       |   2 +-
 include/linux/iomap.h  |   1 +
 5 files changed, 209 insertions(+), 41 deletions(-)

--
2.39.2


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFCv4 1/3] iomap: Allocate iop in ->write_begin() early
  2023-05-04 14:51 [RFCv4 0/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
@ 2023-05-04 14:51 ` Ritesh Harjani (IBM)
  2023-05-04 14:51 ` [RFCv4 2/3] iomap: Change uptodate variable name to state Ritesh Harjani (IBM)
  2023-05-04 14:51 ` [RFCv4 3/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
  2 siblings, 0 replies; 7+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-05-04 14:51 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Ojaswin Mujoo, Disha Goel, Ritesh Harjani (IBM)

Earlier when the folio is uptodate, we only allocate iop at writeback
time (in iomap_writepage_map()). This is ok until now, but when we are
going to add support for subpage size dirty bitmap tracking in iop, this
could cause some performance degradation. The reason is that if we don't
allocate iop during ->write_begin(), then we will never mark the
necessary dirty bits in ->write_end() call. And we will have to mark all
the bits as dirty at the writeback time, that could cause the same write
amplification and performance problems as it is now (w/o subpage dirty
bitmap tracking in iop).

However, for all the writes with (pos, len) which completely overlaps
the given folio, there is no need to allocate an iop during
->write_begin(). So skip those cases.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6f4c97a6d7e9..e43821bd1ff5 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -562,14 +562,31 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	size_t from = offset_in_folio(folio, pos), to = from + len;
 	size_t poff, plen;
 
-	if (folio_test_uptodate(folio))
+	/*
+	 * If the write completely overlaps the current folio, then
+	 * entire folio will be dirtied so there is no need for
+	 * sub-folio state tracking structures to be attached to this folio.
+	 */
+
+	if (pos <= folio_pos(folio) &&
+	    pos + len >= folio_pos(folio) + folio_size(folio))
 		return 0;
-	folio_clear_error(folio);
 
 	iop = iomap_page_create(iter->inode, folio, iter->flags);
+
+	/*
+	 * If we don't have an iop and nr_blocks > 1 then return -EAGAIN here
+	 * even though the folio may be uptodate. To ensure we add sub-folio
+	 * state tracking structures to this folio.
+	 */
 	if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
 		return -EAGAIN;
 
+	if (folio_test_uptodate(folio))
+		return 0;
+	folio_clear_error(folio);
+
+
 	do {
 		iomap_adjust_read_range(iter->inode, folio, &block_start,
 				block_end - block_start, &poff, &plen);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFCv4 2/3] iomap: Change uptodate variable name to state
  2023-05-04 14:51 [RFCv4 0/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
  2023-05-04 14:51 ` [RFCv4 1/3] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
@ 2023-05-04 14:51 ` Ritesh Harjani (IBM)
  2023-05-04 14:51 ` [RFCv4 3/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
  2 siblings, 0 replies; 7+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-05-04 14:51 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Ojaswin Mujoo, Disha Goel, Ritesh Harjani (IBM),
	Dave Chinner

This patch changes the struct iomap_page uptodate & uptodate_lock
member names to state and state_lock to better reflect their purpose
for the upcoming patch. It also introduces the accessor functions for
updating uptodate state bits in iop->state bitmap. This makes the code
easy to understand on when different bitmap types are getting referred
in different code paths.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/buffered-io.c | 65 ++++++++++++++++++++++++++++++++----------
 1 file changed, 50 insertions(+), 15 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e43821bd1ff5..b8b23c859ecf 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -25,13 +25,13 @@
 
 /*
  * Structure allocated for each folio when block size < folio size
- * to track sub-folio uptodate status and I/O completions.
+ * to track sub-folio uptodate state and I/O completions.
  */
 struct iomap_page {
 	atomic_t		read_bytes_pending;
 	atomic_t		write_bytes_pending;
-	spinlock_t		uptodate_lock;
-	unsigned long		uptodate[];
+	spinlock_t		state_lock;
+	unsigned long		state[];
 };
 
 static inline struct iomap_page *to_iomap_page(struct folio *folio)
@@ -43,6 +43,38 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
 
 static struct bio_set iomap_ioend_bioset;
 
+/*
+ * Accessor functions for setting/clearing/checking uptodate bits in
+ * iop->state bitmap.
+ * nrblocks is i_blocks_per_folio() which is passed in every
+ * function as the last argument for API consistency.
+ */
+static inline void iop_set_range_uptodate(struct iomap_page *iop,
+				unsigned int start, unsigned int len,
+				unsigned int nrblocks)
+{
+	bitmap_set(iop->state, start, len);
+}
+
+static inline void iop_clear_range_uptodate(struct iomap_page *iop,
+				unsigned int start, unsigned int len,
+				unsigned int nrblocks)
+{
+	bitmap_clear(iop->state, start, len);
+}
+
+static inline bool iop_test_uptodate(struct iomap_page *iop, unsigned int block,
+				unsigned int nrblocks)
+{
+	return test_bit(block, iop->state);
+}
+
+static inline bool iop_uptodate_full(struct iomap_page *iop,
+				unsigned int nrblocks)
+{
+	return bitmap_full(iop->state, nrblocks);
+}
+
 static struct iomap_page *
 iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
 {
@@ -58,12 +90,12 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;
 
-	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
+	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
 		      gfp);
 	if (iop) {
-		spin_lock_init(&iop->uptodate_lock);
+		spin_lock_init(&iop->state_lock);
 		if (folio_test_uptodate(folio))
-			bitmap_fill(iop->uptodate, nr_blocks);
+			iop_set_range_uptodate(iop, 0, nr_blocks, nr_blocks);
 		folio_attach_private(folio, iop);
 	}
 	return iop;
@@ -79,7 +111,7 @@ static void iomap_page_release(struct folio *folio)
 		return;
 	WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
 	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
-	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
+	WARN_ON_ONCE(iop_uptodate_full(iop, nr_blocks) !=
 			folio_test_uptodate(folio));
 	kfree(iop);
 }
@@ -99,6 +131,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 	size_t plen = min_t(loff_t, folio_size(folio) - poff, length);
 	unsigned first = poff >> block_bits;
 	unsigned last = (poff + plen - 1) >> block_bits;
+	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 
 	/*
 	 * If the block size is smaller than the page size, we need to check the
@@ -110,7 +143,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 
 		/* move forward for each leading block marked uptodate */
 		for (i = first; i <= last; i++) {
-			if (!test_bit(i, iop->uptodate))
+			if (!iop_test_uptodate(iop, i, nr_blocks))
 				break;
 			*pos += block_size;
 			poff += block_size;
@@ -120,7 +153,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 
 		/* truncate len if we find any trailing uptodate block(s) */
 		for ( ; i <= last; i++) {
-			if (test_bit(i, iop->uptodate)) {
+			if (iop_test_uptodate(iop, i, nr_blocks)) {
 				plen -= (last - i + 1) * block_size;
 				last = i - 1;
 				break;
@@ -151,12 +184,13 @@ static void iomap_iop_set_range_uptodate(struct folio *folio,
 	unsigned first = off >> inode->i_blkbits;
 	unsigned last = (off + len - 1) >> inode->i_blkbits;
 	unsigned long flags;
+	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 
-	spin_lock_irqsave(&iop->uptodate_lock, flags);
-	bitmap_set(iop->uptodate, first, last - first + 1);
-	if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio)))
+	spin_lock_irqsave(&iop->state_lock, flags);
+	iop_set_range_uptodate(iop, first, last - first + 1, nr_blocks);
+	if (iop_uptodate_full(iop, nr_blocks))
 		folio_mark_uptodate(folio);
-	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
+	spin_unlock_irqrestore(&iop->state_lock, flags);
 }
 
 static void iomap_set_range_uptodate(struct folio *folio,
@@ -439,6 +473,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
 	struct iomap_page *iop = to_iomap_page(folio);
 	struct inode *inode = folio->mapping->host;
 	unsigned first, last, i;
+	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 
 	if (!iop)
 		return false;
@@ -451,7 +486,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count)
 	last = (from + count - 1) >> inode->i_blkbits;
 
 	for (i = first; i <= last; i++)
-		if (!test_bit(i, iop->uptodate))
+		if (!iop_test_uptodate(iop, i, nr_blocks))
 			return false;
 	return true;
 }
@@ -1652,7 +1687,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iop && !test_bit(i, iop->uptodate))
+		if (iop && !iop_test_uptodate(iop, i, nblocks))
 			continue;
 
 		error = wpc->ops->map_blocks(wpc, inode, pos);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFCv4 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-05-04 14:51 [RFCv4 0/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
  2023-05-04 14:51 ` [RFCv4 1/3] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
  2023-05-04 14:51 ` [RFCv4 2/3] iomap: Change uptodate variable name to state Ritesh Harjani (IBM)
@ 2023-05-04 14:51 ` Ritesh Harjani (IBM)
  2023-05-04 15:02   ` Matthew Wilcox
  2 siblings, 1 reply; 7+ messages in thread
From: Ritesh Harjani (IBM) @ 2023-05-04 14:51 UTC (permalink / raw)
  To: linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Dave Chinner, Brian Foster,
	Ojaswin Mujoo, Disha Goel, Ritesh Harjani (IBM),
	Aravinda Herle

On a 64k pagesize platforms (specially Power and/or aarch64) with 4k
filesystem blocksize, this patch should improve the performance by doing
only the subpage dirty data write.

This should also reduce the write amplification since we can now track
subpage dirty status within state bitmaps. Earlier we had to
write the entire 64k page even if only a part of it (e.g. 4k) was
updated.

Performance testing of below fio workload reveals ~16x performance
improvement on nvme with XFS (4k blocksize) on Power (64K pagesize)
FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.

1. <test_randwrite.fio>
[global]
	ioengine=psync
	rw=randwrite
	overwrite=1
	pre_read=1
	direct=0
	bs=4k
	size=1G
	dir=./
	numjobs=8
	fdatasync=1
	runtime=60
	iodepth=64
	group_reporting=1

[fio-run]

2. Also our internal performance team reported that this patch improves
   there database workload performance by around ~83% (with XFS on Power)

Reported-by: Aravinda Herle <araherle@in.ibm.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/gfs2/aops.c         |   2 +-
 fs/iomap/buffered-io.c | 175 ++++++++++++++++++++++++++++++++++-------
 fs/xfs/xfs_aops.c      |   2 +-
 fs/zonefs/file.c       |   2 +-
 include/linux/iomap.h  |   1 +
 5 files changed, 149 insertions(+), 33 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index a5f4be6b9213..75efec3c3b71 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -746,7 +746,7 @@ static const struct address_space_operations gfs2_aops = {
 	.writepages = gfs2_writepages,
 	.read_folio = gfs2_read_folio,
 	.readahead = gfs2_readahead,
-	.dirty_folio = filemap_dirty_folio,
+	.dirty_folio = iomap_dirty_folio,
 	.release_folio = iomap_release_folio,
 	.invalidate_folio = iomap_invalidate_folio,
 	.bmap = gfs2_bmap,
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index b8b23c859ecf..52c9703ff262 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -34,6 +34,11 @@ struct iomap_page {
 	unsigned long		state[];
 };

+enum iop_state {
+	IOP_STATE_UPDATE = 0,
+	IOP_STATE_DIRTY = 1
+};
+
 static inline struct iomap_page *to_iomap_page(struct folio *folio)
 {
 	if (folio_test_private(folio))
@@ -44,8 +49,8 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
 static struct bio_set iomap_ioend_bioset;

 /*
- * Accessor functions for setting/clearing/checking uptodate bits in
- * iop->state bitmap.
+ * Accessor functions for setting/clearing/checking uptodate and
+ * dirty bits in iop->state bitmap.
  * nrblocks is i_blocks_per_folio() which is passed in every
  * function as the last argument for API consistency.
  */
@@ -75,8 +80,29 @@ static inline bool iop_uptodate_full(struct iomap_page *iop,
 	return bitmap_full(iop->state, nrblocks);
 }

+static inline void iop_set_range_dirty(struct iomap_page *iop,
+				unsigned int start, unsigned int len,
+				unsigned int nrblocks)
+{
+	bitmap_set(iop->state, start + nrblocks, len);
+}
+
+static inline void iop_clear_range_dirty(struct iomap_page *iop,
+				unsigned int start, unsigned int len,
+				unsigned int nrblocks)
+{
+	bitmap_clear(iop->state, start + nrblocks, len);
+}
+
+static inline bool iop_test_dirty(struct iomap_page *iop, unsigned int block,
+			     unsigned int nrblocks)
+{
+	return test_bit(block + nrblocks, iop->state);
+}
+
 static struct iomap_page *
-iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
+iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags,
+		  bool is_dirty)
 {
 	struct iomap_page *iop = to_iomap_page(folio);
 	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
@@ -90,12 +116,21 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
 	else
 		gfp = GFP_NOFS | __GFP_NOFAIL;

-	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
+	/*
+	 * iop->state tracks two sets of state flags when the
+	 * filesystem block size is smaller than the folio size.
+	 * The first state tracks per-filesystem block uptodate
+	 * and the second tracks per-filesystem block dirty
+	 * state.
+	 */
+	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)),
 		      gfp);
 	if (iop) {
 		spin_lock_init(&iop->state_lock);
 		if (folio_test_uptodate(folio))
 			iop_set_range_uptodate(iop, 0, nr_blocks, nr_blocks);
+		if (is_dirty)
+			iop_set_range_dirty(iop, 0, nr_blocks, nr_blocks);
 		folio_attach_private(folio, iop);
 	}
 	return iop;
@@ -177,29 +212,62 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 	*lenp = plen;
 }

-static void iomap_iop_set_range_uptodate(struct folio *folio,
-		struct iomap_page *iop, size_t off, size_t len)
+static void iomap_iop_set_range(struct folio *folio, struct iomap_page *iop,
+		size_t off, size_t len, enum iop_state state)
 {
 	struct inode *inode = folio->mapping->host;
-	unsigned first = off >> inode->i_blkbits;
-	unsigned last = (off + len - 1) >> inode->i_blkbits;
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+	unsigned int first_blk = (off >> inode->i_blkbits);
+	unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits);
+	unsigned int nr_blks = last_blk - first_blk + 1;
 	unsigned long flags;
-	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);

-	spin_lock_irqsave(&iop->state_lock, flags);
-	iop_set_range_uptodate(iop, first, last - first + 1, nr_blocks);
-	if (iop_uptodate_full(iop, nr_blocks))
-		folio_mark_uptodate(folio);
-	spin_unlock_irqrestore(&iop->state_lock, flags);
+	switch (state) {
+	case IOP_STATE_UPDATE:
+		if (!iop) {
+			folio_mark_uptodate(folio);
+			return;
+		}
+		spin_lock_irqsave(&iop->state_lock, flags);
+		iop_set_range_uptodate(iop, first_blk, nr_blks, blks_per_folio);
+		if (iop_uptodate_full(iop, blks_per_folio))
+			folio_mark_uptodate(folio);
+		spin_unlock_irqrestore(&iop->state_lock, flags);
+		break;
+	case IOP_STATE_DIRTY:
+		if (!iop)
+			return;
+		spin_lock_irqsave(&iop->state_lock, flags);
+		iop_set_range_dirty(iop, first_blk, nr_blks, blks_per_folio);
+		spin_unlock_irqrestore(&iop->state_lock, flags);
+		break;
+	}
 }

-static void iomap_set_range_uptodate(struct folio *folio,
-		struct iomap_page *iop, size_t off, size_t len)
+static void iomap_iop_clear_range(struct folio *folio,
+		struct iomap_page *iop, size_t off, size_t len,
+		enum iop_state state)
 {
-	if (iop)
-		iomap_iop_set_range_uptodate(folio, iop, off, len);
-	else
-		folio_mark_uptodate(folio);
+	struct inode *inode = folio->mapping->host;
+	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
+	unsigned int first_blk = (off >> inode->i_blkbits);
+	unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits);
+	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned long flags;
+
+	switch (state) {
+	case IOP_STATE_UPDATE:
+		// Never gets called so not implemented
+		WARN_ON(1);
+		break;
+	case IOP_STATE_DIRTY:
+		if (!iop)
+			return;
+		spin_lock_irqsave(&iop->state_lock, flags);
+		iop_clear_range_dirty(iop, first_blk, nr_blks, blks_per_folio);
+		spin_unlock_irqrestore(&iop->state_lock, flags);
+		break;
+	}
 }

 static void iomap_finish_folio_read(struct folio *folio, size_t offset,
@@ -211,7 +279,7 @@ static void iomap_finish_folio_read(struct folio *folio, size_t offset,
 		folio_clear_uptodate(folio);
 		folio_set_error(folio);
 	} else {
-		iomap_set_range_uptodate(folio, iop, offset, len);
+		iomap_iop_set_range(folio, iop, offset, len, IOP_STATE_UPDATE);
 	}

 	if (!iop || atomic_sub_and_test(len, &iop->read_bytes_pending))
@@ -265,7 +333,8 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
 	if (offset > 0)
-		iop = iomap_page_create(iter->inode, folio, iter->flags);
+		iop = iomap_page_create(iter->inode, folio, iter->flags,
+					folio_test_dirty(folio));
 	else
 		iop = to_iomap_page(folio);

@@ -273,7 +342,8 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	memcpy(addr, iomap->inline_data, size);
 	memset(addr + size, 0, PAGE_SIZE - poff - size);
 	kunmap_local(addr);
-	iomap_set_range_uptodate(folio, iop, offset, PAGE_SIZE - poff);
+	iomap_iop_set_range(folio, iop, offset, PAGE_SIZE - poff,
+			    IOP_STATE_UPDATE);
 	return 0;
 }

@@ -303,14 +373,15 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		return iomap_read_inline_data(iter, folio);

 	/* zero post-eof blocks as the page may be mapped */
-	iop = iomap_page_create(iter->inode, folio, iter->flags);
+	iop = iomap_page_create(iter->inode, folio, iter->flags,
+				folio_test_dirty(folio));
 	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;

 	if (iomap_block_needs_zeroing(iter, pos)) {
 		folio_zero_range(folio, poff, plen);
-		iomap_set_range_uptodate(folio, iop, poff, plen);
+		iomap_iop_set_range(folio, iop, poff, plen, IOP_STATE_UPDATE);
 		goto done;
 	}

@@ -559,6 +630,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);

+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio)
+{
+	unsigned int nr_blocks = i_blocks_per_folio(mapping->host, folio);
+	struct iomap_page *iop;
+
+	iop = iomap_page_create(mapping->host, folio, 0, false);
+	iomap_iop_set_range(folio, iop, 0,
+			nr_blocks << mapping->host->i_blkbits, IOP_STATE_DIRTY);
+	return filemap_dirty_folio(mapping, folio);
+}
+EXPORT_SYMBOL_GPL(iomap_dirty_folio);
+
 static void
 iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 {
@@ -607,7 +690,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	    pos + len >= folio_pos(folio) + folio_size(folio))
 		return 0;

-	iop = iomap_page_create(iter->inode, folio, iter->flags);
+	iop = iomap_page_create(iter->inode, folio, iter->flags,
+				folio_test_dirty(folio));

 	/*
 	 * If we don't have an iop and nr_blocks > 1 then return -EAGAIN here
@@ -648,7 +732,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 			if (status)
 				return status;
 		}
-		iomap_set_range_uptodate(folio, iop, poff, plen);
+		iomap_iop_set_range(folio, iop, poff, plen, IOP_STATE_UPDATE);
 	} while ((block_start += plen) < block_end);

 	return 0;
@@ -771,7 +855,10 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	 */
 	if (unlikely(copied < len && !folio_test_uptodate(folio)))
 		return 0;
-	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
+	iomap_iop_set_range(folio, iop, offset_in_folio(folio, pos), copied,
+			    IOP_STATE_UPDATE);
+	iomap_iop_set_range(folio, iop, offset_in_folio(folio, pos), copied,
+			    IOP_STATE_DIRTY);
 	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
 }
@@ -959,6 +1046,12 @@ static int iomap_write_delalloc_scan(struct inode *inode,
 {
 	while (start_byte < end_byte) {
 		struct folio	*folio;
+		size_t first, last;
+		loff_t end;
+		unsigned int i;
+		struct iomap_page *iop;
+		u8 blkbits = inode->i_blkbits;
+		unsigned int nr_blocks;

 		/* grab locked page */
 		folio = filemap_lock_folio(inode->i_mapping,
@@ -983,6 +1076,26 @@ static int iomap_write_delalloc_scan(struct inode *inode,
 				}
 			}

+			/*
+			 * When we have subfolio dirty tracking, there can be
+			 * subblocks within a folio which are marked uptodate
+			 * but not dirty. In that case it is necessary to punch
+			 * out such blocks to avoid leaking delalloc blocks.
+			 */
+			iop = to_iomap_page(folio);
+			if (!iop)
+				goto skip_iop_punch;
+			end = min_t(loff_t, end_byte - 1,
+				(folio_next_index(folio) << PAGE_SHIFT) - 1);
+			first = offset_in_folio(folio, start_byte) >> blkbits;
+			last = offset_in_folio(folio, end) >> blkbits;
+			nr_blocks = i_blocks_per_folio(inode, folio);
+			for (i = first; i <= last; i++) {
+				if (!iop_test_dirty(iop, i, nr_blocks))
+					punch(inode, i << blkbits,
+							1 << blkbits);
+			}
+skip_iop_punch:
 			/*
 			 * Make sure the next punch start is correctly bound to
 			 * the end of this data range, not the end of the folio.
@@ -1671,7 +1784,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct folio *folio, u64 end_pos)
 {
-	struct iomap_page *iop = iomap_page_create(inode, folio, 0);
+	struct iomap_page *iop = iomap_page_create(inode, folio, 0, true);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	unsigned nblocks = i_blocks_per_folio(inode, folio);
@@ -1687,7 +1800,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * invalid, grab a new one.
 	 */
 	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (iop && !iop_test_uptodate(iop, i, nblocks))
+		if (iop && !iop_test_dirty(iop, i, nblocks))
 			continue;

 		error = wpc->ops->map_blocks(wpc, inode, pos);
@@ -1731,6 +1844,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		}
 	}

+	iomap_iop_clear_range(folio, iop, 0, end_pos - folio_pos(folio),
+			      IOP_STATE_DIRTY);
 	folio_start_writeback(folio);
 	folio_unlock(folio);

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 2ef78aa1d3f6..77c7332ae197 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -578,7 +578,7 @@ const struct address_space_operations xfs_address_space_operations = {
 	.read_folio		= xfs_vm_read_folio,
 	.readahead		= xfs_vm_readahead,
 	.writepages		= xfs_vm_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.bmap			= xfs_vm_bmap,
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index 132f01d3461f..e508c8e97372 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -175,7 +175,7 @@ const struct address_space_operations zonefs_file_aops = {
 	.read_folio		= zonefs_read_folio,
 	.readahead		= zonefs_readahead,
 	.writepages		= zonefs_writepages,
-	.dirty_folio		= filemap_dirty_folio,
+	.dirty_folio		= iomap_dirty_folio,
 	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.migrate_folio		= filemap_migrate_folio,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0f8123504e5e..0c2bee80565c 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -264,6 +264,7 @@ bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
+bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
--
2.39.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFCv4 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-05-04 14:51 ` [RFCv4 3/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
@ 2023-05-04 15:02   ` Matthew Wilcox
  2023-05-05  3:27     ` Ritesh Harjani
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2023-05-04 15:02 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-xfs, linux-fsdevel, Dave Chinner, Brian Foster,
	Ojaswin Mujoo, Disha Goel, Aravinda Herle

On Thu, May 04, 2023 at 08:21:09PM +0530, Ritesh Harjani (IBM) wrote:
> @@ -90,12 +116,21 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
>  	else
>  		gfp = GFP_NOFS | __GFP_NOFAIL;
> 
> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
> +	/*
> +	 * iop->state tracks two sets of state flags when the
> +	 * filesystem block size is smaller than the folio size.
> +	 * The first state tracks per-filesystem block uptodate
> +	 * and the second tracks per-filesystem block dirty
> +	 * state.
> +	 */

"per-filesystem"?  I think you mean "per-block (uptodate|block) state".
Using "per-block" naming throughout this patchset might help readability.
It's currently an awkward mix of "subpage", "sub-page" and "sub-folio".
It also feels like you're adding a comment to every non-mechanical change
you make, which isn't necessarily helpful.  Changelog, sure, but
sometimes your comments simply re-state what your change is doing.

> -static void iomap_iop_set_range_uptodate(struct folio *folio,
> -		struct iomap_page *iop, size_t off, size_t len)
> +static void iomap_iop_set_range(struct folio *folio, struct iomap_page *iop,
> +		size_t off, size_t len, enum iop_state state)
>  {
>  	struct inode *inode = folio->mapping->host;
> -	unsigned first = off >> inode->i_blkbits;
> -	unsigned last = (off + len - 1) >> inode->i_blkbits;
> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
> +	unsigned int first_blk = (off >> inode->i_blkbits);
> +	unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits);
> +	unsigned int nr_blks = last_blk - first_blk + 1;
>  	unsigned long flags;
> -	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> 
> -	spin_lock_irqsave(&iop->state_lock, flags);
> -	iop_set_range_uptodate(iop, first, last - first + 1, nr_blocks);
> -	if (iop_uptodate_full(iop, nr_blocks))
> -		folio_mark_uptodate(folio);
> -	spin_unlock_irqrestore(&iop->state_lock, flags);
> +	switch (state) {
> +	case IOP_STATE_UPDATE:
> +		if (!iop) {
> +			folio_mark_uptodate(folio);
> +			return;
> +		}
> +		spin_lock_irqsave(&iop->state_lock, flags);
> +		iop_set_range_uptodate(iop, first_blk, nr_blks, blks_per_folio);
> +		if (iop_uptodate_full(iop, blks_per_folio))
> +			folio_mark_uptodate(folio);
> +		spin_unlock_irqrestore(&iop->state_lock, flags);
> +		break;
> +	case IOP_STATE_DIRTY:
> +		if (!iop)
> +			return;
> +		spin_lock_irqsave(&iop->state_lock, flags);
> +		iop_set_range_dirty(iop, first_blk, nr_blks, blks_per_folio);
> +		spin_unlock_irqrestore(&iop->state_lock, flags);
> +		break;
> +	}
>  }

I can't believe this is what Dave wanted you to do.  iomap_iop_set_range()
should be the low-level helper called by iop_set_range_uptodate() and
iop_set_range_dirty(), not the other way around.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFCv4 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-05-04 15:02   ` Matthew Wilcox
@ 2023-05-05  3:27     ` Ritesh Harjani
  2023-05-05  3:59       ` Matthew Wilcox
  0 siblings, 1 reply; 7+ messages in thread
From: Ritesh Harjani @ 2023-05-05  3:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, Dave Chinner, Brian Foster,
	Ojaswin Mujoo, Disha Goel, Aravinda Herle

Matthew Wilcox <willy@infradead.org> writes:

> On Thu, May 04, 2023 at 08:21:09PM +0530, Ritesh Harjani (IBM) wrote:
>> @@ -90,12 +116,21 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
>>  	else
>>  		gfp = GFP_NOFS | __GFP_NOFAIL;
>>
>> -	iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)),
>> +	/*
>> +	 * iop->state tracks two sets of state flags when the
>> +	 * filesystem block size is smaller than the folio size.
>> +	 * The first state tracks per-filesystem block uptodate
>> +	 * and the second tracks per-filesystem block dirty
>> +	 * state.
>> +	 */
>
> "per-filesystem"?  I think you mean "per-block (uptodate|block) state".
> Using "per-block" naming throughout this patchset might help readability.
> It's currently an awkward mix of "subpage", "sub-page" and "sub-folio".
and subfolio to add.

Yes, I agree it got all mixed up in the comments.
Let me stick to sub-folio (which was what we were using earlier [1])

[1]: https://lore.kernel.org/all/20211216210715.3801857-5-willy@infradead.org/

> It also feels like you're adding a comment to every non-mechanical change
> you make, which isn't necessarily helpful.  Changelog, sure, but
> sometimes your comments simply re-state what your change is doing.
>

Sure, I will keep that in mind for next rev to remove unwanted comments.

>> -static void iomap_iop_set_range_uptodate(struct folio *folio,
>> -		struct iomap_page *iop, size_t off, size_t len)
>> +static void iomap_iop_set_range(struct folio *folio, struct iomap_page *iop,
>> +		size_t off, size_t len, enum iop_state state)
>>  {
>>  	struct inode *inode = folio->mapping->host;
>> -	unsigned first = off >> inode->i_blkbits;
>> -	unsigned last = (off + len - 1) >> inode->i_blkbits;
>> +	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
>> +	unsigned int first_blk = (off >> inode->i_blkbits);
>> +	unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits);
>> +	unsigned int nr_blks = last_blk - first_blk + 1;
>>  	unsigned long flags;
>> -	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
>>
>> -	spin_lock_irqsave(&iop->state_lock, flags);
>> -	iop_set_range_uptodate(iop, first, last - first + 1, nr_blocks);
>> -	if (iop_uptodate_full(iop, nr_blocks))
>> -		folio_mark_uptodate(folio);
>> -	spin_unlock_irqrestore(&iop->state_lock, flags);
>> +	switch (state) {
>> +	case IOP_STATE_UPDATE:
>> +		if (!iop) {
>> +			folio_mark_uptodate(folio);
>> +			return;
>> +		}
>> +		spin_lock_irqsave(&iop->state_lock, flags);
>> +		iop_set_range_uptodate(iop, first_blk, nr_blks, blks_per_folio);
>> +		if (iop_uptodate_full(iop, blks_per_folio))
>> +			folio_mark_uptodate(folio);
>> +		spin_unlock_irqrestore(&iop->state_lock, flags);
>> +		break;
>> +	case IOP_STATE_DIRTY:
>> +		if (!iop)
>> +			return;
>> +		spin_lock_irqsave(&iop->state_lock, flags);
>> +		iop_set_range_dirty(iop, first_blk, nr_blks, blks_per_folio);
>> +		spin_unlock_irqrestore(&iop->state_lock, flags);
>> +		break;
>> +	}
>>  }
>
> I can't believe this is what Dave wanted you to do.  iomap_iop_set_range()
> should be the low-level helper called by iop_set_range_uptodate() and
> iop_set_range_dirty(), not the other way around.

Ok, I see the confusion, I think if we make
iomap_iop_set_range() to iomap_set_range(), then that should be good.
Then it becomes iomap_set_range() calling
iop_set_range_update() & iop_set_range_dirty() as the lower level helper routines.

Based on the the existing code, I believe this ^^^^ is how the heirarchy
should look like. Does it look good then? If yes, I will simply drop the
"_iop" part in the next rev.

<Relevant function list>
    iop_set_range_uptodate
    iop_clear_range_uptodate
    iop_test_uptodate
    iop_uptodate_full
    iop_set_range_dirty
    iop_clear_range_dirty
    iop_test_dirty
    iomap_page_create
    iomap_page_release
    iomap_iop_set_range -> iomap_set_range()
    iomap_iop_clear_range  -> iomap_clear_range()

-ritesh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFCv4 3/3] iomap: Support subpage size dirty tracking to improve write performance
  2023-05-05  3:27     ` Ritesh Harjani
@ 2023-05-05  3:59       ` Matthew Wilcox
  0 siblings, 0 replies; 7+ messages in thread
From: Matthew Wilcox @ 2023-05-05  3:59 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-xfs, linux-fsdevel, Dave Chinner, Brian Foster,
	Ojaswin Mujoo, Disha Goel, Aravinda Herle

On Fri, May 05, 2023 at 08:57:53AM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> > "per-filesystem"?  I think you mean "per-block (uptodate|block) state".

(s/|block/|dirty/, sorry)

> > Using "per-block" naming throughout this patchset might help readability.
> > It's currently an awkward mix of "subpage", "sub-page" and "sub-folio".
> and subfolio to add.
> 
> Yes, I agree it got all mixed up in the comments.
> Let me stick to sub-folio (which was what we were using earlier [1])

I think per-block is better?  sub-folio might be at almost any
granularity, but per-block is specific.

> >> +static void iomap_iop_set_range(struct folio *folio, struct iomap_page *iop,
> >> +		size_t off, size_t len, enum iop_state state)
> >
> > I can't believe this is what Dave wanted you to do.  iomap_iop_set_range()
> > should be the low-level helper called by iop_set_range_uptodate() and
> > iop_set_range_dirty(), not the other way around.
> 
> Ok, I see the confusion, I think if we make
> iomap_iop_set_range() to iomap_set_range(), then that should be good.
> Then it becomes iomap_set_range() calling
> iop_set_range_update() & iop_set_range_dirty() as the lower level helper routines.
> 
> Based on the the existing code, I believe this ^^^^ is how the heirarchy
> should look like. Does it look good then? If yes, I will simply drop the
> "_iop" part in the next rev.

I don't think that's what I'm saying.  The next version
should not have enum iop_state in it.

ie it looks something like:

iomap_set_range()
{
	bitmap_set(...);
}

iomap_set_range_uptodate()
{
	iomap_set_range(...);
}

iomap_set_range_dirty()
{
	iomap_set_range(...);
}

> <Relevant function list>
>     iop_set_range_uptodate
>     iop_clear_range_uptodate
>     iop_test_uptodate
>     iop_uptodate_full
>     iop_set_range_dirty
>     iop_clear_range_dirty
>     iop_test_dirty
>     iomap_page_create
>     iomap_page_release
>     iomap_iop_set_range -> iomap_set_range()
>     iomap_iop_clear_range  -> iomap_clear_range()
> 
> -ritesh

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-05-05  3:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-04 14:51 [RFCv4 0/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
2023-05-04 14:51 ` [RFCv4 1/3] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
2023-05-04 14:51 ` [RFCv4 2/3] iomap: Change uptodate variable name to state Ritesh Harjani (IBM)
2023-05-04 14:51 ` [RFCv4 3/3] iomap: Support subpage size dirty tracking to improve write performance Ritesh Harjani (IBM)
2023-05-04 15:02   ` Matthew Wilcox
2023-05-05  3:27     ` Ritesh Harjani
2023-05-05  3:59       ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).