linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFCv3 0/7] ext2 iomap changes and iomap improvements
@ 2024-04-25 13:28 Ritesh Harjani (IBM)
  2024-04-25 13:28 ` [RFCv3 1/7] ext2: Remove comment related to journal handle Ritesh Harjani (IBM)
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-04-25 13:28 UTC (permalink / raw)
  To: linux-ext4, linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Darrick J . Wong, Ojaswin Mujoo,
	Ritesh Harjani, Jan Kara

Hello all,

Here is a RFCv3 which implements ext2 iomap changes and I have also included
some iomap improvements along with implementing BH_Boundary fix within iomap.

Posting this more as an update (before conference/call) to the current work
since it has survived few runs of xfstests -

Patch 1-4 implements ext2 regular file buffered I/O to use iomap APIs.
Patch 5 & 6 are iomap improvements which me and Ojaswin noticed during code reviews.
Patch 7 optimizes the data access patterns for filesystems with indirect block
mappings. Thanks to Matthew Wilcox for pointing the problem and providing a
rough solution to BH_Boundary problem within iomap (which this patch is based
on).

Please note that I would still like to work on following aspects before
thinking of merging any of these -

1. Look into how dir handling for ext2 should be handled. Since ext2 uses page
   cache for that, can we directly use iomap? Do we need any other iomap ops for
   implementing dir handling? Basically either a PoC or some theoretical
   understanding of how we should handle dir for ext2.

2. Integrate Patch 4 within Patch-2. Kept it separate for review.

3. Test patch 5 & 6 separately before thinking of getting those merged. The
   changes look ok to me and I would like those to be reviewed. But I hope to
   get more testing done on those patches individually, because those are not
   dependent on this series.

4. Patch 7 is an early RFC to get an idea on whether it is taking the right
   direction or not. If this looks ok, then I can polish the series, carefully
   review at any missed corner cases (hopefully I have covered all),
   work on other points in this todo list and do more testing before posting
   another version.

5. Write few fstests to excercise the paths more for the overall series.

Ritesh Harjani (IBM) (7):
  ext2: Remove comment related to journal handle
  ext2: Convert ext2 regular file buffered I/O to use iomap
  ext2: Enable large folio support
  ext2: Implement seq counter for validating cached iomap
  iomap: Fix iomap_adjust_read_range for plen calculation
  iomap: Optimize iomap_read_folio
  iomap: Optimize data access patterns for filesystems with indirect mappings

 fs/ext2/balloc.c       |   1 +
 fs/ext2/ext2.h         |   6 ++
 fs/ext2/file.c         |  20 +++++-
 fs/ext2/inode.c        | 126 +++++++++++++++++++++++++++++++++++---
 fs/ext2/super.c        |   2 +-
 fs/iomap/buffered-io.c | 135 ++++++++++++++++++++++++++++++++---------
 6 files changed, 248 insertions(+), 42 deletions(-)

--
2.44.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFCv3 1/7] ext2: Remove comment related to journal handle
  2024-04-25 13:28 [RFCv3 0/7] ext2 iomap changes and iomap improvements Ritesh Harjani (IBM)
@ 2024-04-25 13:28 ` Ritesh Harjani (IBM)
  2024-04-26 15:21   ` Darrick J. Wong
  2024-04-25 13:28 ` [RFCv3 2/7] ext2: Convert ext2 regular file buffered I/O to use iomap Ritesh Harjani (IBM)
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-04-25 13:28 UTC (permalink / raw)
  To: linux-ext4, linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Darrick J . Wong, Ojaswin Mujoo,
	Ritesh Harjani, Jan Kara

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext2/inode.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index f3d570a9302b..c4de3a94c4b2 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -615,8 +615,6 @@ static void ext2_splice_branch(struct inode *inode,
  * allocations is needed - we simply release blocks and do not touch anything
  * reachable from inode.
  *
- * `handle' can be NULL if create == 0.
- *
  * return > 0, # of blocks mapped or allocated.
  * return = 0, if plain lookup failed.
  * return < 0, error case.
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFCv3 2/7] ext2: Convert ext2 regular file buffered I/O to use iomap
  2024-04-25 13:28 [RFCv3 0/7] ext2 iomap changes and iomap improvements Ritesh Harjani (IBM)
  2024-04-25 13:28 ` [RFCv3 1/7] ext2: Remove comment related to journal handle Ritesh Harjani (IBM)
@ 2024-04-25 13:28 ` Ritesh Harjani (IBM)
  2024-04-26 15:29   ` Darrick J. Wong
  2024-04-25 13:28 ` [RFCv3 3/7] ext2: Enable large folio support Ritesh Harjani (IBM)
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-04-25 13:28 UTC (permalink / raw)
  To: linux-ext4, linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Darrick J . Wong, Ojaswin Mujoo,
	Ritesh Harjani, Jan Kara

This patch converts ext2 regular file's buffered-io path to use iomap.
- buffered-io path using iomap_file_buffered_write
- DIO fallback to buffered-io now uses iomap_file_buffered_write
- writeback path now uses a new aops - ext2_file_aops
- truncate now uses iomap_truncate_page
- mmap path of ext2 continues to use generic_file_vm_ops

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext2/file.c  | 20 ++++++++++++--
 fs/ext2/inode.c | 69 ++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 81 insertions(+), 8 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 4ddc36f4dbd4..ee5cd4a2f24f 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -252,7 +252,7 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 		iocb->ki_flags &= ~IOCB_DIRECT;
 		pos = iocb->ki_pos;
-		status = generic_perform_write(iocb, from);
+		status = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops);
 		if (unlikely(status < 0)) {
 			ret = status;
 			goto out_unlock;
@@ -278,6 +278,22 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	return ret;
 }
 
+static ssize_t ext2_buffered_write_iter(struct kiocb *iocb,
+					struct iov_iter *from)
+{
+	ssize_t ret = 0;
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	inode_lock(inode);
+	ret = generic_write_checks(iocb, from);
+	if (ret > 0)
+		ret = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops);
+	inode_unlock(inode);
+	if (ret > 0)
+		ret = generic_write_sync(iocb, ret);
+	return ret;
+}
+
 static ssize_t ext2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 #ifdef CONFIG_FS_DAX
@@ -299,7 +315,7 @@ static ssize_t ext2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (iocb->ki_flags & IOCB_DIRECT)
 		return ext2_dio_write_iter(iocb, from);
 
-	return generic_file_write_iter(iocb, from);
+	return ext2_buffered_write_iter(iocb, from);
 }
 
 const struct file_operations ext2_file_operations = {
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c4de3a94c4b2..f90d280025d9 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -877,10 +877,14 @@ ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 	if ((flags & IOMAP_DIRECT) && (flags & IOMAP_WRITE) && written == 0)
 		return -ENOTBLK;
 
-	if (iomap->type == IOMAP_MAPPED &&
-	    written < length &&
-	    (flags & IOMAP_WRITE))
+	if (iomap->type == IOMAP_MAPPED && written < length &&
+	   (flags & IOMAP_WRITE)) {
 		ext2_write_failed(inode->i_mapping, offset + length);
+		return 0;
+	}
+
+	if (iomap->flags & IOMAP_F_SIZE_CHANGED)
+		mark_inode_dirty(inode);
 	return 0;
 }
 
@@ -912,6 +916,16 @@ static void ext2_readahead(struct readahead_control *rac)
 	mpage_readahead(rac, ext2_get_block);
 }
 
+static int ext2_file_read_folio(struct file *file, struct folio *folio)
+{
+	return iomap_read_folio(folio, &ext2_iomap_ops);
+}
+
+static void ext2_file_readahead(struct readahead_control *rac)
+{
+	iomap_readahead(rac, &ext2_iomap_ops);
+}
+
 static int
 ext2_write_begin(struct file *file, struct address_space *mapping,
 		loff_t pos, unsigned len, struct page **pagep, void **fsdata)
@@ -941,12 +955,41 @@ static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
 	return generic_block_bmap(mapping,block,ext2_get_block);
 }
 
+static sector_t ext2_file_bmap(struct address_space *mapping, sector_t block)
+{
+	return iomap_bmap(mapping, block, &ext2_iomap_ops);
+}
+
 static int
 ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	return mpage_writepages(mapping, wbc, ext2_get_block);
 }
 
+static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
+				 struct inode *inode, loff_t offset,
+				 unsigned len)
+{
+	if (offset >= wpc->iomap.offset &&
+	    offset < wpc->iomap.offset + wpc->iomap.length)
+		return 0;
+
+	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
+				IOMAP_WRITE, &wpc->iomap, NULL);
+}
+
+static const struct iomap_writeback_ops ext2_writeback_ops = {
+	.map_blocks		= ext2_write_map_blocks,
+};
+
+static int ext2_file_writepages(struct address_space *mapping,
+				struct writeback_control *wbc)
+{
+	struct iomap_writepage_ctx wpc = { };
+
+	return iomap_writepages(mapping, wbc, &wpc, &ext2_writeback_ops);
+}
+
 static int
 ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
@@ -955,6 +998,20 @@ ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc
 	return dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
 }
 
+const struct address_space_operations ext2_file_aops = {
+	.dirty_folio		= iomap_dirty_folio,
+	.release_folio 		= iomap_release_folio,
+	.invalidate_folio	= iomap_invalidate_folio,
+	.read_folio		= ext2_file_read_folio,
+	.readahead		= ext2_file_readahead,
+	.bmap			= ext2_file_bmap,
+	.direct_IO		= noop_direct_IO,
+	.writepages		= ext2_file_writepages,
+	.migrate_folio		= filemap_migrate_folio,
+	.is_partially_uptodate	= iomap_is_partially_uptodate,
+	.error_remove_folio	= generic_error_remove_folio,
+};
+
 const struct address_space_operations ext2_aops = {
 	.dirty_folio		= block_dirty_folio,
 	.invalidate_folio	= block_invalidate_folio,
@@ -1279,8 +1336,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 		error = dax_truncate_page(inode, newsize, NULL,
 					  &ext2_iomap_ops);
 	else
-		error = block_truncate_page(inode->i_mapping,
-				newsize, ext2_get_block);
+		error = iomap_truncate_page(inode, newsize, NULL,
+					    &ext2_iomap_ops);
 	if (error)
 		return error;
 
@@ -1370,7 +1427,7 @@ void ext2_set_file_ops(struct inode *inode)
 	if (IS_DAX(inode))
 		inode->i_mapping->a_ops = &ext2_dax_aops;
 	else
-		inode->i_mapping->a_ops = &ext2_aops;
+		inode->i_mapping->a_ops = &ext2_file_aops;
 }
 
 struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFCv3 3/7] ext2: Enable large folio support
  2024-04-25 13:28 [RFCv3 0/7] ext2 iomap changes and iomap improvements Ritesh Harjani (IBM)
  2024-04-25 13:28 ` [RFCv3 1/7] ext2: Remove comment related to journal handle Ritesh Harjani (IBM)
  2024-04-25 13:28 ` [RFCv3 2/7] ext2: Convert ext2 regular file buffered I/O to use iomap Ritesh Harjani (IBM)
@ 2024-04-25 13:28 ` Ritesh Harjani (IBM)
  2024-04-26 15:30   ` Darrick J. Wong
  2024-04-25 13:28 ` [RFCv3 4/7] ext2: Implement seq counter for validating cached iomap Ritesh Harjani (IBM)
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-04-25 13:28 UTC (permalink / raw)
  To: linux-ext4, linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Darrick J . Wong, Ojaswin Mujoo,
	Ritesh Harjani, Jan Kara

Now that ext2 regular file buffered-io path is converted to use iomap,
we can also enable large folio support for ext2.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext2/inode.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index f90d280025d9..2b62786130b5 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1424,10 +1424,12 @@ void ext2_set_file_ops(struct inode *inode)
 {
 	inode->i_op = &ext2_file_inode_operations;
 	inode->i_fop = &ext2_file_operations;
-	if (IS_DAX(inode))
+	if (IS_DAX(inode)) {
 		inode->i_mapping->a_ops = &ext2_dax_aops;
-	else
+	} else {
 		inode->i_mapping->a_ops = &ext2_file_aops;
+		mapping_set_large_folios(inode->i_mapping);
+	}
 }
 
 struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFCv3 4/7] ext2: Implement seq counter for validating cached iomap
  2024-04-25 13:28 [RFCv3 0/7] ext2 iomap changes and iomap improvements Ritesh Harjani (IBM)
                   ` (2 preceding siblings ...)
  2024-04-25 13:28 ` [RFCv3 3/7] ext2: Enable large folio support Ritesh Harjani (IBM)
@ 2024-04-25 13:28 ` Ritesh Harjani (IBM)
  2024-04-26 15:41   ` Darrick J. Wong
  2024-04-25 13:28 ` [RFCv3 5/7] iomap: Fix iomap_adjust_read_range for plen calculation Ritesh Harjani (IBM)
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-04-25 13:28 UTC (permalink / raw)
  To: linux-ext4, linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Darrick J . Wong, Ojaswin Mujoo,
	Ritesh Harjani, Jan Kara

There is a possibility of following race with iomap during
writebck -

write_cache_pages()
  cache extent covering 0..1MB range
  write page at offset 0k
					truncate(file, 4k)
					  drops all relevant pages
					  frees fs blocks
					pwrite(file, 4k, 4k)
					  creates dirty page in the page cache
  writes page at offset 4k to a stale block

This race can happen because iomap_writepages() keeps a cached extent mapping
within struct iomap. While write_cache_pages() is going over each folio,
(can cache a large extent range), if a truncate happens in parallel on the
next folio followed by a buffered write to the same offset within the file,
this can change logical to physical offset of the cached iomap mapping.
That means, the cached iomap has now become stale.

This patch implements the seq counter approach for revalidation of stale
iomap mappings. i_blkseq will get incremented for every block
allocation/free. Here is what we do -

For ext2 buffered-writes, the block allocation happens at the
->write_iter time itself. So at writeback time,
1. We first cache the i_blkseq.
2. Call ext2_get_blocks(, create = 0) to get the no. of blocks
   already allocated.
3. Call ext2_get_blocks() the second time with length to be same as
   the no. of blocks we know were already allocated.
4. Till now it means, the cached i_blkseq remains valid as no block
   allocation has happened yet.
This means the next call to ->map_blocks(), we can verify whether the
i_blkseq has raced with truncate or not. If not, then i_blkseq will
remain valid.

In case of a hole (could happen with mmaped writes), we only allocate
1 block at a time anyways. So even if the i_blkseq value changes right
after, we anyway need to allocate the next block in subsequent
->map_blocks() call.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext2/balloc.c |  1 +
 fs/ext2/ext2.h   |  6 +++++
 fs/ext2/inode.c  | 57 ++++++++++++++++++++++++++++++++++++++++++++----
 fs/ext2/super.c  |  2 +-
 4 files changed, 61 insertions(+), 5 deletions(-)

diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
index 1bfd6ab11038..047a8f41a6f5 100644
--- a/fs/ext2/balloc.c
+++ b/fs/ext2/balloc.c
@@ -495,6 +495,7 @@ void ext2_free_blocks(struct inode * inode, ext2_fsblk_t block,
 	}
 
 	ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1);
+	ext2_inc_i_blkseq(EXT2_I(inode));
 
 do_more:
 	overflow = 0;
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index f38bdd46e4f7..67b1acb08eb2 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -663,6 +663,7 @@ struct ext2_inode_info {
 	struct rw_semaphore xattr_sem;
 #endif
 	rwlock_t i_meta_lock;
+	unsigned int i_blkseq;
 
 	/*
 	 * truncate_mutex is for serialising ext2_truncate() against
@@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
 	return container_of(inode, struct ext2_inode_info, vfs_inode);
 }
 
+static inline void ext2_inc_i_blkseq(struct ext2_inode_info *ei)
+{
+	WRITE_ONCE(ei->i_blkseq, READ_ONCE(ei->i_blkseq) + 1);
+}
+
 /* balloc.c */
 extern int ext2_bg_has_super(struct super_block *sb, int group);
 extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 2b62786130b5..946a614ddfc0 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -406,6 +406,8 @@ static int ext2_alloc_blocks(struct inode *inode,
 	ext2_fsblk_t current_block = 0;
 	int ret = 0;
 
+	ext2_inc_i_blkseq(EXT2_I(inode));
+
 	/*
 	 * Here we try to allocate the requested multiple blocks at once,
 	 * on a best-effort basis.
@@ -966,15 +968,62 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
 	return mpage_writepages(mapping, wbc, ext2_get_block);
 }
 
+static bool ext2_imap_valid(struct iomap_writepage_ctx *wpc, struct inode *inode,
+			    loff_t offset)
+{
+	if (offset < wpc->iomap.offset ||
+	    offset >= wpc->iomap.offset + wpc->iomap.length)
+		return false;
+
+	if (wpc->iomap.validity_cookie != READ_ONCE(EXT2_I(inode)->i_blkseq))
+		return false;
+
+	return true;
+}
+
 static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
 				 struct inode *inode, loff_t offset,
 				 unsigned len)
 {
-	if (offset >= wpc->iomap.offset &&
-	    offset < wpc->iomap.offset + wpc->iomap.length)
+	loff_t maxblocks = (loff_t)INT_MAX;
+	u8 blkbits = inode->i_blkbits;
+	u32 bno;
+	bool new, boundary;
+	int ret;
+
+	if (ext2_imap_valid(wpc, inode, offset))
 		return 0;
 
-	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
+	/*
+	 * For ext2 buffered-writes, the block allocation happens at the
+	 * ->write_iter time itself. So at writeback time -
+	 * 1. We first cache the i_blkseq.
+	 * 2. Call ext2_get_blocks(, create = 0) to get the no. of blocks
+	 *    already allocated.
+	 * 3. Call ext2_get_blocks() the second time with length to be same as
+	 *    the no. of blocks we know were already allocated.
+	 * 4. Till now it means, the cached i_blkseq remains valid as no block
+	 *    allocation has happened yet.
+	 * This means the next call to ->map_blocks(), we can verify whether the
+	 * i_blkseq has raced with truncate or not. If not, then i_blkseq will
+	 * remain valid.
+	 *
+	 * In case of a hole (could happen with mmaped writes), we only allocate
+	 * 1 block at a time anyways. So even if the i_blkseq value changes, we
+	 * anyway need to allocate the next block in subsequent ->map_blocks()
+	 * call.
+	 */
+	wpc->iomap.validity_cookie = READ_ONCE(EXT2_I(inode)->i_blkseq);
+
+	ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits,
+			      &bno, &new, &boundary, 0);
+	if (ret < 0)
+		return ret;
+	/*
+	 * ret can be 0 in case of a hole which is possible for mmaped writes.
+	 */
+	ret = ret ? ret : 1;
+	return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits,
 				IOMAP_WRITE, &wpc->iomap, NULL);
 }
 
@@ -1000,7 +1049,7 @@ ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc
 
 const struct address_space_operations ext2_file_aops = {
 	.dirty_folio		= iomap_dirty_folio,
-	.release_folio 		= iomap_release_folio,
+	.release_folio		= iomap_release_folio,
 	.invalidate_folio	= iomap_invalidate_folio,
 	.read_folio		= ext2_file_read_folio,
 	.readahead		= ext2_file_readahead,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 37f7ce56adce..32f5386284d6 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -188,7 +188,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
 #ifdef CONFIG_QUOTA
 	memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
 #endif
-
+	WRITE_ONCE(ei->i_blkseq, 0);
 	return &ei->vfs_inode;
 }
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFCv3 5/7] iomap: Fix iomap_adjust_read_range for plen calculation
  2024-04-25 13:28 [RFCv3 0/7] ext2 iomap changes and iomap improvements Ritesh Harjani (IBM)
                   ` (3 preceding siblings ...)
  2024-04-25 13:28 ` [RFCv3 4/7] ext2: Implement seq counter for validating cached iomap Ritesh Harjani (IBM)
@ 2024-04-25 13:28 ` Ritesh Harjani (IBM)
  2024-04-26  6:49   ` Christoph Hellwig
  2024-04-25 13:28 ` [RFCv3 6/7] iomap: Optimize iomap_read_folio Ritesh Harjani (IBM)
  2024-04-25 13:28 ` [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings Ritesh Harjani (IBM)
  6 siblings, 1 reply; 30+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-04-25 13:28 UTC (permalink / raw)
  To: linux-ext4, linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Darrick J . Wong, Ojaswin Mujoo,
	Ritesh Harjani, Jan Kara

If the extent spans the block that contains the i_size, we need to
handle both halves separately but only when the i_size is within the
current folio under processing.
"orig_pos + length > isize" can be true for all folios if the mapped
extent length is greater than the folio size. That is making plen to
break for every folio instead of only the last folio.

So use orig_plen for checking if "orig_pos + orig_plen > isize".

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/iomap/buffered-io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4e8e41c8b3c0..9f79c82d1f73 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -241,6 +241,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 	unsigned block_size = (1 << block_bits);
 	size_t poff = offset_in_folio(folio, *pos);
 	size_t plen = min_t(loff_t, folio_size(folio) - poff, length);
+	size_t orig_plen = plen;
 	unsigned first = poff >> block_bits;
 	unsigned last = (poff + plen - 1) >> block_bits;
 
@@ -277,7 +278,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 	 * handle both halves separately so that we properly zero data in the
 	 * page cache for blocks that are entirely outside of i_size.
 	 */
-	if (orig_pos <= isize && orig_pos + length > isize) {
+	if (orig_pos <= isize && orig_pos + orig_plen > isize) {
 		unsigned end = offset_in_folio(folio, isize - 1) >> block_bits;
 
 		if (first <= end && last > end)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFCv3 6/7] iomap: Optimize iomap_read_folio
  2024-04-25 13:28 [RFCv3 0/7] ext2 iomap changes and iomap improvements Ritesh Harjani (IBM)
                   ` (4 preceding siblings ...)
  2024-04-25 13:28 ` [RFCv3 5/7] iomap: Fix iomap_adjust_read_range for plen calculation Ritesh Harjani (IBM)
@ 2024-04-25 13:28 ` Ritesh Harjani (IBM)
  2024-04-25 13:53   ` Matthew Wilcox
  2024-04-26  6:53   ` Christoph Hellwig
  2024-04-25 13:28 ` [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings Ritesh Harjani (IBM)
  6 siblings, 2 replies; 30+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-04-25 13:28 UTC (permalink / raw)
  To: linux-ext4, linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Darrick J . Wong, Ojaswin Mujoo,
	Ritesh Harjani, Jan Kara

iomap_readpage_iter() handles "uptodate blocks" and "not uptodate blocks"
within a folio separately. This makes iomap_read_folio() to call into
->iomap_begin() to request for extent mapping even though it might already
have an extent which is not fully processed.

This happens when we either have a large folio or with bs < ps. In these
cases we can have sub blocks which can be uptodate (say for e.g. due to
previous writes). With iomap_read_folio_iter(), this is handled more
efficiently by not calling ->iomap_begin() call until all the sub blocks
with the current folio are processed.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/iomap/buffered-io.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9f79c82d1f73..0a4269095ae2 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -444,6 +444,24 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	return pos - orig_pos + plen;
 }
 
+static loff_t iomap_read_folio_iter(const struct iomap_iter *iter,
+		struct iomap_readpage_ctx *ctx)
+{
+	struct folio *folio = ctx->cur_folio;
+	size_t pos = offset_in_folio(folio, iter->pos);
+	loff_t length = min_t(loff_t, folio_size(folio) - pos,
+			      iomap_length(iter));
+	loff_t done, ret;
+
+	for (done = 0; done < length; done += ret) {
+		ret = iomap_readpage_iter(iter, ctx, done);
+		if (ret <= 0)
+			return ret;
+	}
+
+	return done;
+}
+
 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 {
 	struct iomap_iter iter = {
@@ -459,7 +477,7 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 	trace_iomap_readpage(iter.inode, 1);
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.processed = iomap_readpage_iter(&iter, &ctx, 0);
+		iter.processed = iomap_read_folio_iter(&iter, &ctx);
 
 	if (ret < 0)
 		folio_set_error(folio);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings
  2024-04-25 13:28 [RFCv3 0/7] ext2 iomap changes and iomap improvements Ritesh Harjani (IBM)
                   ` (5 preceding siblings ...)
  2024-04-25 13:28 ` [RFCv3 6/7] iomap: Optimize iomap_read_folio Ritesh Harjani (IBM)
@ 2024-04-25 13:28 ` Ritesh Harjani (IBM)
  2024-04-26 16:24   ` Darrick J. Wong
  2024-04-27  4:47   ` Christoph Hellwig
  6 siblings, 2 replies; 30+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-04-25 13:28 UTC (permalink / raw)
  To: linux-ext4, linux-xfs
  Cc: linux-fsdevel, Matthew Wilcox, Darrick J . Wong, Ojaswin Mujoo,
	Ritesh Harjani, Jan Kara

This patch optimizes the data access patterns for filesystems with
indirect block mapping by implementing BH_Boundary handling within
iomap.

Currently the bios for reads within iomap are only submitted at
2 places -
1. If we cannot merge the new req. with previous bio, only then we
   submit the previous bio.
2. Submit the bio at the end of the entire read processing.

This means for filesystems with indirect block mapping, we call into
->iomap_begin() again w/o submitting the previous bios. That causes
unoptimized data access patterns for blocks which are of BH_Boundary type.

For e.g. consider the file mapping
logical block(4k) 		physical block(4k)
0-11 				1000-1011
12-15 				1013-1016

In above physical block 1012 is an indirect metadata block which has the
mapping information for next set of indirect blocks (1013-1016).
With iomap buffered reads for reading 1st 16 logical blocks of a file
(0-15), we get below I/O pattern
	- submit a bio for 1012
	- complete the bio for 1012
	- submit a bio for 1000-1011
	- submit a bio for 1013-1016
	- complete the bios for 1000-1011
	- complete the bios for 1013-1016

So as we can see, above is an non-optimal I/O access pattern and also we
get 3 bio completions instead of 2.

This patch changes this behavior by doing submit_bio() if there are any
bios already processed, before calling ->iomap_begin() again.
That means if there are any blocks which are already processed, gets
submitted for I/O earlier and then within ->iomap_begin(), if we get a
request for reading an indirect metadata block, then block layer can merge
those bios with the already submitted read request to reduce the no. of bio
completions.

Now, for bs < ps or for large folios, this patch requires proper handling
of "ifs->read_bytes_pending". In that we first set ifs->read_bytes_pending
to folio_size. Then handle all the cases where we need to subtract
ifs->read_bytes_pending either during the submission side
(if we don't need to submit any I/O - for e.g. for uptodate sub blocks),
or during an I/O error, or at the completion of an I/O.

Here is the ftrace output of iomap and block layer with ext2 iomap
conversion patches -

root# filefrag -b512 -v /mnt1/test/f1
Filesystem type is: ef53
Filesystem cylinder groups approximately 32
File size of /mnt1/test/f1 is 65536 (128 blocks of 512 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      95:      98304..     98399:     96:             merged
   1:       96..     127:      98408..     98439:     32:      98400: last,merged,eof
/mnt1/test/f1: 2 extents found

root# #This reads 4 blocks starting from lblk 10, 11, 12, 13
root# xfs_io -c "pread -b$((4*4096)) $((10*4096)) $((4*4096))" /mnt1/test/f1

w/o this patch - (indirect block is submitted before and does not get merged, resulting in 3 bios completion)
      xfs_io-907     [002] .....   185.608791: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
      xfs_io-907     [002] .....   185.608819: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
      xfs_io-907     [002] .....   185.608823: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
      xfs_io-907     [002] .....   185.608831: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
      xfs_io-907     [002] .....   185.608859: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
      xfs_io-907     [002] .....   185.608865: block_getrq: 8,16 R 98400 + 8 [xfs_io]
      xfs_io-907     [002] .....   185.608867: block_io_start: 8,16 R 4096 () 98400 + 8 [xfs_io]
      xfs_io-907     [002] .....   185.608869: block_plug: [xfs_io]
      xfs_io-907     [002] .....   185.608872: block_unplug: [xfs_io] 1
      xfs_io-907     [002] .....   185.608874: block_rq_insert: 8,16 R 4096 () 98400 + 8 [xfs_io]
kworker/2:1H-198     [002] .....   185.608908: block_rq_issue: 8,16 R 4096 () 98400 + 8 [kworker/2:1H]
      <idle>-0       [002] d.h2.   185.609579: block_rq_complete: 8,16 R () 98400 + 8 [0]
      <idle>-0       [002] dNh2.   185.609631: block_io_done: 8,16 R 0 () 98400 + 0 [swapper/2]
      xfs_io-907     [002] .....   185.609694: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
      xfs_io-907     [002] .....   185.609704: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
      xfs_io-907     [002] .....   185.609718: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
      xfs_io-907     [002] .....   185.609721: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
      xfs_io-907     [002] .....   185.609726: block_plug: [xfs_io]
      xfs_io-907     [002] .....   185.609735: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
      xfs_io-907     [002] .....   185.609736: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
      xfs_io-907     [002] .....   185.609740: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
      xfs_io-907     [002] .....   185.609741: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
      xfs_io-907     [002] .....   185.609756: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
      xfs_io-907     [002] .....   185.609769: block_rq_issue: 8,16 RA 8192 () 98384 + 16 [xfs_io]
      <idle>-0       [002] d.H2.   185.610280: block_rq_complete: 8,16 RA () 98408 + 16 [0]
      <idle>-0       [002] d.H2.   185.610289: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/2]
      <idle>-0       [002] d.H2.   185.610292: block_rq_complete: 8,16 RA () 98384 + 16 [0]
      <idle>-0       [002] dNH2.   185.610301: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/2]

v/s with the patch - (optimzed I/O access pattern and bio gets merged resulting in only 2 bios completion)
      xfs_io-944     [005] .....    99.926187: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
      xfs_io-944     [005] .....    99.926208: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
      xfs_io-944     [005] .....    99.926211: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
      xfs_io-944     [005] .....    99.926222: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
      xfs_io-944     [005] .....    99.926232: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
      xfs_io-944     [005] .....    99.926233: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
      xfs_io-944     [005] .....    99.926234: block_plug: [xfs_io]
      xfs_io-944     [005] .....    99.926235: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
      xfs_io-944     [005] .....    99.926261: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
      xfs_io-944     [005] .....    99.926266: block_bio_backmerge: 8,16 R 98400 + 8 [xfs_io]
      xfs_io-944     [005] .....    99.926271: block_unplug: [xfs_io] 1
      xfs_io-944     [005] .....    99.926272: block_rq_insert: 8,16 RA 12288 () 98384 + 24 [xfs_io]
kworker/5:1H-234     [005] .....    99.926314: block_rq_issue: 8,16 RA 12288 () 98384 + 24 [kworker/5:1H]
      <idle>-0       [005] d.h2.    99.926905: block_rq_complete: 8,16 RA () 98384 + 24 [0]
      <idle>-0       [005] dNh2.    99.926931: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/5]
      xfs_io-944     [005] .....    99.926971: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
      xfs_io-944     [005] .....    99.926981: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
      xfs_io-944     [005] .....    99.926989: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
      xfs_io-944     [005] .....    99.926989: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
      xfs_io-944     [005] .....    99.926991: block_plug: [xfs_io]
      xfs_io-944     [005] .....    99.926993: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
      xfs_io-944     [005] .....    99.927001: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
      <idle>-0       [005] d.h2.    99.927397: block_rq_complete: 8,16 RA () 98408 + 16 [0]
      <idle>-0       [005] dNh2.    99.927414: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/5]

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/iomap/buffered-io.c | 112 +++++++++++++++++++++++++++++++----------
 1 file changed, 85 insertions(+), 27 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 0a4269095ae2..a1d50086a3f5 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -30,7 +30,7 @@ typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length);
  */
 struct iomap_folio_state {
 	spinlock_t		state_lock;
-	unsigned int		read_bytes_pending;
+	size_t			read_bytes_pending;
 	atomic_t		write_bytes_pending;

 	/*
@@ -380,6 +380,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	loff_t orig_pos = pos;
 	size_t poff, plen;
 	sector_t sector;
+	bool rbp_finished = false;

 	if (iomap->type == IOMAP_INLINE)
 		return iomap_read_inline_data(iter, folio);
@@ -387,21 +388,39 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	/* zero post-eof blocks as the page may be mapped */
 	ifs = ifs_alloc(iter->inode, folio, iter->flags);
 	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
+
+	if (ifs) {
+		loff_t to_read = min_t(loff_t, iter->len - offset,
+			folio_size(folio) - offset_in_folio(folio, orig_pos));
+		size_t padjust;
+
+		spin_lock_irq(&ifs->state_lock);
+		if (!ifs->read_bytes_pending)
+			ifs->read_bytes_pending = to_read;
+		padjust = pos - orig_pos;
+		ifs->read_bytes_pending -= padjust;
+		if (!ifs->read_bytes_pending)
+			rbp_finished = true;
+		spin_unlock_irq(&ifs->state_lock);
+	}
+
 	if (plen == 0)
 		goto done;

 	if (iomap_block_needs_zeroing(iter, pos)) {
+		if (ifs) {
+			spin_lock_irq(&ifs->state_lock);
+			ifs->read_bytes_pending -= plen;
+			if (!ifs->read_bytes_pending)
+				rbp_finished = true;
+			spin_unlock_irq(&ifs->state_lock);
+		}
 		folio_zero_range(folio, poff, plen);
 		iomap_set_range_uptodate(folio, poff, plen);
 		goto done;
 	}

 	ctx->cur_folio_in_bio = true;
-	if (ifs) {
-		spin_lock_irq(&ifs->state_lock);
-		ifs->read_bytes_pending += plen;
-		spin_unlock_irq(&ifs->state_lock);
-	}

 	sector = iomap_sector(iomap, pos);
 	if (!ctx->bio ||
@@ -435,6 +454,14 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	}

 done:
+	/*
+	 * If there is no bio prepared and if rbp is finished and
+	 * this was the last offset within this folio then mark
+	 * cur_folio_in_bio to false.
+	 */
+	if (!ctx->bio && rbp_finished &&
+			offset_in_folio(folio, pos + plen) == 0)
+		ctx->cur_folio_in_bio = false;
 	/*
 	 * Move the caller beyond our range so that it keeps making progress.
 	 * For that, we have to include any leading non-uptodate ranges, but
@@ -459,9 +486,43 @@ static loff_t iomap_read_folio_iter(const struct iomap_iter *iter,
 			return ret;
 	}

+	if (ctx->bio) {
+		submit_bio(ctx->bio);
+		WARN_ON_ONCE(!ctx->cur_folio_in_bio);
+		ctx->bio = NULL;
+	}
+	if (offset_in_folio(folio, iter->pos + done) == 0 &&
+			!ctx->cur_folio_in_bio) {
+		folio_unlock(ctx->cur_folio);
+	}
+
 	return done;
 }

+static void iomap_handle_read_error(struct iomap_readpage_ctx *ctx,
+		struct iomap_iter *iter)
+{
+	struct folio *folio = ctx->cur_folio;
+	struct iomap_folio_state *ifs;
+	unsigned long flags;
+	bool rbp_finished = false;
+	size_t rbp_adjust = folio_size(folio) - offset_in_folio(folio,
+				iter->pos);
+	ifs = folio->private;
+	if (!ifs || !ifs->read_bytes_pending)
+		goto unlock;
+
+	spin_lock_irqsave(&ifs->state_lock, flags);
+	ifs->read_bytes_pending -= rbp_adjust;
+	if (!ifs->read_bytes_pending)
+		rbp_finished = true;
+	spin_unlock_irqrestore(&ifs->state_lock, flags);
+
+unlock:
+	if (rbp_finished || !ctx->cur_folio_in_bio)
+		folio_unlock(folio);
+}
+
 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 {
 	struct iomap_iter iter = {
@@ -479,15 +540,9 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 	while ((ret = iomap_iter(&iter, ops)) > 0)
 		iter.processed = iomap_read_folio_iter(&iter, &ctx);

-	if (ret < 0)
+	if (ret < 0) {
 		folio_set_error(folio);
-
-	if (ctx.bio) {
-		submit_bio(ctx.bio);
-		WARN_ON_ONCE(!ctx.cur_folio_in_bio);
-	} else {
-		WARN_ON_ONCE(ctx.cur_folio_in_bio);
-		folio_unlock(folio);
+		iomap_handle_read_error(&ctx, &iter);
 	}

 	/*
@@ -506,12 +561,6 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
 	loff_t done, ret;

 	for (done = 0; done < length; done += ret) {
-		if (ctx->cur_folio &&
-		    offset_in_folio(ctx->cur_folio, iter->pos + done) == 0) {
-			if (!ctx->cur_folio_in_bio)
-				folio_unlock(ctx->cur_folio);
-			ctx->cur_folio = NULL;
-		}
 		if (!ctx->cur_folio) {
 			ctx->cur_folio = readahead_folio(ctx->rac);
 			ctx->cur_folio_in_bio = false;
@@ -519,6 +568,17 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
 		ret = iomap_readpage_iter(iter, ctx, done);
 		if (ret <= 0)
 			return ret;
+		if (ctx->cur_folio && offset_in_folio(ctx->cur_folio,
+					iter->pos + done + ret) == 0) {
+			if (!ctx->cur_folio_in_bio)
+				folio_unlock(ctx->cur_folio);
+			ctx->cur_folio = NULL;
+		}
+	}
+
+	if (ctx->bio) {
+		submit_bio(ctx->bio);
+		ctx->bio = NULL;
 	}

 	return done;
@@ -549,18 +609,16 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
 	struct iomap_readpage_ctx ctx = {
 		.rac	= rac,
 	};
+	int ret = 0;

 	trace_iomap_readahead(rac->mapping->host, readahead_count(rac));

-	while (iomap_iter(&iter, ops) > 0)
+	while ((ret = iomap_iter(&iter, ops)) > 0)
 		iter.processed = iomap_readahead_iter(&iter, &ctx);

-	if (ctx.bio)
-		submit_bio(ctx.bio);
-	if (ctx.cur_folio) {
-		if (!ctx.cur_folio_in_bio)
-			folio_unlock(ctx.cur_folio);
-	}
+	if (ret < 0 && ctx.cur_folio)
+		iomap_handle_read_error(&ctx, &iter);
+
 }
 EXPORT_SYMBOL_GPL(iomap_readahead);

--
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFCv3 6/7] iomap: Optimize iomap_read_folio
  2024-04-25 13:28 ` [RFCv3 6/7] iomap: Optimize iomap_read_folio Ritesh Harjani (IBM)
@ 2024-04-25 13:53   ` Matthew Wilcox
  2024-04-26  6:53   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2024-04-25 13:53 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-ext4, linux-xfs, linux-fsdevel, Darrick J . Wong,
	Ojaswin Mujoo, Jan Kara

On Thu, Apr 25, 2024 at 06:58:50PM +0530, Ritesh Harjani (IBM) wrote:
> +static loff_t iomap_read_folio_iter(const struct iomap_iter *iter,
> +		struct iomap_readpage_ctx *ctx)
> +{
> +	struct folio *folio = ctx->cur_folio;
> +	size_t pos = offset_in_folio(folio, iter->pos);

"pos" is position in file.  You should call this 'offset'.

> +	loff_t length = min_t(loff_t, folio_size(folio) - pos,
> +			      iomap_length(iter));
> +	loff_t done, ret;
> +
> +	for (done = 0; done < length; done += ret) {
> +		ret = iomap_readpage_iter(iter, ctx, done);
> +		if (ret <= 0)
> +			return ret;
> +	}

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 5/7] iomap: Fix iomap_adjust_read_range for plen calculation
  2024-04-25 13:28 ` [RFCv3 5/7] iomap: Fix iomap_adjust_read_range for plen calculation Ritesh Harjani (IBM)
@ 2024-04-26  6:49   ` Christoph Hellwig
  2024-04-26  8:52     ` Ritesh Harjani
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2024-04-26  6:49 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-ext4, linux-xfs, linux-fsdevel, Matthew Wilcox,
	Darrick J . Wong, Ojaswin Mujoo, Jan Kara

On Thu, Apr 25, 2024 at 06:58:49PM +0530, Ritesh Harjani (IBM) wrote:
> If the extent spans the block that contains the i_size, we need to

s/the i_size/i_size/.

> handle both halves separately

.. so that we properly zero data in the page cache for blocks that are
entirely outside of i_size.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 6/7] iomap: Optimize iomap_read_folio
  2024-04-25 13:28 ` [RFCv3 6/7] iomap: Optimize iomap_read_folio Ritesh Harjani (IBM)
  2024-04-25 13:53   ` Matthew Wilcox
@ 2024-04-26  6:53   ` Christoph Hellwig
  2024-04-26  8:50     ` Ritesh Harjani
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2024-04-26  6:53 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-ext4, linux-xfs, linux-fsdevel, Matthew Wilcox,
	Darrick J . Wong, Ojaswin Mujoo, Jan Kara

On Thu, Apr 25, 2024 at 06:58:50PM +0530, Ritesh Harjani (IBM) wrote:
> iomap_readpage_iter() handles "uptodate blocks" and "not uptodate blocks"
> within a folio separately. This makes iomap_read_folio() to call into
> ->iomap_begin() to request for extent mapping even though it might already
> have an extent which is not fully processed.
> 
> This happens when we either have a large folio or with bs < ps. In these
> cases we can have sub blocks which can be uptodate (say for e.g. due to
> previous writes). With iomap_read_folio_iter(), this is handled more
> efficiently by not calling ->iomap_begin() call until all the sub blocks
> with the current folio are processed.

Maybe throw in a sentence here that this copies what
iomap_readahead_iter already does?

Otherwise this looks good to me modulo the offset comment from willy.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 6/7] iomap: Optimize iomap_read_folio
  2024-04-26  6:53   ` Christoph Hellwig
@ 2024-04-26  8:50     ` Ritesh Harjani
  2024-04-27  4:44       ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Ritesh Harjani @ 2024-04-26  8:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-ext4, linux-xfs, linux-fsdevel, Matthew Wilcox,
	Darrick J . Wong, Ojaswin Mujoo, Jan Kara

Christoph Hellwig <hch@infradead.org> writes:

> On Thu, Apr 25, 2024 at 06:58:50PM +0530, Ritesh Harjani (IBM) wrote:
>> iomap_readpage_iter() handles "uptodate blocks" and "not uptodate blocks"
>> within a folio separately. This makes iomap_read_folio() to call into
>> ->iomap_begin() to request for extent mapping even though it might already
>> have an extent which is not fully processed.
>> 
>> This happens when we either have a large folio or with bs < ps. In these
>> cases we can have sub blocks which can be uptodate (say for e.g. due to
>> previous writes). With iomap_read_folio_iter(), this is handled more
>> efficiently by not calling ->iomap_begin() call until all the sub blocks
>> with the current folio are processed.
>
> Maybe throw in a sentence here that this copies what
> iomap_readahead_iter already does?

Does this sound any better?

iomap_read_folio_iter() handles multiple sub blocks within a given
folio but it's implementation logic is similar to how
iomap_readahead_iter() handles multiple folios within a single mapped
extent. Both of them iterate over a given range of folio/mapped extent
and call iomap_readpage_iter() for reading.


>
> Otherwise this looks good to me modulo the offset comment from willy.

Yes, I will address willy's comment too. 
Thanks for the review!

-ritesh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 5/7] iomap: Fix iomap_adjust_read_range for plen calculation
  2024-04-26  6:49   ` Christoph Hellwig
@ 2024-04-26  8:52     ` Ritesh Harjani
  2024-04-26 15:43       ` Darrick J. Wong
  0 siblings, 1 reply; 30+ messages in thread
From: Ritesh Harjani @ 2024-04-26  8:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-ext4, linux-xfs, linux-fsdevel, Matthew Wilcox,
	Darrick J . Wong, Ojaswin Mujoo, Jan Kara

Christoph Hellwig <hch@infradead.org> writes:

> On Thu, Apr 25, 2024 at 06:58:49PM +0530, Ritesh Harjani (IBM) wrote:
>> If the extent spans the block that contains the i_size, we need to
>
> s/the i_size/i_size/.
>
>> handle both halves separately
>
> .. so that we properly zero data in the page cache for blocks that are
> entirely outside of i_size.

Sure. 

>
> Otherwise looks good:
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks for the review.

-ritesh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 1/7] ext2: Remove comment related to journal handle
  2024-04-25 13:28 ` [RFCv3 1/7] ext2: Remove comment related to journal handle Ritesh Harjani (IBM)
@ 2024-04-26 15:21   ` Darrick J. Wong
  0 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2024-04-26 15:21 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-ext4, linux-xfs, linux-fsdevel, Matthew Wilcox,
	Ojaswin Mujoo, Jan Kara

On Thu, Apr 25, 2024 at 06:58:45PM +0530, Ritesh Harjani (IBM) wrote:
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

What handle? ;)
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/ext2/inode.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index f3d570a9302b..c4de3a94c4b2 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -615,8 +615,6 @@ static void ext2_splice_branch(struct inode *inode,
>   * allocations is needed - we simply release blocks and do not touch anything
>   * reachable from inode.
>   *
> - * `handle' can be NULL if create == 0.
> - *
>   * return > 0, # of blocks mapped or allocated.
>   * return = 0, if plain lookup failed.
>   * return < 0, error case.
> -- 
> 2.44.0
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 2/7] ext2: Convert ext2 regular file buffered I/O to use iomap
  2024-04-25 13:28 ` [RFCv3 2/7] ext2: Convert ext2 regular file buffered I/O to use iomap Ritesh Harjani (IBM)
@ 2024-04-26 15:29   ` Darrick J. Wong
  0 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2024-04-26 15:29 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-ext4, linux-xfs, linux-fsdevel, Matthew Wilcox,
	Ojaswin Mujoo, Jan Kara

On Thu, Apr 25, 2024 at 06:58:46PM +0530, Ritesh Harjani (IBM) wrote:
> This patch converts ext2 regular file's buffered-io path to use iomap.
> - buffered-io path using iomap_file_buffered_write
> - DIO fallback to buffered-io now uses iomap_file_buffered_write
> - writeback path now uses a new aops - ext2_file_aops
> - truncate now uses iomap_truncate_page
> - mmap path of ext2 continues to use generic_file_vm_ops
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/ext2/file.c  | 20 ++++++++++++--
>  fs/ext2/inode.c | 69 ++++++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 81 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index 4ddc36f4dbd4..ee5cd4a2f24f 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -252,7 +252,7 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  
>  		iocb->ki_flags &= ~IOCB_DIRECT;
>  		pos = iocb->ki_pos;
> -		status = generic_perform_write(iocb, from);
> +		status = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops);
>  		if (unlikely(status < 0)) {
>  			ret = status;
>  			goto out_unlock;
> @@ -278,6 +278,22 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	return ret;
>  }
>  
> +static ssize_t ext2_buffered_write_iter(struct kiocb *iocb,
> +					struct iov_iter *from)
> +{
> +	ssize_t ret = 0;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	inode_lock(inode);
> +	ret = generic_write_checks(iocb, from);
> +	if (ret > 0)
> +		ret = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops);
> +	inode_unlock(inode);
> +	if (ret > 0)
> +		ret = generic_write_sync(iocb, ret);
> +	return ret;
> +}
> +
>  static ssize_t ext2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  {
>  #ifdef CONFIG_FS_DAX
> @@ -299,7 +315,7 @@ static ssize_t ext2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	if (iocb->ki_flags & IOCB_DIRECT)
>  		return ext2_dio_write_iter(iocb, from);
>  
> -	return generic_file_write_iter(iocb, from);
> +	return ext2_buffered_write_iter(iocb, from);
>  }
>  
>  const struct file_operations ext2_file_operations = {
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index c4de3a94c4b2..f90d280025d9 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -877,10 +877,14 @@ ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>  	if ((flags & IOMAP_DIRECT) && (flags & IOMAP_WRITE) && written == 0)
>  		return -ENOTBLK;
>  
> -	if (iomap->type == IOMAP_MAPPED &&
> -	    written < length &&
> -	    (flags & IOMAP_WRITE))
> +	if (iomap->type == IOMAP_MAPPED && written < length &&
> +	   (flags & IOMAP_WRITE)) {
>  		ext2_write_failed(inode->i_mapping, offset + length);
> +		return 0;
> +	}
> +
> +	if (iomap->flags & IOMAP_F_SIZE_CHANGED)
> +		mark_inode_dirty(inode);
>  	return 0;
>  }
>  
> @@ -912,6 +916,16 @@ static void ext2_readahead(struct readahead_control *rac)
>  	mpage_readahead(rac, ext2_get_block);
>  }
>  
> +static int ext2_file_read_folio(struct file *file, struct folio *folio)
> +{
> +	return iomap_read_folio(folio, &ext2_iomap_ops);
> +}
> +
> +static void ext2_file_readahead(struct readahead_control *rac)
> +{
> +	iomap_readahead(rac, &ext2_iomap_ops);
> +}
> +
>  static int
>  ext2_write_begin(struct file *file, struct address_space *mapping,
>  		loff_t pos, unsigned len, struct page **pagep, void **fsdata)
> @@ -941,12 +955,41 @@ static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
>  	return generic_block_bmap(mapping,block,ext2_get_block);
>  }
>  
> +static sector_t ext2_file_bmap(struct address_space *mapping, sector_t block)
> +{
> +	return iomap_bmap(mapping, block, &ext2_iomap_ops);
> +}
> +
>  static int
>  ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  {
>  	return mpage_writepages(mapping, wbc, ext2_get_block);
>  }
>  
> +static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
> +				 struct inode *inode, loff_t offset,
> +				 unsigned len)
> +{
> +	if (offset >= wpc->iomap.offset &&
> +	    offset < wpc->iomap.offset + wpc->iomap.length)
> +		return 0;
> +
> +	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> +				IOMAP_WRITE, &wpc->iomap, NULL);
> +}

Soooo... this is almost a directio write of the pagecache? ;)

> +
> +static const struct iomap_writeback_ops ext2_writeback_ops = {
> +	.map_blocks		= ext2_write_map_blocks,
> +};
> +
> +static int ext2_file_writepages(struct address_space *mapping,
> +				struct writeback_control *wbc)
> +{
> +	struct iomap_writepage_ctx wpc = { };
> +
> +	return iomap_writepages(mapping, wbc, &wpc, &ext2_writeback_ops);
> +}
> +
>  static int
>  ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  {
> @@ -955,6 +998,20 @@ ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc
>  	return dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
>  }
>  
> +const struct address_space_operations ext2_file_aops = {
> +	.dirty_folio		= iomap_dirty_folio,
> +	.release_folio 		= iomap_release_folio,

trailing space here   ^

> +	.invalidate_folio	= iomap_invalidate_folio,
> +	.read_folio		= ext2_file_read_folio,
> +	.readahead		= ext2_file_readahead,
> +	.bmap			= ext2_file_bmap,
> +	.direct_IO		= noop_direct_IO,

Nowadays, it is preferred to set FMODE_CAN_ODIRECT and skip setting
->direct_IO.  But I see that ext2 hasn't been converted, so this is a
minor point.

> +	.writepages		= ext2_file_writepages,
> +	.migrate_folio		= filemap_migrate_folio,
> +	.is_partially_uptodate	= iomap_is_partially_uptodate,
> +	.error_remove_folio	= generic_error_remove_folio,
> +};
> +
>  const struct address_space_operations ext2_aops = {

I wonder, could directories and symlinks get converted to iomap at some
point?  (It's ok if that is not in scope for this series.)

Looks good to me,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

>  	.dirty_folio		= block_dirty_folio,
>  	.invalidate_folio	= block_invalidate_folio,
> @@ -1279,8 +1336,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
>  		error = dax_truncate_page(inode, newsize, NULL,
>  					  &ext2_iomap_ops);
>  	else
> -		error = block_truncate_page(inode->i_mapping,
> -				newsize, ext2_get_block);
> +		error = iomap_truncate_page(inode, newsize, NULL,
> +					    &ext2_iomap_ops);
>  	if (error)
>  		return error;
>  
> @@ -1370,7 +1427,7 @@ void ext2_set_file_ops(struct inode *inode)
>  	if (IS_DAX(inode))
>  		inode->i_mapping->a_ops = &ext2_dax_aops;
>  	else
> -		inode->i_mapping->a_ops = &ext2_aops;
> +		inode->i_mapping->a_ops = &ext2_file_aops;
>  }
>  
>  struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
> -- 
> 2.44.0
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 3/7] ext2: Enable large folio support
  2024-04-25 13:28 ` [RFCv3 3/7] ext2: Enable large folio support Ritesh Harjani (IBM)
@ 2024-04-26 15:30   ` Darrick J. Wong
  0 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2024-04-26 15:30 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-ext4, linux-xfs, linux-fsdevel, Matthew Wilcox,
	Ojaswin Mujoo, Jan Kara

On Thu, Apr 25, 2024 at 06:58:47PM +0530, Ritesh Harjani (IBM) wrote:
> Now that ext2 regular file buffered-io path is converted to use iomap,
> we can also enable large folio support for ext2.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

This filesystem is looking more and more modern!
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/ext2/inode.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index f90d280025d9..2b62786130b5 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -1424,10 +1424,12 @@ void ext2_set_file_ops(struct inode *inode)
>  {
>  	inode->i_op = &ext2_file_inode_operations;
>  	inode->i_fop = &ext2_file_operations;
> -	if (IS_DAX(inode))
> +	if (IS_DAX(inode)) {
>  		inode->i_mapping->a_ops = &ext2_dax_aops;
> -	else
> +	} else {
>  		inode->i_mapping->a_ops = &ext2_file_aops;
> +		mapping_set_large_folios(inode->i_mapping);
> +	}
>  }
>  
>  struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
> -- 
> 2.44.0
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 4/7] ext2: Implement seq counter for validating cached iomap
  2024-04-25 13:28 ` [RFCv3 4/7] ext2: Implement seq counter for validating cached iomap Ritesh Harjani (IBM)
@ 2024-04-26 15:41   ` Darrick J. Wong
  0 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2024-04-26 15:41 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-ext4, linux-xfs, linux-fsdevel, Matthew Wilcox,
	Ojaswin Mujoo, Jan Kara

On Thu, Apr 25, 2024 at 06:58:48PM +0530, Ritesh Harjani (IBM) wrote:
> There is a possibility of following race with iomap during
> writebck -
> 
> write_cache_pages()
>   cache extent covering 0..1MB range
>   write page at offset 0k
> 					truncate(file, 4k)
> 					  drops all relevant pages
> 					  frees fs blocks
> 					pwrite(file, 4k, 4k)
> 					  creates dirty page in the page cache
>   writes page at offset 4k to a stale block
> 
> This race can happen because iomap_writepages() keeps a cached extent mapping
> within struct iomap. While write_cache_pages() is going over each folio,
> (can cache a large extent range), if a truncate happens in parallel on the
> next folio followed by a buffered write to the same offset within the file,
> this can change logical to physical offset of the cached iomap mapping.
> That means, the cached iomap has now become stale.
> 
> This patch implements the seq counter approach for revalidation of stale
> iomap mappings. i_blkseq will get incremented for every block
> allocation/free. Here is what we do -
> 
> For ext2 buffered-writes, the block allocation happens at the
> ->write_iter time itself. So at writeback time,
> 1. We first cache the i_blkseq.
> 2. Call ext2_get_blocks(, create = 0) to get the no. of blocks
>    already allocated.
> 3. Call ext2_get_blocks() the second time with length to be same as
>    the no. of blocks we know were already allocated.
> 4. Till now it means, the cached i_blkseq remains valid as no block
>    allocation has happened yet.
> This means the next call to ->map_blocks(), we can verify whether the
> i_blkseq has raced with truncate or not. If not, then i_blkseq will
> remain valid.
> 
> In case of a hole (could happen with mmaped writes), we only allocate
> 1 block at a time anyways. So even if the i_blkseq value changes right
> after, we anyway need to allocate the next block in subsequent
> ->map_blocks() call.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/ext2/balloc.c |  1 +
>  fs/ext2/ext2.h   |  6 +++++
>  fs/ext2/inode.c  | 57 ++++++++++++++++++++++++++++++++++++++++++++----
>  fs/ext2/super.c  |  2 +-
>  4 files changed, 61 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
> index 1bfd6ab11038..047a8f41a6f5 100644
> --- a/fs/ext2/balloc.c
> +++ b/fs/ext2/balloc.c
> @@ -495,6 +495,7 @@ void ext2_free_blocks(struct inode * inode, ext2_fsblk_t block,
>  	}
>  
>  	ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1);
> +	ext2_inc_i_blkseq(EXT2_I(inode));
>  
>  do_more:
>  	overflow = 0;
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index f38bdd46e4f7..67b1acb08eb2 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -663,6 +663,7 @@ struct ext2_inode_info {
>  	struct rw_semaphore xattr_sem;
>  #endif
>  	rwlock_t i_meta_lock;
> +	unsigned int i_blkseq;
>  
>  	/*
>  	 * truncate_mutex is for serialising ext2_truncate() against
> @@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
>  	return container_of(inode, struct ext2_inode_info, vfs_inode);
>  }
>  
> +static inline void ext2_inc_i_blkseq(struct ext2_inode_info *ei)
> +{
> +	WRITE_ONCE(ei->i_blkseq, READ_ONCE(ei->i_blkseq) + 1);
> +}
> +
>  /* balloc.c */
>  extern int ext2_bg_has_super(struct super_block *sb, int group);
>  extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group);
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 2b62786130b5..946a614ddfc0 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -406,6 +406,8 @@ static int ext2_alloc_blocks(struct inode *inode,
>  	ext2_fsblk_t current_block = 0;
>  	int ret = 0;
>  
> +	ext2_inc_i_blkseq(EXT2_I(inode));
> +
>  	/*
>  	 * Here we try to allocate the requested multiple blocks at once,
>  	 * on a best-effort basis.
> @@ -966,15 +968,62 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  	return mpage_writepages(mapping, wbc, ext2_get_block);
>  }
>  
> +static bool ext2_imap_valid(struct iomap_writepage_ctx *wpc, struct inode *inode,
> +			    loff_t offset)

ext2_iomap_valid, to stay consistent with the ext4 conversion series?

> +{
> +	if (offset < wpc->iomap.offset ||
> +	    offset >= wpc->iomap.offset + wpc->iomap.length)
> +		return false;
> +
> +	if (wpc->iomap.validity_cookie != READ_ONCE(EXT2_I(inode)->i_blkseq))
> +		return false;
> +
> +	return true;
> +}
> +
>  static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
>  				 struct inode *inode, loff_t offset,
>  				 unsigned len)
>  {
> -	if (offset >= wpc->iomap.offset &&
> -	    offset < wpc->iomap.offset + wpc->iomap.length)
> +	loff_t maxblocks = (loff_t)INT_MAX;
> +	u8 blkbits = inode->i_blkbits;
> +	u32 bno;
> +	bool new, boundary;
> +	int ret;
> +
> +	if (ext2_imap_valid(wpc, inode, offset))
>  		return 0;
>  
> -	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> +	/*
> +	 * For ext2 buffered-writes, the block allocation happens at the
> +	 * ->write_iter time itself. So at writeback time -
> +	 * 1. We first cache the i_blkseq.
> +	 * 2. Call ext2_get_blocks(, create = 0) to get the no. of blocks
> +	 *    already allocated.
> +	 * 3. Call ext2_get_blocks() the second time with length to be same as
> +	 *    the no. of blocks we know were already allocated.
> +	 * 4. Till now it means, the cached i_blkseq remains valid as no block
> +	 *    allocation has happened yet.
> +	 * This means the next call to ->map_blocks(), we can verify whether the
> +	 * i_blkseq has raced with truncate or not. If not, then i_blkseq will
> +	 * remain valid.
> +	 *
> +	 * In case of a hole (could happen with mmaped writes), we only allocate
> +	 * 1 block at a time anyways. So even if the i_blkseq value changes, we
> +	 * anyway need to allocate the next block in subsequent ->map_blocks()
> +	 * call.

You might want to leave a comment here that ext2 doesn't support
unwritten extents, so the validation cookie is needed only for
writeback, and not for pagecache writes themselves.  I don't expect
anyone to port extents (and hence unwritten blocks) to ext2, so this is
a minor point.

> +	 */
> +	wpc->iomap.validity_cookie = READ_ONCE(EXT2_I(inode)->i_blkseq);
> +
> +	ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits,
> +			      &bno, &new, &boundary, 0);
> +	if (ret < 0)
> +		return ret;
> +	/*
> +	 * ret can be 0 in case of a hole which is possible for mmaped writes.
> +	 */
> +	ret = ret ? ret : 1;
> +	return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits,
>  				IOMAP_WRITE, &wpc->iomap, NULL);
>  }
>  
> @@ -1000,7 +1049,7 @@ ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc
>  
>  const struct address_space_operations ext2_file_aops = {
>  	.dirty_folio		= iomap_dirty_folio,
> -	.release_folio 		= iomap_release_folio,
> +	.release_folio		= iomap_release_folio,

This fix should be in patch 2.

--D

>  	.invalidate_folio	= iomap_invalidate_folio,
>  	.read_folio		= ext2_file_read_folio,
>  	.readahead		= ext2_file_readahead,
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 37f7ce56adce..32f5386284d6 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -188,7 +188,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
>  #ifdef CONFIG_QUOTA
>  	memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
>  #endif
> -
> +	WRITE_ONCE(ei->i_blkseq, 0);
>  	return &ei->vfs_inode;
>  }
>  
> -- 
> 2.44.0
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 5/7] iomap: Fix iomap_adjust_read_range for plen calculation
  2024-04-26  8:52     ` Ritesh Harjani
@ 2024-04-26 15:43       ` Darrick J. Wong
  0 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2024-04-26 15:43 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Christoph Hellwig, linux-ext4, linux-xfs, linux-fsdevel,
	Matthew Wilcox, Ojaswin Mujoo, Jan Kara

On Fri, Apr 26, 2024 at 02:22:25PM +0530, Ritesh Harjani wrote:
> Christoph Hellwig <hch@infradead.org> writes:
> 
> > On Thu, Apr 25, 2024 at 06:58:49PM +0530, Ritesh Harjani (IBM) wrote:
> >> If the extent spans the block that contains the i_size, we need to
> >
> > s/the i_size/i_size/.
> >
> >> handle both halves separately
> >
> > .. so that we properly zero data in the page cache for blocks that are
> > entirely outside of i_size.
> 
> Sure. 
> 
> >
> > Otherwise looks good:
> >
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Thanks for the review.

With Christoph's comments addressed, you can also add
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> 
> -ritesh
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings
  2024-04-25 13:28 ` [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings Ritesh Harjani (IBM)
@ 2024-04-26 16:24   ` Darrick J. Wong
  2024-04-26 17:17     ` Ritesh Harjani
  2024-04-27  4:47   ` Christoph Hellwig
  1 sibling, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2024-04-26 16:24 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-ext4, linux-xfs, linux-fsdevel, Matthew Wilcox,
	Ojaswin Mujoo, Jan Kara

On Thu, Apr 25, 2024 at 06:58:51PM +0530, Ritesh Harjani (IBM) wrote:
> This patch optimizes the data access patterns for filesystems with
> indirect block mapping by implementing BH_Boundary handling within
> iomap.
> 
> Currently the bios for reads within iomap are only submitted at
> 2 places -
> 1. If we cannot merge the new req. with previous bio, only then we
>    submit the previous bio.
> 2. Submit the bio at the end of the entire read processing.
> 
> This means for filesystems with indirect block mapping, we call into
> ->iomap_begin() again w/o submitting the previous bios. That causes
> unoptimized data access patterns for blocks which are of BH_Boundary type.
> 
> For e.g. consider the file mapping
> logical block(4k) 		physical block(4k)
> 0-11 				1000-1011
> 12-15 				1013-1016
> 
> In above physical block 1012 is an indirect metadata block which has the
> mapping information for next set of indirect blocks (1013-1016).
> With iomap buffered reads for reading 1st 16 logical blocks of a file
> (0-15), we get below I/O pattern
> 	- submit a bio for 1012
> 	- complete the bio for 1012
> 	- submit a bio for 1000-1011
> 	- submit a bio for 1013-1016
> 	- complete the bios for 1000-1011
> 	- complete the bios for 1013-1016
> 
> So as we can see, above is an non-optimal I/O access pattern and also we
> get 3 bio completions instead of 2.
> 
> This patch changes this behavior by doing submit_bio() if there are any
> bios already processed, before calling ->iomap_begin() again.
> That means if there are any blocks which are already processed, gets
> submitted for I/O earlier and then within ->iomap_begin(), if we get a
> request for reading an indirect metadata block, then block layer can merge
> those bios with the already submitted read request to reduce the no. of bio
> completions.
> 
> Now, for bs < ps or for large folios, this patch requires proper handling
> of "ifs->read_bytes_pending". In that we first set ifs->read_bytes_pending
> to folio_size. Then handle all the cases where we need to subtract
> ifs->read_bytes_pending either during the submission side
> (if we don't need to submit any I/O - for e.g. for uptodate sub blocks),
> or during an I/O error, or at the completion of an I/O.
> 
> Here is the ftrace output of iomap and block layer with ext2 iomap
> conversion patches -
> 
> root# filefrag -b512 -v /mnt1/test/f1
> Filesystem type is: ef53
> Filesystem cylinder groups approximately 32
> File size of /mnt1/test/f1 is 65536 (128 blocks of 512 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      95:      98304..     98399:     96:             merged
>    1:       96..     127:      98408..     98439:     32:      98400: last,merged,eof
> /mnt1/test/f1: 2 extents found
> 
> root# #This reads 4 blocks starting from lblk 10, 11, 12, 13
> root# xfs_io -c "pread -b$((4*4096)) $((10*4096)) $((4*4096))" /mnt1/test/f1
> 
> w/o this patch - (indirect block is submitted before and does not get merged, resulting in 3 bios completion)
>       xfs_io-907     [002] .....   185.608791: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
>       xfs_io-907     [002] .....   185.608819: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
>       xfs_io-907     [002] .....   185.608823: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
>       xfs_io-907     [002] .....   185.608831: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
>       xfs_io-907     [002] .....   185.608859: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
>       xfs_io-907     [002] .....   185.608865: block_getrq: 8,16 R 98400 + 8 [xfs_io]
>       xfs_io-907     [002] .....   185.608867: block_io_start: 8,16 R 4096 () 98400 + 8 [xfs_io]
>       xfs_io-907     [002] .....   185.608869: block_plug: [xfs_io]
>       xfs_io-907     [002] .....   185.608872: block_unplug: [xfs_io] 1
>       xfs_io-907     [002] .....   185.608874: block_rq_insert: 8,16 R 4096 () 98400 + 8 [xfs_io]
> kworker/2:1H-198     [002] .....   185.608908: block_rq_issue: 8,16 R 4096 () 98400 + 8 [kworker/2:1H]
>       <idle>-0       [002] d.h2.   185.609579: block_rq_complete: 8,16 R () 98400 + 8 [0]
>       <idle>-0       [002] dNh2.   185.609631: block_io_done: 8,16 R 0 () 98400 + 0 [swapper/2]
>       xfs_io-907     [002] .....   185.609694: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
>       xfs_io-907     [002] .....   185.609704: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
>       xfs_io-907     [002] .....   185.609718: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
>       xfs_io-907     [002] .....   185.609721: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>       xfs_io-907     [002] .....   185.609726: block_plug: [xfs_io]
>       xfs_io-907     [002] .....   185.609735: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
>       xfs_io-907     [002] .....   185.609736: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
>       xfs_io-907     [002] .....   185.609740: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
>       xfs_io-907     [002] .....   185.609741: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>       xfs_io-907     [002] .....   185.609756: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>       xfs_io-907     [002] .....   185.609769: block_rq_issue: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>       <idle>-0       [002] d.H2.   185.610280: block_rq_complete: 8,16 RA () 98408 + 16 [0]
>       <idle>-0       [002] d.H2.   185.610289: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/2]
>       <idle>-0       [002] d.H2.   185.610292: block_rq_complete: 8,16 RA () 98384 + 16 [0]
>       <idle>-0       [002] dNH2.   185.610301: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/2]

Could this be shortened to ... the iomap calls and
block_bio_queue/backmerge?  It's a bit difficult to see the point you're
getting at with all the other noise.

I think you're trying to say that the access pattern here is 98400 ->
98408 -> 98384, which is not sequential?

> v/s with the patch - (optimzed I/O access pattern and bio gets merged resulting in only 2 bios completion)
>       xfs_io-944     [005] .....    99.926187: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
>       xfs_io-944     [005] .....    99.926208: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
>       xfs_io-944     [005] .....    99.926211: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
>       xfs_io-944     [005] .....    99.926222: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
>       xfs_io-944     [005] .....    99.926232: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
>       xfs_io-944     [005] .....    99.926233: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>       xfs_io-944     [005] .....    99.926234: block_plug: [xfs_io]
>       xfs_io-944     [005] .....    99.926235: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
>       xfs_io-944     [005] .....    99.926261: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
>       xfs_io-944     [005] .....    99.926266: block_bio_backmerge: 8,16 R 98400 + 8 [xfs_io]
>       xfs_io-944     [005] .....    99.926271: block_unplug: [xfs_io] 1
>       xfs_io-944     [005] .....    99.926272: block_rq_insert: 8,16 RA 12288 () 98384 + 24 [xfs_io]
> kworker/5:1H-234     [005] .....    99.926314: block_rq_issue: 8,16 RA 12288 () 98384 + 24 [kworker/5:1H]
>       <idle>-0       [005] d.h2.    99.926905: block_rq_complete: 8,16 RA () 98384 + 24 [0]
>       <idle>-0       [005] dNh2.    99.926931: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/5]
>       xfs_io-944     [005] .....    99.926971: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
>       xfs_io-944     [005] .....    99.926981: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
>       xfs_io-944     [005] .....    99.926989: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
>       xfs_io-944     [005] .....    99.926989: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>       xfs_io-944     [005] .....    99.926991: block_plug: [xfs_io]
>       xfs_io-944     [005] .....    99.926993: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
>       xfs_io-944     [005] .....    99.927001: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>       <idle>-0       [005] d.h2.    99.927397: block_rq_complete: 8,16 RA () 98408 + 16 [0]
>       <idle>-0       [005] dNh2.    99.927414: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/5]
> 
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
>  fs/iomap/buffered-io.c | 112 +++++++++++++++++++++++++++++++----------
>  1 file changed, 85 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 0a4269095ae2..a1d50086a3f5 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -30,7 +30,7 @@ typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length);
>   */
>  struct iomap_folio_state {
>  	spinlock_t		state_lock;
> -	unsigned int		read_bytes_pending;
> +	size_t			read_bytes_pending;
>  	atomic_t		write_bytes_pending;
> 
>  	/*
> @@ -380,6 +380,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>  	loff_t orig_pos = pos;
>  	size_t poff, plen;
>  	sector_t sector;
> +	bool rbp_finished = false;

What is "rbp"?  My assembly programmer brain says x64 frame pointer, but
that's clearly wrong here.  Maybe I'm confused...

>  	if (iomap->type == IOMAP_INLINE)
>  		return iomap_read_inline_data(iter, folio);
> @@ -387,21 +388,39 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>  	/* zero post-eof blocks as the page may be mapped */
>  	ifs = ifs_alloc(iter->inode, folio, iter->flags);
>  	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
> +
> +	if (ifs) {
> +		loff_t to_read = min_t(loff_t, iter->len - offset,
> +			folio_size(folio) - offset_in_folio(folio, orig_pos));
> +		size_t padjust;
> +
> +		spin_lock_irq(&ifs->state_lock);
> +		if (!ifs->read_bytes_pending)
> +			ifs->read_bytes_pending = to_read;
> +		padjust = pos - orig_pos;
> +		ifs->read_bytes_pending -= padjust;
> +		if (!ifs->read_bytes_pending)
> +			rbp_finished = true;
> +		spin_unlock_irq(&ifs->state_lock);
> +	}
> +
>  	if (plen == 0)
>  		goto done;
> 
>  	if (iomap_block_needs_zeroing(iter, pos)) {
> +		if (ifs) {
> +			spin_lock_irq(&ifs->state_lock);
> +			ifs->read_bytes_pending -= plen;
> +			if (!ifs->read_bytes_pending)
> +				rbp_finished = true;
> +			spin_unlock_irq(&ifs->state_lock);
> +		}
>  		folio_zero_range(folio, poff, plen);
>  		iomap_set_range_uptodate(folio, poff, plen);
>  		goto done;
>  	}
> 
>  	ctx->cur_folio_in_bio = true;
> -	if (ifs) {
> -		spin_lock_irq(&ifs->state_lock);
> -		ifs->read_bytes_pending += plen;
> -		spin_unlock_irq(&ifs->state_lock);
> -	}
> 
>  	sector = iomap_sector(iomap, pos);
>  	if (!ctx->bio ||
> @@ -435,6 +454,14 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>  	}
> 
>  done:
> +	/*
> +	 * If there is no bio prepared and if rbp is finished and
> +	 * this was the last offset within this folio then mark
> +	 * cur_folio_in_bio to false.
> +	 */
> +	if (!ctx->bio && rbp_finished &&
> +			offset_in_folio(folio, pos + plen) == 0)
> +		ctx->cur_folio_in_bio = false;

...yes, I think I am confused.  When would ctx->bio be NULL but
cur_folio_in_bio is true?

I /think/ what you're doing here is using read_bytes_pending to figure
out if you've processed the folio up to the end of the mapping?  But
then you submit the bio unconditionally below for each readpage_iter
call?

Why not add an IOMAP_BOUNDARY flag that means "I will have to do some IO
if you call ->iomap_begin again"?  Then if we get to this point in
readpage_iter with a ctx->bio, we can submit the bio, clear
cur_folio_in_bio, and return?  And then you don't need this machinery?

--D

>  	/*
>  	 * Move the caller beyond our range so that it keeps making progress.
>  	 * For that, we have to include any leading non-uptodate ranges, but
> @@ -459,9 +486,43 @@ static loff_t iomap_read_folio_iter(const struct iomap_iter *iter,
>  			return ret;
>  	}
> 
> +	if (ctx->bio) {
> +		submit_bio(ctx->bio);
> +		WARN_ON_ONCE(!ctx->cur_folio_in_bio);
> +		ctx->bio = NULL;
> +	}
> +	if (offset_in_folio(folio, iter->pos + done) == 0 &&
> +			!ctx->cur_folio_in_bio) {
> +		folio_unlock(ctx->cur_folio);
> +	}
> +
>  	return done;
>  }
> 
> +static void iomap_handle_read_error(struct iomap_readpage_ctx *ctx,
> +		struct iomap_iter *iter)
> +{
> +	struct folio *folio = ctx->cur_folio;
> +	struct iomap_folio_state *ifs;
> +	unsigned long flags;
> +	bool rbp_finished = false;
> +	size_t rbp_adjust = folio_size(folio) - offset_in_folio(folio,
> +				iter->pos);
> +	ifs = folio->private;
> +	if (!ifs || !ifs->read_bytes_pending)
> +		goto unlock;
> +
> +	spin_lock_irqsave(&ifs->state_lock, flags);
> +	ifs->read_bytes_pending -= rbp_adjust;
> +	if (!ifs->read_bytes_pending)
> +		rbp_finished = true;
> +	spin_unlock_irqrestore(&ifs->state_lock, flags);
> +
> +unlock:
> +	if (rbp_finished || !ctx->cur_folio_in_bio)
> +		folio_unlock(folio);
> +}
> +
>  int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
>  {
>  	struct iomap_iter iter = {
> @@ -479,15 +540,9 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
>  	while ((ret = iomap_iter(&iter, ops)) > 0)
>  		iter.processed = iomap_read_folio_iter(&iter, &ctx);
> 
> -	if (ret < 0)
> +	if (ret < 0) {
>  		folio_set_error(folio);
> -
> -	if (ctx.bio) {
> -		submit_bio(ctx.bio);
> -		WARN_ON_ONCE(!ctx.cur_folio_in_bio);
> -	} else {
> -		WARN_ON_ONCE(ctx.cur_folio_in_bio);
> -		folio_unlock(folio);
> +		iomap_handle_read_error(&ctx, &iter);
>  	}
> 
>  	/*
> @@ -506,12 +561,6 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
>  	loff_t done, ret;
> 
>  	for (done = 0; done < length; done += ret) {
> -		if (ctx->cur_folio &&
> -		    offset_in_folio(ctx->cur_folio, iter->pos + done) == 0) {
> -			if (!ctx->cur_folio_in_bio)
> -				folio_unlock(ctx->cur_folio);
> -			ctx->cur_folio = NULL;
> -		}
>  		if (!ctx->cur_folio) {
>  			ctx->cur_folio = readahead_folio(ctx->rac);
>  			ctx->cur_folio_in_bio = false;
> @@ -519,6 +568,17 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
>  		ret = iomap_readpage_iter(iter, ctx, done);
>  		if (ret <= 0)
>  			return ret;
> +		if (ctx->cur_folio && offset_in_folio(ctx->cur_folio,
> +					iter->pos + done + ret) == 0) {
> +			if (!ctx->cur_folio_in_bio)
> +				folio_unlock(ctx->cur_folio);
> +			ctx->cur_folio = NULL;
> +		}
> +	}
> +
> +	if (ctx->bio) {
> +		submit_bio(ctx->bio);
> +		ctx->bio = NULL;
>  	}
> 
>  	return done;
> @@ -549,18 +609,16 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
>  	struct iomap_readpage_ctx ctx = {
>  		.rac	= rac,
>  	};
> +	int ret = 0;
> 
>  	trace_iomap_readahead(rac->mapping->host, readahead_count(rac));
> 
> -	while (iomap_iter(&iter, ops) > 0)
> +	while ((ret = iomap_iter(&iter, ops)) > 0)
>  		iter.processed = iomap_readahead_iter(&iter, &ctx);
> 
> -	if (ctx.bio)
> -		submit_bio(ctx.bio);
> -	if (ctx.cur_folio) {
> -		if (!ctx.cur_folio_in_bio)
> -			folio_unlock(ctx.cur_folio);
> -	}
> +	if (ret < 0 && ctx.cur_folio)
> +		iomap_handle_read_error(&ctx, &iter);
> +
>  }
>  EXPORT_SYMBOL_GPL(iomap_readahead);
> 
> --
> 2.44.0
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings
  2024-04-26 16:24   ` Darrick J. Wong
@ 2024-04-26 17:17     ` Ritesh Harjani
  2024-04-26 17:25       ` Ritesh Harjani
  0 siblings, 1 reply; 30+ messages in thread
From: Ritesh Harjani @ 2024-04-26 17:17 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-ext4, linux-xfs, linux-fsdevel, Matthew Wilcox,
	Ojaswin Mujoo, Jan Kara

"Darrick J. Wong" <djwong@kernel.org> writes:

> On Thu, Apr 25, 2024 at 06:58:51PM +0530, Ritesh Harjani (IBM) wrote:
>> This patch optimizes the data access patterns for filesystems with
>> indirect block mapping by implementing BH_Boundary handling within
>> iomap.
>> 
>> Currently the bios for reads within iomap are only submitted at
>> 2 places -
>> 1. If we cannot merge the new req. with previous bio, only then we
>>    submit the previous bio.
>> 2. Submit the bio at the end of the entire read processing.
>> 
>> This means for filesystems with indirect block mapping, we call into
>> ->iomap_begin() again w/o submitting the previous bios. That causes
>> unoptimized data access patterns for blocks which are of BH_Boundary type.
>> 
>> For e.g. consider the file mapping
>> logical block(4k) 		physical block(4k)
>> 0-11 				1000-1011
>> 12-15 				1013-1016
>> 
>> In above physical block 1012 is an indirect metadata block which has the
>> mapping information for next set of indirect blocks (1013-1016).
>> With iomap buffered reads for reading 1st 16 logical blocks of a file
>> (0-15), we get below I/O pattern
>> 	- submit a bio for 1012
>> 	- complete the bio for 1012
>> 	- submit a bio for 1000-1011
>> 	- submit a bio for 1013-1016
>> 	- complete the bios for 1000-1011
>> 	- complete the bios for 1013-1016
>> 
>> So as we can see, above is an non-optimal I/O access pattern and also we
>> get 3 bio completions instead of 2.
>> 
>> This patch changes this behavior by doing submit_bio() if there are any
>> bios already processed, before calling ->iomap_begin() again.
>> That means if there are any blocks which are already processed, gets
>> submitted for I/O earlier and then within ->iomap_begin(), if we get a
>> request for reading an indirect metadata block, then block layer can merge
>> those bios with the already submitted read request to reduce the no. of bio
>> completions.
>> 
>> Now, for bs < ps or for large folios, this patch requires proper handling
>> of "ifs->read_bytes_pending". In that we first set ifs->read_bytes_pending
>> to folio_size. Then handle all the cases where we need to subtract
>> ifs->read_bytes_pending either during the submission side
>> (if we don't need to submit any I/O - for e.g. for uptodate sub blocks),
>> or during an I/O error, or at the completion of an I/O.
>> 
>> Here is the ftrace output of iomap and block layer with ext2 iomap
>> conversion patches -
>> 
>> root# filefrag -b512 -v /mnt1/test/f1
>> Filesystem type is: ef53
>> Filesystem cylinder groups approximately 32
>> File size of /mnt1/test/f1 is 65536 (128 blocks of 512 bytes)
>>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>>    0:        0..      95:      98304..     98399:     96:             merged
>>    1:       96..     127:      98408..     98439:     32:      98400: last,merged,eof
>> /mnt1/test/f1: 2 extents found
>> 
>> root# #This reads 4 blocks starting from lblk 10, 11, 12, 13
>> root# xfs_io -c "pread -b$((4*4096)) $((10*4096)) $((4*4096))" /mnt1/test/f1
>> 
>> w/o this patch - (indirect block is submitted before and does not get merged, resulting in 3 bios completion)
>>       xfs_io-907     [002] .....   185.608791: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
>>       xfs_io-907     [002] .....   185.608819: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
>>       xfs_io-907     [002] .....   185.608823: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
>>       xfs_io-907     [002] .....   185.608831: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
>>       xfs_io-907     [002] .....   185.608859: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
>>       xfs_io-907     [002] .....   185.608865: block_getrq: 8,16 R 98400 + 8 [xfs_io]
>>       xfs_io-907     [002] .....   185.608867: block_io_start: 8,16 R 4096 () 98400 + 8 [xfs_io]
>>       xfs_io-907     [002] .....   185.608869: block_plug: [xfs_io]
>>       xfs_io-907     [002] .....   185.608872: block_unplug: [xfs_io] 1
>>       xfs_io-907     [002] .....   185.608874: block_rq_insert: 8,16 R 4096 () 98400 + 8 [xfs_io]
>> kworker/2:1H-198     [002] .....   185.608908: block_rq_issue: 8,16 R 4096 () 98400 + 8 [kworker/2:1H]
>>       <idle>-0       [002] d.h2.   185.609579: block_rq_complete: 8,16 R () 98400 + 8 [0]
>>       <idle>-0       [002] dNh2.   185.609631: block_io_done: 8,16 R 0 () 98400 + 0 [swapper/2]
>>       xfs_io-907     [002] .....   185.609694: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
>>       xfs_io-907     [002] .....   185.609704: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
>>       xfs_io-907     [002] .....   185.609718: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
>>       xfs_io-907     [002] .....   185.609721: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>>       xfs_io-907     [002] .....   185.609726: block_plug: [xfs_io]
>>       xfs_io-907     [002] .....   185.609735: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
>>       xfs_io-907     [002] .....   185.609736: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
>>       xfs_io-907     [002] .....   185.609740: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
>>       xfs_io-907     [002] .....   185.609741: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>>       xfs_io-907     [002] .....   185.609756: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>>       xfs_io-907     [002] .....   185.609769: block_rq_issue: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>>       <idle>-0       [002] d.H2.   185.610280: block_rq_complete: 8,16 RA () 98408 + 16 [0]
>>       <idle>-0       [002] d.H2.   185.610289: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/2]
>>       <idle>-0       [002] d.H2.   185.610292: block_rq_complete: 8,16 RA () 98384 + 16 [0]
>>       <idle>-0       [002] dNH2.   185.610301: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/2]
>
> Could this be shortened to ... the iomap calls and
> block_bio_queue/backmerge?  It's a bit difficult to see the point you're
> getting at with all the other noise.

I will remove this log and move it to cover letter and will just extend
the simple example I considered before in this commit message,
to show the difference with and w/o patch.

>
> I think you're trying to say that the access pattern here is 98400 ->
> 98408 -> 98384, which is not sequential?
>

it's (98400,8 ==> metadata block) -> (98384,16 == lblk 10 & 11) -> (98408,16 ==> lblk 12 & 13)
... w/o the patch

>> v/s with the patch - (optimzed I/O access pattern and bio gets merged resulting in only 2 bios completion)
>>       xfs_io-944     [005] .....    99.926187: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
>>       xfs_io-944     [005] .....    99.926208: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
>>       xfs_io-944     [005] .....    99.926211: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
>>       xfs_io-944     [005] .....    99.926222: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
>>       xfs_io-944     [005] .....    99.926232: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
>>       xfs_io-944     [005] .....    99.926233: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>>       xfs_io-944     [005] .....    99.926234: block_plug: [xfs_io]
>>       xfs_io-944     [005] .....    99.926235: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
>>       xfs_io-944     [005] .....    99.926261: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
>>       xfs_io-944     [005] .....    99.926266: block_bio_backmerge: 8,16 R 98400 + 8 [xfs_io]
>>       xfs_io-944     [005] .....    99.926271: block_unplug: [xfs_io] 1
>>       xfs_io-944     [005] .....    99.926272: block_rq_insert: 8,16 RA 12288 () 98384 + 24 [xfs_io]
>> kworker/5:1H-234     [005] .....    99.926314: block_rq_issue: 8,16 RA 12288 () 98384 + 24 [kworker/5:1H]
>>       <idle>-0       [005] d.h2.    99.926905: block_rq_complete: 8,16 RA () 98384 + 24 [0]
>>       <idle>-0       [005] dNh2.    99.926931: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/5]
>>       xfs_io-944     [005] .....    99.926971: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
>>       xfs_io-944     [005] .....    99.926981: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
>>       xfs_io-944     [005] .....    99.926989: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
>>       xfs_io-944     [005] .....    99.926989: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>>       xfs_io-944     [005] .....    99.926991: block_plug: [xfs_io]
>>       xfs_io-944     [005] .....    99.926993: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
>>       xfs_io-944     [005] .....    99.927001: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>>       <idle>-0       [005] d.h2.    99.927397: block_rq_complete: 8,16 RA () 98408 + 16 [0]
>>       <idle>-0       [005] dNh2.    99.927414: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/5]
>> 
>> Suggested-by: Matthew Wilcox <willy@infradead.org>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> ---
>>  fs/iomap/buffered-io.c | 112 +++++++++++++++++++++++++++++++----------
>>  1 file changed, 85 insertions(+), 27 deletions(-)
>> 
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 0a4269095ae2..a1d50086a3f5 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -30,7 +30,7 @@ typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length);
>>   */
>>  struct iomap_folio_state {
>>  	spinlock_t		state_lock;
>> -	unsigned int		read_bytes_pending;
>> +	size_t			read_bytes_pending;
>>  	atomic_t		write_bytes_pending;
>> 
>>  	/*
>> @@ -380,6 +380,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>>  	loff_t orig_pos = pos;
>>  	size_t poff, plen;
>>  	sector_t sector;
>> +	bool rbp_finished = false;
>
> What is "rbp"?  My assembly programmer brain says x64 frame pointer, but
> that's clearly wrong here.  Maybe I'm confused...
>

rbp == read_bytes_pending ;)

>>  	if (iomap->type == IOMAP_INLINE)
>>  		return iomap_read_inline_data(iter, folio);
>> @@ -387,21 +388,39 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>>  	/* zero post-eof blocks as the page may be mapped */
>>  	ifs = ifs_alloc(iter->inode, folio, iter->flags);
>>  	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
>> +
>> +	if (ifs) {
>> +		loff_t to_read = min_t(loff_t, iter->len - offset,
>> +			folio_size(folio) - offset_in_folio(folio, orig_pos));
>> +		size_t padjust;
>> +
>> +		spin_lock_irq(&ifs->state_lock);
>> +		if (!ifs->read_bytes_pending)
>> +			ifs->read_bytes_pending = to_read;
>> +		padjust = pos - orig_pos;
>> +		ifs->read_bytes_pending -= padjust;
>> +		if (!ifs->read_bytes_pending)
>> +			rbp_finished = true;
>> +		spin_unlock_irq(&ifs->state_lock);
>> +	}
>> +
>>  	if (plen == 0)
>>  		goto done;
>> 
>>  	if (iomap_block_needs_zeroing(iter, pos)) {
>> +		if (ifs) {
>> +			spin_lock_irq(&ifs->state_lock);
>> +			ifs->read_bytes_pending -= plen;
>> +			if (!ifs->read_bytes_pending)
>> +				rbp_finished = true;
>> +			spin_unlock_irq(&ifs->state_lock);
>> +		}
>>  		folio_zero_range(folio, poff, plen);
>>  		iomap_set_range_uptodate(folio, poff, plen);
>>  		goto done;
>>  	}
>> 
>>  	ctx->cur_folio_in_bio = true;
>> -	if (ifs) {
>> -		spin_lock_irq(&ifs->state_lock);
>> -		ifs->read_bytes_pending += plen;
>> -		spin_unlock_irq(&ifs->state_lock);
>> -	}
>> 
>>  	sector = iomap_sector(iomap, pos);
>>  	if (!ctx->bio ||
>> @@ -435,6 +454,14 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>>  	}
>> 
>>  done:
>> +	/*
>> +	 * If there is no bio prepared and if rbp is finished and
>> +	 * this was the last offset within this folio then mark
>> +	 * cur_folio_in_bio to false.
>> +	 */
>> +	if (!ctx->bio && rbp_finished &&
>> +			offset_in_folio(folio, pos + plen) == 0)
>> +		ctx->cur_folio_in_bio = false;
>
> ...yes, I think I am confused.  When would ctx->bio be NULL but
> cur_folio_in_bio is true?

Previously we had the bio submitted and so we make it null, but we still
have ctx->cur_folio & ctx->cur_folio_in_bio to true, since we haven't
completely processed the folio.

>
> I /think/ what you're doing here is using read_bytes_pending to figure
> out if you've processed the folio up to the end of the mapping?  But
> then you submit the bio unconditionally below for each readpage_iter
> call?
>

yes, that's right.

> Why not add an IOMAP_BOUNDARY flag that means "I will have to do some IO
> if you call ->iomap_begin again"?  Then if we get to this point in
> readpage_iter with a ctx->bio, we can submit the bio, clear
> cur_folio_in_bio, and return?  And then you don't need this machinery?

TBH, I initially didn't think the approach taken in the patch would
require such careful handling of r_b_p. It was because of all of this
corner cases when we don't need to read the update blocks and/or in case
of an error we need to ensure we reduce r_b_p carefully so that we could
unlock the folio and when extent spans beyond i_size.

So it's all about how do we know if we could unlock the folio and that it's
corresponding blocks/mapping has been all processed or submitted for
I/O. 

Assume we have a folio which spans over multiple extents. In such a
case, 
-> we process a bio for 1st extent, 
-> then we go back to iomap_iter() to get new extent mapping, 
-> We now increment the r_b_p with this new plen to be processed. 
-> We then submit the previous bio, since this new mapping couldn't be
merged due to discontinuous extents. 
So by first incrementing the r_b_p before doing submit_bio(), we don't
unlock the folio at bio completion.

Maybe, it would be helpful if we have an easy mechanism to keep some state
from the time of submit_bio() till the bio completion to know that the
corresponding folio is still being processed and it shouldn't be
unlocked.
 -> This currently is what we are doing by making r_b_p to the value of
 folio_size() and then carefully reducing r_b_p for all the cases I
 mentioned above.

Let me think if adding a IOMAP_BH_BOUNDARY flag could be helpful or not.
Say if we have a pagesize of 64k that means all first 16 blocks belongs
to same page. So even with IOMAP_BH_BOUNDARY flag the problem that still
remains is that, even if we submit the bio at block 11 (bh_boundary
block), how will the bio completion side know that the folio is not
completely processed and so we shouldn't unlock the folio?


-ritesh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings
  2024-04-26 17:17     ` Ritesh Harjani
@ 2024-04-26 17:25       ` Ritesh Harjani
  2024-04-26 17:37         ` Matthew Wilcox
  0 siblings, 1 reply; 30+ messages in thread
From: Ritesh Harjani @ 2024-04-26 17:25 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-ext4, linux-xfs, linux-fsdevel, Matthew Wilcox,
	Ojaswin Mujoo, Jan Kara

Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:

> "Darrick J. Wong" <djwong@kernel.org> writes:
>
>> On Thu, Apr 25, 2024 at 06:58:51PM +0530, Ritesh Harjani (IBM) wrote:
>>> This patch optimizes the data access patterns for filesystems with
>>> indirect block mapping by implementing BH_Boundary handling within
>>> iomap.
>>> 
>>> Currently the bios for reads within iomap are only submitted at
>>> 2 places -
>>> 1. If we cannot merge the new req. with previous bio, only then we
>>>    submit the previous bio.
>>> 2. Submit the bio at the end of the entire read processing.
>>> 
>>> This means for filesystems with indirect block mapping, we call into
>>> ->iomap_begin() again w/o submitting the previous bios. That causes
>>> unoptimized data access patterns for blocks which are of BH_Boundary type.
>>> 
>>> For e.g. consider the file mapping
>>> logical block(4k) 		physical block(4k)
>>> 0-11 				1000-1011
>>> 12-15 				1013-1016
>>> 
>>> In above physical block 1012 is an indirect metadata block which has the
>>> mapping information for next set of indirect blocks (1013-1016).
>>> With iomap buffered reads for reading 1st 16 logical blocks of a file
>>> (0-15), we get below I/O pattern
>>> 	- submit a bio for 1012
>>> 	- complete the bio for 1012
>>> 	- submit a bio for 1000-1011
>>> 	- submit a bio for 1013-1016
>>> 	- complete the bios for 1000-1011
>>> 	- complete the bios for 1013-1016
>>> 
>>> So as we can see, above is an non-optimal I/O access pattern and also we
>>> get 3 bio completions instead of 2.
>>> 
>>> This patch changes this behavior by doing submit_bio() if there are any
>>> bios already processed, before calling ->iomap_begin() again.
>>> That means if there are any blocks which are already processed, gets
>>> submitted for I/O earlier and then within ->iomap_begin(), if we get a
>>> request for reading an indirect metadata block, then block layer can merge
>>> those bios with the already submitted read request to reduce the no. of bio
>>> completions.
>>> 
>>> Now, for bs < ps or for large folios, this patch requires proper handling
>>> of "ifs->read_bytes_pending". In that we first set ifs->read_bytes_pending
>>> to folio_size. Then handle all the cases where we need to subtract
>>> ifs->read_bytes_pending either during the submission side
>>> (if we don't need to submit any I/O - for e.g. for uptodate sub blocks),
>>> or during an I/O error, or at the completion of an I/O.
>>> 
>>> Here is the ftrace output of iomap and block layer with ext2 iomap
>>> conversion patches -
>>> 
>>> root# filefrag -b512 -v /mnt1/test/f1
>>> Filesystem type is: ef53
>>> Filesystem cylinder groups approximately 32
>>> File size of /mnt1/test/f1 is 65536 (128 blocks of 512 bytes)
>>>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>>>    0:        0..      95:      98304..     98399:     96:             merged
>>>    1:       96..     127:      98408..     98439:     32:      98400: last,merged,eof
>>> /mnt1/test/f1: 2 extents found
>>> 
>>> root# #This reads 4 blocks starting from lblk 10, 11, 12, 13
>>> root# xfs_io -c "pread -b$((4*4096)) $((10*4096)) $((4*4096))" /mnt1/test/f1
>>> 
>>> w/o this patch - (indirect block is submitted before and does not get merged, resulting in 3 bios completion)
>>>       xfs_io-907     [002] .....   185.608791: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
>>>       xfs_io-907     [002] .....   185.608819: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
>>>       xfs_io-907     [002] .....   185.608823: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
>>>       xfs_io-907     [002] .....   185.608831: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
>>>       xfs_io-907     [002] .....   185.608859: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
>>>       xfs_io-907     [002] .....   185.608865: block_getrq: 8,16 R 98400 + 8 [xfs_io]
>>>       xfs_io-907     [002] .....   185.608867: block_io_start: 8,16 R 4096 () 98400 + 8 [xfs_io]
>>>       xfs_io-907     [002] .....   185.608869: block_plug: [xfs_io]
>>>       xfs_io-907     [002] .....   185.608872: block_unplug: [xfs_io] 1
>>>       xfs_io-907     [002] .....   185.608874: block_rq_insert: 8,16 R 4096 () 98400 + 8 [xfs_io]
>>> kworker/2:1H-198     [002] .....   185.608908: block_rq_issue: 8,16 R 4096 () 98400 + 8 [kworker/2:1H]
>>>       <idle>-0       [002] d.h2.   185.609579: block_rq_complete: 8,16 R () 98400 + 8 [0]
>>>       <idle>-0       [002] dNh2.   185.609631: block_io_done: 8,16 R 0 () 98400 + 0 [swapper/2]
>>>       xfs_io-907     [002] .....   185.609694: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
>>>       xfs_io-907     [002] .....   185.609704: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
>>>       xfs_io-907     [002] .....   185.609718: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
>>>       xfs_io-907     [002] .....   185.609721: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>>>       xfs_io-907     [002] .....   185.609726: block_plug: [xfs_io]
>>>       xfs_io-907     [002] .....   185.609735: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
>>>       xfs_io-907     [002] .....   185.609736: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
>>>       xfs_io-907     [002] .....   185.609740: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
>>>       xfs_io-907     [002] .....   185.609741: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>>>       xfs_io-907     [002] .....   185.609756: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>>>       xfs_io-907     [002] .....   185.609769: block_rq_issue: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>>>       <idle>-0       [002] d.H2.   185.610280: block_rq_complete: 8,16 RA () 98408 + 16 [0]
>>>       <idle>-0       [002] d.H2.   185.610289: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/2]
>>>       <idle>-0       [002] d.H2.   185.610292: block_rq_complete: 8,16 RA () 98384 + 16 [0]
>>>       <idle>-0       [002] dNH2.   185.610301: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/2]
>>
>> Could this be shortened to ... the iomap calls and
>> block_bio_queue/backmerge?  It's a bit difficult to see the point you're
>> getting at with all the other noise.
>
> I will remove this log and move it to cover letter and will just extend
> the simple example I considered before in this commit message,
> to show the difference with and w/o patch.
>
>>
>> I think you're trying to say that the access pattern here is 98400 ->
>> 98408 -> 98384, which is not sequential?
>>
>
> it's (98400,8 ==> metadata block) -> (98384,16 == lblk 10 & 11) -> (98408,16 ==> lblk 12 & 13)
> ... w/o the patch
>
>>> v/s with the patch - (optimzed I/O access pattern and bio gets merged resulting in only 2 bios completion)
>>>       xfs_io-944     [005] .....    99.926187: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
>>>       xfs_io-944     [005] .....    99.926208: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
>>>       xfs_io-944     [005] .....    99.926211: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
>>>       xfs_io-944     [005] .....    99.926222: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
>>>       xfs_io-944     [005] .....    99.926232: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
>>>       xfs_io-944     [005] .....    99.926233: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>>>       xfs_io-944     [005] .....    99.926234: block_plug: [xfs_io]
>>>       xfs_io-944     [005] .....    99.926235: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
>>>       xfs_io-944     [005] .....    99.926261: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
>>>       xfs_io-944     [005] .....    99.926266: block_bio_backmerge: 8,16 R 98400 + 8 [xfs_io]
>>>       xfs_io-944     [005] .....    99.926271: block_unplug: [xfs_io] 1
>>>       xfs_io-944     [005] .....    99.926272: block_rq_insert: 8,16 RA 12288 () 98384 + 24 [xfs_io]
>>> kworker/5:1H-234     [005] .....    99.926314: block_rq_issue: 8,16 RA 12288 () 98384 + 24 [kworker/5:1H]
>>>       <idle>-0       [005] d.h2.    99.926905: block_rq_complete: 8,16 RA () 98384 + 24 [0]
>>>       <idle>-0       [005] dNh2.    99.926931: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/5]
>>>       xfs_io-944     [005] .....    99.926971: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
>>>       xfs_io-944     [005] .....    99.926981: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
>>>       xfs_io-944     [005] .....    99.926989: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
>>>       xfs_io-944     [005] .....    99.926989: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>>>       xfs_io-944     [005] .....    99.926991: block_plug: [xfs_io]
>>>       xfs_io-944     [005] .....    99.926993: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
>>>       xfs_io-944     [005] .....    99.927001: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>>>       <idle>-0       [005] d.h2.    99.927397: block_rq_complete: 8,16 RA () 98408 + 16 [0]
>>>       <idle>-0       [005] dNh2.    99.927414: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/5]
>>> 
>>> Suggested-by: Matthew Wilcox <willy@infradead.org>
>>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>>> cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>>> ---
>>>  fs/iomap/buffered-io.c | 112 +++++++++++++++++++++++++++++++----------
>>>  1 file changed, 85 insertions(+), 27 deletions(-)
>>> 
>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>>> index 0a4269095ae2..a1d50086a3f5 100644
>>> --- a/fs/iomap/buffered-io.c
>>> +++ b/fs/iomap/buffered-io.c
>>> @@ -30,7 +30,7 @@ typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length);
>>>   */
>>>  struct iomap_folio_state {
>>>  	spinlock_t		state_lock;
>>> -	unsigned int		read_bytes_pending;
>>> +	size_t			read_bytes_pending;
>>>  	atomic_t		write_bytes_pending;
>>> 
>>>  	/*
>>> @@ -380,6 +380,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>>>  	loff_t orig_pos = pos;
>>>  	size_t poff, plen;
>>>  	sector_t sector;
>>> +	bool rbp_finished = false;
>>
>> What is "rbp"?  My assembly programmer brain says x64 frame pointer, but
>> that's clearly wrong here.  Maybe I'm confused...
>>
>
> rbp == read_bytes_pending ;)
>
>>>  	if (iomap->type == IOMAP_INLINE)
>>>  		return iomap_read_inline_data(iter, folio);
>>> @@ -387,21 +388,39 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>>>  	/* zero post-eof blocks as the page may be mapped */
>>>  	ifs = ifs_alloc(iter->inode, folio, iter->flags);
>>>  	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
>>> +
>>> +	if (ifs) {
>>> +		loff_t to_read = min_t(loff_t, iter->len - offset,
>>> +			folio_size(folio) - offset_in_folio(folio, orig_pos));
>>> +		size_t padjust;
>>> +
>>> +		spin_lock_irq(&ifs->state_lock);
>>> +		if (!ifs->read_bytes_pending)
>>> +			ifs->read_bytes_pending = to_read;
>>> +		padjust = pos - orig_pos;
>>> +		ifs->read_bytes_pending -= padjust;
>>> +		if (!ifs->read_bytes_pending)
>>> +			rbp_finished = true;
>>> +		spin_unlock_irq(&ifs->state_lock);
>>> +	}
>>> +
>>>  	if (plen == 0)
>>>  		goto done;
>>> 
>>>  	if (iomap_block_needs_zeroing(iter, pos)) {
>>> +		if (ifs) {
>>> +			spin_lock_irq(&ifs->state_lock);
>>> +			ifs->read_bytes_pending -= plen;
>>> +			if (!ifs->read_bytes_pending)
>>> +				rbp_finished = true;
>>> +			spin_unlock_irq(&ifs->state_lock);
>>> +		}
>>>  		folio_zero_range(folio, poff, plen);
>>>  		iomap_set_range_uptodate(folio, poff, plen);
>>>  		goto done;
>>>  	}
>>> 
>>>  	ctx->cur_folio_in_bio = true;
>>> -	if (ifs) {
>>> -		spin_lock_irq(&ifs->state_lock);
>>> -		ifs->read_bytes_pending += plen;
>>> -		spin_unlock_irq(&ifs->state_lock);
>>> -	}
>>> 
>>>  	sector = iomap_sector(iomap, pos);
>>>  	if (!ctx->bio ||
>>> @@ -435,6 +454,14 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>>>  	}
>>> 
>>>  done:
>>> +	/*
>>> +	 * If there is no bio prepared and if rbp is finished and
>>> +	 * this was the last offset within this folio then mark
>>> +	 * cur_folio_in_bio to false.
>>> +	 */
>>> +	if (!ctx->bio && rbp_finished &&
>>> +			offset_in_folio(folio, pos + plen) == 0)
>>> +		ctx->cur_folio_in_bio = false;
>>
>> ...yes, I think I am confused.  When would ctx->bio be NULL but
>> cur_folio_in_bio is true?
>
> Previously we had the bio submitted and so we make it null, but we still
> have ctx->cur_folio & ctx->cur_folio_in_bio to true, since we haven't
> completely processed the folio.
>
>>
>> I /think/ what you're doing here is using read_bytes_pending to figure
>> out if you've processed the folio up to the end of the mapping?  But
>> then you submit the bio unconditionally below for each readpage_iter
>> call?
>>
>
> yes, that's right.
>
>> Why not add an IOMAP_BOUNDARY flag that means "I will have to do some IO
>> if you call ->iomap_begin again"?  Then if we get to this point in
>> readpage_iter with a ctx->bio, we can submit the bio, clear
>> cur_folio_in_bio, and return?  And then you don't need this machinery?
>
> TBH, I initially didn't think the approach taken in the patch would
> require such careful handling of r_b_p. It was because of all of this
> corner cases when we don't need to read the update blocks and/or in case
> of an error we need to ensure we reduce r_b_p carefully so that we could
> unlock the folio and when extent spans beyond i_size.
>
> So it's all about how do we know if we could unlock the folio and that it's
> corresponding blocks/mapping has been all processed or submitted for
> I/O. 
>
> Assume we have a folio which spans over multiple extents. In such a
> case, 
> -> we process a bio for 1st extent, 
> -> then we go back to iomap_iter() to get new extent mapping, 
> -> We now increment the r_b_p with this new plen to be processed. 
> -> We then submit the previous bio, since this new mapping couldn't be
> merged due to discontinuous extents. 
> So by first incrementing the r_b_p before doing submit_bio(), we don't
> unlock the folio at bio completion.
>
> Maybe, it would be helpful if we have an easy mechanism to keep some state
> from the time of submit_bio() till the bio completion to know that the
> corresponding folio is still being processed and it shouldn't be
> unlocked.
>  -> This currently is what we are doing by making r_b_p to the value of
>  folio_size() and then carefully reducing r_b_p for all the cases I
>  mentioned above.
>
> Let me think if adding a IOMAP_BH_BOUNDARY flag could be helpful or not.
> Say if we have a pagesize of 64k that means all first 16 blocks belongs
> to same page. So even with IOMAP_BH_BOUNDARY flag the problem that still
> remains is that, even if we submit the bio at block 11 (bh_boundary
> block), how will the bio completion side know that the folio is not
> completely processed and so we shouldn't unlock the folio?

Maybe one way could be if we could add another state flag to ifs for
BH_BOUNDARY block and read that at the bio completion.
We can then also let the completion side know if it should unlock the
folio or whether it still needs processing at the submission side.

I guess that might be an easier approach then this. Thoughts?

-ritesh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings
  2024-04-26 17:25       ` Ritesh Harjani
@ 2024-04-26 17:37         ` Matthew Wilcox
  2024-04-26 17:55           ` Ritesh Harjani
  0 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2024-04-26 17:37 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Darrick J. Wong, linux-ext4, linux-xfs, linux-fsdevel,
	Ojaswin Mujoo, Jan Kara

On Fri, Apr 26, 2024 at 10:55:23PM +0530, Ritesh Harjani wrote:
> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> 
> > "Darrick J. Wong" <djwong@kernel.org> writes:
> >
> >> On Thu, Apr 25, 2024 at 06:58:51PM +0530, Ritesh Harjani (IBM) wrote:
> >>> This patch optimizes the data access patterns for filesystems with
> >>> indirect block mapping by implementing BH_Boundary handling within
> >>> iomap.
> >>> 
> >>> Currently the bios for reads within iomap are only submitted at
> >>> 2 places -
> >>> 1. If we cannot merge the new req. with previous bio, only then we
> >>>    submit the previous bio.
> >>> 2. Submit the bio at the end of the entire read processing.
> >>> 
> >>> This means for filesystems with indirect block mapping, we call into
> >>> ->iomap_begin() again w/o submitting the previous bios. That causes
> >>> unoptimized data access patterns for blocks which are of BH_Boundary type.
> >>> 
> >>> For e.g. consider the file mapping
> >>> logical block(4k) 		physical block(4k)
> >>> 0-11 				1000-1011
> >>> 12-15 				1013-1016
> >>> 
> >>> In above physical block 1012 is an indirect metadata block which has the
> >>> mapping information for next set of indirect blocks (1013-1016).
> >>> With iomap buffered reads for reading 1st 16 logical blocks of a file
> >>> (0-15), we get below I/O pattern
> >>> 	- submit a bio for 1012
> >>> 	- complete the bio for 1012
> >>> 	- submit a bio for 1000-1011
> >>> 	- submit a bio for 1013-1016
> >>> 	- complete the bios for 1000-1011
> >>> 	- complete the bios for 1013-1016
> >>> 
> >>> So as we can see, above is an non-optimal I/O access pattern and also we
> >>> get 3 bio completions instead of 2.
> >>> 
> >>> This patch changes this behavior by doing submit_bio() if there are any
> >>> bios already processed, before calling ->iomap_begin() again.
> >>> That means if there are any blocks which are already processed, gets
> >>> submitted for I/O earlier and then within ->iomap_begin(), if we get a
> >>> request for reading an indirect metadata block, then block layer can merge
> >>> those bios with the already submitted read request to reduce the no. of bio
> >>> completions.
> >>> 
> >>> Now, for bs < ps or for large folios, this patch requires proper handling
> >>> of "ifs->read_bytes_pending". In that we first set ifs->read_bytes_pending
> >>> to folio_size. Then handle all the cases where we need to subtract
> >>> ifs->read_bytes_pending either during the submission side
> >>> (if we don't need to submit any I/O - for e.g. for uptodate sub blocks),
> >>> or during an I/O error, or at the completion of an I/O.
> >>> 
> >>> Here is the ftrace output of iomap and block layer with ext2 iomap
> >>> conversion patches -
> >>> 
> >>> root# filefrag -b512 -v /mnt1/test/f1
> >>> Filesystem type is: ef53
> >>> Filesystem cylinder groups approximately 32
> >>> File size of /mnt1/test/f1 is 65536 (128 blocks of 512 bytes)
> >>>  ext:     logical_offset:        physical_offset: length:   expected: flags:
> >>>    0:        0..      95:      98304..     98399:     96:             merged
> >>>    1:       96..     127:      98408..     98439:     32:      98400: last,merged,eof
> >>> /mnt1/test/f1: 2 extents found
> >>> 
> >>> root# #This reads 4 blocks starting from lblk 10, 11, 12, 13
> >>> root# xfs_io -c "pread -b$((4*4096)) $((10*4096)) $((4*4096))" /mnt1/test/f1
> >>> 
> >>> w/o this patch - (indirect block is submitted before and does not get merged, resulting in 3 bios completion)
> >>>       xfs_io-907     [002] .....   185.608791: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
> >>>       xfs_io-907     [002] .....   185.608819: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
> >>>       xfs_io-907     [002] .....   185.608823: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
> >>>       xfs_io-907     [002] .....   185.608831: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
> >>>       xfs_io-907     [002] .....   185.608859: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
> >>>       xfs_io-907     [002] .....   185.608865: block_getrq: 8,16 R 98400 + 8 [xfs_io]
> >>>       xfs_io-907     [002] .....   185.608867: block_io_start: 8,16 R 4096 () 98400 + 8 [xfs_io]
> >>>       xfs_io-907     [002] .....   185.608869: block_plug: [xfs_io]
> >>>       xfs_io-907     [002] .....   185.608872: block_unplug: [xfs_io] 1
> >>>       xfs_io-907     [002] .....   185.608874: block_rq_insert: 8,16 R 4096 () 98400 + 8 [xfs_io]
> >>> kworker/2:1H-198     [002] .....   185.608908: block_rq_issue: 8,16 R 4096 () 98400 + 8 [kworker/2:1H]
> >>>       <idle>-0       [002] d.h2.   185.609579: block_rq_complete: 8,16 R () 98400 + 8 [0]
> >>>       <idle>-0       [002] dNh2.   185.609631: block_io_done: 8,16 R 0 () 98400 + 0 [swapper/2]
> >>>       xfs_io-907     [002] .....   185.609694: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
> >>>       xfs_io-907     [002] .....   185.609704: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
> >>>       xfs_io-907     [002] .....   185.609718: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
> >>>       xfs_io-907     [002] .....   185.609721: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
> >>>       xfs_io-907     [002] .....   185.609726: block_plug: [xfs_io]
> >>>       xfs_io-907     [002] .....   185.609735: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
> >>>       xfs_io-907     [002] .....   185.609736: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
> >>>       xfs_io-907     [002] .....   185.609740: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
> >>>       xfs_io-907     [002] .....   185.609741: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
> >>>       xfs_io-907     [002] .....   185.609756: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
> >>>       xfs_io-907     [002] .....   185.609769: block_rq_issue: 8,16 RA 8192 () 98384 + 16 [xfs_io]
> >>>       <idle>-0       [002] d.H2.   185.610280: block_rq_complete: 8,16 RA () 98408 + 16 [0]
> >>>       <idle>-0       [002] d.H2.   185.610289: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/2]
> >>>       <idle>-0       [002] d.H2.   185.610292: block_rq_complete: 8,16 RA () 98384 + 16 [0]
> >>>       <idle>-0       [002] dNH2.   185.610301: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/2]
> >>
> >> Could this be shortened to ... the iomap calls and
> >> block_bio_queue/backmerge?  It's a bit difficult to see the point you're
> >> getting at with all the other noise.
> >
> > I will remove this log and move it to cover letter and will just extend
> > the simple example I considered before in this commit message,
> > to show the difference with and w/o patch.
> >
> >>
> >> I think you're trying to say that the access pattern here is 98400 ->
> >> 98408 -> 98384, which is not sequential?
> >>
> >
> > it's (98400,8 ==> metadata block) -> (98384,16 == lblk 10 & 11) -> (98408,16 ==> lblk 12 & 13)
> > ... w/o the patch
> >
> >>> v/s with the patch - (optimzed I/O access pattern and bio gets merged resulting in only 2 bios completion)
> >>>       xfs_io-944     [005] .....    99.926187: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
> >>>       xfs_io-944     [005] .....    99.926208: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
> >>>       xfs_io-944     [005] .....    99.926211: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
> >>>       xfs_io-944     [005] .....    99.926222: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
> >>>       xfs_io-944     [005] .....    99.926232: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
> >>>       xfs_io-944     [005] .....    99.926233: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
> >>>       xfs_io-944     [005] .....    99.926234: block_plug: [xfs_io]
> >>>       xfs_io-944     [005] .....    99.926235: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
> >>>       xfs_io-944     [005] .....    99.926261: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
> >>>       xfs_io-944     [005] .....    99.926266: block_bio_backmerge: 8,16 R 98400 + 8 [xfs_io]
> >>>       xfs_io-944     [005] .....    99.926271: block_unplug: [xfs_io] 1
> >>>       xfs_io-944     [005] .....    99.926272: block_rq_insert: 8,16 RA 12288 () 98384 + 24 [xfs_io]
> >>> kworker/5:1H-234     [005] .....    99.926314: block_rq_issue: 8,16 RA 12288 () 98384 + 24 [kworker/5:1H]
> >>>       <idle>-0       [005] d.h2.    99.926905: block_rq_complete: 8,16 RA () 98384 + 24 [0]
> >>>       <idle>-0       [005] dNh2.    99.926931: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/5]
> >>>       xfs_io-944     [005] .....    99.926971: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
> >>>       xfs_io-944     [005] .....    99.926981: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
> >>>       xfs_io-944     [005] .....    99.926989: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
> >>>       xfs_io-944     [005] .....    99.926989: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
> >>>       xfs_io-944     [005] .....    99.926991: block_plug: [xfs_io]
> >>>       xfs_io-944     [005] .....    99.926993: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
> >>>       xfs_io-944     [005] .....    99.927001: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
> >>>       <idle>-0       [005] d.h2.    99.927397: block_rq_complete: 8,16 RA () 98408 + 16 [0]
> >>>       <idle>-0       [005] dNh2.    99.927414: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/5]
> >>> 
> >>> Suggested-by: Matthew Wilcox <willy@infradead.org>
> >>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> >>> cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> >>> ---
> >>>  fs/iomap/buffered-io.c | 112 +++++++++++++++++++++++++++++++----------
> >>>  1 file changed, 85 insertions(+), 27 deletions(-)
> >>> 
> >>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >>> index 0a4269095ae2..a1d50086a3f5 100644
> >>> --- a/fs/iomap/buffered-io.c
> >>> +++ b/fs/iomap/buffered-io.c
> >>> @@ -30,7 +30,7 @@ typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length);
> >>>   */
> >>>  struct iomap_folio_state {
> >>>  	spinlock_t		state_lock;
> >>> -	unsigned int		read_bytes_pending;
> >>> +	size_t			read_bytes_pending;
> >>>  	atomic_t		write_bytes_pending;
> >>> 
> >>>  	/*
> >>> @@ -380,6 +380,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
> >>>  	loff_t orig_pos = pos;
> >>>  	size_t poff, plen;
> >>>  	sector_t sector;
> >>> +	bool rbp_finished = false;
> >>
> >> What is "rbp"?  My assembly programmer brain says x64 frame pointer, but
> >> that's clearly wrong here.  Maybe I'm confused...
> >>
> >
> > rbp == read_bytes_pending ;)
> >
> >>>  	if (iomap->type == IOMAP_INLINE)
> >>>  		return iomap_read_inline_data(iter, folio);
> >>> @@ -387,21 +388,39 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
> >>>  	/* zero post-eof blocks as the page may be mapped */
> >>>  	ifs = ifs_alloc(iter->inode, folio, iter->flags);
> >>>  	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
> >>> +
> >>> +	if (ifs) {
> >>> +		loff_t to_read = min_t(loff_t, iter->len - offset,
> >>> +			folio_size(folio) - offset_in_folio(folio, orig_pos));
> >>> +		size_t padjust;
> >>> +
> >>> +		spin_lock_irq(&ifs->state_lock);
> >>> +		if (!ifs->read_bytes_pending)
> >>> +			ifs->read_bytes_pending = to_read;
> >>> +		padjust = pos - orig_pos;
> >>> +		ifs->read_bytes_pending -= padjust;
> >>> +		if (!ifs->read_bytes_pending)
> >>> +			rbp_finished = true;
> >>> +		spin_unlock_irq(&ifs->state_lock);
> >>> +	}
> >>> +
> >>>  	if (plen == 0)
> >>>  		goto done;
> >>> 
> >>>  	if (iomap_block_needs_zeroing(iter, pos)) {
> >>> +		if (ifs) {
> >>> +			spin_lock_irq(&ifs->state_lock);
> >>> +			ifs->read_bytes_pending -= plen;
> >>> +			if (!ifs->read_bytes_pending)
> >>> +				rbp_finished = true;
> >>> +			spin_unlock_irq(&ifs->state_lock);
> >>> +		}
> >>>  		folio_zero_range(folio, poff, plen);
> >>>  		iomap_set_range_uptodate(folio, poff, plen);
> >>>  		goto done;
> >>>  	}
> >>> 
> >>>  	ctx->cur_folio_in_bio = true;
> >>> -	if (ifs) {
> >>> -		spin_lock_irq(&ifs->state_lock);
> >>> -		ifs->read_bytes_pending += plen;
> >>> -		spin_unlock_irq(&ifs->state_lock);
> >>> -	}
> >>> 
> >>>  	sector = iomap_sector(iomap, pos);
> >>>  	if (!ctx->bio ||
> >>> @@ -435,6 +454,14 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
> >>>  	}
> >>> 
> >>>  done:
> >>> +	/*
> >>> +	 * If there is no bio prepared and if rbp is finished and
> >>> +	 * this was the last offset within this folio then mark
> >>> +	 * cur_folio_in_bio to false.
> >>> +	 */
> >>> +	if (!ctx->bio && rbp_finished &&
> >>> +			offset_in_folio(folio, pos + plen) == 0)
> >>> +		ctx->cur_folio_in_bio = false;
> >>
> >> ...yes, I think I am confused.  When would ctx->bio be NULL but
> >> cur_folio_in_bio is true?
> >
> > Previously we had the bio submitted and so we make it null, but we still
> > have ctx->cur_folio & ctx->cur_folio_in_bio to true, since we haven't
> > completely processed the folio.
> >
> >>
> >> I /think/ what you're doing here is using read_bytes_pending to figure
> >> out if you've processed the folio up to the end of the mapping?  But
> >> then you submit the bio unconditionally below for each readpage_iter
> >> call?
> >>
> >
> > yes, that's right.
> >
> >> Why not add an IOMAP_BOUNDARY flag that means "I will have to do some IO
> >> if you call ->iomap_begin again"?  Then if we get to this point in
> >> readpage_iter with a ctx->bio, we can submit the bio, clear
> >> cur_folio_in_bio, and return?  And then you don't need this machinery?
> >
> > TBH, I initially didn't think the approach taken in the patch would
> > require such careful handling of r_b_p. It was because of all of this
> > corner cases when we don't need to read the update blocks and/or in case
> > of an error we need to ensure we reduce r_b_p carefully so that we could
> > unlock the folio and when extent spans beyond i_size.
> >
> > So it's all about how do we know if we could unlock the folio and that it's
> > corresponding blocks/mapping has been all processed or submitted for
> > I/O. 
> >
> > Assume we have a folio which spans over multiple extents. In such a
> > case, 
> > -> we process a bio for 1st extent, 
> > -> then we go back to iomap_iter() to get new extent mapping, 
> > -> We now increment the r_b_p with this new plen to be processed. 
> > -> We then submit the previous bio, since this new mapping couldn't be
> > merged due to discontinuous extents. 
> > So by first incrementing the r_b_p before doing submit_bio(), we don't
> > unlock the folio at bio completion.
> >
> > Maybe, it would be helpful if we have an easy mechanism to keep some state
> > from the time of submit_bio() till the bio completion to know that the
> > corresponding folio is still being processed and it shouldn't be
> > unlocked.
> >  -> This currently is what we are doing by making r_b_p to the value of
> >  folio_size() and then carefully reducing r_b_p for all the cases I
> >  mentioned above.
> >
> > Let me think if adding a IOMAP_BH_BOUNDARY flag could be helpful or not.
> > Say if we have a pagesize of 64k that means all first 16 blocks belongs
> > to same page. So even with IOMAP_BH_BOUNDARY flag the problem that still
> > remains is that, even if we submit the bio at block 11 (bh_boundary
> > block), how will the bio completion side know that the folio is not
> > completely processed and so we shouldn't unlock the folio?
> 
> Maybe one way could be if we could add another state flag to ifs for
> BH_BOUNDARY block and read that at the bio completion.
> We can then also let the completion side know if it should unlock the
> folio or whether it still needs processing at the submission side.

The approach I suggested was to initialise read_bytes_pending to
folio_size() at the start.  Then subtract off blocksize for each
uptodate block, whether you find it already uptodate, or as the
completion handler runs.

Is there a reason that doesn't work?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings
  2024-04-26 17:37         ` Matthew Wilcox
@ 2024-04-26 17:55           ` Ritesh Harjani
  2024-04-26 18:13             ` Matthew Wilcox
  0 siblings, 1 reply; 30+ messages in thread
From: Ritesh Harjani @ 2024-04-26 17:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, linux-ext4, linux-xfs, linux-fsdevel,
	Ojaswin Mujoo, Jan Kara

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Apr 26, 2024 at 10:55:23PM +0530, Ritesh Harjani wrote:
>> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
>> 
>> > "Darrick J. Wong" <djwong@kernel.org> writes:
>> >
>> >> On Thu, Apr 25, 2024 at 06:58:51PM +0530, Ritesh Harjani (IBM) wrote:
>> >>> This patch optimizes the data access patterns for filesystems with
>> >>> indirect block mapping by implementing BH_Boundary handling within
>> >>> iomap.
>> >>> 
>> >>> Currently the bios for reads within iomap are only submitted at
>> >>> 2 places -
>> >>> 1. If we cannot merge the new req. with previous bio, only then we
>> >>>    submit the previous bio.
>> >>> 2. Submit the bio at the end of the entire read processing.
>> >>> 
>> >>> This means for filesystems with indirect block mapping, we call into
>> >>> ->iomap_begin() again w/o submitting the previous bios. That causes
>> >>> unoptimized data access patterns for blocks which are of BH_Boundary type.
>> >>> 
>> >>> For e.g. consider the file mapping
>> >>> logical block(4k) 		physical block(4k)
>> >>> 0-11 				1000-1011
>> >>> 12-15 				1013-1016
>> >>> 
>> >>> In above physical block 1012 is an indirect metadata block which has the
>> >>> mapping information for next set of indirect blocks (1013-1016).
>> >>> With iomap buffered reads for reading 1st 16 logical blocks of a file
>> >>> (0-15), we get below I/O pattern
>> >>> 	- submit a bio for 1012
>> >>> 	- complete the bio for 1012
>> >>> 	- submit a bio for 1000-1011
>> >>> 	- submit a bio for 1013-1016
>> >>> 	- complete the bios for 1000-1011
>> >>> 	- complete the bios for 1013-1016
>> >>> 
>> >>> So as we can see, above is an non-optimal I/O access pattern and also we
>> >>> get 3 bio completions instead of 2.
>> >>> 
>> >>> This patch changes this behavior by doing submit_bio() if there are any
>> >>> bios already processed, before calling ->iomap_begin() again.
>> >>> That means if there are any blocks which are already processed, gets
>> >>> submitted for I/O earlier and then within ->iomap_begin(), if we get a
>> >>> request for reading an indirect metadata block, then block layer can merge
>> >>> those bios with the already submitted read request to reduce the no. of bio
>> >>> completions.
>> >>> 
>> >>> Now, for bs < ps or for large folios, this patch requires proper handling
>> >>> of "ifs->read_bytes_pending". In that we first set ifs->read_bytes_pending
>> >>> to folio_size. Then handle all the cases where we need to subtract
>> >>> ifs->read_bytes_pending either during the submission side
>> >>> (if we don't need to submit any I/O - for e.g. for uptodate sub blocks),
>> >>> or during an I/O error, or at the completion of an I/O.
>> >>> 
>> >>> Here is the ftrace output of iomap and block layer with ext2 iomap
>> >>> conversion patches -
>> >>> 
>> >>> root# filefrag -b512 -v /mnt1/test/f1
>> >>> Filesystem type is: ef53
>> >>> Filesystem cylinder groups approximately 32
>> >>> File size of /mnt1/test/f1 is 65536 (128 blocks of 512 bytes)
>> >>>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>> >>>    0:        0..      95:      98304..     98399:     96:             merged
>> >>>    1:       96..     127:      98408..     98439:     32:      98400: last,merged,eof
>> >>> /mnt1/test/f1: 2 extents found
>> >>> 
>> >>> root# #This reads 4 blocks starting from lblk 10, 11, 12, 13
>> >>> root# xfs_io -c "pread -b$((4*4096)) $((10*4096)) $((4*4096))" /mnt1/test/f1
>> >>> 
>> >>> w/o this patch - (indirect block is submitted before and does not get merged, resulting in 3 bios completion)
>> >>>       xfs_io-907     [002] .....   185.608791: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
>> >>>       xfs_io-907     [002] .....   185.608819: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
>> >>>       xfs_io-907     [002] .....   185.608823: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
>> >>>       xfs_io-907     [002] .....   185.608831: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
>> >>>       xfs_io-907     [002] .....   185.608859: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
>> >>>       xfs_io-907     [002] .....   185.608865: block_getrq: 8,16 R 98400 + 8 [xfs_io]
>> >>>       xfs_io-907     [002] .....   185.608867: block_io_start: 8,16 R 4096 () 98400 + 8 [xfs_io]
>> >>>       xfs_io-907     [002] .....   185.608869: block_plug: [xfs_io]
>> >>>       xfs_io-907     [002] .....   185.608872: block_unplug: [xfs_io] 1
>> >>>       xfs_io-907     [002] .....   185.608874: block_rq_insert: 8,16 R 4096 () 98400 + 8 [xfs_io]
>> >>> kworker/2:1H-198     [002] .....   185.608908: block_rq_issue: 8,16 R 4096 () 98400 + 8 [kworker/2:1H]
>> >>>       <idle>-0       [002] d.h2.   185.609579: block_rq_complete: 8,16 R () 98400 + 8 [0]
>> >>>       <idle>-0       [002] dNh2.   185.609631: block_io_done: 8,16 R 0 () 98400 + 0 [swapper/2]
>> >>>       xfs_io-907     [002] .....   185.609694: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
>> >>>       xfs_io-907     [002] .....   185.609704: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
>> >>>       xfs_io-907     [002] .....   185.609718: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
>> >>>       xfs_io-907     [002] .....   185.609721: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>> >>>       xfs_io-907     [002] .....   185.609726: block_plug: [xfs_io]
>> >>>       xfs_io-907     [002] .....   185.609735: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
>> >>>       xfs_io-907     [002] .....   185.609736: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
>> >>>       xfs_io-907     [002] .....   185.609740: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
>> >>>       xfs_io-907     [002] .....   185.609741: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>> >>>       xfs_io-907     [002] .....   185.609756: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>> >>>       xfs_io-907     [002] .....   185.609769: block_rq_issue: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>> >>>       <idle>-0       [002] d.H2.   185.610280: block_rq_complete: 8,16 RA () 98408 + 16 [0]
>> >>>       <idle>-0       [002] d.H2.   185.610289: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/2]
>> >>>       <idle>-0       [002] d.H2.   185.610292: block_rq_complete: 8,16 RA () 98384 + 16 [0]
>> >>>       <idle>-0       [002] dNH2.   185.610301: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/2]
>> >>
>> >> Could this be shortened to ... the iomap calls and
>> >> block_bio_queue/backmerge?  It's a bit difficult to see the point you're
>> >> getting at with all the other noise.
>> >
>> > I will remove this log and move it to cover letter and will just extend
>> > the simple example I considered before in this commit message,
>> > to show the difference with and w/o patch.
>> >
>> >>
>> >> I think you're trying to say that the access pattern here is 98400 ->
>> >> 98408 -> 98384, which is not sequential?
>> >>
>> >
>> > it's (98400,8 ==> metadata block) -> (98384,16 == lblk 10 & 11) -> (98408,16 ==> lblk 12 & 13)
>> > ... w/o the patch
>> >
>> >>> v/s with the patch - (optimzed I/O access pattern and bio gets merged resulting in only 2 bios completion)
>> >>>       xfs_io-944     [005] .....    99.926187: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
>> >>>       xfs_io-944     [005] .....    99.926208: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
>> >>>       xfs_io-944     [005] .....    99.926211: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
>> >>>       xfs_io-944     [005] .....    99.926222: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
>> >>>       xfs_io-944     [005] .....    99.926232: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
>> >>>       xfs_io-944     [005] .....    99.926233: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>> >>>       xfs_io-944     [005] .....    99.926234: block_plug: [xfs_io]
>> >>>       xfs_io-944     [005] .....    99.926235: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
>> >>>       xfs_io-944     [005] .....    99.926261: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
>> >>>       xfs_io-944     [005] .....    99.926266: block_bio_backmerge: 8,16 R 98400 + 8 [xfs_io]
>> >>>       xfs_io-944     [005] .....    99.926271: block_unplug: [xfs_io] 1
>> >>>       xfs_io-944     [005] .....    99.926272: block_rq_insert: 8,16 RA 12288 () 98384 + 24 [xfs_io]
>> >>> kworker/5:1H-234     [005] .....    99.926314: block_rq_issue: 8,16 RA 12288 () 98384 + 24 [kworker/5:1H]
>> >>>       <idle>-0       [005] d.h2.    99.926905: block_rq_complete: 8,16 RA () 98384 + 24 [0]
>> >>>       <idle>-0       [005] dNh2.    99.926931: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/5]
>> >>>       xfs_io-944     [005] .....    99.926971: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
>> >>>       xfs_io-944     [005] .....    99.926981: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
>> >>>       xfs_io-944     [005] .....    99.926989: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
>> >>>       xfs_io-944     [005] .....    99.926989: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>> >>>       xfs_io-944     [005] .....    99.926991: block_plug: [xfs_io]
>> >>>       xfs_io-944     [005] .....    99.926993: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags  (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
>> >>>       xfs_io-944     [005] .....    99.927001: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>> >>>       <idle>-0       [005] d.h2.    99.927397: block_rq_complete: 8,16 RA () 98408 + 16 [0]
>> >>>       <idle>-0       [005] dNh2.    99.927414: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/5]
>> >>> 
>> >>> Suggested-by: Matthew Wilcox <willy@infradead.org>
>> >>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> >>> cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> >>> ---
>> >>>  fs/iomap/buffered-io.c | 112 +++++++++++++++++++++++++++++++----------
>> >>>  1 file changed, 85 insertions(+), 27 deletions(-)
>> >>> 
>> >>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> >>> index 0a4269095ae2..a1d50086a3f5 100644
>> >>> --- a/fs/iomap/buffered-io.c
>> >>> +++ b/fs/iomap/buffered-io.c
>> >>> @@ -30,7 +30,7 @@ typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length);
>> >>>   */
>> >>>  struct iomap_folio_state {
>> >>>  	spinlock_t		state_lock;
>> >>> -	unsigned int		read_bytes_pending;
>> >>> +	size_t			read_bytes_pending;
>> >>>  	atomic_t		write_bytes_pending;
>> >>> 
>> >>>  	/*
>> >>> @@ -380,6 +380,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>> >>>  	loff_t orig_pos = pos;
>> >>>  	size_t poff, plen;
>> >>>  	sector_t sector;
>> >>> +	bool rbp_finished = false;
>> >>
>> >> What is "rbp"?  My assembly programmer brain says x64 frame pointer, but
>> >> that's clearly wrong here.  Maybe I'm confused...
>> >>
>> >
>> > rbp == read_bytes_pending ;)
>> >
>> >>>  	if (iomap->type == IOMAP_INLINE)
>> >>>  		return iomap_read_inline_data(iter, folio);
>> >>> @@ -387,21 +388,39 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>> >>>  	/* zero post-eof blocks as the page may be mapped */
>> >>>  	ifs = ifs_alloc(iter->inode, folio, iter->flags);
>> >>>  	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
>> >>> +
>> >>> +	if (ifs) {
>> >>> +		loff_t to_read = min_t(loff_t, iter->len - offset,
>> >>> +			folio_size(folio) - offset_in_folio(folio, orig_pos));
>> >>> +		size_t padjust;
>> >>> +
>> >>> +		spin_lock_irq(&ifs->state_lock);
>> >>> +		if (!ifs->read_bytes_pending)
>> >>> +			ifs->read_bytes_pending = to_read;
>> >>> +		padjust = pos - orig_pos;
>> >>> +		ifs->read_bytes_pending -= padjust;
>> >>> +		if (!ifs->read_bytes_pending)
>> >>> +			rbp_finished = true;
>> >>> +		spin_unlock_irq(&ifs->state_lock);
>> >>> +	}
>> >>> +
>> >>>  	if (plen == 0)
>> >>>  		goto done;
>> >>> 
>> >>>  	if (iomap_block_needs_zeroing(iter, pos)) {
>> >>> +		if (ifs) {
>> >>> +			spin_lock_irq(&ifs->state_lock);
>> >>> +			ifs->read_bytes_pending -= plen;
>> >>> +			if (!ifs->read_bytes_pending)
>> >>> +				rbp_finished = true;
>> >>> +			spin_unlock_irq(&ifs->state_lock);
>> >>> +		}
>> >>>  		folio_zero_range(folio, poff, plen);
>> >>>  		iomap_set_range_uptodate(folio, poff, plen);
>> >>>  		goto done;
>> >>>  	}
>> >>> 
>> >>>  	ctx->cur_folio_in_bio = true;
>> >>> -	if (ifs) {
>> >>> -		spin_lock_irq(&ifs->state_lock);
>> >>> -		ifs->read_bytes_pending += plen;
>> >>> -		spin_unlock_irq(&ifs->state_lock);
>> >>> -	}
>> >>> 
>> >>>  	sector = iomap_sector(iomap, pos);
>> >>>  	if (!ctx->bio ||
>> >>> @@ -435,6 +454,14 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>> >>>  	}
>> >>> 
>> >>>  done:
>> >>> +	/*
>> >>> +	 * If there is no bio prepared and if rbp is finished and
>> >>> +	 * this was the last offset within this folio then mark
>> >>> +	 * cur_folio_in_bio to false.
>> >>> +	 */
>> >>> +	if (!ctx->bio && rbp_finished &&
>> >>> +			offset_in_folio(folio, pos + plen) == 0)
>> >>> +		ctx->cur_folio_in_bio = false;
>> >>
>> >> ...yes, I think I am confused.  When would ctx->bio be NULL but
>> >> cur_folio_in_bio is true?
>> >
>> > Previously we had the bio submitted and so we make it null, but we still
>> > have ctx->cur_folio & ctx->cur_folio_in_bio to true, since we haven't
>> > completely processed the folio.
>> >
>> >>
>> >> I /think/ what you're doing here is using read_bytes_pending to figure
>> >> out if you've processed the folio up to the end of the mapping?  But
>> >> then you submit the bio unconditionally below for each readpage_iter
>> >> call?
>> >>
>> >
>> > yes, that's right.
>> >
>> >> Why not add an IOMAP_BOUNDARY flag that means "I will have to do some IO
>> >> if you call ->iomap_begin again"?  Then if we get to this point in
>> >> readpage_iter with a ctx->bio, we can submit the bio, clear
>> >> cur_folio_in_bio, and return?  And then you don't need this machinery?
>> >
>> > TBH, I initially didn't think the approach taken in the patch would
>> > require such careful handling of r_b_p. It was because of all of this
>> > corner cases when we don't need to read the update blocks and/or in case
>> > of an error we need to ensure we reduce r_b_p carefully so that we could
>> > unlock the folio and when extent spans beyond i_size.
>> >
>> > So it's all about how do we know if we could unlock the folio and that it's
>> > corresponding blocks/mapping has been all processed or submitted for
>> > I/O. 
>> >
>> > Assume we have a folio which spans over multiple extents. In such a
>> > case, 
>> > -> we process a bio for 1st extent, 
>> > -> then we go back to iomap_iter() to get new extent mapping, 
>> > -> We now increment the r_b_p with this new plen to be processed. 
>> > -> We then submit the previous bio, since this new mapping couldn't be
>> > merged due to discontinuous extents. 
>> > So by first incrementing the r_b_p before doing submit_bio(), we don't
>> > unlock the folio at bio completion.
>> >
>> > Maybe, it would be helpful if we have an easy mechanism to keep some state
>> > from the time of submit_bio() till the bio completion to know that the
>> > corresponding folio is still being processed and it shouldn't be
>> > unlocked.
>> >  -> This currently is what we are doing by making r_b_p to the value of
>> >  folio_size() and then carefully reducing r_b_p for all the cases I
>> >  mentioned above.
>> >
>> > Let me think if adding a IOMAP_BH_BOUNDARY flag could be helpful or not.
>> > Say if we have a pagesize of 64k that means all first 16 blocks belongs
>> > to same page. So even with IOMAP_BH_BOUNDARY flag the problem that still
>> > remains is that, even if we submit the bio at block 11 (bh_boundary
>> > block), how will the bio completion side know that the folio is not
>> > completely processed and so we shouldn't unlock the folio?
>> 
>> Maybe one way could be if we could add another state flag to ifs for
>> BH_BOUNDARY block and read that at the bio completion.
>> We can then also let the completion side know if it should unlock the
>> folio or whether it still needs processing at the submission side.
>
> The approach I suggested was to initialise read_bytes_pending to
> folio_size() at the start.  Then subtract off blocksize for each
> uptodate block, whether you find it already uptodate, or as the
> completion handler runs.
>
> Is there a reason that doesn't work?

That is what this patch series does right. The current patch does work
as far as my testing goes.

For e.g. This is what initializes the r_b_p for the first time when
ifs->r_b_p is 0.

+		loff_t to_read = min_t(loff_t, iter->len - offset,
+			folio_size(folio) - offset_in_folio(folio, orig_pos));
<..>
+		if (!ifs->read_bytes_pending)
+			ifs->read_bytes_pending = to_read;


Then this is where we subtract r_b_p for blocks which are uptodate.
+		padjust = pos - orig_pos;
+		ifs->read_bytes_pending -= padjust;


This is when we adjust r_b_p when we directly zero the folio.
 	if (iomap_block_needs_zeroing(iter, pos)) {
+		if (ifs) {
+			spin_lock_irq(&ifs->state_lock);
+			ifs->read_bytes_pending -= plen;
+			if (!ifs->read_bytes_pending)
+				rbp_finished = true;
+			spin_unlock_irq(&ifs->state_lock);
+		}

But as you see this requires surgery throughout read paths. What if
we add a state flag to ifs only for BH_BOUNDARY. Maybe that could
result in a more simplified approach?
Because all we require is to know whether the folio should be unlocked
or not at the time of completion. 

Do you think we should try that part or you think the current approach
looks ok?

-ritesh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings
  2024-04-26 17:55           ` Ritesh Harjani
@ 2024-04-26 18:13             ` Matthew Wilcox
  2024-04-26 18:57               ` Ritesh Harjani
  0 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2024-04-26 18:13 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Darrick J. Wong, linux-ext4, linux-xfs, linux-fsdevel,
	Ojaswin Mujoo, Jan Kara

On Fri, Apr 26, 2024 at 11:25:25PM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> > The approach I suggested was to initialise read_bytes_pending to
> > folio_size() at the start.  Then subtract off blocksize for each
> > uptodate block, whether you find it already uptodate, or as the
> > completion handler runs.
> >
> > Is there a reason that doesn't work?
> 
> That is what this patch series does right. The current patch does work
> as far as my testing goes.
> 
> For e.g. This is what initializes the r_b_p for the first time when
> ifs->r_b_p is 0.
> 
> +		loff_t to_read = min_t(loff_t, iter->len - offset,
> +			folio_size(folio) - offset_in_folio(folio, orig_pos));
> <..>
> +		if (!ifs->read_bytes_pending)
> +			ifs->read_bytes_pending = to_read;
> 
> 
> Then this is where we subtract r_b_p for blocks which are uptodate.
> +		padjust = pos - orig_pos;
> +		ifs->read_bytes_pending -= padjust;
> 
> 
> This is when we adjust r_b_p when we directly zero the folio.
>  	if (iomap_block_needs_zeroing(iter, pos)) {
> +		if (ifs) {
> +			spin_lock_irq(&ifs->state_lock);
> +			ifs->read_bytes_pending -= plen;
> +			if (!ifs->read_bytes_pending)
> +				rbp_finished = true;
> +			spin_unlock_irq(&ifs->state_lock);
> +		}
> 
> But as you see this requires surgery throughout read paths. What if
> we add a state flag to ifs only for BH_BOUNDARY. Maybe that could
> result in a more simplified approach?
> Because all we require is to know whether the folio should be unlocked
> or not at the time of completion. 
> 
> Do you think we should try that part or you think the current approach
> looks ok?

You've really made life hard for yourself.  I had something more like
this in mind.  I may have missed a few places that need to be changed,
but this should update read_bytes_pending everwhere that we set bits
in the uptodate bitmap, so it should be right?

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 41c8f0c68ef5..f87ca8ee4d19 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -79,6 +79,7 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
 	if (ifs) {
 		spin_lock_irqsave(&ifs->state_lock, flags);
 		uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
+		ifs->read_bytes_pending -= len;
 		spin_unlock_irqrestore(&ifs->state_lock, flags);
 	}
 
@@ -208,6 +209,8 @@ static struct iomap_folio_state *ifs_alloc(struct inode *inode,
 	spin_lock_init(&ifs->state_lock);
 	if (folio_test_uptodate(folio))
 		bitmap_set(ifs->state, 0, nr_blocks);
+	else
+		ifs->read_bytes_pending = folio_size(folio);
 	if (folio_test_dirty(folio))
 		bitmap_set(ifs->state, nr_blocks, nr_blocks);
 	folio_attach_private(folio, ifs);
@@ -396,12 +399,6 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	}
 
 	ctx->cur_folio_in_bio = true;
-	if (ifs) {
-		spin_lock_irq(&ifs->state_lock);
-		ifs->read_bytes_pending += plen;
-		spin_unlock_irq(&ifs->state_lock);
-	}
-
 	sector = iomap_sector(iomap, pos);
 	if (!ctx->bio ||
 	    bio_end_sector(ctx->bio) != sector ||

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings
  2024-04-26 18:13             ` Matthew Wilcox
@ 2024-04-26 18:57               ` Ritesh Harjani
  2024-04-26 19:19                 ` Matthew Wilcox
  0 siblings, 1 reply; 30+ messages in thread
From: Ritesh Harjani @ 2024-04-26 18:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, linux-ext4, linux-xfs, linux-fsdevel,
	Ojaswin Mujoo, Jan Kara

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Apr 26, 2024 at 11:25:25PM +0530, Ritesh Harjani wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>> > The approach I suggested was to initialise read_bytes_pending to
>> > folio_size() at the start.  Then subtract off blocksize for each
>> > uptodate block, whether you find it already uptodate, or as the
>> > completion handler runs.
>> >
>> > Is there a reason that doesn't work?
>> 
>> That is what this patch series does right. The current patch does work
>> as far as my testing goes.
>> 
>> For e.g. This is what initializes the r_b_p for the first time when
>> ifs->r_b_p is 0.
>> 
>> +		loff_t to_read = min_t(loff_t, iter->len - offset,
>> +			folio_size(folio) - offset_in_folio(folio, orig_pos));
>> <..>
>> +		if (!ifs->read_bytes_pending)
>> +			ifs->read_bytes_pending = to_read;
>> 
>> 
>> Then this is where we subtract r_b_p for blocks which are uptodate.
>> +		padjust = pos - orig_pos;
>> +		ifs->read_bytes_pending -= padjust;
>> 
>> 
>> This is when we adjust r_b_p when we directly zero the folio.
>>  	if (iomap_block_needs_zeroing(iter, pos)) {
>> +		if (ifs) {
>> +			spin_lock_irq(&ifs->state_lock);
>> +			ifs->read_bytes_pending -= plen;
>> +			if (!ifs->read_bytes_pending)
>> +				rbp_finished = true;
>> +			spin_unlock_irq(&ifs->state_lock);
>> +		}
>> 
>> But as you see this requires surgery throughout read paths. What if
>> we add a state flag to ifs only for BH_BOUNDARY. Maybe that could
>> result in a more simplified approach?
>> Because all we require is to know whether the folio should be unlocked
>> or not at the time of completion. 
>> 
>> Do you think we should try that part or you think the current approach
>> looks ok?
>
> You've really made life hard for yourself.  I had something more like
> this in mind.  I may have missed a few places that need to be changed,
> but this should update read_bytes_pending everwhere that we set bits
> in the uptodate bitmap, so it should be right?

Please correct me if I am wrong.

>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 41c8f0c68ef5..f87ca8ee4d19 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -79,6 +79,7 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
>  	if (ifs) {
>  		spin_lock_irqsave(&ifs->state_lock, flags);
>  		uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
> +		ifs->read_bytes_pending -= len;
>  		spin_unlock_irqrestore(&ifs->state_lock, flags);
>  	}

iomap_set_range_uptodate() gets called from ->write_begin() and
->write_end() too. So what we are saying is we are updating
the state of read_bytes_pending even though we are not in
->read_folio() or ->readahead() call?

>  
> @@ -208,6 +209,8 @@ static struct iomap_folio_state *ifs_alloc(struct inode *inode,
>  	spin_lock_init(&ifs->state_lock);
>  	if (folio_test_uptodate(folio))
>  		bitmap_set(ifs->state, 0, nr_blocks);
> +	else
> +		ifs->read_bytes_pending = folio_size(folio);

We might not come till here during ->read_folio -> ifs_alloc(). Since we
might have a cached ifs which was allocated during write to this folio.

But unless you are saying that during writes, we would have set
ifs->r_b_p to folio_size() and when the read call happens, we use
the same value of the cached ifs.
Ok, I see. I was mostly focusing on updating ifs->r_b_p value only when
the reads bytes are actually pending during ->read_folio() or
->readahead() and not updating r_b_p during writes.

...One small problem which I see with this approach is - we might have
some non-zero value in ifs->r_b_p when ifs_free() gets called and it
might give a warning of non-zero ifs->r_b_p, because we updated
ifs->r_b_p during writes to a non-zero value, but the reads
never happend. Then during a call to ->release_folio, it will complain
of a non-zero ifs->r_b_p.


>  	if (folio_test_dirty(folio))
>  		bitmap_set(ifs->state, nr_blocks, nr_blocks);
>  	folio_attach_private(folio, ifs);
> @@ -396,12 +399,6 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>  	}
>  
>  	ctx->cur_folio_in_bio = true;
> -	if (ifs) {
> -		spin_lock_irq(&ifs->state_lock);
> -		ifs->read_bytes_pending += plen;
> -		spin_unlock_irq(&ifs->state_lock);
> -	}
> -
>  	sector = iomap_sector(iomap, pos);
>  	if (!ctx->bio ||
>  	    bio_end_sector(ctx->bio) != sector ||

-ritesh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings
  2024-04-26 18:57               ` Ritesh Harjani
@ 2024-04-26 19:19                 ` Matthew Wilcox
  2024-04-27  4:54                   ` Christoph Hellwig
  2024-04-27  6:03                   ` Ritesh Harjani
  0 siblings, 2 replies; 30+ messages in thread
From: Matthew Wilcox @ 2024-04-26 19:19 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Darrick J. Wong, linux-ext4, linux-xfs, linux-fsdevel,
	Ojaswin Mujoo, Jan Kara

On Sat, Apr 27, 2024 at 12:27:52AM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> > @@ -79,6 +79,7 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
> >  	if (ifs) {
> >  		spin_lock_irqsave(&ifs->state_lock, flags);
> >  		uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
> > +		ifs->read_bytes_pending -= len;
> >  		spin_unlock_irqrestore(&ifs->state_lock, flags);
> >  	}
> 
> iomap_set_range_uptodate() gets called from ->write_begin() and
> ->write_end() too. So what we are saying is we are updating
> the state of read_bytes_pending even though we are not in
> ->read_folio() or ->readahead() call?

Exactly.

> >  
> > @@ -208,6 +209,8 @@ static struct iomap_folio_state *ifs_alloc(struct inode *inode,
> >  	spin_lock_init(&ifs->state_lock);
> >  	if (folio_test_uptodate(folio))
> >  		bitmap_set(ifs->state, 0, nr_blocks);
> > +	else
> > +		ifs->read_bytes_pending = folio_size(folio);
> 
> We might not come till here during ->read_folio -> ifs_alloc(). Since we
> might have a cached ifs which was allocated during write to this folio.
> 
> But unless you are saying that during writes, we would have set
> ifs->r_b_p to folio_size() and when the read call happens, we use
> the same value of the cached ifs.
> Ok, I see. I was mostly focusing on updating ifs->r_b_p value only when
> the reads bytes are actually pending during ->read_folio() or
> ->readahead() and not updating r_b_p during writes.

I see why you might want to think that way ... but this way is much less
complex, don't you think?  ;-)

> ...One small problem which I see with this approach is - we might have
> some non-zero value in ifs->r_b_p when ifs_free() gets called and it
> might give a warning of non-zero ifs->r_b_p, because we updated
> ifs->r_b_p during writes to a non-zero value, but the reads
> never happend. Then during a call to ->release_folio, it will complain
> of a non-zero ifs->r_b_p.

Yes, we'll have to remove that assertion.  I don't think that's a
problem, do you?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 6/7] iomap: Optimize iomap_read_folio
  2024-04-26  8:50     ` Ritesh Harjani
@ 2024-04-27  4:44       ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-04-27  4:44 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Christoph Hellwig, linux-ext4, linux-xfs, linux-fsdevel,
	Matthew Wilcox, Darrick J . Wong, Ojaswin Mujoo, Jan Kara

On Fri, Apr 26, 2024 at 02:20:05PM +0530, Ritesh Harjani wrote:
> iomap_read_folio_iter() handles multiple sub blocks within a given
> folio but it's implementation logic is similar to how
> iomap_readahead_iter() handles multiple folios within a single mapped
> extent. Both of them iterate over a given range of folio/mapped extent
> and call iomap_readpage_iter() for reading.

Sounds good.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings
  2024-04-25 13:28 ` [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings Ritesh Harjani (IBM)
  2024-04-26 16:24   ` Darrick J. Wong
@ 2024-04-27  4:47   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-04-27  4:47 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-ext4, linux-xfs, linux-fsdevel, Matthew Wilcox,
	Darrick J . Wong, Ojaswin Mujoo, Jan Kara

On Thu, Apr 25, 2024 at 06:58:51PM +0530, Ritesh Harjani (IBM) wrote:
> Currently the bios for reads within iomap are only submitted at
> 2 places -
> 1. If we cannot merge the new req. with previous bio, only then we
>    submit the previous bio.
> 2. Submit the bio at the end of the entire read processing.
> 
> This means for filesystems with indirect block mapping, we call into
> ->iomap_begin() again w/o submitting the previous bios. That causes
> unoptimized data access patterns for blocks which are of BH_Boundary type.

The same is true for extent mappings.  And it's not ideal there either,
although it only inreases the bio submission latency.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings
  2024-04-26 19:19                 ` Matthew Wilcox
@ 2024-04-27  4:54                   ` Christoph Hellwig
  2024-04-27  6:03                   ` Ritesh Harjani
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-04-27  4:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ritesh Harjani, Darrick J. Wong, linux-ext4, linux-xfs,
	linux-fsdevel, Ojaswin Mujoo, Jan Kara

On Fri, Apr 26, 2024 at 08:19:47PM +0100, Matthew Wilcox wrote:
> > ...One small problem which I see with this approach is - we might have
> > some non-zero value in ifs->r_b_p when ifs_free() gets called and it
> > might give a warning of non-zero ifs->r_b_p, because we updated
> > ifs->r_b_p during writes to a non-zero value, but the reads
> > never happend. Then during a call to ->release_folio, it will complain
> > of a non-zero ifs->r_b_p.
> 
> Yes, we'll have to remove that assertion.  I don't think that's a
> problem, do you?

Or refine it, as the parts not read shoud not be uptodate either?

Either way I had another idea to simplify things a bit, but this might
end up beeing even simpler, so I'll stop the hacking on my version
for now.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings
  2024-04-26 19:19                 ` Matthew Wilcox
  2024-04-27  4:54                   ` Christoph Hellwig
@ 2024-04-27  6:03                   ` Ritesh Harjani
  1 sibling, 0 replies; 30+ messages in thread
From: Ritesh Harjani @ 2024-04-27  6:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, linux-ext4, linux-xfs, linux-fsdevel,
	Ojaswin Mujoo, Jan Kara

Matthew Wilcox <willy@infradead.org> writes:

> On Sat, Apr 27, 2024 at 12:27:52AM +0530, Ritesh Harjani wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>> > @@ -79,6 +79,7 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
>> >  	if (ifs) {
>> >  		spin_lock_irqsave(&ifs->state_lock, flags);
>> >  		uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
>> > +		ifs->read_bytes_pending -= len;
>> >  		spin_unlock_irqrestore(&ifs->state_lock, flags);
>> >  	}
>> 
>> iomap_set_range_uptodate() gets called from ->write_begin() and
>> ->write_end() too. So what we are saying is we are updating
>> the state of read_bytes_pending even though we are not in
>> ->read_folio() or ->readahead() call?
>
> Exactly.
>
>> >  
>> > @@ -208,6 +209,8 @@ static struct iomap_folio_state *ifs_alloc(struct inode *inode,
>> >  	spin_lock_init(&ifs->state_lock);
>> >  	if (folio_test_uptodate(folio))
>> >  		bitmap_set(ifs->state, 0, nr_blocks);
>> > +	else
>> > +		ifs->read_bytes_pending = folio_size(folio);
>> 
>> We might not come till here during ->read_folio -> ifs_alloc(). Since we
>> might have a cached ifs which was allocated during write to this folio.
>> 
>> But unless you are saying that during writes, we would have set
>> ifs->r_b_p to folio_size() and when the read call happens, we use
>> the same value of the cached ifs.
>> Ok, I see. I was mostly focusing on updating ifs->r_b_p value only when
>> the reads bytes are actually pending during ->read_folio() or
>> ->readahead() and not updating r_b_p during writes.
>
> I see why you might want to think that way ... but this way is much less
> complex, don't you think?  ;-)
>
>> ...One small problem which I see with this approach is - we might have
>> some non-zero value in ifs->r_b_p when ifs_free() gets called and it
>> might give a warning of non-zero ifs->r_b_p, because we updated
>> ifs->r_b_p during writes to a non-zero value, but the reads
>> never happend. Then during a call to ->release_folio, it will complain
>> of a non-zero ifs->r_b_p.
>
> Yes, we'll have to remove that assertion.  I don't think that's a
> problem, do you?

Sure, I will give it a spin.

-ritesh

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2024-04-27  6:03 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-25 13:28 [RFCv3 0/7] ext2 iomap changes and iomap improvements Ritesh Harjani (IBM)
2024-04-25 13:28 ` [RFCv3 1/7] ext2: Remove comment related to journal handle Ritesh Harjani (IBM)
2024-04-26 15:21   ` Darrick J. Wong
2024-04-25 13:28 ` [RFCv3 2/7] ext2: Convert ext2 regular file buffered I/O to use iomap Ritesh Harjani (IBM)
2024-04-26 15:29   ` Darrick J. Wong
2024-04-25 13:28 ` [RFCv3 3/7] ext2: Enable large folio support Ritesh Harjani (IBM)
2024-04-26 15:30   ` Darrick J. Wong
2024-04-25 13:28 ` [RFCv3 4/7] ext2: Implement seq counter for validating cached iomap Ritesh Harjani (IBM)
2024-04-26 15:41   ` Darrick J. Wong
2024-04-25 13:28 ` [RFCv3 5/7] iomap: Fix iomap_adjust_read_range for plen calculation Ritesh Harjani (IBM)
2024-04-26  6:49   ` Christoph Hellwig
2024-04-26  8:52     ` Ritesh Harjani
2024-04-26 15:43       ` Darrick J. Wong
2024-04-25 13:28 ` [RFCv3 6/7] iomap: Optimize iomap_read_folio Ritesh Harjani (IBM)
2024-04-25 13:53   ` Matthew Wilcox
2024-04-26  6:53   ` Christoph Hellwig
2024-04-26  8:50     ` Ritesh Harjani
2024-04-27  4:44       ` Christoph Hellwig
2024-04-25 13:28 ` [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings Ritesh Harjani (IBM)
2024-04-26 16:24   ` Darrick J. Wong
2024-04-26 17:17     ` Ritesh Harjani
2024-04-26 17:25       ` Ritesh Harjani
2024-04-26 17:37         ` Matthew Wilcox
2024-04-26 17:55           ` Ritesh Harjani
2024-04-26 18:13             ` Matthew Wilcox
2024-04-26 18:57               ` Ritesh Harjani
2024-04-26 19:19                 ` Matthew Wilcox
2024-04-27  4:54                   ` Christoph Hellwig
2024-04-27  6:03                   ` Ritesh Harjani
2024-04-27  4:47   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).