All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/10] Make O_SYNC writethrough
@ 2022-05-03  6:39 Matthew Wilcox (Oracle)
  2022-05-03  6:39 ` [RFC PATCH 01/10] iomap: Pass struct iomap to iomap_alloc_ioend() Matthew Wilcox (Oracle)
                   ` (11 more replies)
  0 siblings, 12 replies; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-05-03  6:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	Damien Le Moal, Christoph Hellwig, Darrick J . Wong

This is very much in development and basically untested, but Damian
started describing to me something that he wanted, and I told him he
was asking for the wrong thing, and I already had this patch series
in progress.  If someone wants to pick it up and make it mergable,
that'd be grand.

The idea is that an O_SYNC write is always going to want to write, and
we know that at the time we're storing into the page cache.  So for an
otherwise clean folio, we can skip the part where we dirty the folio,
find the dirty folios and wait for their writeback.  We can just mark the
folio as writeback-in-progress and start the IO there and then (where we
know exactly which blocks need to be written, so possibly a smaller I/O
than writing the entire page).  The existing "find dirty pages, start
I/O and wait on them" code will end up waiting on this pre-started I/O
to complete, even though it didn't start any of its own I/O.

The important part is patch 9.  Everything before it is boring prep work.
I'm in two minds about whether to keep the 'write_through' bool, or
remove it.  So feel to read patches 9+10 squashed together, or as if
patch 10 doesn't exist.  Whichever feels better.

The biggest problem with all this is that iomap doesn't have the necessary
information to cause extent allocation, so if you do an O_SYNC write
to an extent which is HOLE or DELALLOC, we can't do this optimisation.
Maybe that doesn't really matter for interesting applications.  I suspect
it doesn't matter for ZoneFS.

Matthew Wilcox (Oracle) (10):
  iomap: Pass struct iomap to iomap_alloc_ioend()
  iomap: Remove iomap_writepage_ctx from iomap_can_add_to_ioend()
  iomap: Do not pass iomap_writepage_ctx to iomap_add_to_ioend()
  iomap: Accept a NULL iomap_writepage_ctx in iomap_submit_ioend()
  iomap: Allow a NULL writeback_control argument to iomap_alloc_ioend()
  iomap: Pass a length to iomap_add_to_ioend()
  iomap: Reorder functions
  iomap: Reorder functions
  iomap: Add writethrough for O_SYNC
  remove write_through bool

 fs/iomap/buffered-io.c | 492 +++++++++++++++++++++++------------------
 1 file changed, 273 insertions(+), 219 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 01/10] iomap: Pass struct iomap to iomap_alloc_ioend()
  2022-05-03  6:39 [RFC PATCH 00/10] Make O_SYNC writethrough Matthew Wilcox (Oracle)
@ 2022-05-03  6:39 ` Matthew Wilcox (Oracle)
  2022-05-03  6:40 ` [RFC PATCH 02/10] iomap: Remove iomap_writepage_ctx from iomap_can_add_to_ioend() Matthew Wilcox (Oracle)
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-05-03  6:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	Damien Le Moal, Christoph Hellwig, Darrick J . Wong

iomap_alloc_ioend() does not need the rest of iomap_writepage_ctx;
only the iomap contained in it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6c51a75d0be6..03c7c97bc871 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1222,15 +1222,15 @@ iomap_submit_ioend(struct iomap_writepage_ctx *wpc, struct iomap_ioend *ioend,
 	return 0;
 }
 
-static struct iomap_ioend *
-iomap_alloc_ioend(struct inode *inode, struct iomap_writepage_ctx *wpc,
-		loff_t offset, sector_t sector, struct writeback_control *wbc)
+static struct iomap_ioend *iomap_alloc_ioend(struct inode *inode,
+		struct iomap *iomap, loff_t offset, sector_t sector,
+		struct writeback_control *wbc)
 {
 	struct iomap_ioend *ioend;
 	struct bio *bio;
 
 	bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_VECS, &iomap_ioend_bioset);
-	bio_set_dev(bio, wpc->iomap.bdev);
+	bio_set_dev(bio, iomap->bdev);
 	bio->bi_iter.bi_sector = sector;
 	bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
 	bio->bi_write_hint = inode->i_write_hint;
@@ -1238,8 +1238,8 @@ iomap_alloc_ioend(struct inode *inode, struct iomap_writepage_ctx *wpc,
 
 	ioend = container_of(bio, struct iomap_ioend, io_inline_bio);
 	INIT_LIST_HEAD(&ioend->io_list);
-	ioend->io_type = wpc->iomap.type;
-	ioend->io_flags = wpc->iomap.flags;
+	ioend->io_type = iomap->type;
+	ioend->io_flags = iomap->flags;
 	ioend->io_inode = inode;
 	ioend->io_size = 0;
 	ioend->io_folios = 0;
@@ -1305,14 +1305,15 @@ iomap_add_to_ioend(struct inode *inode, loff_t pos, struct folio *folio,
 		struct iomap_page *iop, struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct list_head *iolist)
 {
-	sector_t sector = iomap_sector(&wpc->iomap, pos);
+	struct iomap *iomap = &wpc->iomap;
+	sector_t sector = iomap_sector(iomap, pos);
 	unsigned len = i_blocksize(inode);
 	size_t poff = offset_in_folio(folio, pos);
 
 	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos, sector)) {
 		if (wpc->ioend)
 			list_add(&wpc->ioend->io_list, iolist);
-		wpc->ioend = iomap_alloc_ioend(inode, wpc, pos, sector, wbc);
+		wpc->ioend = iomap_alloc_ioend(inode, iomap, pos, sector, wbc);
 	}
 
 	if (!bio_add_folio(wpc->ioend->io_bio, folio, len, poff)) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 02/10] iomap: Remove iomap_writepage_ctx from iomap_can_add_to_ioend()
  2022-05-03  6:39 [RFC PATCH 00/10] Make O_SYNC writethrough Matthew Wilcox (Oracle)
  2022-05-03  6:39 ` [RFC PATCH 01/10] iomap: Pass struct iomap to iomap_alloc_ioend() Matthew Wilcox (Oracle)
@ 2022-05-03  6:40 ` Matthew Wilcox (Oracle)
  2022-05-03  6:40 ` [RFC PATCH 03/10] iomap: Do not pass iomap_writepage_ctx to iomap_add_to_ioend() Matthew Wilcox (Oracle)
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-05-03  6:40 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	Damien Le Moal, Christoph Hellwig, Darrick J . Wong

In preparation for using this function without an iomap_writepage_ctx,
pass in the iomap and ioend.  Also simplify iomap_add_to_ioend() by
using the iomap & ioend directly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 35 ++++++++++++++++++-----------------
 1 file changed, 18 insertions(+), 17 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 03c7c97bc871..c91259530ac1 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1273,25 +1273,24 @@ iomap_chain_bio(struct bio *prev)
 	return new;
 }
 
-static bool
-iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
-		sector_t sector)
+static bool iomap_can_add_to_ioend(struct iomap *iomap,
+		struct iomap_ioend *ioend, loff_t offset, sector_t sector)
 {
-	if ((wpc->iomap.flags & IOMAP_F_SHARED) !=
-	    (wpc->ioend->io_flags & IOMAP_F_SHARED))
+	if ((iomap->flags & IOMAP_F_SHARED) !=
+	    (ioend->io_flags & IOMAP_F_SHARED))
 		return false;
-	if (wpc->iomap.type != wpc->ioend->io_type)
+	if (iomap->type != ioend->io_type)
 		return false;
-	if (offset != wpc->ioend->io_offset + wpc->ioend->io_size)
+	if (offset != ioend->io_offset + ioend->io_size)
 		return false;
-	if (sector != bio_end_sector(wpc->ioend->io_bio))
+	if (sector != bio_end_sector(ioend->io_bio))
 		return false;
 	/*
 	 * Limit ioend bio chain lengths to minimise IO completion latency. This
 	 * also prevents long tight loops ending page writeback on all the
 	 * folios in the ioend.
 	 */
-	if (wpc->ioend->io_folios >= IOEND_BATCH_SIZE)
+	if (ioend->io_folios >= IOEND_BATCH_SIZE)
 		return false;
 	return true;
 }
@@ -1306,24 +1305,26 @@ iomap_add_to_ioend(struct inode *inode, loff_t pos, struct folio *folio,
 		struct writeback_control *wbc, struct list_head *iolist)
 {
 	struct iomap *iomap = &wpc->iomap;
+	struct iomap_ioend *ioend = wpc->ioend;
 	sector_t sector = iomap_sector(iomap, pos);
 	unsigned len = i_blocksize(inode);
 	size_t poff = offset_in_folio(folio, pos);
 
-	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos, sector)) {
-		if (wpc->ioend)
-			list_add(&wpc->ioend->io_list, iolist);
-		wpc->ioend = iomap_alloc_ioend(inode, iomap, pos, sector, wbc);
+	if (!ioend || !iomap_can_add_to_ioend(iomap, ioend, pos, sector)) {
+		if (ioend)
+			list_add(&ioend->io_list, iolist);
+		ioend = iomap_alloc_ioend(inode, iomap, pos, sector, wbc);
+		wpc->ioend = ioend;
 	}
 
-	if (!bio_add_folio(wpc->ioend->io_bio, folio, len, poff)) {
-		wpc->ioend->io_bio = iomap_chain_bio(wpc->ioend->io_bio);
-		bio_add_folio(wpc->ioend->io_bio, folio, len, poff);
+	if (!bio_add_folio(ioend->io_bio, folio, len, poff)) {
+		ioend->io_bio = iomap_chain_bio(ioend->io_bio);
+		bio_add_folio(ioend->io_bio, folio, len, poff);
 	}
 
 	if (iop)
 		atomic_add(len, &iop->write_bytes_pending);
-	wpc->ioend->io_size += len;
+	ioend->io_size += len;
 	wbc_account_cgroup_owner(wbc, &folio->page, len);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 03/10] iomap: Do not pass iomap_writepage_ctx to iomap_add_to_ioend()
  2022-05-03  6:39 [RFC PATCH 00/10] Make O_SYNC writethrough Matthew Wilcox (Oracle)
  2022-05-03  6:39 ` [RFC PATCH 01/10] iomap: Pass struct iomap to iomap_alloc_ioend() Matthew Wilcox (Oracle)
  2022-05-03  6:40 ` [RFC PATCH 02/10] iomap: Remove iomap_writepage_ctx from iomap_can_add_to_ioend() Matthew Wilcox (Oracle)
@ 2022-05-03  6:40 ` Matthew Wilcox (Oracle)
  2022-05-03  6:40 ` [RFC PATCH 04/10] iomap: Accept a NULL iomap_writepage_ctx in iomap_submit_ioend() Matthew Wilcox (Oracle)
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-05-03  6:40 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	Damien Le Moal, Christoph Hellwig, Darrick J . Wong

In preparation for calling iomap_add_to_ioend() without a writepage_ctx
available, pass in the iomap and the (current) ioend, and return the
current ioend.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index c91259530ac1..1bf361446267 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1299,13 +1299,11 @@ static bool iomap_can_add_to_ioend(struct iomap *iomap,
  * Test to see if we have an existing ioend structure that we could append to
  * first; otherwise finish off the current ioend and start another.
  */
-static void
-iomap_add_to_ioend(struct inode *inode, loff_t pos, struct folio *folio,
-		struct iomap_page *iop, struct iomap_writepage_ctx *wpc,
+static struct iomap_ioend *iomap_add_to_ioend(struct inode *inode,
+		loff_t pos, struct folio *folio, struct iomap_page *iop,
+		struct iomap *iomap, struct iomap_ioend *ioend,
 		struct writeback_control *wbc, struct list_head *iolist)
 {
-	struct iomap *iomap = &wpc->iomap;
-	struct iomap_ioend *ioend = wpc->ioend;
 	sector_t sector = iomap_sector(iomap, pos);
 	unsigned len = i_blocksize(inode);
 	size_t poff = offset_in_folio(folio, pos);
@@ -1314,7 +1312,6 @@ iomap_add_to_ioend(struct inode *inode, loff_t pos, struct folio *folio,
 		if (ioend)
 			list_add(&ioend->io_list, iolist);
 		ioend = iomap_alloc_ioend(inode, iomap, pos, sector, wbc);
-		wpc->ioend = ioend;
 	}
 
 	if (!bio_add_folio(ioend->io_bio, folio, len, poff)) {
@@ -1326,6 +1323,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t pos, struct folio *folio,
 		atomic_add(len, &iop->write_bytes_pending);
 	ioend->io_size += len;
 	wbc_account_cgroup_owner(wbc, &folio->page, len);
+	return ioend;
 }
 
 /*
@@ -1375,8 +1373,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 			continue;
 		if (wpc->iomap.type == IOMAP_HOLE)
 			continue;
-		iomap_add_to_ioend(inode, pos, folio, iop, wpc, wbc,
-				 &submit_list);
+		wpc->ioend = iomap_add_to_ioend(inode, pos, folio, iop,
+				&wpc->iomap, wpc->ioend, wbc, &submit_list);
 		count++;
 	}
 	if (count)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 04/10] iomap: Accept a NULL iomap_writepage_ctx in iomap_submit_ioend()
  2022-05-03  6:39 [RFC PATCH 00/10] Make O_SYNC writethrough Matthew Wilcox (Oracle)
                   ` (2 preceding siblings ...)
  2022-05-03  6:40 ` [RFC PATCH 03/10] iomap: Do not pass iomap_writepage_ctx to iomap_add_to_ioend() Matthew Wilcox (Oracle)
@ 2022-05-03  6:40 ` Matthew Wilcox (Oracle)
  2022-05-03  6:40 ` [RFC PATCH 05/10] iomap: Allow a NULL writeback_control argument to iomap_alloc_ioend() Matthew Wilcox (Oracle)
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-05-03  6:40 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	Damien Le Moal, Christoph Hellwig, Darrick J . Wong

Prepare for I/O without an iomap_writepage_ctx() by accepting a NULL
wpc.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 1bf361446267..85bcdb0dc66c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1204,7 +1204,7 @@ iomap_submit_ioend(struct iomap_writepage_ctx *wpc, struct iomap_ioend *ioend,
 	ioend->io_bio->bi_private = ioend;
 	ioend->io_bio->bi_end_io = iomap_writepage_end_bio;
 
-	if (wpc->ops->prepare_ioend)
+	if (wpc && wpc->ops->prepare_ioend)
 		error = wpc->ops->prepare_ioend(ioend, error);
 	if (error) {
 		/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 05/10] iomap: Allow a NULL writeback_control argument to iomap_alloc_ioend()
  2022-05-03  6:39 [RFC PATCH 00/10] Make O_SYNC writethrough Matthew Wilcox (Oracle)
                   ` (3 preceding siblings ...)
  2022-05-03  6:40 ` [RFC PATCH 04/10] iomap: Accept a NULL iomap_writepage_ctx in iomap_submit_ioend() Matthew Wilcox (Oracle)
@ 2022-05-03  6:40 ` Matthew Wilcox (Oracle)
  2022-05-03  6:40 ` [RFC PATCH 06/10] iomap: Pass a length to iomap_add_to_ioend() Matthew Wilcox (Oracle)
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-05-03  6:40 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	Damien Le Moal, Christoph Hellwig, Darrick J . Wong

When we're doing writethrough, we don't have a writeback_control to
pass in.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 85bcdb0dc66c..024e16fb95a8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1232,9 +1232,13 @@ static struct iomap_ioend *iomap_alloc_ioend(struct inode *inode,
 	bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_VECS, &iomap_ioend_bioset);
 	bio_set_dev(bio, iomap->bdev);
 	bio->bi_iter.bi_sector = sector;
-	bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
+	bio->bi_opf = REQ_OP_WRITE;
 	bio->bi_write_hint = inode->i_write_hint;
-	wbc_init_bio(wbc, bio);
+
+	if (wbc) {
+		bio->bi_opf |= wbc_to_write_flags(wbc);
+		wbc_init_bio(wbc, bio);
+	}
 
 	ioend = container_of(bio, struct iomap_ioend, io_inline_bio);
 	INIT_LIST_HEAD(&ioend->io_list);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 06/10] iomap: Pass a length to iomap_add_to_ioend()
  2022-05-03  6:39 [RFC PATCH 00/10] Make O_SYNC writethrough Matthew Wilcox (Oracle)
                   ` (4 preceding siblings ...)
  2022-05-03  6:40 ` [RFC PATCH 05/10] iomap: Allow a NULL writeback_control argument to iomap_alloc_ioend() Matthew Wilcox (Oracle)
@ 2022-05-03  6:40 ` Matthew Wilcox (Oracle)
  2022-05-03  6:40 ` [RFC PATCH 07/10] iomap: Reorder functions Matthew Wilcox (Oracle)
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-05-03  6:40 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	Damien Le Moal, Christoph Hellwig, Darrick J . Wong

Allow the caller to specify how much of the page to add to the ioend
instead of assuming a single sector.  Somebody should probably enhance
iomap_writepage_map() to make one call per extent instead of one per
block.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 024e16fb95a8..5b69cea71f71 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1304,12 +1304,12 @@ static bool iomap_can_add_to_ioend(struct iomap *iomap,
  * first; otherwise finish off the current ioend and start another.
  */
 static struct iomap_ioend *iomap_add_to_ioend(struct inode *inode,
-		loff_t pos, struct folio *folio, struct iomap_page *iop,
-		struct iomap *iomap, struct iomap_ioend *ioend,
-		struct writeback_control *wbc, struct list_head *iolist)
+		loff_t pos, size_t len, struct folio *folio,
+		struct iomap_page *iop, struct iomap *iomap,
+		struct iomap_ioend *ioend, struct writeback_control *wbc,
+		struct list_head *iolist)
 {
 	sector_t sector = iomap_sector(iomap, pos);
-	unsigned len = i_blocksize(inode);
 	size_t poff = offset_in_folio(folio, pos);
 
 	if (!ioend || !iomap_can_add_to_ioend(iomap, ioend, pos, sector)) {
@@ -1377,7 +1377,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 			continue;
 		if (wpc->iomap.type == IOMAP_HOLE)
 			continue;
-		wpc->ioend = iomap_add_to_ioend(inode, pos, folio, iop,
+		wpc->ioend = iomap_add_to_ioend(inode, pos, len, folio, iop,
 				&wpc->iomap, wpc->ioend, wbc, &submit_list);
 		count++;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 07/10] iomap: Reorder functions
  2022-05-03  6:39 [RFC PATCH 00/10] Make O_SYNC writethrough Matthew Wilcox (Oracle)
                   ` (5 preceding siblings ...)
  2022-05-03  6:40 ` [RFC PATCH 06/10] iomap: Pass a length to iomap_add_to_ioend() Matthew Wilcox (Oracle)
@ 2022-05-03  6:40 ` Matthew Wilcox (Oracle)
  2022-05-03  6:40 ` [RFC PATCH 08/10] " Matthew Wilcox (Oracle)
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-05-03  6:40 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	Damien Le Moal, Christoph Hellwig, Darrick J . Wong

Move the ioend creation functions earlier in the file so write_end can
create ioends without requiring forward declarations.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 215 ++++++++++++++++++++---------------------
 1 file changed, 107 insertions(+), 108 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 5b69cea71f71..4aa2209fb003 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -558,6 +558,113 @@ static int iomap_read_folio_sync(loff_t block_start, struct folio *folio,
 	return submit_bio_wait(&bio);
 }
 
+static bool iomap_can_add_to_ioend(struct iomap *iomap,
+		struct iomap_ioend *ioend, loff_t offset, sector_t sector)
+{
+	if ((iomap->flags & IOMAP_F_SHARED) !=
+	    (ioend->io_flags & IOMAP_F_SHARED))
+		return false;
+	if (iomap->type != ioend->io_type)
+		return false;
+	if (offset != ioend->io_offset + ioend->io_size)
+		return false;
+	if (sector != bio_end_sector(ioend->io_bio))
+		return false;
+	/*
+	 * Limit ioend bio chain lengths to minimise IO completion latency. This
+	 * also prevents long tight loops ending page writeback on all the
+	 * folios in the ioend.
+	 */
+	if (ioend->io_folios >= IOEND_BATCH_SIZE)
+		return false;
+	return true;
+}
+
+static struct iomap_ioend *iomap_alloc_ioend(struct inode *inode,
+		struct iomap *iomap, loff_t offset, sector_t sector,
+		struct writeback_control *wbc)
+{
+	struct iomap_ioend *ioend;
+	struct bio *bio;
+
+	bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_VECS, &iomap_ioend_bioset);
+	bio_set_dev(bio, iomap->bdev);
+	bio->bi_iter.bi_sector = sector;
+	bio->bi_opf = REQ_OP_WRITE;
+	bio->bi_write_hint = inode->i_write_hint;
+
+	if (wbc) {
+		bio->bi_opf |= wbc_to_write_flags(wbc);
+		wbc_init_bio(wbc, bio);
+	}
+
+	ioend = container_of(bio, struct iomap_ioend, io_inline_bio);
+	INIT_LIST_HEAD(&ioend->io_list);
+	ioend->io_type = iomap->type;
+	ioend->io_flags = iomap->flags;
+	ioend->io_inode = inode;
+	ioend->io_size = 0;
+	ioend->io_folios = 0;
+	ioend->io_offset = offset;
+	ioend->io_bio = bio;
+	ioend->io_sector = sector;
+	return ioend;
+}
+
+/*
+ * Allocate a new bio, and chain the old bio to the new one.
+ *
+ * Note that we have to perform the chaining in this unintuitive order
+ * so that the bi_private linkage is set up in the right direction for the
+ * traversal in iomap_finish_ioend().
+ */
+static struct bio *iomap_chain_bio(struct bio *prev)
+{
+	struct bio *new;
+
+	new = bio_alloc(GFP_NOFS, BIO_MAX_VECS);
+	bio_copy_dev(new, prev);/* also copies over blkcg information */
+	new->bi_iter.bi_sector = bio_end_sector(prev);
+	new->bi_opf = prev->bi_opf;
+	new->bi_write_hint = prev->bi_write_hint;
+
+	bio_chain(prev, new);
+	bio_get(prev);		/* for iomap_finish_ioend */
+	submit_bio(prev);
+	return new;
+}
+
+/*
+ * Test to see if we have an existing ioend structure that we could append to
+ * first; otherwise finish off the current ioend and start another.
+ */
+static struct iomap_ioend *iomap_add_to_ioend(struct inode *inode,
+		loff_t pos, size_t len, struct folio *folio,
+		struct iomap_page *iop, struct iomap *iomap,
+		struct iomap_ioend *ioend, struct writeback_control *wbc,
+		struct list_head *iolist)
+{
+	sector_t sector = iomap_sector(iomap, pos);
+	size_t poff = offset_in_folio(folio, pos);
+
+	if (!ioend || !iomap_can_add_to_ioend(iomap, ioend, pos, sector)) {
+		if (ioend)
+			list_add(&ioend->io_list, iolist);
+		ioend = iomap_alloc_ioend(inode, iomap, pos, sector, wbc);
+	}
+
+	if (!bio_add_folio(ioend->io_bio, folio, len, poff)) {
+		ioend->io_bio = iomap_chain_bio(ioend->io_bio);
+		bio_add_folio(ioend->io_bio, folio, len, poff);
+	}
+
+	if (iop)
+		atomic_add(len, &iop->write_bytes_pending);
+	ioend->io_size += len;
+	wbc_account_cgroup_owner(wbc, &folio->page, len);
+	return ioend;
+}
+
 static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		size_t len, struct folio *folio)
 {
@@ -1222,114 +1329,6 @@ iomap_submit_ioend(struct iomap_writepage_ctx *wpc, struct iomap_ioend *ioend,
 	return 0;
 }
 
-static struct iomap_ioend *iomap_alloc_ioend(struct inode *inode,
-		struct iomap *iomap, loff_t offset, sector_t sector,
-		struct writeback_control *wbc)
-{
-	struct iomap_ioend *ioend;
-	struct bio *bio;
-
-	bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_VECS, &iomap_ioend_bioset);
-	bio_set_dev(bio, iomap->bdev);
-	bio->bi_iter.bi_sector = sector;
-	bio->bi_opf = REQ_OP_WRITE;
-	bio->bi_write_hint = inode->i_write_hint;
-
-	if (wbc) {
-		bio->bi_opf |= wbc_to_write_flags(wbc);
-		wbc_init_bio(wbc, bio);
-	}
-
-	ioend = container_of(bio, struct iomap_ioend, io_inline_bio);
-	INIT_LIST_HEAD(&ioend->io_list);
-	ioend->io_type = iomap->type;
-	ioend->io_flags = iomap->flags;
-	ioend->io_inode = inode;
-	ioend->io_size = 0;
-	ioend->io_folios = 0;
-	ioend->io_offset = offset;
-	ioend->io_bio = bio;
-	ioend->io_sector = sector;
-	return ioend;
-}
-
-/*
- * Allocate a new bio, and chain the old bio to the new one.
- *
- * Note that we have to perform the chaining in this unintuitive order
- * so that the bi_private linkage is set up in the right direction for the
- * traversal in iomap_finish_ioend().
- */
-static struct bio *
-iomap_chain_bio(struct bio *prev)
-{
-	struct bio *new;
-
-	new = bio_alloc(GFP_NOFS, BIO_MAX_VECS);
-	bio_copy_dev(new, prev);/* also copies over blkcg information */
-	new->bi_iter.bi_sector = bio_end_sector(prev);
-	new->bi_opf = prev->bi_opf;
-	new->bi_write_hint = prev->bi_write_hint;
-
-	bio_chain(prev, new);
-	bio_get(prev);		/* for iomap_finish_ioend */
-	submit_bio(prev);
-	return new;
-}
-
-static bool iomap_can_add_to_ioend(struct iomap *iomap,
-		struct iomap_ioend *ioend, loff_t offset, sector_t sector)
-{
-	if ((iomap->flags & IOMAP_F_SHARED) !=
-	    (ioend->io_flags & IOMAP_F_SHARED))
-		return false;
-	if (iomap->type != ioend->io_type)
-		return false;
-	if (offset != ioend->io_offset + ioend->io_size)
-		return false;
-	if (sector != bio_end_sector(ioend->io_bio))
-		return false;
-	/*
-	 * Limit ioend bio chain lengths to minimise IO completion latency. This
-	 * also prevents long tight loops ending page writeback on all the
-	 * folios in the ioend.
-	 */
-	if (ioend->io_folios >= IOEND_BATCH_SIZE)
-		return false;
-	return true;
-}
-
-/*
- * Test to see if we have an existing ioend structure that we could append to
- * first; otherwise finish off the current ioend and start another.
- */
-static struct iomap_ioend *iomap_add_to_ioend(struct inode *inode,
-		loff_t pos, size_t len, struct folio *folio,
-		struct iomap_page *iop, struct iomap *iomap,
-		struct iomap_ioend *ioend, struct writeback_control *wbc,
-		struct list_head *iolist)
-{
-	sector_t sector = iomap_sector(iomap, pos);
-	size_t poff = offset_in_folio(folio, pos);
-
-	if (!ioend || !iomap_can_add_to_ioend(iomap, ioend, pos, sector)) {
-		if (ioend)
-			list_add(&ioend->io_list, iolist);
-		ioend = iomap_alloc_ioend(inode, iomap, pos, sector, wbc);
-	}
-
-	if (!bio_add_folio(ioend->io_bio, folio, len, poff)) {
-		ioend->io_bio = iomap_chain_bio(ioend->io_bio);
-		bio_add_folio(ioend->io_bio, folio, len, poff);
-	}
-
-	if (iop)
-		atomic_add(len, &iop->write_bytes_pending);
-	ioend->io_size += len;
-	wbc_account_cgroup_owner(wbc, &folio->page, len);
-	return ioend;
-}
-
 /*
  * We implement an immediate ioend submission policy here to avoid needing to
  * chain multiple ioends and hence nest mempool allocations which can violate
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 08/10] iomap: Reorder functions
  2022-05-03  6:39 [RFC PATCH 00/10] Make O_SYNC writethrough Matthew Wilcox (Oracle)
                   ` (6 preceding siblings ...)
  2022-05-03  6:40 ` [RFC PATCH 07/10] iomap: Reorder functions Matthew Wilcox (Oracle)
@ 2022-05-03  6:40 ` Matthew Wilcox (Oracle)
  2022-05-03  6:40 ` [RFC PATCH 09/10] iomap: Add writethrough for O_SYNC Matthew Wilcox (Oracle)
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-05-03  6:40 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	Damien Le Moal, Christoph Hellwig, Darrick J . Wong

Move the ioend submission functions earlier in the file so write_iter
can submit ioends without requiring forward declarations.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 204 ++++++++++++++++++++---------------------
 1 file changed, 101 insertions(+), 103 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4aa2209fb003..6c540390eec3 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -665,6 +665,107 @@ static struct iomap_ioend *iomap_add_to_ioend(struct inode *inode,
 	return ioend;
 }
 
+static void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
+		size_t len, int error)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+
+	if (error) {
+		folio_set_error(folio);
+		mapping_set_error(inode->i_mapping, error);
+	}
+
+	WARN_ON_ONCE(i_blocks_per_folio(inode, folio) > 1 && !iop);
+	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) <= 0);
+
+	if (!iop || atomic_sub_and_test(len, &iop->write_bytes_pending))
+		folio_end_writeback(folio);
+}
+
+/*
+ * We're now finished for good with this ioend structure.  Update the page
+ * state, release holds on bios, and finally free up memory.  Do not use the
+ * ioend after this.
+ */
+static u32 iomap_finish_ioend(struct iomap_ioend *ioend, int error)
+{
+	struct inode *inode = ioend->io_inode;
+	struct bio *bio = &ioend->io_inline_bio;
+	struct bio *last = ioend->io_bio, *next;
+	u64 start = bio->bi_iter.bi_sector;
+	loff_t offset = ioend->io_offset;
+	bool quiet = bio_flagged(bio, BIO_QUIET);
+	u32 folio_count = 0;
+
+	for (bio = &ioend->io_inline_bio; bio; bio = next) {
+		struct folio_iter fi;
+
+		/*
+		 * For the last bio, bi_private points to the ioend, so we
+		 * need to explicitly end the iteration here.
+		 */
+		if (bio == last)
+			next = NULL;
+		else
+			next = bio->bi_private;
+
+		/* walk all folios in bio, ending page IO on them */
+		bio_for_each_folio_all(fi, bio) {
+			iomap_finish_folio_write(inode, fi.folio, fi.length,
+					error);
+			folio_count++;
+		}
+		bio_put(bio);
+	}
+	/* The ioend has been freed by bio_put() */
+
+	if (unlikely(error && !quiet)) {
+		printk_ratelimited(KERN_ERR
+"%s: writeback error on inode %lu, offset %lld, sector %llu",
+			inode->i_sb->s_id, inode->i_ino, offset, start);
+	}
+	return folio_count;
+}
+
+static void iomap_writepage_end_bio(struct bio *bio)
+{
+	struct iomap_ioend *ioend = bio->bi_private;
+
+	iomap_finish_ioend(ioend, blk_status_to_errno(bio->bi_status));
+}
+
+/*
+ * Submit the final bio for an ioend.
+ *
+ * If @error is non-zero, it means that we have a situation where some part of
+ * the submission process has failed after we've marked pages for writeback
+ * and unlocked them.  In this situation, we need to fail the bio instead of
+ * submitting it.  This typically only happens on a filesystem shutdown.
+ */
+static int iomap_submit_ioend(struct iomap_writepage_ctx *wpc,
+		struct iomap_ioend *ioend, int error)
+{
+	ioend->io_bio->bi_private = ioend;
+	ioend->io_bio->bi_end_io = iomap_writepage_end_bio;
+
+	if (wpc && wpc->ops->prepare_ioend)
+		error = wpc->ops->prepare_ioend(ioend, error);
+	if (error) {
+		/*
+		 * If we're failing the IO now, just mark the ioend with an
+		 * error and finish it.  This will run IO completion immediately
+		 * as there is only one reference to the ioend at this point in
+		 * time.
+		 */
+		ioend->io_bio->bi_status = errno_to_blk_status(error);
+		bio_endio(ioend->io_bio);
+		return error;
+	}
+
+	submit_bio(ioend->io_bio);
+	return 0;
+}
+
 static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		size_t len, struct folio *folio)
 {
@@ -1126,69 +1227,6 @@ vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
 }
 EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
 
-static void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
-		size_t len, int error)
-{
-	struct iomap_page *iop = to_iomap_page(folio);
-
-	if (error) {
-		folio_set_error(folio);
-		mapping_set_error(inode->i_mapping, error);
-	}
-
-	WARN_ON_ONCE(i_blocks_per_folio(inode, folio) > 1 && !iop);
-	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) <= 0);
-
-	if (!iop || atomic_sub_and_test(len, &iop->write_bytes_pending))
-		folio_end_writeback(folio);
-}
-
-/*
- * We're now finished for good with this ioend structure.  Update the page
- * state, release holds on bios, and finally free up memory.  Do not use the
- * ioend after this.
- */
-static u32
-iomap_finish_ioend(struct iomap_ioend *ioend, int error)
-{
-	struct inode *inode = ioend->io_inode;
-	struct bio *bio = &ioend->io_inline_bio;
-	struct bio *last = ioend->io_bio, *next;
-	u64 start = bio->bi_iter.bi_sector;
-	loff_t offset = ioend->io_offset;
-	bool quiet = bio_flagged(bio, BIO_QUIET);
-	u32 folio_count = 0;
-
-	for (bio = &ioend->io_inline_bio; bio; bio = next) {
-		struct folio_iter fi;
-
-		/*
-		 * For the last bio, bi_private points to the ioend, so we
-		 * need to explicitly end the iteration here.
-		 */
-		if (bio == last)
-			next = NULL;
-		else
-			next = bio->bi_private;
-
-		/* walk all folios in bio, ending page IO on them */
-		bio_for_each_folio_all(fi, bio) {
-			iomap_finish_folio_write(inode, fi.folio, fi.length,
-					error);
-			folio_count++;
-		}
-		bio_put(bio);
-	}
-	/* The ioend has been freed by bio_put() */
-
-	if (unlikely(error && !quiet)) {
-		printk_ratelimited(KERN_ERR
-"%s: writeback error on inode %lu, offset %lld, sector %llu",
-			inode->i_sb->s_id, inode->i_ino, offset, start);
-	}
-	return folio_count;
-}
-
 /*
  * Ioend completion routine for merged bios. This can only be called from task
  * contexts as merged ioends can be of unbound length. Hence we have to break up
@@ -1289,46 +1327,6 @@ iomap_sort_ioends(struct list_head *ioend_list)
 }
 EXPORT_SYMBOL_GPL(iomap_sort_ioends);
 
-static void iomap_writepage_end_bio(struct bio *bio)
-{
-	struct iomap_ioend *ioend = bio->bi_private;
-
-	iomap_finish_ioend(ioend, blk_status_to_errno(bio->bi_status));
-}
-
-/*
- * Submit the final bio for an ioend.
- *
- * If @error is non-zero, it means that we have a situation where some part of
- * the submission process has failed after we've marked pages for writeback
- * and unlocked them.  In this situation, we need to fail the bio instead of
- * submitting it.  This typically only happens on a filesystem shutdown.
- */
-static int
-iomap_submit_ioend(struct iomap_writepage_ctx *wpc, struct iomap_ioend *ioend,
-		int error)
-{
-	ioend->io_bio->bi_private = ioend;
-	ioend->io_bio->bi_end_io = iomap_writepage_end_bio;
-
-	if (wpc && wpc->ops->prepare_ioend)
-		error = wpc->ops->prepare_ioend(ioend, error);
-	if (error) {
-		/*
-		 * If we're failing the IO now, just mark the ioend with an
-		 * error and finish it.  This will run IO completion immediately
-		 * as there is only one reference to the ioend at this point in
-		 * time.
-		 */
-		ioend->io_bio->bi_status = errno_to_blk_status(error);
-		bio_endio(ioend->io_bio);
-		return error;
-	}
-
-	submit_bio(ioend->io_bio);
-	return 0;
-}
-
 /*
  * We implement an immediate ioend submission policy here to avoid needing to
  * chain multiple ioends and hence nest mempool allocations which can violate
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 09/10] iomap: Add writethrough for O_SYNC
  2022-05-03  6:39 [RFC PATCH 00/10] Make O_SYNC writethrough Matthew Wilcox (Oracle)
                   ` (7 preceding siblings ...)
  2022-05-03  6:40 ` [RFC PATCH 08/10] " Matthew Wilcox (Oracle)
@ 2022-05-03  6:40 ` Matthew Wilcox (Oracle)
  2022-05-03  6:40 ` [RFC PATCH 10/10] remove write_through bool Matthew Wilcox (Oracle)
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-05-03  6:40 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	Damien Le Moal, Christoph Hellwig, Darrick J . Wong

For O_SYNC writes, if the filesystem has already allocated blocks for
the range, we can avoid marking the page as dirty and skip straight to
marking the page as writeback.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 74 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 64 insertions(+), 10 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6c540390eec3..5050adbd4bc8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -531,6 +531,12 @@ iomap_migrate_page(struct address_space *mapping, struct page *newpage,
 EXPORT_SYMBOL_GPL(iomap_migrate_page);
 #endif /* CONFIG_MIGRATION */
 
+struct iomap_write_ctx {
+	struct iomap_ioend *ioend;
+	struct list_head iolist;
+	bool write_through;
+};
+
 static void
 iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 {
@@ -875,8 +881,38 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	return status;
 }
 
-static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
-		size_t copied, struct folio *folio)
+/* Returns true if we can skip dirtying the page */
+static bool iomap_write_through(struct iomap_write_ctx *iwc,
+		struct iomap *iomap, struct inode *inode, struct folio *folio,
+		loff_t pos, size_t len)
+{
+	unsigned int blksize = i_blocksize(inode);
+
+	if (!iwc || !iwc->write_through)
+		return false;
+	if (folio_test_dirty(folio))
+		return true;
+	if (folio_test_writeback(folio))
+		return false;
+
+	/* Can't allocate blocks here because we don't have ->prepare_ioend */
+	if (iomap->type != IOMAP_MAPPED || iomap->type != IOMAP_UNWRITTEN ||
+	    iomap->flags & IOMAP_F_SHARED)
+		return false;
+
+	len = round_up(pos + len - 1, blksize);
+	pos = round_down(pos, blksize);
+	len -= pos;
+	iwc->ioend = iomap_add_to_ioend(inode, pos, len, folio,
+			iomap_page_create(inode, folio), iomap, iwc->ioend,
+			NULL, &iwc->iolist);
+	folio_start_writeback(folio);
+	return true;
+}
+
+static size_t __iomap_write_end(struct iomap_write_ctx *iwc,
+		struct iomap *iomap, struct inode *inode, loff_t pos,
+		size_t len, size_t copied, struct folio *folio)
 {
 	struct iomap_page *iop = to_iomap_page(folio);
 	flush_dcache_folio(folio);
@@ -895,7 +931,8 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	if (unlikely(copied < len && !folio_test_uptodate(folio)))
 		return 0;
 	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
-	filemap_dirty_folio(inode->i_mapping, folio);
+	if (!iomap_write_through(iwc, iomap, inode, folio, pos, len))
+		filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
 }
 
@@ -918,7 +955,8 @@ static size_t iomap_write_end_inline(const struct iomap_iter *iter,
 }
 
 /* Returns the number of bytes copied.  May be 0.  Cannot be an errno. */
-static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
+static size_t iomap_write_end(struct iomap_write_ctx *iwc,
+		struct iomap_iter *iter, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
 	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
@@ -932,7 +970,8 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 		ret = block_write_end(NULL, iter->inode->i_mapping, pos, len,
 				copied, &folio->page, NULL);
 	} else {
-		ret = __iomap_write_end(iter->inode, pos, len, copied, folio);
+		ret = __iomap_write_end(iwc, &iter->iomap, iter->inode, pos,
+				len, copied, folio);
 	}
 
 	/*
@@ -957,7 +996,8 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 	return ret;
 }
 
-static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
+static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i,
+		struct iomap_write_ctx *iwc)
 {
 	loff_t length = iomap_length(iter);
 	loff_t pos = iter->pos;
@@ -999,7 +1039,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 
 		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
 
-		status = iomap_write_end(iter, pos, bytes, copied, folio);
+		status = iomap_write_end(iwc, iter, pos, bytes, copied, folio);
 
 		if (unlikely(copied != status))
 			iov_iter_revert(i, copied - status);
@@ -1036,10 +1076,24 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 		.len		= iov_iter_count(i),
 		.flags		= IOMAP_WRITE,
 	};
+	struct iomap_write_ctx iwc = {
+		.iolist = LIST_HEAD_INIT(iwc.iolist),
+		.write_through = iocb->ki_flags & IOCB_SYNC,
+	};
+	struct iomap_ioend *ioend, *next;
 	int ret;
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.processed = iomap_write_iter(&iter, i);
+		iter.processed = iomap_write_iter(&iter, i, &iwc);
+
+	list_for_each_entry_safe(ioend, next, &iwc.iolist, io_list) {
+		list_del_init(&ioend->io_list);
+		ret = iomap_submit_ioend(NULL, ioend, ret);
+	}
+
+	if (iwc.ioend)
+		ret = iomap_submit_ioend(NULL, iwc.ioend, ret);
+
 	if (iter.pos == iocb->ki_pos)
 		return ret;
 	return iter.pos - iocb->ki_pos;
@@ -1071,7 +1125,7 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
 		if (unlikely(status))
 			return status;
 
-		status = iomap_write_end(iter, pos, bytes, bytes, folio);
+		status = iomap_write_end(NULL, iter, pos, bytes, bytes, folio);
 		if (WARN_ON_ONCE(status == 0))
 			return -EIO;
 
@@ -1133,7 +1187,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		folio_zero_range(folio, offset, bytes);
 		folio_mark_accessed(folio);
 
-		bytes = iomap_write_end(iter, pos, bytes, bytes, folio);
+		bytes = iomap_write_end(NULL, iter, pos, bytes, bytes, folio);
 		if (WARN_ON_ONCE(bytes == 0))
 			return -EIO;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 10/10] remove write_through bool
  2022-05-03  6:39 [RFC PATCH 00/10] Make O_SYNC writethrough Matthew Wilcox (Oracle)
                   ` (8 preceding siblings ...)
  2022-05-03  6:40 ` [RFC PATCH 09/10] iomap: Add writethrough for O_SYNC Matthew Wilcox (Oracle)
@ 2022-05-03  6:40 ` Matthew Wilcox (Oracle)
  2022-05-03 12:57 ` [RFC PATCH 00/10] Make O_SYNC writethrough Damien Le Moal
  2022-05-05  4:58 ` Dave Chinner
  11 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2022-05-03  6:40 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle),
	Damien Le Moal, Christoph Hellwig, Darrick J . Wong

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 5050adbd4bc8..ec0189dc6747 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -534,7 +534,6 @@ EXPORT_SYMBOL_GPL(iomap_migrate_page);
 struct iomap_write_ctx {
 	struct iomap_ioend *ioend;
 	struct list_head iolist;
-	bool write_through;
 };
 
 static void
@@ -587,7 +586,7 @@ static bool iomap_can_add_to_ioend(struct iomap *iomap,
 }
 
 static struct iomap_ioend *iomap_alloc_ioend(struct inode *inode,
-		struct iomap *iomap, loff_t offset, sector_t sector,
+		const struct iomap *iomap, loff_t offset, sector_t sector,
 		struct writeback_control *wbc)
 {
 	struct iomap_ioend *ioend;
@@ -888,7 +887,7 @@ static bool iomap_write_through(struct iomap_write_ctx *iwc,
 {
 	unsigned int blksize = i_blocksize(inode);
 
-	if (!iwc || !iwc->write_through)
+	if (!iwc)
 		return false;
 	if (folio_test_dirty(folio))
 		return true;
@@ -1078,13 +1077,13 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 	};
 	struct iomap_write_ctx iwc = {
 		.iolist = LIST_HEAD_INIT(iwc.iolist),
-		.write_through = iocb->ki_flags & IOCB_SYNC,
 	};
+	struct iomap_write_ctx *iwcp = iocb->ki_flags & IOCB_SYNC ? &iwc : NULL;
 	struct iomap_ioend *ioend, *next;
 	int ret;
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.processed = iomap_write_iter(&iter, i, &iwc);
+		iter.processed = iomap_write_iter(&iter, i, iwcp);
 
 	list_for_each_entry_safe(ioend, next, &iwc.iolist, io_list) {
 		list_del_init(&ioend->io_list);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/10] Make O_SYNC writethrough
  2022-05-03  6:39 [RFC PATCH 00/10] Make O_SYNC writethrough Matthew Wilcox (Oracle)
                   ` (9 preceding siblings ...)
  2022-05-03  6:40 ` [RFC PATCH 10/10] remove write_through bool Matthew Wilcox (Oracle)
@ 2022-05-03 12:57 ` Damien Le Moal
  2022-05-05  4:58 ` Dave Chinner
  11 siblings, 0 replies; 17+ messages in thread
From: Damien Le Moal @ 2022-05-03 12:57 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), linux-fsdevel
  Cc: Christoph Hellwig, Darrick J . Wong

On 2022/05/03 15:39, Matthew Wilcox (Oracle) wrote:
> This is very much in development and basically untested, but Damian

s/Damian/Damien :)

Thank you for posting this. I am definitely going to play with this with zonefs.

The goal is to allow replacing the mandatory O_DIRECT writing of sequential zone
files with sector aligned O_SYNC writes which "preload" the page cache for
subsequent buffered reads, thus reducing device accesses. That will also avoid
an annoying overhead with zonefs which is that applications need 2 file
descriptors per zone file: one without O_DIRECT for buffered reads and another
O_DIRECT one for writes.

In the case of zonefs, since all sequential files are always fully mapped,
allocated, cannot be used for mmap writing *and* a write is never an overwrite,
these conditions:

+	if (folio_test_dirty(folio))
+		return true;

+	/* Can't allocate blocks here because we don't have ->prepare_ioend */
+	if (iomap->type != IOMAP_MAPPED || iomap->type != IOMAP_UNWRITTEN ||
+	    iomap->flags & IOMAP_F_SHARED)
+		return false;

never trigger and the writethrough is always started with
folio_start_writeback(), essentially becoming a "direct" write from the issuer
context (under the inode lock) on the entire folio. And that should guarantee
that writes stay sequential as they must.

> started describing to me something that he wanted, and I told him he
> was asking for the wrong thing, and I already had this patch series
> in progress.  If someone wants to pick it up and make it mergable,
> that'd be grand.
> 
> The idea is that an O_SYNC write is always going to want to write, and
> we know that at the time we're storing into the page cache.  So for an
> otherwise clean folio, we can skip the part where we dirty the folio,
> find the dirty folios and wait for their writeback.  We can just mark the
> folio as writeback-in-progress and start the IO there and then (where we
> know exactly which blocks need to be written, so possibly a smaller I/O
> than writing the entire page).  The existing "find dirty pages, start
> I/O and wait on them" code will end up waiting on this pre-started I/O
> to complete, even though it didn't start any of its own I/O.
> 
> The important part is patch 9.  Everything before it is boring prep work.
> I'm in two minds about whether to keep the 'write_through' bool, or
> remove it.  So feel to read patches 9+10 squashed together, or as if
> patch 10 doesn't exist.  Whichever feels better.
> 
> The biggest problem with all this is that iomap doesn't have the necessary
> information to cause extent allocation, so if you do an O_SYNC write
> to an extent which is HOLE or DELALLOC, we can't do this optimisation.
> Maybe that doesn't really matter for interesting applications.  I suspect
> it doesn't matter for ZoneFS.
> 
> Matthew Wilcox (Oracle) (10):
>   iomap: Pass struct iomap to iomap_alloc_ioend()
>   iomap: Remove iomap_writepage_ctx from iomap_can_add_to_ioend()
>   iomap: Do not pass iomap_writepage_ctx to iomap_add_to_ioend()
>   iomap: Accept a NULL iomap_writepage_ctx in iomap_submit_ioend()
>   iomap: Allow a NULL writeback_control argument to iomap_alloc_ioend()
>   iomap: Pass a length to iomap_add_to_ioend()
>   iomap: Reorder functions
>   iomap: Reorder functions
>   iomap: Add writethrough for O_SYNC
>   remove write_through bool
> 
>  fs/iomap/buffered-io.c | 492 +++++++++++++++++++++++------------------
>  1 file changed, 273 insertions(+), 219 deletions(-)
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/10] Make O_SYNC writethrough
  2022-05-03  6:39 [RFC PATCH 00/10] Make O_SYNC writethrough Matthew Wilcox (Oracle)
                   ` (10 preceding siblings ...)
  2022-05-03 12:57 ` [RFC PATCH 00/10] Make O_SYNC writethrough Damien Le Moal
@ 2022-05-05  4:58 ` Dave Chinner
  2022-05-05  5:07   ` Matthew Wilcox
  11 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2022-05-05  4:58 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, Damien Le Moal, Christoph Hellwig, Darrick J . Wong

On Tue, May 03, 2022 at 07:39:58AM +0100, Matthew Wilcox (Oracle) wrote:
> This is very much in development and basically untested, but Damian
> started describing to me something that he wanted, and I told him he
> was asking for the wrong thing, and I already had this patch series
> in progress.  If someone wants to pick it up and make it mergable,
> that'd be grand.

That've very non-descriptive. Saying "someone wanted something, I said it's
wrong, so here's a patch series about something else" doesn't tell me anything
about the problem that Damien was trying to solve.

> The idea is that an O_SYNC write is always going to want to write, and
> we know that at the time we're storing into the page cache.  So for an
> otherwise clean folio, we can skip the part where we dirty the folio,
> find the dirty folios and wait for their writeback.

What exactly is this shortcut trying to optimise away? A bit of CPU
time?

O_SYNC is already a write-through operation - we just call
filemap_write_and_wait_range() once we've copied the data into the
page cache and dirtied the page. What does skipping the dirty page
step gain us?

> We can just mark the
> folio as writeback-in-progress and start the IO there and then (where we
> know exactly which blocks need to be written, so possibly a smaller I/O
> than writing the entire page).  The existing "find dirty pages, start
> I/O and wait on them" code will end up waiting on this pre-started I/O
> to complete, even though it didn't start any of its own I/O.
> 
> The important part is patch 9.  Everything before it is boring prep work.
> I'm in two minds about whether to keep the 'write_through' bool, or
> remove it.  So feel to read patches 9+10 squashed together, or as if
> patch 10 doesn't exist.  Whichever feels better.
> 
> The biggest problem with all this is that iomap doesn't have the necessary
> information to cause extent allocation, so if you do an O_SYNC write
> to an extent which is HOLE or DELALLOC, we can't do this optimisation.
> Maybe that doesn't really matter for interesting applications.  I suspect
> it doesn't matter for ZoneFS.

This seems like a lot of complexity for only partial support. It
introduces races with page dirtying and cleaning, it likely has
interesting issues with all the VM dirty/writeback accounting
(because this series is using a completion path that expects the
submission path has done it's side of the accounting) and it only
works in certain preconditions are met.

And I still don't know what problem this code actually trying to
solve....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/10] Make O_SYNC writethrough
  2022-05-05  4:58 ` Dave Chinner
@ 2022-05-05  5:07   ` Matthew Wilcox
  2022-05-05  7:05     ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2022-05-05  5:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, Damien Le Moal, Christoph Hellwig, Darrick J . Wong

On Thu, May 05, 2022 at 02:58:21PM +1000, Dave Chinner wrote:
> On Tue, May 03, 2022 at 07:39:58AM +0100, Matthew Wilcox (Oracle) wrote:
> > This is very much in development and basically untested, but Damian
> > started describing to me something that he wanted, and I told him he
> > was asking for the wrong thing, and I already had this patch series
> > in progress.  If someone wants to pick it up and make it mergable,
> > that'd be grand.
> 
> That've very non-descriptive. Saying "someone wanted something, I said it's
> wrong, so here's a patch series about something else" doesn't tell me anything
> about the problem that Damien was trying to solve.

Sorry about that.  I was a bit jet-lagged when I wrote it.

> > The idea is that an O_SYNC write is always going to want to write, and
> > we know that at the time we're storing into the page cache.  So for an
> > otherwise clean folio, we can skip the part where we dirty the folio,
> > find the dirty folios and wait for their writeback.
> 
> What exactly is this shortcut trying to optimise away? A bit of CPU
> time?
> 
> O_SYNC is already a write-through operation - we just call
> filemap_write_and_wait_range() once we've copied the data into the
> page cache and dirtied the page. What does skipping the dirty page
> step gain us?

Two things; the original reason I was doing this, and Damien's reason.

My reason: a small write to a large folio will cause the entire folio to
be dirtied and written.  This is unnecessary with O_SYNC; we're about
to force the write anyway; we may as well do the write of the part of
the folio which is modified, and skip the whole dirtying step.

Damien's reason: It's racy.  Somebody else (... even vmscan) could cause
folios to be written out of order.  This matters for ZoneFS because
writing a file out of order is Not Allowed.  He was looking at relaxing
O_DIRECT, but I think what he really wants is a writethrough page cache.

> > We can just mark the
> > folio as writeback-in-progress and start the IO there and then (where we
> > know exactly which blocks need to be written, so possibly a smaller I/O
> > than writing the entire page).  The existing "find dirty pages, start
> > I/O and wait on them" code will end up waiting on this pre-started I/O
> > to complete, even though it didn't start any of its own I/O.
> > 
> > The important part is patch 9.  Everything before it is boring prep work.
> > I'm in two minds about whether to keep the 'write_through' bool, or
> > remove it.  So feel to read patches 9+10 squashed together, or as if
> > patch 10 doesn't exist.  Whichever feels better.
> > 
> > The biggest problem with all this is that iomap doesn't have the necessary
> > information to cause extent allocation, so if you do an O_SYNC write
> > to an extent which is HOLE or DELALLOC, we can't do this optimisation.
> > Maybe that doesn't really matter for interesting applications.  I suspect
> > it doesn't matter for ZoneFS.
> 
> This seems like a lot of complexity for only partial support. It
> introduces races with page dirtying and cleaning, it likely has
> interesting issues with all the VM dirty/writeback accounting
> (because this series is using a completion path that expects the
> submission path has done it's side of the accounting) and it only
> works in certain preconditions are met.

If we want to have better O_SYNC support, I think we can improve those
conditions.  For example, XFS could preallocate the blocks before calling
into iomap.  Since it's an O_SYNC write, everything is already terrible.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/10] Make O_SYNC writethrough
  2022-05-05  5:07   ` Matthew Wilcox
@ 2022-05-05  7:05     ` Dave Chinner
  2022-05-06 12:03       ` Damien Le Moal
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2022-05-05  7:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, Damien Le Moal, Christoph Hellwig, Darrick J . Wong

On Thu, May 05, 2022 at 06:07:11AM +0100, Matthew Wilcox wrote:
> On Thu, May 05, 2022 at 02:58:21PM +1000, Dave Chinner wrote:
> > On Tue, May 03, 2022 at 07:39:58AM +0100, Matthew Wilcox (Oracle) wrote:
> > > This is very much in development and basically untested, but Damian
> > > started describing to me something that he wanted, and I told him he
> > > was asking for the wrong thing, and I already had this patch series
> > > in progress.  If someone wants to pick it up and make it mergable,
> > > that'd be grand.
> > 
> > That've very non-descriptive. Saying "someone wanted something, I said it's
> > wrong, so here's a patch series about something else" doesn't tell me anything
> > about the problem that Damien was trying to solve.
> 
> Sorry about that.  I was a bit jet-lagged when I wrote it.
> 
> > > The idea is that an O_SYNC write is always going to want to write, and
> > > we know that at the time we're storing into the page cache.  So for an
> > > otherwise clean folio, we can skip the part where we dirty the folio,
> > > find the dirty folios and wait for their writeback.
> > 
> > What exactly is this shortcut trying to optimise away? A bit of CPU
> > time?
> > 
> > O_SYNC is already a write-through operation - we just call
> > filemap_write_and_wait_range() once we've copied the data into the
> > page cache and dirtied the page. What does skipping the dirty page
> > step gain us?
> 
> Two things; the original reason I was doing this, and Damien's reason.
> 
> My reason: a small write to a large folio will cause the entire folio to
> be dirtied and written.

If that's a problem, then shouldn't we track sub-folio dirty
regions? Because normal non-O_SYNC buffered writes will still cause
this to happen...

> This is unnecessary with O_SYNC; we're about
> to force the write anyway; we may as well do the write of the part of
> the folio which is modified, and skip the whole dirtying step.

What happens when another part of the folio is concurrently dirtied?

What happens if the folio already has other parts of it under
writeback? How do we avoid and/or resolve concurent "partial folio
writeback" race conditions?

> Damien's reason: It's racy.  Somebody else (... even vmscan) could cause
> folios to be written out of order.  This matters for ZoneFS because
> writing a file out of order is Not Allowed.  He was looking at relaxing
> O_DIRECT, but I think what he really wants is a writethrough page cache.

Zonefs has other mechanisms to solve this. It already has the
inode_lock() to serialise all dio writes to a zone because they must
be append IOs. i.e. new writes must be located at the write pointer,
and the write pointer does not get incremented until the IO
has been submitted (for DIO+AIO) or completed (for non-AIO).

Hence for buffered writes, we have the same situation: once we have
sampled the zone write pointer to get the offset, we cannot start
another write until the current IO has been submitted.

Further, for zonefs, we cannot get another write to that page cache
page *ever*; we can only get reads from it. Hence page state really
doesn't matter at all - once there is data in the page cache page,
all that can happen is it can be invalidated but it cannot change
(ah, the beauties of write-once media!). Hence the dirty state is
completely meaningless from a coherency and integrity POV, as is the
writeback state.

IOWs, for zonefs we can already ignore the page dirtying and
writeback mechanisms fairly safely. Hence we could do something like
this in the zonefs buffered write path:

- lock the inode
- sample the write pointer to get the file offset
- instantiate a page cache folio at the given offset
- copy the data into the folio, mark it up to date.
- mark it as under writeback or lock the folio to keep reclaim away
- add the page cache folio to an iter_iov
- pass the iter_iov to the direct IO write path to submit the IO and
  wait for completion.
- clear the folio writeback state.
- move the write pointer
- unlock the inode

and that gets us writethrough O_SYNC buffered writes. In fact, I
think it may even work with async writes, too, just like the DIO
write path seems to work with AIO.

The best part about the above mechanism is that there is
almost no new iomap, page cache or direct IO functionality required
to do this. All the magic is all in the zonefs sequential zone write
path. Hence I don't see needing to substantially modify the iomap
buffered write path to do zonefs write-through....

> > > The biggest problem with all this is that iomap doesn't have the necessary
> > > information to cause extent allocation, so if you do an O_SYNC write
> > > to an extent which is HOLE or DELALLOC, we can't do this optimisation.
> > > Maybe that doesn't really matter for interesting applications.  I suspect
> > > it doesn't matter for ZoneFS.
> > 
> > This seems like a lot of complexity for only partial support. It
> > introduces races with page dirtying and cleaning, it likely has
> > interesting issues with all the VM dirty/writeback accounting
> > (because this series is using a completion path that expects the
> > submission path has done it's side of the accounting) and it only
> > works in certain preconditions are met.
> 
> If we want to have better O_SYNC support, I think we can improve those
> conditions.  For example, XFS could preallocate the blocks before calling
> into iomap.  Since it's an O_SYNC write, everything is already terrible.

Ugh, that's even worse.

Quite frankly, designing pure O_SYNC writethrough is a classic case
of not seeing the forest for the trees.  What we actually need is
*async* page cache write-through.

Ever wondered why you can only get 60-70k write IOPS out of buffered
writes? e.g untarring really large tarballs of small files always
end up at 60-70k write IOPS regardless of filesystem, how many
threads you break the writes up into, etc? io_uring buffered writes
won't save us here, either, because it's not the data ingest side
that limits performance. Yeah, it's the writeback side that limits
us.

There's a simple reason for that: the flusher thread becomes CPU
bound doing the writeback of hundreds of thousands of dirty inodes.

Writeback caching is a major bottleneck on high performance storage;
when your storage can do 6.5GB/s and buffered writes can only copy
into the page cache and flush to disk at 2GB/s (typically lower than
this!), writeback caching is robbing us of major amounts of
performance.

It's even worse with small files - the flusher thread becomes CPU
bound at 60-80k IOPS on XFS, ext4 and btrfs because block allocation
is an expensive operation. On a device with a couple of million IOPS
available, having the kernel top out at under 5% of it's capacity is
pretty bad.

However, if I do a hacky "writethrough" of small writes by calling
filemap_flush() in ->release() (i.e. when close is called after the
write), then multithreaded small file write workloads can push
*several hundred thousand* write IOPS to disk before I run out of
CPU.

Write-through enables submission concurrency for small IOs. It
avoids lots of page state management overehad for high data
throughput IO. That's where all the performance wins with high end
storage are - keeping the pipes full. Buffered writes stopped being
able to do that years ago, and modern PCIe4 SSDs have only made that
gulf wider again.

IOWs, what we actually need is a clean page cache write-through
model that doesn't have any nasty quirks or side effects. IOWs, I
think you are on the right conceptual path, just the wrong
architectural path.

My preference would be for the page cache write-through mode to be a
thin shim over the DIO write path. The DIO write path is a highly
concurrent async IO engine - it's designed to handle everything
AIO and io_uring can throw at it. Forget about "direct IO", just
treat it as a high concurrency, high throughput async IO engine.

Hence for page cache write-through, all we do is instantiate the
page cache page, lock it, copy the data into it and then pass it to
the direct IO write implementation to submit it and then unlock it
on completion.  There's nothing else we really need to do - the DIO
path already handles everything else.

And if we use page/folio locking for concurrency synchronisation of
write-through mode instead of an exclusive inode lock, the model
allows for concurrent, non-overlapping buffered writes to a single
inode, just like we have for direct IO. It also allows us to avoid
all dirty and writeback page cache and VM state/accounting
manipulations. ANd by using the page/folio lock we avoid racing
state transitions until the write-through op is complete.

Sure, if there is an existing dirty folio in the page cache, then
punt it down the existing buffered IO path - something else is
already using write-back caching for this folio (e.g. mmap), so we
don't want to deal with trying to change modes.

But otherwise, we don't want to go near the normal buffered write
paths - they are all optimised for *write back* caching.  From an IO
and filesystem allocation optimisation perspective, page-cache
write-through IO is exactly the same as direct IO writes.  Hence we
ireally want page cache write-through to use the same allocator
paths and optimisations as the direct IO path, not the existing
buffered write path.

This sort of setup will get write-through buffered writes close to
the throughput of what direct IO is capable of on modern storage. It
won't quite match it, because DIO is zero copy and buffered IO is
copy-once, but it'll get a *lot* closer than it does now....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/10] Make O_SYNC writethrough
  2022-05-05  7:05     ` Dave Chinner
@ 2022-05-06 12:03       ` Damien Le Moal
  2022-05-10  1:26         ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2022-05-06 12:03 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox
  Cc: linux-fsdevel, Christoph Hellwig, Darrick J . Wong

On 2022/05/05 16:05, Dave Chinner wrote:
> On Thu, May 05, 2022 at 06:07:11AM +0100, Matthew Wilcox wrote:
>> On Thu, May 05, 2022 at 02:58:21PM +1000, Dave Chinner wrote:
>>> On Tue, May 03, 2022 at 07:39:58AM +0100, Matthew Wilcox (Oracle) wrote:
>>>> This is very much in development and basically untested, but Damian
>>>> started describing to me something that he wanted, and I told him he
>>>> was asking for the wrong thing, and I already had this patch series
>>>> in progress.  If someone wants to pick it up and make it mergable,
>>>> that'd be grand.
>>>
>>> That've very non-descriptive. Saying "someone wanted something, I said it's
>>> wrong, so here's a patch series about something else" doesn't tell me anything
>>> about the problem that Damien was trying to solve.
>>
>> Sorry about that.  I was a bit jet-lagged when I wrote it.
>>
>>>> The idea is that an O_SYNC write is always going to want to write, and
>>>> we know that at the time we're storing into the page cache.  So for an
>>>> otherwise clean folio, we can skip the part where we dirty the folio,
>>>> find the dirty folios and wait for their writeback.
>>>
>>> What exactly is this shortcut trying to optimise away? A bit of CPU
>>> time?
>>>
>>> O_SYNC is already a write-through operation - we just call
>>> filemap_write_and_wait_range() once we've copied the data into the
>>> page cache and dirtied the page. What does skipping the dirty page
>>> step gain us?
>>
>> Two things; the original reason I was doing this, and Damien's reason.
>>
>> My reason: a small write to a large folio will cause the entire folio to
>> be dirtied and written.
> 
> If that's a problem, then shouldn't we track sub-folio dirty
> regions? Because normal non-O_SYNC buffered writes will still cause
> this to happen...
> 
>> This is unnecessary with O_SYNC; we're about
>> to force the write anyway; we may as well do the write of the part of
>> the folio which is modified, and skip the whole dirtying step.
> 
> What happens when another part of the folio is concurrently dirtied?
> 
> What happens if the folio already has other parts of it under
> writeback? How do we avoid and/or resolve concurent "partial folio
> writeback" race conditions?
> 
>> Damien's reason: It's racy.  Somebody else (... even vmscan) could cause
>> folios to be written out of order.  This matters for ZoneFS because
>> writing a file out of order is Not Allowed.  He was looking at relaxing
>> O_DIRECT, but I think what he really wants is a writethrough page cache.
> 
> Zonefs has other mechanisms to solve this. It already has the
> inode_lock() to serialise all dio writes to a zone because they must
> be append IOs. i.e. new writes must be located at the write pointer,
> and the write pointer does not get incremented until the IO
> has been submitted (for DIO+AIO) or completed (for non-AIO).
> 
> Hence for buffered writes, we have the same situation: once we have
> sampled the zone write pointer to get the offset, we cannot start
> another write until the current IO has been submitted.
> 
> Further, for zonefs, we cannot get another write to that page cache
> page *ever*; we can only get reads from it. Hence page state really
> doesn't matter at all - once there is data in the page cache page,
> all that can happen is it can be invalidated but it cannot change
> (ah, the beauties of write-once media!). Hence the dirty state is
> completely meaningless from a coherency and integrity POV, as is the
> writeback state.
> 
> IOWs, for zonefs we can already ignore the page dirtying and
> writeback mechanisms fairly safely. Hence we could do something like
> this in the zonefs buffered write path:
> 
> - lock the inode
> - sample the write pointer to get the file offset
> - instantiate a page cache folio at the given offset
> - copy the data into the folio, mark it up to date.
> - mark it as under writeback or lock the folio to keep reclaim away
> - add the page cache folio to an iter_iov
> - pass the iter_iov to the direct IO write path to submit the IO and
>   wait for completion.
> - clear the folio writeback state.
> - move the write pointer
> - unlock the inode

That was my initial idea. When I talked about it with Matthew, he mentioned his
write-through work and posted it. For my use case, I do like what he has done
since that would avoid the need to add most of the above machinery to zonefs.
But if there are no benefits anywhere else, adding this as a zonefs only thing
is fine with me.

> and that gets us writethrough O_SYNC buffered writes. In fact, I
> think it may even work with async writes, too, just like the DIO
> write path seems to work with AIO.

Yes, I think this all works for AIOs too since we use the "soft" write pointer
position updated on BIO submit, not completion.

> The best part about the above mechanism is that there is
> almost no new iomap, page cache or direct IO functionality required
> to do this. All the magic is all in the zonefs sequential zone write
> path. Hence I don't see needing to substantially modify the iomap
> buffered write path to do zonefs write-through....

Indeed. The only additional constraint is that zonefs must still ensure that
writes are physical block aligned to avoid iomap attempting to do a
read-modify-rewrite of the last written sector of a zone. Just need to think
about potential corner cases when the page size is larger than the device block
size. Could the partially filled last page of a file ever end-up needing a
read-modify-write ? I do not think so, but need to check.

>>>> The biggest problem with all this is that iomap doesn't have the necessary
>>>> information to cause extent allocation, so if you do an O_SYNC write
>>>> to an extent which is HOLE or DELALLOC, we can't do this optimisation.
>>>> Maybe that doesn't really matter for interesting applications.  I suspect
>>>> it doesn't matter for ZoneFS.
>>>
>>> This seems like a lot of complexity for only partial support. It
>>> introduces races with page dirtying and cleaning, it likely has
>>> interesting issues with all the VM dirty/writeback accounting
>>> (because this series is using a completion path that expects the
>>> submission path has done it's side of the accounting) and it only
>>> works in certain preconditions are met.
>>
>> If we want to have better O_SYNC support, I think we can improve those
>> conditions.  For example, XFS could preallocate the blocks before calling
>> into iomap.  Since it's an O_SYNC write, everything is already terrible.
> 
> Ugh, that's even worse.
> 
> Quite frankly, designing pure O_SYNC writethrough is a classic case
> of not seeing the forest for the trees.  What we actually need is
> *async* page cache write-through.
> 
> Ever wondered why you can only get 60-70k write IOPS out of buffered
> writes? e.g untarring really large tarballs of small files always
> end up at 60-70k write IOPS regardless of filesystem, how many
> threads you break the writes up into, etc? io_uring buffered writes
> won't save us here, either, because it's not the data ingest side
> that limits performance. Yeah, it's the writeback side that limits
> us.
> 
> There's a simple reason for that: the flusher thread becomes CPU
> bound doing the writeback of hundreds of thousands of dirty inodes.
> 
> Writeback caching is a major bottleneck on high performance storage;
> when your storage can do 6.5GB/s and buffered writes can only copy
> into the page cache and flush to disk at 2GB/s (typically lower than
> this!), writeback caching is robbing us of major amounts of
> performance.
> 
> It's even worse with small files - the flusher thread becomes CPU
> bound at 60-80k IOPS on XFS, ext4 and btrfs because block allocation
> is an expensive operation. On a device with a couple of million IOPS
> available, having the kernel top out at under 5% of it's capacity is
> pretty bad.
> 
> However, if I do a hacky "writethrough" of small writes by calling
> filemap_flush() in ->release() (i.e. when close is called after the
> write), then multithreaded small file write workloads can push
> *several hundred thousand* write IOPS to disk before I run out of
> CPU.
> 
> Write-through enables submission concurrency for small IOs. It
> avoids lots of page state management overehad for high data
> throughput IO. That's where all the performance wins with high end
> storage are - keeping the pipes full. Buffered writes stopped being
> able to do that years ago, and modern PCIe4 SSDs have only made that
> gulf wider again.
> 
> IOWs, what we actually need is a clean page cache write-through
> model that doesn't have any nasty quirks or side effects. IOWs, I
> think you are on the right conceptual path, just the wrong
> architectural path.
> 
> My preference would be for the page cache write-through mode to be a
> thin shim over the DIO write path. The DIO write path is a highly
> concurrent async IO engine - it's designed to handle everything
> AIO and io_uring can throw at it. Forget about "direct IO", just
> treat it as a high concurrency, high throughput async IO engine.
> 
> Hence for page cache write-through, all we do is instantiate the
> page cache page, lock it, copy the data into it and then pass it to
> the direct IO write implementation to submit it and then unlock it
> on completion.  There's nothing else we really need to do - the DIO
> path already handles everything else.

Yes ! And the special case for zonefs would actually implement almost exactly
this, modulo the additional requirement of the write alignment that is purely
due to zonefs/device constraint.

> 
> And if we use page/folio locking for concurrency synchronisation of
> write-through mode instead of an exclusive inode lock, the model
> allows for concurrent, non-overlapping buffered writes to a single
> inode, just like we have for direct IO. It also allows us to avoid
> all dirty and writeback page cache and VM state/accounting
> manipulations. ANd by using the page/folio lock we avoid racing
> state transitions until the write-through op is complete.
> 
> Sure, if there is an existing dirty folio in the page cache, then
> punt it down the existing buffered IO path - something else is
> already using write-back caching for this folio (e.g. mmap), so we
> don't want to deal with trying to change modes.
> 
> But otherwise, we don't want to go near the normal buffered write
> paths - they are all optimised for *write back* caching.  From an IO
> and filesystem allocation optimisation perspective, page-cache
> write-through IO is exactly the same as direct IO writes.  Hence we
> ireally want page cache write-through to use the same allocator
> paths and optimisations as the direct IO path, not the existing
> buffered write path.
> 
> This sort of setup will get write-through buffered writes close to
> the throughput of what direct IO is capable of on modern storage. It
> won't quite match it, because DIO is zero copy and buffered IO is
> copy-once, but it'll get a *lot* closer than it does now....
> 
> Cheers,
> 
> Dave.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/10] Make O_SYNC writethrough
  2022-05-06 12:03       ` Damien Le Moal
@ 2022-05-10  1:26         ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2022-05-10  1:26 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Matthew Wilcox, linux-fsdevel, Christoph Hellwig, Darrick J . Wong

On Fri, May 06, 2022 at 09:03:40PM +0900, Damien Le Moal wrote:
> On 2022/05/05 16:05, Dave Chinner wrote:
> > On Thu, May 05, 2022 at 06:07:11AM +0100, Matthew Wilcox wrote:
> >> On Thu, May 05, 2022 at 02:58:21PM +1000, Dave Chinner wrote:
> >>> On Tue, May 03, 2022 at 07:39:58AM +0100, Matthew Wilcox (Oracle) wrote:
> >>>> This is very much in development and basically untested, but Damian
> >>>> started describing to me something that he wanted, and I told him he
> >>>> was asking for the wrong thing, and I already had this patch series
> >>>> in progress.  If someone wants to pick it up and make it mergable,
> >>>> that'd be grand.
> >>>
> >>> That've very non-descriptive. Saying "someone wanted something, I said it's
> >>> wrong, so here's a patch series about something else" doesn't tell me anything
> >>> about the problem that Damien was trying to solve.
> >>
> >> Sorry about that.  I was a bit jet-lagged when I wrote it.
> >>
> >>>> The idea is that an O_SYNC write is always going to want to write, and
> >>>> we know that at the time we're storing into the page cache.  So for an
> >>>> otherwise clean folio, we can skip the part where we dirty the folio,
> >>>> find the dirty folios and wait for their writeback.
> >>>
> >>> What exactly is this shortcut trying to optimise away? A bit of CPU
> >>> time?
> >>>
> >>> O_SYNC is already a write-through operation - we just call
> >>> filemap_write_and_wait_range() once we've copied the data into the
> >>> page cache and dirtied the page. What does skipping the dirty page
> >>> step gain us?
> >>
> >> Two things; the original reason I was doing this, and Damien's reason.
> >>
> >> My reason: a small write to a large folio will cause the entire folio to
> >> be dirtied and written.
> > 
> > If that's a problem, then shouldn't we track sub-folio dirty
> > regions? Because normal non-O_SYNC buffered writes will still cause
> > this to happen...
> > 
> >> This is unnecessary with O_SYNC; we're about
> >> to force the write anyway; we may as well do the write of the part of
> >> the folio which is modified, and skip the whole dirtying step.
> > 
> > What happens when another part of the folio is concurrently dirtied?
> > 
> > What happens if the folio already has other parts of it under
> > writeback? How do we avoid and/or resolve concurent "partial folio
> > writeback" race conditions?
> > 
> >> Damien's reason: It's racy.  Somebody else (... even vmscan) could cause
> >> folios to be written out of order.  This matters for ZoneFS because
> >> writing a file out of order is Not Allowed.  He was looking at relaxing
> >> O_DIRECT, but I think what he really wants is a writethrough page cache.
> > 
> > Zonefs has other mechanisms to solve this. It already has the
> > inode_lock() to serialise all dio writes to a zone because they must
> > be append IOs. i.e. new writes must be located at the write pointer,
> > and the write pointer does not get incremented until the IO
> > has been submitted (for DIO+AIO) or completed (for non-AIO).
> > 
> > Hence for buffered writes, we have the same situation: once we have
> > sampled the zone write pointer to get the offset, we cannot start
> > another write until the current IO has been submitted.
> > 
> > Further, for zonefs, we cannot get another write to that page cache
> > page *ever*; we can only get reads from it. Hence page state really
> > doesn't matter at all - once there is data in the page cache page,
> > all that can happen is it can be invalidated but it cannot change
> > (ah, the beauties of write-once media!). Hence the dirty state is
> > completely meaningless from a coherency and integrity POV, as is the
> > writeback state.
> > 
> > IOWs, for zonefs we can already ignore the page dirtying and
> > writeback mechanisms fairly safely. Hence we could do something like
> > this in the zonefs buffered write path:
> > 
> > - lock the inode
> > - sample the write pointer to get the file offset
> > - instantiate a page cache folio at the given offset
> > - copy the data into the folio, mark it up to date.
> > - mark it as under writeback or lock the folio to keep reclaim away
> > - add the page cache folio to an iter_iov
> > - pass the iter_iov to the direct IO write path to submit the IO and
> >   wait for completion.
> > - clear the folio writeback state.
> > - move the write pointer
> > - unlock the inode
> 
> That was my initial idea. When I talked about it with Matthew, he mentioned his
> write-through work and posted it. For my use case, I do like what he has done
> since that would avoid the need to add most of the above machinery to zonefs.
> But if there are no benefits anywhere else, adding this as a zonefs only thing
> is fine with me.

I figured as much, but I wanted to get it documented and to set the
context of being able to use the existing DIO path for the
write-through functionality. Because...

> 
> > and that gets us writethrough O_SYNC buffered writes. In fact, I
> > think it may even work with async writes, too, just like the DIO
> > write path seems to work with AIO.
> 
> Yes, I think this all works for AIOs too since we use the "soft" write pointer
> position updated on BIO submit, not completion.
> 
> > The best part about the above mechanism is that there is
> > almost no new iomap, page cache or direct IO functionality required
> > to do this. All the magic is all in the zonefs sequential zone write
> > path. Hence I don't see needing to substantially modify the iomap
> > buffered write path to do zonefs write-through....
> 
> Indeed. The only additional constraint is that zonefs must still ensure that
> writes are physical block aligned to avoid iomap attempting to do a
> read-modify-rewrite of the last written sector of a zone.

Well, we get that with page cache IO - the IO still has to be
filesystem block sized and shaped, adn the page cache has to handle
instantiate of data within the folio/page in the case that buffered
write IO does not align to filesystem blocks.

> Just need to think
> about potential corner cases when the page size is larger than the device block
> size. Could the partially filled last page of a file ever end-up needing a
> read-modify-write ? I do not think so, but need to check.

Yes, if the page is getting marked up to date as a whole during the
write. In that case, the portion of the page that has current data
over it needs to read in before the page can be marked up to date.
If you are doing repeated partial page extension writes, the page
should already have the data in it and be up to date before the new
extending write adds more data to the page. Hence it's only really a
problem for the initial cold cache write....

> > IOWs, what we actually need is a clean page cache write-through
> > model that doesn't have any nasty quirks or side effects. IOWs, I
> > think you are on the right conceptual path, just the wrong
> > architectural path.
> > 
> > My preference would be for the page cache write-through mode to be a
> > thin shim over the DIO write path. The DIO write path is a highly
> > concurrent async IO engine - it's designed to handle everything
> > AIO and io_uring can throw at it. Forget about "direct IO", just
> > treat it as a high concurrency, high throughput async IO engine.
> > 
> > Hence for page cache write-through, all we do is instantiate the
> > page cache page, lock it, copy the data into it and then pass it to
> > the direct IO write implementation to submit it and then unlock it
> > on completion.  There's nothing else we really need to do - the DIO
> > path already handles everything else.
> 
> Yes ! And the special case for zonefs would actually implement almost exactly
> this, modulo the additional requirement of the write alignment that is purely
> due to zonefs/device constraint.

Right - if we have a model like this, then zonefs can likely just
use the generic implementation as we may well need the
cold-cache-partial-tail-read operations in the generic write-through
case, too. That was why I initially outlined the zonefs-specific
implementation above - it shows how a generic implementation can
support even constrained filesystems like zonefs or sub-page partial
IO on block size < page size filesystems....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-05-10  1:26 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-03  6:39 [RFC PATCH 00/10] Make O_SYNC writethrough Matthew Wilcox (Oracle)
2022-05-03  6:39 ` [RFC PATCH 01/10] iomap: Pass struct iomap to iomap_alloc_ioend() Matthew Wilcox (Oracle)
2022-05-03  6:40 ` [RFC PATCH 02/10] iomap: Remove iomap_writepage_ctx from iomap_can_add_to_ioend() Matthew Wilcox (Oracle)
2022-05-03  6:40 ` [RFC PATCH 03/10] iomap: Do not pass iomap_writepage_ctx to iomap_add_to_ioend() Matthew Wilcox (Oracle)
2022-05-03  6:40 ` [RFC PATCH 04/10] iomap: Accept a NULL iomap_writepage_ctx in iomap_submit_ioend() Matthew Wilcox (Oracle)
2022-05-03  6:40 ` [RFC PATCH 05/10] iomap: Allow a NULL writeback_control argument to iomap_alloc_ioend() Matthew Wilcox (Oracle)
2022-05-03  6:40 ` [RFC PATCH 06/10] iomap: Pass a length to iomap_add_to_ioend() Matthew Wilcox (Oracle)
2022-05-03  6:40 ` [RFC PATCH 07/10] iomap: Reorder functions Matthew Wilcox (Oracle)
2022-05-03  6:40 ` [RFC PATCH 08/10] " Matthew Wilcox (Oracle)
2022-05-03  6:40 ` [RFC PATCH 09/10] iomap: Add writethrough for O_SYNC Matthew Wilcox (Oracle)
2022-05-03  6:40 ` [RFC PATCH 10/10] remove write_through bool Matthew Wilcox (Oracle)
2022-05-03 12:57 ` [RFC PATCH 00/10] Make O_SYNC writethrough Damien Le Moal
2022-05-05  4:58 ` Dave Chinner
2022-05-05  5:07   ` Matthew Wilcox
2022-05-05  7:05     ` Dave Chinner
2022-05-06 12:03       ` Damien Le Moal
2022-05-10  1:26         ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.