linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] iomap: avoid soft lockup warnings on large ioends
@ 2021-05-17 17:17 Brian Foster
  2021-05-17 17:17 ` [PATCH v3 1/3] iomap: resched ioend completion when in non-atomic context Brian Foster
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Brian Foster @ 2021-05-17 17:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Hi all,

There's been a bit more feedback on v2 of this series so here's a v3
with some small changes. Patch 1 is unchanged and just allows iomap
ioend completion to reschedule when invoked from non-atomic context.
Matthew Wilcox argued that ioends larger than the 256kB-1MB or so range
should probably be processed outside of bio completion context, so patch
2 updates XFS to queue completion of ioends >=1MB (based on 4k pages) to
the same workqueue used for processing ioends that require post I/O
metadata changes. Finally, there's been some debate around whether we
should continue to construct somewhat arbitrarily large ioends from a
latency perspective, independent of the soft lockup warning that's been
reproduced when processing ioends with tens of GBs worth of pages. Dave
Chinner had proposed an ioend size limit of ~16MB, so patch 3 is an RFC
for that change (and includes a comment written by Dave on the
explanation).

This series survives fstests and I've run some basic fio buffered write
(overwrites only) performance testing to measure the potential latency
hit at the size threshold. fio shows an average latency increase of
~20us with 1MB random writes at a reduction of ~10iops while 4k random
writes show basically no change (as expected). I'm happy to run further
tests on request. Thoughts, reviews, flames appreciated.

Brian

v3:
- Rebase.
- Change wq completion size threshold to 1MB.
- Append RFC for 16MB ioend size limit.
v2: https://lore.kernel.org/linux-xfs/20201005152102.15797-1-bfoster@redhat.com/
- Fix type in macro.
v1: https://lore.kernel.org/linux-xfs/20201002153357.56409-1-bfoster@redhat.com/
rfc: https://lore.kernel.org/linux-fsdevel/20200825144917.GA321765@bfoster/

Brian Foster (3):
  iomap: resched ioend completion when in non-atomic context
  xfs: kick large ioends to completion workqueue
  iomap: bound ioend size to 4096 pages

 fs/iomap/buffered-io.c | 21 +++++++++++++--------
 fs/xfs/xfs_aops.c      | 18 +++++++++++++++---
 include/linux/iomap.h  | 28 +++++++++++++++++++++++++++-
 3 files changed, 55 insertions(+), 12 deletions(-)

-- 
2.26.3


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 1/3] iomap: resched ioend completion when in non-atomic context
  2021-05-17 17:17 [PATCH v3 0/3] iomap: avoid soft lockup warnings on large ioends Brian Foster
@ 2021-05-17 17:17 ` Brian Foster
  2021-05-17 17:54   ` Matthew Wilcox
  2021-05-22  7:45   ` Ming Lei
  2021-05-17 17:17 ` [PATCH v3 2/3] xfs: kick large ioends to completion workqueue Brian Foster
  2021-05-17 17:17 ` [PATCH RFC v3 3/3] iomap: bound ioend size to 4096 pages Brian Foster
  2 siblings, 2 replies; 24+ messages in thread
From: Brian Foster @ 2021-05-17 17:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

The iomap ioend mechanism has the ability to construct very large,
contiguous bios and/or bio chains. This has been reported to lead to
soft lockup warnings in bio completion due to the amount of page
processing that occurs. Update the ioend completion path with a
parameter to indicate atomic context and insert a cond_resched()
call to avoid soft lockups in either scenario.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 15 +++++++++------
 fs/xfs/xfs_aops.c      |  2 +-
 include/linux/iomap.h  |  2 +-
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 414769a6ad11..642422775e4e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1061,7 +1061,7 @@ iomap_finish_page_writeback(struct inode *inode, struct page *page,
  * ioend after this.
  */
 static void
-iomap_finish_ioend(struct iomap_ioend *ioend, int error)
+iomap_finish_ioend(struct iomap_ioend *ioend, int error, bool atomic)
 {
 	struct inode *inode = ioend->io_inode;
 	struct bio *bio = &ioend->io_inline_bio;
@@ -1084,9 +1084,12 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 			next = bio->bi_private;
 
 		/* walk each page on bio, ending page IO on them */
-		bio_for_each_segment_all(bv, bio, iter_all)
+		bio_for_each_segment_all(bv, bio, iter_all) {
 			iomap_finish_page_writeback(inode, bv->bv_page, error,
 					bv->bv_len);
+			if (!atomic)
+				cond_resched();
+		}
 		bio_put(bio);
 	}
 	/* The ioend has been freed by bio_put() */
@@ -1099,17 +1102,17 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 }
 
 void
-iomap_finish_ioends(struct iomap_ioend *ioend, int error)
+iomap_finish_ioends(struct iomap_ioend *ioend, int error, bool atomic)
 {
 	struct list_head tmp;
 
 	list_replace_init(&ioend->io_list, &tmp);
-	iomap_finish_ioend(ioend, error);
+	iomap_finish_ioend(ioend, error, atomic);
 
 	while (!list_empty(&tmp)) {
 		ioend = list_first_entry(&tmp, struct iomap_ioend, io_list);
 		list_del_init(&ioend->io_list);
-		iomap_finish_ioend(ioend, error);
+		iomap_finish_ioend(ioend, error, atomic);
 	}
 }
 EXPORT_SYMBOL_GPL(iomap_finish_ioends);
@@ -1178,7 +1181,7 @@ static void iomap_writepage_end_bio(struct bio *bio)
 {
 	struct iomap_ioend *ioend = bio->bi_private;
 
-	iomap_finish_ioend(ioend, blk_status_to_errno(bio->bi_status));
+	iomap_finish_ioend(ioend, blk_status_to_errno(bio->bi_status), true);
 }
 
 /*
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 9b08db45ce85..84cd6cf46b12 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -123,7 +123,7 @@ xfs_end_ioend(
 	if (!error && xfs_ioend_is_append(ioend))
 		error = xfs_setfilesize(ip, ioend->io_offset, ioend->io_size);
 done:
-	iomap_finish_ioends(ioend, error);
+	iomap_finish_ioends(ioend, error, false);
 	memalloc_nofs_restore(nofs_flag);
 }
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index d202fd2d0f91..07f3f4e69084 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -232,7 +232,7 @@ struct iomap_writepage_ctx {
 	const struct iomap_writeback_ops *ops;
 };
 
-void iomap_finish_ioends(struct iomap_ioend *ioend, int error);
+void iomap_finish_ioends(struct iomap_ioend *ioend, int error, bool atomic);
 void iomap_ioend_try_merge(struct iomap_ioend *ioend,
 		struct list_head *more_ioends,
 		void (*merge_private)(struct iomap_ioend *ioend,
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 2/3] xfs: kick large ioends to completion workqueue
  2021-05-17 17:17 [PATCH v3 0/3] iomap: avoid soft lockup warnings on large ioends Brian Foster
  2021-05-17 17:17 ` [PATCH v3 1/3] iomap: resched ioend completion when in non-atomic context Brian Foster
@ 2021-05-17 17:17 ` Brian Foster
  2021-05-26  1:20   ` Darrick J. Wong
  2021-05-17 17:17 ` [PATCH RFC v3 3/3] iomap: bound ioend size to 4096 pages Brian Foster
  2 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2021-05-17 17:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

We've had reports of soft lockup warnings in the iomap ioend
completion path due to very large bios and/or bio chains. This
occurs because ioend completion touches every page associated with
the ioend. It generally requires exceedingly large (i.e. multi-GB)
bios or bio chains to reproduce a soft lockup warning, but even with
smaller ioends there's really no good reason to incur the cost of
potential cacheline misses in bio completion context. Divert ioends
larger than 1MB to the workqueue so completion occurs in non-atomic
context and can reschedule to avoid soft lockup warnings.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_aops.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 84cd6cf46b12..05b1bb146f17 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -30,6 +30,13 @@ XFS_WPC(struct iomap_writepage_ctx *ctx)
 	return container_of(ctx, struct xfs_writepage_ctx, ctx);
 }
 
+/*
+ * Completion touches every page associated with the ioend. Send anything
+ * larger than 1MB (based on 4k pages) or so to the completion workqueue to
+ * avoid this work in bio completion context.
+ */
+#define XFS_LARGE_IOEND	(256ULL << PAGE_SHIFT)
+
 /*
  * Fast and loose check if this write could update the on-disk inode size.
  */
@@ -409,9 +416,14 @@ xfs_prepare_ioend(
 
 	memalloc_nofs_restore(nofs_flag);
 
-	/* send ioends that might require a transaction to the completion wq */
+	/*
+	 * Send ioends that might require a transaction or are large enough that
+	 * we don't want to do page processing in bio completion context to the
+	 * wq.
+	 */
 	if (xfs_ioend_is_append(ioend) || ioend->io_type == IOMAP_UNWRITTEN ||
-	    (ioend->io_flags & IOMAP_F_SHARED))
+	    (ioend->io_flags & IOMAP_F_SHARED) ||
+	    ioend->io_size >= XFS_LARGE_IOEND)
 		ioend->io_bio->bi_end_io = xfs_end_bio;
 	return status;
 }
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RFC v3 3/3] iomap: bound ioend size to 4096 pages
  2021-05-17 17:17 [PATCH v3 0/3] iomap: avoid soft lockup warnings on large ioends Brian Foster
  2021-05-17 17:17 ` [PATCH v3 1/3] iomap: resched ioend completion when in non-atomic context Brian Foster
  2021-05-17 17:17 ` [PATCH v3 2/3] xfs: kick large ioends to completion workqueue Brian Foster
@ 2021-05-17 17:17 ` Brian Foster
  2021-05-19 13:28   ` Christoph Hellwig
  2021-05-20 23:27   ` Darrick J. Wong
  2 siblings, 2 replies; 24+ messages in thread
From: Brian Foster @ 2021-05-17 17:17 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

The iomap writeback infrastructure is currently able to construct
extremely large bio chains (tens of GBs) associated with a single
ioend. This consolidation provides no significant value as bio
chains increase beyond a reasonable minimum size. On the other hand,
this does hold significant numbers of pages in the writeback
state across an unnecessarily large number of bios because the ioend
is not processed for completion until the final bio in the chain
completes. Cap an individual ioend to a reasonable size of 4096
pages (16MB with 4k pages) to avoid this condition.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c |  6 ++++--
 include/linux/iomap.h  | 26 ++++++++++++++++++++++++++
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 642422775e4e..f2890ee434d0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1269,7 +1269,7 @@ iomap_chain_bio(struct bio *prev)
 
 static bool
 iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
-		sector_t sector)
+		unsigned len, sector_t sector)
 {
 	if ((wpc->iomap.flags & IOMAP_F_SHARED) !=
 	    (wpc->ioend->io_flags & IOMAP_F_SHARED))
@@ -1280,6 +1280,8 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
 		return false;
 	if (sector != bio_end_sector(wpc->ioend->io_bio))
 		return false;
+	if (wpc->ioend->io_size + len > IOEND_MAX_IOSIZE)
+		return false;
 	return true;
 }
 
@@ -1297,7 +1299,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
 	unsigned poff = offset & (PAGE_SIZE - 1);
 	bool merged, same_page = false;
 
-	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) {
+	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, len, sector)) {
 		if (wpc->ioend)
 			list_add(&wpc->ioend->io_list, iolist);
 		wpc->ioend = iomap_alloc_ioend(inode, wpc, offset, sector, wbc);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 07f3f4e69084..89b15cc236d5 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -203,6 +203,32 @@ struct iomap_ioend {
 	struct bio		io_inline_bio;	/* MUST BE LAST! */
 };
 
+/*
+ * Maximum ioend IO size is used to prevent ioends from becoming unbound in
+ * size. bios can reach 4GB in size if pages are contiguous, and bio chains are
+ * effectively unbound in length. Hence the only limits on the size of the bio
+ * chain is the contiguity of the extent on disk and the length of the run of
+ * sequential dirty pages in the page cache. This can be tens of GBs of physical
+ * extents and if memory is large enough, tens of millions of dirty pages.
+ * Locking them all under writeback until the final bio in the chain is
+ * submitted and completed locks all those pages for the legnth of time it takes
+ * to write those many, many GBs of data to storage.
+ *
+ * Background writeback caps any single writepages call to half the device
+ * bandwidth to ensure fairness and prevent any one dirty inode causing
+ * writeback starvation. fsync() and other WB_SYNC_ALL writebacks have no such
+ * cap on wbc->nr_pages, and that's where the above massive bio chain lengths
+ * come from. We want large IOs to reach the storage, but we need to limit
+ * completion latencies, hence we need to control the maximum IO size we
+ * dispatch to the storage stack.
+ *
+ * We don't really have to care about the extra IO completion overhead here
+ * because iomap has contiguous IO completion merging. If the device can sustain
+ * high throughput and large bios, the ioends are merged on completion and
+ * processed in large, efficient chunks with no additional IO latency.
+ */
+#define IOEND_MAX_IOSIZE	(4096ULL << PAGE_SHIFT)
+
 struct iomap_writeback_ops {
 	/*
 	 * Required, maps the blocks so that writeback can be performed on
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/3] iomap: resched ioend completion when in non-atomic context
  2021-05-17 17:17 ` [PATCH v3 1/3] iomap: resched ioend completion when in non-atomic context Brian Foster
@ 2021-05-17 17:54   ` Matthew Wilcox
  2021-05-18 11:38     ` Brian Foster
  2021-05-22  7:45   ` Ming Lei
  1 sibling, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2021-05-17 17:54 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel

On Mon, May 17, 2021 at 01:17:20PM -0400, Brian Foster wrote:
> @@ -1084,9 +1084,12 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
>  			next = bio->bi_private;
>  
>  		/* walk each page on bio, ending page IO on them */
> -		bio_for_each_segment_all(bv, bio, iter_all)
> +		bio_for_each_segment_all(bv, bio, iter_all) {
>  			iomap_finish_page_writeback(inode, bv->bv_page, error,
>  					bv->bv_len);
> +			if (!atomic)
> +				cond_resched();
> +		}

I don't know that it makes sense to check after _every_ page.  I might
go for every segment.  Some users check after every thousand pages.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/3] iomap: resched ioend completion when in non-atomic context
  2021-05-17 17:54   ` Matthew Wilcox
@ 2021-05-18 11:38     ` Brian Foster
  2021-05-20 21:58       ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2021-05-18 11:38 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-xfs, linux-fsdevel

On Mon, May 17, 2021 at 06:54:34PM +0100, Matthew Wilcox wrote:
> On Mon, May 17, 2021 at 01:17:20PM -0400, Brian Foster wrote:
> > @@ -1084,9 +1084,12 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
> >  			next = bio->bi_private;
> >  
> >  		/* walk each page on bio, ending page IO on them */
> > -		bio_for_each_segment_all(bv, bio, iter_all)
> > +		bio_for_each_segment_all(bv, bio, iter_all) {
> >  			iomap_finish_page_writeback(inode, bv->bv_page, error,
> >  					bv->bv_len);
> > +			if (!atomic)
> > +				cond_resched();
> > +		}
> 
> I don't know that it makes sense to check after _every_ page.  I might
> go for every segment.  Some users check after every thousand pages.
> 

The handful of examples I come across on a brief scan (including the
other iomap usage) have a similar pattern as used here. I don't doubt
there are others, but I think I'd prefer to have more reasoning behind
adding more code than might be necessary (i.e. do we expect additional
overhead to be measurable here?). As it is, the intent isn't so much to
check on every page as much as this just happens to be the common point
of the function to cover both long bio chains and single vector bios
with large numbers of pages.

Brian


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v3 3/3] iomap: bound ioend size to 4096 pages
  2021-05-17 17:17 ` [PATCH RFC v3 3/3] iomap: bound ioend size to 4096 pages Brian Foster
@ 2021-05-19 13:28   ` Christoph Hellwig
  2021-05-19 14:52     ` Brian Foster
  2021-05-20 23:27   ` Darrick J. Wong
  1 sibling, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2021-05-19 13:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel

On Mon, May 17, 2021 at 01:17:22PM -0400, Brian Foster wrote:
> The iomap writeback infrastructure is currently able to construct
> extremely large bio chains (tens of GBs) associated with a single
> ioend. This consolidation provides no significant value as bio
> chains increase beyond a reasonable minimum size. On the other hand,
> this does hold significant numbers of pages in the writeback
> state across an unnecessarily large number of bios because the ioend
> is not processed for completion until the final bio in the chain
> completes. Cap an individual ioend to a reasonable size of 4096
> pages (16MB with 4k pages) to avoid this condition.

Note that once we get huge page/folio support in the page cache this
sucks as we can trivially handle much larger sizes with very little
iteration.

I wonder if both this limit and the previous one should be based on the
number of pages added instead.  And in fact maybe if we only want the
limit at add to ioend time and skip the defer to workqueue part entirely.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v3 3/3] iomap: bound ioend size to 4096 pages
  2021-05-19 13:28   ` Christoph Hellwig
@ 2021-05-19 14:52     ` Brian Foster
  0 siblings, 0 replies; 24+ messages in thread
From: Brian Foster @ 2021-05-19 14:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel

On Wed, May 19, 2021 at 02:28:39PM +0100, Christoph Hellwig wrote:
> On Mon, May 17, 2021 at 01:17:22PM -0400, Brian Foster wrote:
> > The iomap writeback infrastructure is currently able to construct
> > extremely large bio chains (tens of GBs) associated with a single
> > ioend. This consolidation provides no significant value as bio
> > chains increase beyond a reasonable minimum size. On the other hand,
> > this does hold significant numbers of pages in the writeback
> > state across an unnecessarily large number of bios because the ioend
> > is not processed for completion until the final bio in the chain
> > completes. Cap an individual ioend to a reasonable size of 4096
> > pages (16MB with 4k pages) to avoid this condition.
> 
> Note that once we get huge page/folio support in the page cache this
> sucks as we can trivially handle much larger sizes with very little
> iteration.
> 
> I wonder if both this limit and the previous one should be based on the
> number of pages added instead.  And in fact maybe if we only want the
> limit at add to ioend time and skip the defer to workqueue part entirely.
> 

Both limits are already based on pages. I imagine they could change to
folios when appropriate.

The defer to workqueue part was based on your suggestion[1]. The primary
purpose of this series is to address the completion processing soft
lockup warning, so I don't have a strong preference on whether we do
that by capping ioend size, processing (and yielding) from non-atomic
context, or both.

Brian

[1] https://lore.kernel.org/linux-fsdevel/20200917080455.GY26262@infradead.org/


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/3] iomap: resched ioend completion when in non-atomic context
  2021-05-18 11:38     ` Brian Foster
@ 2021-05-20 21:58       ` Darrick J. Wong
  2021-05-24 11:57         ` Brian Foster
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2021-05-20 21:58 UTC (permalink / raw)
  To: Brian Foster; +Cc: Matthew Wilcox, linux-xfs, linux-fsdevel

On Tue, May 18, 2021 at 07:38:01AM -0400, Brian Foster wrote:
> On Mon, May 17, 2021 at 06:54:34PM +0100, Matthew Wilcox wrote:
> > On Mon, May 17, 2021 at 01:17:20PM -0400, Brian Foster wrote:
> > > @@ -1084,9 +1084,12 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
> > >  			next = bio->bi_private;
> > >  
> > >  		/* walk each page on bio, ending page IO on them */
> > > -		bio_for_each_segment_all(bv, bio, iter_all)
> > > +		bio_for_each_segment_all(bv, bio, iter_all) {
> > >  			iomap_finish_page_writeback(inode, bv->bv_page, error,
> > >  					bv->bv_len);
> > > +			if (!atomic)
> > > +				cond_resched();
> > > +		}
> > 
> > I don't know that it makes sense to check after _every_ page.  I might
> > go for every segment.  Some users check after every thousand pages.
> > 
> 
> The handful of examples I come across on a brief scan (including the
> other iomap usage) have a similar pattern as used here. I don't doubt
> there are others, but I think I'd prefer to have more reasoning behind
> adding more code than might be necessary (i.e. do we expect additional
> overhead to be measurable here?). As it is, the intent isn't so much to
> check on every page as much as this just happens to be the common point
> of the function to cover both long bio chains and single vector bios
> with large numbers of pages.

It's been a while since I waded through the macro hell to find out what
cond_resched actually does, but iirc it can do some fairly heavyweight
things (disable preemption, call the scheduler, rcu stuff) which is why
we're supposed to be a little judicious about amortizing each call over
a few thousand pages.

--D

> Brian
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v3 3/3] iomap: bound ioend size to 4096 pages
  2021-05-17 17:17 ` [PATCH RFC v3 3/3] iomap: bound ioend size to 4096 pages Brian Foster
  2021-05-19 13:28   ` Christoph Hellwig
@ 2021-05-20 23:27   ` Darrick J. Wong
  2021-05-24 12:02     ` Brian Foster
  1 sibling, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2021-05-20 23:27 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel

On Mon, May 17, 2021 at 01:17:22PM -0400, Brian Foster wrote:
> The iomap writeback infrastructure is currently able to construct
> extremely large bio chains (tens of GBs) associated with a single
> ioend. This consolidation provides no significant value as bio
> chains increase beyond a reasonable minimum size. On the other hand,
> this does hold significant numbers of pages in the writeback
> state across an unnecessarily large number of bios because the ioend
> is not processed for completion until the final bio in the chain
> completes. Cap an individual ioend to a reasonable size of 4096
> pages (16MB with 4k pages) to avoid this condition.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/iomap/buffered-io.c |  6 ++++--
>  include/linux/iomap.h  | 26 ++++++++++++++++++++++++++
>  2 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 642422775e4e..f2890ee434d0 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1269,7 +1269,7 @@ iomap_chain_bio(struct bio *prev)
>  
>  static bool
>  iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
> -		sector_t sector)
> +		unsigned len, sector_t sector)
>  {
>  	if ((wpc->iomap.flags & IOMAP_F_SHARED) !=
>  	    (wpc->ioend->io_flags & IOMAP_F_SHARED))
> @@ -1280,6 +1280,8 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
>  		return false;
>  	if (sector != bio_end_sector(wpc->ioend->io_bio))
>  		return false;
> +	if (wpc->ioend->io_size + len > IOEND_MAX_IOSIZE)
> +		return false;
>  	return true;
>  }
>  
> @@ -1297,7 +1299,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
>  	unsigned poff = offset & (PAGE_SIZE - 1);
>  	bool merged, same_page = false;
>  
> -	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) {
> +	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, len, sector)) {
>  		if (wpc->ioend)
>  			list_add(&wpc->ioend->io_list, iolist);
>  		wpc->ioend = iomap_alloc_ioend(inode, wpc, offset, sector, wbc);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 07f3f4e69084..89b15cc236d5 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -203,6 +203,32 @@ struct iomap_ioend {
>  	struct bio		io_inline_bio;	/* MUST BE LAST! */
>  };
>  
> +/*
> + * Maximum ioend IO size is used to prevent ioends from becoming unbound in
> + * size. bios can reach 4GB in size if pages are contiguous, and bio chains are
> + * effectively unbound in length. Hence the only limits on the size of the bio
> + * chain is the contiguity of the extent on disk and the length of the run of
> + * sequential dirty pages in the page cache. This can be tens of GBs of physical
> + * extents and if memory is large enough, tens of millions of dirty pages.
> + * Locking them all under writeback until the final bio in the chain is
> + * submitted and completed locks all those pages for the legnth of time it takes

s/legnth/length/

> + * to write those many, many GBs of data to storage.
> + *
> + * Background writeback caps any single writepages call to half the device
> + * bandwidth to ensure fairness and prevent any one dirty inode causing
> + * writeback starvation. fsync() and other WB_SYNC_ALL writebacks have no such
> + * cap on wbc->nr_pages, and that's where the above massive bio chain lengths
> + * come from. We want large IOs to reach the storage, but we need to limit
> + * completion latencies, hence we need to control the maximum IO size we
> + * dispatch to the storage stack.
> + *
> + * We don't really have to care about the extra IO completion overhead here
> + * because iomap has contiguous IO completion merging. If the device can sustain

Assuming you're referring to iomap_finish_ioends, only XFS employs the
ioend completion merging, and only for ioends where it decides to
override the default bi_end_io.  iomap on its own never calls
iomap_ioend_try_merge.

This patch establishes a maximum ioend size of 4096 pages so that we
don't trip the lockup watchdog while clearing pagewriteback and also so
that we don't pin a large number of pages while constructing a big chain
of bios.  On gfs2 and zonefs, each ioend completion will now have to
clear up to 4096 pages from whatever context bio_endio is called.

For XFS it's a more complicated -- XFS already overrode the bio handler
for ioends that required further metadata updates (e.g. unwritten
conversion, eof extension, or cow) so that it could combine ioends when
possible.  XFS wants to combine ioends to amortize the cost of getting
the ILOCK and running transactions over a larger number of pages.

So I guess I see how the two changes dovetail nicely for XFS -- iomap
issues smaller write bios, and the xfs ioend worker can recombine
however many bios complete before the worker runs.  As a bonus, we don't
have to worry about situations like the device driver completing so many
bios from a single invocation of a bottom half handler that we run afoul
of the soft lockup timer.

Is that a correct understanding of how the two changes intersect with
each other?  TBH I was expecting the two thresholds to be closer in
value.

The other two users of iomap for buffered io (gfs2 and zonefs) don't
have a means to defer and combine ioends like xfs does.  Do you think
they should?  I think it's still possible to trip the softlockup there.

--D

> + * high throughput and large bios, the ioends are merged on completion and
> + * processed in large, efficient chunks with no additional IO latency.
> + */
> +#define IOEND_MAX_IOSIZE	(4096ULL << PAGE_SHIFT)
> +
>  struct iomap_writeback_ops {
>  	/*
>  	 * Required, maps the blocks so that writeback can be performed on
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/3] iomap: resched ioend completion when in non-atomic context
  2021-05-17 17:17 ` [PATCH v3 1/3] iomap: resched ioend completion when in non-atomic context Brian Foster
  2021-05-17 17:54   ` Matthew Wilcox
@ 2021-05-22  7:45   ` Ming Lei
  2021-05-24 11:57     ` Brian Foster
  1 sibling, 1 reply; 24+ messages in thread
From: Ming Lei @ 2021-05-22  7:45 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel

On Mon, May 17, 2021 at 01:17:20PM -0400, Brian Foster wrote:
> The iomap ioend mechanism has the ability to construct very large,
> contiguous bios and/or bio chains. This has been reported to lead to

BTW, it is actually wrong to complete a large bio chains in
iomap_finish_ioend(), which may risk in bio allocation deadlock, cause
bio_alloc_bioset() relies on bio submission to make forward progress. But
it becomes not true when all chained bios are freed just after the whole
ioend is done since all chained bios(except for the one embedded in ioend)
are allocated from same bioset(fs_bio_set).


Thanks,
Ming


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/3] iomap: resched ioend completion when in non-atomic context
  2021-05-20 21:58       ` Darrick J. Wong
@ 2021-05-24 11:57         ` Brian Foster
  2021-05-24 16:53           ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2021-05-24 11:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Matthew Wilcox, linux-xfs, linux-fsdevel

On Thu, May 20, 2021 at 02:58:58PM -0700, Darrick J. Wong wrote:
> On Tue, May 18, 2021 at 07:38:01AM -0400, Brian Foster wrote:
> > On Mon, May 17, 2021 at 06:54:34PM +0100, Matthew Wilcox wrote:
> > > On Mon, May 17, 2021 at 01:17:20PM -0400, Brian Foster wrote:
> > > > @@ -1084,9 +1084,12 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
> > > >  			next = bio->bi_private;
> > > >  
> > > >  		/* walk each page on bio, ending page IO on them */
> > > > -		bio_for_each_segment_all(bv, bio, iter_all)
> > > > +		bio_for_each_segment_all(bv, bio, iter_all) {
> > > >  			iomap_finish_page_writeback(inode, bv->bv_page, error,
> > > >  					bv->bv_len);
> > > > +			if (!atomic)
> > > > +				cond_resched();
> > > > +		}
> > > 
> > > I don't know that it makes sense to check after _every_ page.  I might
> > > go for every segment.  Some users check after every thousand pages.
> > > 
> > 
> > The handful of examples I come across on a brief scan (including the
> > other iomap usage) have a similar pattern as used here. I don't doubt
> > there are others, but I think I'd prefer to have more reasoning behind
> > adding more code than might be necessary (i.e. do we expect additional
> > overhead to be measurable here?). As it is, the intent isn't so much to
> > check on every page as much as this just happens to be the common point
> > of the function to cover both long bio chains and single vector bios
> > with large numbers of pages.
> 
> It's been a while since I waded through the macro hell to find out what
> cond_resched actually does, but iirc it can do some fairly heavyweight
> things (disable preemption, call the scheduler, rcu stuff) which is why
> we're supposed to be a little judicious about amortizing each call over
> a few thousand pages.
> 

It looks to me it just checks some state bit and only does any work if
actually necessary. I suppose not doing that less often is cheaper than
doing it more, but it's not clear to me it's enough that it really
matters and/or warrants more code to filter out calls..

What exactly did you have in mind for logic? I suppose we could always
stuff a 'if (!(count++ % 1024)) cond_resched();' or some such in the
inner loop, but that might have less of an effect on larger chains
constructed of bios with fewer pages (depending on whether that might
still be possible).

Brian

> --D
> 
> > Brian
> > 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/3] iomap: resched ioend completion when in non-atomic context
  2021-05-22  7:45   ` Ming Lei
@ 2021-05-24 11:57     ` Brian Foster
  2021-05-24 14:11       ` Ming Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2021-05-24 11:57 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-xfs, linux-fsdevel

On Sat, May 22, 2021 at 03:45:11PM +0800, Ming Lei wrote:
> On Mon, May 17, 2021 at 01:17:20PM -0400, Brian Foster wrote:
> > The iomap ioend mechanism has the ability to construct very large,
> > contiguous bios and/or bio chains. This has been reported to lead to
> 
> BTW, it is actually wrong to complete a large bio chains in
> iomap_finish_ioend(), which may risk in bio allocation deadlock, cause
> bio_alloc_bioset() relies on bio submission to make forward progress. But
> it becomes not true when all chained bios are freed just after the whole
> ioend is done since all chained bios(except for the one embedded in ioend)
> are allocated from same bioset(fs_bio_set).
> 

Interesting. Do you have a reproducer (or error report) for this? Is it
addressed by the next patch, or are further changes required?

Brian

> 
> Thanks,
> Ming
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v3 3/3] iomap: bound ioend size to 4096 pages
  2021-05-20 23:27   ` Darrick J. Wong
@ 2021-05-24 12:02     ` Brian Foster
  2021-05-25  4:20       ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2021-05-24 12:02 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel

On Thu, May 20, 2021 at 04:27:37PM -0700, Darrick J. Wong wrote:
> On Mon, May 17, 2021 at 01:17:22PM -0400, Brian Foster wrote:
> > The iomap writeback infrastructure is currently able to construct
> > extremely large bio chains (tens of GBs) associated with a single
> > ioend. This consolidation provides no significant value as bio
> > chains increase beyond a reasonable minimum size. On the other hand,
> > this does hold significant numbers of pages in the writeback
> > state across an unnecessarily large number of bios because the ioend
> > is not processed for completion until the final bio in the chain
> > completes. Cap an individual ioend to a reasonable size of 4096
> > pages (16MB with 4k pages) to avoid this condition.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/iomap/buffered-io.c |  6 ++++--
> >  include/linux/iomap.h  | 26 ++++++++++++++++++++++++++
> >  2 files changed, 30 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 642422775e4e..f2890ee434d0 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1269,7 +1269,7 @@ iomap_chain_bio(struct bio *prev)
> >  
> >  static bool
> >  iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
> > -		sector_t sector)
> > +		unsigned len, sector_t sector)
> >  {
> >  	if ((wpc->iomap.flags & IOMAP_F_SHARED) !=
> >  	    (wpc->ioend->io_flags & IOMAP_F_SHARED))
> > @@ -1280,6 +1280,8 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
> >  		return false;
> >  	if (sector != bio_end_sector(wpc->ioend->io_bio))
> >  		return false;
> > +	if (wpc->ioend->io_size + len > IOEND_MAX_IOSIZE)
> > +		return false;
> >  	return true;
> >  }
> >  
> > @@ -1297,7 +1299,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
> >  	unsigned poff = offset & (PAGE_SIZE - 1);
> >  	bool merged, same_page = false;
> >  
> > -	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) {
> > +	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, len, sector)) {
> >  		if (wpc->ioend)
> >  			list_add(&wpc->ioend->io_list, iolist);
> >  		wpc->ioend = iomap_alloc_ioend(inode, wpc, offset, sector, wbc);
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 07f3f4e69084..89b15cc236d5 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -203,6 +203,32 @@ struct iomap_ioend {
> >  	struct bio		io_inline_bio;	/* MUST BE LAST! */
> >  };
> >  
> > +/*
> > + * Maximum ioend IO size is used to prevent ioends from becoming unbound in
> > + * size. bios can reach 4GB in size if pages are contiguous, and bio chains are
> > + * effectively unbound in length. Hence the only limits on the size of the bio
> > + * chain is the contiguity of the extent on disk and the length of the run of
> > + * sequential dirty pages in the page cache. This can be tens of GBs of physical
> > + * extents and if memory is large enough, tens of millions of dirty pages.
> > + * Locking them all under writeback until the final bio in the chain is
> > + * submitted and completed locks all those pages for the legnth of time it takes
> 
> s/legnth/length/
>

Fixed.
 
> > + * to write those many, many GBs of data to storage.
> > + *
> > + * Background writeback caps any single writepages call to half the device
> > + * bandwidth to ensure fairness and prevent any one dirty inode causing
> > + * writeback starvation. fsync() and other WB_SYNC_ALL writebacks have no such
> > + * cap on wbc->nr_pages, and that's where the above massive bio chain lengths
> > + * come from. We want large IOs to reach the storage, but we need to limit
> > + * completion latencies, hence we need to control the maximum IO size we
> > + * dispatch to the storage stack.
> > + *
> > + * We don't really have to care about the extra IO completion overhead here
> > + * because iomap has contiguous IO completion merging. If the device can sustain
> 
> Assuming you're referring to iomap_finish_ioends, only XFS employs the
> ioend completion merging, and only for ioends where it decides to
> override the default bi_end_io.  iomap on its own never calls
> iomap_ioend_try_merge.
> 
> This patch establishes a maximum ioend size of 4096 pages so that we
> don't trip the lockup watchdog while clearing pagewriteback and also so
> that we don't pin a large number of pages while constructing a big chain
> of bios.  On gfs2 and zonefs, each ioend completion will now have to
> clear up to 4096 pages from whatever context bio_endio is called.
> 
> For XFS it's a more complicated -- XFS already overrode the bio handler
> for ioends that required further metadata updates (e.g. unwritten
> conversion, eof extension, or cow) so that it could combine ioends when
> possible.  XFS wants to combine ioends to amortize the cost of getting
> the ILOCK and running transactions over a larger number of pages.
> 
> So I guess I see how the two changes dovetail nicely for XFS -- iomap
> issues smaller write bios, and the xfs ioend worker can recombine
> however many bios complete before the worker runs.  As a bonus, we don't
> have to worry about situations like the device driver completing so many
> bios from a single invocation of a bottom half handler that we run afoul
> of the soft lockup timer.
> 
> Is that a correct understanding of how the two changes intersect with
> each other?  TBH I was expecting the two thresholds to be closer in
> value.
> 

I think so. That's interesting because my inclination was to make them
farther apart (or more specifically, increase the threshold in this
patch and leave the previous). The primary goal of this series was to
address the soft lockup warning problem, hence the thresholds on earlier
versions started at rather conservative values. I think both values have
been reasonably justified in being reduced, though this patch has a more
broad impact than the previous in that it changes behavior for all iomap
based fs'. Of course that's something that could also be addressed with
a more dynamic tunable..

> The other two users of iomap for buffered io (gfs2 and zonefs) don't
> have a means to defer and combine ioends like xfs does.  Do you think
> they should?  I think it's still possible to trip the softlockup there.
> 

I'm not sure. We'd probably want some feedback from developers of
filesystems other than XFS before incorporating a change like this. The
first patch in the series more just provides some infrastructure for
other filesystems to avoid the problem as they see fit.

Brian

> --D
> 
> > + * high throughput and large bios, the ioends are merged on completion and
> > + * processed in large, efficient chunks with no additional IO latency.
> > + */
> > +#define IOEND_MAX_IOSIZE	(4096ULL << PAGE_SHIFT)
> > +
> >  struct iomap_writeback_ops {
> >  	/*
> >  	 * Required, maps the blocks so that writeback can be performed on
> > -- 
> > 2.26.3
> > 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/3] iomap: resched ioend completion when in non-atomic context
  2021-05-24 11:57     ` Brian Foster
@ 2021-05-24 14:11       ` Ming Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2021-05-24 14:11 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel

On Mon, May 24, 2021 at 07:57:48AM -0400, Brian Foster wrote:
> On Sat, May 22, 2021 at 03:45:11PM +0800, Ming Lei wrote:
> > On Mon, May 17, 2021 at 01:17:20PM -0400, Brian Foster wrote:
> > > The iomap ioend mechanism has the ability to construct very large,
> > > contiguous bios and/or bio chains. This has been reported to lead to
> > 
> > BTW, it is actually wrong to complete a large bio chains in
> > iomap_finish_ioend(), which may risk in bio allocation deadlock, cause
> > bio_alloc_bioset() relies on bio submission to make forward progress. But
> > it becomes not true when all chained bios are freed just after the whole
> > ioend is done since all chained bios(except for the one embedded in ioend)
> > are allocated from same bioset(fs_bio_set).
> > 
> 
> Interesting. Do you have a reproducer (or error report) for this? Is it

No, but the theory has been applied for long time.

> addressed by the next patch, or are further changes required?

Your patchset can't address the issue.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/3] iomap: resched ioend completion when in non-atomic context
  2021-05-24 11:57         ` Brian Foster
@ 2021-05-24 16:53           ` Darrick J. Wong
  2021-05-26  1:19             ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2021-05-24 16:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: Matthew Wilcox, linux-xfs, linux-fsdevel

On Mon, May 24, 2021 at 07:57:31AM -0400, Brian Foster wrote:
> On Thu, May 20, 2021 at 02:58:58PM -0700, Darrick J. Wong wrote:
> > On Tue, May 18, 2021 at 07:38:01AM -0400, Brian Foster wrote:
> > > On Mon, May 17, 2021 at 06:54:34PM +0100, Matthew Wilcox wrote:
> > > > On Mon, May 17, 2021 at 01:17:20PM -0400, Brian Foster wrote:
> > > > > @@ -1084,9 +1084,12 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
> > > > >  			next = bio->bi_private;
> > > > >  
> > > > >  		/* walk each page on bio, ending page IO on them */
> > > > > -		bio_for_each_segment_all(bv, bio, iter_all)
> > > > > +		bio_for_each_segment_all(bv, bio, iter_all) {
> > > > >  			iomap_finish_page_writeback(inode, bv->bv_page, error,
> > > > >  					bv->bv_len);
> > > > > +			if (!atomic)
> > > > > +				cond_resched();
> > > > > +		}
> > > > 
> > > > I don't know that it makes sense to check after _every_ page.  I might
> > > > go for every segment.  Some users check after every thousand pages.
> > > > 
> > > 
> > > The handful of examples I come across on a brief scan (including the
> > > other iomap usage) have a similar pattern as used here. I don't doubt
> > > there are others, but I think I'd prefer to have more reasoning behind
> > > adding more code than might be necessary (i.e. do we expect additional
> > > overhead to be measurable here?). As it is, the intent isn't so much to
> > > check on every page as much as this just happens to be the common point
> > > of the function to cover both long bio chains and single vector bios
> > > with large numbers of pages.
> > 
> > It's been a while since I waded through the macro hell to find out what
> > cond_resched actually does, but iirc it can do some fairly heavyweight
> > things (disable preemption, call the scheduler, rcu stuff) which is why
> > we're supposed to be a little judicious about amortizing each call over
> > a few thousand pages.
> > 
> 
> It looks to me it just checks some state bit and only does any work if
> actually necessary. I suppose not doing that less often is cheaper than
> doing it more, but it's not clear to me it's enough that it really
> matters and/or warrants more code to filter out calls..
> 
> What exactly did you have in mind for logic? I suppose we could always
> stuff a 'if (!(count++ % 1024)) cond_resched();' or some such in the
> inner loop, but that might have less of an effect on larger chains
> constructed of bios with fewer pages (depending on whether that might
> still be possible).

I /was/ thinking about a function level page counter until I noticed
that iomap_{write,unshare}_actor call cond_resched for every page it
touches.  I withdraw the comment. :)

--D

> 
> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v3 3/3] iomap: bound ioend size to 4096 pages
  2021-05-24 12:02     ` Brian Foster
@ 2021-05-25  4:20       ` Darrick J. Wong
  2021-05-25  4:29         ` Damien Le Moal
                           ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Darrick J. Wong @ 2021-05-25  4:20 UTC (permalink / raw)
  To: Brian Foster, Damien Le Moal, Andreas Gruenbacher
  Cc: linux-xfs, linux-fsdevel

On Mon, May 24, 2021 at 08:02:18AM -0400, Brian Foster wrote:
> On Thu, May 20, 2021 at 04:27:37PM -0700, Darrick J. Wong wrote:
> > On Mon, May 17, 2021 at 01:17:22PM -0400, Brian Foster wrote:
> > > The iomap writeback infrastructure is currently able to construct
> > > extremely large bio chains (tens of GBs) associated with a single
> > > ioend. This consolidation provides no significant value as bio
> > > chains increase beyond a reasonable minimum size. On the other hand,
> > > this does hold significant numbers of pages in the writeback
> > > state across an unnecessarily large number of bios because the ioend
> > > is not processed for completion until the final bio in the chain
> > > completes. Cap an individual ioend to a reasonable size of 4096
> > > pages (16MB with 4k pages) to avoid this condition.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/iomap/buffered-io.c |  6 ++++--
> > >  include/linux/iomap.h  | 26 ++++++++++++++++++++++++++
> > >  2 files changed, 30 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index 642422775e4e..f2890ee434d0 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -1269,7 +1269,7 @@ iomap_chain_bio(struct bio *prev)
> > >  
> > >  static bool
> > >  iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
> > > -		sector_t sector)
> > > +		unsigned len, sector_t sector)
> > >  {
> > >  	if ((wpc->iomap.flags & IOMAP_F_SHARED) !=
> > >  	    (wpc->ioend->io_flags & IOMAP_F_SHARED))
> > > @@ -1280,6 +1280,8 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
> > >  		return false;
> > >  	if (sector != bio_end_sector(wpc->ioend->io_bio))
> > >  		return false;
> > > +	if (wpc->ioend->io_size + len > IOEND_MAX_IOSIZE)
> > > +		return false;
> > >  	return true;
> > >  }
> > >  
> > > @@ -1297,7 +1299,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
> > >  	unsigned poff = offset & (PAGE_SIZE - 1);
> > >  	bool merged, same_page = false;
> > >  
> > > -	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) {
> > > +	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, len, sector)) {
> > >  		if (wpc->ioend)
> > >  			list_add(&wpc->ioend->io_list, iolist);
> > >  		wpc->ioend = iomap_alloc_ioend(inode, wpc, offset, sector, wbc);
> > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > > index 07f3f4e69084..89b15cc236d5 100644
> > > --- a/include/linux/iomap.h
> > > +++ b/include/linux/iomap.h
> > > @@ -203,6 +203,32 @@ struct iomap_ioend {
> > >  	struct bio		io_inline_bio;	/* MUST BE LAST! */
> > >  };
> > >  
> > > +/*
> > > + * Maximum ioend IO size is used to prevent ioends from becoming unbound in
> > > + * size. bios can reach 4GB in size if pages are contiguous, and bio chains are
> > > + * effectively unbound in length. Hence the only limits on the size of the bio
> > > + * chain is the contiguity of the extent on disk and the length of the run of
> > > + * sequential dirty pages in the page cache. This can be tens of GBs of physical
> > > + * extents and if memory is large enough, tens of millions of dirty pages.
> > > + * Locking them all under writeback until the final bio in the chain is
> > > + * submitted and completed locks all those pages for the legnth of time it takes
> > 
> > s/legnth/length/
> >
> 
> Fixed.
>  
> > > + * to write those many, many GBs of data to storage.
> > > + *
> > > + * Background writeback caps any single writepages call to half the device
> > > + * bandwidth to ensure fairness and prevent any one dirty inode causing
> > > + * writeback starvation. fsync() and other WB_SYNC_ALL writebacks have no such
> > > + * cap on wbc->nr_pages, and that's where the above massive bio chain lengths
> > > + * come from. We want large IOs to reach the storage, but we need to limit
> > > + * completion latencies, hence we need to control the maximum IO size we
> > > + * dispatch to the storage stack.
> > > + *
> > > + * We don't really have to care about the extra IO completion overhead here
> > > + * because iomap has contiguous IO completion merging. If the device can sustain
> > 
> > Assuming you're referring to iomap_finish_ioends, only XFS employs the
> > ioend completion merging, and only for ioends where it decides to
> > override the default bi_end_io.  iomap on its own never calls
> > iomap_ioend_try_merge.
> > 
> > This patch establishes a maximum ioend size of 4096 pages so that we
> > don't trip the lockup watchdog while clearing pagewriteback and also so
> > that we don't pin a large number of pages while constructing a big chain
> > of bios.  On gfs2 and zonefs, each ioend completion will now have to
> > clear up to 4096 pages from whatever context bio_endio is called.
> > 
> > For XFS it's a more complicated -- XFS already overrode the bio handler
> > for ioends that required further metadata updates (e.g. unwritten
> > conversion, eof extension, or cow) so that it could combine ioends when
> > possible.  XFS wants to combine ioends to amortize the cost of getting
> > the ILOCK and running transactions over a larger number of pages.
> > 
> > So I guess I see how the two changes dovetail nicely for XFS -- iomap
> > issues smaller write bios, and the xfs ioend worker can recombine
> > however many bios complete before the worker runs.  As a bonus, we don't
> > have to worry about situations like the device driver completing so many
> > bios from a single invocation of a bottom half handler that we run afoul
> > of the soft lockup timer.
> > 
> > Is that a correct understanding of how the two changes intersect with
> > each other?  TBH I was expecting the two thresholds to be closer in
> > value.
> > 
> 
> I think so. That's interesting because my inclination was to make them
> farther apart (or more specifically, increase the threshold in this
> patch and leave the previous). The primary goal of this series was to
> address the soft lockup warning problem, hence the thresholds on earlier
> versions started at rather conservative values. I think both values have
> been reasonably justified in being reduced, though this patch has a more
> broad impact than the previous in that it changes behavior for all iomap
> based fs'. Of course that's something that could also be addressed with
> a more dynamic tunable..

<shrug> I think I'm comfortable starting with 256 for xfs to bump an
ioend to a workqueue, and 4096 pages as the limit for an iomap ioend.
If people demonstrate a need to smart-tune or manual-tune we can always
add one later.

Though I guess I did kind of wonder if maybe a better limit for iomap
would be max_hw_sectors?  Since that's the maximum size of an IO that
the kernel will for that device?

(Hm, maybe not; my computers all have it set to 1280k, which is a
pathetic 20 pages on a 64k-page system.)

> > The other two users of iomap for buffered io (gfs2 and zonefs) don't
> > have a means to defer and combine ioends like xfs does.  Do you think
> > they should?  I think it's still possible to trip the softlockup there.
> > 
> 
> I'm not sure. We'd probably want some feedback from developers of
> filesystems other than XFS before incorporating a change like this. The
> first patch in the series more just provides some infrastructure for
> other filesystems to avoid the problem as they see fit.

Hmm.  Any input from the two other users of iomap buffered IO?  Who are
now directly in the to: list? :D

Catch-up TLDR: we're evaluating a proposal to limit the length of an
iomap writeback ioend to 4096 pages so that we don't trip the hangcheck
warning while clearing pagewriteback if the ioend completion happens to
run in softirq context (e.g. nvme completion).

--D

> Brian
> 
> > --D
> > 
> > > + * high throughput and large bios, the ioends are merged on completion and
> > > + * processed in large, efficient chunks with no additional IO latency.
> > > + */
> > > +#define IOEND_MAX_IOSIZE	(4096ULL << PAGE_SHIFT)
> > > +
> > >  struct iomap_writeback_ops {
> > >  	/*
> > >  	 * Required, maps the blocks so that writeback can be performed on
> > > -- 
> > > 2.26.3
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v3 3/3] iomap: bound ioend size to 4096 pages
  2021-05-25  4:20       ` Darrick J. Wong
@ 2021-05-25  4:29         ` Damien Le Moal
  2021-05-25  7:13         ` Dave Chinner
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Damien Le Moal @ 2021-05-25  4:29 UTC (permalink / raw)
  To: Darrick J. Wong, Brian Foster, Andreas Gruenbacher
  Cc: linux-xfs, linux-fsdevel

On 2021/05/25 13:20, Darrick J. Wong wrote:
> On Mon, May 24, 2021 at 08:02:18AM -0400, Brian Foster wrote:
>> On Thu, May 20, 2021 at 04:27:37PM -0700, Darrick J. Wong wrote:
>>> On Mon, May 17, 2021 at 01:17:22PM -0400, Brian Foster wrote:
>>>> The iomap writeback infrastructure is currently able to construct
>>>> extremely large bio chains (tens of GBs) associated with a single
>>>> ioend. This consolidation provides no significant value as bio
>>>> chains increase beyond a reasonable minimum size. On the other hand,
>>>> this does hold significant numbers of pages in the writeback
>>>> state across an unnecessarily large number of bios because the ioend
>>>> is not processed for completion until the final bio in the chain
>>>> completes. Cap an individual ioend to a reasonable size of 4096
>>>> pages (16MB with 4k pages) to avoid this condition.
>>>>
>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>>> ---
>>>>  fs/iomap/buffered-io.c |  6 ++++--
>>>>  include/linux/iomap.h  | 26 ++++++++++++++++++++++++++
>>>>  2 files changed, 30 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>>>> index 642422775e4e..f2890ee434d0 100644
>>>> --- a/fs/iomap/buffered-io.c
>>>> +++ b/fs/iomap/buffered-io.c
>>>> @@ -1269,7 +1269,7 @@ iomap_chain_bio(struct bio *prev)
>>>>  
>>>>  static bool
>>>>  iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
>>>> -		sector_t sector)
>>>> +		unsigned len, sector_t sector)
>>>>  {
>>>>  	if ((wpc->iomap.flags & IOMAP_F_SHARED) !=
>>>>  	    (wpc->ioend->io_flags & IOMAP_F_SHARED))
>>>> @@ -1280,6 +1280,8 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
>>>>  		return false;
>>>>  	if (sector != bio_end_sector(wpc->ioend->io_bio))
>>>>  		return false;
>>>> +	if (wpc->ioend->io_size + len > IOEND_MAX_IOSIZE)
>>>> +		return false;
>>>>  	return true;
>>>>  }
>>>>  
>>>> @@ -1297,7 +1299,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
>>>>  	unsigned poff = offset & (PAGE_SIZE - 1);
>>>>  	bool merged, same_page = false;
>>>>  
>>>> -	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) {
>>>> +	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, len, sector)) {
>>>>  		if (wpc->ioend)
>>>>  			list_add(&wpc->ioend->io_list, iolist);
>>>>  		wpc->ioend = iomap_alloc_ioend(inode, wpc, offset, sector, wbc);
>>>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>>>> index 07f3f4e69084..89b15cc236d5 100644
>>>> --- a/include/linux/iomap.h
>>>> +++ b/include/linux/iomap.h
>>>> @@ -203,6 +203,32 @@ struct iomap_ioend {
>>>>  	struct bio		io_inline_bio;	/* MUST BE LAST! */
>>>>  };
>>>>  
>>>> +/*
>>>> + * Maximum ioend IO size is used to prevent ioends from becoming unbound in
>>>> + * size. bios can reach 4GB in size if pages are contiguous, and bio chains are
>>>> + * effectively unbound in length. Hence the only limits on the size of the bio
>>>> + * chain is the contiguity of the extent on disk and the length of the run of
>>>> + * sequential dirty pages in the page cache. This can be tens of GBs of physical
>>>> + * extents and if memory is large enough, tens of millions of dirty pages.
>>>> + * Locking them all under writeback until the final bio in the chain is
>>>> + * submitted and completed locks all those pages for the legnth of time it takes
>>>
>>> s/legnth/length/
>>>
>>
>> Fixed.
>>  
>>>> + * to write those many, many GBs of data to storage.
>>>> + *
>>>> + * Background writeback caps any single writepages call to half the device
>>>> + * bandwidth to ensure fairness and prevent any one dirty inode causing
>>>> + * writeback starvation. fsync() and other WB_SYNC_ALL writebacks have no such
>>>> + * cap on wbc->nr_pages, and that's where the above massive bio chain lengths
>>>> + * come from. We want large IOs to reach the storage, but we need to limit
>>>> + * completion latencies, hence we need to control the maximum IO size we
>>>> + * dispatch to the storage stack.
>>>> + *
>>>> + * We don't really have to care about the extra IO completion overhead here
>>>> + * because iomap has contiguous IO completion merging. If the device can sustain
>>>
>>> Assuming you're referring to iomap_finish_ioends, only XFS employs the
>>> ioend completion merging, and only for ioends where it decides to
>>> override the default bi_end_io.  iomap on its own never calls
>>> iomap_ioend_try_merge.
>>>
>>> This patch establishes a maximum ioend size of 4096 pages so that we
>>> don't trip the lockup watchdog while clearing pagewriteback and also so
>>> that we don't pin a large number of pages while constructing a big chain
>>> of bios.  On gfs2 and zonefs, each ioend completion will now have to
>>> clear up to 4096 pages from whatever context bio_endio is called.
>>>
>>> For XFS it's a more complicated -- XFS already overrode the bio handler
>>> for ioends that required further metadata updates (e.g. unwritten
>>> conversion, eof extension, or cow) so that it could combine ioends when
>>> possible.  XFS wants to combine ioends to amortize the cost of getting
>>> the ILOCK and running transactions over a larger number of pages.
>>>
>>> So I guess I see how the two changes dovetail nicely for XFS -- iomap
>>> issues smaller write bios, and the xfs ioend worker can recombine
>>> however many bios complete before the worker runs.  As a bonus, we don't
>>> have to worry about situations like the device driver completing so many
>>> bios from a single invocation of a bottom half handler that we run afoul
>>> of the soft lockup timer.
>>>
>>> Is that a correct understanding of how the two changes intersect with
>>> each other?  TBH I was expecting the two thresholds to be closer in
>>> value.
>>>
>>
>> I think so. That's interesting because my inclination was to make them
>> farther apart (or more specifically, increase the threshold in this
>> patch and leave the previous). The primary goal of this series was to
>> address the soft lockup warning problem, hence the thresholds on earlier
>> versions started at rather conservative values. I think both values have
>> been reasonably justified in being reduced, though this patch has a more
>> broad impact than the previous in that it changes behavior for all iomap
>> based fs'. Of course that's something that could also be addressed with
>> a more dynamic tunable..
> 
> <shrug> I think I'm comfortable starting with 256 for xfs to bump an
> ioend to a workqueue, and 4096 pages as the limit for an iomap ioend.
> If people demonstrate a need to smart-tune or manual-tune we can always
> add one later.
> 
> Though I guess I did kind of wonder if maybe a better limit for iomap
> would be max_hw_sectors?  Since that's the maximum size of an IO that
> the kernel will for that device?
> 
> (Hm, maybe not; my computers all have it set to 1280k, which is a
> pathetic 20 pages on a 64k-page system.)

Are you sure you are not looking at max_sectors (not max_hw_sectors) ?
For an average SATA HDD, generally, you get:

# cat max_hw_sectors_kb
32764
# cat max_sectors_kb
1280

So 32MB max command size per hardware limitations. That one cannot be changed.
The block IO layer uses the 1280KB soft limit defined by max_sectors
(max_sectors_kb in sysfs) but the user can tune this from 1 sector up to
max_hw_sectors_kb.

> 
>>> The other two users of iomap for buffered io (gfs2 and zonefs) don't
>>> have a means to defer and combine ioends like xfs does.  Do you think
>>> they should?  I think it's still possible to trip the softlockup there.
>>>
>>
>> I'm not sure. We'd probably want some feedback from developers of
>> filesystems other than XFS before incorporating a change like this. The
>> first patch in the series more just provides some infrastructure for
>> other filesystems to avoid the problem as they see fit.
> 
> Hmm.  Any input from the two other users of iomap buffered IO?  Who are
> now directly in the to: list? :D
> 
> Catch-up TLDR: we're evaluating a proposal to limit the length of an
> iomap writeback ioend to 4096 pages so that we don't trip the hangcheck
> warning while clearing pagewriteback if the ioend completion happens to
> run in softirq context (e.g. nvme completion).
> 
> --D
> 
>> Brian
>>
>>> --D
>>>
>>>> + * high throughput and large bios, the ioends are merged on completion and
>>>> + * processed in large, efficient chunks with no additional IO latency.
>>>> + */
>>>> +#define IOEND_MAX_IOSIZE	(4096ULL << PAGE_SHIFT)
>>>> +
>>>>  struct iomap_writeback_ops {
>>>>  	/*
>>>>  	 * Required, maps the blocks so that writeback can be performed on
>>>> -- 
>>>> 2.26.3
>>>>
>>>
>>
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v3 3/3] iomap: bound ioend size to 4096 pages
  2021-05-25  4:20       ` Darrick J. Wong
  2021-05-25  4:29         ` Damien Le Moal
@ 2021-05-25  7:13         ` Dave Chinner
  2021-05-25  9:07         ` Andreas Gruenbacher
  2021-05-26  2:12         ` Matthew Wilcox
  3 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2021-05-25  7:13 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Brian Foster, Damien Le Moal, Andreas Gruenbacher, linux-xfs,
	linux-fsdevel

On Mon, May 24, 2021 at 09:20:35PM -0700, Darrick J. Wong wrote:
> <shrug> I think I'm comfortable starting with 256 for xfs to bump an
> ioend to a workqueue, and 4096 pages as the limit for an iomap ioend.
> If people demonstrate a need to smart-tune or manual-tune we can always
> add one later.
> 
> Though I guess I did kind of wonder if maybe a better limit for iomap
> would be max_hw_sectors?  Since that's the maximum size of an IO that
> the kernel will for that device?
>
> (Hm, maybe not; my computers all have it set to 1280k, which is a
> pathetic 20 pages on a 64k-page system.)

I've got samsung nvme devices here that set max_hw_sectors_kb to
128kB....

But device sizes ignore that RAID devices give an optimal IO size
for submissions:

$ cat /sys/block/dm-0/queue/optimal_io_size
1048576
$ cat /sys/block/dm-0/queue/max_sectors_kb
128

IOWs, we might be trying to feed lots of devices through the one
submission, so using a "device" limit isn't really something we
should be doing. Ideally I think we should be looking at some
multiple of the  maximum optimal IO size that the underlying device
requests if it is set, otherwise a byte limit of some kind....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v3 3/3] iomap: bound ioend size to 4096 pages
  2021-05-25  4:20       ` Darrick J. Wong
  2021-05-25  4:29         ` Damien Le Moal
  2021-05-25  7:13         ` Dave Chinner
@ 2021-05-25  9:07         ` Andreas Gruenbacher
  2021-05-26  2:12         ` Matthew Wilcox
  3 siblings, 0 replies; 24+ messages in thread
From: Andreas Gruenbacher @ 2021-05-25  9:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, Damien Le Moal, linux-xfs, linux-fsdevel

On Tue, May 25, 2021 at 6:20 AM Darrick J. Wong <djwong@kernel.org> wrote:
> On Mon, May 24, 2021 at 08:02:18AM -0400, Brian Foster wrote:
> > On Thu, May 20, 2021 at 04:27:37PM -0700, Darrick J. Wong wrote:
> > > On Mon, May 17, 2021 at 01:17:22PM -0400, Brian Foster wrote:
> > > > The iomap writeback infrastructure is currently able to construct
> > > > extremely large bio chains (tens of GBs) associated with a single
> > > > ioend. This consolidation provides no significant value as bio
> > > > chains increase beyond a reasonable minimum size. On the other hand,
> > > > this does hold significant numbers of pages in the writeback
> > > > state across an unnecessarily large number of bios because the ioend
> > > > is not processed for completion until the final bio in the chain
> > > > completes. Cap an individual ioend to a reasonable size of 4096
> > > > pages (16MB with 4k pages) to avoid this condition.
> > > >
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >  fs/iomap/buffered-io.c |  6 ++++--
> > > >  include/linux/iomap.h  | 26 ++++++++++++++++++++++++++
> > > >  2 files changed, 30 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index 642422775e4e..f2890ee434d0 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -1269,7 +1269,7 @@ iomap_chain_bio(struct bio *prev)
> > > >
> > > >  static bool
> > > >  iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
> > > > -         sector_t sector)
> > > > +         unsigned len, sector_t sector)
> > > >  {
> > > >   if ((wpc->iomap.flags & IOMAP_F_SHARED) !=
> > > >       (wpc->ioend->io_flags & IOMAP_F_SHARED))
> > > > @@ -1280,6 +1280,8 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
> > > >           return false;
> > > >   if (sector != bio_end_sector(wpc->ioend->io_bio))
> > > >           return false;
> > > > + if (wpc->ioend->io_size + len > IOEND_MAX_IOSIZE)
> > > > +         return false;
> > > >   return true;
> > > >  }
> > > >
> > > > @@ -1297,7 +1299,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
> > > >   unsigned poff = offset & (PAGE_SIZE - 1);
> > > >   bool merged, same_page = false;
> > > >
> > > > - if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) {
> > > > + if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, len, sector)) {
> > > >           if (wpc->ioend)
> > > >                   list_add(&wpc->ioend->io_list, iolist);
> > > >           wpc->ioend = iomap_alloc_ioend(inode, wpc, offset, sector, wbc);
> > > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > > > index 07f3f4e69084..89b15cc236d5 100644
> > > > --- a/include/linux/iomap.h
> > > > +++ b/include/linux/iomap.h
> > > > @@ -203,6 +203,32 @@ struct iomap_ioend {
> > > >   struct bio              io_inline_bio;  /* MUST BE LAST! */
> > > >  };
> > > >
> > > > +/*
> > > > + * Maximum ioend IO size is used to prevent ioends from becoming unbound in
> > > > + * size. bios can reach 4GB in size if pages are contiguous, and bio chains are
> > > > + * effectively unbound in length. Hence the only limits on the size of the bio
> > > > + * chain is the contiguity of the extent on disk and the length of the run of
> > > > + * sequential dirty pages in the page cache. This can be tens of GBs of physical
> > > > + * extents and if memory is large enough, tens of millions of dirty pages.
> > > > + * Locking them all under writeback until the final bio in the chain is
> > > > + * submitted and completed locks all those pages for the legnth of time it takes
> > >
> > > s/legnth/length/
> > >
> >
> > Fixed.
> >
> > > > + * to write those many, many GBs of data to storage.
> > > > + *
> > > > + * Background writeback caps any single writepages call to half the device
> > > > + * bandwidth to ensure fairness and prevent any one dirty inode causing
> > > > + * writeback starvation. fsync() and other WB_SYNC_ALL writebacks have no such
> > > > + * cap on wbc->nr_pages, and that's where the above massive bio chain lengths
> > > > + * come from. We want large IOs to reach the storage, but we need to limit
> > > > + * completion latencies, hence we need to control the maximum IO size we
> > > > + * dispatch to the storage stack.
> > > > + *
> > > > + * We don't really have to care about the extra IO completion overhead here
> > > > + * because iomap has contiguous IO completion merging. If the device can sustain
> > >
> > > Assuming you're referring to iomap_finish_ioends, only XFS employs the
> > > ioend completion merging, and only for ioends where it decides to
> > > override the default bi_end_io.  iomap on its own never calls
> > > iomap_ioend_try_merge.
> > >
> > > This patch establishes a maximum ioend size of 4096 pages so that we
> > > don't trip the lockup watchdog while clearing pagewriteback and also so
> > > that we don't pin a large number of pages while constructing a big chain
> > > of bios.  On gfs2 and zonefs, each ioend completion will now have to
> > > clear up to 4096 pages from whatever context bio_endio is called.
> > >
> > > For XFS it's a more complicated -- XFS already overrode the bio handler
> > > for ioends that required further metadata updates (e.g. unwritten
> > > conversion, eof extension, or cow) so that it could combine ioends when
> > > possible.  XFS wants to combine ioends to amortize the cost of getting
> > > the ILOCK and running transactions over a larger number of pages.
> > >
> > > So I guess I see how the two changes dovetail nicely for XFS -- iomap
> > > issues smaller write bios, and the xfs ioend worker can recombine
> > > however many bios complete before the worker runs.  As a bonus, we don't
> > > have to worry about situations like the device driver completing so many
> > > bios from a single invocation of a bottom half handler that we run afoul
> > > of the soft lockup timer.
> > >
> > > Is that a correct understanding of how the two changes intersect with
> > > each other?  TBH I was expecting the two thresholds to be closer in
> > > value.
> > >
> >
> > I think so. That's interesting because my inclination was to make them
> > farther apart (or more specifically, increase the threshold in this
> > patch and leave the previous). The primary goal of this series was to
> > address the soft lockup warning problem, hence the thresholds on earlier
> > versions started at rather conservative values. I think both values have
> > been reasonably justified in being reduced, though this patch has a more
> > broad impact than the previous in that it changes behavior for all iomap
> > based fs'. Of course that's something that could also be addressed with
> > a more dynamic tunable..
>
> <shrug> I think I'm comfortable starting with 256 for xfs to bump an
> ioend to a workqueue, and 4096 pages as the limit for an iomap ioend.
> If people demonstrate a need to smart-tune or manual-tune we can always
> add one later.
>
> Though I guess I did kind of wonder if maybe a better limit for iomap
> would be max_hw_sectors?  Since that's the maximum size of an IO that
> the kernel will for that device?
>
> (Hm, maybe not; my computers all have it set to 1280k, which is a
> pathetic 20 pages on a 64k-page system.)
>
> > > The other two users of iomap for buffered io (gfs2 and zonefs) don't
> > > have a means to defer and combine ioends like xfs does.  Do you think
> > > they should?  I think it's still possible to trip the softlockup there.
> > >
> >
> > I'm not sure. We'd probably want some feedback from developers of
> > filesystems other than XFS before incorporating a change like this. The
> > first patch in the series more just provides some infrastructure for
> > other filesystems to avoid the problem as they see fit.
>
> Hmm.  Any input from the two other users of iomap buffered IO?  Who are
> now directly in the to: list? :D
>
> Catch-up TLDR: we're evaluating a proposal to limit the length of an
> iomap writeback ioend to 4096 pages so that we don't trip the hangcheck
> warning while clearing pagewriteback if the ioend completion happens to
> run in softirq context (e.g. nvme completion).

That's fine for gfs2. Due to the way our allocator works, our extents
typically are at most 509 blocks long (< 2 MB), which already limits
the maximum size. We have plans for fixing that, but even then, any
somewhat sane limit should do.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/3] iomap: resched ioend completion when in non-atomic context
  2021-05-24 16:53           ` Darrick J. Wong
@ 2021-05-26  1:19             ` Darrick J. Wong
  0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2021-05-26  1:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: Matthew Wilcox, linux-xfs, linux-fsdevel

On Mon, May 24, 2021 at 09:53:05AM -0700, Darrick J. Wong wrote:
> On Mon, May 24, 2021 at 07:57:31AM -0400, Brian Foster wrote:
> > On Thu, May 20, 2021 at 02:58:58PM -0700, Darrick J. Wong wrote:
> > > On Tue, May 18, 2021 at 07:38:01AM -0400, Brian Foster wrote:
> > > > On Mon, May 17, 2021 at 06:54:34PM +0100, Matthew Wilcox wrote:
> > > > > On Mon, May 17, 2021 at 01:17:20PM -0400, Brian Foster wrote:
> > > > > > @@ -1084,9 +1084,12 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
> > > > > >  			next = bio->bi_private;
> > > > > >  
> > > > > >  		/* walk each page on bio, ending page IO on them */
> > > > > > -		bio_for_each_segment_all(bv, bio, iter_all)
> > > > > > +		bio_for_each_segment_all(bv, bio, iter_all) {
> > > > > >  			iomap_finish_page_writeback(inode, bv->bv_page, error,
> > > > > >  					bv->bv_len);
> > > > > > +			if (!atomic)
> > > > > > +				cond_resched();
> > > > > > +		}
> > > > > 
> > > > > I don't know that it makes sense to check after _every_ page.  I might
> > > > > go for every segment.  Some users check after every thousand pages.
> > > > > 
> > > > 
> > > > The handful of examples I come across on a brief scan (including the
> > > > other iomap usage) have a similar pattern as used here. I don't doubt
> > > > there are others, but I think I'd prefer to have more reasoning behind
> > > > adding more code than might be necessary (i.e. do we expect additional
> > > > overhead to be measurable here?). As it is, the intent isn't so much to
> > > > check on every page as much as this just happens to be the common point
> > > > of the function to cover both long bio chains and single vector bios
> > > > with large numbers of pages.
> > > 
> > > It's been a while since I waded through the macro hell to find out what
> > > cond_resched actually does, but iirc it can do some fairly heavyweight
> > > things (disable preemption, call the scheduler, rcu stuff) which is why
> > > we're supposed to be a little judicious about amortizing each call over
> > > a few thousand pages.
> > > 
> > 
> > It looks to me it just checks some state bit and only does any work if
> > actually necessary. I suppose not doing that less often is cheaper than
> > doing it more, but it's not clear to me it's enough that it really
> > matters and/or warrants more code to filter out calls..
> > 
> > What exactly did you have in mind for logic? I suppose we could always
> > stuff a 'if (!(count++ % 1024)) cond_resched();' or some such in the
> > inner loop, but that might have less of an effect on larger chains
> > constructed of bios with fewer pages (depending on whether that might
> > still be possible).
> 
> I /was/ thinking about a function level page counter until I noticed
> that iomap_{write,unshare}_actor call cond_resched for every page it
> touches.  I withdraw the comment. :)

Oh, also:
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> 
> --D
> 
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > 
> > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/3] xfs: kick large ioends to completion workqueue
  2021-05-17 17:17 ` [PATCH v3 2/3] xfs: kick large ioends to completion workqueue Brian Foster
@ 2021-05-26  1:20   ` Darrick J. Wong
  0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2021-05-26  1:20 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel

On Mon, May 17, 2021 at 01:17:21PM -0400, Brian Foster wrote:
> We've had reports of soft lockup warnings in the iomap ioend
> completion path due to very large bios and/or bio chains. This
> occurs because ioend completion touches every page associated with
> the ioend. It generally requires exceedingly large (i.e. multi-GB)
> bios or bio chains to reproduce a soft lockup warning, but even with
> smaller ioends there's really no good reason to incur the cost of
> potential cacheline misses in bio completion context. Divert ioends
> larger than 1MB to the workqueue so completion occurs in non-atomic
> context and can reschedule to avoid soft lockup warnings.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Will give this a spin on the test farm overnight but at least in
principle this seems fine to me:

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_aops.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 84cd6cf46b12..05b1bb146f17 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -30,6 +30,13 @@ XFS_WPC(struct iomap_writepage_ctx *ctx)
>  	return container_of(ctx, struct xfs_writepage_ctx, ctx);
>  }
>  
> +/*
> + * Completion touches every page associated with the ioend. Send anything
> + * larger than 1MB (based on 4k pages) or so to the completion workqueue to
> + * avoid this work in bio completion context.
> + */
> +#define XFS_LARGE_IOEND	(256ULL << PAGE_SHIFT)
> +
>  /*
>   * Fast and loose check if this write could update the on-disk inode size.
>   */
> @@ -409,9 +416,14 @@ xfs_prepare_ioend(
>  
>  	memalloc_nofs_restore(nofs_flag);
>  
> -	/* send ioends that might require a transaction to the completion wq */
> +	/*
> +	 * Send ioends that might require a transaction or are large enough that
> +	 * we don't want to do page processing in bio completion context to the
> +	 * wq.
> +	 */
>  	if (xfs_ioend_is_append(ioend) || ioend->io_type == IOMAP_UNWRITTEN ||
> -	    (ioend->io_flags & IOMAP_F_SHARED))
> +	    (ioend->io_flags & IOMAP_F_SHARED) ||
> +	    ioend->io_size >= XFS_LARGE_IOEND)
>  		ioend->io_bio->bi_end_io = xfs_end_bio;
>  	return status;
>  }
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v3 3/3] iomap: bound ioend size to 4096 pages
  2021-05-25  4:20       ` Darrick J. Wong
                           ` (2 preceding siblings ...)
  2021-05-25  9:07         ` Andreas Gruenbacher
@ 2021-05-26  2:12         ` Matthew Wilcox
  2021-05-26  3:32           ` Darrick J. Wong
  3 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2021-05-26  2:12 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Brian Foster, Damien Le Moal, Andreas Gruenbacher, linux-xfs,
	linux-fsdevel

On Mon, May 24, 2021 at 09:20:35PM -0700, Darrick J. Wong wrote:
> > > This patch establishes a maximum ioend size of 4096 pages so that we
> > > don't trip the lockup watchdog while clearing pagewriteback and also so
> > > that we don't pin a large number of pages while constructing a big chain
> > > of bios.  On gfs2 and zonefs, each ioend completion will now have to
> > > clear up to 4096 pages from whatever context bio_endio is called.
> > > 
> > > For XFS it's a more complicated -- XFS already overrode the bio handler
> > > for ioends that required further metadata updates (e.g. unwritten
> > > conversion, eof extension, or cow) so that it could combine ioends when
> > > possible.  XFS wants to combine ioends to amortize the cost of getting
> > > the ILOCK and running transactions over a larger number of pages.
> > > 
> > > So I guess I see how the two changes dovetail nicely for XFS -- iomap
> > > issues smaller write bios, and the xfs ioend worker can recombine
> > > however many bios complete before the worker runs.  As a bonus, we don't
> > > have to worry about situations like the device driver completing so many
> > > bios from a single invocation of a bottom half handler that we run afoul
> > > of the soft lockup timer.
> > > 
> > > Is that a correct understanding of how the two changes intersect with
> > > each other?  TBH I was expecting the two thresholds to be closer in
> > > value.
> > > 
> > 
> > I think so. That's interesting because my inclination was to make them
> > farther apart (or more specifically, increase the threshold in this
> > patch and leave the previous). The primary goal of this series was to
> > address the soft lockup warning problem, hence the thresholds on earlier
> > versions started at rather conservative values. I think both values have
> > been reasonably justified in being reduced, though this patch has a more
> > broad impact than the previous in that it changes behavior for all iomap
> > based fs'. Of course that's something that could also be addressed with
> > a more dynamic tunable..
> 
> <shrug> I think I'm comfortable starting with 256 for xfs to bump an
> ioend to a workqueue, and 4096 pages as the limit for an iomap ioend.
> If people demonstrate a need to smart-tune or manual-tune we can always
> add one later.
> 
> Though I guess I did kind of wonder if maybe a better limit for iomap
> would be max_hw_sectors?  Since that's the maximum size of an IO that
> the kernel will for that device?

I think you're looking at this wrong.  The question is whether the
system can tolerate the additional latency of bumping to a workqueue vs
servicing directly.

If the I/O is large, then clearly it can.  It already waited for all
those DMAs to happen which took a certain amount of time on the I/O bus.
If the I/O is small, then maybe it can and maybe it can't.  So we should
be conservative and complete it in interrupt context.

This is why I think "number of pages" is really a red herring.  Sure,
that's the amount of work to be done, but really the question is "can
this I/O tolerate the extra delay".  Short of passing that information
in from the caller, number of bytes really is our best way of knowing.
And that doesn't scale with anything to do with the device or the
system bus.  

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC v3 3/3] iomap: bound ioend size to 4096 pages
  2021-05-26  2:12         ` Matthew Wilcox
@ 2021-05-26  3:32           ` Darrick J. Wong
  0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2021-05-26  3:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Brian Foster, Damien Le Moal, Andreas Gruenbacher, linux-xfs,
	linux-fsdevel

On Wed, May 26, 2021 at 03:12:50AM +0100, Matthew Wilcox wrote:
> On Mon, May 24, 2021 at 09:20:35PM -0700, Darrick J. Wong wrote:
> > > > This patch establishes a maximum ioend size of 4096 pages so that we
> > > > don't trip the lockup watchdog while clearing pagewriteback and also so
> > > > that we don't pin a large number of pages while constructing a big chain
> > > > of bios.  On gfs2 and zonefs, each ioend completion will now have to
> > > > clear up to 4096 pages from whatever context bio_endio is called.
> > > > 
> > > > For XFS it's a more complicated -- XFS already overrode the bio handler
> > > > for ioends that required further metadata updates (e.g. unwritten
> > > > conversion, eof extension, or cow) so that it could combine ioends when
> > > > possible.  XFS wants to combine ioends to amortize the cost of getting
> > > > the ILOCK and running transactions over a larger number of pages.
> > > > 
> > > > So I guess I see how the two changes dovetail nicely for XFS -- iomap
> > > > issues smaller write bios, and the xfs ioend worker can recombine
> > > > however many bios complete before the worker runs.  As a bonus, we don't
> > > > have to worry about situations like the device driver completing so many
> > > > bios from a single invocation of a bottom half handler that we run afoul
> > > > of the soft lockup timer.
> > > > 
> > > > Is that a correct understanding of how the two changes intersect with
> > > > each other?  TBH I was expecting the two thresholds to be closer in
> > > > value.
> > > > 
> > > 
> > > I think so. That's interesting because my inclination was to make them
> > > farther apart (or more specifically, increase the threshold in this
> > > patch and leave the previous). The primary goal of this series was to
> > > address the soft lockup warning problem, hence the thresholds on earlier
> > > versions started at rather conservative values. I think both values have
> > > been reasonably justified in being reduced, though this patch has a more
> > > broad impact than the previous in that it changes behavior for all iomap
> > > based fs'. Of course that's something that could also be addressed with
> > > a more dynamic tunable..
> > 
> > <shrug> I think I'm comfortable starting with 256 for xfs to bump an
> > ioend to a workqueue, and 4096 pages as the limit for an iomap ioend.
> > If people demonstrate a need to smart-tune or manual-tune we can always
> > add one later.
> > 
> > Though I guess I did kind of wonder if maybe a better limit for iomap
> > would be max_hw_sectors?  Since that's the maximum size of an IO that
> > the kernel will for that device?
> 
> I think you're looking at this wrong.  The question is whether the
> system can tolerate the additional latency of bumping to a workqueue vs
> servicing directly.
> 
> If the I/O is large, then clearly it can.  It already waited for all
> those DMAs to happen which took a certain amount of time on the I/O bus.
> If the I/O is small, then maybe it can and maybe it can't.  So we should
> be conservative and complete it in interrupt context.
> 
> This is why I think "number of pages" is really a red herring.  Sure,
> that's the amount of work to be done, but really the question is "can
> this I/O tolerate the extra delay".  Short of passing that information
> in from the caller, number of bytes really is our best way of knowing.
> And that doesn't scale with anything to do with the device or the
> system bus.  

It doesn't matter whether the process(es) that triggered writeback will
tolerate the extra latency of a workqueue.  The hangcheck timer trips,
which means we've been doing things in softirq context too long.

The next thing that happens is that the kind of people who treat **ANY**
stack trace in dmesg as grounds to file a bug and escalate it will file
a bug and escalate it, and now I'm working 10 hour days trying to stomp
down all 6 escalations, run a QA botnet, review patches, and make any
incremental progress on long term goals when I can squeeze out five
minutes of free time.

Yeah, it'd be nice to rebuild writeback with some sort of QOS system so
that it could pick different strategies based on the amount of work to
do and the impatience levels of the processes waiting for it.  But that
is a project of its own.  This is a starter fix to take the heat off.

The reason I've been running at 110% burnout for the last 9 months is
exactly this -- someone submits a patchset to fix or improve something,
but then the reviewers pile on with "No no no, you should consider
building this far more elaborate solution", withhold review tags, but
then seem to be too busy to participate in building the elaborate thing.

At least in this case I can do something about it.  We're nearly to rc4
so barring anything weird showing up in QA runs overnight I plan to
stuff this in for 5.14.

--D

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-05-26  3:32 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-17 17:17 [PATCH v3 0/3] iomap: avoid soft lockup warnings on large ioends Brian Foster
2021-05-17 17:17 ` [PATCH v3 1/3] iomap: resched ioend completion when in non-atomic context Brian Foster
2021-05-17 17:54   ` Matthew Wilcox
2021-05-18 11:38     ` Brian Foster
2021-05-20 21:58       ` Darrick J. Wong
2021-05-24 11:57         ` Brian Foster
2021-05-24 16:53           ` Darrick J. Wong
2021-05-26  1:19             ` Darrick J. Wong
2021-05-22  7:45   ` Ming Lei
2021-05-24 11:57     ` Brian Foster
2021-05-24 14:11       ` Ming Lei
2021-05-17 17:17 ` [PATCH v3 2/3] xfs: kick large ioends to completion workqueue Brian Foster
2021-05-26  1:20   ` Darrick J. Wong
2021-05-17 17:17 ` [PATCH RFC v3 3/3] iomap: bound ioend size to 4096 pages Brian Foster
2021-05-19 13:28   ` Christoph Hellwig
2021-05-19 14:52     ` Brian Foster
2021-05-20 23:27   ` Darrick J. Wong
2021-05-24 12:02     ` Brian Foster
2021-05-25  4:20       ` Darrick J. Wong
2021-05-25  4:29         ` Damien Le Moal
2021-05-25  7:13         ` Dave Chinner
2021-05-25  9:07         ` Andreas Gruenbacher
2021-05-26  2:12         ` Matthew Wilcox
2021-05-26  3:32           ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).