linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] iomap: avoid soft lockup warnings on large ioends
@ 2020-10-02 15:33 Brian Foster
  2020-10-02 15:33 ` [PATCH 1/2] iomap: resched ioend completion when in non-atomic context Brian Foster
  2020-10-02 15:33 ` [PATCH 2/2] xfs: kick extra large ioends to completion workqueue Brian Foster
  0 siblings, 2 replies; 16+ messages in thread
From: Brian Foster @ 2020-10-02 15:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs

Hi all,

My understanding is that there's still no real agreement on the proper
approach to address this problem. The RFC I floated [1] intended to cap
the size of ioends to avoid any latency issues with holding so many
pages in writeback for effectively a single completion instance of a GB+
sized I/O. Instead, Christoph preferred to dump those large bios onto
the completion workqueue and use cond_resched() rather than cap the
ioend size. This series implements the latter (for XFS) since it seems
like incremental progress and should at least address the warning.
Thoughts, reviews, flames appreciated.

Brian

[1] https://lore.kernel.org/linux-fsdevel/20200825144917.GA321765@bfoster/

Brian Foster (2):
  iomap: resched ioend completion when in non-atomic context
  xfs: kick extra large ioends to completion workqueue

 fs/iomap/buffered-io.c | 15 +++++++++------
 fs/xfs/xfs_aops.c      | 12 ++++++++++--
 include/linux/iomap.h  |  2 +-
 3 files changed, 20 insertions(+), 9 deletions(-)

-- 
2.25.4


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] iomap: resched ioend completion when in non-atomic context
  2020-10-02 15:33 [PATCH 0/2] iomap: avoid soft lockup warnings on large ioends Brian Foster
@ 2020-10-02 15:33 ` Brian Foster
  2020-10-02 15:33 ` [PATCH 2/2] xfs: kick extra large ioends to completion workqueue Brian Foster
  1 sibling, 0 replies; 16+ messages in thread
From: Brian Foster @ 2020-10-02 15:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs

The iomap ioend mechanism has the ability to construct very large,
contiguous bios and/or bio chains. This has been reported to lead to
soft lockup warnings in bio completion due to the amount of page
processing that occurs. Update the ioend completion path with a
parameter to indicate atomic context and insert a cond_resched()
call to avoid soft lockups in either scenario.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 15 +++++++++------
 fs/xfs/xfs_aops.c      |  2 +-
 include/linux/iomap.h  |  2 +-
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index bcfc288dba3f..5dfdb77a05b2 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1092,7 +1092,7 @@ iomap_finish_page_writeback(struct inode *inode, struct page *page,
  * ioend after this.
  */
 static void
-iomap_finish_ioend(struct iomap_ioend *ioend, int error)
+iomap_finish_ioend(struct iomap_ioend *ioend, int error, bool atomic)
 {
 	struct inode *inode = ioend->io_inode;
 	struct bio *bio = &ioend->io_inline_bio;
@@ -1115,8 +1115,11 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 			next = bio->bi_private;
 
 		/* walk each page on bio, ending page IO on them */
-		bio_for_each_segment_all(bv, bio, iter_all)
+		bio_for_each_segment_all(bv, bio, iter_all) {
 			iomap_finish_page_writeback(inode, bv->bv_page, error);
+			if (!atomic)
+				cond_resched();
+		}
 		bio_put(bio);
 	}
 	/* The ioend has been freed by bio_put() */
@@ -1129,17 +1132,17 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 }
 
 void
-iomap_finish_ioends(struct iomap_ioend *ioend, int error)
+iomap_finish_ioends(struct iomap_ioend *ioend, int error, bool atomic)
 {
 	struct list_head tmp;
 
 	list_replace_init(&ioend->io_list, &tmp);
-	iomap_finish_ioend(ioend, error);
+	iomap_finish_ioend(ioend, error, atomic);
 
 	while (!list_empty(&tmp)) {
 		ioend = list_first_entry(&tmp, struct iomap_ioend, io_list);
 		list_del_init(&ioend->io_list);
-		iomap_finish_ioend(ioend, error);
+		iomap_finish_ioend(ioend, error, atomic);
 	}
 }
 EXPORT_SYMBOL_GPL(iomap_finish_ioends);
@@ -1208,7 +1211,7 @@ static void iomap_writepage_end_bio(struct bio *bio)
 {
 	struct iomap_ioend *ioend = bio->bi_private;
 
-	iomap_finish_ioend(ioend, blk_status_to_errno(bio->bi_status));
+	iomap_finish_ioend(ioend, blk_status_to_errno(bio->bi_status), true);
 }
 
 /*
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b35611882ff9..3e061ea99922 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -188,7 +188,7 @@ xfs_end_ioend(
 done:
 	if (ioend->io_private)
 		error = xfs_setfilesize_ioend(ioend, error);
-	iomap_finish_ioends(ioend, error);
+	iomap_finish_ioends(ioend, error, false);
 	memalloc_nofs_restore(nofs_flag);
 }
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4d1d3c3469e9..4d3778dc4318 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -229,7 +229,7 @@ struct iomap_writepage_ctx {
 	const struct iomap_writeback_ops *ops;
 };
 
-void iomap_finish_ioends(struct iomap_ioend *ioend, int error);
+void iomap_finish_ioends(struct iomap_ioend *ioend, int error, bool atomic);
 void iomap_ioend_try_merge(struct iomap_ioend *ioend,
 		struct list_head *more_ioends,
 		void (*merge_private)(struct iomap_ioend *ioend,
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] xfs: kick extra large ioends to completion workqueue
  2020-10-02 15:33 [PATCH 0/2] iomap: avoid soft lockup warnings on large ioends Brian Foster
  2020-10-02 15:33 ` [PATCH 1/2] iomap: resched ioend completion when in non-atomic context Brian Foster
@ 2020-10-02 15:33 ` Brian Foster
  2020-10-02 16:19   ` Darrick J. Wong
                     ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Brian Foster @ 2020-10-02 15:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs

We've had reports of soft lockup warnings in the iomap ioend
completion path due to very large bios and/or bio chains. Divert any
ioends with 256k or more pages to process to the workqueue so
completion occurs in non-atomic context and can reschedule to avoid
soft lockup warnings.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_aops.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3e061ea99922..84ee917014f1 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -30,6 +30,13 @@ XFS_WPC(struct iomap_writepage_ctx *ctx)
 	return container_of(ctx, struct xfs_writepage_ctx, ctx);
 }
 
+/*
+ * Kick extra large ioends off to the workqueue. Completion will process a lot
+ * of pages for a large bio or bio chain and a non-atomic context is required to
+ * reschedule and avoid soft lockup warnings.
+ */
+#define XFS_LARGE_IOEND	(262144 << PAGE_SHIFT)
+
 /*
  * Fast and loose check if this write could update the on-disk inode size.
  */
@@ -239,7 +246,8 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
 {
 	return ioend->io_private ||
 		ioend->io_type == IOMAP_UNWRITTEN ||
-		(ioend->io_flags & IOMAP_F_SHARED);
+		(ioend->io_flags & IOMAP_F_SHARED) ||
+		(ioend->io_size >= XFS_LARGE_IOEND);
 }
 
 STATIC void
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] xfs: kick extra large ioends to completion workqueue
  2020-10-02 15:33 ` [PATCH 2/2] xfs: kick extra large ioends to completion workqueue Brian Foster
@ 2020-10-02 16:19   ` Darrick J. Wong
  2020-10-02 16:38     ` Brian Foster
  2020-10-03  0:26   ` kernel test robot
  2020-10-05 15:21   ` [PATCH v2 " Brian Foster
  2 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2020-10-02 16:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs

On Fri, Oct 02, 2020 at 11:33:57AM -0400, Brian Foster wrote:
> We've had reports of soft lockup warnings in the iomap ioend
> completion path due to very large bios and/or bio chains. Divert any
> ioends with 256k or more pages to process to the workqueue so
> completion occurs in non-atomic context and can reschedule to avoid
> soft lockup warnings.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_aops.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3e061ea99922..84ee917014f1 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -30,6 +30,13 @@ XFS_WPC(struct iomap_writepage_ctx *ctx)
>  	return container_of(ctx, struct xfs_writepage_ctx, ctx);
>  }
>  
> +/*
> + * Kick extra large ioends off to the workqueue. Completion will process a lot
> + * of pages for a large bio or bio chain and a non-atomic context is required to
> + * reschedule and avoid soft lockup warnings.
> + */
> +#define XFS_LARGE_IOEND	(262144 << PAGE_SHIFT)

Hm, shouldn't that 262144 have to be annoated with a 'ULL' so that a
dumb compiler won't turn that into a u32 and shift that all the way to
zero?

I still kind of wonder about the letting the limit hit 16G on power with
64k pages, but I guess the number of pages we have to whack is ... not
that high?

I dunno, if you fire up a 64k-page system with fantastical IO
capabilities, attach a realtime volume, fallocate a 32G file and then
try to write to that, will it actually turn that into one gigantic IO?

> +
>  /*
>   * Fast and loose check if this write could update the on-disk inode size.
>   */
> @@ -239,7 +246,8 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
>  {
>  	return ioend->io_private ||
>  		ioend->io_type == IOMAP_UNWRITTEN ||
> -		(ioend->io_flags & IOMAP_F_SHARED);
> +		(ioend->io_flags & IOMAP_F_SHARED) ||
> +		(ioend->io_size >= XFS_LARGE_IOEND);
>  }
>  
>  STATIC void
> -- 
> 2.25.4
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] xfs: kick extra large ioends to completion workqueue
  2020-10-02 16:19   ` Darrick J. Wong
@ 2020-10-02 16:38     ` Brian Foster
  0 siblings, 0 replies; 16+ messages in thread
From: Brian Foster @ 2020-10-02 16:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs

On Fri, Oct 02, 2020 at 09:19:23AM -0700, Darrick J. Wong wrote:
> On Fri, Oct 02, 2020 at 11:33:57AM -0400, Brian Foster wrote:
> > We've had reports of soft lockup warnings in the iomap ioend
> > completion path due to very large bios and/or bio chains. Divert any
> > ioends with 256k or more pages to process to the workqueue so
> > completion occurs in non-atomic context and can reschedule to avoid
> > soft lockup warnings.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 3e061ea99922..84ee917014f1 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -30,6 +30,13 @@ XFS_WPC(struct iomap_writepage_ctx *ctx)
> >  	return container_of(ctx, struct xfs_writepage_ctx, ctx);
> >  }
> >  
> > +/*
> > + * Kick extra large ioends off to the workqueue. Completion will process a lot
> > + * of pages for a large bio or bio chain and a non-atomic context is required to
> > + * reschedule and avoid soft lockup warnings.
> > + */
> > +#define XFS_LARGE_IOEND	(262144 << PAGE_SHIFT)
> 
> Hm, shouldn't that 262144 have to be annoated with a 'ULL' so that a
> dumb compiler won't turn that into a u32 and shift that all the way to
> zero?
> 

Probably.. will fix.

> I still kind of wonder about the letting the limit hit 16G on power with
> 64k pages, but I guess the number of pages we have to whack is ... not
> that high?
> 

TBH, the limit is kind of picked out of a hat since we don't have any
real data on the point where the page count becomes generally too high.
I originally was capping the size of the ioend, so for that I figured
1GB on 4k pages was conservative enough to still allow fairly large
ioends without doing too much page processing. This patch doesn't cap
the I/O size, so I suppose it might be more reasonable to reduce the
threshold if we wanted to. I don't really have a strong preference
either way. Hm?

> I dunno, if you fire up a 64k-page system with fantastical IO
> capabilities, attach a realtime volume, fallocate a 32G file and then
> try to write to that, will it actually turn that into one gigantic IO?
> 

Not sure, but one report we had was an x86_64 box pushing a 10GB+ bio
chain... :P

Brian

> > +
> >  /*
> >   * Fast and loose check if this write could update the on-disk inode size.
> >   */
> > @@ -239,7 +246,8 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> >  {
> >  	return ioend->io_private ||
> >  		ioend->io_type == IOMAP_UNWRITTEN ||
> > -		(ioend->io_flags & IOMAP_F_SHARED);
> > +		(ioend->io_flags & IOMAP_F_SHARED) ||
> > +		(ioend->io_size >= XFS_LARGE_IOEND);
> >  }
> >  
> >  STATIC void
> > -- 
> > 2.25.4
> > 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] xfs: kick extra large ioends to completion workqueue
  2020-10-02 15:33 ` [PATCH 2/2] xfs: kick extra large ioends to completion workqueue Brian Foster
  2020-10-02 16:19   ` Darrick J. Wong
@ 2020-10-03  0:26   ` kernel test robot
  2020-10-05 15:21   ` [PATCH v2 " Brian Foster
  2 siblings, 0 replies; 16+ messages in thread
From: kernel test robot @ 2020-10-03  0:26 UTC (permalink / raw)
  To: Brian Foster, linux-fsdevel; +Cc: kbuild-all, clang-built-linux, linux-xfs

[-- Attachment #1: Type: text/plain, Size: 2546 bytes --]

Hi Brian,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on xfs-linux/for-next]
[also build test WARNING on v5.9-rc7]
[cannot apply to next-20201002]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Brian-Foster/iomap-avoid-soft-lockup-warnings-on-large-ioends/20201002-233417
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
config: arm64-randconfig-r014-20201002 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project bcd05599d0e53977a963799d6ee4f6e0bc21331b)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm64 cross compiling tool for clang build
        # apt-get install binutils-aarch64-linux-gnu
        # https://github.com/0day-ci/linux/commit/bfa33e76564c1e273d89f17ec39c1c5fd305c763
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Brian-Foster/iomap-avoid-soft-lockup-warnings-on-large-ioends/20201002-233417
        git checkout bfa33e76564c1e273d89f17ec39c1c5fd305c763
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/xfs/xfs_aops.c:250:22: warning: signed shift result (0x100000000) requires 34 bits to represent, but 'int' only has 32 bits [-Wshift-overflow]
                   (ioend->io_size >= XFS_LARGE_IOEND);
                                      ^~~~~~~~~~~~~~~
   fs/xfs/xfs_aops.c:38:33: note: expanded from macro 'XFS_LARGE_IOEND'
   #define XFS_LARGE_IOEND (262144 << PAGE_SHIFT)
                            ~~~~~~ ^  ~~~~~~~~~~
   1 warning generated.

vim +/int +250 fs/xfs/xfs_aops.c

   244	
   245	static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
   246	{
   247		return ioend->io_private ||
   248			ioend->io_type == IOMAP_UNWRITTEN ||
   249			(ioend->io_flags & IOMAP_F_SHARED) ||
 > 250			(ioend->io_size >= XFS_LARGE_IOEND);
   251	}
   252	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 40887 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 2/2] xfs: kick extra large ioends to completion workqueue
  2020-10-02 15:33 ` [PATCH 2/2] xfs: kick extra large ioends to completion workqueue Brian Foster
  2020-10-02 16:19   ` Darrick J. Wong
  2020-10-03  0:26   ` kernel test robot
@ 2020-10-05 15:21   ` Brian Foster
  2020-10-06  3:55     ` Darrick J. Wong
  2 siblings, 1 reply; 16+ messages in thread
From: Brian Foster @ 2020-10-05 15:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs

We've had reports of soft lockup warnings in the iomap ioend
completion path due to very large bios and/or bio chains. Divert any
ioends with 256k or more pages to process to the workqueue so
completion occurs in non-atomic context and can reschedule to avoid
soft lockup warnings.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

v2:
- Fix type in macro.

 fs/xfs/xfs_aops.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3e061ea99922..c00cc0624986 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -30,6 +30,13 @@ XFS_WPC(struct iomap_writepage_ctx *ctx)
 	return container_of(ctx, struct xfs_writepage_ctx, ctx);
 }
 
+/*
+ * Kick extra large ioends off to the workqueue. Completion will process a lot
+ * of pages for a large bio or bio chain and a non-atomic context is required to
+ * reschedule and avoid soft lockup warnings.
+ */
+#define XFS_LARGE_IOEND	(262144ULL << PAGE_SHIFT)
+
 /*
  * Fast and loose check if this write could update the on-disk inode size.
  */
@@ -239,7 +246,8 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
 {
 	return ioend->io_private ||
 		ioend->io_type == IOMAP_UNWRITTEN ||
-		(ioend->io_flags & IOMAP_F_SHARED);
+		(ioend->io_flags & IOMAP_F_SHARED) ||
+		(ioend->io_size >= XFS_LARGE_IOEND);
 }
 
 STATIC void
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] xfs: kick extra large ioends to completion workqueue
  2020-10-05 15:21   ` [PATCH v2 " Brian Foster
@ 2020-10-06  3:55     ` Darrick J. Wong
  2020-10-06 12:44       ` Brian Foster
  2020-10-06 14:07       ` Matthew Wilcox
  0 siblings, 2 replies; 16+ messages in thread
From: Darrick J. Wong @ 2020-10-06  3:55 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs

On Mon, Oct 05, 2020 at 11:21:02AM -0400, Brian Foster wrote:
> We've had reports of soft lockup warnings in the iomap ioend
> completion path due to very large bios and/or bio chains. Divert any
> ioends with 256k or more pages to process to the workqueue so
> completion occurs in non-atomic context and can reschedule to avoid
> soft lockup warnings.

Hmmmm... is there any way we can just make end_page_writeback faster?

TBH it still strikes me as odd that we'd cap ioends this way just to
cover for the fact that we have to poke each and every page.

(Also, those 'bool atomic' in the other patch make me kind of nervous --
how do we make sure (from a QA perspective) that nobody gets that wrong?)

--D

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> v2:
> - Fix type in macro.
> 
>  fs/xfs/xfs_aops.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3e061ea99922..c00cc0624986 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -30,6 +30,13 @@ XFS_WPC(struct iomap_writepage_ctx *ctx)
>  	return container_of(ctx, struct xfs_writepage_ctx, ctx);
>  }
>  
> +/*
> + * Kick extra large ioends off to the workqueue. Completion will process a lot
> + * of pages for a large bio or bio chain and a non-atomic context is required to
> + * reschedule and avoid soft lockup warnings.
> + */
> +#define XFS_LARGE_IOEND	(262144ULL << PAGE_SHIFT)
> +
>  /*
>   * Fast and loose check if this write could update the on-disk inode size.
>   */
> @@ -239,7 +246,8 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
>  {
>  	return ioend->io_private ||
>  		ioend->io_type == IOMAP_UNWRITTEN ||
> -		(ioend->io_flags & IOMAP_F_SHARED);
> +		(ioend->io_flags & IOMAP_F_SHARED) ||
> +		(ioend->io_size >= XFS_LARGE_IOEND);
>  }
>  
>  STATIC void
> -- 
> 2.25.4
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] xfs: kick extra large ioends to completion workqueue
  2020-10-06  3:55     ` Darrick J. Wong
@ 2020-10-06 12:44       ` Brian Foster
  2021-05-06 19:31         ` Darrick J. Wong
  2020-10-06 14:07       ` Matthew Wilcox
  1 sibling, 1 reply; 16+ messages in thread
From: Brian Foster @ 2020-10-06 12:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs

On Mon, Oct 05, 2020 at 08:55:37PM -0700, Darrick J. Wong wrote:
> On Mon, Oct 05, 2020 at 11:21:02AM -0400, Brian Foster wrote:
> > We've had reports of soft lockup warnings in the iomap ioend
> > completion path due to very large bios and/or bio chains. Divert any
> > ioends with 256k or more pages to process to the workqueue so
> > completion occurs in non-atomic context and can reschedule to avoid
> > soft lockup warnings.
> 
> Hmmmm... is there any way we can just make end_page_writeback faster?
> 

I'm not sure that would help us. It's not doing much work as it is. The
issue is just that we effectively queue so many of them to a single bio
completion due to either bio chaining or the physical page merging
implemented by multipage bvecs.

> TBH it still strikes me as odd that we'd cap ioends this way just to
> cover for the fact that we have to poke each and every page.
> 

I suppose, but it's not like we don't already account for constructing
bios that must be handed off to a workqueue for completion processing.
Also FWIW this doesn't cap ioend size like my original patch does. It
just updates XFS to send them to the completion workqueue.

> (Also, those 'bool atomic' in the other patch make me kind of nervous --
> how do we make sure (from a QA perspective) that nobody gets that wrong?)
> 

Yeah, that's a bit ugly. If somebody has a better idea on the factoring
I'm interested in hearing about it. My understanding is that in_atomic()
is not reliable and/or generally frowned upon, hence the explicit
context parameter.

Also, I don't have the error handy but my development kernel complains
quite clearly if we make a call that can potentially sleep in atomic
context. I believe this is the purpose of the __might_sleep()
(CONFIG_DEBUG_ATOMIC_SLEEP) annotation.

Brian

> --D
> 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > v2:
> > - Fix type in macro.
> > 
> >  fs/xfs/xfs_aops.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 3e061ea99922..c00cc0624986 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -30,6 +30,13 @@ XFS_WPC(struct iomap_writepage_ctx *ctx)
> >  	return container_of(ctx, struct xfs_writepage_ctx, ctx);
> >  }
> >  
> > +/*
> > + * Kick extra large ioends off to the workqueue. Completion will process a lot
> > + * of pages for a large bio or bio chain and a non-atomic context is required to
> > + * reschedule and avoid soft lockup warnings.
> > + */
> > +#define XFS_LARGE_IOEND	(262144ULL << PAGE_SHIFT)
> > +
> >  /*
> >   * Fast and loose check if this write could update the on-disk inode size.
> >   */
> > @@ -239,7 +246,8 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> >  {
> >  	return ioend->io_private ||
> >  		ioend->io_type == IOMAP_UNWRITTEN ||
> > -		(ioend->io_flags & IOMAP_F_SHARED);
> > +		(ioend->io_flags & IOMAP_F_SHARED) ||
> > +		(ioend->io_size >= XFS_LARGE_IOEND);
> >  }
> >  
> >  STATIC void
> > -- 
> > 2.25.4
> > 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] xfs: kick extra large ioends to completion workqueue
  2020-10-06  3:55     ` Darrick J. Wong
  2020-10-06 12:44       ` Brian Foster
@ 2020-10-06 14:07       ` Matthew Wilcox
  2021-05-06 19:34         ` Darrick J. Wong
  1 sibling, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2020-10-06 14:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-fsdevel, linux-xfs

On Mon, Oct 05, 2020 at 08:55:37PM -0700, Darrick J. Wong wrote:
> On Mon, Oct 05, 2020 at 11:21:02AM -0400, Brian Foster wrote:
> > We've had reports of soft lockup warnings in the iomap ioend
> > completion path due to very large bios and/or bio chains. Divert any
> > ioends with 256k or more pages to process to the workqueue so
> > completion occurs in non-atomic context and can reschedule to avoid
> > soft lockup warnings.
> 
> Hmmmm... is there any way we can just make end_page_writeback faster?

There are ways to make it faster.  I don't know if they're a "just"
solution ...

1. We can use THPs.  That will reduce the number of pages being operated
on.  I hear somebody might have a patch set for that.  Incidentally,
this patch set will clash with the THP patchset, so one of us is going to
have to rebase on the other's work.  Not a complaint, just acknowledging
that some coordination will be needed for the 5.11 merge window.

2. We could create end_writeback_pages(struct pagevec *pvec) which
calls a new test_clear_writeback_pages(pvec).  That could amortise
taking the memcg lock and finding the lruvec and taking the mapping
lock -- assuming these pages are sufficiently virtually contiguous.
It can definitely amortise all the statistics updates.

3. We can make wake_up_page(page, PG_writeback); more efficient.  If
you can produce this situation on demand, I had a patch for that which
languished due to lack of interest.

https://lore.kernel.org/linux-fsdevel/20200416220130.13343-1-willy@infradead.org/


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] xfs: kick extra large ioends to completion workqueue
  2020-10-06 12:44       ` Brian Foster
@ 2021-05-06 19:31         ` Darrick J. Wong
  2021-05-07 14:06           ` Brian Foster
  0 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2021-05-06 19:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-fsdevel, linux-xfs

On Tue, Oct 06, 2020 at 08:44:40AM -0400, Brian Foster wrote:
> On Mon, Oct 05, 2020 at 08:55:37PM -0700, Darrick J. Wong wrote:
> > On Mon, Oct 05, 2020 at 11:21:02AM -0400, Brian Foster wrote:
> > > We've had reports of soft lockup warnings in the iomap ioend
> > > completion path due to very large bios and/or bio chains. Divert any
> > > ioends with 256k or more pages to process to the workqueue so

Heh, lol, so now we're hitting this internally too.  Certain customers
are sending 580,000-page bios, which then trip the hangcheck timer when
we stall the interrupt handler while clearing all the page bits.

> > > completion occurs in non-atomic context and can reschedule to avoid
> > > soft lockup warnings.
> > 
> > Hmmmm... is there any way we can just make end_page_writeback faster?
> > 
> 
> I'm not sure that would help us. It's not doing much work as it is. The
> issue is just that we effectively queue so many of them to a single bio
> completion due to either bio chaining or the physical page merging
> implemented by multipage bvecs.
> 
> > TBH it still strikes me as odd that we'd cap ioends this way just to
> > cover for the fact that we have to poke each and every page.
> > 
> 
> I suppose, but it's not like we don't already account for constructing
> bios that must be handed off to a workqueue for completion processing.
> Also FWIW this doesn't cap ioend size like my original patch does. It
> just updates XFS to send them to the completion workqueue.

<nod> So I guess I'm saying that my resistance to /this/ part of the
changes is melting away.  For a 2GB+ write IO, I guess the extra overhead
of poking a workqueue can be amortized over the sheer number of pages.

> > (Also, those 'bool atomic' in the other patch make me kind of nervous --
> > how do we make sure (from a QA perspective) that nobody gets that wrong?)
> > 
> 
> Yeah, that's a bit ugly. If somebody has a better idea on the factoring
> I'm interested in hearing about it. My understanding is that in_atomic()
> is not reliable and/or generally frowned upon, hence the explicit
> context parameter.
> 
> Also, I don't have the error handy but my development kernel complains
> quite clearly if we make a call that can potentially sleep in atomic
> context. I believe this is the purpose of the __might_sleep()
> (CONFIG_DEBUG_ATOMIC_SLEEP) annotation.

I wonder if it's not too late to establish a new iomap rule?

All clients whose ->prepare_ioend handler overrides the default
ioend->io_bio->bi_end_io handler must call iomap_finish_ioends from
process context, because the "only" reason why a filesystem would need
to do that is because some post-write metadata update is necessary, and
those really shouldn't be running from interrupt context.

With such a rule (no idea how we'd enforce that) we could at least
constrain that in_atomic variable to buffered-io.c, since the only time
it would be unsafe to call cond_resched() is if iomap_writepage_end_bio
is in use, and it decides to call iomap_finish_ioend directly.

Right now XFS is the only filesystem that overrides the bio endio
handler, and the only time it does that is for writes that need extra
metadata updates (unwritten conversion, setfilesize, cow).

--D

> Brian
> 
> > --D
> > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > > 
> > > v2:
> > > - Fix type in macro.
> > > 
> > >  fs/xfs/xfs_aops.c | 10 +++++++++-
> > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index 3e061ea99922..c00cc0624986 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -30,6 +30,13 @@ XFS_WPC(struct iomap_writepage_ctx *ctx)
> > >  	return container_of(ctx, struct xfs_writepage_ctx, ctx);
> > >  }
> > >  
> > > +/*
> > > + * Kick extra large ioends off to the workqueue. Completion will process a lot
> > > + * of pages for a large bio or bio chain and a non-atomic context is required to
> > > + * reschedule and avoid soft lockup warnings.
> > > + */
> > > +#define XFS_LARGE_IOEND	(262144ULL << PAGE_SHIFT)
> > > +
> > >  /*
> > >   * Fast and loose check if this write could update the on-disk inode size.
> > >   */
> > > @@ -239,7 +246,8 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> > >  {
> > >  	return ioend->io_private ||
> > >  		ioend->io_type == IOMAP_UNWRITTEN ||
> > > -		(ioend->io_flags & IOMAP_F_SHARED);
> > > +		(ioend->io_flags & IOMAP_F_SHARED) ||
> > > +		(ioend->io_size >= XFS_LARGE_IOEND);
> > >  }
> > >  
> > >  STATIC void
> > > -- 
> > > 2.25.4
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] xfs: kick extra large ioends to completion workqueue
  2020-10-06 14:07       ` Matthew Wilcox
@ 2021-05-06 19:34         ` Darrick J. Wong
  2021-05-06 19:45           ` Matthew Wilcox
  0 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2021-05-06 19:34 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Darrick J. Wong, Brian Foster, linux-fsdevel, linux-xfs

On Tue, Oct 06, 2020 at 03:07:20PM +0100, Matthew Wilcox wrote:
> On Mon, Oct 05, 2020 at 08:55:37PM -0700, Darrick J. Wong wrote:
> > On Mon, Oct 05, 2020 at 11:21:02AM -0400, Brian Foster wrote:
> > > We've had reports of soft lockup warnings in the iomap ioend
> > > completion path due to very large bios and/or bio chains. Divert any
> > > ioends with 256k or more pages to process to the workqueue so
> > > completion occurs in non-atomic context and can reschedule to avoid
> > > soft lockup warnings.
> > 
> > Hmmmm... is there any way we can just make end_page_writeback faster?
> 
> There are ways to make it faster.  I don't know if they're a "just"
> solution ...
> 
> 1. We can use THPs.  That will reduce the number of pages being operated
> on.  I hear somebody might have a patch set for that.  Incidentally,
> this patch set will clash with the THP patchset, so one of us is going to
> have to rebase on the other's work.  Not a complaint, just acknowledging
> that some coordination will be needed for the 5.11 merge window.

How far off is this, anyway?  I assume it's in line behind the folio
series?

> 2. We could create end_writeback_pages(struct pagevec *pvec) which
> calls a new test_clear_writeback_pages(pvec).  That could amortise
> taking the memcg lock and finding the lruvec and taking the mapping
> lock -- assuming these pages are sufficiently virtually contiguous.
> It can definitely amortise all the statistics updates.

/me kinda wonders if THPs arent the better solution for people who want
to run large ios.

> 3. We can make wake_up_page(page, PG_writeback); more efficient.  If
> you can produce this situation on demand, I had a patch for that which
> languished due to lack of interest.

I can (well, someone can) so I'll talk to you internally about their
seeekret reproducer.

> https://lore.kernel.org/linux-fsdevel/20200416220130.13343-1-willy@infradead.org/

--D

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] xfs: kick extra large ioends to completion workqueue
  2021-05-06 19:34         ` Darrick J. Wong
@ 2021-05-06 19:45           ` Matthew Wilcox
  0 siblings, 0 replies; 16+ messages in thread
From: Matthew Wilcox @ 2021-05-06 19:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Darrick J. Wong, Brian Foster, linux-fsdevel, linux-xfs

On Thu, May 06, 2021 at 12:34:06PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 06, 2020 at 03:07:20PM +0100, Matthew Wilcox wrote:
> > On Mon, Oct 05, 2020 at 08:55:37PM -0700, Darrick J. Wong wrote:
> > > On Mon, Oct 05, 2020 at 11:21:02AM -0400, Brian Foster wrote:
> > > > We've had reports of soft lockup warnings in the iomap ioend
> > > > completion path due to very large bios and/or bio chains. Divert any
> > > > ioends with 256k or more pages to process to the workqueue so
> > > > completion occurs in non-atomic context and can reschedule to avoid
> > > > soft lockup warnings.
> > > 
> > > Hmmmm... is there any way we can just make end_page_writeback faster?
> > 
> > There are ways to make it faster.  I don't know if they're a "just"
> > solution ...
> > 
> > 1. We can use THPs.  That will reduce the number of pages being operated
> > on.  I hear somebody might have a patch set for that.  Incidentally,
> > this patch set will clash with the THP patchset, so one of us is going to
> > have to rebase on the other's work.  Not a complaint, just acknowledging
> > that some coordination will be needed for the 5.11 merge window.
> 
> How far off is this, anyway?  I assume it's in line behind the folio
> series?

Right.  The folio series found all kinds of fun places where the
accounting was wrong (eg accounting for an N-page I/O as a single page),
so the THP work is all renamed folio now.  The folio patchset I posted
yesterday [1] is _most_ of what is necessary from an XFS point of view.
There's probably another three dozen mm patches to actually enable
multi-page folios after that, and a lot more patches to optimise the
mm/vfs for multi-page folios, but your side of things is almost all
published and reviewable.

[1] https://lore.kernel.org/linux-fsdevel/20210505150628.111735-1-willy@infradead.org/

> > 2. We could create end_writeback_pages(struct pagevec *pvec) which
> > calls a new test_clear_writeback_pages(pvec).  That could amortise
> > taking the memcg lock and finding the lruvec and taking the mapping
> > lock -- assuming these pages are sufficiently virtually contiguous.
> > It can definitely amortise all the statistics updates.
> 
> /me kinda wonders if THPs arent the better solution for people who want
> to run large ios.

Yes, definitely.  It does rather depend on their usage patterns,
but if they're working on a file-contiguous chunk of memory, this
can help a lot.

> > 3. We can make wake_up_page(page, PG_writeback); more efficient.  If
> > you can produce this situation on demand, I had a patch for that which
> > languished due to lack of interest.
> 
> I can (well, someone can) so I'll talk to you internally about their
> seeekret reproducer.

Fantastic.  If it's something that needs to get backported to a
stable-ABI kernel ... this isn't going to be a viable solution.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] xfs: kick extra large ioends to completion workqueue
  2021-05-06 19:31         ` Darrick J. Wong
@ 2021-05-07 14:06           ` Brian Foster
  2021-05-07 14:40             ` Matthew Wilcox
  0 siblings, 1 reply; 16+ messages in thread
From: Brian Foster @ 2021-05-07 14:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Darrick J. Wong, linux-fsdevel, linux-xfs

On Thu, May 06, 2021 at 12:31:58PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 06, 2020 at 08:44:40AM -0400, Brian Foster wrote:
> > On Mon, Oct 05, 2020 at 08:55:37PM -0700, Darrick J. Wong wrote:
> > > On Mon, Oct 05, 2020 at 11:21:02AM -0400, Brian Foster wrote:
> > > > We've had reports of soft lockup warnings in the iomap ioend
> > > > completion path due to very large bios and/or bio chains. Divert any
> > > > ioends with 256k or more pages to process to the workqueue so
> 
> Heh, lol, so now we're hitting this internally too.  Certain customers
> are sending 580,000-page bios, which then trip the hangcheck timer when
> we stall the interrupt handler while clearing all the page bits.
> 

Yep, sounds about right. :P

> > > > completion occurs in non-atomic context and can reschedule to avoid
> > > > soft lockup warnings.
> > > 
> > > Hmmmm... is there any way we can just make end_page_writeback faster?
> > > 
> > 
> > I'm not sure that would help us. It's not doing much work as it is. The
> > issue is just that we effectively queue so many of them to a single bio
> > completion due to either bio chaining or the physical page merging
> > implemented by multipage bvecs.
> > 
> > > TBH it still strikes me as odd that we'd cap ioends this way just to
> > > cover for the fact that we have to poke each and every page.
> > > 
> > 
> > I suppose, but it's not like we don't already account for constructing
> > bios that must be handed off to a workqueue for completion processing.
> > Also FWIW this doesn't cap ioend size like my original patch does. It
> > just updates XFS to send them to the completion workqueue.
> 
> <nod> So I guess I'm saying that my resistance to /this/ part of the
> changes is melting away.  For a 2GB+ write IO, I guess the extra overhead
> of poking a workqueue can be amortized over the sheer number of pages.
> 

I think the main question is what is a suitable size threshold to kick
an ioend over to the workqueue? Looking back, I think this patch just
picked 256k randomly to propose the idea. ISTM there could be a
potentially large window from the point where I/O latency starts to
dominate (over the extra context switch for wq processing) and where the
softlockup warning thing will eventually trigger due to having too many
pages. I think that means we could probably use a more conservative
value, I'm just not sure what value should be (10MB, 100MB, 1GB?). If
you have a reproducer it might be interesting to experiment with that.

> > > (Also, those 'bool atomic' in the other patch make me kind of nervous --
> > > how do we make sure (from a QA perspective) that nobody gets that wrong?)
> > > 
> > 
> > Yeah, that's a bit ugly. If somebody has a better idea on the factoring
> > I'm interested in hearing about it. My understanding is that in_atomic()
> > is not reliable and/or generally frowned upon, hence the explicit
> > context parameter.
> > 
> > Also, I don't have the error handy but my development kernel complains
> > quite clearly if we make a call that can potentially sleep in atomic
> > context. I believe this is the purpose of the __might_sleep()
> > (CONFIG_DEBUG_ATOMIC_SLEEP) annotation.
> 
> I wonder if it's not too late to establish a new iomap rule?
> 
> All clients whose ->prepare_ioend handler overrides the default
> ioend->io_bio->bi_end_io handler must call iomap_finish_ioends from
> process context, because the "only" reason why a filesystem would need
> to do that is because some post-write metadata update is necessary, and
> those really shouldn't be running from interrupt context.
> 
> With such a rule (no idea how we'd enforce that) we could at least
> constrain that in_atomic variable to buffered-io.c, since the only time
> it would be unsafe to call cond_resched() is if iomap_writepage_end_bio
> is in use, and it decides to call iomap_finish_ioend directly.
> 

I'm not following if you mean to suggest to change what patch 1 does
somehow or another (it seems similar to what you're describing here) or
something else..?

Brian

> Right now XFS is the only filesystem that overrides the bio endio
> handler, and the only time it does that is for writes that need extra
> metadata updates (unwritten conversion, setfilesize, cow).
> 
> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > > 
> > > > v2:
> > > > - Fix type in macro.
> > > > 
> > > >  fs/xfs/xfs_aops.c | 10 +++++++++-
> > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > index 3e061ea99922..c00cc0624986 100644
> > > > --- a/fs/xfs/xfs_aops.c
> > > > +++ b/fs/xfs/xfs_aops.c
> > > > @@ -30,6 +30,13 @@ XFS_WPC(struct iomap_writepage_ctx *ctx)
> > > >  	return container_of(ctx, struct xfs_writepage_ctx, ctx);
> > > >  }
> > > >  
> > > > +/*
> > > > + * Kick extra large ioends off to the workqueue. Completion will process a lot
> > > > + * of pages for a large bio or bio chain and a non-atomic context is required to
> > > > + * reschedule and avoid soft lockup warnings.
> > > > + */
> > > > +#define XFS_LARGE_IOEND	(262144ULL << PAGE_SHIFT)
> > > > +
> > > >  /*
> > > >   * Fast and loose check if this write could update the on-disk inode size.
> > > >   */
> > > > @@ -239,7 +246,8 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> > > >  {
> > > >  	return ioend->io_private ||
> > > >  		ioend->io_type == IOMAP_UNWRITTEN ||
> > > > -		(ioend->io_flags & IOMAP_F_SHARED);
> > > > +		(ioend->io_flags & IOMAP_F_SHARED) ||
> > > > +		(ioend->io_size >= XFS_LARGE_IOEND);
> > > >  }
> > > >  
> > > >  STATIC void
> > > > -- 
> > > > 2.25.4
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] xfs: kick extra large ioends to completion workqueue
  2021-05-07 14:06           ` Brian Foster
@ 2021-05-07 14:40             ` Matthew Wilcox
  2021-05-10  2:45               ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2021-05-07 14:40 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, Darrick J. Wong, linux-fsdevel, linux-xfs

On Fri, May 07, 2021 at 10:06:31AM -0400, Brian Foster wrote:
> > <nod> So I guess I'm saying that my resistance to /this/ part of the
> > changes is melting away.  For a 2GB+ write IO, I guess the extra overhead
> > of poking a workqueue can be amortized over the sheer number of pages.
> 
> I think the main question is what is a suitable size threshold to kick
> an ioend over to the workqueue? Looking back, I think this patch just
> picked 256k randomly to propose the idea. ISTM there could be a
> potentially large window from the point where I/O latency starts to
> dominate (over the extra context switch for wq processing) and where the
> softlockup warning thing will eventually trigger due to having too many
> pages. I think that means we could probably use a more conservative
> value, I'm just not sure what value should be (10MB, 100MB, 1GB?). If
> you have a reproducer it might be interesting to experiment with that.

To my mind, there are four main types of I/Os.

1. Small, dependent reads -- maybe reading a B-tree block so we can get
the next pointer.  Those need latency above all.

2. Readahead.  Tend to be large I/Os and latency is not a concern

3. Page writeback which tend to be large and can afford the extra latency.

4. Log writes.  These tend to be small, and I'm not sure what increasing
their latency would do.  Probably bad.

I like 256kB as a threshold.  I think I could get behind anything from
128kB to 1MB.  I don't think playing with it is going to be really
interesting because most IOs are going to be far below or far above
that threshold.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] xfs: kick extra large ioends to completion workqueue
  2021-05-07 14:40             ` Matthew Wilcox
@ 2021-05-10  2:45               ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2021-05-10  2:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Brian Foster, Darrick J. Wong, Darrick J. Wong, linux-fsdevel, linux-xfs

On Fri, May 07, 2021 at 03:40:39PM +0100, Matthew Wilcox wrote:
> On Fri, May 07, 2021 at 10:06:31AM -0400, Brian Foster wrote:
> > > <nod> So I guess I'm saying that my resistance to /this/ part of the
> > > changes is melting away.  For a 2GB+ write IO, I guess the extra overhead
> > > of poking a workqueue can be amortized over the sheer number of pages.
> > 
> > I think the main question is what is a suitable size threshold to kick
> > an ioend over to the workqueue? Looking back, I think this patch just
> > picked 256k randomly to propose the idea. ISTM there could be a
> > potentially large window from the point where I/O latency starts to
> > dominate (over the extra context switch for wq processing) and where the
> > softlockup warning thing will eventually trigger due to having too many
> > pages. I think that means we could probably use a more conservative
> > value, I'm just not sure what value should be (10MB, 100MB, 1GB?). If
> > you have a reproducer it might be interesting to experiment with that.
> 
> To my mind, there are four main types of I/Os.
> 
> 1. Small, dependent reads -- maybe reading a B-tree block so we can get
> the next pointer.  Those need latency above all.
> 
> 2. Readahead.  Tend to be large I/Os and latency is not a concern
> 
> 3. Page writeback which tend to be large and can afford the extra latency.
> 
> 4. Log writes.  These tend to be small, and I'm not sure what increasing
> their latency would do.  Probably bad.
> 
> I like 256kB as a threshold.  I think I could get behind anything from
> 128kB to 1MB.  I don't think playing with it is going to be really
> interesting because most IOs are going to be far below or far above
> that threshold.

256kB is waaaaay too small for writeback IO. Brian actually proposed
256k *pages*, not bytes. 256kB will take us back to the bad old days
where we are dependent on bio merging to get decent IO patterns down
to the hardware, and that's a bad place to be from both an IO and
CPU efficiency POV.

IOWs, the IO that is built by the filesystem needs to be large
w.r.t. the underlying hardware so that the block layer can slice and
dice it efficiently so we don't end up doing lots of small partial
stripe writes to RAID devices.

ISTR proposing - in response to Brian's patch - a limit of 16MB or
so - large enough that for most storage stacks we'll keep it's
queues full with well formed IO, but also small enough that we don't
end up with long completion latencies due to walking hundreds of
thousands of pages completing them in a tight loop...

Yup, here's the relevent chunk of that patch:

+/*
+ * Maximum ioend IO size is used to prevent ioends from becoming unbound in
+ * size. Bios can read 4GB in size is pages are contiguous, and bio chains are
+ * effectively unbound in length. Hence the only limits on the size of the bio
+ * chain is the contiguity of the extent on disk and the length of the run of
+ * sequential dirty pages in the page cache. This can be tens of GBs of physical
+ * extents and if memory is large enough, tens of millions of dirty pages.
+ * Locking them all under writeback until the final bio in the chain is
+ * submitted and completed locks all those pages for the legnth of time it takes
+ * to write those many, many GBs of data to storage.
+ *
+ * Background writeback caps any single writepages call to half the device
+ * bandwidth to ensure fairness and prevent any one dirty inode causing
+ * writeback starvation.  fsync() and other WB_SYNC_ALL writebacks have no such
+ * cap on wbc->nr_pages, and that's where the above massive bio chain lengths
+ * come from. We want large IOs to reach the storage, but we need to limit
+ * completion latencies, hence we need to control the maximum IO size we
+ * dispatch to the storage stack.
+ *
+ * We don't really have to care about the extra IO completion overhead here -
+ * iomap has contiguous IO completion merging, so if the device can sustain a
+ * very high throughput and large bios, the ioends will be merged on completion
+ * and processed in large, efficient chunks with no additional IO latency. IOWs,
+ * we really don't need the huge bio chains to be built in the first place for
+ * sequential writeback...
+ *
+ * Note that page size has an effect here - 64k pages really needs lower
+ * page count thresholds because they have the same IO capabilities. We can do
+ * larger IOs because of the lower completion overhead, so cap it at 1024
+ * pages. For everything else, use 16MB as the ioend size.
+ */
+#if (PAGE_SIZE == 65536)
+#define IOMAP_MAX_IOEND_PAGES  1024
+#else
+#define IOMAP_MAX_IOEND_PAGES  4096
+#endif
+#define IOMAP_MAX_IOEND_SIZE   (IOMAP_MAX_IOEND_PAGES * PAGE_SIZE)


Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-05-10  2:45 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-02 15:33 [PATCH 0/2] iomap: avoid soft lockup warnings on large ioends Brian Foster
2020-10-02 15:33 ` [PATCH 1/2] iomap: resched ioend completion when in non-atomic context Brian Foster
2020-10-02 15:33 ` [PATCH 2/2] xfs: kick extra large ioends to completion workqueue Brian Foster
2020-10-02 16:19   ` Darrick J. Wong
2020-10-02 16:38     ` Brian Foster
2020-10-03  0:26   ` kernel test robot
2020-10-05 15:21   ` [PATCH v2 " Brian Foster
2020-10-06  3:55     ` Darrick J. Wong
2020-10-06 12:44       ` Brian Foster
2021-05-06 19:31         ` Darrick J. Wong
2021-05-07 14:06           ` Brian Foster
2021-05-07 14:40             ` Matthew Wilcox
2021-05-10  2:45               ` Dave Chinner
2020-10-06 14:07       ` Matthew Wilcox
2021-05-06 19:34         ` Darrick J. Wong
2021-05-06 19:45           ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).