All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>,
	linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH v2 2/2] xfs: kick extra large ioends to completion workqueue
Date: Fri, 7 May 2021 10:06:31 -0400	[thread overview]
Message-ID: <YJVJZzld5ucxnlAH@bfoster> (raw)
In-Reply-To: <20210506193158.GD8582@magnolia>

On Thu, May 06, 2021 at 12:31:58PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 06, 2020 at 08:44:40AM -0400, Brian Foster wrote:
> > On Mon, Oct 05, 2020 at 08:55:37PM -0700, Darrick J. Wong wrote:
> > > On Mon, Oct 05, 2020 at 11:21:02AM -0400, Brian Foster wrote:
> > > > We've had reports of soft lockup warnings in the iomap ioend
> > > > completion path due to very large bios and/or bio chains. Divert any
> > > > ioends with 256k or more pages to process to the workqueue so
> 
> Heh, lol, so now we're hitting this internally too.  Certain customers
> are sending 580,000-page bios, which then trip the hangcheck timer when
> we stall the interrupt handler while clearing all the page bits.
> 

Yep, sounds about right. :P

> > > > completion occurs in non-atomic context and can reschedule to avoid
> > > > soft lockup warnings.
> > > 
> > > Hmmmm... is there any way we can just make end_page_writeback faster?
> > > 
> > 
> > I'm not sure that would help us. It's not doing much work as it is. The
> > issue is just that we effectively queue so many of them to a single bio
> > completion due to either bio chaining or the physical page merging
> > implemented by multipage bvecs.
> > 
> > > TBH it still strikes me as odd that we'd cap ioends this way just to
> > > cover for the fact that we have to poke each and every page.
> > > 
> > 
> > I suppose, but it's not like we don't already account for constructing
> > bios that must be handed off to a workqueue for completion processing.
> > Also FWIW this doesn't cap ioend size like my original patch does. It
> > just updates XFS to send them to the completion workqueue.
> 
> <nod> So I guess I'm saying that my resistance to /this/ part of the
> changes is melting away.  For a 2GB+ write IO, I guess the extra overhead
> of poking a workqueue can be amortized over the sheer number of pages.
> 

I think the main question is what is a suitable size threshold to kick
an ioend over to the workqueue? Looking back, I think this patch just
picked 256k randomly to propose the idea. ISTM there could be a
potentially large window from the point where I/O latency starts to
dominate (over the extra context switch for wq processing) and where the
softlockup warning thing will eventually trigger due to having too many
pages. I think that means we could probably use a more conservative
value, I'm just not sure what value should be (10MB, 100MB, 1GB?). If
you have a reproducer it might be interesting to experiment with that.

> > > (Also, those 'bool atomic' in the other patch make me kind of nervous --
> > > how do we make sure (from a QA perspective) that nobody gets that wrong?)
> > > 
> > 
> > Yeah, that's a bit ugly. If somebody has a better idea on the factoring
> > I'm interested in hearing about it. My understanding is that in_atomic()
> > is not reliable and/or generally frowned upon, hence the explicit
> > context parameter.
> > 
> > Also, I don't have the error handy but my development kernel complains
> > quite clearly if we make a call that can potentially sleep in atomic
> > context. I believe this is the purpose of the __might_sleep()
> > (CONFIG_DEBUG_ATOMIC_SLEEP) annotation.
> 
> I wonder if it's not too late to establish a new iomap rule?
> 
> All clients whose ->prepare_ioend handler overrides the default
> ioend->io_bio->bi_end_io handler must call iomap_finish_ioends from
> process context, because the "only" reason why a filesystem would need
> to do that is because some post-write metadata update is necessary, and
> those really shouldn't be running from interrupt context.
> 
> With such a rule (no idea how we'd enforce that) we could at least
> constrain that in_atomic variable to buffered-io.c, since the only time
> it would be unsafe to call cond_resched() is if iomap_writepage_end_bio
> is in use, and it decides to call iomap_finish_ioend directly.
> 

I'm not following if you mean to suggest to change what patch 1 does
somehow or another (it seems similar to what you're describing here) or
something else..?

Brian

> Right now XFS is the only filesystem that overrides the bio endio
> handler, and the only time it does that is for writes that need extra
> metadata updates (unwritten conversion, setfilesize, cow).
> 
> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > > 
> > > > v2:
> > > > - Fix type in macro.
> > > > 
> > > >  fs/xfs/xfs_aops.c | 10 +++++++++-
> > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > index 3e061ea99922..c00cc0624986 100644
> > > > --- a/fs/xfs/xfs_aops.c
> > > > +++ b/fs/xfs/xfs_aops.c
> > > > @@ -30,6 +30,13 @@ XFS_WPC(struct iomap_writepage_ctx *ctx)
> > > >  	return container_of(ctx, struct xfs_writepage_ctx, ctx);
> > > >  }
> > > >  
> > > > +/*
> > > > + * Kick extra large ioends off to the workqueue. Completion will process a lot
> > > > + * of pages for a large bio or bio chain and a non-atomic context is required to
> > > > + * reschedule and avoid soft lockup warnings.
> > > > + */
> > > > +#define XFS_LARGE_IOEND	(262144ULL << PAGE_SHIFT)
> > > > +
> > > >  /*
> > > >   * Fast and loose check if this write could update the on-disk inode size.
> > > >   */
> > > > @@ -239,7 +246,8 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> > > >  {
> > > >  	return ioend->io_private ||
> > > >  		ioend->io_type == IOMAP_UNWRITTEN ||
> > > > -		(ioend->io_flags & IOMAP_F_SHARED);
> > > > +		(ioend->io_flags & IOMAP_F_SHARED) ||
> > > > +		(ioend->io_size >= XFS_LARGE_IOEND);
> > > >  }
> > > >  
> > > >  STATIC void
> > > > -- 
> > > > 2.25.4
> > > > 
> > > 
> > 
> 


  reply	other threads:[~2021-05-07 14:06 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-02 15:33 [PATCH 0/2] iomap: avoid soft lockup warnings on large ioends Brian Foster
2020-10-02 15:33 ` [PATCH 1/2] iomap: resched ioend completion when in non-atomic context Brian Foster
2020-10-02 15:33 ` [PATCH 2/2] xfs: kick extra large ioends to completion workqueue Brian Foster
2020-10-02 16:19   ` Darrick J. Wong
2020-10-02 16:38     ` Brian Foster
2020-10-03  0:26   ` kernel test robot
2020-10-03  0:26     ` kernel test robot
2020-10-05 15:21   ` [PATCH v2 " Brian Foster
2020-10-06  3:55     ` Darrick J. Wong
2020-10-06 12:44       ` Brian Foster
2021-05-06 19:31         ` Darrick J. Wong
2021-05-07 14:06           ` Brian Foster [this message]
2021-05-07 14:40             ` Matthew Wilcox
2021-05-10  2:45               ` Dave Chinner
2020-10-06 14:07       ` Matthew Wilcox
2021-05-06 19:34         ` Darrick J. Wong
2021-05-06 19:45           ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YJVJZzld5ucxnlAH@bfoster \
    --to=bfoster@redhat.com \
    --cc=darrick.wong@oracle.com \
    --cc=djwong@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.