All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH] xfs: fix soft lockup via spinning in filestream ag selection loop
Date: Mon, 25 Apr 2022 09:15:32 -0700	[thread overview]
Message-ID: <20220425161532.GD17025@magnolia> (raw)
In-Reply-To: <YmLWRBjTSP43r6Cs@bfoster>

On Fri, Apr 22, 2022 at 12:22:28PM -0400, Brian Foster wrote:
> On Fri, Apr 22, 2022 at 09:00:21AM -0700, Darrick J. Wong wrote:
> > On Fri, Apr 22, 2022 at 10:12:26AM -0400, Brian Foster wrote:
> > > The filestream AG selection loop uses pagf data to aid in AG
> > > selection, which depends on pagf initialization. If the in-core
> > > structure is not initialized, the caller invokes the AGF read path
> > > to do so and carries on. If another task enters the loop and finds
> > > a pagf init already in progress, the AGF read returns -EAGAIN and
> > > the task continues the loop. This does not increment the current ag
> > > index, however, which means the task spins on the current AGF buffer
> > > until unlocked.
> > > 
> > > If the AGF read I/O submitted by the initial task happens to be
> > > delayed for whatever reason, this results in soft lockup warnings
> > 
> > Is there a specific 'whatever reason' going on here?
> > 
> 
> Presumably.. given this seems to reproduce reliably or not at all in
> certain environments/configs, my suspicion was that either the timing of
> the test changes enough such that some other task involved with the test
> is able to load the bdev, or otherwise timing changes just enough to
> trigger the pagf_init race and the subsequent spinning is what
> exacerbates the delay (i.e. burning cpu and subsequent soft lockup BUG
> starve out some part(s) of the I/O submission/completion processing).
> I've no tangible evidence for either aside from the latter seems fairly
> logical when you consider that the test consistently completes in 3-4
> seconds with the fix in place, but without it we consistently hit
> multiple instances of the soft lockup detector (on ~20s intervals IIRC)
> and the system seems to melt down indefinitely. *shrug*

Ah, ok, so there wasn't any specific event that was causing AGF IO to
take a long time, it's just that a thread running the filestream
allocator could fail the trylock loop for any reason for long enough to
trip the hangcheck warning.

--D

> Brian
> 
> > > via the spinning task. This is reproduced by xfs/170. To avoid this
> > > problem, fix the AGF trylock failure path to properly iterate to the
> > > next AG. If a task iterates all AGs without making progress, the
> > > trylock behavior is dropped in favor of blocking locks and thus a
> > > soft lockup is no longer possible.
> > > 
> > > Fixes: f48e2df8a877ca1c ("xfs: make xfs_*read_agf return EAGAIN to ALLOC_FLAG_TRYLOCK callers")
> > 
> > Ooops, this was a major braino on my part.
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > 
> > --D
> > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > > 
> > > I included the Fixes: tag because this looks like a regression in said
> > > commit, but I've not explicitly verified.
> > > 
> > > Brian
> > > 
> > >  fs/xfs/xfs_filestream.c | 7 ++++---
> > >  1 file changed, 4 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_filestream.c b/fs/xfs/xfs_filestream.c
> > > index 6a3ce0f6dc9e..be9bcf8a1f99 100644
> > > --- a/fs/xfs/xfs_filestream.c
> > > +++ b/fs/xfs/xfs_filestream.c
> > > @@ -128,11 +128,12 @@ xfs_filestream_pick_ag(
> > >  		if (!pag->pagf_init) {
> > >  			err = xfs_alloc_pagf_init(mp, NULL, ag, trylock);
> > >  			if (err) {
> > > -				xfs_perag_put(pag);
> > > -				if (err != -EAGAIN)
> > > +				if (err != -EAGAIN) {
> > > +					xfs_perag_put(pag);
> > >  					return err;
> > > +				}
> > >  				/* Couldn't lock the AGF, skip this AG. */
> > > -				continue;
> > > +				goto next_ag;
> > >  			}
> > >  		}
> > >  
> > > -- 
> > > 2.34.1
> > > 
> > 
> 

  reply	other threads:[~2022-04-25 16:15 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-22 14:12 [PATCH] xfs: fix soft lockup via spinning in filestream ag selection loop Brian Foster
2022-04-22 16:00 ` Darrick J. Wong
2022-04-22 16:22   ` Brian Foster
2022-04-25 16:15     ` Darrick J. Wong [this message]
2022-04-24  5:23 ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220425161532.GD17025@magnolia \
    --to=djwong@kernel.org \
    --cc=bfoster@redhat.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.