Re: [PATCH 2/2] xfs: allow delwri requeue of wait listed buffers

From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 2/2] xfs: allow delwri requeue of wait listed buffers
Date: Sat, 9 Jun 2018 08:49:09 +1000	[thread overview]
Message-ID: <20180608224909.GC10363@dastard> (raw)
In-Reply-To: <20180608120709.GA23628@bfoster>

On Fri, Jun 08, 2018 at 08:07:09AM -0400, Brian Foster wrote:
> On Fri, Jun 08, 2018 at 09:27:13AM +1000, Dave Chinner wrote:
> > On Thu, Jun 07, 2018 at 08:41:25AM -0400, Brian Foster wrote:
> > > If a delwri queue occurs of a buffer that was previously submitted
> > > from a delwri queue but has not yet been removed from a wait list,
> > > the queue sets _XBF_DELWRI_Q without changing the state of ->b_list.
> > > This occurs, for example, if another thread beats the submitter
> > > thread to the buffer lock after I/O completion. Once the submitter
> > > acquires the lock, it removes the buffer from the wait list and
> > > leaves a buffer with _XBF_DELWRI_Q set but not populated on a list.
> > > This results in a lost buffer submission and in turn can result in
> > > assert failures due to _XBF_DELWRI_Q being set on buffer reclaim or
> > > filesystem lockups if the buffer happens to cover an item in the
> > > AIL.
> > 
> > I just so happened to have this ASSERT happen over night on
> > generic/232 testing some code I wrote yesterday. It never ceases to
> > amaze me how bugs that have been around for ages always seem to be
> > hit at the same time by different people in completely different
> > contexts....
> > 
> 
> Interesting, out of curiosity was this in a memory limited environment?

No. had GBs of free RAM, but it's on a fakenuma setup so a node
could have been low on free memory and hence running the quota
shrinker....

....
> > This seems all a bit broken.
> 
> Yes, the premise of this was to do something that didn't break it
> further. ;) I figured using sync I/O would also address the problem, but
> would introduce terrible submission->completion serialization...

*nod*

> So essentially take apart sync buffer I/O so we can do batched
> submission/completion. That sounds like a nice idea to me.

*nod*

> Feedback on the code to follow. That aside, are you planning to turn
> this into a real patch submission or would you like me to do it?

I've got plenty of other things to do - if you want to take it an
run I'm fine with that :P

> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index a9678c2f3810..40f441e96dc5 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -1453,27 +1453,20 @@ _xfs_buf_ioapply(
> >  }
> >  
> >  /*
> > - * Asynchronous IO submission path. This transfers the buffer lock ownership and
> > - * the current reference to the IO. It is not safe to reference the buffer after
> > - * a call to this function unless the caller holds an additional reference
> > - * itself.
> > + * Internal I/O submission helpers for split submission and waiting actions.
> >   */
> > -void
> > -xfs_buf_submit(
> > +static int
> > +__xfs_buf_submit(
> 
> It looks like the buffer submission refactoring could be a separate
> patch from the delwri queue race fix.

Yup, it probably should be.

> > @@ -1483,12 +1476,8 @@ xfs_buf_submit(
> >  	bp->b_io_error = 0;
> >  
> >  	/*
> > -	 * The caller's reference is released during I/O completion.
> > -	 * This occurs some time after the last b_io_remaining reference is
> > -	 * released, so after we drop our Io reference we have to have some
> > -	 * other reference to ensure the buffer doesn't go away from underneath
> > -	 * us. Take a direct reference to ensure we have safe access to the
> > -	 * buffer until we are finished with it.
> > +	 * I/O needs a reference, because the caller may go away before we are
> > +	 * done with the IO. Completion will deal with it.
> >  	 */
> >  	xfs_buf_hold(bp);
> 
> I think this should be lifted in the callers. For one, it's confusing to
> follow.

Both paths take a reference here, both take it because the IO itself
needs a reference. The only difference is when it is dropped.

> Second, it looks like xfs_buf_submit() unconditionally drops a
> reference whereas __xfs_buf_submit() may not acquire one (i.e. when we
> return an error).
>
> ISTM that the buffer reference calls could be balanced in the top-level
> submit functions rather than split between the common submission path
> and unique sync completion path.

Yes, that may end up being cleaner, because the delwri path has to
take a reference across submit/complete anyway. I just chopped
quickl away at the code without thinking about this stuff too
deeply. I did say "untested", right? :P

> >  void *
> > @@ -2045,14 +2035,21 @@ xfs_buf_delwri_submit_buffers(
> >  		 * at this point so the caller can still access it.
> >  		 */
> >  		bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_WRITE_FAIL);
> > -		bp->b_flags |= XBF_WRITE | XBF_ASYNC;
> > +		bp->b_flags |= XBF_WRITE;
> 
> We set XBF_ASYNC below in the specific case, but this doesn't tell us
> anything about whether it might have already been set on the buffer. Is
> it not the responsibility of this function to set/clear XBF_ASYNC
> appropriately?

XBF_ASYNC should be clear (as the buffer is not under IO) buti, yes,
I agree that it would be a good idea to clear it anyway.

> >  		if (wait_list) {
> > +			/*
> > +			 * Split synchronous IO - we wait later, so we need ai
> > +			 * reference until we run completion processing and drop
> > +			 * the buffer lock ourselves
> > +			 */
> 
> Might as well merge this with the comment above, which needs fixing
> anyways since we no longer "do all IO submission async."
> 
> >  			xfs_buf_hold(bp);
> 
> I think the buffer reference counting is now broken here.

More than likely. :P

....
> >  			list_move_tail(&bp->b_list, wait_list);
> > -		} else
> > +			__xfs_buf_submit(bp);
> 
> I suspect we need to handle submission errors here, otherwise we wait on
> a buffer that was never submitted.

Yup.

> One final thought.. ISTM that the nature of batched sync buffer
> submission means that once we wait on one or two of those buffers,
> there's a good chance many of the remaining buffer physical I/Os will
> have completed by the time we get to the associated iowait. That means
> that the current behavior of large (sync) delwri buffer completions
> running in the async completion workqueue most likely changes to running
> from __xfs_buf_iowait()->xfs_buf_ioend() context. I'm not sure that
> really matters, but just something to note.

Yup, that's something that crossed my mind. But we only wait for
buffers to complete in log recovery, quota check and the /quota
shrinker/ (now the generic/232 failure makes sense, doesn't it? :).

Hence I don't think there's a performance issue we need to worry
about - correctness is more important...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com