Re: xfs ioend batching log reservation deadlock

From: Brian Foster <bfoster@redhat.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-xfs@vger.kernel.org, Christoph Hellwig <hch@lst.de>
Subject: Re: xfs ioend batching log reservation deadlock
Date: Fri, 26 Mar 2021 14:13:21 -0400	[thread overview]
Message-ID: <YF4kQWXqwCgAh4vW@bfoster> (raw)
In-Reply-To: <20210326173244.GY4090233@magnolia>

On Fri, Mar 26, 2021 at 10:32:44AM -0700, Darrick J. Wong wrote:
> On Fri, Mar 26, 2021 at 11:39:38AM -0400, Brian Foster wrote:
> > Hi all,
> > 
> > We have a report of a workload that deadlocks on log reservation via
> > iomap_ioend completion batching. To start, the fs format is somewhat
> > unique in that the log is on the smaller side (35MB) and the log stripe
> > unit is 256k, but this is actually a default mkfs for the underlying
> > storage. I don't have much more information wrt to the workload or
> > anything that contributes to the completion processing characteristics.
> > 
> > The overall scenario is that a workqueue task is executing in
> > xfs_end_io() and blocked on transaction reservation for an unwritten
> > extent conversion. Since this task began executing and pulled pending
> > items from ->i_ioend_list, the latter was repopulated with 90 ioends, 67
> > of which have append transactions. These append transactions account for
> > ~520k of log reservation each due to the log stripe unit. All together
> > this consumes nearly all of available log space, prevents allocation of
> > the aforementioned unwritten extent conversion transaction and thus
> > leaves the fs in a deadlocked state.
> > 
> > I can think of different ways we could probably optimize this problem
> > away. One example is to transfer the append transaction to the inode at
> > bio completion time such that we retain only one per pending batch of
> > ioends. The workqueue task would then pull this append transaction from
> > the inode along with the ioend list and transfer it back to the last
> > non-unwritten/shared ioend in the sorted list.
> > 
> > That said, I'm not totally convinced this addresses the fundamental
> > problem of acquiring transaction reservation from a context that
> > essentially already owns outstanding reservation vs. just making it hard
> > to reproduce. I'm wondering if/why we need the append transaction at
> > all. AFAICT it goes back to commit 281627df3eb5 ("xfs: log file size
> > updates at I/O completion time") in v3.4 which changed the completion
> > on-disk size update from being an unlogged update. If we continue to
> > send these potential append ioends to the workqueue for completion
> > processing, is there any reason we can't let the workqueue allocate the
> > transaction as it already does for unwritten conversion?
> 
> Frankly I've never understood what benefit we get from preallocating a
> transaction and letting it twist in the wind consuming log space while
> writeback pushes data to the disk.  It's perfectly fine to delay ioend
> processing while we wait for unwritten conversions and cow remapping to
> take effect, so what's the harm in a slight delay for this?
> 
> I guess it's an optimization to reduce wait times?  It's a pity that
> nobody left a comment justifying why it was done in that particular way,
> what with the freeze protection lockdep weirdness too.
> 

I thought it might have been to facilitate ioend completion in interrupt
(i.e. bio completion) context, but I'm really not sure. That's why I'm
asking. :) I'm hoping Christoph can provide some context since it
appears to be his original implementation.

> > If that is reasonable, I'm thinking of a couple patches:
> > 
> > 1. Optimize current append transaction processing with an inode field as
> > noted above.
> > 
> > 2. Replace the submission side append transaction entirely with a flag
> > or some such on the ioend that allocates the transaction at completion
> > time, but otherwise preserves batching behavior instituted in patch 1.
> 
> What happens if you replace the call to xfs_setfilesize_ioend in
> xfs_end_ioend with xfs_setfilesize, and skip the transaction
> preallocation altogether?
> 

That's pretty much what I'm referring to in step 2 above. I was just
thinking it might make sense to implement some sort of batching model
first to avoid scenarios where we have a bunch of discontiguous append
ioends in the completion list and really only need to update the file
size at the end. (Hence a flag or whatever to indicate the last ioend
must call xfs_setfilesize()).

That said, we do still have contiguous ioend merging happening already.
We could certainly just rely on that, update di_size as we process each
append ioend and worry about further optimization later. That might be
more simple and less code (and a safer first step)..

Brian

> --D
> 
> > Thoughts?
> > 
> > Brian
> > 
>