Re: xfs ioend batching log reservation deadlock

From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
	linux-xfs@vger.kernel.org, Christoph Hellwig <hch@lst.de>
Subject: Re: xfs ioend batching log reservation deadlock
Date: Tue, 30 Mar 2021 10:42:48 -0400	[thread overview]
Message-ID: <YGM46ABWNVO5FKld@bfoster> (raw)
In-Reply-To: <20210329235142.GR63242@dread.disaster.area>

On Tue, Mar 30, 2021 at 10:51:42AM +1100, Dave Chinner wrote:
> On Mon, Mar 29, 2021 at 02:04:25PM -0400, Brian Foster wrote:
> > On Mon, Mar 29, 2021 at 01:28:26PM +1100, Dave Chinner wrote:
...
> 
> > @@ -182,12 +185,10 @@ xfs_end_ioend(
> >  		error = xfs_reflink_end_cow(ip, offset, size);
> >  	else if (ioend->io_type == IOMAP_UNWRITTEN)
> >  		error = xfs_iomap_write_unwritten(ip, offset, size, false);
> > -	else
> > -		ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private);
> >  
> >  done:
> > -	if (ioend->io_private)
> > -		error = xfs_setfilesize_ioend(ioend, error);
> > +	if (ioend->io_flags & IOMAP_F_APPEND)
> > +		error = xfs_setfilesize(ip, offset, size);
> >  	iomap_finish_ioends(ioend, error);
> >  	memalloc_nofs_restore(nofs_flag);
> >  }
> > @@ -221,16 +222,28 @@ xfs_end_io(
> >  	struct iomap_ioend	*ioend;
> >  	struct list_head	tmp;
> >  	unsigned long		flags;
> > +	xfs_off_t		maxendoff;
> >  
> >  	spin_lock_irqsave(&ip->i_ioend_lock, flags);
> >  	list_replace_init(&ip->i_ioend_list, &tmp);
> >  	spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
> >  
> >  	iomap_sort_ioends(&tmp);
> > +
> > +	/* XXX: track max endoff manually? */
> > +	ioend = list_last_entry(&tmp, struct iomap_ioend, io_list);
> > +	if (((ioend->io_flags & IOMAP_F_SHARED) ||
> > +	     (ioend->io_type != IOMAP_UNWRITTEN)) &&
> > +	    xfs_ioend_is_append(ioend)) {
> > +		ioend->io_flags |= IOMAP_F_APPEND;
> > +		maxendoff = ioend->io_offset + ioend->io_size;
> > +	}
> > +
> >  	while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend,
> >  			io_list))) {
> >  		list_del_init(&ioend->io_list);
> >  		iomap_ioend_try_merge(ioend, &tmp, xfs_ioend_merge_private);
> > +		ASSERT(ioend->io_offset + ioend->io_size <= maxendoff);
> >  		xfs_end_ioend(ioend);
> >  	}
> >  }
> 
> So now when I run a workload that is untarring a large tarball full
> of small files, we have as many transaction reserve operations
> runnning concurrently as there are IO completions queued.
> 

This patch has pretty much no effect on a typical untar workload because
it is dominated by delalloc -> unwritten extent conversions that already
require completion time transaction reservations. Indeed, a quick test
to untar a kernel source tree produces no setfilesize events at all.

I'm not sure we have many situations upstream where append transactions
are used outside of perhaps cow completions (which already have a
completion time transaction allocation for fork remaps) or intra-block
file extending writes (that thus produce an inode size change within a
mapped, already converted block). Otherwise a truncate down should
always remove post-eof blocks and speculative prealloc originates from
delalloc, so afaict those should follow the same general sequence. Eh?

As it is, I think the performance concern is overstated but I'm happy to
run any tests to confirm or deny that if you want to make more concrete
suggestions. This patch is easy to test and pretty much survived an
overnight regression run (outside of one or two things I have to look
into..). I'm happy to adjust the approach from there, but I also think
if it proves necessary there are fallback options (based around the
original suggestion in my first mail) to preserve current submission
time (serial) append transaction reservation that don't require to
categorize/split or partially process the pending ioend list.

Brian

> Right now, we the single threaded writeback (bdi-flusher) runs
> reservations serially, so we are introducing a large amount of
> concurrency to reservations here. IOWs, instead of throttling active
> reservations before we submit the IO we end up with attempted
> reservations only being bounded by the number of kworkers the
> completion workqueue can throw at the system.  Then we might have
> tens of active filesystems doing the same thing, each with their own
> set of workqueues and kworkers...
> 
> Yes, we can make "lots of IO to a single inode" have less overhead,
> but we do not want to do that at the expense of bad behaviour when
> we have "single IOs to lots of inodes". That's the concern I have
> here, and that's going to take a fair amount of work to characterise
> and measure the impact, especially on large machines with slow
> disks...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>