All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
	linux-xfs@vger.kernel.org, Christoph Hellwig <hch@lst.de>
Subject: Re: xfs ioend batching log reservation deadlock
Date: Tue, 30 Mar 2021 10:42:48 -0400	[thread overview]
Message-ID: <YGM46ABWNVO5FKld@bfoster> (raw)
In-Reply-To: <20210329235142.GR63242@dread.disaster.area>

On Tue, Mar 30, 2021 at 10:51:42AM +1100, Dave Chinner wrote:
> On Mon, Mar 29, 2021 at 02:04:25PM -0400, Brian Foster wrote:
> > On Mon, Mar 29, 2021 at 01:28:26PM +1100, Dave Chinner wrote:
...
> 
> > @@ -182,12 +185,10 @@ xfs_end_ioend(
> >  		error = xfs_reflink_end_cow(ip, offset, size);
> >  	else if (ioend->io_type == IOMAP_UNWRITTEN)
> >  		error = xfs_iomap_write_unwritten(ip, offset, size, false);
> > -	else
> > -		ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private);
> >  
> >  done:
> > -	if (ioend->io_private)
> > -		error = xfs_setfilesize_ioend(ioend, error);
> > +	if (ioend->io_flags & IOMAP_F_APPEND)
> > +		error = xfs_setfilesize(ip, offset, size);
> >  	iomap_finish_ioends(ioend, error);
> >  	memalloc_nofs_restore(nofs_flag);
> >  }
> > @@ -221,16 +222,28 @@ xfs_end_io(
> >  	struct iomap_ioend	*ioend;
> >  	struct list_head	tmp;
> >  	unsigned long		flags;
> > +	xfs_off_t		maxendoff;
> >  
> >  	spin_lock_irqsave(&ip->i_ioend_lock, flags);
> >  	list_replace_init(&ip->i_ioend_list, &tmp);
> >  	spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
> >  
> >  	iomap_sort_ioends(&tmp);
> > +
> > +	/* XXX: track max endoff manually? */
> > +	ioend = list_last_entry(&tmp, struct iomap_ioend, io_list);
> > +	if (((ioend->io_flags & IOMAP_F_SHARED) ||
> > +	     (ioend->io_type != IOMAP_UNWRITTEN)) &&
> > +	    xfs_ioend_is_append(ioend)) {
> > +		ioend->io_flags |= IOMAP_F_APPEND;
> > +		maxendoff = ioend->io_offset + ioend->io_size;
> > +	}
> > +
> >  	while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend,
> >  			io_list))) {
> >  		list_del_init(&ioend->io_list);
> >  		iomap_ioend_try_merge(ioend, &tmp, xfs_ioend_merge_private);
> > +		ASSERT(ioend->io_offset + ioend->io_size <= maxendoff);
> >  		xfs_end_ioend(ioend);
> >  	}
> >  }
> 
> So now when I run a workload that is untarring a large tarball full
> of small files, we have as many transaction reserve operations
> runnning concurrently as there are IO completions queued.
> 

This patch has pretty much no effect on a typical untar workload because
it is dominated by delalloc -> unwritten extent conversions that already
require completion time transaction reservations. Indeed, a quick test
to untar a kernel source tree produces no setfilesize events at all.

I'm not sure we have many situations upstream where append transactions
are used outside of perhaps cow completions (which already have a
completion time transaction allocation for fork remaps) or intra-block
file extending writes (that thus produce an inode size change within a
mapped, already converted block). Otherwise a truncate down should
always remove post-eof blocks and speculative prealloc originates from
delalloc, so afaict those should follow the same general sequence. Eh?

As it is, I think the performance concern is overstated but I'm happy to
run any tests to confirm or deny that if you want to make more concrete
suggestions. This patch is easy to test and pretty much survived an
overnight regression run (outside of one or two things I have to look
into..). I'm happy to adjust the approach from there, but I also think
if it proves necessary there are fallback options (based around the
original suggestion in my first mail) to preserve current submission
time (serial) append transaction reservation that don't require to
categorize/split or partially process the pending ioend list.

Brian

> Right now, we the single threaded writeback (bdi-flusher) runs
> reservations serially, so we are introducing a large amount of
> concurrency to reservations here. IOWs, instead of throttling active
> reservations before we submit the IO we end up with attempted
> reservations only being bounded by the number of kworkers the
> completion workqueue can throw at the system.  Then we might have
> tens of active filesystems doing the same thing, each with their own
> set of workqueues and kworkers...
> 
> Yes, we can make "lots of IO to a single inode" have less overhead,
> but we do not want to do that at the expense of bad behaviour when
> we have "single IOs to lots of inodes". That's the concern I have
> here, and that's going to take a fair amount of work to characterise
> and measure the impact, especially on large machines with slow
> disks...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


  reply	other threads:[~2021-03-30 14:43 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-26 15:39 xfs ioend batching log reservation deadlock Brian Foster
2021-03-26 17:32 ` Darrick J. Wong
2021-03-26 18:13   ` Brian Foster
2021-03-29  2:28   ` Dave Chinner
2021-03-29 18:04     ` Brian Foster
2021-03-29 23:51       ` Dave Chinner
2021-03-30 14:42         ` Brian Foster [this message]
2021-03-30 23:57           ` Dave Chinner
2021-03-31 12:26             ` Brian Foster
2021-03-29  5:45 ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YGM46ABWNVO5FKld@bfoster \
    --to=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.