From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
linux-xfs@vger.kernel.org, Christoph Hellwig <hch@lst.de>
Subject: Re: xfs ioend batching log reservation deadlock
Date: Tue, 30 Mar 2021 10:42:48 -0400 [thread overview]
Message-ID: <YGM46ABWNVO5FKld@bfoster> (raw)
In-Reply-To: <20210329235142.GR63242@dread.disaster.area>
On Tue, Mar 30, 2021 at 10:51:42AM +1100, Dave Chinner wrote:
> On Mon, Mar 29, 2021 at 02:04:25PM -0400, Brian Foster wrote:
> > On Mon, Mar 29, 2021 at 01:28:26PM +1100, Dave Chinner wrote:
...
>
> > @@ -182,12 +185,10 @@ xfs_end_ioend(
> > error = xfs_reflink_end_cow(ip, offset, size);
> > else if (ioend->io_type == IOMAP_UNWRITTEN)
> > error = xfs_iomap_write_unwritten(ip, offset, size, false);
> > - else
> > - ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private);
> >
> > done:
> > - if (ioend->io_private)
> > - error = xfs_setfilesize_ioend(ioend, error);
> > + if (ioend->io_flags & IOMAP_F_APPEND)
> > + error = xfs_setfilesize(ip, offset, size);
> > iomap_finish_ioends(ioend, error);
> > memalloc_nofs_restore(nofs_flag);
> > }
> > @@ -221,16 +222,28 @@ xfs_end_io(
> > struct iomap_ioend *ioend;
> > struct list_head tmp;
> > unsigned long flags;
> > + xfs_off_t maxendoff;
> >
> > spin_lock_irqsave(&ip->i_ioend_lock, flags);
> > list_replace_init(&ip->i_ioend_list, &tmp);
> > spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
> >
> > iomap_sort_ioends(&tmp);
> > +
> > + /* XXX: track max endoff manually? */
> > + ioend = list_last_entry(&tmp, struct iomap_ioend, io_list);
> > + if (((ioend->io_flags & IOMAP_F_SHARED) ||
> > + (ioend->io_type != IOMAP_UNWRITTEN)) &&
> > + xfs_ioend_is_append(ioend)) {
> > + ioend->io_flags |= IOMAP_F_APPEND;
> > + maxendoff = ioend->io_offset + ioend->io_size;
> > + }
> > +
> > while ((ioend = list_first_entry_or_null(&tmp, struct iomap_ioend,
> > io_list))) {
> > list_del_init(&ioend->io_list);
> > iomap_ioend_try_merge(ioend, &tmp, xfs_ioend_merge_private);
> > + ASSERT(ioend->io_offset + ioend->io_size <= maxendoff);
> > xfs_end_ioend(ioend);
> > }
> > }
>
> So now when I run a workload that is untarring a large tarball full
> of small files, we have as many transaction reserve operations
> runnning concurrently as there are IO completions queued.
>
This patch has pretty much no effect on a typical untar workload because
it is dominated by delalloc -> unwritten extent conversions that already
require completion time transaction reservations. Indeed, a quick test
to untar a kernel source tree produces no setfilesize events at all.
I'm not sure we have many situations upstream where append transactions
are used outside of perhaps cow completions (which already have a
completion time transaction allocation for fork remaps) or intra-block
file extending writes (that thus produce an inode size change within a
mapped, already converted block). Otherwise a truncate down should
always remove post-eof blocks and speculative prealloc originates from
delalloc, so afaict those should follow the same general sequence. Eh?
As it is, I think the performance concern is overstated but I'm happy to
run any tests to confirm or deny that if you want to make more concrete
suggestions. This patch is easy to test and pretty much survived an
overnight regression run (outside of one or two things I have to look
into..). I'm happy to adjust the approach from there, but I also think
if it proves necessary there are fallback options (based around the
original suggestion in my first mail) to preserve current submission
time (serial) append transaction reservation that don't require to
categorize/split or partially process the pending ioend list.
Brian
> Right now, we the single threaded writeback (bdi-flusher) runs
> reservations serially, so we are introducing a large amount of
> concurrency to reservations here. IOWs, instead of throttling active
> reservations before we submit the IO we end up with attempted
> reservations only being bounded by the number of kworkers the
> completion workqueue can throw at the system. Then we might have
> tens of active filesystems doing the same thing, each with their own
> set of workqueues and kworkers...
>
> Yes, we can make "lots of IO to a single inode" have less overhead,
> but we do not want to do that at the expense of bad behaviour when
> we have "single IOs to lots of inodes". That's the concern I have
> here, and that's going to take a fair amount of work to characterise
> and measure the impact, especially on large machines with slow
> disks...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
next prev parent reply other threads:[~2021-03-30 14:43 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-26 15:39 xfs ioend batching log reservation deadlock Brian Foster
2021-03-26 17:32 ` Darrick J. Wong
2021-03-26 18:13 ` Brian Foster
2021-03-29 2:28 ` Dave Chinner
2021-03-29 18:04 ` Brian Foster
2021-03-29 23:51 ` Dave Chinner
2021-03-30 14:42 ` Brian Foster [this message]
2021-03-30 23:57 ` Dave Chinner
2021-03-31 12:26 ` Brian Foster
2021-03-29 5:45 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YGM46ABWNVO5FKld@bfoster \
--to=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.