From: Dave Chinner <david@fromorbit.com>
To: Chris Mason <clm@fb.com>
Cc: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH] [RFC] writeback: fix range_cyclic writeback vs writepages deadlock
Date: Tue, 26 Jun 2018 08:37:34 +1000 [thread overview]
Message-ID: <20180625223734.GA19934@dastard> (raw)
In-Reply-To: <984063CF-92B5-40CF-AFA6-96398D8692BF@fb.com>
On Fri, Jun 22, 2018 at 11:14:20AM -0400, Chris Mason wrote:
> On 21 Jun 2018, at 23:09, Dave Chinner wrote:
>
> >From: Dave Chinner <dchinner@redhat.com>
> >
> >We've recently seen a workload on XFS filesystems with a repeatable
> >deadlock between background writeback and a multi-process
> >application doing concurrent writes and fsyncs to a small range of a
> >file.
> >
> >
> >The underlying cause of the problem here is that range_cyclic
> >writeback is processing pages in descending index order as we
> >hold higher index pages in a structure controlled from above
> >write_cache_pages(). The write_cache_pages() caller needs to
> >be able to submit these pages for IO before write_cache_pages
> >restarts writeback at mapping index 0 to avoid wcp inverting the
> >page lock/writeback wait order.
> >
> >generic_writepages() is not susceptible to this bug as it has no
> >private context held across write_cache_pages() - filesystems using
> >this infrastructure always submit pages in ->writepage immediately
> >and so there is no problem with range_cyclic going back to mapping
> >index 0.
> >
> >However:
> > mpage_writepages() has a private bio context,
> > exofs_writepages() has page_collect
> > fuse_writepages() has fuse_fill_wb_data
> > nfs_writepages() has nfs_pageio_descriptor
> > xfs_vm_writepages() has xfs_writepage_ctx
> >
> >All of these ->writepages implementations can hold pages under
> >writeback in their private structures until write_cache_pages()
> >returns, and hence they are all susceptible to this deadlock.
> >
> >Also worth noting is that ext4 has it's own bastardised version of
> >write_cache_pages() and so it /may/ have an equivalent deadlock. I
> >looked at the code long enough to understand that it has a similar
> >retry loop for range_cyclic writeback reaching the end of the file
> >and then promptly ran away before my eyes bled too much. I'll leave
> >it for the ext4 developers to determine if their code is actually
> >has this deadlock and how to fix it if it has.
> >
> >There's a few ways I can see avoid this deadlock. There's probably
> >more, but these are the first I've though of:
> >
> >1. get rid of range_cyclic altogether
> >
> >2. range_cyclic always stops at EOF, and we start again from
> >writeback index 0 on the next call into write_cache_pages()
> >
> >2a. wcp also returns EAGAIN to ->writepages implementations to
> >indicate range cyclic has hit EOF. write-pages implementations can
> >then flush the current context and call wpc again to continue. i.e.
> >lift the retry into the ->writepages implementation
>
> Btrfs has a variation of 2a in our bastardized copy of
> write_cache_pages(), it has a call to flush the bios we've built up
> before doing operations that might deadlock, including waiting for
> writeback, locking pages etc:
>
> if (wbc->sync_mode != WB_SYNC_NONE) {
> if (PageWriteback(page))
> flush_write_bio(epd);
> wait_on_page_writeback(page);
> }
Yeah, that's essentially what I was thinking, but it's complicated
by the fact we don't know what the private data contains in wcp.
Seems like a reasonable thing to do if you have all the ducks in a
row.
> I don't see any problems with the solution you picked instead, but
> if we run into more complex variations of this we can just pass a
> callback and a flushing cookie to generic_writepages().
>
> Open question on the perf benefits of staring IO early while we wait
> for writeback on page X or letting our bio build bigger.
It's async background writeback. IO latency doesn't matter - it's
likely to get throttled anyway - so really the optimisations should
focus around building the most well formed bios we can....
> Open question #2, at least for btrfs we're building a single fat bio
> by adding pages to it along the way. This is a small variation on
> plugging...we could just pull the last bio off the plug stack and
> stuff the pages in. Then we'd get all the deadlock prevent implicit
> in plugging for free.
The problem is that there's a lot of context around the bio that the
filesystem has to maintain as well, and we can't put that on the
plug list. I'd prefer that we don't make a simple one-way IO
aggregation mechanism (the plug list) much more complicated by
allowing tasks to use it as a per-task "almost but not quite
submitted bio" stack without a lot more thought about it.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2018-06-25 22:37 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-06-22 3:09 [PATCH] [RFC] writeback: fix range_cyclic writeback vs writepages deadlock Dave Chinner
2018-06-22 14:41 ` Brian Foster
2018-06-25 22:27 ` Dave Chinner
2018-06-26 11:28 ` Brian Foster
2018-06-22 15:14 ` Chris Mason
2018-06-25 22:37 ` Dave Chinner [this message]
2018-09-10 11:51 ` Jan Kara
2018-09-10 23:21 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180625223734.GA19934@dastard \
--to=david@fromorbit.com \
--cc=clm@fb.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).