linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Chris Mason <clm@fb.com>
Cc: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH] [RFC] writeback: fix range_cyclic writeback vs writepages deadlock
Date: Tue, 26 Jun 2018 08:37:34 +1000	[thread overview]
Message-ID: <20180625223734.GA19934@dastard> (raw)
In-Reply-To: <984063CF-92B5-40CF-AFA6-96398D8692BF@fb.com>

On Fri, Jun 22, 2018 at 11:14:20AM -0400, Chris Mason wrote:
> On 21 Jun 2018, at 23:09, Dave Chinner wrote:
> 
> >From: Dave Chinner <dchinner@redhat.com>
> >
> >We've recently seen a workload on XFS filesystems with a repeatable
> >deadlock between background writeback and a multi-process
> >application doing concurrent writes and fsyncs to a small range of a
> >file.
> >
> >
> >The underlying cause of the problem here is that range_cyclic
> >writeback is processing pages in descending index order as we
> >hold higher index pages in a structure controlled from above
> >write_cache_pages(). The write_cache_pages() caller needs to
> >be able to submit these pages for IO before write_cache_pages
> >restarts writeback at mapping index 0 to avoid wcp inverting the
> >page lock/writeback wait order.
> >
> >generic_writepages() is not susceptible to this bug as it has no
> >private context held across write_cache_pages() - filesystems using
> >this infrastructure always submit pages in ->writepage immediately
> >and so there is no problem with range_cyclic going back to mapping
> >index 0.
> >
> >However:
> >	mpage_writepages() has a private bio context,
> >	exofs_writepages() has page_collect
> >	fuse_writepages() has fuse_fill_wb_data
> >	nfs_writepages() has nfs_pageio_descriptor
> >	xfs_vm_writepages() has xfs_writepage_ctx
> >
> >All of these ->writepages implementations can hold pages under
> >writeback in their private structures until write_cache_pages()
> >returns, and hence they are all susceptible to this deadlock.
> >
> >Also worth noting is that ext4 has it's own bastardised version of
> >write_cache_pages() and so it /may/ have an equivalent deadlock. I
> >looked at the code long enough to understand that it has a similar
> >retry loop for range_cyclic writeback reaching the end of the file
> >and then promptly ran away before my eyes bled too much.  I'll leave
> >it for the ext4 developers to determine if their code is actually
> >has this deadlock and how to fix it if it has.
> >
> >There's a few ways I can see avoid this deadlock. There's probably
> >more, but these are the first I've though of:
> >
> >1. get rid of range_cyclic altogether
> >
> >2. range_cyclic always stops at EOF, and we start again from
> >writeback index 0 on the next call into write_cache_pages()
> >
> >2a. wcp also returns EAGAIN to ->writepages implementations to
> >indicate range cyclic has hit EOF. write-pages implementations can
> >then flush the current context and call wpc again to continue. i.e.
> >lift the retry into the ->writepages implementation
> 
> Btrfs has a variation of 2a in our bastardized copy of
> write_cache_pages(), it has a call to flush the bios we've built up
> before doing operations that might deadlock, including waiting for
> writeback, locking pages etc:
> 
>                         if (wbc->sync_mode != WB_SYNC_NONE) {
>                                 if (PageWriteback(page))
>                                         flush_write_bio(epd);
>                                 wait_on_page_writeback(page);
>                         }

Yeah, that's essentially what I was thinking, but it's complicated
by the fact we don't know what the private data contains in wcp.
Seems like a reasonable thing to do if you have all the ducks in a
row.

> I don't see any problems with the solution you picked instead, but
> if we run into more complex variations of this we can just pass a
> callback and a flushing cookie to generic_writepages().
> 
> Open question on the perf benefits of staring IO early while we wait
> for writeback on page X or letting our bio build bigger.

It's async background writeback. IO latency doesn't matter - it's
likely to get throttled anyway - so really the optimisations should
focus around building the most well formed bios we can....

> Open question #2, at least for btrfs we're building a single fat bio
> by adding pages to it along the way.  This is a small variation on
> plugging...we could just pull the last bio off the plug stack and
> stuff the pages in.  Then we'd get all the deadlock prevent implicit
> in plugging for free.

The problem is that there's a lot of context around the bio that the
filesystem has to maintain as well, and we can't put that on the
plug list. I'd prefer that we don't make a simple one-way IO
aggregation mechanism (the plug list) much more complicated by
allowing tasks to use it as a per-task "almost but not quite
submitted bio" stack without a lot more thought about it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2018-06-25 22:37 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-22  3:09 [PATCH] [RFC] writeback: fix range_cyclic writeback vs writepages deadlock Dave Chinner
2018-06-22 14:41 ` Brian Foster
2018-06-25 22:27   ` Dave Chinner
2018-06-26 11:28     ` Brian Foster
2018-06-22 15:14 ` Chris Mason
2018-06-25 22:37   ` Dave Chinner [this message]
2018-09-10 11:51 ` Jan Kara
2018-09-10 23:21   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180625223734.GA19934@dastard \
    --to=david@fromorbit.com \
    --cc=clm@fb.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).