linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Moyer <jmoyer@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
	linux-kernel@vger.kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	Dan Williams <dan.j.williams@intel.com>,
	Ingo Molnar <mingo@redhat.com>, Jan Kara <jack@suse.com>,
	Jeff Layton <jlayton@poochiereds.net>,
	Matthew Wilcox <willy@linux.intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org,
	xfs@oss.sgi.com, Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <matthew.r.wilcox@intel.com>,
	axboe@kernel.dk
Subject: Re: [RFC 00/11] DAX fsynx/msync support
Date: Mon, 02 Nov 2015 16:02:48 -0500	[thread overview]
Message-ID: <x49twp4p11j.fsf@segfault.boston.devel.redhat.com> (raw)
In-Reply-To: <20151102201029.GI10656@dastard> (Dave Chinner's message of "Tue, 3 Nov 2015 07:10:29 +1100")

Dave Chinner <david@fromorbit.com> writes:

> On Mon, Nov 02, 2015 at 09:22:15AM -0500, Jeff Moyer wrote:
>> Dave Chinner <david@fromorbit.com> writes:
>> 
>> > Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable
>> > storage" commands. They are also IO barriers that affect scheduling
>> > of IOs in progress and in the request queues.  A REQ_FLUSH/REQ_FUA
>> > IO cannot be dispatched before all prior IO has been dispatched and
>> > drained from the request queue, and IO submitted after a queued
>> > REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued
>> > REQ_FLUSH/REQ_FUA operation.
>> >
>> > IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable
>> > storage, they also guarantee the order of IO dispatch and
>> > completion when concurrent IO is in progress.
>> 
>> This hasn't been the case for several years, now.  It used to work that
>> way, and that was deemed a big performance problem.  Since file systems
>> already issued and waited for all I/O before sending down a barrier, we
>> decided to get rid of the I/O ordering pieces of barriers (and stop
>> calling them barriers).
>> 
>> See commit 28e7d184521 (block: drop barrier ordering by queue draining).
>
> Yes, I realise that, even if I wasn't very clear about how I wrote
> it. ;)
>
> Correct me if I'm wrong: AFAIA, dispatch ordering (i.e. the "IO
> barrier") is still enforced by the scheduler via REQ_FUA|REQ_FLUSH
> -> ELEVATOR_INSERT_FLUSH -> REQ_SOFTBARRIER and subsequent IO
> scheduler calls to elv_dispatch_sort() that don't pass
> REQ_SOFTBARRIER in the queue.

This part is right.

> IOWs, if we queue a bunch of REQ_WRITE IOs followed by a
> REQ_WRITE|REQ_FLUSH IO, all of the prior REQ_WRITE IOs will be
> dispatched before the REQ_WRITE|REQ_FLUSH IO and hence be captured
> by the cache flush.

But this part is not.  It is up to the I/O scheduler to decide when to
dispatch requests.  It can hold on to them for a variety of reasons.
Flush requests, however, do not go through the I/O scheduler.  At the
very moment that the flush request is inserted, it goes directly to the
dispatch queue (assuming no other flush is in progress).  The prior
requests may still be waiting in the I/O scheduler's internal lists.

So, any newly dispatched I/Os will certainly not get past the REQ_FLUSH.
However, the REQ_FLUSH is very likely to jump ahead of prior I/Os in the
queue.

> Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO
> to complete, we know that all the earlier REQ_WRITE IOs are on
> stable storage, too. Hence there's no need for the elevator to drain
> the queue to guarantee completion ordering - the dispatch ordering
> and flush/fua write semantics guarantee that when the flush/fua
> completes, all the IOs dispatch prior to that flush/fua write are
> also on stable storage...

Des xfs rely on this model for correctness?  If so, I'd say we've got a
problem.

Cheers,
Jeff

  reply	other threads:[~2015-11-02 21:02 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
2015-10-29 20:12 ` [RFC 01/11] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
2015-10-29 20:12 ` [RFC 02/11] mm: add pmd_mkclean() Ross Zwisler
2015-10-29 20:12 ` [RFC 03/11] pmem: enable REQ_FLUSH handling Ross Zwisler
2015-10-29 20:12 ` [RFC 04/11] dax: support dirty DAX entries in radix tree Ross Zwisler
2015-10-29 20:12 ` [RFC 05/11] mm: add follow_pte_pmd() Ross Zwisler
2015-10-29 20:12 ` [RFC 06/11] mm: add pgoff_mkclean() Ross Zwisler
2015-10-29 20:12 ` [RFC 07/11] mm: add find_get_entries_tag() Ross Zwisler
2015-10-29 20:12 ` [RFC 08/11] fs: add get_block() to struct inode_operations Ross Zwisler
2015-10-29 20:12 ` [RFC 09/11] dax: add support for fsync/sync Ross Zwisler
2015-10-29 20:12 ` [RFC 10/11] xfs, ext2: call dax_pfn_mkwrite() on write fault Ross Zwisler
2015-10-29 20:12 ` [RFC 11/11] ext4: add ext4_dax_pfn_mkwrite() Ross Zwisler
2015-10-29 22:49 ` [RFC 00/11] DAX fsynx/msync support Ross Zwisler
2015-10-30  3:55 ` Dave Chinner
2015-10-30 18:39   ` Ross Zwisler
2015-11-01 23:29     ` Dave Chinner
2015-11-02 14:22       ` Jeff Moyer
2015-11-02 20:10         ` Dave Chinner
2015-11-02 21:02           ` Jeff Moyer [this message]
2015-11-04 18:34             ` Jeff Moyer
2015-11-05  8:33             ` Dave Chinner
2015-11-05 19:49               ` Jeff Moyer
2015-11-05 20:54               ` Jens Axboe
2015-10-30 18:34 ` Dan Williams
2015-10-30 19:43   ` Ross Zwisler
2015-10-30 19:51     ` Dan Williams
2015-11-01 23:36       ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=x49twp4p11j.fsf@segfault.boston.devel.redhat.com \
    --to=jmoyer@redhat.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=bfields@fieldses.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=hpa@zytor.com \
    --cc=jack@suse.com \
    --cc=jlayton@poochiereds.net \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=matthew.r.wilcox@intel.com \
    --cc=mingo@redhat.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=tglx@linutronix.de \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@linux.intel.com \
    --cc=x86@kernel.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).