linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Ross Zwisler <ross.zwisler@linux.intel.com>,
	linux-kernel@vger.kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	Dan Williams <dan.j.williams@intel.com>,
	Ingo Molnar <mingo@redhat.com>, Jan Kara <jack@suse.com>,
	Jeff Layton <jlayton@poochiereds.net>,
	Matthew Wilcox <willy@linux.intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org,
	xfs@oss.sgi.com, Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <matthew.r.wilcox@intel.com>
Subject: Re: [RFC 00/11] DAX fsynx/msync support
Date: Mon, 2 Nov 2015 10:29:48 +1100	[thread overview]
Message-ID: <20151101232948.GF10656@dastard> (raw)
In-Reply-To: <20151030183938.GC24643@linux.intel.com>

On Fri, Oct 30, 2015 at 12:39:38PM -0600, Ross Zwisler wrote:
> On Fri, Oct 30, 2015 at 02:55:33PM +1100, Dave Chinner wrote:
> > On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote:
> > > This patch series adds support for fsync/msync to DAX.
> > > 
> > > Patches 1 through 8 add various utilities that the DAX code will eventually
> > > need, and the DAX code itself is added by patch 9.  Patches 10 and 11 are
> > > filesystem changes that are needed after the DAX code is added, but these
> > > patches may change slightly as the filesystem fault handling for DAX is
> > > being modified ([1] and [2]).
> > > 
> > > I've marked this series as RFC because I'm still testing, but I wanted to
> > > get this out there so people would see the direction I was going and
> > > hopefully comment on any big red flags sooner rather than later.
> > > 
> > > I realize that we are getting pretty dang close to the v4.4 merge window,
> > > but I think that if we can get this reviewed and working it's a much better
> > > solution than the "big hammer" approach that blindly flushes entire PMEM
> > > namespaces [3].
> > 
> > We need the "big hammer" regardless of fsync. If REQ_FLUSH and
> > REQ_FUA don't do the right thing when it comes to ordering journal
> > writes against other IO operations, then the filesystems are not
> > crash safe. i.e. we need REQ_FLUSH/REQ_FUA to commit all outstanding
> > changes back to stable storage, just like they do for existing
> > storage....
> 
> I think that what I've got here (when it's fully working) will protect all the
> cases that we need.
> 
> AFAIK there are three ways that data can be written to a PMEM namespace:
> 
> 1) Through the PMEM driver via either pmem_make_request(), pmem_rw_page() or
> pmem_rw_bytes().  All of these paths sync the newly written data durably to
> media before the I/O completes so they shouldn't have any reliance on
> REQ_FUA/REQ_FLUSH.

I suspect that not all future pmem devices will use this
driver/interface/semantics.

Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable
storage" commands. They are also IO barriers that affect scheduling
of IOs in progress and in the request queues.  A REQ_FLUSH/REQ_FUA
IO cannot be dispatched before all prior IO has been dispatched and
drained from the request queue, and IO submitted after a queued
REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued
REQ_FLUSH/REQ_FUA operation.

IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable
storage, they also guarantee the order of IO dispatch and
completion when concurrent IO is in progress.


> 2) Through the DAX I/O path, dax_io().  As with PMEM we flush the newly
> written data durably to media before the I/O operation completes, so this path
> shouldn't have any reliance on REQ_FUA/REQ_FLUSH.

That's fine, but that's not the problem we need solved ;)

> 3) Through mmaps set up by DAX.  This is the path we are trying to protect
> with the dirty page tracking and flushing in this patch set, and I think that
> this is the only path that has reliance on REQ_FLUSH.

Quite possibly this is the case for the current intel pmem driver,
but I don't look at the functionality from that perspective.

Dirty page tracking is needed to enable "data writeback", whether it
be CPU cachelines via pcommit() or dirty pages via submit_bio(). How
the pages get dirty is irrelevant - the fact is they are dirty and
we need to do /something/ to ensure they are correctly written back
to the storage layer.

REQ_FLUSH is needed to guarantee all data that has been written back
to the storage layer is persistent in that layer.  How a /driver/
manages that is up to the driver - the actual implementation is
irrelevant to the higher layers. i.e. what we are concerned about at
the filesystem level is that:

	a) "data writeback" is started correctly;
	b) the "data writeback" is completed; and
	c) volatile caches are completely flushed before we write
	   the metadata changes that reference that data to the
	   journal via FUA

e.g. we could have pmem, but we are using buffered IO (i.e. non-DAX)
and a hardware driver that doesn't flush CPU cachelines in the
physical IO path. This requires that driver to flush CPU cachelines
and place memory barriers in REQ_FLUSH operations, as well as after
writing the data in REQ_FUA operations.  Yes, this is different to
the way the intel pmem drivers work (i.e.  as noted in 1) above),
but it is /not wrong/ as long as REQ_FLUSH/REQ_FUA also flush dirty
cpu cachelines.

IOWs, the high level code we write that implements fsync
for DAX needs to be generic enough so that when something slightly
different comes along we don't have to throw everything away and
start again. I think your code will end up being generic enough to
handle this, but let's make sure we don't implement something that
can only work with pmem hardware/drivers that do all IO as fully
synchronous to the stable domain...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2015-11-01 23:30 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
2015-10-29 20:12 ` [RFC 01/11] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
2015-10-29 20:12 ` [RFC 02/11] mm: add pmd_mkclean() Ross Zwisler
2015-10-29 20:12 ` [RFC 03/11] pmem: enable REQ_FLUSH handling Ross Zwisler
2015-10-29 20:12 ` [RFC 04/11] dax: support dirty DAX entries in radix tree Ross Zwisler
2015-10-29 20:12 ` [RFC 05/11] mm: add follow_pte_pmd() Ross Zwisler
2015-10-29 20:12 ` [RFC 06/11] mm: add pgoff_mkclean() Ross Zwisler
2015-10-29 20:12 ` [RFC 07/11] mm: add find_get_entries_tag() Ross Zwisler
2015-10-29 20:12 ` [RFC 08/11] fs: add get_block() to struct inode_operations Ross Zwisler
2015-10-29 20:12 ` [RFC 09/11] dax: add support for fsync/sync Ross Zwisler
2015-10-29 20:12 ` [RFC 10/11] xfs, ext2: call dax_pfn_mkwrite() on write fault Ross Zwisler
2015-10-29 20:12 ` [RFC 11/11] ext4: add ext4_dax_pfn_mkwrite() Ross Zwisler
2015-10-29 22:49 ` [RFC 00/11] DAX fsynx/msync support Ross Zwisler
2015-10-30  3:55 ` Dave Chinner
2015-10-30 18:39   ` Ross Zwisler
2015-11-01 23:29     ` Dave Chinner [this message]
2015-11-02 14:22       ` Jeff Moyer
2015-11-02 20:10         ` Dave Chinner
2015-11-02 21:02           ` Jeff Moyer
2015-11-04 18:34             ` Jeff Moyer
2015-11-05  8:33             ` Dave Chinner
2015-11-05 19:49               ` Jeff Moyer
2015-11-05 20:54               ` Jens Axboe
2015-10-30 18:34 ` Dan Williams
2015-10-30 19:43   ` Ross Zwisler
2015-10-30 19:51     ` Dan Williams
2015-11-01 23:36       ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151101232948.GF10656@dastard \
    --to=david@fromorbit.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=akpm@linux-foundation.org \
    --cc=bfields@fieldses.org \
    --cc=dan.j.williams@intel.com \
    --cc=hpa@zytor.com \
    --cc=jack@suse.com \
    --cc=jlayton@poochiereds.net \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=matthew.r.wilcox@intel.com \
    --cc=mingo@redhat.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=tglx@linutronix.de \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@linux.intel.com \
    --cc=x86@kernel.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).