All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: david <david@fromorbit.com>
Cc: jmoyer <jmoyer@redhat.com>, Eric Sandeen <sandeen@sandeen.net>,
	zwisler@kernel.org, Christoph Hellwig <hch@lst.de>,
	Jan Kara <jack@suse.cz>, linux-xfs <linux-xfs@vger.kernel.org>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH 0/3] ext2, ext4, xfs: hard fail dax mount on unsupported devices
Date: Wed, 17 Oct 2018 19:01:43 -0700	[thread overview]
Message-ID: <CAPcyv4gSAsDg0uQkZwAWn-pASmxH0LKwx6MwQuWs-YKwnS1eRA@mail.gmail.com> (raw)
In-Reply-To: <20181018010500.GD6311@dastard>

On Wed, Oct 17, 2018 at 6:05 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Oct 17, 2018 at 02:44:55PM -0700, Dan Williams wrote:
> > On Wed, Oct 17, 2018 at 2:31 PM Jeff Moyer <jmoyer@redhat.com> wrote:
> > >
> > > Eric Sandeen <sandeen@sandeen.net> writes:
> > >
> > > > I've been thinking about the per-inode stuff a bit, and while I don't know
> > > > how to resolve some of the trickier issues, at least the expected behavior
> > > > seems like something we can narrow down and specify.
> > > >
> > > > Because it's an on-disk flag (in xfs today, in any case) it seems that
> > > > the only sane behavior to expect is either/or, i.e.:
> > > >
> > > > Mount option: All files always dax, per-inode flags ignored (or rejected)
> > > > Per-inode: Mount option cannot be specified; only inodes explicitly flagged are dax
> > > >
> > > > Think about it; what would mount-option-plus-per-inode mean?  We have
> > > > no "negative" dax flag, so while mount-option-with-flag surely means
> > > > "dax", what the heck does mount-option-without-flag mean, and how is it
> > > > distinguishable from mount option only?
> > > >
> > > > I submit that flags can only have meaning w/o the fs-wide mount option
> > > > enabled, so the question of "should we hard fail mount -o dax for devices
> > > > that cannot support it" seems to be orthogonal to the per-inode question.
> > > >
> > > > i.e. mount -o dax really can only mean "I want dax on everything" and so
> > > > again, I think we probably need to fail the mount if that can't be honored.
> > >
> > > I hate to even open up this can of worms, but what about killing the dax
> > > mount option?
> > >
> > > To quote Christoph:
> > >   How does an application "make use of DAX"?  What actual user visible
> > >   semantics are associated with a file that has this flag set?
> > >
> > > We're already talking about making caching decisions automatically, so
> > > does DAX even mean anything at that point?  If the storage and the file
> > > system support it, enable it.
> > >
> > > From what we've seen so far, aplications want:
> > > 1) to be able to make data persistent from userspace
> > >    For this, we have MAP_SYNC.
> > > 2) to determine whether or not page cache will be used
> > >    For this, we have O_DIRECT for read/write access, and MAP_SYNC for
> > >    mmap access (and maybe a third option coming, we'll see).
> >
> > As Jan has said, it's not safe to assume that 'no page cache' is
> > implied with MAP_SYNC. It's a side effect not a contract of the
> > current implementation.
>
> Even MAP_DIRECT shouldn't mean "no page cache". O_DIRECT is a hint,
> not a guarantee, and so it may very well use the page cache if it
> needs to (as I've just explained in detail in a different thread).
>
> > > The only thing users gain from a mount option is the ability to turn OFF
> > > dax.  I suppose there might be a use case that wants this, but I'm not
> > > aware of it.
> >
> > I think we're stuck with it as many scripts would break if it ever
> > went completely away. However, we could mark it deprecated / ignored
>
> I don't really care that much about this - it is still marked
> experimental.
>
> That said, deprecation is the best way forward here if we are going
> to remove the mount option. We've done this for other XFS mount
> options recently (e.g. barrier/nobarrier) where the functionality is
> now fully baked into the fileystem and there's no user option to
> control it anymore.
>
> What we really need is a document describing the expected behaviour
> of filesysetms on dax-capable storage. Let's nail down exactly what
> we need to do to pull DAX out of the experimental state before we
> start changing things. We've been doing things in a very ad-hoc way
> for a while now, and we're not really converging on an endpoint where we
> can say "we're done, have at it".
>
> I think we need to decide on:
>
> - default filesystem behaviour on dax-capable block devices
> - what information aout DAX do applications actually need? What
>   makes sense to provide them with that information?
> - how to provide hints to the kernel for desired behaviour
>   - on-disk inode flags, or something else?
>   - dax/nodax mount options or root dir inode flags become default
>     global hints?
>   - is a single hint flag sufficient or do we also need an
>     explicit "do not use dax" flag?
> - behaviour of MAP_SYNC w.r.t. non-DAX filesystems that can provide
>   required MAP_SYNC semnatics
> - behaviour of MAP_DIRECT - hint like O_DIRECT or guarantee?
> - default read/write path behaviour of dax-capable block devices
>   - automatically bypass the pagecache if bdev is capable?
> - default mmap behaviour on dax capable devices
>   - use dax always?
> - DAX vs get_user_pages_longterm
>   - turns off DAX dynamically?
>   - how do DAX-enabled filesystems interact with page fault capable
>     hardware? Can we allow DAX in those cases?
>
> I'm sure there's a heap more we need to document and nail down.
> There's a lot of stuff to sort out before we start hammering on
> random bits of code....

Nice, yes, I'll add some more:

- Is MADV_DIRECT_ACCESS a hint or a requirement?
- How does the kernel communicate the effective mode of a mapping
  taking into account madvise(), inode flags, mount options, and / or
  default fs behavior? New madvice() syscall?
- What is the behavior of dax in the presence of reflink'd extents?
  Just failing seems the 'experimental' behavior. What to do about
  page->index when page belongs to more than 1 file via reflink?
- Is there ever a case to force disable dax operation? To date we've
  only ever thought about interfaces to force *enable* dax operation
- The virtio-pmem use case wants dax mappings but requires an explicit
  fsync() instead of MAP_SYNC to flush software buffers, it's a DAX
  sub-set, should it have it's own name?
- DAX operation is loosely tied to block devices. There has been
  discussions of mounting filesystems on /dev/dax devices directly.
  Should we take that to its logical conclusion and support a
  block-layer-less conversion of dax-capable file systems?
- Willy has proposed that the Xarray cache file-offset-to-physical
  address lookups, currently it only tracks dirty mapping state
- The NVDIMM sub-system tracks badblocks, but the filesytem currently
  only finds out about them late when it attempts dax_direct_access().
  Applications want to be able to list files+offsets that have
  experienced media corruption.

> > provided we had a way for applications to query and override if DAX is
> > enabled. I also think it's important to keep separate the dax-mmap
> > behavior from the dax-read/write behavior. dax-mmap is where an
> > application would make different decisions if it can get a mapping
> > without page cache,
>
> The functionality people keep saying "requires DAX" really doesn't -
> what it really requires is that mmap() exposes filesystem tracked
> pmem in a CPU addressable memory range. DAX is not the only way to
> do that - a filesystem with a pmem-based persistent page cache can
> provide MAP_SYNC semantics to userspace without being a DAX
> filesystem.

*nod*

      reply	other threads:[~2018-10-18 10:00 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-08 19:32 [PATCH 0/3] ext2, ext4, xfs: hard fail dax mount on unsupported devices Eric Sandeen
2018-10-08 19:32 ` [PATCH 1/3] " Eric Sandeen
2018-10-08 22:10   ` Dave Chinner
2018-10-08 19:32 ` [PATCH 2/3] ext4: " Eric Sandeen
2018-12-04  5:48   ` Theodore Y. Ts'o
2018-10-08 19:32 ` [PATCH 3/3] ext2: " Eric Sandeen
2018-10-11 10:36 ` [PATCH 0/3] ext2, ext4, xfs: " Jan Kara
2018-10-11 18:08   ` Dan Williams
2018-10-11 18:38     ` Eric Sandeen
2018-10-12  2:21       ` Theodore Y. Ts'o
2018-10-12  8:21       ` Christoph Hellwig
2018-10-13 16:05         ` Ross Zwisler
2018-10-17 19:42           ` Eric Sandeen
2018-10-17 19:51             ` Ross Zwisler
2018-10-17 19:52             ` Dan Williams
2018-10-17 21:31             ` Jeff Moyer
2018-10-17 21:44               ` Dan Williams
2018-10-18  1:05                 ` Dave Chinner
2018-10-18  2:01                   ` Dan Williams [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4gSAsDg0uQkZwAWn-pASmxH0LKwx6MwQuWs-YKwnS1eRA@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jmoyer@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=sandeen@sandeen.net \
    --cc=zwisler@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.