Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>, jmoyer <jmoyer@redhat.com>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Dave Jiang <dave.jiang@intel.com>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Linux MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps
Date: Wed, 31 Oct 2018 09:49:04 +1100
Message-ID: <20181030224904.GT19305@dastard> (raw)
In-Reply-To: <CAPcyv4ixoAh7HEMfm+B4sRDx1Qwm6SHGjtQ+5r3EKsxreRydrA@mail.gmail.com>

On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote:
> On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> > > On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > > > MAP_SYNC
> > > > > - file system guarantees that metadata required to reach faulted-in file
> > > > >   data is consistent on media before a write fault is completed.  A
> > > > >   side-effect is that the page cache will not be used for
> > > > >   writably-mapped pages.
> > > >
> > > > I think you are conflating current implementation with API
> > > > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > > > use. The man page definition simply says "supported only for files
> > > > supporting DAX" and that it provides certain data integrity
> > > > guarantees. It does not define the implementation.
....
> > > With O_DIRECT the fallback to buffered IO is quite rare (at least for major
> > > filesystems) so usually people just won't notice. If fallback for
> > > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.
> >
> > Which is just like the situation where O_DIRECT on ext3 was not very
> > useful, but on other filesystems like XFS it was fully functional.
> >
> > IMO, the fact that a specific filesytem has a suboptimal fallback
> > path for an uncommon behaviour isn't an argument against MAP_DIRECT
> > as a hint - it's actually a feature. If MAP_DIRECT can't be used
> > until it's always direct access, then most filesystems wouldn't be
> > able to provide any faster paths at all. It's much better to have
> > partial functionality now than it is to never have the functionality
> > at all, and so we need to design in the flexibility we need to
> > iteratively improve implementations without needing API changes that
> > will break applications.
> 
> The hard guarantee requirement still remains though because an
> application that expects combined MAP_SYNC|MAP_DIRECT semantics will
> be surprised if the MAP_DIRECT property silently disappears.

Why would they be surprised? They won't even notice it if the
filesystem can provide MAP_SYNC without MAP_DIRECT.

And that's the whole point.

MAP_DIRECT is a private mapping state. So is MAP_SYNC. They are not
visible to the filesystem and the filesystem does nothing to enforce
them. If someone does something that requires the page cache (e.g.
calls do_splice_direct()) then that MAP_DIRECT mapping has a whole
heap of new work to do. And, in some cases, the filesystem may not
be able to provide MAP_DIRECT as a result..

IOWs, the filesystem cannot guarantee MAP_DIRECT and the
circumstances under which MAP_DIRECT will and will not work are
dynamic. If MAP_DIRECT is supposed to be a guarantee then we'll have
applications randomly segfaulting in production as things like
backups, indexers, etc run over the filesystem and do their work.

This is why MAP_DIRECT needs to be an optimisation, not a
requirement - things will still work if MAP_DIRECT is not used. What
matters to these applications is MAP_SYNC - if we break MAP_SYNC,
then the application data integrity model is violated. That's not an
acceptible outcome.

The problem, it seems to me, is that people are unable to separate
MAP_DIRECT and MAP_SYNC. I suspect that is because, at present,
MAP_SYNC on XFS and ext4 requires MAP_DIRECT. i.e. we can only
provide MAP_SYNC functionality on DAX mappings. However, that's a
/filesystem implementation issue/, not an API guarantee we need to
provide to userspace.

If we implement a persistent page cache (e.g. allocate page cache
pages out of ZONE_DEVICE pmem), then filesystems like XFS and ext4
could provide applications with the MAP_SYNC data integrity model
without MAP_DIRECT. Indeed, those filesystems would not even be able
to provide MAP_DIRECT semantics because they aren't backed by pmem.

Hence if applications that want MAP_SYNC are hard coded
MAP_SYNC|MAP_DIRECT and we make MAP_DIRECT a hard guarantee, then
those applications are going to fail on a filesystem that provides
only MAP_SYNC. This is despite the fact the applications would
function correctly and the data integrity model would be maintained.
i.e. the failure is because applications have assumed MAP_SYNC can
only be provided by a DAX implementation, not because MAP_SYNC is
not supported.

MAP_SYNC really isn't about DAX at all. It's about enabling a data
integrity model that requires the filesystem to provide userspace
access to CPU addressable persistent memory.  DAX+MAP_DIRECT is just
one method of providing this functionality, but it's not the only
method. Our API needs to be future proof rather than an encoding of
the existing implementation limitations, otherwise apps will have to
be re-written as every new MAP_SYNC capable technology comes along.

In summary:

	MAP_DIRECT is an access hint.

	MAP_SYNC provides a data integrity model guarantee.

	MAP_SYNC may imply MAP_DIRECT for specific implementations,
	but it does not require or guarantee MAP_DIRECT.

Let's compare that with O_DIRECT:

	O_DIRECT in an access hint.

	O_DSYNC provides a data integrity model guarantee.

	O_DSYNC may imply O_DIRECT for specific implementations, but
	it does not require or guarantee O_DIRECT.

Consistency in access and data integrity models is a good thing. DAX
and pmem is not an exception. We need to use a model we know works
and has proven itself over a long period of time.

> I think
> it still makes some sense as a hint for apps that want to minimize
> page cache, but for the applications with a flush from userspace model
> I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This
> still gives the filesystem the option to inject page-cache at will,
> but with an application coordination point.

Why make it more complex for applications than it needs to be? 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply index

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-02 10:05 Jan Kara
2018-10-02 10:50 ` Michal Hocko
2018-10-02 13:32   ` Jan Kara
2018-10-02 12:10 ` Johannes Thumshirn
2018-10-02 14:20   ` Johannes Thumshirn
2018-10-02 14:45     ` Christoph Hellwig
2018-10-02 15:01       ` Johannes Thumshirn
2018-10-02 15:06         ` Christoph Hellwig
2018-10-04 10:09           ` Johannes Thumshirn
2018-10-05  6:25             ` Christoph Hellwig
2018-10-05  6:35               ` Johannes Thumshirn
2018-10-06  1:17                 ` Dan Williams
2018-10-14 15:47                   ` Dan Williams
2018-10-17 20:01                     ` Dan Williams
2018-10-18 17:43                       ` Jan Kara
2018-10-18 19:10                         ` Dan Williams
2018-10-19  3:01                           ` Dave Chinner
2018-10-02 14:29   ` Jan Kara
2018-10-02 14:37     ` Christoph Hellwig
2018-10-02 14:44       ` Johannes Thumshirn
2018-10-02 14:52         ` Christoph Hellwig
2018-10-02 15:31           ` Jan Kara
2018-10-02 20:18             ` Dan Williams
2018-10-03 12:50               ` Jan Kara
2018-10-03 14:38                 ` Dan Williams
2018-10-03 15:06                   ` Jan Kara
2018-10-03 15:13                     ` Dan Williams
2018-10-03 16:44                       ` Jan Kara
2018-10-03 21:13                         ` Dan Williams
2018-10-04 10:04                         ` Johannes Thumshirn
2018-10-02 15:07       ` Jan Kara
2018-10-17 20:23     ` Jeff Moyer
2018-10-18  0:25       ` Dave Chinner
2018-10-18 14:55         ` Jan Kara
2018-10-19  0:43           ` Dave Chinner
2018-10-30  6:30             ` Dan Williams
2018-10-30 22:49               ` Dave Chinner [this message]
2018-10-30 22:59                 ` Dan Williams
2018-10-31  5:59                 ` y-goto
2018-11-01 23:00                   ` Dave Chinner
2018-11-02  1:43                     ` y-goto
2018-10-18 21:05         ` Jeff Moyer
2018-10-09 19:43 ` Jeff Moyer
2018-10-16  8:25   ` Jan Kara
2018-10-16 12:35     ` Jeff Moyer

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181030224904.GT19305@dastard \
    --to=david@fromorbit.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=jack@suse.cz \
    --cc=jmoyer@redhat.com \
    --cc=jthumshirn@suse.de \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git