Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: david <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>, jmoyer <jmoyer@redhat.com>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Dave Jiang <dave.jiang@intel.com>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Linux MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps
Date: Tue, 30 Oct 2018 15:59:09 -0700
Message-ID: <CAPcyv4gwBJ65170Gy+EaDLpomA93kxuWTM10mZeaPYV6Rq=SRw@mail.gmail.com> (raw)
In-Reply-To: <20181030224904.GT19305@dastard>

On Tue, Oct 30, 2018 at 3:49 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote:
> > On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <david@fromorbit.com> wrote:
> > > On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> > > > On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > > > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > > > > MAP_SYNC
> > > > > > - file system guarantees that metadata required to reach faulted-in file
> > > > > >   data is consistent on media before a write fault is completed.  A
> > > > > >   side-effect is that the page cache will not be used for
> > > > > >   writably-mapped pages.
> > > > >
> > > > > I think you are conflating current implementation with API
> > > > > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > > > > use. The man page definition simply says "supported only for files
> > > > > supporting DAX" and that it provides certain data integrity
> > > > > guarantees. It does not define the implementation.
> ....
> > > > With O_DIRECT the fallback to buffered IO is quite rare (at least for major
> > > > filesystems) so usually people just won't notice. If fallback for
> > > > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.
> > >
> > > Which is just like the situation where O_DIRECT on ext3 was not very
> > > useful, but on other filesystems like XFS it was fully functional.
> > >
> > > IMO, the fact that a specific filesytem has a suboptimal fallback
> > > path for an uncommon behaviour isn't an argument against MAP_DIRECT
> > > as a hint - it's actually a feature. If MAP_DIRECT can't be used
> > > until it's always direct access, then most filesystems wouldn't be
> > > able to provide any faster paths at all. It's much better to have
> > > partial functionality now than it is to never have the functionality
> > > at all, and so we need to design in the flexibility we need to
> > > iteratively improve implementations without needing API changes that
> > > will break applications.
> >
> > The hard guarantee requirement still remains though because an
> > application that expects combined MAP_SYNC|MAP_DIRECT semantics will
> > be surprised if the MAP_DIRECT property silently disappears.
>
> Why would they be surprised? They won't even notice it if the
> filesystem can provide MAP_SYNC without MAP_DIRECT.
>
> And that's the whole point.
>
> MAP_DIRECT is a private mapping state. So is MAP_SYNC. They are not
> visible to the filesystem and the filesystem does nothing to enforce
> them. If someone does something that requires the page cache (e.g.
> calls do_splice_direct()) then that MAP_DIRECT mapping has a whole
> heap of new work to do. And, in some cases, the filesystem may not
> be able to provide MAP_DIRECT as a result..
>
> IOWs, the filesystem cannot guarantee MAP_DIRECT and the
> circumstances under which MAP_DIRECT will and will not work are
> dynamic. If MAP_DIRECT is supposed to be a guarantee then we'll have
> applications randomly segfaulting in production as things like
> backups, indexers, etc run over the filesystem and do their work.
>
> This is why MAP_DIRECT needs to be an optimisation, not a
> requirement - things will still work if MAP_DIRECT is not used. What
> matters to these applications is MAP_SYNC - if we break MAP_SYNC,
> then the application data integrity model is violated. That's not an
> acceptible outcome.
>
> The problem, it seems to me, is that people are unable to separate
> MAP_DIRECT and MAP_SYNC. I suspect that is because, at present,
> MAP_SYNC on XFS and ext4 requires MAP_DIRECT. i.e. we can only
> provide MAP_SYNC functionality on DAX mappings. However, that's a
> /filesystem implementation issue/, not an API guarantee we need to
> provide to userspace.
>
> If we implement a persistent page cache (e.g. allocate page cache
> pages out of ZONE_DEVICE pmem), then filesystems like XFS and ext4
> could provide applications with the MAP_SYNC data integrity model
> without MAP_DIRECT. Indeed, those filesystems would not even be able
> to provide MAP_DIRECT semantics because they aren't backed by pmem.
>
> Hence if applications that want MAP_SYNC are hard coded
> MAP_SYNC|MAP_DIRECT and we make MAP_DIRECT a hard guarantee, then
> those applications are going to fail on a filesystem that provides
> only MAP_SYNC. This is despite the fact the applications would
> function correctly and the data integrity model would be maintained.
> i.e. the failure is because applications have assumed MAP_SYNC can
> only be provided by a DAX implementation, not because MAP_SYNC is
> not supported.
>
> MAP_SYNC really isn't about DAX at all. It's about enabling a data
> integrity model that requires the filesystem to provide userspace
> access to CPU addressable persistent memory.  DAX+MAP_DIRECT is just
> one method of providing this functionality, but it's not the only
> method. Our API needs to be future proof rather than an encoding of
> the existing implementation limitations, otherwise apps will have to
> be re-written as every new MAP_SYNC capable technology comes along.
>
> In summary:
>
>         MAP_DIRECT is an access hint.
>
>         MAP_SYNC provides a data integrity model guarantee.
>
>         MAP_SYNC may imply MAP_DIRECT for specific implementations,
>         but it does not require or guarantee MAP_DIRECT.
>
> Let's compare that with O_DIRECT:
>
>         O_DIRECT in an access hint.
>
>         O_DSYNC provides a data integrity model guarantee.
>
>         O_DSYNC may imply O_DIRECT for specific implementations, but
>         it does not require or guarantee O_DIRECT.
>
> Consistency in access and data integrity models is a good thing. DAX
> and pmem is not an exception. We need to use a model we know works
> and has proven itself over a long period of time.
>
> > I think
> > it still makes some sense as a hint for apps that want to minimize
> > page cache, but for the applications with a flush from userspace model
> > I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This
> > still gives the filesystem the option to inject page-cache at will,
> > but with an application coordination point.
>
> Why make it more complex for applications than it needs to be?

With the clarification that MAP_SYNC implies "cpu cache flush to
persistent memory page-cache *or* dax to persistent memory" I think
all of the concerns are addressed. I was conflating MAP_DIRECT as "no
page cache indirection", but the indirection does not matter if the
page cache itself is persisted.

  reply index

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-02 10:05 Jan Kara
2018-10-02 10:50 ` Michal Hocko
2018-10-02 13:32   ` Jan Kara
2018-10-02 12:10 ` Johannes Thumshirn
2018-10-02 14:20   ` Johannes Thumshirn
2018-10-02 14:45     ` Christoph Hellwig
2018-10-02 15:01       ` Johannes Thumshirn
2018-10-02 15:06         ` Christoph Hellwig
2018-10-04 10:09           ` Johannes Thumshirn
2018-10-05  6:25             ` Christoph Hellwig
2018-10-05  6:35               ` Johannes Thumshirn
2018-10-06  1:17                 ` Dan Williams
2018-10-14 15:47                   ` Dan Williams
2018-10-17 20:01                     ` Dan Williams
2018-10-18 17:43                       ` Jan Kara
2018-10-18 19:10                         ` Dan Williams
2018-10-19  3:01                           ` Dave Chinner
2018-10-02 14:29   ` Jan Kara
2018-10-02 14:37     ` Christoph Hellwig
2018-10-02 14:44       ` Johannes Thumshirn
2018-10-02 14:52         ` Christoph Hellwig
2018-10-02 15:31           ` Jan Kara
2018-10-02 20:18             ` Dan Williams
2018-10-03 12:50               ` Jan Kara
2018-10-03 14:38                 ` Dan Williams
2018-10-03 15:06                   ` Jan Kara
2018-10-03 15:13                     ` Dan Williams
2018-10-03 16:44                       ` Jan Kara
2018-10-03 21:13                         ` Dan Williams
2018-10-04 10:04                         ` Johannes Thumshirn
2018-10-02 15:07       ` Jan Kara
2018-10-17 20:23     ` Jeff Moyer
2018-10-18  0:25       ` Dave Chinner
2018-10-18 14:55         ` Jan Kara
2018-10-19  0:43           ` Dave Chinner
2018-10-30  6:30             ` Dan Williams
2018-10-30 22:49               ` Dave Chinner
2018-10-30 22:59                 ` Dan Williams [this message]
2018-10-31  5:59                 ` y-goto
2018-11-01 23:00                   ` Dave Chinner
2018-11-02  1:43                     ` y-goto
2018-10-18 21:05         ` Jeff Moyer
2018-10-09 19:43 ` Jeff Moyer
2018-10-16  8:25   ` Jan Kara
2018-10-16 12:35     ` Jeff Moyer

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPcyv4gwBJ65170Gy+EaDLpomA93kxuWTM10mZeaPYV6Rq=SRw@mail.gmail.com' \
    --to=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=david@fromorbit.com \
    --cc=jack@suse.cz \
    --cc=jmoyer@redhat.com \
    --cc=jthumshirn@suse.de \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git