Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: "y-goto@fujitsu.com" <y-goto@fujitsu.com>
To: 'Dave Chinner' <david@fromorbit.com>,
	Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>, jmoyer <jmoyer@redhat.com>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Dave Jiang <dave.jiang@intel.com>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Linux MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>
Subject: RE: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps
Date: Wed, 31 Oct 2018 05:59:17 +0000
Message-ID: <TYAPR01MB32619CCA488DD0DA86EDB17E90CD0@TYAPR01MB3261.jpnprd01.prod.outlook.com> (raw)
In-Reply-To: <20181030224904.GT19305@dastard>

Hello,

> On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote:
> > On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <david@fromorbit.com> wrote:
> > > On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> > > > On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > > > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > > > > MAP_SYNC
> > > > > > - file system guarantees that metadata required to reach faulted-in file
> > > > > >   data is consistent on media before a write fault is completed.  A
> > > > > >   side-effect is that the page cache will not be used for
> > > > > >   writably-mapped pages.
> > > > >
> > > > > I think you are conflating current implementation with API
> > > > > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > > > > use. The man page definition simply says "supported only for files
> > > > > supporting DAX" and that it provides certain data integrity
> > > > > guarantees. It does not define the implementation.
> ....
> > > > With O_DIRECT the fallback to buffered IO is quite rare (at least for major
> > > > filesystems) so usually people just won't notice. If fallback for
> > > > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.
> > >
> > > Which is just like the situation where O_DIRECT on ext3 was not very
> > > useful, but on other filesystems like XFS it was fully functional.
> > >
> > > IMO, the fact that a specific filesytem has a suboptimal fallback
> > > path for an uncommon behaviour isn't an argument against MAP_DIRECT
> > > as a hint - it's actually a feature. If MAP_DIRECT can't be used
> > > until it's always direct access, then most filesystems wouldn't be
> > > able to provide any faster paths at all. It's much better to have
> > > partial functionality now than it is to never have the functionality
> > > at all, and so we need to design in the flexibility we need to
> > > iteratively improve implementations without needing API changes that
> > > will break applications.
> >
> > The hard guarantee requirement still remains though because an
> > application that expects combined MAP_SYNC|MAP_DIRECT semantics will
> > be surprised if the MAP_DIRECT property silently disappears.
> 
> Why would they be surprised? They won't even notice it if the
> filesystem can provide MAP_SYNC without MAP_DIRECT.
> 
> And that's the whole point.
> 
> MAP_DIRECT is a private mapping state. So is MAP_SYNC. They are not
> visible to the filesystem and the filesystem does nothing to enforce
> them. If someone does something that requires the page cache (e.g.
> calls do_splice_direct()) then that MAP_DIRECT mapping has a whole
> heap of new work to do. And, in some cases, the filesystem may not
> be able to provide MAP_DIRECT as a result..
> 
> IOWs, the filesystem cannot guarantee MAP_DIRECT and the
> circumstances under which MAP_DIRECT will and will not work are
> dynamic. If MAP_DIRECT is supposed to be a guarantee then we'll have
> applications randomly segfaulting in production as things like
> backups, indexers, etc run over the filesystem and do their work.
> 
> This is why MAP_DIRECT needs to be an optimisation, not a
> requirement - things will still work if MAP_DIRECT is not used. What
> matters to these applications is MAP_SYNC - if we break MAP_SYNC,
> then the application data integrity model is violated. That's not an
> acceptible outcome.
> 
> The problem, it seems to me, is that people are unable to separate
> MAP_DIRECT and MAP_SYNC. I suspect that is because, at present,
> MAP_SYNC on XFS and ext4 requires MAP_DIRECT. i.e. we can only
> provide MAP_SYNC functionality on DAX mappings. However, that's a
> /filesystem implementation issue/, not an API guarantee we need to
> provide to userspace.
> 
> If we implement a persistent page cache (e.g. allocate page cache
> pages out of ZONE_DEVICE pmem), then filesystems like XFS and ext4
> could provide applications with the MAP_SYNC data integrity model
> without MAP_DIRECT. Indeed, those filesystems would not even be able
> to provide MAP_DIRECT semantics because they aren't backed by pmem.
> 
> Hence if applications that want MAP_SYNC are hard coded
> MAP_SYNC|MAP_DIRECT and we make MAP_DIRECT a hard guarantee, then
> those applications are going to fail on a filesystem that provides
> only MAP_SYNC. This is despite the fact the applications would
> function correctly and the data integrity model would be maintained.
> i.e. the failure is because applications have assumed MAP_SYNC can
> only be provided by a DAX implementation, not because MAP_SYNC is
> not supported.
> 
> MAP_SYNC really isn't about DAX at all. It's about enabling a data
> integrity model that requires the filesystem to provide userspace
> access to CPU addressable persistent memory.  DAX+MAP_DIRECT is just
> one method of providing this functionality, but it's not the only
> method. Our API needs to be future proof rather than an encoding of
> the existing implementation limitations, otherwise apps will have to
> be re-written as every new MAP_SYNC capable technology comes along.
> 
> In summary:
> 
> 	MAP_DIRECT is an access hint.
> 
> 	MAP_SYNC provides a data integrity model guarantee.
> 
> 	MAP_SYNC may imply MAP_DIRECT for specific implementations,
> 	but it does not require or guarantee MAP_DIRECT.
> 
> Let's compare that with O_DIRECT:
> 
> 	O_DIRECT in an access hint.
> 
> 	O_DSYNC provides a data integrity model guarantee.
> 
> 	O_DSYNC may imply O_DIRECT for specific implementations, but
> 	it does not require or guarantee O_DIRECT.
> 
> Consistency in access and data integrity models is a good thing. DAX
> and pmem is not an exception. We need to use a model we know works
> and has proven itself over a long period of time.

Hmmm, then, I would like to know all of the reasons of breakage of MAP_DIRECT.
(I'm not opposed to your opinion, but I need to know it.)

In O_DIRECT case, in my understanding, the reason of breakage of O_DIRECT is 
"wrong alignment is specified by application", right?

When filesystem can not use O_DIRECT and it uses page cache instead,
then system uses more memory resource than user's expectation.
So, there is a side effect, and it may cause other trouble.
(memory pressure, expected performance can not be gained, and so on ..)
 
In such case its administrator (or technical support engineer) needs to struggle to
investigate what is the reason.
If the reason of the breakage is clear, then it is helpful to find the root cause,
and they can require the developer of wrong application to fix the problem.
"Please fix the alignment!".

So, I would like to know in MAP_DIRECT case, what is the reasons? 
I think it will be helpful for users.
Only splice?

(Maybe such document will be necessary....)

Thanks,

> 
> > I think
> > it still makes some sense as a hint for apps that want to minimize
> > page cache, but for the applications with a flush from userspace model
> > I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This
> > still gives the filesystem the option to inject page-cache at will,
> > but with an application coordination point.
> 
> Why make it more complex for applications than it needs to be?
> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

  parent reply index

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-02 10:05 Jan Kara
2018-10-02 10:50 ` Michal Hocko
2018-10-02 13:32   ` Jan Kara
2018-10-02 12:10 ` Johannes Thumshirn
2018-10-02 14:20   ` Johannes Thumshirn
2018-10-02 14:45     ` Christoph Hellwig
2018-10-02 15:01       ` Johannes Thumshirn
2018-10-02 15:06         ` Christoph Hellwig
2018-10-04 10:09           ` Johannes Thumshirn
2018-10-05  6:25             ` Christoph Hellwig
2018-10-05  6:35               ` Johannes Thumshirn
2018-10-06  1:17                 ` Dan Williams
2018-10-14 15:47                   ` Dan Williams
2018-10-17 20:01                     ` Dan Williams
2018-10-18 17:43                       ` Jan Kara
2018-10-18 19:10                         ` Dan Williams
2018-10-19  3:01                           ` Dave Chinner
2018-10-02 14:29   ` Jan Kara
2018-10-02 14:37     ` Christoph Hellwig
2018-10-02 14:44       ` Johannes Thumshirn
2018-10-02 14:52         ` Christoph Hellwig
2018-10-02 15:31           ` Jan Kara
2018-10-02 20:18             ` Dan Williams
2018-10-03 12:50               ` Jan Kara
2018-10-03 14:38                 ` Dan Williams
2018-10-03 15:06                   ` Jan Kara
2018-10-03 15:13                     ` Dan Williams
2018-10-03 16:44                       ` Jan Kara
2018-10-03 21:13                         ` Dan Williams
2018-10-04 10:04                         ` Johannes Thumshirn
2018-10-02 15:07       ` Jan Kara
2018-10-17 20:23     ` Jeff Moyer
2018-10-18  0:25       ` Dave Chinner
2018-10-18 14:55         ` Jan Kara
2018-10-19  0:43           ` Dave Chinner
2018-10-30  6:30             ` Dan Williams
2018-10-30 22:49               ` Dave Chinner
2018-10-30 22:59                 ` Dan Williams
2018-10-31  5:59                 ` y-goto [this message]
2018-11-01 23:00                   ` Dave Chinner
2018-11-02  1:43                     ` y-goto
2018-10-18 21:05         ` Jeff Moyer
2018-10-09 19:43 ` Jeff Moyer
2018-10-16  8:25   ` Jan Kara
2018-10-16 12:35     ` Jeff Moyer

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=TYAPR01MB32619CCA488DD0DA86EDB17E90CD0@TYAPR01MB3261.jpnprd01.prod.outlook.com \
    --to=y-goto@fujitsu.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=david@fromorbit.com \
    --cc=jack@suse.cz \
    --cc=jmoyer@redhat.com \
    --cc=jthumshirn@suse.de \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org linux-fsdevel@archiver.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox