Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: david <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>, jmoyer <jmoyer@redhat.com>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Dave Jiang <dave.jiang@intel.com>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Linux MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps
Date: Mon, 29 Oct 2018 23:30:41 -0700
Message-ID: <CAPcyv4ixoAh7HEMfm+B4sRDx1Qwm6SHGjtQ+5r3EKsxreRydrA@mail.gmail.com> (raw)
In-Reply-To: <20181019004303.GI6311@dastard>

On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> > On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > > MAP_SYNC
> > > > - file system guarantees that metadata required to reach faulted-in file
> > > >   data is consistent on media before a write fault is completed.  A
> > > >   side-effect is that the page cache will not be used for
> > > >   writably-mapped pages.
> > >
> > > I think you are conflating current implementation with API
> > > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > > use. The man page definition simply says "supported only for files
> > > supporting DAX" and that it provides certain data integrity
> > > guarantees. It does not define the implementation.
> > >
> > > We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour,
> > > because that's the only way we can currently provide the required
> > > behaviour to userspace. However, if a filesystem can use the page
> > > cache to provide the required functionality, then it's free to do
> > > so.
> > >
> > > i.e. if someone implements a pmem-based page cache, MAP_SYNC data
> > > integrity could be provided /without DAX/ by any filesystem using
> > > that persistent page cache. i.e. MAP_SYNC really only requires
> > > mmap() of CPU addressable persistent memory - it does not require
> > > DAX. Right now, however, the only way to get this functionality is
> > > through a DAX capable filesystem on dax capable storage.
> > >
> > > And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it
> > > COWs new pages in pmem and attaches them a special per-inode cache
> > > on clean->dirty transition. Then on data sync, background writeback
> > > or crash recovery, it migrates them from the cache into the file map
> > > proper via atomic metadata pointer swaps.
> > >
> > > IOWs, NOVA provides the correct MAP_SYNC semantics by using a
> > > separate persistent per-inode write cache to provide the correct
> > > crash recovery semantics for MAP_SYNC.
> >
> > Corect. NOVA would be able to provide MAP_SYNC semantics without DAX. But
> > effectively it will be also able to provide MAP_DIRECT semantics, right?
>
> Yes, I think so. It still needs to do COW on first write fault,
> but then the app has direct access to the data buffer until it is
> cleaned and put back in place. The "put back in place" is just an
> atomic swap of metadata pointers, so it doesn't need the page cache
> at all...
>
> > Because there won't be DRAM between app and persistent storage and I don't
> > think COW tricks or other data integrity methods are that interesting for
> > the application.
>
> Not for the application, but the filesystem still wants to support
> snapshots and other such functionality that requires COW. And NOVA
> doesn't have write-in-place functionality at all - it always COWs
> on the clean->dirty transition.
>
> > Most users of O_DIRECT are concerned about getting close
> > to media speed performance and low DRAM usage...
>
> *nod*
>
> > > > and what I think Dan had proposed:
> > > >
> > > > mmap flag, MAP_DIRECT
> > > > - file system guarantees that page cache will not be used to front storage.
> > > >   storage MUST be directly addressable.  This *almost* implies MAP_SYNC.
> > > >   The subtle difference is that a write fault /may/ not result in metadata
> > > >   being written back to media.
> > >
> > > SIimilar to O_DIRECT, these semantics do not allow userspace apps to
> > > replace msync/fsync with CPU cache flush operations. So any
> > > application that uses this mode still needs to use either MAP_SYNC
> > > or issue msync/fsync for data integrity.
> > >
> > > If the app is using MAP_DIRECT, the what do we do if the filesystem
> > > can't provide the required semantics for that specific operation? In
> > > the case of O_DIRECT, we fall back to buffered IO because it has the
> > > same data integrity semantics as O_DIRECT and will always work. It's
> > > just slower and consumes more memory, but the app continues to work
> > > just fine.
> > >
> > > Sending SIGBUS to apps when we can't perform MAP_DIRECT operations
> > > without using the pagecache seems extremely problematic to me.  e.g.
> > > an app already has an open MAP_DIRECT file, and a third party
> > > reflinks it or dedupes it and the fs has to fall back to buffered IO
> > > to do COW operations. This isn't the app's fault - the kernel should
> > > just fall back transparently to using the page cache for the
> > > MAP_DIRECT app and just keep working, just like it would if it was
> > > using O_DIRECT read/write.
> >
> > There's another option of failing reflink / dedupe with EBUSY if the file
> > is mapped with MAP_DIRECT and the filesystem cannot support relink &
> > MAP_DIRECT together. But there are downsides to that as well.
>
> Yup, not the least that setting MAP_DIRECT can race with a
> reflink....
>
> > > The point I'm trying to make here is that O_DIRECT is a /hint/, not
> > > a guarantee, and it's done that way to prevent applications from
> > > being presented with transient, potentially fatal error cases
> > > because a filesystem implementation can't do a specific operation
> > > through the direct IO path.
> > >
> > > IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a
> > > guarantee. Over time we'll end up with filesystems that can
> > > guarantee that MAP_DIRECT is always going to use DAX, in the same
> > > way we have filesystems that guarantee O_DIRECT will always be
> > > O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee
> > > no page cache will ever be used, then we are basically saying
> > > "filesystems won't provide MAP_DIRECT even in common, useful cases
> > > because they can't provide MAP_DIRECT in all cases." And that
> > > doesn't seem like a very good solution to me.
> >
> > These are good points. I'm just somewhat wary of the situation where users
> > will map files with MAP_DIRECT and then the machine starts thrashing
> > because the file got reflinked and thus pagecache gets used suddently.
>
> It's still better than apps randomly getting SIGBUS.
>
> FWIW, this suggests that we probably need to be able to host both
> DAX pages and page cache pages on the same file at the same time,
> and be able to handle page faults based on the type of page being
> mapped (different sets of fault ops for different page types?)
> and have fallback paths when the page type needs to be changed
> between direct and cached during the fault....
>
> > With O_DIRECT the fallback to buffered IO is quite rare (at least for major
> > filesystems) so usually people just won't notice. If fallback for
> > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.
>
> Which is just like the situation where O_DIRECT on ext3 was not very
> useful, but on other filesystems like XFS it was fully functional.
>
> IMO, the fact that a specific filesytem has a suboptimal fallback
> path for an uncommon behaviour isn't an argument against MAP_DIRECT
> as a hint - it's actually a feature. If MAP_DIRECT can't be used
> until it's always direct access, then most filesystems wouldn't be
> able to provide any faster paths at all. It's much better to have
> partial functionality now than it is to never have the functionality
> at all, and so we need to design in the flexibility we need to
> iteratively improve implementations without needing API changes that
> will break applications.

The hard guarantee requirement still remains though because an
application that expects combined MAP_SYNC|MAP_DIRECT semantics will
be surprised if the MAP_DIRECT property silently disappears. I think
it still makes some sense as a hint for apps that want to minimize
page cache, but for the applications with a flush from userspace model
I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This
still gives the filesystem the option to inject page-cache at will,
but with an application coordination point.

  reply index

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-02 10:05 Jan Kara
2018-10-02 10:50 ` Michal Hocko
2018-10-02 13:32   ` Jan Kara
2018-10-02 12:10 ` Johannes Thumshirn
2018-10-02 14:20   ` Johannes Thumshirn
2018-10-02 14:45     ` Christoph Hellwig
2018-10-02 15:01       ` Johannes Thumshirn
2018-10-02 15:06         ` Christoph Hellwig
2018-10-04 10:09           ` Johannes Thumshirn
2018-10-05  6:25             ` Christoph Hellwig
2018-10-05  6:35               ` Johannes Thumshirn
2018-10-06  1:17                 ` Dan Williams
2018-10-14 15:47                   ` Dan Williams
2018-10-17 20:01                     ` Dan Williams
2018-10-18 17:43                       ` Jan Kara
2018-10-18 19:10                         ` Dan Williams
2018-10-19  3:01                           ` Dave Chinner
2018-10-02 14:29   ` Jan Kara
2018-10-02 14:37     ` Christoph Hellwig
2018-10-02 14:44       ` Johannes Thumshirn
2018-10-02 14:52         ` Christoph Hellwig
2018-10-02 15:31           ` Jan Kara
2018-10-02 20:18             ` Dan Williams
2018-10-03 12:50               ` Jan Kara
2018-10-03 14:38                 ` Dan Williams
2018-10-03 15:06                   ` Jan Kara
2018-10-03 15:13                     ` Dan Williams
2018-10-03 16:44                       ` Jan Kara
2018-10-03 21:13                         ` Dan Williams
2018-10-04 10:04                         ` Johannes Thumshirn
2018-10-02 15:07       ` Jan Kara
2018-10-17 20:23     ` Jeff Moyer
2018-10-18  0:25       ` Dave Chinner
2018-10-18 14:55         ` Jan Kara
2018-10-19  0:43           ` Dave Chinner
2018-10-30  6:30             ` Dan Williams [this message]
2018-10-30 22:49               ` Dave Chinner
2018-10-30 22:59                 ` Dan Williams
2018-10-31  5:59                 ` y-goto
2018-11-01 23:00                   ` Dave Chinner
2018-11-02  1:43                     ` y-goto
2018-10-18 21:05         ` Jeff Moyer
2018-10-09 19:43 ` Jeff Moyer
2018-10-16  8:25   ` Jan Kara
2018-10-16 12:35     ` Jeff Moyer

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4ixoAh7HEMfm+B4sRDx1Qwm6SHGjtQ+5r3EKsxreRydrA@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=david@fromorbit.com \
    --cc=jack@suse.cz \
    --cc=jmoyer@redhat.com \
    --cc=jthumshirn@suse.de \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org linux-fsdevel@archiver.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox