Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	linux-nvdimm@lists.01.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-api@vger.kernel.org
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps
Date: Fri, 19 Oct 2018 11:43:03 +1100
Message-ID: <20181019004303.GI6311@dastard> (raw)
In-Reply-To: <20181018145555.GS23493@quack2.suse.cz>

On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > MAP_SYNC
> > > - file system guarantees that metadata required to reach faulted-in file
> > >   data is consistent on media before a write fault is completed.  A
> > >   side-effect is that the page cache will not be used for
> > >   writably-mapped pages.
> > 
> > I think you are conflating current implementation with API
> > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > use. The man page definition simply says "supported only for files
> > supporting DAX" and that it provides certain data integrity
> > guarantees. It does not define the implementation.
> > 
> > We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour,
> > because that's the only way we can currently provide the required
> > behaviour to userspace. However, if a filesystem can use the page
> > cache to provide the required functionality, then it's free to do
> > so.
> > 
> > i.e. if someone implements a pmem-based page cache, MAP_SYNC data
> > integrity could be provided /without DAX/ by any filesystem using
> > that persistent page cache. i.e. MAP_SYNC really only requires
> > mmap() of CPU addressable persistent memory - it does not require
> > DAX. Right now, however, the only way to get this functionality is
> > through a DAX capable filesystem on dax capable storage.
> > 
> > And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it
> > COWs new pages in pmem and attaches them a special per-inode cache
> > on clean->dirty transition. Then on data sync, background writeback
> > or crash recovery, it migrates them from the cache into the file map
> > proper via atomic metadata pointer swaps.
> > 
> > IOWs, NOVA provides the correct MAP_SYNC semantics by using a
> > separate persistent per-inode write cache to provide the correct
> > crash recovery semantics for MAP_SYNC.
> 
> Corect. NOVA would be able to provide MAP_SYNC semantics without DAX. But
> effectively it will be also able to provide MAP_DIRECT semantics, right?

Yes, I think so. It still needs to do COW on first write fault,
but then the app has direct access to the data buffer until it is
cleaned and put back in place. The "put back in place" is just an
atomic swap of metadata pointers, so it doesn't need the page cache
at all...

> Because there won't be DRAM between app and persistent storage and I don't
> think COW tricks or other data integrity methods are that interesting for
> the application.

Not for the application, but the filesystem still wants to support
snapshots and other such functionality that requires COW. And NOVA
doesn't have write-in-place functionality at all - it always COWs
on the clean->dirty transition.

> Most users of O_DIRECT are concerned about getting close
> to media speed performance and low DRAM usage...

*nod*

> > > and what I think Dan had proposed:
> > > 
> > > mmap flag, MAP_DIRECT
> > > - file system guarantees that page cache will not be used to front storage.
> > >   storage MUST be directly addressable.  This *almost* implies MAP_SYNC.
> > >   The subtle difference is that a write fault /may/ not result in metadata
> > >   being written back to media.
> > 
> > SIimilar to O_DIRECT, these semantics do not allow userspace apps to
> > replace msync/fsync with CPU cache flush operations. So any
> > application that uses this mode still needs to use either MAP_SYNC
> > or issue msync/fsync for data integrity.
> > 
> > If the app is using MAP_DIRECT, the what do we do if the filesystem
> > can't provide the required semantics for that specific operation? In
> > the case of O_DIRECT, we fall back to buffered IO because it has the
> > same data integrity semantics as O_DIRECT and will always work. It's
> > just slower and consumes more memory, but the app continues to work
> > just fine.
> > 
> > Sending SIGBUS to apps when we can't perform MAP_DIRECT operations
> > without using the pagecache seems extremely problematic to me.  e.g.
> > an app already has an open MAP_DIRECT file, and a third party
> > reflinks it or dedupes it and the fs has to fall back to buffered IO
> > to do COW operations. This isn't the app's fault - the kernel should
> > just fall back transparently to using the page cache for the
> > MAP_DIRECT app and just keep working, just like it would if it was
> > using O_DIRECT read/write.
> 
> There's another option of failing reflink / dedupe with EBUSY if the file
> is mapped with MAP_DIRECT and the filesystem cannot support relink &
> MAP_DIRECT together. But there are downsides to that as well.

Yup, not the least that setting MAP_DIRECT can race with a
reflink....

> > The point I'm trying to make here is that O_DIRECT is a /hint/, not
> > a guarantee, and it's done that way to prevent applications from
> > being presented with transient, potentially fatal error cases
> > because a filesystem implementation can't do a specific operation
> > through the direct IO path.
> > 
> > IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a
> > guarantee. Over time we'll end up with filesystems that can
> > guarantee that MAP_DIRECT is always going to use DAX, in the same
> > way we have filesystems that guarantee O_DIRECT will always be
> > O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee
> > no page cache will ever be used, then we are basically saying
> > "filesystems won't provide MAP_DIRECT even in common, useful cases
> > because they can't provide MAP_DIRECT in all cases." And that
> > doesn't seem like a very good solution to me.
> 
> These are good points. I'm just somewhat wary of the situation where users
> will map files with MAP_DIRECT and then the machine starts thrashing
> because the file got reflinked and thus pagecache gets used suddently.

It's still better than apps randomly getting SIGBUS.

FWIW, this suggests that we probably need to be able to host both
DAX pages and page cache pages on the same file at the same time,
and be able to handle page faults based on the type of page being
mapped (different sets of fault ops for different page types?)
and have fallback paths when the page type needs to be changed
between direct and cached during the fault....

> With O_DIRECT the fallback to buffered IO is quite rare (at least for major
> filesystems) so usually people just won't notice. If fallback for
> MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.

Which is just like the situation where O_DIRECT on ext3 was not very
useful, but on other filesystems like XFS it was fully functional.

IMO, the fact that a specific filesytem has a suboptimal fallback
path for an uncommon behaviour isn't an argument against MAP_DIRECT
as a hint - it's actually a feature. If MAP_DIRECT can't be used
until it's always direct access, then most filesystems wouldn't be
able to provide any faster paths at all. It's much better to have
partial functionality now than it is to never have the functionality
at all, and so we need to design in the flexibility we need to
iteratively improve implementations without needing API changes that
will break applications.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply index

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-02 10:05 Jan Kara
2018-10-02 10:50 ` Michal Hocko
2018-10-02 13:32   ` Jan Kara
2018-10-02 12:10 ` Johannes Thumshirn
2018-10-02 14:20   ` Johannes Thumshirn
2018-10-02 14:45     ` Christoph Hellwig
2018-10-02 15:01       ` Johannes Thumshirn
2018-10-02 15:06         ` Christoph Hellwig
2018-10-04 10:09           ` Johannes Thumshirn
2018-10-05  6:25             ` Christoph Hellwig
2018-10-05  6:35               ` Johannes Thumshirn
2018-10-06  1:17                 ` Dan Williams
2018-10-14 15:47                   ` Dan Williams
2018-10-17 20:01                     ` Dan Williams
2018-10-18 17:43                       ` Jan Kara
2018-10-18 19:10                         ` Dan Williams
2018-10-19  3:01                           ` Dave Chinner
2018-10-02 14:29   ` Jan Kara
2018-10-02 14:37     ` Christoph Hellwig
2018-10-02 14:44       ` Johannes Thumshirn
2018-10-02 14:52         ` Christoph Hellwig
2018-10-02 15:31           ` Jan Kara
2018-10-02 20:18             ` Dan Williams
2018-10-03 12:50               ` Jan Kara
2018-10-03 14:38                 ` Dan Williams
2018-10-03 15:06                   ` Jan Kara
2018-10-03 15:13                     ` Dan Williams
2018-10-03 16:44                       ` Jan Kara
2018-10-03 21:13                         ` Dan Williams
2018-10-04 10:04                         ` Johannes Thumshirn
2018-10-02 15:07       ` Jan Kara
2018-10-17 20:23     ` Jeff Moyer
2018-10-18  0:25       ` Dave Chinner
2018-10-18 14:55         ` Jan Kara
2018-10-19  0:43           ` Dave Chinner [this message]
2018-10-30  6:30             ` Dan Williams
2018-10-30 22:49               ` Dave Chinner
2018-10-30 22:59                 ` Dan Williams
2018-10-31  5:59                 ` y-goto
2018-11-01 23:00                   ` Dave Chinner
2018-11-02  1:43                     ` y-goto
2018-10-18 21:05         ` Jeff Moyer
2018-10-09 19:43 ` Jeff Moyer
2018-10-16  8:25   ` Jan Kara
2018-10-16 12:35     ` Jeff Moyer

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181019004303.GI6311@dastard \
    --to=david@fromorbit.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=jack@suse.cz \
    --cc=jmoyer@redhat.com \
    --cc=jthumshirn@suse.de \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git