All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jeff Moyer <jmoyer@redhat.com>, Jan Kara <jack@suse.cz>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	linux-nvdimm@lists.01.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-api@vger.kernel.org
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps
Date: Thu, 18 Oct 2018 16:55:55 +0200	[thread overview]
Message-ID: <20181018145555.GS23493@quack2.suse.cz> (raw)
In-Reply-To: <20181018002510.GC6311@dastard>

On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > MAP_SYNC
> > - file system guarantees that metadata required to reach faulted-in file
> >   data is consistent on media before a write fault is completed.  A
> >   side-effect is that the page cache will not be used for
> >   writably-mapped pages.
> 
> I think you are conflating current implementation with API
> requirements - MAP_SYNC doesn't guarantee anything about page cache
> use. The man page definition simply says "supported only for files
> supporting DAX" and that it provides certain data integrity
> guarantees. It does not define the implementation.
> 
> We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour,
> because that's the only way we can currently provide the required
> behaviour to userspace. However, if a filesystem can use the page
> cache to provide the required functionality, then it's free to do
> so.
> 
> i.e. if someone implements a pmem-based page cache, MAP_SYNC data
> integrity could be provided /without DAX/ by any filesystem using
> that persistent page cache. i.e. MAP_SYNC really only requires
> mmap() of CPU addressable persistent memory - it does not require
> DAX. Right now, however, the only way to get this functionality is
> through a DAX capable filesystem on dax capable storage.
> 
> And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it
> COWs new pages in pmem and attaches them a special per-inode cache
> on clean->dirty transition. Then on data sync, background writeback
> or crash recovery, it migrates them from the cache into the file map
> proper via atomic metadata pointer swaps.
> 
> IOWs, NOVA provides the correct MAP_SYNC semantics by using a
> separate persistent per-inode write cache to provide the correct
> crash recovery semantics for MAP_SYNC.

Corect. NOVA would be able to provide MAP_SYNC semantics without DAX. But
effectively it will be also able to provide MAP_DIRECT semantics, right?
Because there won't be DRAM between app and persistent storage and I don't
think COW tricks or other data integrity methods are that interesting for
the application. Most users of O_DIRECT are concerned about getting close
to media speed performance and low DRAM usage...

> > and what I think Dan had proposed:
> > 
> > mmap flag, MAP_DIRECT
> > - file system guarantees that page cache will not be used to front storage.
> >   storage MUST be directly addressable.  This *almost* implies MAP_SYNC.
> >   The subtle difference is that a write fault /may/ not result in metadata
> >   being written back to media.
> 
> SIimilar to O_DIRECT, these semantics do not allow userspace apps to
> replace msync/fsync with CPU cache flush operations. So any
> application that uses this mode still needs to use either MAP_SYNC
> or issue msync/fsync for data integrity.
> 
> If the app is using MAP_DIRECT, the what do we do if the filesystem
> can't provide the required semantics for that specific operation? In
> the case of O_DIRECT, we fall back to buffered IO because it has the
> same data integrity semantics as O_DIRECT and will always work. It's
> just slower and consumes more memory, but the app continues to work
> just fine.
> 
> Sending SIGBUS to apps when we can't perform MAP_DIRECT operations
> without using the pagecache seems extremely problematic to me.  e.g.
> an app already has an open MAP_DIRECT file, and a third party
> reflinks it or dedupes it and the fs has to fall back to buffered IO
> to do COW operations. This isn't the app's fault - the kernel should
> just fall back transparently to using the page cache for the
> MAP_DIRECT app and just keep working, just like it would if it was
> using O_DIRECT read/write.

There's another option of failing reflink / dedupe with EBUSY if the file
is mapped with MAP_DIRECT and the filesystem cannot support relink &
MAP_DIRECT together. But there are downsides to that as well.

> The point I'm trying to make here is that O_DIRECT is a /hint/, not
> a guarantee, and it's done that way to prevent applications from
> being presented with transient, potentially fatal error cases
> because a filesystem implementation can't do a specific operation
> through the direct IO path.
> 
> IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a
> guarantee. Over time we'll end up with filesystems that can
> guarantee that MAP_DIRECT is always going to use DAX, in the same
> way we have filesystems that guarantee O_DIRECT will always be
> O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee
> no page cache will ever be used, then we are basically saying
> "filesystems won't provide MAP_DIRECT even in common, useful cases
> because they can't provide MAP_DIRECT in all cases." And that
> doesn't seem like a very good solution to me.

These are good points. I'm just somewhat wary of the situation where users
will map files with MAP_DIRECT and then the machine starts thrashing
because the file got reflinked and thus pagecache gets used suddently.
With O_DIRECT the fallback to buffered IO is quite rare (at least for major
filesystems) so usually people just won't notice. If fallback for
MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
To: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps
Date: Thu, 18 Oct 2018 16:55:55 +0200	[thread overview]
Message-ID: <20181018145555.GS23493@quack2.suse.cz> (raw)
In-Reply-To: <20181018002510.GC6311@dastard>

On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > MAP_SYNC
> > - file system guarantees that metadata required to reach faulted-in file
> >   data is consistent on media before a write fault is completed.  A
> >   side-effect is that the page cache will not be used for
> >   writably-mapped pages.
> 
> I think you are conflating current implementation with API
> requirements - MAP_SYNC doesn't guarantee anything about page cache
> use. The man page definition simply says "supported only for files
> supporting DAX" and that it provides certain data integrity
> guarantees. It does not define the implementation.
> 
> We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour,
> because that's the only way we can currently provide the required
> behaviour to userspace. However, if a filesystem can use the page
> cache to provide the required functionality, then it's free to do
> so.
> 
> i.e. if someone implements a pmem-based page cache, MAP_SYNC data
> integrity could be provided /without DAX/ by any filesystem using
> that persistent page cache. i.e. MAP_SYNC really only requires
> mmap() of CPU addressable persistent memory - it does not require
> DAX. Right now, however, the only way to get this functionality is
> through a DAX capable filesystem on dax capable storage.
> 
> And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it
> COWs new pages in pmem and attaches them a special per-inode cache
> on clean->dirty transition. Then on data sync, background writeback
> or crash recovery, it migrates them from the cache into the file map
> proper via atomic metadata pointer swaps.
> 
> IOWs, NOVA provides the correct MAP_SYNC semantics by using a
> separate persistent per-inode write cache to provide the correct
> crash recovery semantics for MAP_SYNC.

Corect. NOVA would be able to provide MAP_SYNC semantics without DAX. But
effectively it will be also able to provide MAP_DIRECT semantics, right?
Because there won't be DRAM between app and persistent storage and I don't
think COW tricks or other data integrity methods are that interesting for
the application. Most users of O_DIRECT are concerned about getting close
to media speed performance and low DRAM usage...

> > and what I think Dan had proposed:
> > 
> > mmap flag, MAP_DIRECT
> > - file system guarantees that page cache will not be used to front storage.
> >   storage MUST be directly addressable.  This *almost* implies MAP_SYNC.
> >   The subtle difference is that a write fault /may/ not result in metadata
> >   being written back to media.
> 
> SIimilar to O_DIRECT, these semantics do not allow userspace apps to
> replace msync/fsync with CPU cache flush operations. So any
> application that uses this mode still needs to use either MAP_SYNC
> or issue msync/fsync for data integrity.
> 
> If the app is using MAP_DIRECT, the what do we do if the filesystem
> can't provide the required semantics for that specific operation? In
> the case of O_DIRECT, we fall back to buffered IO because it has the
> same data integrity semantics as O_DIRECT and will always work. It's
> just slower and consumes more memory, but the app continues to work
> just fine.
> 
> Sending SIGBUS to apps when we can't perform MAP_DIRECT operations
> without using the pagecache seems extremely problematic to me.  e.g.
> an app already has an open MAP_DIRECT file, and a third party
> reflinks it or dedupes it and the fs has to fall back to buffered IO
> to do COW operations. This isn't the app's fault - the kernel should
> just fall back transparently to using the page cache for the
> MAP_DIRECT app and just keep working, just like it would if it was
> using O_DIRECT read/write.

There's another option of failing reflink / dedupe with EBUSY if the file
is mapped with MAP_DIRECT and the filesystem cannot support relink &
MAP_DIRECT together. But there are downsides to that as well.

> The point I'm trying to make here is that O_DIRECT is a /hint/, not
> a guarantee, and it's done that way to prevent applications from
> being presented with transient, potentially fatal error cases
> because a filesystem implementation can't do a specific operation
> through the direct IO path.
> 
> IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a
> guarantee. Over time we'll end up with filesystems that can
> guarantee that MAP_DIRECT is always going to use DAX, in the same
> way we have filesystems that guarantee O_DIRECT will always be
> O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee
> no page cache will ever be used, then we are basically saying
> "filesystems won't provide MAP_DIRECT even in common, useful cases
> because they can't provide MAP_DIRECT in all cases." And that
> doesn't seem like a very good solution to me.

These are good points. I'm just somewhat wary of the situation where users
will map files with MAP_DIRECT and then the machine starts thrashing
because the file got reflinked and thus pagecache gets used suddently.
With O_DIRECT the fallback to buffered IO is quite rare (at least for major
filesystems) so usually people just won't notice. If fallback for
MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

  reply	other threads:[~2018-10-18 14:55 UTC|newest]

Thread overview: 124+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-02 10:05 Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps Jan Kara
2018-10-02 10:05 ` Jan Kara
2018-10-02 10:05 ` Jan Kara
2018-10-02 10:50 ` Michal Hocko
2018-10-02 10:50   ` Michal Hocko
2018-10-02 13:32   ` Jan Kara
2018-10-02 13:32     ` Jan Kara
2018-10-02 12:10 ` Johannes Thumshirn
2018-10-02 12:10   ` Johannes Thumshirn
2018-10-02 12:10   ` Johannes Thumshirn
2018-10-02 14:20   ` Johannes Thumshirn
2018-10-02 14:20     ` Johannes Thumshirn
2018-10-02 14:20     ` Johannes Thumshirn
2018-10-02 14:45     ` Christoph Hellwig
2018-10-02 14:45       ` Christoph Hellwig
2018-10-02 15:01       ` Johannes Thumshirn
2018-10-02 15:01         ` Johannes Thumshirn
2018-10-02 15:01         ` Johannes Thumshirn
2018-10-02 15:06         ` Christoph Hellwig
2018-10-02 15:06           ` Christoph Hellwig
2018-10-04 10:09           ` Johannes Thumshirn
2018-10-04 10:09             ` Johannes Thumshirn
2018-10-04 10:09             ` Johannes Thumshirn
2018-10-05  6:25             ` Christoph Hellwig
2018-10-05  6:25               ` Christoph Hellwig
2018-10-05  6:35               ` Johannes Thumshirn
2018-10-05  6:35                 ` Johannes Thumshirn
2018-10-05  6:35                 ` Johannes Thumshirn
2018-10-06  1:17                 ` Dan Williams
2018-10-06  1:17                   ` Dan Williams
2018-10-14 15:47                   ` Dan Williams
2018-10-14 15:47                     ` Dan Williams
2018-10-17 20:01                     ` Dan Williams
2018-10-18 17:43                       ` Jan Kara
2018-10-18 17:43                         ` Jan Kara
2018-10-18 19:10                         ` Dan Williams
2018-10-18 19:10                           ` Dan Williams
2018-10-19  3:01                           ` Dave Chinner
2018-10-19  3:01                             ` Dave Chinner
2018-10-02 14:29   ` Jan Kara
2018-10-02 14:29     ` Jan Kara
2018-10-02 14:29     ` Jan Kara
2018-10-02 14:37     ` Christoph Hellwig
2018-10-02 14:37       ` Christoph Hellwig
2018-10-02 14:37       ` Christoph Hellwig
2018-10-02 14:44       ` Johannes Thumshirn
2018-10-02 14:44         ` Johannes Thumshirn
2018-10-02 14:44         ` Johannes Thumshirn
2018-10-02 14:44         ` Johannes Thumshirn
2018-10-02 14:44         ` Johannes Thumshirn
2018-10-02 14:52         ` Christoph Hellwig
2018-10-02 14:52           ` Christoph Hellwig
2018-10-02 14:52           ` Christoph Hellwig
2018-10-02 15:31           ` Jan Kara
2018-10-02 15:31             ` Jan Kara
2018-10-02 15:31             ` Jan Kara
2018-10-02 20:18             ` Dan Williams
2018-10-02 20:18               ` Dan Williams
2018-10-03 12:50               ` Jan Kara
2018-10-03 12:50                 ` Jan Kara
2018-10-03 12:50                 ` Jan Kara
2018-10-03 14:38                 ` Dan Williams
2018-10-03 14:38                   ` Dan Williams
2018-10-03 15:06                   ` Jan Kara
2018-10-03 15:06                     ` Jan Kara
2018-10-03 15:06                     ` Jan Kara
2018-10-03 15:13                     ` Dan Williams
2018-10-03 15:13                       ` Dan Williams
2018-10-03 15:13                       ` Dan Williams
2018-10-03 16:44                       ` Jan Kara
2018-10-03 16:44                         ` Jan Kara
2018-10-03 16:44                         ` Jan Kara
2018-10-03 21:13                         ` Dan Williams
2018-10-03 21:13                           ` Dan Williams
2018-10-03 21:13                           ` Dan Williams
2018-10-04 10:04                         ` Johannes Thumshirn
2018-10-04 10:04                           ` Johannes Thumshirn
2018-10-04 10:04                           ` Johannes Thumshirn
2018-10-04 10:04                           ` Johannes Thumshirn
2018-10-04 10:04                           ` Johannes Thumshirn
2018-10-02 15:07       ` Jan Kara
2018-10-02 15:07         ` Jan Kara
2018-10-02 15:07         ` Jan Kara
2018-10-17 20:23     ` Jeff Moyer
2018-10-17 20:23       ` Jeff Moyer
2018-10-17 20:23       ` Jeff Moyer
2018-10-17 20:23       ` Jeff Moyer
2018-10-18  0:25       ` Dave Chinner
2018-10-18  0:25         ` Dave Chinner
2018-10-18  0:25         ` Dave Chinner
2018-10-18 14:55         ` Jan Kara [this message]
2018-10-18 14:55           ` Jan Kara
2018-10-19  0:43           ` Dave Chinner
2018-10-19  0:43             ` Dave Chinner
2018-10-19  0:43             ` Dave Chinner
2018-10-30  6:30             ` Dan Williams
2018-10-30  6:30               ` Dan Williams
2018-10-30  6:30               ` Dan Williams
2018-10-30 22:49               ` Dave Chinner
2018-10-30 22:49                 ` Dave Chinner
2018-10-30 22:49                 ` Dave Chinner
2018-10-30 22:59                 ` Dan Williams
2018-10-30 22:59                   ` Dan Williams
2018-10-30 22:59                   ` Dan Williams
2018-10-31  5:59                 ` y-goto
2018-10-31  5:59                   ` y-goto-LMvhtfratI1BDgjK7y7TUQ
2018-10-31  5:59                   ` y-goto
2018-11-01 23:00                   ` Dave Chinner
2018-11-01 23:00                     ` Dave Chinner
2018-11-01 23:00                     ` Dave Chinner
2018-11-02  1:43                     ` y-goto
2018-11-02  1:43                       ` y-goto-LMvhtfratI1BDgjK7y7TUQ
2018-11-02  1:43                       ` y-goto
2018-10-18 21:05         ` Jeff Moyer
2018-10-18 21:05           ` Jeff Moyer
2018-10-18 21:05           ` Jeff Moyer
2018-10-18 21:05           ` Jeff Moyer
2018-10-09 19:43 ` Jeff Moyer
2018-10-09 19:43   ` Jeff Moyer
2018-10-09 19:43   ` Jeff Moyer
2018-10-16  8:25   ` Jan Kara
2018-10-16  8:25     ` Jan Kara
2018-10-16 12:35     ` Jeff Moyer
2018-10-16 12:35       ` Jeff Moyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181018145555.GS23493@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=david@fromorbit.com \
    --cc=jmoyer@redhat.com \
    --cc=jthumshirn@suse.de \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.