linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Jonathan Halliday <jonathan.halliday@redhat.com>,
	Jeff Moyer <jmoyer@redhat.com>, Christoph Hellwig <hch@lst.de>,
	Dave Chinner <david@fromorbit.com>,
	"Weiny, Ira" <ira.weiny@intel.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	"Theodore Y. Ts'o" <tytso@mit.edu>, Jan Kara <jack@suse.cz>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
Date: Wed, 26 Feb 2020 18:20:34 +0100	[thread overview]
Message-ID: <20200226172034.GV10728@quack2.suse.cz> (raw)
In-Reply-To: <CAPcyv4iuWpHi-0SK_HS0zmfH87=G64U47VhthhpTjDCw_BMG8A@mail.gmail.com>

On Wed 26-02-20 08:46:42, Dan Williams wrote:
> On Wed, Feb 26, 2020 at 1:29 AM Jonathan Halliday
> <jonathan.halliday@redhat.com> wrote:
> >
> >
> > Hi All
> >
> > I'm a middleware developer, focused on how Java (JVM) workloads can
> > benefit from app-direct mode pmem. Initially the target is apps that
> > need a fast binary log for fault tolerance: the classic database WAL use
> > case; transaction coordination systems; enterprise message bus
> > persistence and suchlike. Critically, there are cases where we use log
> > based storage, i.e. it's not the strict 'read rarely, only on recovery'
> > model that a classic db may have, but more of a 'append only, read many
> > times' event stream model.
> >
> > Think of the log oriented data storage as having logical segments (let's
> > implement them as files), of which the most recent is being appended to
> > (read_write) and the remaining N-1 older segments are full and sealed,
> > so effectively immutable (read_only) until discarded. The tail segment
> > needs to be in DAX mode for optimal write performance, as the size of
> > the append may be sub-block and we don't want the overhead of the kernel
> > call anyhow. So that's clearly a good fit for putting on a DAX fs mount
> > and using mmap with MAP_SYNC.
> >
> > However, we want fast read access into the segments, to retrieve stored
> > records. The small access index can be built in volatile RAM (assuming
> > we're willing to take the startup overhead of a full file scan at
> > recovery time) but the data itself is big and we don't want to move it
> > all off pmem. Which means the requirements are now different: we want
> > the O/S cache to pull hot data into fast volatile RAM for us, which DAX
> > explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather
> > than app-direct mode, except here we're using the O/S rather than the
> > hardware memory controller to do the cache management for us.
> >
> > Currently this requires closing the full (read_write) file, then copying
> > it to a non-DAX device and reopening it (read_only) there. Clearly
> > that's expensive and rather tedious. Instead, I'd like to close the
> > MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode
> > that will instead go via the O/S cache in the traditional manner. Bonus
> > points if I can do it over non-overlapping ranges in a file without
> > closing the DAX mode mmap, since then the segments are entirely logical
> > instead of needing separate physical files.
> 
> Hi John,
> 
> IIRC we chatted about this at PIRL, right?
> 
> At the time it sounded more like mixed mode dax, i.e. dax writes, but
> cached reads. To me that's an optimization to optionally use dax for
> direct-I/O writes, with its existing set of page-cache coherence
> warts, and not a capability to dynamically switch the dax-mode.
> mmap+MAP_SYNC seems the wrong interface for this. This writeup
> mentions bypassing kernel call overhead, but I don't see how a
> dax-write syscall is cheaper than an mmap syscall plus fault. If
> direct-I/O to a dax capable file bypasses the block layer, isn't that
> about the maximum of kernel overhead that can be cut out of this use
> case? Otherwise MAP_SYNC is a facility to achieve efficient sub-block
> update-in-place writes not append writes.

Well, even for appends you'll pay the cost only once per page (or maybe even
once per huge page) when using MAP_SYNC. With a syscall you'll pay once per
write. So although it would be good to check real numbers, the design isn't
non-sensical to me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2020-02-26 17:20 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
2020-02-21  0:41 ` [PATCH V4 01/13] fs/xfs: Remove unnecessary initialization of i_rwsem ira.weiny
2020-02-21  1:26   ` Dave Chinner
2020-02-27 17:52     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 02/13] fs/xfs: Clarify lockdep dependency for xfs_isilocked() ira.weiny
2020-02-21  1:34   ` Dave Chinner
2020-02-21 23:00     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 03/13] fs: Remove unneeded IS_DAX() check ira.weiny
2020-02-21  1:42   ` Dave Chinner
2020-02-21 23:04     ` Ira Weiny
2020-02-21 17:42   ` Christoph Hellwig
2020-02-21  0:41 ` [PATCH V4 04/13] fs/stat: Define DAX statx attribute ira.weiny
2020-02-21  0:41 ` [PATCH V4 05/13] fs/xfs: Isolate the physical DAX flag from enabled ira.weiny
2020-02-21  0:41 ` [PATCH V4 06/13] fs/xfs: Create function xfs_inode_enable_dax() ira.weiny
2020-02-22  0:28   ` Darrick J. Wong
2020-02-23 15:07     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state ira.weiny
2020-02-21 17:44   ` Christoph Hellwig
2020-02-21 22:44     ` Dave Chinner
2020-02-21 23:26       ` Dan Williams
2020-02-24 17:56       ` Christoph Hellwig
2020-02-25  0:09         ` Dave Chinner
2020-02-25 17:36           ` Christoph Hellwig
2020-02-25 19:37             ` Jeff Moyer
2020-02-26  9:28               ` Jonathan Halliday
2020-02-26 11:31                 ` Jan Kara
2020-02-26 11:56                   ` Jonathan Halliday
2020-02-26 16:10                 ` Ira Weiny
2020-02-26 16:46                 ` Dan Williams
2020-02-26 17:20                   ` Jan Kara [this message]
2020-02-26 17:54                     ` Dan Williams
2020-02-25 21:03             ` Ira Weiny
2020-02-26 11:17           ` Jan Kara
2020-02-26 15:57             ` Ira Weiny
2020-02-22  0:33   ` Darrick J. Wong
2020-02-23 15:03     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 08/13] fs: Prevent DAX state change if file is mmap'ed ira.weiny
2020-02-21  0:41 ` [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer ira.weiny
2020-02-22  0:31   ` Darrick J. Wong
2020-02-23 15:04     ` Ira Weiny
2020-02-24  0:34   ` Dave Chinner
2020-02-24 19:57     ` Ira Weiny
2020-02-24 22:32       ` Dave Chinner
2020-02-25 21:12         ` Ira Weiny
2020-02-25 22:59           ` Dave Chinner
2020-02-26 18:02             ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 10/13] fs/xfs: Clean up locking in dax invalidate ira.weiny
2020-02-21 17:45   ` Christoph Hellwig
2020-02-21 18:06     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 11/13] fs/xfs: Allow toggle of effective DAX flag ira.weiny
2020-02-21  0:41 ` [PATCH V4 12/13] fs/xfs: Remove xfs_diflags_to_linux() ira.weiny
2020-02-21  0:41 ` [PATCH V4 13/13] Documentation/dax: Update Usage section ira.weiny
2020-02-26 22:48 ` [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 Jeff Moyer
2020-02-27  2:43   ` Ira Weiny

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200226172034.GV10728@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=dan.j.williams@intel.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=ira.weiny@intel.com \
    --cc=jmoyer@redhat.com \
    --cc=jonathan.halliday@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).