linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jonathan Halliday <jonathan.halliday@redhat.com>
To: Jeff Moyer <jmoyer@redhat.com>, Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>,
	ira.weiny@intel.com, linux-kernel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Dan Williams <dan.j.williams@intel.com>,
	"Theodore Y. Ts'o" <tytso@mit.edu>, Jan Kara <jack@suse.cz>,
	linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
Date: Wed, 26 Feb 2020 09:28:57 +0000	[thread overview]
Message-ID: <a126276c-d252-6050-b6ee-4d6448d45fac@redhat.com> (raw)
In-Reply-To: <x49fteyh313.fsf@segfault.boston.devel.redhat.com>


Hi All

I'm a middleware developer, focused on how Java (JVM) workloads can 
benefit from app-direct mode pmem. Initially the target is apps that 
need a fast binary log for fault tolerance: the classic database WAL use 
case; transaction coordination systems; enterprise message bus 
persistence and suchlike. Critically, there are cases where we use log 
based storage, i.e. it's not the strict 'read rarely, only on recovery' 
model that a classic db may have, but more of a 'append only, read many 
times' event stream model.

Think of the log oriented data storage as having logical segments (let's 
implement them as files), of which the most recent is being appended to 
(read_write) and the remaining N-1 older segments are full and sealed, 
so effectively immutable (read_only) until discarded. The tail segment 
needs to be in DAX mode for optimal write performance, as the size of 
the append may be sub-block and we don't want the overhead of the kernel 
call anyhow. So that's clearly a good fit for putting on a DAX fs mount 
and using mmap with MAP_SYNC.

However, we want fast read access into the segments, to retrieve stored 
records. The small access index can be built in volatile RAM (assuming 
we're willing to take the startup overhead of a full file scan at 
recovery time) but the data itself is big and we don't want to move it 
all off pmem. Which means the requirements are now different: we want 
the O/S cache to pull hot data into fast volatile RAM for us, which DAX 
explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather 
than app-direct mode, except here we're using the O/S rather than the 
hardware memory controller to do the cache management for us.

Currently this requires closing the full (read_write) file, then copying 
it to a non-DAX device and reopening it (read_only) there. Clearly 
that's expensive and rather tedious. Instead, I'd like to close the 
MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode 
that will instead go via the O/S cache in the traditional manner. Bonus 
points if I can do it over non-overlapping ranges in a file without 
closing the DAX mode mmap, since then the segments are entirely logical 
instead of needing separate physical files.

I note a comment below regarding a per-directly setting, but don't have 
the background to fully understand what's being suggested. However, I'll 
note here that I can live with a per-directory granularity, as relinking 
a file into a new dir is a constant time operation, whilst the move 
described above isn't. So if a per-directory granularity is easier than 
a per-file one that's fine, though as a person with only passing 
knowledge of filesystem design I don't see how having multiple links to 
a file can work cleanly in that case.

Hope that helps.

Jonathan

P.S. I'll cheekily take the opportunity of having your attention to tack 
on one minor gripe about the current system: The only way to know if a 
mmap with MAP_SYNC will work is to try it and catch the error. Which 
would be reasonable if it were free of side effects.  However, the 
process requires first expanding the file to at least the size of the 
desired map, which is done non-atomically i.e. is user visible. There 
are thus nasty race conditions in the cleanup, where after a failed mmap 
attempt (e.g the device doesn't support DAX), we try to shrink the file 
back to its original size, but something else has already opened it at 
its new, larger size. This is not theoretical: I got caught by it whilst 
adapting some of our middleware to use pmem.  Therefore, some way to 
probe the file path for its capability would be nice, much the same as I 
can e.g. inspect file permissions to (more or less) evaluate if I can 
write it without actually mutating it.  Thanks!



On 25/02/2020 19:37, Jeff Moyer wrote:
> Christoph Hellwig <hch@lst.de> writes:
> 
>> And my point is that if we ensure S_DAX can only be checked if there
>> are no blocks on the file, is is fairly easy to provide the same
>> guarantee.  And I've not heard any argument that we really need more
>> flexibility than that.  In fact I think just being able to change it
>> on the parent directory and inheriting the flag might be more than
>> plenty, which would lead to a very simple implementation without any
>> of the crazy overhead in this series.
> 
> I know of one user who had at least mentioned it to me, so I cc'd him.
> Jonathan, can you describe your use case for being able to change a
> file between dax and non-dax modes?  Or, if I'm misremembering, just
> correct me?
> 
> Thanks!
> Jeff
> 

-- 
Registered in England and Wales under Company Registration No. 03798903 
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander


  reply	other threads:[~2020-02-26  9:29 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
2020-02-21  0:41 ` [PATCH V4 01/13] fs/xfs: Remove unnecessary initialization of i_rwsem ira.weiny
2020-02-21  1:26   ` Dave Chinner
2020-02-27 17:52     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 02/13] fs/xfs: Clarify lockdep dependency for xfs_isilocked() ira.weiny
2020-02-21  1:34   ` Dave Chinner
2020-02-21 23:00     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 03/13] fs: Remove unneeded IS_DAX() check ira.weiny
2020-02-21  1:42   ` Dave Chinner
2020-02-21 23:04     ` Ira Weiny
2020-02-21 17:42   ` Christoph Hellwig
2020-02-21  0:41 ` [PATCH V4 04/13] fs/stat: Define DAX statx attribute ira.weiny
2020-02-21  0:41 ` [PATCH V4 05/13] fs/xfs: Isolate the physical DAX flag from enabled ira.weiny
2020-02-21  0:41 ` [PATCH V4 06/13] fs/xfs: Create function xfs_inode_enable_dax() ira.weiny
2020-02-22  0:28   ` Darrick J. Wong
2020-02-23 15:07     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state ira.weiny
2020-02-21 17:44   ` Christoph Hellwig
2020-02-21 22:44     ` Dave Chinner
2020-02-21 23:26       ` Dan Williams
2020-02-24 17:56       ` Christoph Hellwig
2020-02-25  0:09         ` Dave Chinner
2020-02-25 17:36           ` Christoph Hellwig
2020-02-25 19:37             ` Jeff Moyer
2020-02-26  9:28               ` Jonathan Halliday [this message]
2020-02-26 11:31                 ` Jan Kara
2020-02-26 11:56                   ` Jonathan Halliday
2020-02-26 16:10                 ` Ira Weiny
2020-02-26 16:46                 ` Dan Williams
2020-02-26 17:20                   ` Jan Kara
2020-02-26 17:54                     ` Dan Williams
2020-02-25 21:03             ` Ira Weiny
2020-02-26 11:17           ` Jan Kara
2020-02-26 15:57             ` Ira Weiny
2020-02-22  0:33   ` Darrick J. Wong
2020-02-23 15:03     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 08/13] fs: Prevent DAX state change if file is mmap'ed ira.weiny
2020-02-21  0:41 ` [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer ira.weiny
2020-02-22  0:31   ` Darrick J. Wong
2020-02-23 15:04     ` Ira Weiny
2020-02-24  0:34   ` Dave Chinner
2020-02-24 19:57     ` Ira Weiny
2020-02-24 22:32       ` Dave Chinner
2020-02-25 21:12         ` Ira Weiny
2020-02-25 22:59           ` Dave Chinner
2020-02-26 18:02             ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 10/13] fs/xfs: Clean up locking in dax invalidate ira.weiny
2020-02-21 17:45   ` Christoph Hellwig
2020-02-21 18:06     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 11/13] fs/xfs: Allow toggle of effective DAX flag ira.weiny
2020-02-21  0:41 ` [PATCH V4 12/13] fs/xfs: Remove xfs_diflags_to_linux() ira.weiny
2020-02-21  0:41 ` [PATCH V4 13/13] Documentation/dax: Update Usage section ira.weiny
2020-02-26 22:48 ` [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 Jeff Moyer
2020-02-27  2:43   ` Ira Weiny

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a126276c-d252-6050-b6ee-4d6448d45fac@redhat.com \
    --to=jonathan.halliday@redhat.com \
    --cc=dan.j.williams@intel.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=jmoyer@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).