From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH 0/2] DAX bdev fixes - move flushing calls to FS Date: Sun, 7 Feb 2016 00:19:11 -0700 Message-Id: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: The first patch in the series just adds a bdev argument to dax_clear_blocks(), and should be relatively straightforward. The second patch is slightly more controversial. During testing of raw block devices + DAX I noticed that the struct block_device that we were using for DAX operations was incorrect. For the fault handlers, etc. we can just get the correct bdev via get_block(), which is passed in as a function pointer, but for the flushing code we don't have access to get_block(). This is also an issue for XFS real-time devices, whenever we get those working. In short, somehow we need to get dax_writeback_mapping_range() a valid bdev. Right now it is called via filemap_write_and_wait_range(), which can't provide either the bdev nor a get_block() function pointer. So, our options seem to be: a) Move the calls to dax_writeback_mapping_range() into the filesystems. This is implemented by patch 2 in this series. b) Keep the calls to dax_writeback_mapping_range() in the mm code, and provide a generic way to ask a filesystem for an inode's bdev. I did a version of this using a superblock operation here: https://lkml.org/lkml/2016/2/2/941 It has been noted that we may need to expand the coverage of our DAX flushing code to include support for the sync() and syncfs() userspace calls. This is still under discussion, but if we do end up needing to add support for sync(), I don't think that it is v4.5 material for the reasons stated here: https://lkml.org/lkml/2016/2/4/962 I think that for v4.5 we either need patch 2 of this series, or the get_bdev() patch listed in for solution b) above. Ross Zwisler (2): dax: pass bdev argument to dax_clear_blocks() dax: move writeback calls into the filesystems fs/block_dev.c | 7 +++++++ fs/dax.c | 9 ++++----- fs/ext2/file.c | 10 ++++++++++ fs/ext2/inode.c | 5 +++-- fs/ext4/fsync.c | 10 +++++++++- fs/xfs/xfs_aops.c | 2 +- fs/xfs/xfs_aops.h | 1 + fs/xfs/xfs_bmap_util.c | 4 +++- fs/xfs/xfs_file.c | 12 ++++++++++-- include/linux/dax.h | 7 ++++--- mm/filemap.c | 6 ------ 11 files changed, 52 insertions(+), 21 deletions(-) -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 9 Feb 2016 11:06:47 -0700 From: Ross Zwisler Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160209180647.GA2450@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> <20160209094353.GF9451@quack.suse.cz> <20160209160134.GA12245@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160209160134.GA12245@quack.suse.cz> Sender: owner-linux-mm@kvack.org To: Jan Kara Cc: Dan Williams , Dave Chinner , Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer List-ID: On Tue, Feb 09, 2016 at 05:01:34PM +0100, Jan Kara wrote: > On Tue 09-02-16 10:43:53, Jan Kara wrote: > > On Mon 08-02-16 12:55:24, Dan Williams wrote: > > > On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner wrote: > > > [..] > > > >> Setting aside the current block zeroing problem you seem to assuming > > > >> that DAX will always be faster and that may not be true at a media > > > >> level. Waiting years for some applications to determine if DAX makes > > > >> sense for their use case seems completely reasonable. In the meantime > > > >> the apps that are already making these changes want to know that a DAX > > > >> mapping request has not silently dropped backed to page cache. They > > > >> also want to know if they successfully jumped through all the hoops to > > > >> get a larger than pte mapping. > > > >> > > > >> I agree it is useful to be able to force DAX on an unmodified > > > >> application to see what happens, and it follows that if those > > > >> applications want to run in that mode they will need functional > > > >> fsync()... > > > >> > > > >> I would feel better if we were talking about specific applications and > > > >> performance numbers to know if forcing DAX on application is a debug > > > >> facility or a production level capability. You seem to have already > > > >> made that determination and I'm curious what I'm missing. > > > > > > > > I'm not setting any policy here at all. This whole argument is > > > > based around the DAX mount option doing "global fs enable or > > > > silently turning it off" and the application not knowing about that. > > > > > > > > The whole point of having a persistent per-inode DAX flags is that > > > > it is a policy mechanism, not a policy. The application can, if it > > > > is DAX aware, directly control whether DAX is used on a file or not. > > > > The application can even query and clear that persistent inode flag > > > > if it is configured not to (or cannot) use DAX. > > > > > > > > If the filesystem cannot support DAX, then we can error out attempts > > > > to set the DAX flag and then the app knows DAX is not available. > > > > i.e. the attempt to set policy failed. If the flag is set, then the > > > > inode will *always* use DAX - there is no "fall back to page cache" > > > > when DAX is enabled. > > > > > > > > If the applicaiton is not DAX aware, then the admin can control the > > > > DAX policy by manipulating these flags themselves, and hence control > > > > whether DAX is used by the application or not. > > > > > > > > If you think I'm dictating policy for DAX users and application, > > > > then you haven't understood anything I've previously said about why > > > > the DAX mount option needs to die before any of this is considered > > > > production ready. DAX is not an opaque "all or nothing" option. XFS > > > > will provide apps and admins with fine-grained, persistent, > > > > discoverable policy flags to allow admins and applications to set > > > > DAX policies however they see fit. This simply cannot be done if the > > > > only knob you have is a mount option that may or may not stick. > > > > > > I agree the mount option needs to die, and I fully grok the reasoning. > > > What I'm concerned with is that a system using fully-DAX-aware > > > applications is forced to incur the overhead of maintaining *sync > > > semantics, periodic sync(2) in particular, even if it is not relying > > > on those semantics. > > > > Let me somewhat correct this: IMO hard requirement is maintaining sync(2) > > semantics. Periodic writeback does not have any hard durability guarantees > > and we are free to ignore such requests in ->writepages() (that function > > has enough information in the writeback_control structure to differentiate > > between periodic writeback and data integrity sync) if we decide it is > > useful. Actually, we could do that even for 4.5. > > Attached is a version of Ross' patch that will work for sync(2) and > fsync(2) and we won't flush caches during periodic writeback. The patch is > only compile-tested. Ross? This looks great. I'll send out a v2 with this and with the dax_clear_sectors() changes after I'm done testing. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 9 Feb 2016 17:01:34 +0100 From: Jan Kara Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160209160134.GA12245@quack.suse.cz> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> <20160209094353.GF9451@quack.suse.cz> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="LQksG6bCIzRHxTLp" Content-Disposition: inline In-Reply-To: <20160209094353.GF9451@quack.suse.cz> Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Dave Chinner , Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer List-ID: --LQksG6bCIzRHxTLp Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Tue 09-02-16 10:43:53, Jan Kara wrote: > On Mon 08-02-16 12:55:24, Dan Williams wrote: > > On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner wrote: > > [..] > > >> Setting aside the current block zeroing problem you seem to assuming > > >> that DAX will always be faster and that may not be true at a media > > >> level. Waiting years for some applications to determine if DAX makes > > >> sense for their use case seems completely reasonable. In the meantime > > >> the apps that are already making these changes want to know that a DAX > > >> mapping request has not silently dropped backed to page cache. They > > >> also want to know if they successfully jumped through all the hoops to > > >> get a larger than pte mapping. > > >> > > >> I agree it is useful to be able to force DAX on an unmodified > > >> application to see what happens, and it follows that if those > > >> applications want to run in that mode they will need functional > > >> fsync()... > > >> > > >> I would feel better if we were talking about specific applications and > > >> performance numbers to know if forcing DAX on application is a debug > > >> facility or a production level capability. You seem to have already > > >> made that determination and I'm curious what I'm missing. > > > > > > I'm not setting any policy here at all. This whole argument is > > > based around the DAX mount option doing "global fs enable or > > > silently turning it off" and the application not knowing about that. > > > > > > The whole point of having a persistent per-inode DAX flags is that > > > it is a policy mechanism, not a policy. The application can, if it > > > is DAX aware, directly control whether DAX is used on a file or not. > > > The application can even query and clear that persistent inode flag > > > if it is configured not to (or cannot) use DAX. > > > > > > If the filesystem cannot support DAX, then we can error out attempts > > > to set the DAX flag and then the app knows DAX is not available. > > > i.e. the attempt to set policy failed. If the flag is set, then the > > > inode will *always* use DAX - there is no "fall back to page cache" > > > when DAX is enabled. > > > > > > If the applicaiton is not DAX aware, then the admin can control the > > > DAX policy by manipulating these flags themselves, and hence control > > > whether DAX is used by the application or not. > > > > > > If you think I'm dictating policy for DAX users and application, > > > then you haven't understood anything I've previously said about why > > > the DAX mount option needs to die before any of this is considered > > > production ready. DAX is not an opaque "all or nothing" option. XFS > > > will provide apps and admins with fine-grained, persistent, > > > discoverable policy flags to allow admins and applications to set > > > DAX policies however they see fit. This simply cannot be done if the > > > only knob you have is a mount option that may or may not stick. > > > > I agree the mount option needs to die, and I fully grok the reasoning. > > What I'm concerned with is that a system using fully-DAX-aware > > applications is forced to incur the overhead of maintaining *sync > > semantics, periodic sync(2) in particular, even if it is not relying > > on those semantics. > > Let me somewhat correct this: IMO hard requirement is maintaining sync(2) > semantics. Periodic writeback does not have any hard durability guarantees > and we are free to ignore such requests in ->writepages() (that function > has enough information in the writeback_control structure to differentiate > between periodic writeback and data integrity sync) if we decide it is > useful. Actually, we could do that even for 4.5. Attached is a version of Ross' patch that will work for sync(2) and fsync(2) and we won't flush caches during periodic writeback. The patch is only compile-tested. Ross? Honza -- Jan Kara SUSE Labs, CR --LQksG6bCIzRHxTLp Content-Type: text/x-patch; charset=us-ascii Content-Disposition: attachment; filename="0001-dax-move-writeback-calls-into-the-filesystems.patch" >>From f7280a34d235031c5dbf3f5a345c4b64e452f097 Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Sun, 7 Feb 2016 00:19:13 -0700 Subject: [PATCH] dax: move writeback calls into the filesystems Previously calls to dax_writeback_mapping_range() for all DAX filesystems (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). dax_writeback_mapping_range() needs a struct block_device, and it used to get that from inode->i_sb->s_bdev. This is correct for normal inodes mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time files. Instead, call dax_writeback_mapping_range() directly from the filesystem ->writepages function so that it can supply us with a valid block device. This also fixes DAX code to properly flush caches in response to sync(2). Signed-off-by: Ross Zwisler Signed-off-by: Jan Kara --- fs/block_dev.c | 13 ++++++++++++- fs/dax.c | 12 +++++++----- fs/ext2/inode.c | 8 ++++++++ fs/ext4/fsync.c | 1 - fs/ext4/inode.c | 4 ++++ fs/xfs/xfs_aops.c | 5 +++++ include/linux/dax.h | 7 +++++-- mm/filemap.c | 12 ++++-------- 8 files changed, 45 insertions(+), 17 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 39b3a174a425..271d38aa6cbb 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1693,13 +1693,24 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) return try_to_free_buffers(page); } +static int blkdev_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + if (dax_mapping(mapping)) { + struct block_device *bdev = I_BDEV(mapping->host); + + return dax_writeback_mapping_range(mapping, bdev, wbc); + } + return generic_writepages(mapping, wbc); +} + static const struct address_space_operations def_blk_aops = { .readpage = blkdev_readpage, .readpages = blkdev_readpages, .writepage = blkdev_writepage, .write_begin = blkdev_write_begin, .write_end = blkdev_write_end, - .writepages = generic_writepages, + .writepages = blkdev_writepages, .releasepage = blkdev_releasepage, .direct_IO = blkdev_direct_IO, .is_dirty_writeback = buffer_check_dirty_writeback, diff --git a/fs/dax.c b/fs/dax.c index fc2e3141138b..2f4965214783 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -485,11 +485,10 @@ static int dax_writeback_one(struct block_device *bdev, * end]. This is required by data integrity operations to ensure file data is * on persistent storage prior to completion of the operation. */ -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end) +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, struct writeback_control *wbc) { struct inode *inode = mapping->host; - struct block_device *bdev = inode->i_sb->s_bdev; pgoff_t start_index, end_index, pmd_index; pgoff_t indices[PAGEVEC_SIZE]; struct pagevec pvec; @@ -500,8 +499,11 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT)) return -EIO; - start_index = start >> PAGE_CACHE_SHIFT; - end_index = end >> PAGE_CACHE_SHIFT; + if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) + return 0; + + start_index = wbc->range_start >> PAGE_CACHE_SHIFT; + end_index = wbc->range_end >> PAGE_CACHE_SHIFT; pmd_index = DAX_PMD_INDEX(start_index); rcu_read_lock(); diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 338eefda70c6..ee05e945f40c 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -874,6 +874,14 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset) static int ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) { +#ifdef CONFIG_FS_DAX + if (dax_mapping(mapping)) { + return dax_writeback_mapping_range(mapping, + mapping->host->i_sb->s_bdev, + wbc); + } +#endif + return mpage_writepages(mapping, wbc, ext2_get_block); } diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 8850254136ae..b7136227d0f8 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -83,7 +83,6 @@ static int ext4_sync_parent(struct inode *inode) * What we do is just kick off a commit and wait on it. This will snapshot the * inode to disk. */ - int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { struct inode *inode = file->f_mapping->host; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 83bc8bfb3bea..19989c12187a 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2450,6 +2450,10 @@ static int ext4_writepages(struct address_space *mapping, trace_ext4_writepages(inode, wbc); + if (dax_mapping(mapping)) + return dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, + wbc); + /* * No pages to write? This is mainly a kludge to avoid starting * a transaction for special inodes like journal inode on last iput() diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 379c089fb051..fd0839278442 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1208,6 +1208,11 @@ xfs_vm_writepages( struct writeback_control *wbc) { xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED); + if (dax_mapping(mapping)) { + return dax_writeback_mapping_range(mapping, + xfs_find_bdev_for_inode(mapping->host), wbc); + } + return generic_writepages(mapping, wbc); } diff --git a/include/linux/dax.h b/include/linux/dax.h index 818e45078929..05d7d043d3bd 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -52,6 +52,9 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end); + +struct writeback_control; + +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, struct writeback_control *wbc); #endif diff --git a/mm/filemap.c b/mm/filemap.c index bc943867d68c..af3eec1a8c5e 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -446,7 +446,8 @@ int filemap_write_and_wait(struct address_space *mapping) { int err = 0; - if (mapping->nrpages) { + if ((!dax_mapping(mapping) && mapping->nrpages) || + (dax_mapping(mapping) && mapping->nrexceptional)) { err = filemap_fdatawrite(mapping); /* * Even if the above returned error, the pages may be @@ -482,13 +483,8 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; - if (dax_mapping(mapping) && mapping->nrexceptional) { - err = dax_writeback_mapping_range(mapping, lstart, lend); - if (err) - return err; - } - - if (mapping->nrpages) { + if ((!dax_mapping(mapping) && mapping->nrpages) || + (dax_mapping(mapping) && mapping->nrexceptional)) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); /* See comment of filemap_write_and_wait() */ -- 2.6.2 --LQksG6bCIzRHxTLp-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 9 Feb 2016 10:43:53 +0100 From: Jan Kara Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160209094353.GF9451@quack.suse.cz> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Dave Chinner , Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer List-ID: On Mon 08-02-16 12:55:24, Dan Williams wrote: > On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner wrote: > [..] > >> Setting aside the current block zeroing problem you seem to assuming > >> that DAX will always be faster and that may not be true at a media > >> level. Waiting years for some applications to determine if DAX makes > >> sense for their use case seems completely reasonable. In the meantime > >> the apps that are already making these changes want to know that a DAX > >> mapping request has not silently dropped backed to page cache. They > >> also want to know if they successfully jumped through all the hoops to > >> get a larger than pte mapping. > >> > >> I agree it is useful to be able to force DAX on an unmodified > >> application to see what happens, and it follows that if those > >> applications want to run in that mode they will need functional > >> fsync()... > >> > >> I would feel better if we were talking about specific applications and > >> performance numbers to know if forcing DAX on application is a debug > >> facility or a production level capability. You seem to have already > >> made that determination and I'm curious what I'm missing. > > > > I'm not setting any policy here at all. This whole argument is > > based around the DAX mount option doing "global fs enable or > > silently turning it off" and the application not knowing about that. > > > > The whole point of having a persistent per-inode DAX flags is that > > it is a policy mechanism, not a policy. The application can, if it > > is DAX aware, directly control whether DAX is used on a file or not. > > The application can even query and clear that persistent inode flag > > if it is configured not to (or cannot) use DAX. > > > > If the filesystem cannot support DAX, then we can error out attempts > > to set the DAX flag and then the app knows DAX is not available. > > i.e. the attempt to set policy failed. If the flag is set, then the > > inode will *always* use DAX - there is no "fall back to page cache" > > when DAX is enabled. > > > > If the applicaiton is not DAX aware, then the admin can control the > > DAX policy by manipulating these flags themselves, and hence control > > whether DAX is used by the application or not. > > > > If you think I'm dictating policy for DAX users and application, > > then you haven't understood anything I've previously said about why > > the DAX mount option needs to die before any of this is considered > > production ready. DAX is not an opaque "all or nothing" option. XFS > > will provide apps and admins with fine-grained, persistent, > > discoverable policy flags to allow admins and applications to set > > DAX policies however they see fit. This simply cannot be done if the > > only knob you have is a mount option that may or may not stick. > > I agree the mount option needs to die, and I fully grok the reasoning. > What I'm concerned with is that a system using fully-DAX-aware > applications is forced to incur the overhead of maintaining *sync > semantics, periodic sync(2) in particular, even if it is not relying > on those semantics. Let me somewhat correct this: IMO hard requirement is maintaining sync(2) semantics. Periodic writeback does not have any hard durability guarantees and we are free to ignore such requests in ->writepages() (that function has enough information in the writeback_control structure to differentiate between periodic writeback and data integrity sync) if we decide it is useful. Actually, we could do that even for 4.5. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> Date: Mon, 8 Feb 2016 14:05:34 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Jeff Moyer Cc: Dave Chinner , Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers List-ID: On Mon, Feb 8, 2016 at 12:58 PM, Jeff Moyer wrote: > Dan Williams writes: > >> I agree the mount option needs to die, and I fully grok the reasoning. >> What I'm concerned with is that a system using fully-DAX-aware >> applications is forced to incur the overhead of maintaining *sync >> semantics, periodic sync(2) in particular, even if it is not relying >> on those semantics. >> >> However, like I said in my other mail, we can solve that with >> alternate interfaces to persistent memory if that becomes an issue and >> not require that "disable *sync" capability to come through DAX. > > What do you envision these alternate interfaces looking like? Well, plan-A was making DAX be explicit opt-in for applications, I haven't thought too much about plan-B. I expect it to be driven by real performance numbers and application use cases once the *sync compat work completes. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20160208201808.GK27429@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> Date: Mon, 8 Feb 2016 12:55:24 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Dave Chinner Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer List-ID: On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner wrote: [..] >> Setting aside the current block zeroing problem you seem to assuming >> that DAX will always be faster and that may not be true at a media >> level. Waiting years for some applications to determine if DAX makes >> sense for their use case seems completely reasonable. In the meantime >> the apps that are already making these changes want to know that a DAX >> mapping request has not silently dropped backed to page cache. They >> also want to know if they successfully jumped through all the hoops to >> get a larger than pte mapping. >> >> I agree it is useful to be able to force DAX on an unmodified >> application to see what happens, and it follows that if those >> applications want to run in that mode they will need functional >> fsync()... >> >> I would feel better if we were talking about specific applications and >> performance numbers to know if forcing DAX on application is a debug >> facility or a production level capability. You seem to have already >> made that determination and I'm curious what I'm missing. > > I'm not setting any policy here at all. This whole argument is > based around the DAX mount option doing "global fs enable or > silently turning it off" and the application not knowing about that. > > The whole point of having a persistent per-inode DAX flags is that > it is a policy mechanism, not a policy. The application can, if it > is DAX aware, directly control whether DAX is used on a file or not. > The application can even query and clear that persistent inode flag > if it is configured not to (or cannot) use DAX. > > If the filesystem cannot support DAX, then we can error out attempts > to set the DAX flag and then the app knows DAX is not available. > i.e. the attempt to set policy failed. If the flag is set, then the > inode will *always* use DAX - there is no "fall back to page cache" > when DAX is enabled. > > If the applicaiton is not DAX aware, then the admin can control the > DAX policy by manipulating these flags themselves, and hence control > whether DAX is used by the application or not. > > If you think I'm dictating policy for DAX users and application, > then you haven't understood anything I've previously said about why > the DAX mount option needs to die before any of this is considered > production ready. DAX is not an opaque "all or nothing" option. XFS > will provide apps and admins with fine-grained, persistent, > discoverable policy flags to allow admins and applications to set > DAX policies however they see fit. This simply cannot be done if the > only knob you have is a mount option that may or may not stick. I agree the mount option needs to die, and I fully grok the reasoning. What I'm concerned with is that a system using fully-DAX-aware applications is forced to incur the overhead of maintaining *sync semantics, periodic sync(2) in particular, even if it is not relying on those semantics. However, like I said in my other mail, we can solve that with alternate interfaces to persistent memory if that becomes an issue and not require that "disable *sync" capability to come through DAX. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 9 Feb 2016 07:18:08 +1100 From: Dave Chinner Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160208201808.GK27429@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer List-ID: On Mon, Feb 08, 2016 at 12:18:11AM -0800, Dan Williams wrote: > On Sun, Feb 7, 2016 at 1:50 PM, Dave Chinner wrote: > > On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote: > >> On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler > >> wrote: > >> > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > >> > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > >> > dax_writeback_mapping_range() needs a struct block_device, and it used to > >> > get that from inode->i_sb->s_bdev. This is correct for normal inodes > >> > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > >> > block devices and for XFS real-time files. > >> > > >> > Instead, call dax_writeback_mapping_range() directly from the filesystem or > >> > raw block device fsync/msync code so that they can supply us with a valid > >> > block device. > >> > > >> > It should be noted that this will reduce the number of calls to > >> > dax_writeback_mapping_range() because filemap_write_and_wait_range() is > >> > called in the various filesystems for operations other than just > >> > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside > >> > of ->fsync for hole punch, truncate, and block relocation > >> > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). > >> > > >> > I don't believe that these extra flushes are necessary in the DAX case. In > >> > the page cache case when we have dirty data in the page cache, that data > >> > will be actively lost if we evict a dirty page cache page without flushing > >> > it to media first. For DAX, though, the data will remain consistent with > >> > the physical address to which it was written regardless of whether it's in > >> > the processor cache or not - really the only reason I see to flush is in > >> > response to a fsync or msync so that our data is durable on media in case > >> > of a power loss. The case where we could throw dirty data out of the page > >> > cache and essentially lose writes simply doesn't exist. > >> > > >> > Signed-off-by: Ross Zwisler > >> > --- > >> > fs/block_dev.c | 7 +++++++ > >> > fs/dax.c | 5 ++--- > >> > fs/ext2/file.c | 10 ++++++++++ > >> > fs/ext4/fsync.c | 10 +++++++++- > >> > fs/xfs/xfs_file.c | 12 ++++++++++-- > >> > include/linux/dax.h | 4 ++-- > >> > mm/filemap.c | 6 ------ > >> > 7 files changed, 40 insertions(+), 14 deletions(-) > >> > >> This sprinkling of dax specific fixups outside of vm_operations_struct > >> routines still has me thinking that we are going in the wrong > >> direction for fsync/msync support. > >> > >> If an application is both unaware of DAX and doing mmap I/O it is > >> better served by the page cache where writeback is durable by default. > >> We expect DAX-aware applications to assume responsibility for cpu > >> cache management [1]. Making DAX mmap semantics explicit opt-in > >> solves not only durability support, but also the current problem that > >> DAX gets silently disabled leaving an app to wonder if it really got a > >> direct mapping. DAX also silently picks pud, pmd, or pte mappings > >> which is information an application would really like to know at map > >> time. > >> > >> The proposal: make applications explicitly request DAX semantics with > >> a new MAP_DAX flag and fail if DAX is unavailable. > > > > No. > > > > As I've stated before, the entire purpose of enabling DAX through > > existing filesytsems like XFS and ext4 is so that existing > > applications work with DAX *without modification*. > > > > That is, applications can be entirely unaware of the fact that the > > filesystem is giving them direct access to the storage because the > > access and failure semantics of DAX enabled mmap are *identical to > > the existing mmap semantics*. > > > > Given this, the app doesn't need to care whether DAX is enabled or > > not; all that will be seen is a difference in speed of access. > > Enabling and disabling DAX is, at this point, purely an > > administration decision - if the hardware and filesystem supports > > it, it can be turned on without having to wait years for application > > developers to add support for it.... > > Setting aside the current block zeroing problem you seem to assuming > that DAX will always be faster and that may not be true at a media > level. Waiting years for some applications to determine if DAX makes > sense for their use case seems completely reasonable. In the meantime > the apps that are already making these changes want to know that a DAX > mapping request has not silently dropped backed to page cache. They > also want to know if they successfully jumped through all the hoops to > get a larger than pte mapping. > > I agree it is useful to be able to force DAX on an unmodified > application to see what happens, and it follows that if those > applications want to run in that mode they will need functional > fsync()... > > I would feel better if we were talking about specific applications and > performance numbers to know if forcing DAX on application is a debug > facility or a production level capability. You seem to have already > made that determination and I'm curious what I'm missing. I'm not setting any policy here at all. This whole argument is based around the DAX mount option doing "global fs enable or silently turning it off" and the application not knowing about that. The whole point of having a persistent per-inode DAX flags is that it is a policy mechanism, not a policy. The application can, if it is DAX aware, directly control whether DAX is used on a file or not. The application can even query and clear that persistent inode flag if it is configured not to (or cannot) use DAX. If the filesystem cannot support DAX, then we can error out attempts to set the DAX flag and then the app knows DAX is not available. i.e. the attempt to set policy failed. If the flag is set, then the inode will *always* use DAX - there is no "fall back to page cache" when DAX is enabled. If the applicaiton is not DAX aware, then the admin can control the DAX policy by manipulating these flags themselves, and hence control whether DAX is used by the application or not. If you think I'm dictating policy for DAX users and application, then you haven't understood anything I've previously said about why the DAX mount option needs to die before any of this is considered production ready. DAX is not an opaque "all or nothing" option. XFS will provide apps and admins with fine-grained, persistent, discoverable policy flags to allow admins and applications to set DAX policies however they see fit. This simply cannot be done if the only knob you have is a mount option that may or may not stick. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20160208183112.GF2343@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160208183112.GF2343@linux.intel.com> Date: Mon, 8 Feb 2016 11:23:56 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer List-ID: On Mon, Feb 8, 2016 at 10:31 AM, Ross Zwisler wrote: > On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote: >> The proposal: make applications explicitly request DAX semantics with >> a new MAP_DAX flag and fail if DAX is unavailable. Document that a >> successful MAP_DAX request mandates that the application assumes >> responsibility for cpu cache management. > >> Require that all applications that mmap the file agree on MAP_DAX. > > I think this proposal could run into issues with aliasing. For example, say > you have two threads accessing the same region, and one wants to use DAX and > the other wants to use the page cache. What happens? > > If we satisfy both requests, we end up with one user reading and writing to > the page cache, while the other is reading and writing directly to the media. > They can't see each other's changes, and you get data corruption. > > If we satisfy the request of whoever asked first, sort of lock the inode into > that mode, and then return an error to the second thread because they are > asking for the other mode, we have now introduced a new weird failure case > where mmaps can randomly fail based on the behavior of other applications. > I think this is where you were going with the last line quoted above, but I > don't understand how it would work in an acceptable way. > > It seems like we have to have the decision about whether or not to use DAX > made in the same way for all users of the inode so that we don't run into > these types of conflicts. We haven't solved the conflict problem by pushing it out to the inode, see the recent revert of blkdev_daxset(). We're heading in a direction where an application can't develop it's own policies about DAX usage, it's always an administrative decision. However, maybe that is ok. Dave is right that if an application is using an existing filesystem it should get all the existing semantics. If the existing semantics (or overhead of maintaining the existing semantics) turn out not to fit a given pmem-aware application then we may just need new interfaces (separate from fs/dax.c) to persistent memory. I admit we're a ways off from knowing if that is needed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 8 Feb 2016 11:31:12 -0700 From: Ross Zwisler Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160208183112.GF2343@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer List-ID: On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote: > The proposal: make applications explicitly request DAX semantics with > a new MAP_DAX flag and fail if DAX is unavailable. Document that a > successful MAP_DAX request mandates that the application assumes > responsibility for cpu cache management. > Require that all applications that mmap the file agree on MAP_DAX. I think this proposal could run into issues with aliasing. For example, say you have two threads accessing the same region, and one wants to use DAX and the other wants to use the page cache. What happens? If we satisfy both requests, we end up with one user reading and writing to the page cache, while the other is reading and writing directly to the media. They can't see each other's changes, and you get data corruption. If we satisfy the request of whoever asked first, sort of lock the inode into that mode, and then return an error to the second thread because they are asking for the other mode, we have now introduced a new weird failure case where mmaps can randomly fail based on the behavior of other applications. I think this is where you were going with the last line quoted above, but I don't understand how it would work in an acceptable way. It seems like we have to have the decision about whether or not to use DAX made in the same way for all users of the inode so that we don't run into these types of conflicts. > This also solves > the future problem of DAX support on virtually tagged cache > architectures where it is difficult for the kernel to know what alias > addresses need flushing. > > [1]: https://github.com/pmem/nvml -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 8 Feb 2016 09:12:11 -0700 From: Ross Zwisler Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160208161211.GE2343@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160208104849.GB9451@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160208104849.GB9451@quack.suse.cz> Sender: owner-linux-mm@kvack.org To: Jan Kara Cc: Ross Zwisler , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: On Mon, Feb 08, 2016 at 11:48:50AM +0100, Jan Kara wrote: > On Sun 07-02-16 00:19:13, Ross Zwisler wrote: > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > block devices and for XFS real-time files. > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem or > > raw block device fsync/msync code so that they can supply us with a valid > > block device. > > > > It should be noted that this will reduce the number of calls to > > dax_writeback_mapping_range() because filemap_write_and_wait_range() is > > called in the various filesystems for operations other than just > > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside > > of ->fsync for hole punch, truncate, and block relocation > > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). > > > > I don't believe that these extra flushes are necessary in the DAX case. In > > the page cache case when we have dirty data in the page cache, that data > > will be actively lost if we evict a dirty page cache page without flushing > > it to media first. For DAX, though, the data will remain consistent with > > the physical address to which it was written regardless of whether it's in > > the processor cache or not - really the only reason I see to flush is in > > response to a fsync or msync so that our data is durable on media in case > > of a power loss. The case where we could throw dirty data out of the page > > cache and essentially lose writes simply doesn't exist. > > You should at least note that sync(2) won't make data durable with this > patch in the changelog. Dave and Christoph have told you that Linux users > depend on sync(2) to make data durable and I fully agree with them. Given > current options, I think we can live with this for 4.5 but long term this > is IMO unacceptable. > > Honza I agree. I'll add a note to the changelog and will work on adding support for sync(2). > > > > Signed-off-by: Ross Zwisler > > --- > > fs/block_dev.c | 7 +++++++ > > fs/dax.c | 5 ++--- > > fs/ext2/file.c | 10 ++++++++++ > > fs/ext4/fsync.c | 10 +++++++++- > > fs/xfs/xfs_file.c | 12 ++++++++++-- > > include/linux/dax.h | 4 ++-- > > mm/filemap.c | 6 ------ > > 7 files changed, 40 insertions(+), 14 deletions(-) > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > index fa0507a..312ad44 100644 > > --- a/fs/block_dev.c > > +++ b/fs/block_dev.c > > @@ -356,8 +356,15 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync) > > { > > struct inode *bd_inode = bdev_file_inode(filp); > > struct block_device *bdev = I_BDEV(bd_inode); > > + struct address_space *mapping = bd_inode->i_mapping; > > int error; > > > > + if (dax_mapping(mapping) && mapping->nrexceptional) { > > + error = dax_writeback_mapping_range(mapping, bdev, start, end); > > + if (error) > > + return error; > > + } > > + > > error = filemap_write_and_wait_range(filp->f_mapping, start, end); > > if (error) > > return error; > > diff --git a/fs/dax.c b/fs/dax.c > > index 4592241..4b5006a 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev, > > * end]. This is required by data integrity operations to ensure file data is > > * on persistent storage prior to completion of the operation. > > */ > > -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > > - loff_t end) > > +int dax_writeback_mapping_range(struct address_space *mapping, > > + struct block_device *bdev, loff_t start, loff_t end) > > { > > struct inode *inode = mapping->host; > > - struct block_device *bdev = inode->i_sb->s_bdev; > > pgoff_t start_index, end_index, pmd_index; > > pgoff_t indices[PAGEVEC_SIZE]; > > struct pagevec pvec; > > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > > index 2c88d68..d1abf53 100644 > > --- a/fs/ext2/file.c > > +++ b/fs/ext2/file.c > > @@ -162,6 +162,16 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) > > int ret; > > struct super_block *sb = file->f_mapping->host->i_sb; > > struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; > > +#ifdef CONFIG_FS_DAX > > + struct address_space *inode_mapping = file->f_inode->i_mapping; > > + > > + if (dax_mapping(inode_mapping) && inode_mapping->nrexceptional) { > > + ret = dax_writeback_mapping_range(inode_mapping, sb->s_bdev, > > + start, end); > > + if (ret) > > + return ret; > > + } > > +#endif > > > > ret = generic_file_fsync(file, start, end, datasync); > > if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) { > > diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c > > index 8850254..e9cf53b 100644 > > --- a/fs/ext4/fsync.c > > +++ b/fs/ext4/fsync.c > > @@ -27,6 +27,7 @@ > > #include > > #include > > #include > > +#include > > > > #include "ext4.h" > > #include "ext4_jbd2.h" > > @@ -83,10 +84,10 @@ static int ext4_sync_parent(struct inode *inode) > > * What we do is just kick off a commit and wait on it. This will snapshot the > > * inode to disk. > > */ > > - > > int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > > { > > struct inode *inode = file->f_mapping->host; > > + struct address_space *mapping = inode->i_mapping; > > struct ext4_inode_info *ei = EXT4_I(inode); > > journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; > > int ret = 0, err; > > @@ -97,6 +98,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > > > > trace_ext4_sync_file_enter(file, datasync); > > > > + if (dax_mapping(mapping) && mapping->nrexceptional) { > > + err = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, > > + start, end); > > + if (err) > > + goto out; > > + } > > + > > if (inode->i_sb->s_flags & MS_RDONLY) { > > /* Make sure that we read updated s_mount_flags value */ > > smp_rmb(); > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > index 52883ac..84e95cc 100644 > > --- a/fs/xfs/xfs_file.c > > +++ b/fs/xfs/xfs_file.c > > @@ -209,7 +209,8 @@ xfs_file_fsync( > > loff_t end, > > int datasync) > > { > > - struct inode *inode = file->f_mapping->host; > > + struct address_space *mapping = file->f_mapping; > > + struct inode *inode = mapping->host; > > struct xfs_inode *ip = XFS_I(inode); > > struct xfs_mount *mp = ip->i_mount; > > int error = 0; > > @@ -218,7 +219,14 @@ xfs_file_fsync( > > > > trace_xfs_file_fsync(ip); > > > > - error = filemap_write_and_wait_range(inode->i_mapping, start, end); > > + if (dax_mapping(mapping) && mapping->nrexceptional) { > > + error = dax_writeback_mapping_range(mapping, > > + xfs_find_bdev_for_inode(inode), start, end); > > + if (error) > > + return error; > > + } > > + > > + error = filemap_write_and_wait_range(mapping, start, end); > > if (error) > > return error; > > > > diff --git a/include/linux/dax.h b/include/linux/dax.h > > index bad27b0..8e9f114 100644 > > --- a/include/linux/dax.h > > +++ b/include/linux/dax.h > > @@ -42,6 +42,6 @@ static inline bool dax_mapping(struct address_space *mapping) > > { > > return mapping->host && IS_DAX(mapping->host); > > } > > -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > > - loff_t end); > > +int dax_writeback_mapping_range(struct address_space *mapping, > > + struct block_device *bdev, loff_t start, loff_t end); > > #endif > > diff --git a/mm/filemap.c b/mm/filemap.c > > index bc94386..c4286eb 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -482,12 +482,6 @@ int filemap_write_and_wait_range(struct address_space *mapping, > > { > > int err = 0; > > > > - if (dax_mapping(mapping) && mapping->nrexceptional) { > > - err = dax_writeback_mapping_range(mapping, lstart, lend); > > - if (err) > > - return err; > > - } > > - > > if (mapping->nrpages) { > > err = __filemap_fdatawrite_range(mapping, lstart, lend, > > WB_SYNC_ALL); > > -- > > 2.5.0 > > > > > -- > Jan Kara > SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 8 Feb 2016 11:48:50 +0100 From: Jan Kara Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160208104849.GB9451@quack.suse.cz> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: On Sun 07-02-16 00:19:13, Ross Zwisler wrote: > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > dax_writeback_mapping_range() needs a struct block_device, and it used to > get that from inode->i_sb->s_bdev. This is correct for normal inodes > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > block devices and for XFS real-time files. > > Instead, call dax_writeback_mapping_range() directly from the filesystem or > raw block device fsync/msync code so that they can supply us with a valid > block device. > > It should be noted that this will reduce the number of calls to > dax_writeback_mapping_range() because filemap_write_and_wait_range() is > called in the various filesystems for operations other than just > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside > of ->fsync for hole punch, truncate, and block relocation > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). > > I don't believe that these extra flushes are necessary in the DAX case. In > the page cache case when we have dirty data in the page cache, that data > will be actively lost if we evict a dirty page cache page without flushing > it to media first. For DAX, though, the data will remain consistent with > the physical address to which it was written regardless of whether it's in > the processor cache or not - really the only reason I see to flush is in > response to a fsync or msync so that our data is durable on media in case > of a power loss. The case where we could throw dirty data out of the page > cache and essentially lose writes simply doesn't exist. You should at least note that sync(2) won't make data durable with this patch in the changelog. Dave and Christoph have told you that Linux users depend on sync(2) to make data durable and I fully agree with them. Given current options, I think we can live with this for 4.5 but long term this is IMO unacceptable. Honza > > Signed-off-by: Ross Zwisler > --- > fs/block_dev.c | 7 +++++++ > fs/dax.c | 5 ++--- > fs/ext2/file.c | 10 ++++++++++ > fs/ext4/fsync.c | 10 +++++++++- > fs/xfs/xfs_file.c | 12 ++++++++++-- > include/linux/dax.h | 4 ++-- > mm/filemap.c | 6 ------ > 7 files changed, 40 insertions(+), 14 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index fa0507a..312ad44 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -356,8 +356,15 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync) > { > struct inode *bd_inode = bdev_file_inode(filp); > struct block_device *bdev = I_BDEV(bd_inode); > + struct address_space *mapping = bd_inode->i_mapping; > int error; > > + if (dax_mapping(mapping) && mapping->nrexceptional) { > + error = dax_writeback_mapping_range(mapping, bdev, start, end); > + if (error) > + return error; > + } > + > error = filemap_write_and_wait_range(filp->f_mapping, start, end); > if (error) > return error; > diff --git a/fs/dax.c b/fs/dax.c > index 4592241..4b5006a 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev, > * end]. This is required by data integrity operations to ensure file data is > * on persistent storage prior to completion of the operation. > */ > -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > - loff_t end) > +int dax_writeback_mapping_range(struct address_space *mapping, > + struct block_device *bdev, loff_t start, loff_t end) > { > struct inode *inode = mapping->host; > - struct block_device *bdev = inode->i_sb->s_bdev; > pgoff_t start_index, end_index, pmd_index; > pgoff_t indices[PAGEVEC_SIZE]; > struct pagevec pvec; > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index 2c88d68..d1abf53 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -162,6 +162,16 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) > int ret; > struct super_block *sb = file->f_mapping->host->i_sb; > struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; > +#ifdef CONFIG_FS_DAX > + struct address_space *inode_mapping = file->f_inode->i_mapping; > + > + if (dax_mapping(inode_mapping) && inode_mapping->nrexceptional) { > + ret = dax_writeback_mapping_range(inode_mapping, sb->s_bdev, > + start, end); > + if (ret) > + return ret; > + } > +#endif > > ret = generic_file_fsync(file, start, end, datasync); > if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) { > diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c > index 8850254..e9cf53b 100644 > --- a/fs/ext4/fsync.c > +++ b/fs/ext4/fsync.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > > #include "ext4.h" > #include "ext4_jbd2.h" > @@ -83,10 +84,10 @@ static int ext4_sync_parent(struct inode *inode) > * What we do is just kick off a commit and wait on it. This will snapshot the > * inode to disk. > */ > - > int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > { > struct inode *inode = file->f_mapping->host; > + struct address_space *mapping = inode->i_mapping; > struct ext4_inode_info *ei = EXT4_I(inode); > journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; > int ret = 0, err; > @@ -97,6 +98,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > > trace_ext4_sync_file_enter(file, datasync); > > + if (dax_mapping(mapping) && mapping->nrexceptional) { > + err = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, > + start, end); > + if (err) > + goto out; > + } > + > if (inode->i_sb->s_flags & MS_RDONLY) { > /* Make sure that we read updated s_mount_flags value */ > smp_rmb(); > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 52883ac..84e95cc 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -209,7 +209,8 @@ xfs_file_fsync( > loff_t end, > int datasync) > { > - struct inode *inode = file->f_mapping->host; > + struct address_space *mapping = file->f_mapping; > + struct inode *inode = mapping->host; > struct xfs_inode *ip = XFS_I(inode); > struct xfs_mount *mp = ip->i_mount; > int error = 0; > @@ -218,7 +219,14 @@ xfs_file_fsync( > > trace_xfs_file_fsync(ip); > > - error = filemap_write_and_wait_range(inode->i_mapping, start, end); > + if (dax_mapping(mapping) && mapping->nrexceptional) { > + error = dax_writeback_mapping_range(mapping, > + xfs_find_bdev_for_inode(inode), start, end); > + if (error) > + return error; > + } > + > + error = filemap_write_and_wait_range(mapping, start, end); > if (error) > return error; > > diff --git a/include/linux/dax.h b/include/linux/dax.h > index bad27b0..8e9f114 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -42,6 +42,6 @@ static inline bool dax_mapping(struct address_space *mapping) > { > return mapping->host && IS_DAX(mapping->host); > } > -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > - loff_t end); > +int dax_writeback_mapping_range(struct address_space *mapping, > + struct block_device *bdev, loff_t start, loff_t end); > #endif > diff --git a/mm/filemap.c b/mm/filemap.c > index bc94386..c4286eb 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -482,12 +482,6 @@ int filemap_write_and_wait_range(struct address_space *mapping, > { > int err = 0; > > - if (dax_mapping(mapping) && mapping->nrexceptional) { > - err = dax_writeback_mapping_range(mapping, lstart, lend); > - if (err) > - return err; > - } > - > if (mapping->nrpages) { > err = __filemap_fdatawrite_range(mapping, lstart, lend, > WB_SYNC_ALL); > -- > 2.5.0 > > -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20160207215047.GJ31407@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> Date: Mon, 8 Feb 2016 00:18:11 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Dave Chinner Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer List-ID: On Sun, Feb 7, 2016 at 1:50 PM, Dave Chinner wrote: > On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote: >> On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler >> wrote: >> > Previously calls to dax_writeback_mapping_range() for all DAX filesystems >> > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). >> > dax_writeback_mapping_range() needs a struct block_device, and it used to >> > get that from inode->i_sb->s_bdev. This is correct for normal inodes >> > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw >> > block devices and for XFS real-time files. >> > >> > Instead, call dax_writeback_mapping_range() directly from the filesystem or >> > raw block device fsync/msync code so that they can supply us with a valid >> > block device. >> > >> > It should be noted that this will reduce the number of calls to >> > dax_writeback_mapping_range() because filemap_write_and_wait_range() is >> > called in the various filesystems for operations other than just >> > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside >> > of ->fsync for hole punch, truncate, and block relocation >> > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). >> > >> > I don't believe that these extra flushes are necessary in the DAX case. In >> > the page cache case when we have dirty data in the page cache, that data >> > will be actively lost if we evict a dirty page cache page without flushing >> > it to media first. For DAX, though, the data will remain consistent with >> > the physical address to which it was written regardless of whether it's in >> > the processor cache or not - really the only reason I see to flush is in >> > response to a fsync or msync so that our data is durable on media in case >> > of a power loss. The case where we could throw dirty data out of the page >> > cache and essentially lose writes simply doesn't exist. >> > >> > Signed-off-by: Ross Zwisler >> > --- >> > fs/block_dev.c | 7 +++++++ >> > fs/dax.c | 5 ++--- >> > fs/ext2/file.c | 10 ++++++++++ >> > fs/ext4/fsync.c | 10 +++++++++- >> > fs/xfs/xfs_file.c | 12 ++++++++++-- >> > include/linux/dax.h | 4 ++-- >> > mm/filemap.c | 6 ------ >> > 7 files changed, 40 insertions(+), 14 deletions(-) >> >> This sprinkling of dax specific fixups outside of vm_operations_struct >> routines still has me thinking that we are going in the wrong >> direction for fsync/msync support. >> >> If an application is both unaware of DAX and doing mmap I/O it is >> better served by the page cache where writeback is durable by default. >> We expect DAX-aware applications to assume responsibility for cpu >> cache management [1]. Making DAX mmap semantics explicit opt-in >> solves not only durability support, but also the current problem that >> DAX gets silently disabled leaving an app to wonder if it really got a >> direct mapping. DAX also silently picks pud, pmd, or pte mappings >> which is information an application would really like to know at map >> time. >> >> The proposal: make applications explicitly request DAX semantics with >> a new MAP_DAX flag and fail if DAX is unavailable. > > No. > > As I've stated before, the entire purpose of enabling DAX through > existing filesytsems like XFS and ext4 is so that existing > applications work with DAX *without modification*. > > That is, applications can be entirely unaware of the fact that the > filesystem is giving them direct access to the storage because the > access and failure semantics of DAX enabled mmap are *identical to > the existing mmap semantics*. > > Given this, the app doesn't need to care whether DAX is enabled or > not; all that will be seen is a difference in speed of access. > Enabling and disabling DAX is, at this point, purely an > administration decision - if the hardware and filesystem supports > it, it can be turned on without having to wait years for application > developers to add support for it.... Setting aside the current block zeroing problem you seem to assuming that DAX will always be faster and that may not be true at a media level. Waiting years for some applications to determine if DAX makes sense for their use case seems completely reasonable. In the meantime the apps that are already making these changes want to know that a DAX mapping request has not silently dropped backed to page cache. They also want to know if they successfully jumped through all the hoops to get a larger than pte mapping. I agree it is useful to be able to force DAX on an unmodified application to see what happens, and it follows that if those applications want to run in that mode they will need functional fsync()... I would feel better if we were talking about specific applications and performance numbers to know if forcing DAX on application is a debug facility or a production level capability. You seem to have already made that determination and I'm curious what I'm missing. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 8 Feb 2016 08:50:47 +1100 From: Dave Chinner Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160207215047.GJ31407@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer List-ID: On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote: > On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler > wrote: > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > block devices and for XFS real-time files. > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem or > > raw block device fsync/msync code so that they can supply us with a valid > > block device. > > > > It should be noted that this will reduce the number of calls to > > dax_writeback_mapping_range() because filemap_write_and_wait_range() is > > called in the various filesystems for operations other than just > > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside > > of ->fsync for hole punch, truncate, and block relocation > > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). > > > > I don't believe that these extra flushes are necessary in the DAX case. In > > the page cache case when we have dirty data in the page cache, that data > > will be actively lost if we evict a dirty page cache page without flushing > > it to media first. For DAX, though, the data will remain consistent with > > the physical address to which it was written regardless of whether it's in > > the processor cache or not - really the only reason I see to flush is in > > response to a fsync or msync so that our data is durable on media in case > > of a power loss. The case where we could throw dirty data out of the page > > cache and essentially lose writes simply doesn't exist. > > > > Signed-off-by: Ross Zwisler > > --- > > fs/block_dev.c | 7 +++++++ > > fs/dax.c | 5 ++--- > > fs/ext2/file.c | 10 ++++++++++ > > fs/ext4/fsync.c | 10 +++++++++- > > fs/xfs/xfs_file.c | 12 ++++++++++-- > > include/linux/dax.h | 4 ++-- > > mm/filemap.c | 6 ------ > > 7 files changed, 40 insertions(+), 14 deletions(-) > > This sprinkling of dax specific fixups outside of vm_operations_struct > routines still has me thinking that we are going in the wrong > direction for fsync/msync support. > > If an application is both unaware of DAX and doing mmap I/O it is > better served by the page cache where writeback is durable by default. > We expect DAX-aware applications to assume responsibility for cpu > cache management [1]. Making DAX mmap semantics explicit opt-in > solves not only durability support, but also the current problem that > DAX gets silently disabled leaving an app to wonder if it really got a > direct mapping. DAX also silently picks pud, pmd, or pte mappings > which is information an application would really like to know at map > time. > > The proposal: make applications explicitly request DAX semantics with > a new MAP_DAX flag and fail if DAX is unavailable. No. As I've stated before, the entire purpose of enabling DAX through existing filesytsems like XFS and ext4 is so that existing applications work with DAX *without modification*. That is, applications can be entirely unaware of the fact that the filesystem is giving them direct access to the storage because the access and failure semantics of DAX enabled mmap are *identical to the existing mmap semantics*. Given this, the app doesn't need to care whether DAX is enabled or not; all that will be seen is a difference in speed of access. Enabling and disabling DAX is, at this point, purely an administration decision - if the hardware and filesystem supports it, it can be turned on without having to wait years for application developers to add support for it.... -Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> Date: Sun, 7 Feb 2016 11:13:51 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Ross Zwisler Cc: "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer List-ID: On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler wrote: > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > dax_writeback_mapping_range() needs a struct block_device, and it used to > get that from inode->i_sb->s_bdev. This is correct for normal inodes > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > block devices and for XFS real-time files. > > Instead, call dax_writeback_mapping_range() directly from the filesystem or > raw block device fsync/msync code so that they can supply us with a valid > block device. > > It should be noted that this will reduce the number of calls to > dax_writeback_mapping_range() because filemap_write_and_wait_range() is > called in the various filesystems for operations other than just > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside > of ->fsync for hole punch, truncate, and block relocation > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). > > I don't believe that these extra flushes are necessary in the DAX case. In > the page cache case when we have dirty data in the page cache, that data > will be actively lost if we evict a dirty page cache page without flushing > it to media first. For DAX, though, the data will remain consistent with > the physical address to which it was written regardless of whether it's in > the processor cache or not - really the only reason I see to flush is in > response to a fsync or msync so that our data is durable on media in case > of a power loss. The case where we could throw dirty data out of the page > cache and essentially lose writes simply doesn't exist. > > Signed-off-by: Ross Zwisler > --- > fs/block_dev.c | 7 +++++++ > fs/dax.c | 5 ++--- > fs/ext2/file.c | 10 ++++++++++ > fs/ext4/fsync.c | 10 +++++++++- > fs/xfs/xfs_file.c | 12 ++++++++++-- > include/linux/dax.h | 4 ++-- > mm/filemap.c | 6 ------ > 7 files changed, 40 insertions(+), 14 deletions(-) This sprinkling of dax specific fixups outside of vm_operations_struct routines still has me thinking that we are going in the wrong direction for fsync/msync support. If an application is both unaware of DAX and doing mmap I/O it is better served by the page cache where writeback is durable by default. We expect DAX-aware applications to assume responsibility for cpu cache management [1]. Making DAX mmap semantics explicit opt-in solves not only durability support, but also the current problem that DAX gets silently disabled leaving an app to wonder if it really got a direct mapping. DAX also silently picks pud, pmd, or pte mappings which is information an application would really like to know at map time. The proposal: make applications explicitly request DAX semantics with a new MAP_DAX flag and fail if DAX is unavailable. Document that a successful MAP_DAX request mandates that the application assumes responsibility for cpu cache management. Require that all applications that mmap the file agree on MAP_DAX. This also solves the future problem of DAX support on virtually tagged cache architectures where it is difficult for the kernel to know what alias addresses need flushing. [1]: https://github.com/pmem/nvml -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH 2/2] dax: move writeback calls into the filesystems Date: Sun, 7 Feb 2016 00:19:13 -0700 Message-Id: <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: Previously calls to dax_writeback_mapping_range() for all DAX filesystems (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). dax_writeback_mapping_range() needs a struct block_device, and it used to get that from inode->i_sb->s_bdev. This is correct for normal inodes mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time files. Instead, call dax_writeback_mapping_range() directly from the filesystem or raw block device fsync/msync code so that they can supply us with a valid block device. It should be noted that this will reduce the number of calls to dax_writeback_mapping_range() because filemap_write_and_wait_range() is called in the various filesystems for operations other than just fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside of ->fsync for hole punch, truncate, and block relocation (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). I don't believe that these extra flushes are necessary in the DAX case. In the page cache case when we have dirty data in the page cache, that data will be actively lost if we evict a dirty page cache page without flushing it to media first. For DAX, though, the data will remain consistent with the physical address to which it was written regardless of whether it's in the processor cache or not - really the only reason I see to flush is in response to a fsync or msync so that our data is durable on media in case of a power loss. The case where we could throw dirty data out of the page cache and essentially lose writes simply doesn't exist. Signed-off-by: Ross Zwisler --- fs/block_dev.c | 7 +++++++ fs/dax.c | 5 ++--- fs/ext2/file.c | 10 ++++++++++ fs/ext4/fsync.c | 10 +++++++++- fs/xfs/xfs_file.c | 12 ++++++++++-- include/linux/dax.h | 4 ++-- mm/filemap.c | 6 ------ 7 files changed, 40 insertions(+), 14 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index fa0507a..312ad44 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -356,8 +356,15 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync) { struct inode *bd_inode = bdev_file_inode(filp); struct block_device *bdev = I_BDEV(bd_inode); + struct address_space *mapping = bd_inode->i_mapping; int error; + if (dax_mapping(mapping) && mapping->nrexceptional) { + error = dax_writeback_mapping_range(mapping, bdev, start, end); + if (error) + return error; + } + error = filemap_write_and_wait_range(filp->f_mapping, start, end); if (error) return error; diff --git a/fs/dax.c b/fs/dax.c index 4592241..4b5006a 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev, * end]. This is required by data integrity operations to ensure file data is * on persistent storage prior to completion of the operation. */ -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end) +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, loff_t start, loff_t end) { struct inode *inode = mapping->host; - struct block_device *bdev = inode->i_sb->s_bdev; pgoff_t start_index, end_index, pmd_index; pgoff_t indices[PAGEVEC_SIZE]; struct pagevec pvec; diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 2c88d68..d1abf53 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -162,6 +162,16 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) int ret; struct super_block *sb = file->f_mapping->host->i_sb; struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; +#ifdef CONFIG_FS_DAX + struct address_space *inode_mapping = file->f_inode->i_mapping; + + if (dax_mapping(inode_mapping) && inode_mapping->nrexceptional) { + ret = dax_writeback_mapping_range(inode_mapping, sb->s_bdev, + start, end); + if (ret) + return ret; + } +#endif ret = generic_file_fsync(file, start, end, datasync); if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) { diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 8850254..e9cf53b 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "ext4.h" #include "ext4_jbd2.h" @@ -83,10 +84,10 @@ static int ext4_sync_parent(struct inode *inode) * What we do is just kick off a commit and wait on it. This will snapshot the * inode to disk. */ - int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { struct inode *inode = file->f_mapping->host; + struct address_space *mapping = inode->i_mapping; struct ext4_inode_info *ei = EXT4_I(inode); journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; int ret = 0, err; @@ -97,6 +98,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) trace_ext4_sync_file_enter(file, datasync); + if (dax_mapping(mapping) && mapping->nrexceptional) { + err = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, + start, end); + if (err) + goto out; + } + if (inode->i_sb->s_flags & MS_RDONLY) { /* Make sure that we read updated s_mount_flags value */ smp_rmb(); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 52883ac..84e95cc 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -209,7 +209,8 @@ xfs_file_fsync( loff_t end, int datasync) { - struct inode *inode = file->f_mapping->host; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; struct xfs_inode *ip = XFS_I(inode); struct xfs_mount *mp = ip->i_mount; int error = 0; @@ -218,7 +219,14 @@ xfs_file_fsync( trace_xfs_file_fsync(ip); - error = filemap_write_and_wait_range(inode->i_mapping, start, end); + if (dax_mapping(mapping) && mapping->nrexceptional) { + error = dax_writeback_mapping_range(mapping, + xfs_find_bdev_for_inode(inode), start, end); + if (error) + return error; + } + + error = filemap_write_and_wait_range(mapping, start, end); if (error) return error; diff --git a/include/linux/dax.h b/include/linux/dax.h index bad27b0..8e9f114 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -42,6 +42,6 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end); +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, loff_t start, loff_t end); #endif diff --git a/mm/filemap.c b/mm/filemap.c index bc94386..c4286eb 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -482,12 +482,6 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; - if (dax_mapping(mapping) && mapping->nrexceptional) { - err = dax_writeback_mapping_range(mapping, lstart, lend); - if (err) - return err; - } - if (mapping->nrpages) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 8 Feb 2016 08:34:43 -0700 From: Ross Zwisler Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Message-ID: <20160208153443.GC2343@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> <20160207220329.GK31407@dastard> <20160208014409.GA2343@linux.intel.com> <20160208051725.GM31407@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160208051725.GM31407@dastard> Sender: owner-linux-mm@kvack.org To: Dave Chinner Cc: Ross Zwisler , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: On Mon, Feb 08, 2016 at 04:17:25PM +1100, Dave Chinner wrote: > On Sun, Feb 07, 2016 at 06:44:09PM -0700, Ross Zwisler wrote: > > On Mon, Feb 08, 2016 at 09:03:29AM +1100, Dave Chinner wrote: > > > On Sun, Feb 07, 2016 at 12:19:12AM -0700, Ross Zwisler wrote: > > > > dax_clear_blocks() needs a valid struct block_device and previously it was > > > > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > > > > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > > > block devices and for XFS real-time devices. > > > > > > > > Instead, have the caller pass in a struct block_device pointer which it > > > > knows to be correct. > > > .... > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > > > index 07ef29b..f722ba2 100644 > > > > --- a/fs/xfs/xfs_bmap_util.c > > > > +++ b/fs/xfs/xfs_bmap_util.c > > > > @@ -73,9 +73,11 @@ xfs_zero_extent( > > > > xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb); > > > > sector_t block = XFS_BB_TO_FSBT(mp, sector); > > > > ssize_t size = XFS_FSB_TO_B(mp, count_fsb); > > > > + struct inode *inode = VFS_I(ip); > > > > > > > > if (IS_DAX(VFS_I(ip))) > > > > - return dax_clear_blocks(VFS_I(ip), block, size); > > > > + return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > > > > + block, size); > > > > > > Get rid of the local inode variable and use VFS_I(ip) like the code > > > originally did. Do not change code that is unrelated to the > > > modifcation being made, especially when it results in making > > > the code an inconsistent mess of mixed pointer constructs.... > > > > The local 'inode' variable was added to avoid multiple calls for VFS_I() for > > the same 'ip'. > > My point is you didn't achieve that. The end result of your patch > is: > > struct inode *inode = VFS_I(ip); > > if (IS_DAX(VFS_I(ip))) > return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > block, size); > > So now we have a local variable, but we still have 2 calls to > VFS_I(ip). i.e. this makes the code harder to read and understand > than before for no benefit. *facepalm* Yep, thanks for the correction. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 8 Feb 2016 16:17:25 +1100 From: Dave Chinner Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Message-ID: <20160208051725.GM31407@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> <20160207220329.GK31407@dastard> <20160208014409.GA2343@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160208014409.GA2343@linux.intel.com> Sender: owner-linux-mm@kvack.org To: Ross Zwisler , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: On Sun, Feb 07, 2016 at 06:44:09PM -0700, Ross Zwisler wrote: > On Mon, Feb 08, 2016 at 09:03:29AM +1100, Dave Chinner wrote: > > On Sun, Feb 07, 2016 at 12:19:12AM -0700, Ross Zwisler wrote: > > > dax_clear_blocks() needs a valid struct block_device and previously it was > > > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > > > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > > block devices and for XFS real-time devices. > > > > > > Instead, have the caller pass in a struct block_device pointer which it > > > knows to be correct. > > .... > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > > index 07ef29b..f722ba2 100644 > > > --- a/fs/xfs/xfs_bmap_util.c > > > +++ b/fs/xfs/xfs_bmap_util.c > > > @@ -73,9 +73,11 @@ xfs_zero_extent( > > > xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb); > > > sector_t block = XFS_BB_TO_FSBT(mp, sector); > > > ssize_t size = XFS_FSB_TO_B(mp, count_fsb); > > > + struct inode *inode = VFS_I(ip); > > > > > > if (IS_DAX(VFS_I(ip))) > > > - return dax_clear_blocks(VFS_I(ip), block, size); > > > + return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > > > + block, size); > > > > Get rid of the local inode variable and use VFS_I(ip) like the code > > originally did. Do not change code that is unrelated to the > > modifcation being made, especially when it results in making > > the code an inconsistent mess of mixed pointer constructs.... > > The local 'inode' variable was added to avoid multiple calls for VFS_I() for > the same 'ip'. My point is you didn't achieve that. The end result of your patch is: struct inode *inode = VFS_I(ip); if (IS_DAX(VFS_I(ip))) return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), block, size); So now we have a local variable, but we still have 2 calls to VFS_I(ip). i.e. this makes the code harder to read and understand than before for no benefit. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (1.0) Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() From: Ross Zwisler In-Reply-To: <20160208014601.GB2343@linux.intel.com> Date: Sun, 7 Feb 2016 21:29:38 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <00FE872A-9B2A-4492-A83C-59025ACB1F4A@gmail.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> <20160208014601.GB2343@linux.intel.com> Sender: owner-linux-mm@kvack.org To: Ross Zwisler Cc: Dan Williams , Theodore Ts'o , "linux-nvdimm@lists.01.org" , Dave Chinner , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , linux-ext4 , Andrew Morton List-ID: > On Feb 7, 2016, at 6:46 PM, Ross Zwisler wr= ote: >=20 >> On Sun, Feb 07, 2016 at 10:19:29AM -0800, Dan Williams wrote: >> On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler >> wrote: >>> dax_clear_blocks() needs a valid struct block_device and previously it w= as >>> using inode->i_sb->s_bdev in all cases. This is correct for normal inod= es >>> on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw >>> block devices and for XFS real-time devices. >>>=20 >>> Instead, have the caller pass in a struct block_device pointer which it >>> knows to be correct. >>>=20 >>> Signed-off-by: Ross Zwisler >>> --- >>> fs/dax.c | 4 ++-- >>> fs/ext2/inode.c | 5 +++-- >>> fs/xfs/xfs_aops.c | 2 +- >>> fs/xfs/xfs_aops.h | 1 + >>> fs/xfs/xfs_bmap_util.c | 4 +++- >>> include/linux/dax.h | 3 ++- >>> 6 files changed, 12 insertions(+), 7 deletions(-) >>>=20 >>> diff --git a/fs/dax.c b/fs/dax.c >>> index 227974a..4592241 100644 >>> --- a/fs/dax.c >>> +++ b/fs/dax.c >>> @@ -83,9 +83,9 @@ struct page *read_dax_sector(struct block_device *bdev= , sector_t n) >>> * and hence this means the stack from this point must follow GFP_NOFS >>> * semantics for all operations. >>> */ >>> -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) >>> +int dax_clear_blocks(struct inode *inode, struct block_device *bdev, >>> + sector_t block, long _size) >>=20 >> Since this is a bdev relative routine we should also resolve the >> sector, i.e. the signature should drop the inode: >>=20 >> int dax_clear_sectors(struct block_device *bdev, sector_t sector, long _s= ize) >=20 > The inode is still needed because dax_clear_blocks() needs inode->i_blkbit= s. > Unless there is some easy way to get this from the bdev that I'm not seein= g? Never mind, you are passing in the sector, not the block. Sure, this seems b= etter - I'll fix this for v2.= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Sun, 7 Feb 2016 18:46:01 -0700 From: Ross Zwisler Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Message-ID: <20160208014601.GB2343@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers List-ID: On Sun, Feb 07, 2016 at 10:19:29AM -0800, Dan Williams wrote: > On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler > wrote: > > dax_clear_blocks() needs a valid struct block_device and previously it was > > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > block devices and for XFS real-time devices. > > > > Instead, have the caller pass in a struct block_device pointer which it > > knows to be correct. > > > > Signed-off-by: Ross Zwisler > > --- > > fs/dax.c | 4 ++-- > > fs/ext2/inode.c | 5 +++-- > > fs/xfs/xfs_aops.c | 2 +- > > fs/xfs/xfs_aops.h | 1 + > > fs/xfs/xfs_bmap_util.c | 4 +++- > > include/linux/dax.h | 3 ++- > > 6 files changed, 12 insertions(+), 7 deletions(-) > > > > diff --git a/fs/dax.c b/fs/dax.c > > index 227974a..4592241 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -83,9 +83,9 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n) > > * and hence this means the stack from this point must follow GFP_NOFS > > * semantics for all operations. > > */ > > -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) > > +int dax_clear_blocks(struct inode *inode, struct block_device *bdev, > > + sector_t block, long _size) > > Since this is a bdev relative routine we should also resolve the > sector, i.e. the signature should drop the inode: > > int dax_clear_sectors(struct block_device *bdev, sector_t sector, long _size) The inode is still needed because dax_clear_blocks() needs inode->i_blkbits. Unless there is some easy way to get this from the bdev that I'm not seeing? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Sun, 7 Feb 2016 18:44:09 -0700 From: Ross Zwisler Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Message-ID: <20160208014409.GA2343@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> <20160207220329.GK31407@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160207220329.GK31407@dastard> Sender: owner-linux-mm@kvack.org To: Dave Chinner Cc: Ross Zwisler , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: On Mon, Feb 08, 2016 at 09:03:29AM +1100, Dave Chinner wrote: > On Sun, Feb 07, 2016 at 12:19:12AM -0700, Ross Zwisler wrote: > > dax_clear_blocks() needs a valid struct block_device and previously it was > > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > block devices and for XFS real-time devices. > > > > Instead, have the caller pass in a struct block_device pointer which it > > knows to be correct. > .... > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > index 07ef29b..f722ba2 100644 > > --- a/fs/xfs/xfs_bmap_util.c > > +++ b/fs/xfs/xfs_bmap_util.c > > @@ -73,9 +73,11 @@ xfs_zero_extent( > > xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb); > > sector_t block = XFS_BB_TO_FSBT(mp, sector); > > ssize_t size = XFS_FSB_TO_B(mp, count_fsb); > > + struct inode *inode = VFS_I(ip); > > > > if (IS_DAX(VFS_I(ip))) > > - return dax_clear_blocks(VFS_I(ip), block, size); > > + return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > > + block, size); > > Get rid of the local inode variable and use VFS_I(ip) like the code > originally did. Do not change code that is unrelated to the > modifcation being made, especially when it results in making > the code an inconsistent mess of mixed pointer constructs.... The local 'inode' variable was added to avoid multiple calls for VFS_I() for the same 'ip'. That said, I'm happy to make the change. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 8 Feb 2016 09:03:29 +1100 From: Dave Chinner Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Message-ID: <20160207220329.GK31407@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: On Sun, Feb 07, 2016 at 12:19:12AM -0700, Ross Zwisler wrote: > dax_clear_blocks() needs a valid struct block_device and previously it was > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > block devices and for XFS real-time devices. > > Instead, have the caller pass in a struct block_device pointer which it > knows to be correct. .... > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > index 07ef29b..f722ba2 100644 > --- a/fs/xfs/xfs_bmap_util.c > +++ b/fs/xfs/xfs_bmap_util.c > @@ -73,9 +73,11 @@ xfs_zero_extent( > xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb); > sector_t block = XFS_BB_TO_FSBT(mp, sector); > ssize_t size = XFS_FSB_TO_B(mp, count_fsb); > + struct inode *inode = VFS_I(ip); > > if (IS_DAX(VFS_I(ip))) > - return dax_clear_blocks(VFS_I(ip), block, size); > + return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > + block, size); Get rid of the local inode variable and use VFS_I(ip) like the code originally did. Do not change code that is unrelated to the modifcation being made, especially when it results in making the code an inconsistent mess of mixed pointer constructs.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> Date: Sun, 7 Feb 2016 10:19:29 -0800 Message-ID: Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Ross Zwisler Cc: "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers List-ID: On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler wrote: > dax_clear_blocks() needs a valid struct block_device and previously it was > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > block devices and for XFS real-time devices. > > Instead, have the caller pass in a struct block_device pointer which it > knows to be correct. > > Signed-off-by: Ross Zwisler > --- > fs/dax.c | 4 ++-- > fs/ext2/inode.c | 5 +++-- > fs/xfs/xfs_aops.c | 2 +- > fs/xfs/xfs_aops.h | 1 + > fs/xfs/xfs_bmap_util.c | 4 +++- > include/linux/dax.h | 3 ++- > 6 files changed, 12 insertions(+), 7 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 227974a..4592241 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -83,9 +83,9 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n) > * and hence this means the stack from this point must follow GFP_NOFS > * semantics for all operations. > */ > -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) > +int dax_clear_blocks(struct inode *inode, struct block_device *bdev, > + sector_t block, long _size) Since this is a bdev relative routine we should also resolve the sector, i.e. the signature should drop the inode: int dax_clear_sectors(struct block_device *bdev, sector_t sector, long _size) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Date: Sun, 7 Feb 2016 00:19:12 -0700 Message-Id: <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: dax_clear_blocks() needs a valid struct block_device and previously it was using inode->i_sb->s_bdev in all cases. This is correct for normal inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time devices. Instead, have the caller pass in a struct block_device pointer which it knows to be correct. Signed-off-by: Ross Zwisler --- fs/dax.c | 4 ++-- fs/ext2/inode.c | 5 +++-- fs/xfs/xfs_aops.c | 2 +- fs/xfs/xfs_aops.h | 1 + fs/xfs/xfs_bmap_util.c | 4 +++- include/linux/dax.h | 3 ++- 6 files changed, 12 insertions(+), 7 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 227974a..4592241 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -83,9 +83,9 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n) * and hence this means the stack from this point must follow GFP_NOFS * semantics for all operations. */ -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) +int dax_clear_blocks(struct inode *inode, struct block_device *bdev, + sector_t block, long _size) { - struct block_device *bdev = inode->i_sb->s_bdev; struct blk_dax_ctl dax = { .sector = block << (inode->i_blkbits - 9), .size = _size, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 338eefd..277a32b 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -737,8 +737,9 @@ static int ext2_get_blocks(struct inode *inode, * so that it's not found by another thread before it's * initialised */ - err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key), - 1 << inode->i_blkbits); + err = dax_clear_blocks(inode, inode->i_sb->s_bdev, + le32_to_cpu(chain[depth-1].key), + 1 << inode->i_blkbits); if (err) { mutex_unlock(&ei->truncate_mutex); goto cleanup; diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 379c089..fc20518 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -55,7 +55,7 @@ xfs_count_page_state( } while ((bh = bh->b_this_page) != head); } -STATIC struct block_device * +struct block_device * xfs_find_bdev_for_inode( struct inode *inode) { diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h index f6ffc9a..a4343c6 100644 --- a/fs/xfs/xfs_aops.h +++ b/fs/xfs/xfs_aops.h @@ -62,5 +62,6 @@ int xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset, struct buffer_head *map_bh, int create); extern void xfs_count_page_state(struct page *, int *, int *); +extern struct block_device *xfs_find_bdev_for_inode(struct inode *); #endif /* __XFS_AOPS_H__ */ diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 07ef29b..f722ba2 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -73,9 +73,11 @@ xfs_zero_extent( xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb); sector_t block = XFS_BB_TO_FSBT(mp, sector); ssize_t size = XFS_FSB_TO_B(mp, count_fsb); + struct inode *inode = VFS_I(ip); if (IS_DAX(VFS_I(ip))) - return dax_clear_blocks(VFS_I(ip), block, size); + return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), + block, size); /* * let the block layer decide on the fastest method of diff --git a/include/linux/dax.h b/include/linux/dax.h index 8204c3d..bad27b0 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -7,7 +7,8 @@ ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t, get_block_t, dio_iodone_t, int flags); -int dax_clear_blocks(struct inode *, sector_t block, long size); +int dax_clear_blocks(struct inode *inode, struct block_device *bdev, + sector_t block, long _size); int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); int dax_truncate_page(struct inode *, loff_t from, get_block_t); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t, -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753509AbcBGHTd (ORCPT ); Sun, 7 Feb 2016 02:19:33 -0500 Received: from mga03.intel.com ([134.134.136.65]:64383 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751135AbcBGHTb (ORCPT ); Sun, 7 Feb 2016 02:19:31 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,409,1449561600"; d="scan'208";a="741902342" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: [PATCH 0/2] DAX bdev fixes - move flushing calls to FS Date: Sun, 7 Feb 2016 00:19:11 -0700 Message-Id: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.5.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The first patch in the series just adds a bdev argument to dax_clear_blocks(), and should be relatively straightforward. The second patch is slightly more controversial. During testing of raw block devices + DAX I noticed that the struct block_device that we were using for DAX operations was incorrect. For the fault handlers, etc. we can just get the correct bdev via get_block(), which is passed in as a function pointer, but for the flushing code we don't have access to get_block(). This is also an issue for XFS real-time devices, whenever we get those working. In short, somehow we need to get dax_writeback_mapping_range() a valid bdev. Right now it is called via filemap_write_and_wait_range(), which can't provide either the bdev nor a get_block() function pointer. So, our options seem to be: a) Move the calls to dax_writeback_mapping_range() into the filesystems. This is implemented by patch 2 in this series. b) Keep the calls to dax_writeback_mapping_range() in the mm code, and provide a generic way to ask a filesystem for an inode's bdev. I did a version of this using a superblock operation here: https://lkml.org/lkml/2016/2/2/941 It has been noted that we may need to expand the coverage of our DAX flushing code to include support for the sync() and syncfs() userspace calls. This is still under discussion, but if we do end up needing to add support for sync(), I don't think that it is v4.5 material for the reasons stated here: https://lkml.org/lkml/2016/2/4/962 I think that for v4.5 we either need patch 2 of this series, or the get_bdev() patch listed in for solution b) above. Ross Zwisler (2): dax: pass bdev argument to dax_clear_blocks() dax: move writeback calls into the filesystems fs/block_dev.c | 7 +++++++ fs/dax.c | 9 ++++----- fs/ext2/file.c | 10 ++++++++++ fs/ext2/inode.c | 5 +++-- fs/ext4/fsync.c | 10 +++++++++- fs/xfs/xfs_aops.c | 2 +- fs/xfs/xfs_aops.h | 1 + fs/xfs/xfs_bmap_util.c | 4 +++- fs/xfs/xfs_file.c | 12 ++++++++++-- include/linux/dax.h | 7 ++++--- mm/filemap.c | 6 ------ 11 files changed, 52 insertions(+), 21 deletions(-) -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753596AbcBGHTf (ORCPT ); Sun, 7 Feb 2016 02:19:35 -0500 Received: from mga03.intel.com ([134.134.136.65]:64383 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753426AbcBGHTd (ORCPT ); Sun, 7 Feb 2016 02:19:33 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,409,1449561600"; d="scan'208";a="741902350" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: [PATCH 2/2] dax: move writeback calls into the filesystems Date: Sun, 7 Feb 2016 00:19:13 -0700 Message-Id: <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.5.0 In-Reply-To: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Previously calls to dax_writeback_mapping_range() for all DAX filesystems (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). dax_writeback_mapping_range() needs a struct block_device, and it used to get that from inode->i_sb->s_bdev. This is correct for normal inodes mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time files. Instead, call dax_writeback_mapping_range() directly from the filesystem or raw block device fsync/msync code so that they can supply us with a valid block device. It should be noted that this will reduce the number of calls to dax_writeback_mapping_range() because filemap_write_and_wait_range() is called in the various filesystems for operations other than just fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside of ->fsync for hole punch, truncate, and block relocation (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). I don't believe that these extra flushes are necessary in the DAX case. In the page cache case when we have dirty data in the page cache, that data will be actively lost if we evict a dirty page cache page without flushing it to media first. For DAX, though, the data will remain consistent with the physical address to which it was written regardless of whether it's in the processor cache or not - really the only reason I see to flush is in response to a fsync or msync so that our data is durable on media in case of a power loss. The case where we could throw dirty data out of the page cache and essentially lose writes simply doesn't exist. Signed-off-by: Ross Zwisler --- fs/block_dev.c | 7 +++++++ fs/dax.c | 5 ++--- fs/ext2/file.c | 10 ++++++++++ fs/ext4/fsync.c | 10 +++++++++- fs/xfs/xfs_file.c | 12 ++++++++++-- include/linux/dax.h | 4 ++-- mm/filemap.c | 6 ------ 7 files changed, 40 insertions(+), 14 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index fa0507a..312ad44 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -356,8 +356,15 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync) { struct inode *bd_inode = bdev_file_inode(filp); struct block_device *bdev = I_BDEV(bd_inode); + struct address_space *mapping = bd_inode->i_mapping; int error; + if (dax_mapping(mapping) && mapping->nrexceptional) { + error = dax_writeback_mapping_range(mapping, bdev, start, end); + if (error) + return error; + } + error = filemap_write_and_wait_range(filp->f_mapping, start, end); if (error) return error; diff --git a/fs/dax.c b/fs/dax.c index 4592241..4b5006a 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev, * end]. This is required by data integrity operations to ensure file data is * on persistent storage prior to completion of the operation. */ -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end) +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, loff_t start, loff_t end) { struct inode *inode = mapping->host; - struct block_device *bdev = inode->i_sb->s_bdev; pgoff_t start_index, end_index, pmd_index; pgoff_t indices[PAGEVEC_SIZE]; struct pagevec pvec; diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 2c88d68..d1abf53 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -162,6 +162,16 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) int ret; struct super_block *sb = file->f_mapping->host->i_sb; struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; +#ifdef CONFIG_FS_DAX + struct address_space *inode_mapping = file->f_inode->i_mapping; + + if (dax_mapping(inode_mapping) && inode_mapping->nrexceptional) { + ret = dax_writeback_mapping_range(inode_mapping, sb->s_bdev, + start, end); + if (ret) + return ret; + } +#endif ret = generic_file_fsync(file, start, end, datasync); if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) { diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 8850254..e9cf53b 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "ext4.h" #include "ext4_jbd2.h" @@ -83,10 +84,10 @@ static int ext4_sync_parent(struct inode *inode) * What we do is just kick off a commit and wait on it. This will snapshot the * inode to disk. */ - int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { struct inode *inode = file->f_mapping->host; + struct address_space *mapping = inode->i_mapping; struct ext4_inode_info *ei = EXT4_I(inode); journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; int ret = 0, err; @@ -97,6 +98,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) trace_ext4_sync_file_enter(file, datasync); + if (dax_mapping(mapping) && mapping->nrexceptional) { + err = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, + start, end); + if (err) + goto out; + } + if (inode->i_sb->s_flags & MS_RDONLY) { /* Make sure that we read updated s_mount_flags value */ smp_rmb(); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 52883ac..84e95cc 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -209,7 +209,8 @@ xfs_file_fsync( loff_t end, int datasync) { - struct inode *inode = file->f_mapping->host; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; struct xfs_inode *ip = XFS_I(inode); struct xfs_mount *mp = ip->i_mount; int error = 0; @@ -218,7 +219,14 @@ xfs_file_fsync( trace_xfs_file_fsync(ip); - error = filemap_write_and_wait_range(inode->i_mapping, start, end); + if (dax_mapping(mapping) && mapping->nrexceptional) { + error = dax_writeback_mapping_range(mapping, + xfs_find_bdev_for_inode(inode), start, end); + if (error) + return error; + } + + error = filemap_write_and_wait_range(mapping, start, end); if (error) return error; diff --git a/include/linux/dax.h b/include/linux/dax.h index bad27b0..8e9f114 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -42,6 +42,6 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end); +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, loff_t start, loff_t end); #endif diff --git a/mm/filemap.c b/mm/filemap.c index bc94386..c4286eb 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -482,12 +482,6 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; - if (dax_mapping(mapping) && mapping->nrexceptional) { - err = dax_writeback_mapping_range(mapping, lstart, lend); - if (err) - return err; - } - if (mapping->nrpages) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753689AbcBGHUD (ORCPT ); Sun, 7 Feb 2016 02:20:03 -0500 Received: from mga03.intel.com ([134.134.136.65]:64383 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751499AbcBGHTc (ORCPT ); Sun, 7 Feb 2016 02:19:32 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,409,1449561600"; d="scan'208";a="741902346" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Date: Sun, 7 Feb 2016 00:19:12 -0700 Message-Id: <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.5.0 In-Reply-To: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org dax_clear_blocks() needs a valid struct block_device and previously it was using inode->i_sb->s_bdev in all cases. This is correct for normal inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time devices. Instead, have the caller pass in a struct block_device pointer which it knows to be correct. Signed-off-by: Ross Zwisler --- fs/dax.c | 4 ++-- fs/ext2/inode.c | 5 +++-- fs/xfs/xfs_aops.c | 2 +- fs/xfs/xfs_aops.h | 1 + fs/xfs/xfs_bmap_util.c | 4 +++- include/linux/dax.h | 3 ++- 6 files changed, 12 insertions(+), 7 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 227974a..4592241 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -83,9 +83,9 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n) * and hence this means the stack from this point must follow GFP_NOFS * semantics for all operations. */ -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) +int dax_clear_blocks(struct inode *inode, struct block_device *bdev, + sector_t block, long _size) { - struct block_device *bdev = inode->i_sb->s_bdev; struct blk_dax_ctl dax = { .sector = block << (inode->i_blkbits - 9), .size = _size, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 338eefd..277a32b 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -737,8 +737,9 @@ static int ext2_get_blocks(struct inode *inode, * so that it's not found by another thread before it's * initialised */ - err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key), - 1 << inode->i_blkbits); + err = dax_clear_blocks(inode, inode->i_sb->s_bdev, + le32_to_cpu(chain[depth-1].key), + 1 << inode->i_blkbits); if (err) { mutex_unlock(&ei->truncate_mutex); goto cleanup; diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 379c089..fc20518 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -55,7 +55,7 @@ xfs_count_page_state( } while ((bh = bh->b_this_page) != head); } -STATIC struct block_device * +struct block_device * xfs_find_bdev_for_inode( struct inode *inode) { diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h index f6ffc9a..a4343c6 100644 --- a/fs/xfs/xfs_aops.h +++ b/fs/xfs/xfs_aops.h @@ -62,5 +62,6 @@ int xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset, struct buffer_head *map_bh, int create); extern void xfs_count_page_state(struct page *, int *, int *); +extern struct block_device *xfs_find_bdev_for_inode(struct inode *); #endif /* __XFS_AOPS_H__ */ diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 07ef29b..f722ba2 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -73,9 +73,11 @@ xfs_zero_extent( xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb); sector_t block = XFS_BB_TO_FSBT(mp, sector); ssize_t size = XFS_FSB_TO_B(mp, count_fsb); + struct inode *inode = VFS_I(ip); if (IS_DAX(VFS_I(ip))) - return dax_clear_blocks(VFS_I(ip), block, size); + return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), + block, size); /* * let the block layer decide on the fastest method of diff --git a/include/linux/dax.h b/include/linux/dax.h index 8204c3d..bad27b0 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -7,7 +7,8 @@ ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t, get_block_t, dio_iodone_t, int flags); -int dax_clear_blocks(struct inode *, sector_t block, long size); +int dax_clear_blocks(struct inode *inode, struct block_device *bdev, + sector_t block, long _size); int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); int dax_truncate_page(struct inode *, loff_t from, get_block_t); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t, -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754163AbcBGSTc (ORCPT ); Sun, 7 Feb 2016 13:19:32 -0500 Received: from mail-yw0-f171.google.com ([209.85.161.171]:33599 "EHLO mail-yw0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752807AbcBGSTa (ORCPT ); Sun, 7 Feb 2016 13:19:30 -0500 MIME-Version: 1.0 In-Reply-To: <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> Date: Sun, 7 Feb 2016 10:19:29 -0800 Message-ID: Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() From: Dan Williams To: Ross Zwisler Cc: "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler wrote: > dax_clear_blocks() needs a valid struct block_device and previously it was > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > block devices and for XFS real-time devices. > > Instead, have the caller pass in a struct block_device pointer which it > knows to be correct. > > Signed-off-by: Ross Zwisler > --- > fs/dax.c | 4 ++-- > fs/ext2/inode.c | 5 +++-- > fs/xfs/xfs_aops.c | 2 +- > fs/xfs/xfs_aops.h | 1 + > fs/xfs/xfs_bmap_util.c | 4 +++- > include/linux/dax.h | 3 ++- > 6 files changed, 12 insertions(+), 7 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 227974a..4592241 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -83,9 +83,9 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n) > * and hence this means the stack from this point must follow GFP_NOFS > * semantics for all operations. > */ > -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) > +int dax_clear_blocks(struct inode *inode, struct block_device *bdev, > + sector_t block, long _size) Since this is a bdev relative routine we should also resolve the sector, i.e. the signature should drop the inode: int dax_clear_sectors(struct block_device *bdev, sector_t sector, long _size) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754541AbcBGTNz (ORCPT ); Sun, 7 Feb 2016 14:13:55 -0500 Received: from mail-yw0-f177.google.com ([209.85.161.177]:32981 "EHLO mail-yw0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754121AbcBGTNw (ORCPT ); Sun, 7 Feb 2016 14:13:52 -0500 MIME-Version: 1.0 In-Reply-To: <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> Date: Sun, 7 Feb 2016 11:13:51 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems From: Dan Williams To: Ross Zwisler Cc: "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler wrote: > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > dax_writeback_mapping_range() needs a struct block_device, and it used to > get that from inode->i_sb->s_bdev. This is correct for normal inodes > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > block devices and for XFS real-time files. > > Instead, call dax_writeback_mapping_range() directly from the filesystem or > raw block device fsync/msync code so that they can supply us with a valid > block device. > > It should be noted that this will reduce the number of calls to > dax_writeback_mapping_range() because filemap_write_and_wait_range() is > called in the various filesystems for operations other than just > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside > of ->fsync for hole punch, truncate, and block relocation > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). > > I don't believe that these extra flushes are necessary in the DAX case. In > the page cache case when we have dirty data in the page cache, that data > will be actively lost if we evict a dirty page cache page without flushing > it to media first. For DAX, though, the data will remain consistent with > the physical address to which it was written regardless of whether it's in > the processor cache or not - really the only reason I see to flush is in > response to a fsync or msync so that our data is durable on media in case > of a power loss. The case where we could throw dirty data out of the page > cache and essentially lose writes simply doesn't exist. > > Signed-off-by: Ross Zwisler > --- > fs/block_dev.c | 7 +++++++ > fs/dax.c | 5 ++--- > fs/ext2/file.c | 10 ++++++++++ > fs/ext4/fsync.c | 10 +++++++++- > fs/xfs/xfs_file.c | 12 ++++++++++-- > include/linux/dax.h | 4 ++-- > mm/filemap.c | 6 ------ > 7 files changed, 40 insertions(+), 14 deletions(-) This sprinkling of dax specific fixups outside of vm_operations_struct routines still has me thinking that we are going in the wrong direction for fsync/msync support. If an application is both unaware of DAX and doing mmap I/O it is better served by the page cache where writeback is durable by default. We expect DAX-aware applications to assume responsibility for cpu cache management [1]. Making DAX mmap semantics explicit opt-in solves not only durability support, but also the current problem that DAX gets silently disabled leaving an app to wonder if it really got a direct mapping. DAX also silently picks pud, pmd, or pte mappings which is information an application would really like to know at map time. The proposal: make applications explicitly request DAX semantics with a new MAP_DAX flag and fail if DAX is unavailable. Document that a successful MAP_DAX request mandates that the application assumes responsibility for cpu cache management. Require that all applications that mmap the file agree on MAP_DAX. This also solves the future problem of DAX support on virtually tagged cache architectures where it is difficult for the kernel to know what alias addresses need flushing. [1]: https://github.com/pmem/nvml From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754876AbcBGVuy (ORCPT ); Sun, 7 Feb 2016 16:50:54 -0500 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:49359 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750749AbcBGVuw (ORCPT ); Sun, 7 Feb 2016 16:50:52 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2CaDABou7dWPBATLHleKAECgw+BP4Jpg3qBeJ0/AQEBAQEBBotmhUSEB4YHAgIBAQKBH00BAQEBAQEHAQEBAUE/hEIBAQQnExwjEAgDGAklDwUlAwcaE4gavHABAQgCAR0YhTKEf4QWBoRQBZZ1jUeOfINSimyEWiguAYcagTgBAQE Date: Mon, 8 Feb 2016 08:50:47 +1100 From: Dave Chinner To: Dan Williams Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160207215047.GJ31407@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote: > On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler > wrote: > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > block devices and for XFS real-time files. > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem or > > raw block device fsync/msync code so that they can supply us with a valid > > block device. > > > > It should be noted that this will reduce the number of calls to > > dax_writeback_mapping_range() because filemap_write_and_wait_range() is > > called in the various filesystems for operations other than just > > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside > > of ->fsync for hole punch, truncate, and block relocation > > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). > > > > I don't believe that these extra flushes are necessary in the DAX case. In > > the page cache case when we have dirty data in the page cache, that data > > will be actively lost if we evict a dirty page cache page without flushing > > it to media first. For DAX, though, the data will remain consistent with > > the physical address to which it was written regardless of whether it's in > > the processor cache or not - really the only reason I see to flush is in > > response to a fsync or msync so that our data is durable on media in case > > of a power loss. The case where we could throw dirty data out of the page > > cache and essentially lose writes simply doesn't exist. > > > > Signed-off-by: Ross Zwisler > > --- > > fs/block_dev.c | 7 +++++++ > > fs/dax.c | 5 ++--- > > fs/ext2/file.c | 10 ++++++++++ > > fs/ext4/fsync.c | 10 +++++++++- > > fs/xfs/xfs_file.c | 12 ++++++++++-- > > include/linux/dax.h | 4 ++-- > > mm/filemap.c | 6 ------ > > 7 files changed, 40 insertions(+), 14 deletions(-) > > This sprinkling of dax specific fixups outside of vm_operations_struct > routines still has me thinking that we are going in the wrong > direction for fsync/msync support. > > If an application is both unaware of DAX and doing mmap I/O it is > better served by the page cache where writeback is durable by default. > We expect DAX-aware applications to assume responsibility for cpu > cache management [1]. Making DAX mmap semantics explicit opt-in > solves not only durability support, but also the current problem that > DAX gets silently disabled leaving an app to wonder if it really got a > direct mapping. DAX also silently picks pud, pmd, or pte mappings > which is information an application would really like to know at map > time. > > The proposal: make applications explicitly request DAX semantics with > a new MAP_DAX flag and fail if DAX is unavailable. No. As I've stated before, the entire purpose of enabling DAX through existing filesytsems like XFS and ext4 is so that existing applications work with DAX *without modification*. That is, applications can be entirely unaware of the fact that the filesystem is giving them direct access to the storage because the access and failure semantics of DAX enabled mmap are *identical to the existing mmap semantics*. Given this, the app doesn't need to care whether DAX is enabled or not; all that will be seen is a difference in speed of access. Enabling and disabling DAX is, at this point, purely an administration decision - if the hardware and filesystem supports it, it can be turned on without having to wait years for application developers to add support for it.... -Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755006AbcBGWE4 (ORCPT ); Sun, 7 Feb 2016 17:04:56 -0500 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:16834 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754750AbcBGWEd (ORCPT ); Sun, 7 Feb 2016 17:04:33 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2AfDADqvrdWPBATLHleKAECgw+BP4Jpg3qBeJ0/AQEBAQEBBotmhUSEB4YHBAICgSBNAQEBAQEBBwEBAQFBP4RCAQEEJxMcIxAIAxgJJQ8FJQMHGhOIGrxzAQseGIUyhH+IbAEElnWNR458RI16gmUZgVwoLohTAQEB Date: Mon, 8 Feb 2016 09:03:29 +1100 From: Dave Chinner To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Message-ID: <20160207220329.GK31407@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Feb 07, 2016 at 12:19:12AM -0700, Ross Zwisler wrote: > dax_clear_blocks() needs a valid struct block_device and previously it was > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > block devices and for XFS real-time devices. > > Instead, have the caller pass in a struct block_device pointer which it > knows to be correct. .... > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > index 07ef29b..f722ba2 100644 > --- a/fs/xfs/xfs_bmap_util.c > +++ b/fs/xfs/xfs_bmap_util.c > @@ -73,9 +73,11 @@ xfs_zero_extent( > xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb); > sector_t block = XFS_BB_TO_FSBT(mp, sector); > ssize_t size = XFS_FSB_TO_B(mp, count_fsb); > + struct inode *inode = VFS_I(ip); > > if (IS_DAX(VFS_I(ip))) > - return dax_clear_blocks(VFS_I(ip), block, size); > + return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > + block, size); Get rid of the local inode variable and use VFS_I(ip) like the code originally did. Do not change code that is unrelated to the modifcation being made, especially when it results in making the code an inconsistent mess of mixed pointer constructs.... Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753314AbcBHBoW (ORCPT ); Sun, 7 Feb 2016 20:44:22 -0500 Received: from mga01.intel.com ([192.55.52.88]:60607 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751933AbcBHBoU (ORCPT ); Sun, 7 Feb 2016 20:44:20 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,413,1449561600"; d="scan'208";a="44077435" Date: Sun, 7 Feb 2016 18:44:09 -0700 From: Ross Zwisler To: Dave Chinner Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Message-ID: <20160208014409.GA2343@linux.intel.com> Mail-Followup-To: Ross Zwisler , Dave Chinner , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> <20160207220329.GK31407@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160207220329.GK31407@dastard> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 08, 2016 at 09:03:29AM +1100, Dave Chinner wrote: > On Sun, Feb 07, 2016 at 12:19:12AM -0700, Ross Zwisler wrote: > > dax_clear_blocks() needs a valid struct block_device and previously it was > > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > block devices and for XFS real-time devices. > > > > Instead, have the caller pass in a struct block_device pointer which it > > knows to be correct. > .... > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > index 07ef29b..f722ba2 100644 > > --- a/fs/xfs/xfs_bmap_util.c > > +++ b/fs/xfs/xfs_bmap_util.c > > @@ -73,9 +73,11 @@ xfs_zero_extent( > > xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb); > > sector_t block = XFS_BB_TO_FSBT(mp, sector); > > ssize_t size = XFS_FSB_TO_B(mp, count_fsb); > > + struct inode *inode = VFS_I(ip); > > > > if (IS_DAX(VFS_I(ip))) > > - return dax_clear_blocks(VFS_I(ip), block, size); > > + return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > > + block, size); > > Get rid of the local inode variable and use VFS_I(ip) like the code > originally did. Do not change code that is unrelated to the > modifcation being made, especially when it results in making > the code an inconsistent mess of mixed pointer constructs.... The local 'inode' variable was added to avoid multiple calls for VFS_I() for the same 'ip'. That said, I'm happy to make the change. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754278AbcBHBqN (ORCPT ); Sun, 7 Feb 2016 20:46:13 -0500 Received: from mga01.intel.com ([192.55.52.88]:32324 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753260AbcBHBqL (ORCPT ); Sun, 7 Feb 2016 20:46:11 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,413,1449561600"; d="scan'208";a="907509021" Date: Sun, 7 Feb 2016 18:46:01 -0700 From: Ross Zwisler To: Dan Williams Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Message-ID: <20160208014601.GB2343@linux.intel.com> Mail-Followup-To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Feb 07, 2016 at 10:19:29AM -0800, Dan Williams wrote: > On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler > wrote: > > dax_clear_blocks() needs a valid struct block_device and previously it was > > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > block devices and for XFS real-time devices. > > > > Instead, have the caller pass in a struct block_device pointer which it > > knows to be correct. > > > > Signed-off-by: Ross Zwisler > > --- > > fs/dax.c | 4 ++-- > > fs/ext2/inode.c | 5 +++-- > > fs/xfs/xfs_aops.c | 2 +- > > fs/xfs/xfs_aops.h | 1 + > > fs/xfs/xfs_bmap_util.c | 4 +++- > > include/linux/dax.h | 3 ++- > > 6 files changed, 12 insertions(+), 7 deletions(-) > > > > diff --git a/fs/dax.c b/fs/dax.c > > index 227974a..4592241 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -83,9 +83,9 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n) > > * and hence this means the stack from this point must follow GFP_NOFS > > * semantics for all operations. > > */ > > -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) > > +int dax_clear_blocks(struct inode *inode, struct block_device *bdev, > > + sector_t block, long _size) > > Since this is a bdev relative routine we should also resolve the > sector, i.e. the signature should drop the inode: > > int dax_clear_sectors(struct block_device *bdev, sector_t sector, long _size) The inode is still needed because dax_clear_blocks() needs inode->i_blkbits. Unless there is some easy way to get this from the bdev that I'm not seeing? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755290AbcBHE3n (ORCPT ); Sun, 7 Feb 2016 23:29:43 -0500 Received: from mail-ob0-f195.google.com ([209.85.214.195]:34223 "EHLO mail-ob0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753052AbcBHE3l convert rfc822-to-8bit (ORCPT ); Sun, 7 Feb 2016 23:29:41 -0500 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (1.0) Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() From: Ross Zwisler X-Mailer: iPad Mail (13D15) In-Reply-To: <20160208014601.GB2343@linux.intel.com> Date: Sun, 7 Feb 2016 21:29:38 -0700 Cc: Dan Williams , "Theodore Ts'o" , "linux-nvdimm@lists.01.org" , Dave Chinner , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , linux-ext4 , Andrew Morton Content-Transfer-Encoding: 8BIT Message-Id: <00FE872A-9B2A-4492-A83C-59025ACB1F4A@gmail.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> <20160208014601.GB2343@linux.intel.com> To: Ross Zwisler Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On Feb 7, 2016, at 6:46 PM, Ross Zwisler wrote: > >> On Sun, Feb 07, 2016 at 10:19:29AM -0800, Dan Williams wrote: >> On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler >> wrote: >>> dax_clear_blocks() needs a valid struct block_device and previously it was >>> using inode->i_sb->s_bdev in all cases. This is correct for normal inodes >>> on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw >>> block devices and for XFS real-time devices. >>> >>> Instead, have the caller pass in a struct block_device pointer which it >>> knows to be correct. >>> >>> Signed-off-by: Ross Zwisler >>> --- >>> fs/dax.c | 4 ++-- >>> fs/ext2/inode.c | 5 +++-- >>> fs/xfs/xfs_aops.c | 2 +- >>> fs/xfs/xfs_aops.h | 1 + >>> fs/xfs/xfs_bmap_util.c | 4 +++- >>> include/linux/dax.h | 3 ++- >>> 6 files changed, 12 insertions(+), 7 deletions(-) >>> >>> diff --git a/fs/dax.c b/fs/dax.c >>> index 227974a..4592241 100644 >>> --- a/fs/dax.c >>> +++ b/fs/dax.c >>> @@ -83,9 +83,9 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n) >>> * and hence this means the stack from this point must follow GFP_NOFS >>> * semantics for all operations. >>> */ >>> -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) >>> +int dax_clear_blocks(struct inode *inode, struct block_device *bdev, >>> + sector_t block, long _size) >> >> Since this is a bdev relative routine we should also resolve the >> sector, i.e. the signature should drop the inode: >> >> int dax_clear_sectors(struct block_device *bdev, sector_t sector, long _size) > > The inode is still needed because dax_clear_blocks() needs inode->i_blkbits. > Unless there is some easy way to get this from the bdev that I'm not seeing? Never mind, you are passing in the sector, not the block. Sure, this seems better - I'll fix this for v2. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932085AbcBHFRb (ORCPT ); Mon, 8 Feb 2016 00:17:31 -0500 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:37087 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750820AbcBHFR3 (ORCPT ); Mon, 8 Feb 2016 00:17:29 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2AICgBLJLhWPBATLHldDhoBAoMPgT+GY4F4nUMDBotmhUSEB4YHBAICgSNNAQEBAQEBBwEBAQFBP4RCAQEEJxMcMwgDGAklDwUlAwcaARKIGrwqDB4YhTKEf4hsAQSWdY1HgWSHaYUvRIoog1KCZRmBDU8oLohTAQEB Date: Mon, 8 Feb 2016 16:17:25 +1100 From: Dave Chinner To: Ross Zwisler , linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Message-ID: <20160208051725.GM31407@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> <20160207220329.GK31407@dastard> <20160208014409.GA2343@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160208014409.GA2343@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Feb 07, 2016 at 06:44:09PM -0700, Ross Zwisler wrote: > On Mon, Feb 08, 2016 at 09:03:29AM +1100, Dave Chinner wrote: > > On Sun, Feb 07, 2016 at 12:19:12AM -0700, Ross Zwisler wrote: > > > dax_clear_blocks() needs a valid struct block_device and previously it was > > > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > > > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > > block devices and for XFS real-time devices. > > > > > > Instead, have the caller pass in a struct block_device pointer which it > > > knows to be correct. > > .... > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > > index 07ef29b..f722ba2 100644 > > > --- a/fs/xfs/xfs_bmap_util.c > > > +++ b/fs/xfs/xfs_bmap_util.c > > > @@ -73,9 +73,11 @@ xfs_zero_extent( > > > xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb); > > > sector_t block = XFS_BB_TO_FSBT(mp, sector); > > > ssize_t size = XFS_FSB_TO_B(mp, count_fsb); > > > + struct inode *inode = VFS_I(ip); > > > > > > if (IS_DAX(VFS_I(ip))) > > > - return dax_clear_blocks(VFS_I(ip), block, size); > > > + return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > > > + block, size); > > > > Get rid of the local inode variable and use VFS_I(ip) like the code > > originally did. Do not change code that is unrelated to the > > modifcation being made, especially when it results in making > > the code an inconsistent mess of mixed pointer constructs.... > > The local 'inode' variable was added to avoid multiple calls for VFS_I() for > the same 'ip'. My point is you didn't achieve that. The end result of your patch is: struct inode *inode = VFS_I(ip); if (IS_DAX(VFS_I(ip))) return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), block, size); So now we have a local variable, but we still have 2 calls to VFS_I(ip). i.e. this makes the code harder to read and understand than before for no benefit. Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755777AbcBHISP (ORCPT ); Mon, 8 Feb 2016 03:18:15 -0500 Received: from mail-yw0-f176.google.com ([209.85.161.176]:36456 "EHLO mail-yw0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755584AbcBHISM (ORCPT ); Mon, 8 Feb 2016 03:18:12 -0500 MIME-Version: 1.0 In-Reply-To: <20160207215047.GJ31407@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> Date: Mon, 8 Feb 2016 00:18:11 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems From: Dan Williams To: Dave Chinner Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Feb 7, 2016 at 1:50 PM, Dave Chinner wrote: > On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote: >> On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler >> wrote: >> > Previously calls to dax_writeback_mapping_range() for all DAX filesystems >> > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). >> > dax_writeback_mapping_range() needs a struct block_device, and it used to >> > get that from inode->i_sb->s_bdev. This is correct for normal inodes >> > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw >> > block devices and for XFS real-time files. >> > >> > Instead, call dax_writeback_mapping_range() directly from the filesystem or >> > raw block device fsync/msync code so that they can supply us with a valid >> > block device. >> > >> > It should be noted that this will reduce the number of calls to >> > dax_writeback_mapping_range() because filemap_write_and_wait_range() is >> > called in the various filesystems for operations other than just >> > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside >> > of ->fsync for hole punch, truncate, and block relocation >> > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). >> > >> > I don't believe that these extra flushes are necessary in the DAX case. In >> > the page cache case when we have dirty data in the page cache, that data >> > will be actively lost if we evict a dirty page cache page without flushing >> > it to media first. For DAX, though, the data will remain consistent with >> > the physical address to which it was written regardless of whether it's in >> > the processor cache or not - really the only reason I see to flush is in >> > response to a fsync or msync so that our data is durable on media in case >> > of a power loss. The case where we could throw dirty data out of the page >> > cache and essentially lose writes simply doesn't exist. >> > >> > Signed-off-by: Ross Zwisler >> > --- >> > fs/block_dev.c | 7 +++++++ >> > fs/dax.c | 5 ++--- >> > fs/ext2/file.c | 10 ++++++++++ >> > fs/ext4/fsync.c | 10 +++++++++- >> > fs/xfs/xfs_file.c | 12 ++++++++++-- >> > include/linux/dax.h | 4 ++-- >> > mm/filemap.c | 6 ------ >> > 7 files changed, 40 insertions(+), 14 deletions(-) >> >> This sprinkling of dax specific fixups outside of vm_operations_struct >> routines still has me thinking that we are going in the wrong >> direction for fsync/msync support. >> >> If an application is both unaware of DAX and doing mmap I/O it is >> better served by the page cache where writeback is durable by default. >> We expect DAX-aware applications to assume responsibility for cpu >> cache management [1]. Making DAX mmap semantics explicit opt-in >> solves not only durability support, but also the current problem that >> DAX gets silently disabled leaving an app to wonder if it really got a >> direct mapping. DAX also silently picks pud, pmd, or pte mappings >> which is information an application would really like to know at map >> time. >> >> The proposal: make applications explicitly request DAX semantics with >> a new MAP_DAX flag and fail if DAX is unavailable. > > No. > > As I've stated before, the entire purpose of enabling DAX through > existing filesytsems like XFS and ext4 is so that existing > applications work with DAX *without modification*. > > That is, applications can be entirely unaware of the fact that the > filesystem is giving them direct access to the storage because the > access and failure semantics of DAX enabled mmap are *identical to > the existing mmap semantics*. > > Given this, the app doesn't need to care whether DAX is enabled or > not; all that will be seen is a difference in speed of access. > Enabling and disabling DAX is, at this point, purely an > administration decision - if the hardware and filesystem supports > it, it can be turned on without having to wait years for application > developers to add support for it.... Setting aside the current block zeroing problem you seem to assuming that DAX will always be faster and that may not be true at a media level. Waiting years for some applications to determine if DAX makes sense for their use case seems completely reasonable. In the meantime the apps that are already making these changes want to know that a DAX mapping request has not silently dropped backed to page cache. They also want to know if they successfully jumped through all the hoops to get a larger than pte mapping. I agree it is useful to be able to force DAX on an unmodified application to see what happens, and it follows that if those applications want to run in that mode they will need functional fsync()... I would feel better if we were talking about specific applications and performance numbers to know if forcing DAX on application is a debug facility or a production level capability. You seem to have already made that determination and I'm curious what I'm missing. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752777AbcBHKsi (ORCPT ); Mon, 8 Feb 2016 05:48:38 -0500 Received: from mx2.suse.de ([195.135.220.15]:46505 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752471AbcBHKsf (ORCPT ); Mon, 8 Feb 2016 05:48:35 -0500 Date: Mon, 8 Feb 2016 11:48:50 +0100 From: Jan Kara To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160208104849.GB9451@quack.suse.cz> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun 07-02-16 00:19:13, Ross Zwisler wrote: > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > dax_writeback_mapping_range() needs a struct block_device, and it used to > get that from inode->i_sb->s_bdev. This is correct for normal inodes > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > block devices and for XFS real-time files. > > Instead, call dax_writeback_mapping_range() directly from the filesystem or > raw block device fsync/msync code so that they can supply us with a valid > block device. > > It should be noted that this will reduce the number of calls to > dax_writeback_mapping_range() because filemap_write_and_wait_range() is > called in the various filesystems for operations other than just > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside > of ->fsync for hole punch, truncate, and block relocation > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). > > I don't believe that these extra flushes are necessary in the DAX case. In > the page cache case when we have dirty data in the page cache, that data > will be actively lost if we evict a dirty page cache page without flushing > it to media first. For DAX, though, the data will remain consistent with > the physical address to which it was written regardless of whether it's in > the processor cache or not - really the only reason I see to flush is in > response to a fsync or msync so that our data is durable on media in case > of a power loss. The case where we could throw dirty data out of the page > cache and essentially lose writes simply doesn't exist. You should at least note that sync(2) won't make data durable with this patch in the changelog. Dave and Christoph have told you that Linux users depend on sync(2) to make data durable and I fully agree with them. Given current options, I think we can live with this for 4.5 but long term this is IMO unacceptable. Honza > > Signed-off-by: Ross Zwisler > --- > fs/block_dev.c | 7 +++++++ > fs/dax.c | 5 ++--- > fs/ext2/file.c | 10 ++++++++++ > fs/ext4/fsync.c | 10 +++++++++- > fs/xfs/xfs_file.c | 12 ++++++++++-- > include/linux/dax.h | 4 ++-- > mm/filemap.c | 6 ------ > 7 files changed, 40 insertions(+), 14 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index fa0507a..312ad44 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -356,8 +356,15 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync) > { > struct inode *bd_inode = bdev_file_inode(filp); > struct block_device *bdev = I_BDEV(bd_inode); > + struct address_space *mapping = bd_inode->i_mapping; > int error; > > + if (dax_mapping(mapping) && mapping->nrexceptional) { > + error = dax_writeback_mapping_range(mapping, bdev, start, end); > + if (error) > + return error; > + } > + > error = filemap_write_and_wait_range(filp->f_mapping, start, end); > if (error) > return error; > diff --git a/fs/dax.c b/fs/dax.c > index 4592241..4b5006a 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev, > * end]. This is required by data integrity operations to ensure file data is > * on persistent storage prior to completion of the operation. > */ > -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > - loff_t end) > +int dax_writeback_mapping_range(struct address_space *mapping, > + struct block_device *bdev, loff_t start, loff_t end) > { > struct inode *inode = mapping->host; > - struct block_device *bdev = inode->i_sb->s_bdev; > pgoff_t start_index, end_index, pmd_index; > pgoff_t indices[PAGEVEC_SIZE]; > struct pagevec pvec; > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index 2c88d68..d1abf53 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -162,6 +162,16 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) > int ret; > struct super_block *sb = file->f_mapping->host->i_sb; > struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; > +#ifdef CONFIG_FS_DAX > + struct address_space *inode_mapping = file->f_inode->i_mapping; > + > + if (dax_mapping(inode_mapping) && inode_mapping->nrexceptional) { > + ret = dax_writeback_mapping_range(inode_mapping, sb->s_bdev, > + start, end); > + if (ret) > + return ret; > + } > +#endif > > ret = generic_file_fsync(file, start, end, datasync); > if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) { > diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c > index 8850254..e9cf53b 100644 > --- a/fs/ext4/fsync.c > +++ b/fs/ext4/fsync.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > > #include "ext4.h" > #include "ext4_jbd2.h" > @@ -83,10 +84,10 @@ static int ext4_sync_parent(struct inode *inode) > * What we do is just kick off a commit and wait on it. This will snapshot the > * inode to disk. > */ > - > int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > { > struct inode *inode = file->f_mapping->host; > + struct address_space *mapping = inode->i_mapping; > struct ext4_inode_info *ei = EXT4_I(inode); > journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; > int ret = 0, err; > @@ -97,6 +98,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > > trace_ext4_sync_file_enter(file, datasync); > > + if (dax_mapping(mapping) && mapping->nrexceptional) { > + err = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, > + start, end); > + if (err) > + goto out; > + } > + > if (inode->i_sb->s_flags & MS_RDONLY) { > /* Make sure that we read updated s_mount_flags value */ > smp_rmb(); > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 52883ac..84e95cc 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -209,7 +209,8 @@ xfs_file_fsync( > loff_t end, > int datasync) > { > - struct inode *inode = file->f_mapping->host; > + struct address_space *mapping = file->f_mapping; > + struct inode *inode = mapping->host; > struct xfs_inode *ip = XFS_I(inode); > struct xfs_mount *mp = ip->i_mount; > int error = 0; > @@ -218,7 +219,14 @@ xfs_file_fsync( > > trace_xfs_file_fsync(ip); > > - error = filemap_write_and_wait_range(inode->i_mapping, start, end); > + if (dax_mapping(mapping) && mapping->nrexceptional) { > + error = dax_writeback_mapping_range(mapping, > + xfs_find_bdev_for_inode(inode), start, end); > + if (error) > + return error; > + } > + > + error = filemap_write_and_wait_range(mapping, start, end); > if (error) > return error; > > diff --git a/include/linux/dax.h b/include/linux/dax.h > index bad27b0..8e9f114 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -42,6 +42,6 @@ static inline bool dax_mapping(struct address_space *mapping) > { > return mapping->host && IS_DAX(mapping->host); > } > -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > - loff_t end); > +int dax_writeback_mapping_range(struct address_space *mapping, > + struct block_device *bdev, loff_t start, loff_t end); > #endif > diff --git a/mm/filemap.c b/mm/filemap.c > index bc94386..c4286eb 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -482,12 +482,6 @@ int filemap_write_and_wait_range(struct address_space *mapping, > { > int err = 0; > > - if (dax_mapping(mapping) && mapping->nrexceptional) { > - err = dax_writeback_mapping_range(mapping, lstart, lend); > - if (err) > - return err; > - } > - > if (mapping->nrpages) { > err = __filemap_fdatawrite_range(mapping, lstart, lend, > WB_SYNC_ALL); > -- > 2.5.0 > > -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754870AbcBHPez (ORCPT ); Mon, 8 Feb 2016 10:34:55 -0500 Received: from mga04.intel.com ([192.55.52.120]:26705 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754571AbcBHPex (ORCPT ); Mon, 8 Feb 2016 10:34:53 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,416,1449561600"; d="scan'208";a="907863244" Date: Mon, 8 Feb 2016 08:34:43 -0700 From: Ross Zwisler To: Dave Chinner Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Message-ID: <20160208153443.GC2343@linux.intel.com> Mail-Followup-To: Ross Zwisler , Dave Chinner , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> <20160207220329.GK31407@dastard> <20160208014409.GA2343@linux.intel.com> <20160208051725.GM31407@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160208051725.GM31407@dastard> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 08, 2016 at 04:17:25PM +1100, Dave Chinner wrote: > On Sun, Feb 07, 2016 at 06:44:09PM -0700, Ross Zwisler wrote: > > On Mon, Feb 08, 2016 at 09:03:29AM +1100, Dave Chinner wrote: > > > On Sun, Feb 07, 2016 at 12:19:12AM -0700, Ross Zwisler wrote: > > > > dax_clear_blocks() needs a valid struct block_device and previously it was > > > > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > > > > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > > > block devices and for XFS real-time devices. > > > > > > > > Instead, have the caller pass in a struct block_device pointer which it > > > > knows to be correct. > > > .... > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > > > index 07ef29b..f722ba2 100644 > > > > --- a/fs/xfs/xfs_bmap_util.c > > > > +++ b/fs/xfs/xfs_bmap_util.c > > > > @@ -73,9 +73,11 @@ xfs_zero_extent( > > > > xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb); > > > > sector_t block = XFS_BB_TO_FSBT(mp, sector); > > > > ssize_t size = XFS_FSB_TO_B(mp, count_fsb); > > > > + struct inode *inode = VFS_I(ip); > > > > > > > > if (IS_DAX(VFS_I(ip))) > > > > - return dax_clear_blocks(VFS_I(ip), block, size); > > > > + return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > > > > + block, size); > > > > > > Get rid of the local inode variable and use VFS_I(ip) like the code > > > originally did. Do not change code that is unrelated to the > > > modifcation being made, especially when it results in making > > > the code an inconsistent mess of mixed pointer constructs.... > > > > The local 'inode' variable was added to avoid multiple calls for VFS_I() for > > the same 'ip'. > > My point is you didn't achieve that. The end result of your patch > is: > > struct inode *inode = VFS_I(ip); > > if (IS_DAX(VFS_I(ip))) > return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > block, size); > > So now we have a local variable, but we still have 2 calls to > VFS_I(ip). i.e. this makes the code harder to read and understand > than before for no benefit. *facepalm* Yep, thanks for the correction. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755476AbcBHQMt (ORCPT ); Mon, 8 Feb 2016 11:12:49 -0500 Received: from mga03.intel.com ([134.134.136.65]:38153 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753840AbcBHQMq (ORCPT ); Mon, 8 Feb 2016 11:12:46 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,416,1449561600"; d="scan'208";a="649041833" Date: Mon, 8 Feb 2016 09:12:11 -0700 From: Ross Zwisler To: Jan Kara Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160208161211.GE2343@linux.intel.com> Mail-Followup-To: Ross Zwisler , Jan Kara , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160208104849.GB9451@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160208104849.GB9451@quack.suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 08, 2016 at 11:48:50AM +0100, Jan Kara wrote: > On Sun 07-02-16 00:19:13, Ross Zwisler wrote: > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > block devices and for XFS real-time files. > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem or > > raw block device fsync/msync code so that they can supply us with a valid > > block device. > > > > It should be noted that this will reduce the number of calls to > > dax_writeback_mapping_range() because filemap_write_and_wait_range() is > > called in the various filesystems for operations other than just > > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside > > of ->fsync for hole punch, truncate, and block relocation > > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). > > > > I don't believe that these extra flushes are necessary in the DAX case. In > > the page cache case when we have dirty data in the page cache, that data > > will be actively lost if we evict a dirty page cache page without flushing > > it to media first. For DAX, though, the data will remain consistent with > > the physical address to which it was written regardless of whether it's in > > the processor cache or not - really the only reason I see to flush is in > > response to a fsync or msync so that our data is durable on media in case > > of a power loss. The case where we could throw dirty data out of the page > > cache and essentially lose writes simply doesn't exist. > > You should at least note that sync(2) won't make data durable with this > patch in the changelog. Dave and Christoph have told you that Linux users > depend on sync(2) to make data durable and I fully agree with them. Given > current options, I think we can live with this for 4.5 but long term this > is IMO unacceptable. > > Honza I agree. I'll add a note to the changelog and will work on adding support for sync(2). > > > > Signed-off-by: Ross Zwisler > > --- > > fs/block_dev.c | 7 +++++++ > > fs/dax.c | 5 ++--- > > fs/ext2/file.c | 10 ++++++++++ > > fs/ext4/fsync.c | 10 +++++++++- > > fs/xfs/xfs_file.c | 12 ++++++++++-- > > include/linux/dax.h | 4 ++-- > > mm/filemap.c | 6 ------ > > 7 files changed, 40 insertions(+), 14 deletions(-) > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > index fa0507a..312ad44 100644 > > --- a/fs/block_dev.c > > +++ b/fs/block_dev.c > > @@ -356,8 +356,15 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync) > > { > > struct inode *bd_inode = bdev_file_inode(filp); > > struct block_device *bdev = I_BDEV(bd_inode); > > + struct address_space *mapping = bd_inode->i_mapping; > > int error; > > > > + if (dax_mapping(mapping) && mapping->nrexceptional) { > > + error = dax_writeback_mapping_range(mapping, bdev, start, end); > > + if (error) > > + return error; > > + } > > + > > error = filemap_write_and_wait_range(filp->f_mapping, start, end); > > if (error) > > return error; > > diff --git a/fs/dax.c b/fs/dax.c > > index 4592241..4b5006a 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev, > > * end]. This is required by data integrity operations to ensure file data is > > * on persistent storage prior to completion of the operation. > > */ > > -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > > - loff_t end) > > +int dax_writeback_mapping_range(struct address_space *mapping, > > + struct block_device *bdev, loff_t start, loff_t end) > > { > > struct inode *inode = mapping->host; > > - struct block_device *bdev = inode->i_sb->s_bdev; > > pgoff_t start_index, end_index, pmd_index; > > pgoff_t indices[PAGEVEC_SIZE]; > > struct pagevec pvec; > > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > > index 2c88d68..d1abf53 100644 > > --- a/fs/ext2/file.c > > +++ b/fs/ext2/file.c > > @@ -162,6 +162,16 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) > > int ret; > > struct super_block *sb = file->f_mapping->host->i_sb; > > struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; > > +#ifdef CONFIG_FS_DAX > > + struct address_space *inode_mapping = file->f_inode->i_mapping; > > + > > + if (dax_mapping(inode_mapping) && inode_mapping->nrexceptional) { > > + ret = dax_writeback_mapping_range(inode_mapping, sb->s_bdev, > > + start, end); > > + if (ret) > > + return ret; > > + } > > +#endif > > > > ret = generic_file_fsync(file, start, end, datasync); > > if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) { > > diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c > > index 8850254..e9cf53b 100644 > > --- a/fs/ext4/fsync.c > > +++ b/fs/ext4/fsync.c > > @@ -27,6 +27,7 @@ > > #include > > #include > > #include > > +#include > > > > #include "ext4.h" > > #include "ext4_jbd2.h" > > @@ -83,10 +84,10 @@ static int ext4_sync_parent(struct inode *inode) > > * What we do is just kick off a commit and wait on it. This will snapshot the > > * inode to disk. > > */ > > - > > int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > > { > > struct inode *inode = file->f_mapping->host; > > + struct address_space *mapping = inode->i_mapping; > > struct ext4_inode_info *ei = EXT4_I(inode); > > journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; > > int ret = 0, err; > > @@ -97,6 +98,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > > > > trace_ext4_sync_file_enter(file, datasync); > > > > + if (dax_mapping(mapping) && mapping->nrexceptional) { > > + err = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, > > + start, end); > > + if (err) > > + goto out; > > + } > > + > > if (inode->i_sb->s_flags & MS_RDONLY) { > > /* Make sure that we read updated s_mount_flags value */ > > smp_rmb(); > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > index 52883ac..84e95cc 100644 > > --- a/fs/xfs/xfs_file.c > > +++ b/fs/xfs/xfs_file.c > > @@ -209,7 +209,8 @@ xfs_file_fsync( > > loff_t end, > > int datasync) > > { > > - struct inode *inode = file->f_mapping->host; > > + struct address_space *mapping = file->f_mapping; > > + struct inode *inode = mapping->host; > > struct xfs_inode *ip = XFS_I(inode); > > struct xfs_mount *mp = ip->i_mount; > > int error = 0; > > @@ -218,7 +219,14 @@ xfs_file_fsync( > > > > trace_xfs_file_fsync(ip); > > > > - error = filemap_write_and_wait_range(inode->i_mapping, start, end); > > + if (dax_mapping(mapping) && mapping->nrexceptional) { > > + error = dax_writeback_mapping_range(mapping, > > + xfs_find_bdev_for_inode(inode), start, end); > > + if (error) > > + return error; > > + } > > + > > + error = filemap_write_and_wait_range(mapping, start, end); > > if (error) > > return error; > > > > diff --git a/include/linux/dax.h b/include/linux/dax.h > > index bad27b0..8e9f114 100644 > > --- a/include/linux/dax.h > > +++ b/include/linux/dax.h > > @@ -42,6 +42,6 @@ static inline bool dax_mapping(struct address_space *mapping) > > { > > return mapping->host && IS_DAX(mapping->host); > > } > > -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > > - loff_t end); > > +int dax_writeback_mapping_range(struct address_space *mapping, > > + struct block_device *bdev, loff_t start, loff_t end); > > #endif > > diff --git a/mm/filemap.c b/mm/filemap.c > > index bc94386..c4286eb 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -482,12 +482,6 @@ int filemap_write_and_wait_range(struct address_space *mapping, > > { > > int err = 0; > > > > - if (dax_mapping(mapping) && mapping->nrexceptional) { > > - err = dax_writeback_mapping_range(mapping, lstart, lend); > > - if (err) > > - return err; > > - } > > - > > if (mapping->nrpages) { > > err = __filemap_fdatawrite_range(mapping, lstart, lend, > > WB_SYNC_ALL); > > -- > > 2.5.0 > > > > > -- > Jan Kara > SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754935AbcBHSbz (ORCPT ); Mon, 8 Feb 2016 13:31:55 -0500 Received: from mga01.intel.com ([192.55.52.88]:38857 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753292AbcBHSbv (ORCPT ); Mon, 8 Feb 2016 13:31:51 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,417,1449561600"; d="scan'208";a="911007210" Date: Mon, 8 Feb 2016 11:31:12 -0700 From: Ross Zwisler To: Dan Williams Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160208183112.GF2343@linux.intel.com> Mail-Followup-To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote: > The proposal: make applications explicitly request DAX semantics with > a new MAP_DAX flag and fail if DAX is unavailable. Document that a > successful MAP_DAX request mandates that the application assumes > responsibility for cpu cache management. > Require that all applications that mmap the file agree on MAP_DAX. I think this proposal could run into issues with aliasing. For example, say you have two threads accessing the same region, and one wants to use DAX and the other wants to use the page cache. What happens? If we satisfy both requests, we end up with one user reading and writing to the page cache, while the other is reading and writing directly to the media. They can't see each other's changes, and you get data corruption. If we satisfy the request of whoever asked first, sort of lock the inode into that mode, and then return an error to the second thread because they are asking for the other mode, we have now introduced a new weird failure case where mmaps can randomly fail based on the behavior of other applications. I think this is where you were going with the last line quoted above, but I don't understand how it would work in an acceptable way. It seems like we have to have the decision about whether or not to use DAX made in the same way for all users of the inode so that we don't run into these types of conflicts. > This also solves > the future problem of DAX support on virtually tagged cache > architectures where it is difficult for the kernel to know what alias > addresses need flushing. > > [1]: https://github.com/pmem/nvml From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753142AbcBHTYB (ORCPT ); Mon, 8 Feb 2016 14:24:01 -0500 Received: from mail-yw0-f182.google.com ([209.85.161.182]:35292 "EHLO mail-yw0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751929AbcBHTX5 (ORCPT ); Mon, 8 Feb 2016 14:23:57 -0500 MIME-Version: 1.0 In-Reply-To: <20160208183112.GF2343@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160208183112.GF2343@linux.intel.com> Date: Mon, 8 Feb 2016 11:23:56 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems From: Dan Williams To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 8, 2016 at 10:31 AM, Ross Zwisler wrote: > On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote: >> The proposal: make applications explicitly request DAX semantics with >> a new MAP_DAX flag and fail if DAX is unavailable. Document that a >> successful MAP_DAX request mandates that the application assumes >> responsibility for cpu cache management. > >> Require that all applications that mmap the file agree on MAP_DAX. > > I think this proposal could run into issues with aliasing. For example, say > you have two threads accessing the same region, and one wants to use DAX and > the other wants to use the page cache. What happens? > > If we satisfy both requests, we end up with one user reading and writing to > the page cache, while the other is reading and writing directly to the media. > They can't see each other's changes, and you get data corruption. > > If we satisfy the request of whoever asked first, sort of lock the inode into > that mode, and then return an error to the second thread because they are > asking for the other mode, we have now introduced a new weird failure case > where mmaps can randomly fail based on the behavior of other applications. > I think this is where you were going with the last line quoted above, but I > don't understand how it would work in an acceptable way. > > It seems like we have to have the decision about whether or not to use DAX > made in the same way for all users of the inode so that we don't run into > these types of conflicts. We haven't solved the conflict problem by pushing it out to the inode, see the recent revert of blkdev_daxset(). We're heading in a direction where an application can't develop it's own policies about DAX usage, it's always an administrative decision. However, maybe that is ok. Dave is right that if an application is using an existing filesystem it should get all the existing semantics. If the existing semantics (or overhead of maintaining the existing semantics) turn out not to fit a given pmem-aware application then we may just need new interfaces (separate from fs/dax.c) to persistent memory. I admit we're a ways off from knowing if that is needed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755944AbcBHU6t (ORCPT ); Mon, 8 Feb 2016 15:58:49 -0500 Received: from mx1.redhat.com ([209.132.183.28]:39470 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753464AbcBHU6r (ORCPT ); Mon, 8 Feb 2016 15:58:47 -0500 From: Jeff Moyer To: Dan Williams Cc: Dave Chinner , Ross Zwisler , "linux-kernel\@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm\@lists.01.org" , XFS Developers Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> X-PGP-KeyID: 1F78E1B4 X-PGP-CertKey: F6FE 280D 8293 F72C 65FD 5A58 1FF8 A7CA 1F78 E1B4 X-PCLoadLetter: What the f**k does that mean? Date: Mon, 08 Feb 2016 15:58:44 -0500 In-Reply-To: (Dan Williams's message of "Mon, 8 Feb 2016 12:55:24 -0800") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Dan Williams writes: > I agree the mount option needs to die, and I fully grok the reasoning. > What I'm concerned with is that a system using fully-DAX-aware > applications is forced to incur the overhead of maintaining *sync > semantics, periodic sync(2) in particular, even if it is not relying > on those semantics. > > However, like I said in my other mail, we can solve that with > alternate interfaces to persistent memory if that becomes an issue and > not require that "disable *sync" capability to come through DAX. What do you envision these alternate interfaces looking like? -Jeff From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755478AbcBHUSP (ORCPT ); Mon, 8 Feb 2016 15:18:15 -0500 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:12716 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750781AbcBHUSN (ORCPT ); Mon, 8 Feb 2016 15:18:13 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2CCCAAL97hWPBATLHleKAECgw+BP4ZjgXidSwaLZoVEhAeGBwICAQECgS5NAQEBAQEBBwEBAQFBP4RCAQEEJxMcIxAIAxgJJQ8FJQMHGhOIGr1uAQEIAgEdGIUyhH+EFgaEUAWHUIcGiB+NR458g1KKbIJlGYFcKC4BhxqBOAEBAQ Date: Tue, 9 Feb 2016 07:18:08 +1100 From: Dave Chinner To: Dan Williams Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160208201808.GK27429@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 08, 2016 at 12:18:11AM -0800, Dan Williams wrote: > On Sun, Feb 7, 2016 at 1:50 PM, Dave Chinner wrote: > > On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote: > >> On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler > >> wrote: > >> > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > >> > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > >> > dax_writeback_mapping_range() needs a struct block_device, and it used to > >> > get that from inode->i_sb->s_bdev. This is correct for normal inodes > >> > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > >> > block devices and for XFS real-time files. > >> > > >> > Instead, call dax_writeback_mapping_range() directly from the filesystem or > >> > raw block device fsync/msync code so that they can supply us with a valid > >> > block device. > >> > > >> > It should be noted that this will reduce the number of calls to > >> > dax_writeback_mapping_range() because filemap_write_and_wait_range() is > >> > called in the various filesystems for operations other than just > >> > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside > >> > of ->fsync for hole punch, truncate, and block relocation > >> > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). > >> > > >> > I don't believe that these extra flushes are necessary in the DAX case. In > >> > the page cache case when we have dirty data in the page cache, that data > >> > will be actively lost if we evict a dirty page cache page without flushing > >> > it to media first. For DAX, though, the data will remain consistent with > >> > the physical address to which it was written regardless of whether it's in > >> > the processor cache or not - really the only reason I see to flush is in > >> > response to a fsync or msync so that our data is durable on media in case > >> > of a power loss. The case where we could throw dirty data out of the page > >> > cache and essentially lose writes simply doesn't exist. > >> > > >> > Signed-off-by: Ross Zwisler > >> > --- > >> > fs/block_dev.c | 7 +++++++ > >> > fs/dax.c | 5 ++--- > >> > fs/ext2/file.c | 10 ++++++++++ > >> > fs/ext4/fsync.c | 10 +++++++++- > >> > fs/xfs/xfs_file.c | 12 ++++++++++-- > >> > include/linux/dax.h | 4 ++-- > >> > mm/filemap.c | 6 ------ > >> > 7 files changed, 40 insertions(+), 14 deletions(-) > >> > >> This sprinkling of dax specific fixups outside of vm_operations_struct > >> routines still has me thinking that we are going in the wrong > >> direction for fsync/msync support. > >> > >> If an application is both unaware of DAX and doing mmap I/O it is > >> better served by the page cache where writeback is durable by default. > >> We expect DAX-aware applications to assume responsibility for cpu > >> cache management [1]. Making DAX mmap semantics explicit opt-in > >> solves not only durability support, but also the current problem that > >> DAX gets silently disabled leaving an app to wonder if it really got a > >> direct mapping. DAX also silently picks pud, pmd, or pte mappings > >> which is information an application would really like to know at map > >> time. > >> > >> The proposal: make applications explicitly request DAX semantics with > >> a new MAP_DAX flag and fail if DAX is unavailable. > > > > No. > > > > As I've stated before, the entire purpose of enabling DAX through > > existing filesytsems like XFS and ext4 is so that existing > > applications work with DAX *without modification*. > > > > That is, applications can be entirely unaware of the fact that the > > filesystem is giving them direct access to the storage because the > > access and failure semantics of DAX enabled mmap are *identical to > > the existing mmap semantics*. > > > > Given this, the app doesn't need to care whether DAX is enabled or > > not; all that will be seen is a difference in speed of access. > > Enabling and disabling DAX is, at this point, purely an > > administration decision - if the hardware and filesystem supports > > it, it can be turned on without having to wait years for application > > developers to add support for it.... > > Setting aside the current block zeroing problem you seem to assuming > that DAX will always be faster and that may not be true at a media > level. Waiting years for some applications to determine if DAX makes > sense for their use case seems completely reasonable. In the meantime > the apps that are already making these changes want to know that a DAX > mapping request has not silently dropped backed to page cache. They > also want to know if they successfully jumped through all the hoops to > get a larger than pte mapping. > > I agree it is useful to be able to force DAX on an unmodified > application to see what happens, and it follows that if those > applications want to run in that mode they will need functional > fsync()... > > I would feel better if we were talking about specific applications and > performance numbers to know if forcing DAX on application is a debug > facility or a production level capability. You seem to have already > made that determination and I'm curious what I'm missing. I'm not setting any policy here at all. This whole argument is based around the DAX mount option doing "global fs enable or silently turning it off" and the application not knowing about that. The whole point of having a persistent per-inode DAX flags is that it is a policy mechanism, not a policy. The application can, if it is DAX aware, directly control whether DAX is used on a file or not. The application can even query and clear that persistent inode flag if it is configured not to (or cannot) use DAX. If the filesystem cannot support DAX, then we can error out attempts to set the DAX flag and then the app knows DAX is not available. i.e. the attempt to set policy failed. If the flag is set, then the inode will *always* use DAX - there is no "fall back to page cache" when DAX is enabled. If the applicaiton is not DAX aware, then the admin can control the DAX policy by manipulating these flags themselves, and hence control whether DAX is used by the application or not. If you think I'm dictating policy for DAX users and application, then you haven't understood anything I've previously said about why the DAX mount option needs to die before any of this is considered production ready. DAX is not an opaque "all or nothing" option. XFS will provide apps and admins with fine-grained, persistent, discoverable policy flags to allow admins and applications to set DAX policies however they see fit. This simply cannot be done if the only knob you have is a mount option that may or may not stick. Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755975AbcBHUz1 (ORCPT ); Mon, 8 Feb 2016 15:55:27 -0500 Received: from mail-yw0-f181.google.com ([209.85.161.181]:35305 "EHLO mail-yw0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755754AbcBHUzZ (ORCPT ); Mon, 8 Feb 2016 15:55:25 -0500 MIME-Version: 1.0 In-Reply-To: <20160208201808.GK27429@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> Date: Mon, 8 Feb 2016 12:55:24 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems From: Dan Williams To: Dave Chinner Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner wrote: [..] >> Setting aside the current block zeroing problem you seem to assuming >> that DAX will always be faster and that may not be true at a media >> level. Waiting years for some applications to determine if DAX makes >> sense for their use case seems completely reasonable. In the meantime >> the apps that are already making these changes want to know that a DAX >> mapping request has not silently dropped backed to page cache. They >> also want to know if they successfully jumped through all the hoops to >> get a larger than pte mapping. >> >> I agree it is useful to be able to force DAX on an unmodified >> application to see what happens, and it follows that if those >> applications want to run in that mode they will need functional >> fsync()... >> >> I would feel better if we were talking about specific applications and >> performance numbers to know if forcing DAX on application is a debug >> facility or a production level capability. You seem to have already >> made that determination and I'm curious what I'm missing. > > I'm not setting any policy here at all. This whole argument is > based around the DAX mount option doing "global fs enable or > silently turning it off" and the application not knowing about that. > > The whole point of having a persistent per-inode DAX flags is that > it is a policy mechanism, not a policy. The application can, if it > is DAX aware, directly control whether DAX is used on a file or not. > The application can even query and clear that persistent inode flag > if it is configured not to (or cannot) use DAX. > > If the filesystem cannot support DAX, then we can error out attempts > to set the DAX flag and then the app knows DAX is not available. > i.e. the attempt to set policy failed. If the flag is set, then the > inode will *always* use DAX - there is no "fall back to page cache" > when DAX is enabled. > > If the applicaiton is not DAX aware, then the admin can control the > DAX policy by manipulating these flags themselves, and hence control > whether DAX is used by the application or not. > > If you think I'm dictating policy for DAX users and application, > then you haven't understood anything I've previously said about why > the DAX mount option needs to die before any of this is considered > production ready. DAX is not an opaque "all or nothing" option. XFS > will provide apps and admins with fine-grained, persistent, > discoverable policy flags to allow admins and applications to set > DAX policies however they see fit. This simply cannot be done if the > only knob you have is a mount option that may or may not stick. I agree the mount option needs to die, and I fully grok the reasoning. What I'm concerned with is that a system using fully-DAX-aware applications is forced to incur the overhead of maintaining *sync semantics, periodic sync(2) in particular, even if it is not relying on those semantics. However, like I said in my other mail, we can solve that with alternate interfaces to persistent memory if that becomes an issue and not require that "disable *sync" capability to come through DAX. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932276AbcBHWGM (ORCPT ); Mon, 8 Feb 2016 17:06:12 -0500 Received: from mail-yw0-f173.google.com ([209.85.161.173]:34774 "EHLO mail-yw0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756594AbcBHWFe (ORCPT ); Mon, 8 Feb 2016 17:05:34 -0500 MIME-Version: 1.0 In-Reply-To: References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> Date: Mon, 8 Feb 2016 14:05:34 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems From: Dan Williams To: Jeff Moyer Cc: Dave Chinner , Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 8, 2016 at 12:58 PM, Jeff Moyer wrote: > Dan Williams writes: > >> I agree the mount option needs to die, and I fully grok the reasoning. >> What I'm concerned with is that a system using fully-DAX-aware >> applications is forced to incur the overhead of maintaining *sync >> semantics, periodic sync(2) in particular, even if it is not relying >> on those semantics. >> >> However, like I said in my other mail, we can solve that with >> alternate interfaces to persistent memory if that becomes an issue and >> not require that "disable *sync" capability to come through DAX. > > What do you envision these alternate interfaces looking like? Well, plan-A was making DAX be explicit opt-in for applications, I haven't thought too much about plan-B. I expect it to be driven by real performance numbers and application use cases once the *sync compat work completes. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756436AbcBIJno (ORCPT ); Tue, 9 Feb 2016 04:43:44 -0500 Received: from mx2.suse.de ([195.135.220.15]:39321 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754559AbcBIJnk (ORCPT ); Tue, 9 Feb 2016 04:43:40 -0500 Date: Tue, 9 Feb 2016 10:43:53 +0100 From: Jan Kara To: Dan Williams Cc: Dave Chinner , Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160209094353.GF9451@quack.suse.cz> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 08-02-16 12:55:24, Dan Williams wrote: > On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner wrote: > [..] > >> Setting aside the current block zeroing problem you seem to assuming > >> that DAX will always be faster and that may not be true at a media > >> level. Waiting years for some applications to determine if DAX makes > >> sense for their use case seems completely reasonable. In the meantime > >> the apps that are already making these changes want to know that a DAX > >> mapping request has not silently dropped backed to page cache. They > >> also want to know if they successfully jumped through all the hoops to > >> get a larger than pte mapping. > >> > >> I agree it is useful to be able to force DAX on an unmodified > >> application to see what happens, and it follows that if those > >> applications want to run in that mode they will need functional > >> fsync()... > >> > >> I would feel better if we were talking about specific applications and > >> performance numbers to know if forcing DAX on application is a debug > >> facility or a production level capability. You seem to have already > >> made that determination and I'm curious what I'm missing. > > > > I'm not setting any policy here at all. This whole argument is > > based around the DAX mount option doing "global fs enable or > > silently turning it off" and the application not knowing about that. > > > > The whole point of having a persistent per-inode DAX flags is that > > it is a policy mechanism, not a policy. The application can, if it > > is DAX aware, directly control whether DAX is used on a file or not. > > The application can even query and clear that persistent inode flag > > if it is configured not to (or cannot) use DAX. > > > > If the filesystem cannot support DAX, then we can error out attempts > > to set the DAX flag and then the app knows DAX is not available. > > i.e. the attempt to set policy failed. If the flag is set, then the > > inode will *always* use DAX - there is no "fall back to page cache" > > when DAX is enabled. > > > > If the applicaiton is not DAX aware, then the admin can control the > > DAX policy by manipulating these flags themselves, and hence control > > whether DAX is used by the application or not. > > > > If you think I'm dictating policy for DAX users and application, > > then you haven't understood anything I've previously said about why > > the DAX mount option needs to die before any of this is considered > > production ready. DAX is not an opaque "all or nothing" option. XFS > > will provide apps and admins with fine-grained, persistent, > > discoverable policy flags to allow admins and applications to set > > DAX policies however they see fit. This simply cannot be done if the > > only knob you have is a mount option that may or may not stick. > > I agree the mount option needs to die, and I fully grok the reasoning. > What I'm concerned with is that a system using fully-DAX-aware > applications is forced to incur the overhead of maintaining *sync > semantics, periodic sync(2) in particular, even if it is not relying > on those semantics. Let me somewhat correct this: IMO hard requirement is maintaining sync(2) semantics. Periodic writeback does not have any hard durability guarantees and we are free to ignore such requests in ->writepages() (that function has enough information in the writeback_control structure to differentiate between periodic writeback and data integrity sync) if we decide it is useful. Actually, we could do that even for 4.5. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932652AbcBIQBZ (ORCPT ); Tue, 9 Feb 2016 11:01:25 -0500 Received: from mx2.suse.de ([195.135.220.15]:53606 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756437AbcBIQBX (ORCPT ); Tue, 9 Feb 2016 11:01:23 -0500 Date: Tue, 9 Feb 2016 17:01:34 +0100 From: Jan Kara To: Dan Williams Cc: Dave Chinner , Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160209160134.GA12245@quack.suse.cz> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> <20160209094353.GF9451@quack.suse.cz> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="LQksG6bCIzRHxTLp" Content-Disposition: inline In-Reply-To: <20160209094353.GF9451@quack.suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --LQksG6bCIzRHxTLp Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Tue 09-02-16 10:43:53, Jan Kara wrote: > On Mon 08-02-16 12:55:24, Dan Williams wrote: > > On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner wrote: > > [..] > > >> Setting aside the current block zeroing problem you seem to assuming > > >> that DAX will always be faster and that may not be true at a media > > >> level. Waiting years for some applications to determine if DAX makes > > >> sense for their use case seems completely reasonable. In the meantime > > >> the apps that are already making these changes want to know that a DAX > > >> mapping request has not silently dropped backed to page cache. They > > >> also want to know if they successfully jumped through all the hoops to > > >> get a larger than pte mapping. > > >> > > >> I agree it is useful to be able to force DAX on an unmodified > > >> application to see what happens, and it follows that if those > > >> applications want to run in that mode they will need functional > > >> fsync()... > > >> > > >> I would feel better if we were talking about specific applications and > > >> performance numbers to know if forcing DAX on application is a debug > > >> facility or a production level capability. You seem to have already > > >> made that determination and I'm curious what I'm missing. > > > > > > I'm not setting any policy here at all. This whole argument is > > > based around the DAX mount option doing "global fs enable or > > > silently turning it off" and the application not knowing about that. > > > > > > The whole point of having a persistent per-inode DAX flags is that > > > it is a policy mechanism, not a policy. The application can, if it > > > is DAX aware, directly control whether DAX is used on a file or not. > > > The application can even query and clear that persistent inode flag > > > if it is configured not to (or cannot) use DAX. > > > > > > If the filesystem cannot support DAX, then we can error out attempts > > > to set the DAX flag and then the app knows DAX is not available. > > > i.e. the attempt to set policy failed. If the flag is set, then the > > > inode will *always* use DAX - there is no "fall back to page cache" > > > when DAX is enabled. > > > > > > If the applicaiton is not DAX aware, then the admin can control the > > > DAX policy by manipulating these flags themselves, and hence control > > > whether DAX is used by the application or not. > > > > > > If you think I'm dictating policy for DAX users and application, > > > then you haven't understood anything I've previously said about why > > > the DAX mount option needs to die before any of this is considered > > > production ready. DAX is not an opaque "all or nothing" option. XFS > > > will provide apps and admins with fine-grained, persistent, > > > discoverable policy flags to allow admins and applications to set > > > DAX policies however they see fit. This simply cannot be done if the > > > only knob you have is a mount option that may or may not stick. > > > > I agree the mount option needs to die, and I fully grok the reasoning. > > What I'm concerned with is that a system using fully-DAX-aware > > applications is forced to incur the overhead of maintaining *sync > > semantics, periodic sync(2) in particular, even if it is not relying > > on those semantics. > > Let me somewhat correct this: IMO hard requirement is maintaining sync(2) > semantics. Periodic writeback does not have any hard durability guarantees > and we are free to ignore such requests in ->writepages() (that function > has enough information in the writeback_control structure to differentiate > between periodic writeback and data integrity sync) if we decide it is > useful. Actually, we could do that even for 4.5. Attached is a version of Ross' patch that will work for sync(2) and fsync(2) and we won't flush caches during periodic writeback. The patch is only compile-tested. Ross? Honza -- Jan Kara SUSE Labs, CR --LQksG6bCIzRHxTLp Content-Type: text/x-patch; charset=us-ascii Content-Disposition: attachment; filename="0001-dax-move-writeback-calls-into-the-filesystems.patch" >>From f7280a34d235031c5dbf3f5a345c4b64e452f097 Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Sun, 7 Feb 2016 00:19:13 -0700 Subject: [PATCH] dax: move writeback calls into the filesystems Previously calls to dax_writeback_mapping_range() for all DAX filesystems (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). dax_writeback_mapping_range() needs a struct block_device, and it used to get that from inode->i_sb->s_bdev. This is correct for normal inodes mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time files. Instead, call dax_writeback_mapping_range() directly from the filesystem ->writepages function so that it can supply us with a valid block device. This also fixes DAX code to properly flush caches in response to sync(2). Signed-off-by: Ross Zwisler Signed-off-by: Jan Kara --- fs/block_dev.c | 13 ++++++++++++- fs/dax.c | 12 +++++++----- fs/ext2/inode.c | 8 ++++++++ fs/ext4/fsync.c | 1 - fs/ext4/inode.c | 4 ++++ fs/xfs/xfs_aops.c | 5 +++++ include/linux/dax.h | 7 +++++-- mm/filemap.c | 12 ++++-------- 8 files changed, 45 insertions(+), 17 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 39b3a174a425..271d38aa6cbb 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1693,13 +1693,24 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) return try_to_free_buffers(page); } +static int blkdev_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + if (dax_mapping(mapping)) { + struct block_device *bdev = I_BDEV(mapping->host); + + return dax_writeback_mapping_range(mapping, bdev, wbc); + } + return generic_writepages(mapping, wbc); +} + static const struct address_space_operations def_blk_aops = { .readpage = blkdev_readpage, .readpages = blkdev_readpages, .writepage = blkdev_writepage, .write_begin = blkdev_write_begin, .write_end = blkdev_write_end, - .writepages = generic_writepages, + .writepages = blkdev_writepages, .releasepage = blkdev_releasepage, .direct_IO = blkdev_direct_IO, .is_dirty_writeback = buffer_check_dirty_writeback, diff --git a/fs/dax.c b/fs/dax.c index fc2e3141138b..2f4965214783 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -485,11 +485,10 @@ static int dax_writeback_one(struct block_device *bdev, * end]. This is required by data integrity operations to ensure file data is * on persistent storage prior to completion of the operation. */ -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end) +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, struct writeback_control *wbc) { struct inode *inode = mapping->host; - struct block_device *bdev = inode->i_sb->s_bdev; pgoff_t start_index, end_index, pmd_index; pgoff_t indices[PAGEVEC_SIZE]; struct pagevec pvec; @@ -500,8 +499,11 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT)) return -EIO; - start_index = start >> PAGE_CACHE_SHIFT; - end_index = end >> PAGE_CACHE_SHIFT; + if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) + return 0; + + start_index = wbc->range_start >> PAGE_CACHE_SHIFT; + end_index = wbc->range_end >> PAGE_CACHE_SHIFT; pmd_index = DAX_PMD_INDEX(start_index); rcu_read_lock(); diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 338eefda70c6..ee05e945f40c 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -874,6 +874,14 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset) static int ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) { +#ifdef CONFIG_FS_DAX + if (dax_mapping(mapping)) { + return dax_writeback_mapping_range(mapping, + mapping->host->i_sb->s_bdev, + wbc); + } +#endif + return mpage_writepages(mapping, wbc, ext2_get_block); } diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 8850254136ae..b7136227d0f8 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -83,7 +83,6 @@ static int ext4_sync_parent(struct inode *inode) * What we do is just kick off a commit and wait on it. This will snapshot the * inode to disk. */ - int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { struct inode *inode = file->f_mapping->host; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 83bc8bfb3bea..19989c12187a 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2450,6 +2450,10 @@ static int ext4_writepages(struct address_space *mapping, trace_ext4_writepages(inode, wbc); + if (dax_mapping(mapping)) + return dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, + wbc); + /* * No pages to write? This is mainly a kludge to avoid starting * a transaction for special inodes like journal inode on last iput() diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 379c089fb051..fd0839278442 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1208,6 +1208,11 @@ xfs_vm_writepages( struct writeback_control *wbc) { xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED); + if (dax_mapping(mapping)) { + return dax_writeback_mapping_range(mapping, + xfs_find_bdev_for_inode(mapping->host), wbc); + } + return generic_writepages(mapping, wbc); } diff --git a/include/linux/dax.h b/include/linux/dax.h index 818e45078929..05d7d043d3bd 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -52,6 +52,9 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end); + +struct writeback_control; + +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, struct writeback_control *wbc); #endif diff --git a/mm/filemap.c b/mm/filemap.c index bc943867d68c..af3eec1a8c5e 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -446,7 +446,8 @@ int filemap_write_and_wait(struct address_space *mapping) { int err = 0; - if (mapping->nrpages) { + if ((!dax_mapping(mapping) && mapping->nrpages) || + (dax_mapping(mapping) && mapping->nrexceptional)) { err = filemap_fdatawrite(mapping); /* * Even if the above returned error, the pages may be @@ -482,13 +483,8 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; - if (dax_mapping(mapping) && mapping->nrexceptional) { - err = dax_writeback_mapping_range(mapping, lstart, lend); - if (err) - return err; - } - - if (mapping->nrpages) { + if ((!dax_mapping(mapping) && mapping->nrpages) || + (dax_mapping(mapping) && mapping->nrexceptional)) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); /* See comment of filemap_write_and_wait() */ -- 2.6.2 --LQksG6bCIzRHxTLp-- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756410AbcBISGz (ORCPT ); Tue, 9 Feb 2016 13:06:55 -0500 Received: from mga01.intel.com ([192.55.52.88]:43589 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753084AbcBISGw (ORCPT ); Tue, 9 Feb 2016 13:06:52 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,421,1449561600"; d="scan'208";a="45097506" Date: Tue, 9 Feb 2016 11:06:47 -0700 From: Ross Zwisler To: Jan Kara Cc: Dan Williams , Dave Chinner , Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160209180647.GA2450@linux.intel.com> Mail-Followup-To: Ross Zwisler , Jan Kara , Dan Williams , Dave Chinner , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> <20160209094353.GF9451@quack.suse.cz> <20160209160134.GA12245@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160209160134.GA12245@quack.suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 09, 2016 at 05:01:34PM +0100, Jan Kara wrote: > On Tue 09-02-16 10:43:53, Jan Kara wrote: > > On Mon 08-02-16 12:55:24, Dan Williams wrote: > > > On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner wrote: > > > [..] > > > >> Setting aside the current block zeroing problem you seem to assuming > > > >> that DAX will always be faster and that may not be true at a media > > > >> level. Waiting years for some applications to determine if DAX makes > > > >> sense for their use case seems completely reasonable. In the meantime > > > >> the apps that are already making these changes want to know that a DAX > > > >> mapping request has not silently dropped backed to page cache. They > > > >> also want to know if they successfully jumped through all the hoops to > > > >> get a larger than pte mapping. > > > >> > > > >> I agree it is useful to be able to force DAX on an unmodified > > > >> application to see what happens, and it follows that if those > > > >> applications want to run in that mode they will need functional > > > >> fsync()... > > > >> > > > >> I would feel better if we were talking about specific applications and > > > >> performance numbers to know if forcing DAX on application is a debug > > > >> facility or a production level capability. You seem to have already > > > >> made that determination and I'm curious what I'm missing. > > > > > > > > I'm not setting any policy here at all. This whole argument is > > > > based around the DAX mount option doing "global fs enable or > > > > silently turning it off" and the application not knowing about that. > > > > > > > > The whole point of having a persistent per-inode DAX flags is that > > > > it is a policy mechanism, not a policy. The application can, if it > > > > is DAX aware, directly control whether DAX is used on a file or not. > > > > The application can even query and clear that persistent inode flag > > > > if it is configured not to (or cannot) use DAX. > > > > > > > > If the filesystem cannot support DAX, then we can error out attempts > > > > to set the DAX flag and then the app knows DAX is not available. > > > > i.e. the attempt to set policy failed. If the flag is set, then the > > > > inode will *always* use DAX - there is no "fall back to page cache" > > > > when DAX is enabled. > > > > > > > > If the applicaiton is not DAX aware, then the admin can control the > > > > DAX policy by manipulating these flags themselves, and hence control > > > > whether DAX is used by the application or not. > > > > > > > > If you think I'm dictating policy for DAX users and application, > > > > then you haven't understood anything I've previously said about why > > > > the DAX mount option needs to die before any of this is considered > > > > production ready. DAX is not an opaque "all or nothing" option. XFS > > > > will provide apps and admins with fine-grained, persistent, > > > > discoverable policy flags to allow admins and applications to set > > > > DAX policies however they see fit. This simply cannot be done if the > > > > only knob you have is a mount option that may or may not stick. > > > > > > I agree the mount option needs to die, and I fully grok the reasoning. > > > What I'm concerned with is that a system using fully-DAX-aware > > > applications is forced to incur the overhead of maintaining *sync > > > semantics, periodic sync(2) in particular, even if it is not relying > > > on those semantics. > > > > Let me somewhat correct this: IMO hard requirement is maintaining sync(2) > > semantics. Periodic writeback does not have any hard durability guarantees > > and we are free to ignore such requests in ->writepages() (that function > > has enough information in the writeback_control structure to differentiate > > between periodic writeback and data integrity sync) if we decide it is > > useful. Actually, we could do that even for 4.5. > > Attached is a version of Ross' patch that will work for sync(2) and > fsync(2) and we won't flush caches during periodic writeback. The patch is > only compile-tested. Ross? This looks great. I'll send out a v2 with this and with the dax_clear_sectors() changes after I'm done testing. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Jeff Moyer To: Dan Williams Cc: Dave Chinner , Ross Zwisler , "linux-kernel\@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm\@lists.01.org" , XFS Developers Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> Date: Mon, 08 Feb 2016 15:58:44 -0500 In-Reply-To: (Dan Williams's message of "Mon, 8 Feb 2016 12:55:24 -0800") Message-ID: MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: Dan Williams writes: > I agree the mount option needs to die, and I fully grok the reasoning. > What I'm concerned with is that a system using fully-DAX-aware > applications is forced to incur the overhead of maintaining *sync > semantics, periodic sync(2) in particular, even if it is not relying > on those semantics. > > However, like I said in my other mail, we can solve that with > alternate interfaces to persistent memory if that becomes an issue and > not require that "disable *sync" capability to come through DAX. What do you envision these alternate interfaces looking like? -Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: [PATCH 2/2] dax: move writeback calls into the filesystems Date: Sun, 7 Feb 2016 00:19:13 -0700 Message-ID: <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Ross Zwisler , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com To: linux-kernel@vger.kernel.org Return-path: In-Reply-To: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org Previously calls to dax_writeback_mapping_range() for all DAX filesystems (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). dax_writeback_mapping_range() needs a struct block_device, and it used to get that from inode->i_sb->s_bdev. This is correct for normal inodes mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time files. Instead, call dax_writeback_mapping_range() directly from the filesystem or raw block device fsync/msync code so that they can supply us with a valid block device. It should be noted that this will reduce the number of calls to dax_writeback_mapping_range() because filemap_write_and_wait_range() is called in the various filesystems for operations other than just fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside of ->fsync for hole punch, truncate, and block relocation (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). I don't believe that these extra flushes are necessary in the DAX case. In the page cache case when we have dirty data in the page cache, that data will be actively lost if we evict a dirty page cache page without flushing it to media first. For DAX, though, the data will remain consistent with the physical address to which it was written regardless of whether it's in the processor cache or not - really the only reason I see to flush is in response to a fsync or msync so that our data is durable on media in case of a power loss. The case where we could throw dirty data out of the page cache and essentially lose writes simply doesn't exist. Signed-off-by: Ross Zwisler --- fs/block_dev.c | 7 +++++++ fs/dax.c | 5 ++--- fs/ext2/file.c | 10 ++++++++++ fs/ext4/fsync.c | 10 +++++++++- fs/xfs/xfs_file.c | 12 ++++++++++-- include/linux/dax.h | 4 ++-- mm/filemap.c | 6 ------ 7 files changed, 40 insertions(+), 14 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index fa0507a..312ad44 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -356,8 +356,15 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync) { struct inode *bd_inode = bdev_file_inode(filp); struct block_device *bdev = I_BDEV(bd_inode); + struct address_space *mapping = bd_inode->i_mapping; int error; + if (dax_mapping(mapping) && mapping->nrexceptional) { + error = dax_writeback_mapping_range(mapping, bdev, start, end); + if (error) + return error; + } + error = filemap_write_and_wait_range(filp->f_mapping, start, end); if (error) return error; diff --git a/fs/dax.c b/fs/dax.c index 4592241..4b5006a 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev, * end]. This is required by data integrity operations to ensure file data is * on persistent storage prior to completion of the operation. */ -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end) +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, loff_t start, loff_t end) { struct inode *inode = mapping->host; - struct block_device *bdev = inode->i_sb->s_bdev; pgoff_t start_index, end_index, pmd_index; pgoff_t indices[PAGEVEC_SIZE]; struct pagevec pvec; diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 2c88d68..d1abf53 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -162,6 +162,16 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) int ret; struct super_block *sb = file->f_mapping->host->i_sb; struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; +#ifdef CONFIG_FS_DAX + struct address_space *inode_mapping = file->f_inode->i_mapping; + + if (dax_mapping(inode_mapping) && inode_mapping->nrexceptional) { + ret = dax_writeback_mapping_range(inode_mapping, sb->s_bdev, + start, end); + if (ret) + return ret; + } +#endif ret = generic_file_fsync(file, start, end, datasync); if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) { diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 8850254..e9cf53b 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "ext4.h" #include "ext4_jbd2.h" @@ -83,10 +84,10 @@ static int ext4_sync_parent(struct inode *inode) * What we do is just kick off a commit and wait on it. This will snapshot the * inode to disk. */ - int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { struct inode *inode = file->f_mapping->host; + struct address_space *mapping = inode->i_mapping; struct ext4_inode_info *ei = EXT4_I(inode); journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; int ret = 0, err; @@ -97,6 +98,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) trace_ext4_sync_file_enter(file, datasync); + if (dax_mapping(mapping) && mapping->nrexceptional) { + err = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, + start, end); + if (err) + goto out; + } + if (inode->i_sb->s_flags & MS_RDONLY) { /* Make sure that we read updated s_mount_flags value */ smp_rmb(); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 52883ac..84e95cc 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -209,7 +209,8 @@ xfs_file_fsync( loff_t end, int datasync) { - struct inode *inode = file->f_mapping->host; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; struct xfs_inode *ip = XFS_I(inode); struct xfs_mount *mp = ip->i_mount; int error = 0; @@ -218,7 +219,14 @@ xfs_file_fsync( trace_xfs_file_fsync(ip); - error = filemap_write_and_wait_range(inode->i_mapping, start, end); + if (dax_mapping(mapping) && mapping->nrexceptional) { + error = dax_writeback_mapping_range(mapping, + xfs_find_bdev_for_inode(inode), start, end); + if (error) + return error; + } + + error = filemap_write_and_wait_range(mapping, start, end); if (error) return error; diff --git a/include/linux/dax.h b/include/linux/dax.h index bad27b0..8e9f114 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -42,6 +42,6 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end); +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, loff_t start, loff_t end); #endif diff --git a/mm/filemap.c b/mm/filemap.c index bc94386..c4286eb 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -482,12 +482,6 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; - if (dax_mapping(mapping) && mapping->nrexceptional) { - err = dax_writeback_mapping_range(mapping, lstart, lend); - if (err) - return err; - } - if (mapping->nrpages) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Williams Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Date: Mon, 8 Feb 2016 00:18:11 -0800 Message-ID: References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer To: Dave Chinner Return-path: In-Reply-To: <20160207215047.GJ31407@dastard> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Sun, Feb 7, 2016 at 1:50 PM, Dave Chinner wrote: > On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote: >> On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler >> wrote: >> > Previously calls to dax_writeback_mapping_range() for all DAX filesystems >> > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). >> > dax_writeback_mapping_range() needs a struct block_device, and it used to >> > get that from inode->i_sb->s_bdev. This is correct for normal inodes >> > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw >> > block devices and for XFS real-time files. >> > >> > Instead, call dax_writeback_mapping_range() directly from the filesystem or >> > raw block device fsync/msync code so that they can supply us with a valid >> > block device. >> > >> > It should be noted that this will reduce the number of calls to >> > dax_writeback_mapping_range() because filemap_write_and_wait_range() is >> > called in the various filesystems for operations other than just >> > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside >> > of ->fsync for hole punch, truncate, and block relocation >> > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). >> > >> > I don't believe that these extra flushes are necessary in the DAX case. In >> > the page cache case when we have dirty data in the page cache, that data >> > will be actively lost if we evict a dirty page cache page without flushing >> > it to media first. For DAX, though, the data will remain consistent with >> > the physical address to which it was written regardless of whether it's in >> > the processor cache or not - really the only reason I see to flush is in >> > response to a fsync or msync so that our data is durable on media in case >> > of a power loss. The case where we could throw dirty data out of the page >> > cache and essentially lose writes simply doesn't exist. >> > >> > Signed-off-by: Ross Zwisler >> > --- >> > fs/block_dev.c | 7 +++++++ >> > fs/dax.c | 5 ++--- >> > fs/ext2/file.c | 10 ++++++++++ >> > fs/ext4/fsync.c | 10 +++++++++- >> > fs/xfs/xfs_file.c | 12 ++++++++++-- >> > include/linux/dax.h | 4 ++-- >> > mm/filemap.c | 6 ------ >> > 7 files changed, 40 insertions(+), 14 deletions(-) >> >> This sprinkling of dax specific fixups outside of vm_operations_struct >> routines still has me thinking that we are going in the wrong >> direction for fsync/msync support. >> >> If an application is both unaware of DAX and doing mmap I/O it is >> better served by the page cache where writeback is durable by default. >> We expect DAX-aware applications to assume responsibility for cpu >> cache management [1]. Making DAX mmap semantics explicit opt-in >> solves not only durability support, but also the current problem that >> DAX gets silently disabled leaving an app to wonder if it really got a >> direct mapping. DAX also silently picks pud, pmd, or pte mappings >> which is information an application would really like to know at map >> time. >> >> The proposal: make applications explicitly request DAX semantics with >> a new MAP_DAX flag and fail if DAX is unavailable. > > No. > > As I've stated before, the entire purpose of enabling DAX through > existing filesytsems like XFS and ext4 is so that existing > applications work with DAX *without modification*. > > That is, applications can be entirely unaware of the fact that the > filesystem is giving them direct access to the storage because the > access and failure semantics of DAX enabled mmap are *identical to > the existing mmap semantics*. > > Given this, the app doesn't need to care whether DAX is enabled or > not; all that will be seen is a difference in speed of access. > Enabling and disabling DAX is, at this point, purely an > administration decision - if the hardware and filesystem supports > it, it can be turned on without having to wait years for application > developers to add support for it.... Setting aside the current block zeroing problem you seem to assuming that DAX will always be faster and that may not be true at a media level. Waiting years for some applications to determine if DAX makes sense for their use case seems completely reasonable. In the meantime the apps that are already making these changes want to know that a DAX mapping request has not silently dropped backed to page cache. They also want to know if they successfully jumped through all the hoops to get a larger than pte mapping. I agree it is useful to be able to force DAX on an unmodified application to see what happens, and it follows that if those applications want to run in that mode they will need functional fsync()... I would feel better if we were talking about specific applications and performance numbers to know if forcing DAX on application is a debug facility or a production level capability. You seem to have already made that determination and I'm curious what I'm missing. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Date: Mon, 8 Feb 2016 08:34:43 -0700 Message-ID: <20160208153443.GC2343@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> <20160207220329.GK31407@dastard> <20160208014409.GA2343@linux.intel.com> <20160208051725.GM31407@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Ross Zwisler , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com To: Dave Chinner Return-path: Content-Disposition: inline In-Reply-To: <20160208051725.GM31407@dastard> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Mon, Feb 08, 2016 at 04:17:25PM +1100, Dave Chinner wrote: > On Sun, Feb 07, 2016 at 06:44:09PM -0700, Ross Zwisler wrote: > > On Mon, Feb 08, 2016 at 09:03:29AM +1100, Dave Chinner wrote: > > > On Sun, Feb 07, 2016 at 12:19:12AM -0700, Ross Zwisler wrote: > > > > dax_clear_blocks() needs a valid struct block_device and previously it was > > > > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > > > > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > > > block devices and for XFS real-time devices. > > > > > > > > Instead, have the caller pass in a struct block_device pointer which it > > > > knows to be correct. > > > .... > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > > > index 07ef29b..f722ba2 100644 > > > > --- a/fs/xfs/xfs_bmap_util.c > > > > +++ b/fs/xfs/xfs_bmap_util.c > > > > @@ -73,9 +73,11 @@ xfs_zero_extent( > > > > xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb); > > > > sector_t block = XFS_BB_TO_FSBT(mp, sector); > > > > ssize_t size = XFS_FSB_TO_B(mp, count_fsb); > > > > + struct inode *inode = VFS_I(ip); > > > > > > > > if (IS_DAX(VFS_I(ip))) > > > > - return dax_clear_blocks(VFS_I(ip), block, size); > > > > + return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > > > > + block, size); > > > > > > Get rid of the local inode variable and use VFS_I(ip) like the code > > > originally did. Do not change code that is unrelated to the > > > modifcation being made, especially when it results in making > > > the code an inconsistent mess of mixed pointer constructs.... > > > > The local 'inode' variable was added to avoid multiple calls for VFS_I() for > > the same 'ip'. > > My point is you didn't achieve that. The end result of your patch > is: > > struct inode *inode = VFS_I(ip); > > if (IS_DAX(VFS_I(ip))) > return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > block, size); > > So now we have a local variable, but we still have 2 calls to > VFS_I(ip). i.e. this makes the code harder to read and understand > than before for no benefit. *facepalm* Yep, thanks for the correction. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Date: Tue, 9 Feb 2016 11:06:47 -0700 Message-ID: <20160209180647.GA2450@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> <20160209094353.GF9451@quack.suse.cz> <20160209160134.GA12245@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Dan Williams , Dave Chinner , Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20160209160134.GA12245@quack.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Tue, Feb 09, 2016 at 05:01:34PM +0100, Jan Kara wrote: > On Tue 09-02-16 10:43:53, Jan Kara wrote: > > On Mon 08-02-16 12:55:24, Dan Williams wrote: > > > On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner wrote: > > > [..] > > > >> Setting aside the current block zeroing problem you seem to assuming > > > >> that DAX will always be faster and that may not be true at a media > > > >> level. Waiting years for some applications to determine if DAX makes > > > >> sense for their use case seems completely reasonable. In the meantime > > > >> the apps that are already making these changes want to know that a DAX > > > >> mapping request has not silently dropped backed to page cache. They > > > >> also want to know if they successfully jumped through all the hoops to > > > >> get a larger than pte mapping. > > > >> > > > >> I agree it is useful to be able to force DAX on an unmodified > > > >> application to see what happens, and it follows that if those > > > >> applications want to run in that mode they will need functional > > > >> fsync()... > > > >> > > > >> I would feel better if we were talking about specific applications and > > > >> performance numbers to know if forcing DAX on application is a debug > > > >> facility or a production level capability. You seem to have already > > > >> made that determination and I'm curious what I'm missing. > > > > > > > > I'm not setting any policy here at all. This whole argument is > > > > based around the DAX mount option doing "global fs enable or > > > > silently turning it off" and the application not knowing about that. > > > > > > > > The whole point of having a persistent per-inode DAX flags is that > > > > it is a policy mechanism, not a policy. The application can, if it > > > > is DAX aware, directly control whether DAX is used on a file or not. > > > > The application can even query and clear that persistent inode flag > > > > if it is configured not to (or cannot) use DAX. > > > > > > > > If the filesystem cannot support DAX, then we can error out attempts > > > > to set the DAX flag and then the app knows DAX is not available. > > > > i.e. the attempt to set policy failed. If the flag is set, then the > > > > inode will *always* use DAX - there is no "fall back to page cache" > > > > when DAX is enabled. > > > > > > > > If the applicaiton is not DAX aware, then the admin can control the > > > > DAX policy by manipulating these flags themselves, and hence control > > > > whether DAX is used by the application or not. > > > > > > > > If you think I'm dictating policy for DAX users and application, > > > > then you haven't understood anything I've previously said about why > > > > the DAX mount option needs to die before any of this is considered > > > > production ready. DAX is not an opaque "all or nothing" option. XFS > > > > will provide apps and admins with fine-grained, persistent, > > > > discoverable policy flags to allow admins and applications to set > > > > DAX policies however they see fit. This simply cannot be done if the > > > > only knob you have is a mount option that may or may not stick. > > > > > > I agree the mount option needs to die, and I fully grok the reasoning. > > > What I'm concerned with is that a system using fully-DAX-aware > > > applications is forced to incur the overhead of maintaining *sync > > > semantics, periodic sync(2) in particular, even if it is not relying > > > on those semantics. > > > > Let me somewhat correct this: IMO hard requirement is maintaining sync(2) > > semantics. Periodic writeback does not have any hard durability guarantees > > and we are free to ignore such requests in ->writepages() (that function > > has enough information in the writeback_control structure to differentiate > > between periodic writeback and data integrity sync) if we decide it is > > useful. Actually, we could do that even for 4.5. > > Attached is a version of Ross' patch that will work for sync(2) and > fsync(2) and we won't flush caches during periodic writeback. The patch is > only compile-tested. Ross? This looks great. I'll send out a v2 with this and with the dax_clear_sectors() changes after I'm done testing. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 752457CA3 for ; Sun, 7 Feb 2016 01:19:36 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay1.corp.sgi.com (Postfix) with ESMTP id 6739F8F8039 for ; Sat, 6 Feb 2016 23:19:33 -0800 (PST) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id 9s400sPCSjgcOWvg for ; Sat, 06 Feb 2016 23:19:31 -0800 (PST) From: Ross Zwisler Subject: [PATCH 0/2] DAX bdev fixes - move flushing calls to FS Date: Sun, 7 Feb 2016 00:19:11 -0700 Message-Id: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Theodore Ts'o , Andrew Morton , linux-nvdimm@lists.01.org, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams The first patch in the series just adds a bdev argument to dax_clear_blocks(), and should be relatively straightforward. The second patch is slightly more controversial. During testing of raw block devices + DAX I noticed that the struct block_device that we were using for DAX operations was incorrect. For the fault handlers, etc. we can just get the correct bdev via get_block(), which is passed in as a function pointer, but for the flushing code we don't have access to get_block(). This is also an issue for XFS real-time devices, whenever we get those working. In short, somehow we need to get dax_writeback_mapping_range() a valid bdev. Right now it is called via filemap_write_and_wait_range(), which can't provide either the bdev nor a get_block() function pointer. So, our options seem to be: a) Move the calls to dax_writeback_mapping_range() into the filesystems. This is implemented by patch 2 in this series. b) Keep the calls to dax_writeback_mapping_range() in the mm code, and provide a generic way to ask a filesystem for an inode's bdev. I did a version of this using a superblock operation here: https://lkml.org/lkml/2016/2/2/941 It has been noted that we may need to expand the coverage of our DAX flushing code to include support for the sync() and syncfs() userspace calls. This is still under discussion, but if we do end up needing to add support for sync(), I don't think that it is v4.5 material for the reasons stated here: https://lkml.org/lkml/2016/2/4/962 I think that for v4.5 we either need patch 2 of this series, or the get_bdev() patch listed in for solution b) above. Ross Zwisler (2): dax: pass bdev argument to dax_clear_blocks() dax: move writeback calls into the filesystems fs/block_dev.c | 7 +++++++ fs/dax.c | 9 ++++----- fs/ext2/file.c | 10 ++++++++++ fs/ext2/inode.c | 5 +++-- fs/ext4/fsync.c | 10 +++++++++- fs/xfs/xfs_aops.c | 2 +- fs/xfs/xfs_aops.h | 1 + fs/xfs/xfs_bmap_util.c | 4 +++- fs/xfs/xfs_file.c | 12 ++++++++++-- include/linux/dax.h | 7 ++++--- mm/filemap.c | 6 ------ 11 files changed, 52 insertions(+), 21 deletions(-) -- 2.5.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 2AB5A7CA3 for ; Sun, 7 Feb 2016 01:19:38 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay3.corp.sgi.com (Postfix) with ESMTP id A20EDAC001 for ; Sat, 6 Feb 2016 23:19:34 -0800 (PST) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id LrydzCQQ3lH43TRJ for ; Sat, 06 Feb 2016 23:19:32 -0800 (PST) From: Ross Zwisler Subject: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Date: Sun, 7 Feb 2016 00:19:12 -0700 Message-Id: <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Theodore Ts'o , Andrew Morton , linux-nvdimm@lists.01.org, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams dax_clear_blocks() needs a valid struct block_device and previously it was using inode->i_sb->s_bdev in all cases. This is correct for normal inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time devices. Instead, have the caller pass in a struct block_device pointer which it knows to be correct. Signed-off-by: Ross Zwisler --- fs/dax.c | 4 ++-- fs/ext2/inode.c | 5 +++-- fs/xfs/xfs_aops.c | 2 +- fs/xfs/xfs_aops.h | 1 + fs/xfs/xfs_bmap_util.c | 4 +++- include/linux/dax.h | 3 ++- 6 files changed, 12 insertions(+), 7 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 227974a..4592241 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -83,9 +83,9 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n) * and hence this means the stack from this point must follow GFP_NOFS * semantics for all operations. */ -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) +int dax_clear_blocks(struct inode *inode, struct block_device *bdev, + sector_t block, long _size) { - struct block_device *bdev = inode->i_sb->s_bdev; struct blk_dax_ctl dax = { .sector = block << (inode->i_blkbits - 9), .size = _size, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 338eefd..277a32b 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -737,8 +737,9 @@ static int ext2_get_blocks(struct inode *inode, * so that it's not found by another thread before it's * initialised */ - err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key), - 1 << inode->i_blkbits); + err = dax_clear_blocks(inode, inode->i_sb->s_bdev, + le32_to_cpu(chain[depth-1].key), + 1 << inode->i_blkbits); if (err) { mutex_unlock(&ei->truncate_mutex); goto cleanup; diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 379c089..fc20518 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -55,7 +55,7 @@ xfs_count_page_state( } while ((bh = bh->b_this_page) != head); } -STATIC struct block_device * +struct block_device * xfs_find_bdev_for_inode( struct inode *inode) { diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h index f6ffc9a..a4343c6 100644 --- a/fs/xfs/xfs_aops.h +++ b/fs/xfs/xfs_aops.h @@ -62,5 +62,6 @@ int xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset, struct buffer_head *map_bh, int create); extern void xfs_count_page_state(struct page *, int *, int *); +extern struct block_device *xfs_find_bdev_for_inode(struct inode *); #endif /* __XFS_AOPS_H__ */ diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 07ef29b..f722ba2 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -73,9 +73,11 @@ xfs_zero_extent( xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb); sector_t block = XFS_BB_TO_FSBT(mp, sector); ssize_t size = XFS_FSB_TO_B(mp, count_fsb); + struct inode *inode = VFS_I(ip); if (IS_DAX(VFS_I(ip))) - return dax_clear_blocks(VFS_I(ip), block, size); + return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), + block, size); /* * let the block layer decide on the fastest method of diff --git a/include/linux/dax.h b/include/linux/dax.h index 8204c3d..bad27b0 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -7,7 +7,8 @@ ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t, get_block_t, dio_iodone_t, int flags); -int dax_clear_blocks(struct inode *, sector_t block, long size); +int dax_clear_blocks(struct inode *inode, struct block_device *bdev, + sector_t block, long _size); int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); int dax_truncate_page(struct inode *, loff_t from, get_block_t); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t, -- 2.5.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 5FA6729DF5 for ; Sun, 7 Feb 2016 12:19:35 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id 50D47304043 for ; Sun, 7 Feb 2016 10:19:32 -0800 (PST) Received: from mail-yw0-f173.google.com (mail-yw0-f173.google.com [209.85.161.173]) by cuda.sgi.com with ESMTP id 0bQc4pkeJ02qFeMf (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Sun, 07 Feb 2016 10:19:30 -0800 (PST) Received: by mail-yw0-f173.google.com with SMTP id g127so88032686ywf.2 for ; Sun, 07 Feb 2016 10:19:30 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> Date: Sun, 7 Feb 2016 10:19:29 -0800 Message-ID: Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() From: Dan Williams List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Matthew Wilcox , Andrew Morton , linux-ext4 On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler wrote: > dax_clear_blocks() needs a valid struct block_device and previously it was > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > block devices and for XFS real-time devices. > > Instead, have the caller pass in a struct block_device pointer which it > knows to be correct. > > Signed-off-by: Ross Zwisler > --- > fs/dax.c | 4 ++-- > fs/ext2/inode.c | 5 +++-- > fs/xfs/xfs_aops.c | 2 +- > fs/xfs/xfs_aops.h | 1 + > fs/xfs/xfs_bmap_util.c | 4 +++- > include/linux/dax.h | 3 ++- > 6 files changed, 12 insertions(+), 7 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 227974a..4592241 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -83,9 +83,9 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n) > * and hence this means the stack from this point must follow GFP_NOFS > * semantics for all operations. > */ > -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) > +int dax_clear_blocks(struct inode *inode, struct block_device *bdev, > + sector_t block, long _size) Since this is a bdev relative routine we should also resolve the sector, i.e. the signature should drop the inode: int dax_clear_sectors(struct block_device *bdev, sector_t sector, long _size) _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 5D47B29DF5 for ; Sun, 7 Feb 2016 13:13:55 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay2.corp.sgi.com (Postfix) with ESMTP id 21D36304032 for ; Sun, 7 Feb 2016 11:13:55 -0800 (PST) Received: from mail-yw0-f179.google.com (mail-yw0-f179.google.com [209.85.161.179]) by cuda.sgi.com with ESMTP id sZhyZ1DCL2hUom5S (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Sun, 07 Feb 2016 11:13:52 -0800 (PST) Received: by mail-yw0-f179.google.com with SMTP id q190so88695217ywd.3 for ; Sun, 07 Feb 2016 11:13:52 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> Date: Sun, 7 Feb 2016 11:13:51 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems From: Dan Williams List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , jmoyer , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Matthew Wilcox , Andrew Morton , linux-ext4 On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler wrote: > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > dax_writeback_mapping_range() needs a struct block_device, and it used to > get that from inode->i_sb->s_bdev. This is correct for normal inodes > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > block devices and for XFS real-time files. > > Instead, call dax_writeback_mapping_range() directly from the filesystem or > raw block device fsync/msync code so that they can supply us with a valid > block device. > > It should be noted that this will reduce the number of calls to > dax_writeback_mapping_range() because filemap_write_and_wait_range() is > called in the various filesystems for operations other than just > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside > of ->fsync for hole punch, truncate, and block relocation > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). > > I don't believe that these extra flushes are necessary in the DAX case. In > the page cache case when we have dirty data in the page cache, that data > will be actively lost if we evict a dirty page cache page without flushing > it to media first. For DAX, though, the data will remain consistent with > the physical address to which it was written regardless of whether it's in > the processor cache or not - really the only reason I see to flush is in > response to a fsync or msync so that our data is durable on media in case > of a power loss. The case where we could throw dirty data out of the page > cache and essentially lose writes simply doesn't exist. > > Signed-off-by: Ross Zwisler > --- > fs/block_dev.c | 7 +++++++ > fs/dax.c | 5 ++--- > fs/ext2/file.c | 10 ++++++++++ > fs/ext4/fsync.c | 10 +++++++++- > fs/xfs/xfs_file.c | 12 ++++++++++-- > include/linux/dax.h | 4 ++-- > mm/filemap.c | 6 ------ > 7 files changed, 40 insertions(+), 14 deletions(-) This sprinkling of dax specific fixups outside of vm_operations_struct routines still has me thinking that we are going in the wrong direction for fsync/msync support. If an application is both unaware of DAX and doing mmap I/O it is better served by the page cache where writeback is durable by default. We expect DAX-aware applications to assume responsibility for cpu cache management [1]. Making DAX mmap semantics explicit opt-in solves not only durability support, but also the current problem that DAX gets silently disabled leaving an app to wonder if it really got a direct mapping. DAX also silently picks pud, pmd, or pte mappings which is information an application would really like to know at map time. The proposal: make applications explicitly request DAX semantics with a new MAP_DAX flag and fail if DAX is unavailable. Document that a successful MAP_DAX request mandates that the application assumes responsibility for cpu cache management. Require that all applications that mmap the file agree on MAP_DAX. This also solves the future problem of DAX support on virtually tagged cache architectures where it is difficult for the kernel to know what alias addresses need flushing. [1]: https://github.com/pmem/nvml _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id AFBC329DF5 for ; Sun, 7 Feb 2016 15:50:58 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay3.corp.sgi.com (Postfix) with ESMTP id 35C65AC001 for ; Sun, 7 Feb 2016 13:50:54 -0800 (PST) Received: from ipmail06.adl2.internode.on.net (ipmail06.adl2.internode.on.net [150.101.137.129]) by cuda.sgi.com with ESMTP id McFJB8rnCxZHeEn1 for ; Sun, 07 Feb 2016 13:50:50 -0800 (PST) Date: Mon, 8 Feb 2016 08:50:47 +1100 From: Dave Chinner Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160207215047.GJ31407@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dan Williams Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , jmoyer , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Matthew Wilcox , Ross Zwisler , linux-ext4 , Andrew Morton On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote: > On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler > wrote: > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > block devices and for XFS real-time files. > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem or > > raw block device fsync/msync code so that they can supply us with a valid > > block device. > > > > It should be noted that this will reduce the number of calls to > > dax_writeback_mapping_range() because filemap_write_and_wait_range() is > > called in the various filesystems for operations other than just > > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside > > of ->fsync for hole punch, truncate, and block relocation > > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). > > > > I don't believe that these extra flushes are necessary in the DAX case. In > > the page cache case when we have dirty data in the page cache, that data > > will be actively lost if we evict a dirty page cache page without flushing > > it to media first. For DAX, though, the data will remain consistent with > > the physical address to which it was written regardless of whether it's in > > the processor cache or not - really the only reason I see to flush is in > > response to a fsync or msync so that our data is durable on media in case > > of a power loss. The case where we could throw dirty data out of the page > > cache and essentially lose writes simply doesn't exist. > > > > Signed-off-by: Ross Zwisler > > --- > > fs/block_dev.c | 7 +++++++ > > fs/dax.c | 5 ++--- > > fs/ext2/file.c | 10 ++++++++++ > > fs/ext4/fsync.c | 10 +++++++++- > > fs/xfs/xfs_file.c | 12 ++++++++++-- > > include/linux/dax.h | 4 ++-- > > mm/filemap.c | 6 ------ > > 7 files changed, 40 insertions(+), 14 deletions(-) > > This sprinkling of dax specific fixups outside of vm_operations_struct > routines still has me thinking that we are going in the wrong > direction for fsync/msync support. > > If an application is both unaware of DAX and doing mmap I/O it is > better served by the page cache where writeback is durable by default. > We expect DAX-aware applications to assume responsibility for cpu > cache management [1]. Making DAX mmap semantics explicit opt-in > solves not only durability support, but also the current problem that > DAX gets silently disabled leaving an app to wonder if it really got a > direct mapping. DAX also silently picks pud, pmd, or pte mappings > which is information an application would really like to know at map > time. > > The proposal: make applications explicitly request DAX semantics with > a new MAP_DAX flag and fail if DAX is unavailable. No. As I've stated before, the entire purpose of enabling DAX through existing filesytsems like XFS and ext4 is so that existing applications work with DAX *without modification*. That is, applications can be entirely unaware of the fact that the filesystem is giving them direct access to the storage because the access and failure semantics of DAX enabled mmap are *identical to the existing mmap semantics*. Given this, the app doesn't need to care whether DAX is enabled or not; all that will be seen is a difference in speed of access. Enabling and disabling DAX is, at this point, purely an administration decision - if the hardware and filesystem supports it, it can be turned on without having to wait years for application developers to add support for it.... -Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 4C43C29DF5 for ; Sun, 7 Feb 2016 16:04:38 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay3.corp.sgi.com (Postfix) with ESMTP id E12DFAC003 for ; Sun, 7 Feb 2016 14:04:34 -0800 (PST) Received: from ipmail06.adl2.internode.on.net (ipmail06.adl2.internode.on.net [150.101.137.129]) by cuda.sgi.com with ESMTP id 1pHiaihkcLN0al1P for ; Sun, 07 Feb 2016 14:04:32 -0800 (PST) Date: Mon, 8 Feb 2016 09:03:29 +1100 From: Dave Chinner Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Message-ID: <20160207220329.GK31407@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: Theodore Ts'o , linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Andrew Morton , linux-ext4@vger.kernel.org, Dan Williams On Sun, Feb 07, 2016 at 12:19:12AM -0700, Ross Zwisler wrote: > dax_clear_blocks() needs a valid struct block_device and previously it was > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > block devices and for XFS real-time devices. > > Instead, have the caller pass in a struct block_device pointer which it > knows to be correct. .... > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > index 07ef29b..f722ba2 100644 > --- a/fs/xfs/xfs_bmap_util.c > +++ b/fs/xfs/xfs_bmap_util.c > @@ -73,9 +73,11 @@ xfs_zero_extent( > xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb); > sector_t block = XFS_BB_TO_FSBT(mp, sector); > ssize_t size = XFS_FSB_TO_B(mp, count_fsb); > + struct inode *inode = VFS_I(ip); > > if (IS_DAX(VFS_I(ip))) > - return dax_clear_blocks(VFS_I(ip), block, size); > + return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > + block, size); Get rid of the local inode variable and use VFS_I(ip) like the code originally did. Do not change code that is unrelated to the modifcation being made, especially when it results in making the code an inconsistent mess of mixed pointer constructs.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 6AAC17CA2 for ; Sun, 7 Feb 2016 19:44:24 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id 4CDED304032 for ; Sun, 7 Feb 2016 17:44:21 -0800 (PST) Received: from mga11.intel.com ([192.55.52.93]) by cuda.sgi.com with ESMTP id dxC8POGslFyPjW1D for ; Sun, 07 Feb 2016 17:44:20 -0800 (PST) Date: Sun, 7 Feb 2016 18:44:09 -0700 From: Ross Zwisler Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Message-ID: <20160208014409.GA2343@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> <20160207220329.GK31407@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160207220329.GK31407@dastard> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: Theodore Ts'o , linux-nvdimm@lists.01.org, Dan Williams , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Andrew Morton On Mon, Feb 08, 2016 at 09:03:29AM +1100, Dave Chinner wrote: > On Sun, Feb 07, 2016 at 12:19:12AM -0700, Ross Zwisler wrote: > > dax_clear_blocks() needs a valid struct block_device and previously it was > > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > block devices and for XFS real-time devices. > > > > Instead, have the caller pass in a struct block_device pointer which it > > knows to be correct. > .... > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > index 07ef29b..f722ba2 100644 > > --- a/fs/xfs/xfs_bmap_util.c > > +++ b/fs/xfs/xfs_bmap_util.c > > @@ -73,9 +73,11 @@ xfs_zero_extent( > > xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb); > > sector_t block = XFS_BB_TO_FSBT(mp, sector); > > ssize_t size = XFS_FSB_TO_B(mp, count_fsb); > > + struct inode *inode = VFS_I(ip); > > > > if (IS_DAX(VFS_I(ip))) > > - return dax_clear_blocks(VFS_I(ip), block, size); > > + return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > > + block, size); > > Get rid of the local inode variable and use VFS_I(ip) like the code > originally did. Do not change code that is unrelated to the > modifcation being made, especially when it results in making > the code an inconsistent mess of mixed pointer constructs.... The local 'inode' variable was added to avoid multiple calls for VFS_I() for the same 'ip'. That said, I'm happy to make the change. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 2DBB97CA2 for ; Sun, 7 Feb 2016 19:46:12 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay1.corp.sgi.com (Postfix) with ESMTP id EB2768F8049 for ; Sun, 7 Feb 2016 17:46:11 -0800 (PST) Received: from mga04.intel.com ([192.55.52.120]) by cuda.sgi.com with ESMTP id iuL2AcHzymiUKyMp for ; Sun, 07 Feb 2016 17:46:11 -0800 (PST) Date: Sun, 7 Feb 2016 18:46:01 -0700 From: Ross Zwisler Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Message-ID: <20160208014601.GB2343@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dan Williams Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Matthew Wilcox , Ross Zwisler , linux-ext4 , Andrew Morton On Sun, Feb 07, 2016 at 10:19:29AM -0800, Dan Williams wrote: > On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler > wrote: > > dax_clear_blocks() needs a valid struct block_device and previously it was > > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > block devices and for XFS real-time devices. > > > > Instead, have the caller pass in a struct block_device pointer which it > > knows to be correct. > > > > Signed-off-by: Ross Zwisler > > --- > > fs/dax.c | 4 ++-- > > fs/ext2/inode.c | 5 +++-- > > fs/xfs/xfs_aops.c | 2 +- > > fs/xfs/xfs_aops.h | 1 + > > fs/xfs/xfs_bmap_util.c | 4 +++- > > include/linux/dax.h | 3 ++- > > 6 files changed, 12 insertions(+), 7 deletions(-) > > > > diff --git a/fs/dax.c b/fs/dax.c > > index 227974a..4592241 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -83,9 +83,9 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n) > > * and hence this means the stack from this point must follow GFP_NOFS > > * semantics for all operations. > > */ > > -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) > > +int dax_clear_blocks(struct inode *inode, struct block_device *bdev, > > + sector_t block, long _size) > > Since this is a bdev relative routine we should also resolve the > sector, i.e. the signature should drop the inode: > > int dax_clear_sectors(struct block_device *bdev, sector_t sector, long _size) The inode is still needed because dax_clear_blocks() needs inode->i_blkbits. Unless there is some easy way to get this from the bdev that I'm not seeing? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id B30047CAA for ; Sun, 7 Feb 2016 22:29:41 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay1.corp.sgi.com (Postfix) with ESMTP id A24A88F8035 for ; Sun, 7 Feb 2016 20:29:41 -0800 (PST) Received: from mail-ob0-f195.google.com (mail-ob0-f195.google.com [209.85.214.195]) by cuda.sgi.com with ESMTP id 5yIlblzCKU35Wa5E (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Sun, 07 Feb 2016 20:29:40 -0800 (PST) Received: by mail-ob0-f195.google.com with SMTP id wg8so7338132obc.3 for ; Sun, 07 Feb 2016 20:29:40 -0800 (PST) Mime-Version: 1.0 (1.0) Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() From: Ross Zwisler In-Reply-To: <20160208014601.GB2343@linux.intel.com> Date: Sun, 7 Feb 2016 21:29:38 -0700 Message-Id: <00FE872A-9B2A-4492-A83C-59025ACB1F4A@gmail.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> <20160208014601.GB2343@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Dan Williams , linux-ext4 , Andrew Morton > On Feb 7, 2016, at 6:46 PM, Ross Zwisler wrote: > >> On Sun, Feb 07, 2016 at 10:19:29AM -0800, Dan Williams wrote: >> On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler >> wrote: >>> dax_clear_blocks() needs a valid struct block_device and previously it was >>> using inode->i_sb->s_bdev in all cases. This is correct for normal inodes >>> on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw >>> block devices and for XFS real-time devices. >>> >>> Instead, have the caller pass in a struct block_device pointer which it >>> knows to be correct. >>> >>> Signed-off-by: Ross Zwisler >>> --- >>> fs/dax.c | 4 ++-- >>> fs/ext2/inode.c | 5 +++-- >>> fs/xfs/xfs_aops.c | 2 +- >>> fs/xfs/xfs_aops.h | 1 + >>> fs/xfs/xfs_bmap_util.c | 4 +++- >>> include/linux/dax.h | 3 ++- >>> 6 files changed, 12 insertions(+), 7 deletions(-) >>> >>> diff --git a/fs/dax.c b/fs/dax.c >>> index 227974a..4592241 100644 >>> --- a/fs/dax.c >>> +++ b/fs/dax.c >>> @@ -83,9 +83,9 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n) >>> * and hence this means the stack from this point must follow GFP_NOFS >>> * semantics for all operations. >>> */ >>> -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) >>> +int dax_clear_blocks(struct inode *inode, struct block_device *bdev, >>> + sector_t block, long _size) >> >> Since this is a bdev relative routine we should also resolve the >> sector, i.e. the signature should drop the inode: >> >> int dax_clear_sectors(struct block_device *bdev, sector_t sector, long _size) > > The inode is still needed because dax_clear_blocks() needs inode->i_blkbits. > Unless there is some easy way to get this from the bdev that I'm not seeing? Never mind, you are passing in the sector, not the block. Sure, this seems better - I'll fix this for v2. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 4AC207CA2 for ; Sun, 7 Feb 2016 23:17:34 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay2.corp.sgi.com (Postfix) with ESMTP id 295FD304039 for ; Sun, 7 Feb 2016 21:17:31 -0800 (PST) Received: from ipmail06.adl2.internode.on.net (ipmail06.adl2.internode.on.net [150.101.137.129]) by cuda.sgi.com with ESMTP id BrvN2lTqGT489lqh for ; Sun, 07 Feb 2016 21:17:28 -0800 (PST) Date: Mon, 8 Feb 2016 16:17:25 +1100 From: Dave Chinner Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Message-ID: <20160208051725.GM31407@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> <20160207220329.GK31407@dastard> <20160208014409.GA2343@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160208014409.GA2343@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com On Sun, Feb 07, 2016 at 06:44:09PM -0700, Ross Zwisler wrote: > On Mon, Feb 08, 2016 at 09:03:29AM +1100, Dave Chinner wrote: > > On Sun, Feb 07, 2016 at 12:19:12AM -0700, Ross Zwisler wrote: > > > dax_clear_blocks() needs a valid struct block_device and previously it was > > > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > > > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > > block devices and for XFS real-time devices. > > > > > > Instead, have the caller pass in a struct block_device pointer which it > > > knows to be correct. > > .... > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > > index 07ef29b..f722ba2 100644 > > > --- a/fs/xfs/xfs_bmap_util.c > > > +++ b/fs/xfs/xfs_bmap_util.c > > > @@ -73,9 +73,11 @@ xfs_zero_extent( > > > xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb); > > > sector_t block = XFS_BB_TO_FSBT(mp, sector); > > > ssize_t size = XFS_FSB_TO_B(mp, count_fsb); > > > + struct inode *inode = VFS_I(ip); > > > > > > if (IS_DAX(VFS_I(ip))) > > > - return dax_clear_blocks(VFS_I(ip), block, size); > > > + return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > > > + block, size); > > > > Get rid of the local inode variable and use VFS_I(ip) like the code > > originally did. Do not change code that is unrelated to the > > modifcation being made, especially when it results in making > > the code an inconsistent mess of mixed pointer constructs.... > > The local 'inode' variable was added to avoid multiple calls for VFS_I() for > the same 'ip'. My point is you didn't achieve that. The end result of your patch is: struct inode *inode = VFS_I(ip); if (IS_DAX(VFS_I(ip))) return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), block, size); So now we have a local variable, but we still have 2 calls to VFS_I(ip). i.e. this makes the code harder to read and understand than before for no benefit. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 6F8D47CA4 for ; Sun, 7 Feb 2016 01:19:37 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id 52982304032 for ; Sat, 6 Feb 2016 23:19:34 -0800 (PST) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id jZYAuZ44tW5D7RGK for ; Sat, 06 Feb 2016 23:19:33 -0800 (PST) From: Ross Zwisler Subject: [PATCH 2/2] dax: move writeback calls into the filesystems Date: Sun, 7 Feb 2016 00:19:13 -0700 Message-Id: <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Theodore Ts'o , Andrew Morton , linux-nvdimm@lists.01.org, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams Previously calls to dax_writeback_mapping_range() for all DAX filesystems (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). dax_writeback_mapping_range() needs a struct block_device, and it used to get that from inode->i_sb->s_bdev. This is correct for normal inodes mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time files. Instead, call dax_writeback_mapping_range() directly from the filesystem or raw block device fsync/msync code so that they can supply us with a valid block device. It should be noted that this will reduce the number of calls to dax_writeback_mapping_range() because filemap_write_and_wait_range() is called in the various filesystems for operations other than just fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside of ->fsync for hole punch, truncate, and block relocation (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). I don't believe that these extra flushes are necessary in the DAX case. In the page cache case when we have dirty data in the page cache, that data will be actively lost if we evict a dirty page cache page without flushing it to media first. For DAX, though, the data will remain consistent with the physical address to which it was written regardless of whether it's in the processor cache or not - really the only reason I see to flush is in response to a fsync or msync so that our data is durable on media in case of a power loss. The case where we could throw dirty data out of the page cache and essentially lose writes simply doesn't exist. Signed-off-by: Ross Zwisler --- fs/block_dev.c | 7 +++++++ fs/dax.c | 5 ++--- fs/ext2/file.c | 10 ++++++++++ fs/ext4/fsync.c | 10 +++++++++- fs/xfs/xfs_file.c | 12 ++++++++++-- include/linux/dax.h | 4 ++-- mm/filemap.c | 6 ------ 7 files changed, 40 insertions(+), 14 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index fa0507a..312ad44 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -356,8 +356,15 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync) { struct inode *bd_inode = bdev_file_inode(filp); struct block_device *bdev = I_BDEV(bd_inode); + struct address_space *mapping = bd_inode->i_mapping; int error; + if (dax_mapping(mapping) && mapping->nrexceptional) { + error = dax_writeback_mapping_range(mapping, bdev, start, end); + if (error) + return error; + } + error = filemap_write_and_wait_range(filp->f_mapping, start, end); if (error) return error; diff --git a/fs/dax.c b/fs/dax.c index 4592241..4b5006a 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev, * end]. This is required by data integrity operations to ensure file data is * on persistent storage prior to completion of the operation. */ -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end) +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, loff_t start, loff_t end) { struct inode *inode = mapping->host; - struct block_device *bdev = inode->i_sb->s_bdev; pgoff_t start_index, end_index, pmd_index; pgoff_t indices[PAGEVEC_SIZE]; struct pagevec pvec; diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 2c88d68..d1abf53 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -162,6 +162,16 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) int ret; struct super_block *sb = file->f_mapping->host->i_sb; struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; +#ifdef CONFIG_FS_DAX + struct address_space *inode_mapping = file->f_inode->i_mapping; + + if (dax_mapping(inode_mapping) && inode_mapping->nrexceptional) { + ret = dax_writeback_mapping_range(inode_mapping, sb->s_bdev, + start, end); + if (ret) + return ret; + } +#endif ret = generic_file_fsync(file, start, end, datasync); if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) { diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 8850254..e9cf53b 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "ext4.h" #include "ext4_jbd2.h" @@ -83,10 +84,10 @@ static int ext4_sync_parent(struct inode *inode) * What we do is just kick off a commit and wait on it. This will snapshot the * inode to disk. */ - int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { struct inode *inode = file->f_mapping->host; + struct address_space *mapping = inode->i_mapping; struct ext4_inode_info *ei = EXT4_I(inode); journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; int ret = 0, err; @@ -97,6 +98,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) trace_ext4_sync_file_enter(file, datasync); + if (dax_mapping(mapping) && mapping->nrexceptional) { + err = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, + start, end); + if (err) + goto out; + } + if (inode->i_sb->s_flags & MS_RDONLY) { /* Make sure that we read updated s_mount_flags value */ smp_rmb(); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 52883ac..84e95cc 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -209,7 +209,8 @@ xfs_file_fsync( loff_t end, int datasync) { - struct inode *inode = file->f_mapping->host; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; struct xfs_inode *ip = XFS_I(inode); struct xfs_mount *mp = ip->i_mount; int error = 0; @@ -218,7 +219,14 @@ xfs_file_fsync( trace_xfs_file_fsync(ip); - error = filemap_write_and_wait_range(inode->i_mapping, start, end); + if (dax_mapping(mapping) && mapping->nrexceptional) { + error = dax_writeback_mapping_range(mapping, + xfs_find_bdev_for_inode(inode), start, end); + if (error) + return error; + } + + error = filemap_write_and_wait_range(mapping, start, end); if (error) return error; diff --git a/include/linux/dax.h b/include/linux/dax.h index bad27b0..8e9f114 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -42,6 +42,6 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end); +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, loff_t start, loff_t end); #endif diff --git a/mm/filemap.c b/mm/filemap.c index bc94386..c4286eb 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -482,12 +482,6 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; - if (dax_mapping(mapping) && mapping->nrexceptional) { - err = dax_writeback_mapping_range(mapping, lstart, lend); - if (err) - return err; - } - if (mapping->nrpages) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); -- 2.5.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 4D70E7CA2 for ; Mon, 8 Feb 2016 04:48:39 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay3.corp.sgi.com (Postfix) with ESMTP id C58DCAC003 for ; Mon, 8 Feb 2016 02:48:38 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by cuda.sgi.com with ESMTP id w7xhkZIhTZew5WQj (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NO) for ; Mon, 08 Feb 2016 02:48:35 -0800 (PST) Date: Mon, 8 Feb 2016 11:48:50 +0100 From: Jan Kara Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160208104849.GB9451@quack.suse.cz> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: Theodore Ts'o , linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Andrew Morton , linux-ext4@vger.kernel.org, Dan Williams On Sun 07-02-16 00:19:13, Ross Zwisler wrote: > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > dax_writeback_mapping_range() needs a struct block_device, and it used to > get that from inode->i_sb->s_bdev. This is correct for normal inodes > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > block devices and for XFS real-time files. > > Instead, call dax_writeback_mapping_range() directly from the filesystem or > raw block device fsync/msync code so that they can supply us with a valid > block device. > > It should be noted that this will reduce the number of calls to > dax_writeback_mapping_range() because filemap_write_and_wait_range() is > called in the various filesystems for operations other than just > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside > of ->fsync for hole punch, truncate, and block relocation > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). > > I don't believe that these extra flushes are necessary in the DAX case. In > the page cache case when we have dirty data in the page cache, that data > will be actively lost if we evict a dirty page cache page without flushing > it to media first. For DAX, though, the data will remain consistent with > the physical address to which it was written regardless of whether it's in > the processor cache or not - really the only reason I see to flush is in > response to a fsync or msync so that our data is durable on media in case > of a power loss. The case where we could throw dirty data out of the page > cache and essentially lose writes simply doesn't exist. You should at least note that sync(2) won't make data durable with this patch in the changelog. Dave and Christoph have told you that Linux users depend on sync(2) to make data durable and I fully agree with them. Given current options, I think we can live with this for 4.5 but long term this is IMO unacceptable. Honza > > Signed-off-by: Ross Zwisler > --- > fs/block_dev.c | 7 +++++++ > fs/dax.c | 5 ++--- > fs/ext2/file.c | 10 ++++++++++ > fs/ext4/fsync.c | 10 +++++++++- > fs/xfs/xfs_file.c | 12 ++++++++++-- > include/linux/dax.h | 4 ++-- > mm/filemap.c | 6 ------ > 7 files changed, 40 insertions(+), 14 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index fa0507a..312ad44 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -356,8 +356,15 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync) > { > struct inode *bd_inode = bdev_file_inode(filp); > struct block_device *bdev = I_BDEV(bd_inode); > + struct address_space *mapping = bd_inode->i_mapping; > int error; > > + if (dax_mapping(mapping) && mapping->nrexceptional) { > + error = dax_writeback_mapping_range(mapping, bdev, start, end); > + if (error) > + return error; > + } > + > error = filemap_write_and_wait_range(filp->f_mapping, start, end); > if (error) > return error; > diff --git a/fs/dax.c b/fs/dax.c > index 4592241..4b5006a 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev, > * end]. This is required by data integrity operations to ensure file data is > * on persistent storage prior to completion of the operation. > */ > -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > - loff_t end) > +int dax_writeback_mapping_range(struct address_space *mapping, > + struct block_device *bdev, loff_t start, loff_t end) > { > struct inode *inode = mapping->host; > - struct block_device *bdev = inode->i_sb->s_bdev; > pgoff_t start_index, end_index, pmd_index; > pgoff_t indices[PAGEVEC_SIZE]; > struct pagevec pvec; > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index 2c88d68..d1abf53 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -162,6 +162,16 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) > int ret; > struct super_block *sb = file->f_mapping->host->i_sb; > struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; > +#ifdef CONFIG_FS_DAX > + struct address_space *inode_mapping = file->f_inode->i_mapping; > + > + if (dax_mapping(inode_mapping) && inode_mapping->nrexceptional) { > + ret = dax_writeback_mapping_range(inode_mapping, sb->s_bdev, > + start, end); > + if (ret) > + return ret; > + } > +#endif > > ret = generic_file_fsync(file, start, end, datasync); > if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) { > diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c > index 8850254..e9cf53b 100644 > --- a/fs/ext4/fsync.c > +++ b/fs/ext4/fsync.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > > #include "ext4.h" > #include "ext4_jbd2.h" > @@ -83,10 +84,10 @@ static int ext4_sync_parent(struct inode *inode) > * What we do is just kick off a commit and wait on it. This will snapshot the > * inode to disk. > */ > - > int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > { > struct inode *inode = file->f_mapping->host; > + struct address_space *mapping = inode->i_mapping; > struct ext4_inode_info *ei = EXT4_I(inode); > journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; > int ret = 0, err; > @@ -97,6 +98,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > > trace_ext4_sync_file_enter(file, datasync); > > + if (dax_mapping(mapping) && mapping->nrexceptional) { > + err = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, > + start, end); > + if (err) > + goto out; > + } > + > if (inode->i_sb->s_flags & MS_RDONLY) { > /* Make sure that we read updated s_mount_flags value */ > smp_rmb(); > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 52883ac..84e95cc 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -209,7 +209,8 @@ xfs_file_fsync( > loff_t end, > int datasync) > { > - struct inode *inode = file->f_mapping->host; > + struct address_space *mapping = file->f_mapping; > + struct inode *inode = mapping->host; > struct xfs_inode *ip = XFS_I(inode); > struct xfs_mount *mp = ip->i_mount; > int error = 0; > @@ -218,7 +219,14 @@ xfs_file_fsync( > > trace_xfs_file_fsync(ip); > > - error = filemap_write_and_wait_range(inode->i_mapping, start, end); > + if (dax_mapping(mapping) && mapping->nrexceptional) { > + error = dax_writeback_mapping_range(mapping, > + xfs_find_bdev_for_inode(inode), start, end); > + if (error) > + return error; > + } > + > + error = filemap_write_and_wait_range(mapping, start, end); > if (error) > return error; > > diff --git a/include/linux/dax.h b/include/linux/dax.h > index bad27b0..8e9f114 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -42,6 +42,6 @@ static inline bool dax_mapping(struct address_space *mapping) > { > return mapping->host && IS_DAX(mapping->host); > } > -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > - loff_t end); > +int dax_writeback_mapping_range(struct address_space *mapping, > + struct block_device *bdev, loff_t start, loff_t end); > #endif > diff --git a/mm/filemap.c b/mm/filemap.c > index bc94386..c4286eb 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -482,12 +482,6 @@ int filemap_write_and_wait_range(struct address_space *mapping, > { > int err = 0; > > - if (dax_mapping(mapping) && mapping->nrexceptional) { > - err = dax_writeback_mapping_range(mapping, lstart, lend); > - if (err) > - return err; > - } > - > if (mapping->nrpages) { > err = __filemap_fdatawrite_range(mapping, lstart, lend, > WB_SYNC_ALL); > -- > 2.5.0 > > -- Jan Kara SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 2658B7CA2 for ; Mon, 8 Feb 2016 02:18:18 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay2.corp.sgi.com (Postfix) with ESMTP id 1D6DF304048 for ; Mon, 8 Feb 2016 00:18:14 -0800 (PST) Received: from mail-yw0-f182.google.com (mail-yw0-f182.google.com [209.85.161.182]) by cuda.sgi.com with ESMTP id EIirQHNFbBZiojwt (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Mon, 08 Feb 2016 00:18:12 -0800 (PST) Received: by mail-yw0-f182.google.com with SMTP id u200so18168632ywf.0 for ; Mon, 08 Feb 2016 00:18:12 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20160207215047.GJ31407@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> Date: Mon, 8 Feb 2016 00:18:11 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems From: Dan Williams List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , jmoyer , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Matthew Wilcox , Ross Zwisler , linux-ext4 , Andrew Morton On Sun, Feb 7, 2016 at 1:50 PM, Dave Chinner wrote: > On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote: >> On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler >> wrote: >> > Previously calls to dax_writeback_mapping_range() for all DAX filesystems >> > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). >> > dax_writeback_mapping_range() needs a struct block_device, and it used to >> > get that from inode->i_sb->s_bdev. This is correct for normal inodes >> > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw >> > block devices and for XFS real-time files. >> > >> > Instead, call dax_writeback_mapping_range() directly from the filesystem or >> > raw block device fsync/msync code so that they can supply us with a valid >> > block device. >> > >> > It should be noted that this will reduce the number of calls to >> > dax_writeback_mapping_range() because filemap_write_and_wait_range() is >> > called in the various filesystems for operations other than just >> > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside >> > of ->fsync for hole punch, truncate, and block relocation >> > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). >> > >> > I don't believe that these extra flushes are necessary in the DAX case. In >> > the page cache case when we have dirty data in the page cache, that data >> > will be actively lost if we evict a dirty page cache page without flushing >> > it to media first. For DAX, though, the data will remain consistent with >> > the physical address to which it was written regardless of whether it's in >> > the processor cache or not - really the only reason I see to flush is in >> > response to a fsync or msync so that our data is durable on media in case >> > of a power loss. The case where we could throw dirty data out of the page >> > cache and essentially lose writes simply doesn't exist. >> > >> > Signed-off-by: Ross Zwisler >> > --- >> > fs/block_dev.c | 7 +++++++ >> > fs/dax.c | 5 ++--- >> > fs/ext2/file.c | 10 ++++++++++ >> > fs/ext4/fsync.c | 10 +++++++++- >> > fs/xfs/xfs_file.c | 12 ++++++++++-- >> > include/linux/dax.h | 4 ++-- >> > mm/filemap.c | 6 ------ >> > 7 files changed, 40 insertions(+), 14 deletions(-) >> >> This sprinkling of dax specific fixups outside of vm_operations_struct >> routines still has me thinking that we are going in the wrong >> direction for fsync/msync support. >> >> If an application is both unaware of DAX and doing mmap I/O it is >> better served by the page cache where writeback is durable by default. >> We expect DAX-aware applications to assume responsibility for cpu >> cache management [1]. Making DAX mmap semantics explicit opt-in >> solves not only durability support, but also the current problem that >> DAX gets silently disabled leaving an app to wonder if it really got a >> direct mapping. DAX also silently picks pud, pmd, or pte mappings >> which is information an application would really like to know at map >> time. >> >> The proposal: make applications explicitly request DAX semantics with >> a new MAP_DAX flag and fail if DAX is unavailable. > > No. > > As I've stated before, the entire purpose of enabling DAX through > existing filesytsems like XFS and ext4 is so that existing > applications work with DAX *without modification*. > > That is, applications can be entirely unaware of the fact that the > filesystem is giving them direct access to the storage because the > access and failure semantics of DAX enabled mmap are *identical to > the existing mmap semantics*. > > Given this, the app doesn't need to care whether DAX is enabled or > not; all that will be seen is a difference in speed of access. > Enabling and disabling DAX is, at this point, purely an > administration decision - if the hardware and filesystem supports > it, it can be turned on without having to wait years for application > developers to add support for it.... Setting aside the current block zeroing problem you seem to assuming that DAX will always be faster and that may not be true at a media level. Waiting years for some applications to determine if DAX makes sense for their use case seems completely reasonable. In the meantime the apps that are already making these changes want to know that a DAX mapping request has not silently dropped backed to page cache. They also want to know if they successfully jumped through all the hoops to get a larger than pte mapping. I agree it is useful to be able to force DAX on an unmodified application to see what happens, and it follows that if those applications want to run in that mode they will need functional fsync()... I would feel better if we were talking about specific applications and performance numbers to know if forcing DAX on application is a debug facility or a production level capability. You seem to have already made that determination and I'm curious what I'm missing. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id E9AFD29DF5 for ; Mon, 8 Feb 2016 10:12:50 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id D9057304043 for ; Mon, 8 Feb 2016 08:12:47 -0800 (PST) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by cuda.sgi.com with ESMTP id kxdC56C2d9zUYT6f for ; Mon, 08 Feb 2016 08:12:46 -0800 (PST) Date: Mon, 8 Feb 2016 09:12:11 -0700 From: Ross Zwisler Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160208161211.GE2343@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160208104849.GB9451@quack.suse.cz> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160208104849.GB9451@quack.suse.cz> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Jan Kara Cc: Theodore Ts'o , linux-nvdimm@lists.01.org, Dan Williams , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Andrew Morton On Mon, Feb 08, 2016 at 11:48:50AM +0100, Jan Kara wrote: > On Sun 07-02-16 00:19:13, Ross Zwisler wrote: > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > block devices and for XFS real-time files. > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem or > > raw block device fsync/msync code so that they can supply us with a valid > > block device. > > > > It should be noted that this will reduce the number of calls to > > dax_writeback_mapping_range() because filemap_write_and_wait_range() is > > called in the various filesystems for operations other than just > > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside > > of ->fsync for hole punch, truncate, and block relocation > > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). > > > > I don't believe that these extra flushes are necessary in the DAX case. In > > the page cache case when we have dirty data in the page cache, that data > > will be actively lost if we evict a dirty page cache page without flushing > > it to media first. For DAX, though, the data will remain consistent with > > the physical address to which it was written regardless of whether it's in > > the processor cache or not - really the only reason I see to flush is in > > response to a fsync or msync so that our data is durable on media in case > > of a power loss. The case where we could throw dirty data out of the page > > cache and essentially lose writes simply doesn't exist. > > You should at least note that sync(2) won't make data durable with this > patch in the changelog. Dave and Christoph have told you that Linux users > depend on sync(2) to make data durable and I fully agree with them. Given > current options, I think we can live with this for 4.5 but long term this > is IMO unacceptable. > > Honza I agree. I'll add a note to the changelog and will work on adding support for sync(2). > > > > Signed-off-by: Ross Zwisler > > --- > > fs/block_dev.c | 7 +++++++ > > fs/dax.c | 5 ++--- > > fs/ext2/file.c | 10 ++++++++++ > > fs/ext4/fsync.c | 10 +++++++++- > > fs/xfs/xfs_file.c | 12 ++++++++++-- > > include/linux/dax.h | 4 ++-- > > mm/filemap.c | 6 ------ > > 7 files changed, 40 insertions(+), 14 deletions(-) > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > index fa0507a..312ad44 100644 > > --- a/fs/block_dev.c > > +++ b/fs/block_dev.c > > @@ -356,8 +356,15 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync) > > { > > struct inode *bd_inode = bdev_file_inode(filp); > > struct block_device *bdev = I_BDEV(bd_inode); > > + struct address_space *mapping = bd_inode->i_mapping; > > int error; > > > > + if (dax_mapping(mapping) && mapping->nrexceptional) { > > + error = dax_writeback_mapping_range(mapping, bdev, start, end); > > + if (error) > > + return error; > > + } > > + > > error = filemap_write_and_wait_range(filp->f_mapping, start, end); > > if (error) > > return error; > > diff --git a/fs/dax.c b/fs/dax.c > > index 4592241..4b5006a 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev, > > * end]. This is required by data integrity operations to ensure file data is > > * on persistent storage prior to completion of the operation. > > */ > > -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > > - loff_t end) > > +int dax_writeback_mapping_range(struct address_space *mapping, > > + struct block_device *bdev, loff_t start, loff_t end) > > { > > struct inode *inode = mapping->host; > > - struct block_device *bdev = inode->i_sb->s_bdev; > > pgoff_t start_index, end_index, pmd_index; > > pgoff_t indices[PAGEVEC_SIZE]; > > struct pagevec pvec; > > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > > index 2c88d68..d1abf53 100644 > > --- a/fs/ext2/file.c > > +++ b/fs/ext2/file.c > > @@ -162,6 +162,16 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) > > int ret; > > struct super_block *sb = file->f_mapping->host->i_sb; > > struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; > > +#ifdef CONFIG_FS_DAX > > + struct address_space *inode_mapping = file->f_inode->i_mapping; > > + > > + if (dax_mapping(inode_mapping) && inode_mapping->nrexceptional) { > > + ret = dax_writeback_mapping_range(inode_mapping, sb->s_bdev, > > + start, end); > > + if (ret) > > + return ret; > > + } > > +#endif > > > > ret = generic_file_fsync(file, start, end, datasync); > > if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) { > > diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c > > index 8850254..e9cf53b 100644 > > --- a/fs/ext4/fsync.c > > +++ b/fs/ext4/fsync.c > > @@ -27,6 +27,7 @@ > > #include > > #include > > #include > > +#include > > > > #include "ext4.h" > > #include "ext4_jbd2.h" > > @@ -83,10 +84,10 @@ static int ext4_sync_parent(struct inode *inode) > > * What we do is just kick off a commit and wait on it. This will snapshot the > > * inode to disk. > > */ > > - > > int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > > { > > struct inode *inode = file->f_mapping->host; > > + struct address_space *mapping = inode->i_mapping; > > struct ext4_inode_info *ei = EXT4_I(inode); > > journal_t *journal = EXT4_SB(inode->i_sb)->s_journal; > > int ret = 0, err; > > @@ -97,6 +98,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) > > > > trace_ext4_sync_file_enter(file, datasync); > > > > + if (dax_mapping(mapping) && mapping->nrexceptional) { > > + err = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, > > + start, end); > > + if (err) > > + goto out; > > + } > > + > > if (inode->i_sb->s_flags & MS_RDONLY) { > > /* Make sure that we read updated s_mount_flags value */ > > smp_rmb(); > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > index 52883ac..84e95cc 100644 > > --- a/fs/xfs/xfs_file.c > > +++ b/fs/xfs/xfs_file.c > > @@ -209,7 +209,8 @@ xfs_file_fsync( > > loff_t end, > > int datasync) > > { > > - struct inode *inode = file->f_mapping->host; > > + struct address_space *mapping = file->f_mapping; > > + struct inode *inode = mapping->host; > > struct xfs_inode *ip = XFS_I(inode); > > struct xfs_mount *mp = ip->i_mount; > > int error = 0; > > @@ -218,7 +219,14 @@ xfs_file_fsync( > > > > trace_xfs_file_fsync(ip); > > > > - error = filemap_write_and_wait_range(inode->i_mapping, start, end); > > + if (dax_mapping(mapping) && mapping->nrexceptional) { > > + error = dax_writeback_mapping_range(mapping, > > + xfs_find_bdev_for_inode(inode), start, end); > > + if (error) > > + return error; > > + } > > + > > + error = filemap_write_and_wait_range(mapping, start, end); > > if (error) > > return error; > > > > diff --git a/include/linux/dax.h b/include/linux/dax.h > > index bad27b0..8e9f114 100644 > > --- a/include/linux/dax.h > > +++ b/include/linux/dax.h > > @@ -42,6 +42,6 @@ static inline bool dax_mapping(struct address_space *mapping) > > { > > return mapping->host && IS_DAX(mapping->host); > > } > > -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, > > - loff_t end); > > +int dax_writeback_mapping_range(struct address_space *mapping, > > + struct block_device *bdev, loff_t start, loff_t end); > > #endif > > diff --git a/mm/filemap.c b/mm/filemap.c > > index bc94386..c4286eb 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -482,12 +482,6 @@ int filemap_write_and_wait_range(struct address_space *mapping, > > { > > int err = 0; > > > > - if (dax_mapping(mapping) && mapping->nrexceptional) { > > - err = dax_writeback_mapping_range(mapping, lstart, lend); > > - if (err) > > - return err; > > - } > > - > > if (mapping->nrpages) { > > err = __filemap_fdatawrite_range(mapping, lstart, lend, > > WB_SYNC_ALL); > > -- > > 2.5.0 > > > > > -- > Jan Kara > SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id C7FAF7CA2 for ; Mon, 8 Feb 2016 12:31:30 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay3.corp.sgi.com (Postfix) with ESMTP id 43F49AC003 for ; Mon, 8 Feb 2016 10:31:30 -0800 (PST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by cuda.sgi.com with ESMTP id IAv8Ku4ga4L2pNle for ; Mon, 08 Feb 2016 10:31:29 -0800 (PST) Date: Mon, 8 Feb 2016 11:31:12 -0700 From: Ross Zwisler Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160208183112.GF2343@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dan Williams Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , jmoyer , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Matthew Wilcox , Ross Zwisler , linux-ext4 , Andrew Morton On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote: > The proposal: make applications explicitly request DAX semantics with > a new MAP_DAX flag and fail if DAX is unavailable. Document that a > successful MAP_DAX request mandates that the application assumes > responsibility for cpu cache management. > Require that all applications that mmap the file agree on MAP_DAX. I think this proposal could run into issues with aliasing. For example, say you have two threads accessing the same region, and one wants to use DAX and the other wants to use the page cache. What happens? If we satisfy both requests, we end up with one user reading and writing to the page cache, while the other is reading and writing directly to the media. They can't see each other's changes, and you get data corruption. If we satisfy the request of whoever asked first, sort of lock the inode into that mode, and then return an error to the second thread because they are asking for the other mode, we have now introduced a new weird failure case where mmaps can randomly fail based on the behavior of other applications. I think this is where you were going with the last line quoted above, but I don't understand how it would work in an acceptable way. It seems like we have to have the decision about whether or not to use DAX made in the same way for all users of the inode so that we don't run into these types of conflicts. > This also solves > the future problem of DAX support on virtually tagged cache > architectures where it is difficult for the kernel to know what alias > addresses need flushing. > > [1]: https://github.com/pmem/nvml _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 4B1AE7CA2 for ; Mon, 8 Feb 2016 13:24:04 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay3.corp.sgi.com (Postfix) with ESMTP id E0B87AC003 for ; Mon, 8 Feb 2016 11:24:00 -0800 (PST) Received: from mail-yw0-f181.google.com (mail-yw0-f181.google.com [209.85.161.181]) by cuda.sgi.com with ESMTP id 8mb44bIsBi7Zzqpl (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Mon, 08 Feb 2016 11:23:58 -0800 (PST) Received: by mail-yw0-f181.google.com with SMTP id h129so110048181ywb.1 for ; Mon, 08 Feb 2016 11:23:58 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20160208183112.GF2343@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160208183112.GF2343@linux.intel.com> Date: Mon, 8 Feb 2016 11:23:56 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems From: Dan Williams List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler , Dan Williams , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer On Mon, Feb 8, 2016 at 10:31 AM, Ross Zwisler wrote: > On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote: >> The proposal: make applications explicitly request DAX semantics with >> a new MAP_DAX flag and fail if DAX is unavailable. Document that a >> successful MAP_DAX request mandates that the application assumes >> responsibility for cpu cache management. > >> Require that all applications that mmap the file agree on MAP_DAX. > > I think this proposal could run into issues with aliasing. For example, say > you have two threads accessing the same region, and one wants to use DAX and > the other wants to use the page cache. What happens? > > If we satisfy both requests, we end up with one user reading and writing to > the page cache, while the other is reading and writing directly to the media. > They can't see each other's changes, and you get data corruption. > > If we satisfy the request of whoever asked first, sort of lock the inode into > that mode, and then return an error to the second thread because they are > asking for the other mode, we have now introduced a new weird failure case > where mmaps can randomly fail based on the behavior of other applications. > I think this is where you were going with the last line quoted above, but I > don't understand how it would work in an acceptable way. > > It seems like we have to have the decision about whether or not to use DAX > made in the same way for all users of the inode so that we don't run into > these types of conflicts. We haven't solved the conflict problem by pushing it out to the inode, see the recent revert of blkdev_daxset(). We're heading in a direction where an application can't develop it's own policies about DAX usage, it's always an administrative decision. However, maybe that is ok. Dave is right that if an application is using an existing filesystem it should get all the existing semantics. If the existing semantics (or overhead of maintaining the existing semantics) turn out not to fit a given pmem-aware application then we may just need new interfaces (separate from fs/dax.c) to persistent memory. I admit we're a ways off from knowing if that is needed. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 19A657CA2 for ; Mon, 8 Feb 2016 14:18:18 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay1.corp.sgi.com (Postfix) with ESMTP id 0E49D8F8040 for ; Mon, 8 Feb 2016 12:18:15 -0800 (PST) Received: from ipmail06.adl2.internode.on.net (ipmail06.adl2.internode.on.net [150.101.137.129]) by cuda.sgi.com with ESMTP id qtkiMBN215QYiHAE for ; Mon, 08 Feb 2016 12:18:10 -0800 (PST) Date: Tue, 9 Feb 2016 07:18:08 +1100 From: Dave Chinner Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160208201808.GK27429@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dan Williams Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , jmoyer , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Matthew Wilcox , Ross Zwisler , linux-ext4 , Andrew Morton On Mon, Feb 08, 2016 at 12:18:11AM -0800, Dan Williams wrote: > On Sun, Feb 7, 2016 at 1:50 PM, Dave Chinner wrote: > > On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote: > >> On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler > >> wrote: > >> > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > >> > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > >> > dax_writeback_mapping_range() needs a struct block_device, and it used to > >> > get that from inode->i_sb->s_bdev. This is correct for normal inodes > >> > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > >> > block devices and for XFS real-time files. > >> > > >> > Instead, call dax_writeback_mapping_range() directly from the filesystem or > >> > raw block device fsync/msync code so that they can supply us with a valid > >> > block device. > >> > > >> > It should be noted that this will reduce the number of calls to > >> > dax_writeback_mapping_range() because filemap_write_and_wait_range() is > >> > called in the various filesystems for operations other than just > >> > fsync/msync. Both ext4 & XFS call filemap_write_and_wait_range() outside > >> > of ->fsync for hole punch, truncate, and block relocation > >> > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()). > >> > > >> > I don't believe that these extra flushes are necessary in the DAX case. In > >> > the page cache case when we have dirty data in the page cache, that data > >> > will be actively lost if we evict a dirty page cache page without flushing > >> > it to media first. For DAX, though, the data will remain consistent with > >> > the physical address to which it was written regardless of whether it's in > >> > the processor cache or not - really the only reason I see to flush is in > >> > response to a fsync or msync so that our data is durable on media in case > >> > of a power loss. The case where we could throw dirty data out of the page > >> > cache and essentially lose writes simply doesn't exist. > >> > > >> > Signed-off-by: Ross Zwisler > >> > --- > >> > fs/block_dev.c | 7 +++++++ > >> > fs/dax.c | 5 ++--- > >> > fs/ext2/file.c | 10 ++++++++++ > >> > fs/ext4/fsync.c | 10 +++++++++- > >> > fs/xfs/xfs_file.c | 12 ++++++++++-- > >> > include/linux/dax.h | 4 ++-- > >> > mm/filemap.c | 6 ------ > >> > 7 files changed, 40 insertions(+), 14 deletions(-) > >> > >> This sprinkling of dax specific fixups outside of vm_operations_struct > >> routines still has me thinking that we are going in the wrong > >> direction for fsync/msync support. > >> > >> If an application is both unaware of DAX and doing mmap I/O it is > >> better served by the page cache where writeback is durable by default. > >> We expect DAX-aware applications to assume responsibility for cpu > >> cache management [1]. Making DAX mmap semantics explicit opt-in > >> solves not only durability support, but also the current problem that > >> DAX gets silently disabled leaving an app to wonder if it really got a > >> direct mapping. DAX also silently picks pud, pmd, or pte mappings > >> which is information an application would really like to know at map > >> time. > >> > >> The proposal: make applications explicitly request DAX semantics with > >> a new MAP_DAX flag and fail if DAX is unavailable. > > > > No. > > > > As I've stated before, the entire purpose of enabling DAX through > > existing filesytsems like XFS and ext4 is so that existing > > applications work with DAX *without modification*. > > > > That is, applications can be entirely unaware of the fact that the > > filesystem is giving them direct access to the storage because the > > access and failure semantics of DAX enabled mmap are *identical to > > the existing mmap semantics*. > > > > Given this, the app doesn't need to care whether DAX is enabled or > > not; all that will be seen is a difference in speed of access. > > Enabling and disabling DAX is, at this point, purely an > > administration decision - if the hardware and filesystem supports > > it, it can be turned on without having to wait years for application > > developers to add support for it.... > > Setting aside the current block zeroing problem you seem to assuming > that DAX will always be faster and that may not be true at a media > level. Waiting years for some applications to determine if DAX makes > sense for their use case seems completely reasonable. In the meantime > the apps that are already making these changes want to know that a DAX > mapping request has not silently dropped backed to page cache. They > also want to know if they successfully jumped through all the hoops to > get a larger than pte mapping. > > I agree it is useful to be able to force DAX on an unmodified > application to see what happens, and it follows that if those > applications want to run in that mode they will need functional > fsync()... > > I would feel better if we were talking about specific applications and > performance numbers to know if forcing DAX on application is a debug > facility or a production level capability. You seem to have already > made that determination and I'm curious what I'm missing. I'm not setting any policy here at all. This whole argument is based around the DAX mount option doing "global fs enable or silently turning it off" and the application not knowing about that. The whole point of having a persistent per-inode DAX flags is that it is a policy mechanism, not a policy. The application can, if it is DAX aware, directly control whether DAX is used on a file or not. The application can even query and clear that persistent inode flag if it is configured not to (or cannot) use DAX. If the filesystem cannot support DAX, then we can error out attempts to set the DAX flag and then the app knows DAX is not available. i.e. the attempt to set policy failed. If the flag is set, then the inode will *always* use DAX - there is no "fall back to page cache" when DAX is enabled. If the applicaiton is not DAX aware, then the admin can control the DAX policy by manipulating these flags themselves, and hence control whether DAX is used by the application or not. If you think I'm dictating policy for DAX users and application, then you haven't understood anything I've previously said about why the DAX mount option needs to die before any of this is considered production ready. DAX is not an opaque "all or nothing" option. XFS will provide apps and admins with fine-grained, persistent, discoverable policy flags to allow admins and applications to set DAX policies however they see fit. This simply cannot be done if the only knob you have is a mount option that may or may not stick. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id C15327CA2 for ; Mon, 8 Feb 2016 14:55:30 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay3.corp.sgi.com (Postfix) with ESMTP id 43C54AC005 for ; Mon, 8 Feb 2016 12:55:27 -0800 (PST) Received: from mail-yw0-f170.google.com (mail-yw0-f170.google.com [209.85.161.170]) by cuda.sgi.com with ESMTP id sIznThIdF7mcEiZw (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Mon, 08 Feb 2016 12:55:24 -0800 (PST) Received: by mail-yw0-f170.google.com with SMTP id g127so111759114ywf.2 for ; Mon, 08 Feb 2016 12:55:24 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20160208201808.GK27429@dastard> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> Date: Mon, 8 Feb 2016 12:55:24 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems From: Dan Williams List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , jmoyer , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Matthew Wilcox , Ross Zwisler , linux-ext4 , Andrew Morton On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner wrote: [..] >> Setting aside the current block zeroing problem you seem to assuming >> that DAX will always be faster and that may not be true at a media >> level. Waiting years for some applications to determine if DAX makes >> sense for their use case seems completely reasonable. In the meantime >> the apps that are already making these changes want to know that a DAX >> mapping request has not silently dropped backed to page cache. They >> also want to know if they successfully jumped through all the hoops to >> get a larger than pte mapping. >> >> I agree it is useful to be able to force DAX on an unmodified >> application to see what happens, and it follows that if those >> applications want to run in that mode they will need functional >> fsync()... >> >> I would feel better if we were talking about specific applications and >> performance numbers to know if forcing DAX on application is a debug >> facility or a production level capability. You seem to have already >> made that determination and I'm curious what I'm missing. > > I'm not setting any policy here at all. This whole argument is > based around the DAX mount option doing "global fs enable or > silently turning it off" and the application not knowing about that. > > The whole point of having a persistent per-inode DAX flags is that > it is a policy mechanism, not a policy. The application can, if it > is DAX aware, directly control whether DAX is used on a file or not. > The application can even query and clear that persistent inode flag > if it is configured not to (or cannot) use DAX. > > If the filesystem cannot support DAX, then we can error out attempts > to set the DAX flag and then the app knows DAX is not available. > i.e. the attempt to set policy failed. If the flag is set, then the > inode will *always* use DAX - there is no "fall back to page cache" > when DAX is enabled. > > If the applicaiton is not DAX aware, then the admin can control the > DAX policy by manipulating these flags themselves, and hence control > whether DAX is used by the application or not. > > If you think I'm dictating policy for DAX users and application, > then you haven't understood anything I've previously said about why > the DAX mount option needs to die before any of this is considered > production ready. DAX is not an opaque "all or nothing" option. XFS > will provide apps and admins with fine-grained, persistent, > discoverable policy flags to allow admins and applications to set > DAX policies however they see fit. This simply cannot be done if the > only knob you have is a mount option that may or may not stick. I agree the mount option needs to die, and I fully grok the reasoning. What I'm concerned with is that a system using fully-DAX-aware applications is forced to incur the overhead of maintaining *sync semantics, periodic sync(2) in particular, even if it is not relying on those semantics. However, like I said in my other mail, we can solve that with alternate interfaces to persistent memory if that becomes an issue and not require that "disable *sync" capability to come through DAX. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 7EC347CA2 for ; Mon, 8 Feb 2016 14:58:49 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay3.corp.sgi.com (Postfix) with ESMTP id 0EC91AC005 for ; Mon, 8 Feb 2016 12:58:48 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by cuda.sgi.com with ESMTP id PGDLL8VLWaqa4o6q (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Mon, 08 Feb 2016 12:58:48 -0800 (PST) From: Jeff Moyer Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> Date: Mon, 08 Feb 2016 15:58:44 -0500 In-Reply-To: (Dan Williams's message of "Mon, 8 Feb 2016 12:55:24 -0800") Message-ID: MIME-Version: 1.0 List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dan Williams Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Matthew Wilcox , Ross Zwisler , linux-ext4 , Andrew Morton Dan Williams writes: > I agree the mount option needs to die, and I fully grok the reasoning. > What I'm concerned with is that a system using fully-DAX-aware > applications is forced to incur the overhead of maintaining *sync > semantics, periodic sync(2) in particular, even if it is not relying > on those semantics. > > However, like I said in my other mail, we can solve that with > alternate interfaces to persistent memory if that becomes an issue and > not require that "disable *sync" capability to come through DAX. What do you envision these alternate interfaces looking like? -Jeff _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 444C07CA2 for ; Mon, 8 Feb 2016 16:05:39 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay1.corp.sgi.com (Postfix) with ESMTP id 343A48F8040 for ; Mon, 8 Feb 2016 14:05:36 -0800 (PST) Received: from mail-yw0-f170.google.com (mail-yw0-f170.google.com [209.85.161.170]) by cuda.sgi.com with ESMTP id aX1d6S6gEX8Ov3zB (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Mon, 08 Feb 2016 14:05:34 -0800 (PST) Received: by mail-yw0-f170.google.com with SMTP id q190so113406520ywd.3 for ; Mon, 08 Feb 2016 14:05:34 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> Date: Mon, 8 Feb 2016 14:05:34 -0800 Message-ID: Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems From: Dan Williams List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Jeff Moyer Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Matthew Wilcox , Ross Zwisler , linux-ext4 , Andrew Morton On Mon, Feb 8, 2016 at 12:58 PM, Jeff Moyer wrote: > Dan Williams writes: > >> I agree the mount option needs to die, and I fully grok the reasoning. >> What I'm concerned with is that a system using fully-DAX-aware >> applications is forced to incur the overhead of maintaining *sync >> semantics, periodic sync(2) in particular, even if it is not relying >> on those semantics. >> >> However, like I said in my other mail, we can solve that with >> alternate interfaces to persistent memory if that becomes an issue and >> not require that "disable *sync" capability to come through DAX. > > What do you envision these alternate interfaces looking like? Well, plan-A was making DAX be explicit opt-in for applications, I haven't thought too much about plan-B. I expect it to be driven by real performance numbers and application use cases once the *sync compat work completes. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id C281A29DF5 for ; Mon, 8 Feb 2016 09:34:54 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay3.corp.sgi.com (Postfix) with ESMTP id 5E44CAC001 for ; Mon, 8 Feb 2016 07:34:54 -0800 (PST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by cuda.sgi.com with ESMTP id LkV3j53B2N0TXoAN for ; Mon, 08 Feb 2016 07:34:53 -0800 (PST) Date: Mon, 8 Feb 2016 08:34:43 -0700 From: Ross Zwisler Subject: Re: [PATCH 1/2] dax: pass bdev argument to dax_clear_blocks() Message-ID: <20160208153443.GC2343@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-2-git-send-email-ross.zwisler@linux.intel.com> <20160207220329.GK31407@dastard> <20160208014409.GA2343@linux.intel.com> <20160208051725.GM31407@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160208051725.GM31407@dastard> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: Theodore Ts'o , linux-nvdimm@lists.01.org, Dan Williams , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Andrew Morton On Mon, Feb 08, 2016 at 04:17:25PM +1100, Dave Chinner wrote: > On Sun, Feb 07, 2016 at 06:44:09PM -0700, Ross Zwisler wrote: > > On Mon, Feb 08, 2016 at 09:03:29AM +1100, Dave Chinner wrote: > > > On Sun, Feb 07, 2016 at 12:19:12AM -0700, Ross Zwisler wrote: > > > > dax_clear_blocks() needs a valid struct block_device and previously it was > > > > using inode->i_sb->s_bdev in all cases. This is correct for normal inodes > > > > on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > > > block devices and for XFS real-time devices. > > > > > > > > Instead, have the caller pass in a struct block_device pointer which it > > > > knows to be correct. > > > .... > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > > > index 07ef29b..f722ba2 100644 > > > > --- a/fs/xfs/xfs_bmap_util.c > > > > +++ b/fs/xfs/xfs_bmap_util.c > > > > @@ -73,9 +73,11 @@ xfs_zero_extent( > > > > xfs_daddr_t sector = xfs_fsb_to_db(ip, start_fsb); > > > > sector_t block = XFS_BB_TO_FSBT(mp, sector); > > > > ssize_t size = XFS_FSB_TO_B(mp, count_fsb); > > > > + struct inode *inode = VFS_I(ip); > > > > > > > > if (IS_DAX(VFS_I(ip))) > > > > - return dax_clear_blocks(VFS_I(ip), block, size); > > > > + return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > > > > + block, size); > > > > > > Get rid of the local inode variable and use VFS_I(ip) like the code > > > originally did. Do not change code that is unrelated to the > > > modifcation being made, especially when it results in making > > > the code an inconsistent mess of mixed pointer constructs.... > > > > The local 'inode' variable was added to avoid multiple calls for VFS_I() for > > the same 'ip'. > > My point is you didn't achieve that. The end result of your patch > is: > > struct inode *inode = VFS_I(ip); > > if (IS_DAX(VFS_I(ip))) > return dax_clear_blocks(inode, xfs_find_bdev_for_inode(inode), > block, size); > > So now we have a local variable, but we still have 2 calls to > VFS_I(ip). i.e. this makes the code harder to read and understand > than before for no benefit. *facepalm* Yep, thanks for the correction. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id F10EC7CA2 for ; Tue, 9 Feb 2016 03:43:44 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay3.corp.sgi.com (Postfix) with ESMTP id 6B86EAC004 for ; Tue, 9 Feb 2016 01:43:44 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by cuda.sgi.com with ESMTP id NC5Jbw1i4SZt9PFW (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NO) for ; Tue, 09 Feb 2016 01:43:40 -0800 (PST) Date: Tue, 9 Feb 2016 10:43:53 +0100 From: Jan Kara Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160209094353.GF9451@quack.suse.cz> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dan Williams Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , jmoyer , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Matthew Wilcox , Ross Zwisler , linux-ext4 , Andrew Morton On Mon 08-02-16 12:55:24, Dan Williams wrote: > On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner wrote: > [..] > >> Setting aside the current block zeroing problem you seem to assuming > >> that DAX will always be faster and that may not be true at a media > >> level. Waiting years for some applications to determine if DAX makes > >> sense for their use case seems completely reasonable. In the meantime > >> the apps that are already making these changes want to know that a DAX > >> mapping request has not silently dropped backed to page cache. They > >> also want to know if they successfully jumped through all the hoops to > >> get a larger than pte mapping. > >> > >> I agree it is useful to be able to force DAX on an unmodified > >> application to see what happens, and it follows that if those > >> applications want to run in that mode they will need functional > >> fsync()... > >> > >> I would feel better if we were talking about specific applications and > >> performance numbers to know if forcing DAX on application is a debug > >> facility or a production level capability. You seem to have already > >> made that determination and I'm curious what I'm missing. > > > > I'm not setting any policy here at all. This whole argument is > > based around the DAX mount option doing "global fs enable or > > silently turning it off" and the application not knowing about that. > > > > The whole point of having a persistent per-inode DAX flags is that > > it is a policy mechanism, not a policy. The application can, if it > > is DAX aware, directly control whether DAX is used on a file or not. > > The application can even query and clear that persistent inode flag > > if it is configured not to (or cannot) use DAX. > > > > If the filesystem cannot support DAX, then we can error out attempts > > to set the DAX flag and then the app knows DAX is not available. > > i.e. the attempt to set policy failed. If the flag is set, then the > > inode will *always* use DAX - there is no "fall back to page cache" > > when DAX is enabled. > > > > If the applicaiton is not DAX aware, then the admin can control the > > DAX policy by manipulating these flags themselves, and hence control > > whether DAX is used by the application or not. > > > > If you think I'm dictating policy for DAX users and application, > > then you haven't understood anything I've previously said about why > > the DAX mount option needs to die before any of this is considered > > production ready. DAX is not an opaque "all or nothing" option. XFS > > will provide apps and admins with fine-grained, persistent, > > discoverable policy flags to allow admins and applications to set > > DAX policies however they see fit. This simply cannot be done if the > > only knob you have is a mount option that may or may not stick. > > I agree the mount option needs to die, and I fully grok the reasoning. > What I'm concerned with is that a system using fully-DAX-aware > applications is forced to incur the overhead of maintaining *sync > semantics, periodic sync(2) in particular, even if it is not relying > on those semantics. Let me somewhat correct this: IMO hard requirement is maintaining sync(2) semantics. Periodic writeback does not have any hard durability guarantees and we are free to ignore such requests in ->writepages() (that function has enough information in the writeback_control structure to differentiate between periodic writeback and data integrity sync) if we decide it is useful. Actually, we could do that even for 4.5. Honza -- Jan Kara SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id B418B29DF5 for ; Tue, 9 Feb 2016 10:01:31 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay1.corp.sgi.com (Postfix) with ESMTP id 79D298F804C for ; Tue, 9 Feb 2016 08:01:28 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by cuda.sgi.com with ESMTP id eg8HO4iDUsDLwfQz (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NO) for ; Tue, 09 Feb 2016 08:01:22 -0800 (PST) Date: Tue, 9 Feb 2016 17:01:34 +0100 From: Jan Kara Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160209160134.GA12245@quack.suse.cz> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> <20160209094353.GF9451@quack.suse.cz> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="LQksG6bCIzRHxTLp" Content-Disposition: inline In-Reply-To: <20160209094353.GF9451@quack.suse.cz> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dan Williams Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , jmoyer , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Matthew Wilcox , Ross Zwisler , linux-ext4 , Andrew Morton --LQksG6bCIzRHxTLp Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Tue 09-02-16 10:43:53, Jan Kara wrote: > On Mon 08-02-16 12:55:24, Dan Williams wrote: > > On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner wrote: > > [..] > > >> Setting aside the current block zeroing problem you seem to assuming > > >> that DAX will always be faster and that may not be true at a media > > >> level. Waiting years for some applications to determine if DAX makes > > >> sense for their use case seems completely reasonable. In the meantime > > >> the apps that are already making these changes want to know that a DAX > > >> mapping request has not silently dropped backed to page cache. They > > >> also want to know if they successfully jumped through all the hoops to > > >> get a larger than pte mapping. > > >> > > >> I agree it is useful to be able to force DAX on an unmodified > > >> application to see what happens, and it follows that if those > > >> applications want to run in that mode they will need functional > > >> fsync()... > > >> > > >> I would feel better if we were talking about specific applications and > > >> performance numbers to know if forcing DAX on application is a debug > > >> facility or a production level capability. You seem to have already > > >> made that determination and I'm curious what I'm missing. > > > > > > I'm not setting any policy here at all. This whole argument is > > > based around the DAX mount option doing "global fs enable or > > > silently turning it off" and the application not knowing about that. > > > > > > The whole point of having a persistent per-inode DAX flags is that > > > it is a policy mechanism, not a policy. The application can, if it > > > is DAX aware, directly control whether DAX is used on a file or not. > > > The application can even query and clear that persistent inode flag > > > if it is configured not to (or cannot) use DAX. > > > > > > If the filesystem cannot support DAX, then we can error out attempts > > > to set the DAX flag and then the app knows DAX is not available. > > > i.e. the attempt to set policy failed. If the flag is set, then the > > > inode will *always* use DAX - there is no "fall back to page cache" > > > when DAX is enabled. > > > > > > If the applicaiton is not DAX aware, then the admin can control the > > > DAX policy by manipulating these flags themselves, and hence control > > > whether DAX is used by the application or not. > > > > > > If you think I'm dictating policy for DAX users and application, > > > then you haven't understood anything I've previously said about why > > > the DAX mount option needs to die before any of this is considered > > > production ready. DAX is not an opaque "all or nothing" option. XFS > > > will provide apps and admins with fine-grained, persistent, > > > discoverable policy flags to allow admins and applications to set > > > DAX policies however they see fit. This simply cannot be done if the > > > only knob you have is a mount option that may or may not stick. > > > > I agree the mount option needs to die, and I fully grok the reasoning. > > What I'm concerned with is that a system using fully-DAX-aware > > applications is forced to incur the overhead of maintaining *sync > > semantics, periodic sync(2) in particular, even if it is not relying > > on those semantics. > > Let me somewhat correct this: IMO hard requirement is maintaining sync(2) > semantics. Periodic writeback does not have any hard durability guarantees > and we are free to ignore such requests in ->writepages() (that function > has enough information in the writeback_control structure to differentiate > between periodic writeback and data integrity sync) if we decide it is > useful. Actually, we could do that even for 4.5. Attached is a version of Ross' patch that will work for sync(2) and fsync(2) and we won't flush caches during periodic writeback. The patch is only compile-tested. Ross? Honza -- Jan Kara SUSE Labs, CR --LQksG6bCIzRHxTLp Content-Type: text/x-patch; charset=us-ascii Content-Disposition: attachment; filename="0001-dax-move-writeback-calls-into-the-filesystems.patch" >>From f7280a34d235031c5dbf3f5a345c4b64e452f097 Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Sun, 7 Feb 2016 00:19:13 -0700 Subject: [PATCH] dax: move writeback calls into the filesystems Previously calls to dax_writeback_mapping_range() for all DAX filesystems (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). dax_writeback_mapping_range() needs a struct block_device, and it used to get that from inode->i_sb->s_bdev. This is correct for normal inodes mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time files. Instead, call dax_writeback_mapping_range() directly from the filesystem ->writepages function so that it can supply us with a valid block device. This also fixes DAX code to properly flush caches in response to sync(2). Signed-off-by: Ross Zwisler Signed-off-by: Jan Kara --- fs/block_dev.c | 13 ++++++++++++- fs/dax.c | 12 +++++++----- fs/ext2/inode.c | 8 ++++++++ fs/ext4/fsync.c | 1 - fs/ext4/inode.c | 4 ++++ fs/xfs/xfs_aops.c | 5 +++++ include/linux/dax.h | 7 +++++-- mm/filemap.c | 12 ++++-------- 8 files changed, 45 insertions(+), 17 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 39b3a174a425..271d38aa6cbb 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1693,13 +1693,24 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) return try_to_free_buffers(page); } +static int blkdev_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + if (dax_mapping(mapping)) { + struct block_device *bdev = I_BDEV(mapping->host); + + return dax_writeback_mapping_range(mapping, bdev, wbc); + } + return generic_writepages(mapping, wbc); +} + static const struct address_space_operations def_blk_aops = { .readpage = blkdev_readpage, .readpages = blkdev_readpages, .writepage = blkdev_writepage, .write_begin = blkdev_write_begin, .write_end = blkdev_write_end, - .writepages = generic_writepages, + .writepages = blkdev_writepages, .releasepage = blkdev_releasepage, .direct_IO = blkdev_direct_IO, .is_dirty_writeback = buffer_check_dirty_writeback, diff --git a/fs/dax.c b/fs/dax.c index fc2e3141138b..2f4965214783 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -485,11 +485,10 @@ static int dax_writeback_one(struct block_device *bdev, * end]. This is required by data integrity operations to ensure file data is * on persistent storage prior to completion of the operation. */ -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end) +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, struct writeback_control *wbc) { struct inode *inode = mapping->host; - struct block_device *bdev = inode->i_sb->s_bdev; pgoff_t start_index, end_index, pmd_index; pgoff_t indices[PAGEVEC_SIZE]; struct pagevec pvec; @@ -500,8 +499,11 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT)) return -EIO; - start_index = start >> PAGE_CACHE_SHIFT; - end_index = end >> PAGE_CACHE_SHIFT; + if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) + return 0; + + start_index = wbc->range_start >> PAGE_CACHE_SHIFT; + end_index = wbc->range_end >> PAGE_CACHE_SHIFT; pmd_index = DAX_PMD_INDEX(start_index); rcu_read_lock(); diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 338eefda70c6..ee05e945f40c 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -874,6 +874,14 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset) static int ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) { +#ifdef CONFIG_FS_DAX + if (dax_mapping(mapping)) { + return dax_writeback_mapping_range(mapping, + mapping->host->i_sb->s_bdev, + wbc); + } +#endif + return mpage_writepages(mapping, wbc, ext2_get_block); } diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 8850254136ae..b7136227d0f8 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -83,7 +83,6 @@ static int ext4_sync_parent(struct inode *inode) * What we do is just kick off a commit and wait on it. This will snapshot the * inode to disk. */ - int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { struct inode *inode = file->f_mapping->host; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 83bc8bfb3bea..19989c12187a 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2450,6 +2450,10 @@ static int ext4_writepages(struct address_space *mapping, trace_ext4_writepages(inode, wbc); + if (dax_mapping(mapping)) + return dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, + wbc); + /* * No pages to write? This is mainly a kludge to avoid starting * a transaction for special inodes like journal inode on last iput() diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 379c089fb051..fd0839278442 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1208,6 +1208,11 @@ xfs_vm_writepages( struct writeback_control *wbc) { xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED); + if (dax_mapping(mapping)) { + return dax_writeback_mapping_range(mapping, + xfs_find_bdev_for_inode(mapping->host), wbc); + } + return generic_writepages(mapping, wbc); } diff --git a/include/linux/dax.h b/include/linux/dax.h index 818e45078929..05d7d043d3bd 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -52,6 +52,9 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end); + +struct writeback_control; + +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, struct writeback_control *wbc); #endif diff --git a/mm/filemap.c b/mm/filemap.c index bc943867d68c..af3eec1a8c5e 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -446,7 +446,8 @@ int filemap_write_and_wait(struct address_space *mapping) { int err = 0; - if (mapping->nrpages) { + if ((!dax_mapping(mapping) && mapping->nrpages) || + (dax_mapping(mapping) && mapping->nrexceptional)) { err = filemap_fdatawrite(mapping); /* * Even if the above returned error, the pages may be @@ -482,13 +483,8 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; - if (dax_mapping(mapping) && mapping->nrexceptional) { - err = dax_writeback_mapping_range(mapping, lstart, lend); - if (err) - return err; - } - - if (mapping->nrpages) { + if ((!dax_mapping(mapping) && mapping->nrpages) || + (dax_mapping(mapping) && mapping->nrexceptional)) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); /* See comment of filemap_write_and_wait() */ -- 2.6.2 --LQksG6bCIzRHxTLp Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs --LQksG6bCIzRHxTLp-- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 40B6D29DF5 for ; Tue, 9 Feb 2016 12:07:30 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay1.corp.sgi.com (Postfix) with ESMTP id 1F68E8F8064 for ; Tue, 9 Feb 2016 10:07:27 -0800 (PST) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by cuda.sgi.com with ESMTP id KZGJZgHn7eePIpDn for ; Tue, 09 Feb 2016 10:07:26 -0800 (PST) Date: Tue, 9 Feb 2016 11:06:47 -0700 From: Ross Zwisler Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160209180647.GA2450@linux.intel.com> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> <20160209094353.GF9451@quack.suse.cz> <20160209160134.GA12245@quack.suse.cz> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160209160134.GA12245@quack.suse.cz> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Jan Kara Cc: Theodore Ts'o , Andrew Morton , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , jmoyer , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Matthew Wilcox , Dan Williams , linux-ext4 , Ross Zwisler On Tue, Feb 09, 2016 at 05:01:34PM +0100, Jan Kara wrote: > On Tue 09-02-16 10:43:53, Jan Kara wrote: > > On Mon 08-02-16 12:55:24, Dan Williams wrote: > > > On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner wrote: > > > [..] > > > >> Setting aside the current block zeroing problem you seem to assuming > > > >> that DAX will always be faster and that may not be true at a media > > > >> level. Waiting years for some applications to determine if DAX makes > > > >> sense for their use case seems completely reasonable. In the meantime > > > >> the apps that are already making these changes want to know that a DAX > > > >> mapping request has not silently dropped backed to page cache. They > > > >> also want to know if they successfully jumped through all the hoops to > > > >> get a larger than pte mapping. > > > >> > > > >> I agree it is useful to be able to force DAX on an unmodified > > > >> application to see what happens, and it follows that if those > > > >> applications want to run in that mode they will need functional > > > >> fsync()... > > > >> > > > >> I would feel better if we were talking about specific applications and > > > >> performance numbers to know if forcing DAX on application is a debug > > > >> facility or a production level capability. You seem to have already > > > >> made that determination and I'm curious what I'm missing. > > > > > > > > I'm not setting any policy here at all. This whole argument is > > > > based around the DAX mount option doing "global fs enable or > > > > silently turning it off" and the application not knowing about that. > > > > > > > > The whole point of having a persistent per-inode DAX flags is that > > > > it is a policy mechanism, not a policy. The application can, if it > > > > is DAX aware, directly control whether DAX is used on a file or not. > > > > The application can even query and clear that persistent inode flag > > > > if it is configured not to (or cannot) use DAX. > > > > > > > > If the filesystem cannot support DAX, then we can error out attempts > > > > to set the DAX flag and then the app knows DAX is not available. > > > > i.e. the attempt to set policy failed. If the flag is set, then the > > > > inode will *always* use DAX - there is no "fall back to page cache" > > > > when DAX is enabled. > > > > > > > > If the applicaiton is not DAX aware, then the admin can control the > > > > DAX policy by manipulating these flags themselves, and hence control > > > > whether DAX is used by the application or not. > > > > > > > > If you think I'm dictating policy for DAX users and application, > > > > then you haven't understood anything I've previously said about why > > > > the DAX mount option needs to die before any of this is considered > > > > production ready. DAX is not an opaque "all or nothing" option. XFS > > > > will provide apps and admins with fine-grained, persistent, > > > > discoverable policy flags to allow admins and applications to set > > > > DAX policies however they see fit. This simply cannot be done if the > > > > only knob you have is a mount option that may or may not stick. > > > > > > I agree the mount option needs to die, and I fully grok the reasoning. > > > What I'm concerned with is that a system using fully-DAX-aware > > > applications is forced to incur the overhead of maintaining *sync > > > semantics, periodic sync(2) in particular, even if it is not relying > > > on those semantics. > > > > Let me somewhat correct this: IMO hard requirement is maintaining sync(2) > > semantics. Periodic writeback does not have any hard durability guarantees > > and we are free to ignore such requests in ->writepages() (that function > > has enough information in the writeback_control structure to differentiate > > between periodic writeback and data integrity sync) if we decide it is > > useful. Actually, we could do that even for 4.5. > > Attached is a version of Ross' patch that will work for sync(2) and > fsync(2) and we won't flush caches during periodic writeback. The patch is > only compile-tested. Ross? This looks great. I'll send out a v2 with this and with the dax_clear_sectors() changes after I'm done testing. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f177.google.com (mail-qk0-f177.google.com [209.85.220.177]) by kanga.kvack.org (Postfix) with ESMTP id C81E6828E8 for ; Mon, 8 Feb 2016 15:58:48 -0500 (EST) Received: by mail-qk0-f177.google.com with SMTP id o6so63620054qkc.2 for ; Mon, 08 Feb 2016 12:58:48 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id u108si32236481qge.50.2016.02.08.12.58.48 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 08 Feb 2016 12:58:48 -0800 (PST) From: Jeff Moyer Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> Date: Mon, 08 Feb 2016 15:58:44 -0500 In-Reply-To: (Dan Williams's message of "Mon, 8 Feb 2016 12:55:24 -0800") Message-ID: MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: Dan Williams Cc: Dave Chinner , Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers Dan Williams writes: > I agree the mount option needs to die, and I fully grok the reasoning. > What I'm concerned with is that a system using fully-DAX-aware > applications is forced to incur the overhead of maintaining *sync > semantics, periodic sync(2) in particular, even if it is not relying > on those semantics. > > However, like I said in my other mail, we can solve that with > alternate interfaces to persistent memory if that becomes an issue and > not require that "disable *sync" capability to come through DAX. What do you envision these alternate interfaces looking like? -Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f175.google.com (mail-lb0-f175.google.com [209.85.217.175]) by kanga.kvack.org (Postfix) with ESMTP id 889356B0253 for ; Tue, 9 Feb 2016 11:01:23 -0500 (EST) Received: by mail-lb0-f175.google.com with SMTP id bc4so102816402lbc.2 for ; Tue, 09 Feb 2016 08:01:23 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id x69si18847003lfd.16.2016.02.09.08.01.21 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 09 Feb 2016 08:01:22 -0800 (PST) Date: Tue, 9 Feb 2016 17:01:34 +0100 From: Jan Kara Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems Message-ID: <20160209160134.GA12245@quack.suse.cz> References: <1454829553-29499-1-git-send-email-ross.zwisler@linux.intel.com> <1454829553-29499-3-git-send-email-ross.zwisler@linux.intel.com> <20160207215047.GJ31407@dastard> <20160208201808.GK27429@dastard> <20160209094353.GF9451@quack.suse.cz> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="LQksG6bCIzRHxTLp" Content-Disposition: inline In-Reply-To: <20160209094353.GF9451@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Dan Williams Cc: Dave Chinner , Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers , jmoyer --LQksG6bCIzRHxTLp Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Tue 09-02-16 10:43:53, Jan Kara wrote: > On Mon 08-02-16 12:55:24, Dan Williams wrote: > > On Mon, Feb 8, 2016 at 12:18 PM, Dave Chinner wrote: > > [..] > > >> Setting aside the current block zeroing problem you seem to assuming > > >> that DAX will always be faster and that may not be true at a media > > >> level. Waiting years for some applications to determine if DAX makes > > >> sense for their use case seems completely reasonable. In the meantime > > >> the apps that are already making these changes want to know that a DAX > > >> mapping request has not silently dropped backed to page cache. They > > >> also want to know if they successfully jumped through all the hoops to > > >> get a larger than pte mapping. > > >> > > >> I agree it is useful to be able to force DAX on an unmodified > > >> application to see what happens, and it follows that if those > > >> applications want to run in that mode they will need functional > > >> fsync()... > > >> > > >> I would feel better if we were talking about specific applications and > > >> performance numbers to know if forcing DAX on application is a debug > > >> facility or a production level capability. You seem to have already > > >> made that determination and I'm curious what I'm missing. > > > > > > I'm not setting any policy here at all. This whole argument is > > > based around the DAX mount option doing "global fs enable or > > > silently turning it off" and the application not knowing about that. > > > > > > The whole point of having a persistent per-inode DAX flags is that > > > it is a policy mechanism, not a policy. The application can, if it > > > is DAX aware, directly control whether DAX is used on a file or not. > > > The application can even query and clear that persistent inode flag > > > if it is configured not to (or cannot) use DAX. > > > > > > If the filesystem cannot support DAX, then we can error out attempts > > > to set the DAX flag and then the app knows DAX is not available. > > > i.e. the attempt to set policy failed. If the flag is set, then the > > > inode will *always* use DAX - there is no "fall back to page cache" > > > when DAX is enabled. > > > > > > If the applicaiton is not DAX aware, then the admin can control the > > > DAX policy by manipulating these flags themselves, and hence control > > > whether DAX is used by the application or not. > > > > > > If you think I'm dictating policy for DAX users and application, > > > then you haven't understood anything I've previously said about why > > > the DAX mount option needs to die before any of this is considered > > > production ready. DAX is not an opaque "all or nothing" option. XFS > > > will provide apps and admins with fine-grained, persistent, > > > discoverable policy flags to allow admins and applications to set > > > DAX policies however they see fit. This simply cannot be done if the > > > only knob you have is a mount option that may or may not stick. > > > > I agree the mount option needs to die, and I fully grok the reasoning. > > What I'm concerned with is that a system using fully-DAX-aware > > applications is forced to incur the overhead of maintaining *sync > > semantics, periodic sync(2) in particular, even if it is not relying > > on those semantics. > > Let me somewhat correct this: IMO hard requirement is maintaining sync(2) > semantics. Periodic writeback does not have any hard durability guarantees > and we are free to ignore such requests in ->writepages() (that function > has enough information in the writeback_control structure to differentiate > between periodic writeback and data integrity sync) if we decide it is > useful. Actually, we could do that even for 4.5. Attached is a version of Ross' patch that will work for sync(2) and fsync(2) and we won't flush caches during periodic writeback. The patch is only compile-tested. Ross? Honza -- Jan Kara SUSE Labs, CR --LQksG6bCIzRHxTLp Content-Type: text/x-patch; charset=us-ascii Content-Disposition: attachment; filename="0001-dax-move-writeback-calls-into-the-filesystems.patch" --LQksG6bCIzRHxTLp--