From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 12 Feb 2016 21:59:12 -0700 From: Ross Zwisler Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160213045912.GA22595@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> <20160212190320.GA24857@linux.intel.com> <20160213023849.GD14668@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160213023849.GD14668@dastard> Sender: owner-linux-mm@kvack.org To: Dave Chinner Cc: Ross Zwisler , Jan Kara , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: On Sat, Feb 13, 2016 at 01:38:49PM +1100, Dave Chinner wrote: > On Fri, Feb 12, 2016 at 12:03:20PM -0700, Ross Zwisler wrote: > > On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote: > > > On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > > > > 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue > > > > the writeback in the case that DAX is enabled but we only have a nonzero > > > > mapping->nrpages. As with 1) and 2), I believe this is necessary to > > > > properly writeback metadata changes. If this sounds wrong, please let me > > > > know and I'll get more info. > > > > > > And I'm surprised here as well. If there are dax_mapping() inodes that have > > > pagecache pages, then we have issues with radix tree handling as well. So > > > how come dax_mapping() inodes have pages attached? If it is about block > > > device inodes, then I find it buggy, that S_DAX gets set for such inodes > > > when filesystem is mounted on them because in such cases we are IMO asking > > > for data corruption sooner rather than later... > > > > I think I've figured this one out, at least partially. > > > > For ext2 the issues I was seeing were due to the fact that directory inodes > > have S_DAX set, but have dirty page cache pages. In testing with > > generic/002, I see two ext2 inodes with S_DAX trying to do a writeback while > > they have dirty page cache pages. The first has i_ino=2, which is the > > EXT2_ROOT_INO. > .... > > As far as I can see, XFS does not have these issues - returning immediately > > having done just the DAX writeback in xfs_vm_writepages() lets all my xfstests > > pass. > > XFS will not have issues because it does not dirty directory inodes > at the VFS level, nor does it use the page cache for directory data. > However, looking at the code I think it does still set S_DAX on > directory inodes, which it shouldn't be doing. > > I've got a couple of fixes I need to do in this area - hopefully > I'll get it done on Monday. Cool. I've got a quick patch that stops S_DAX from being set on everything but regular inodes for ext2 and ext4. This solved a lot of my xfstests failures. Even after that I'm seeing two last failures with ext4 - I'll keep working on those. - Ross -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Sat, 13 Feb 2016 13:38:49 +1100 From: Dave Chinner Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160213023849.GD14668@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> <20160212190320.GA24857@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160212190320.GA24857@linux.intel.com> Sender: owner-linux-mm@kvack.org To: Ross Zwisler , Jan Kara , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: On Fri, Feb 12, 2016 at 12:03:20PM -0700, Ross Zwisler wrote: > On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote: > > On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > > > 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue > > > the writeback in the case that DAX is enabled but we only have a nonzero > > > mapping->nrpages. As with 1) and 2), I believe this is necessary to > > > properly writeback metadata changes. If this sounds wrong, please let me > > > know and I'll get more info. > > > > And I'm surprised here as well. If there are dax_mapping() inodes that have > > pagecache pages, then we have issues with radix tree handling as well. So > > how come dax_mapping() inodes have pages attached? If it is about block > > device inodes, then I find it buggy, that S_DAX gets set for such inodes > > when filesystem is mounted on them because in such cases we are IMO asking > > for data corruption sooner rather than later... > > I think I've figured this one out, at least partially. > > For ext2 the issues I was seeing were due to the fact that directory inodes > have S_DAX set, but have dirty page cache pages. In testing with > generic/002, I see two ext2 inodes with S_DAX trying to do a writeback while > they have dirty page cache pages. The first has i_ino=2, which is the > EXT2_ROOT_INO. .... > As far as I can see, XFS does not have these issues - returning immediately > having done just the DAX writeback in xfs_vm_writepages() lets all my xfstests > pass. XFS will not have issues because it does not dirty directory inodes at the VFS level, nor does it use the page cache for directory data. However, looking at the code I think it does still set S_DAX on directory inodes, which it shouldn't be doing. I've got a couple of fixes I need to do in this area - hopefully I'll get it done on Monday. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 12 Feb 2016 12:03:20 -0700 From: Ross Zwisler Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160212190320.GA24857@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160211124304.GI21760@quack.suse.cz> Sender: owner-linux-mm@kvack.org To: Jan Kara Cc: Ross Zwisler , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote: > On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > > 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue > > the writeback in the case that DAX is enabled but we only have a nonzero > > mapping->nrpages. As with 1) and 2), I believe this is necessary to > > properly writeback metadata changes. If this sounds wrong, please let me > > know and I'll get more info. > > And I'm surprised here as well. If there are dax_mapping() inodes that have > pagecache pages, then we have issues with radix tree handling as well. So > how come dax_mapping() inodes have pages attached? If it is about block > device inodes, then I find it buggy, that S_DAX gets set for such inodes > when filesystem is mounted on them because in such cases we are IMO asking > for data corruption sooner rather than later... I think I've figured this one out, at least partially. For ext2 the issues I was seeing were due to the fact that directory inodes have S_DAX set, but have dirty page cache pages. In testing with generic/002, I see two ext2 inodes with S_DAX trying to do a writeback while they have dirty page cache pages. The first has i_ino=2, which is the EXT2_ROOT_INO. The second inode changes from run to run, but for my last run was 155649. The test failed because that directory inode was found to be corrupt by fsck.ext2: *** fsck.ext2 output *** fsck from util-linux 2.26.2 e2fsck 1.42.12 (29-Aug-2014) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Directory inode 155649, block #0, offset 0: directory corrupted If I change the code in ext2_writepages() so that it does the mpage_writepages() even for DAX inodes, all my xfstests pass. I'm not sure this is the right fix, though - should it instead be that ext2 directory inodes don't have S_DAX set? A similar problem occurs with ext4, though I haven't yet tracked it down to an inode type. It could be that ext4 directory inodes have the same issue, and Eric Sandeen suggested we might also have an issue with XATTRS attached to inodes. As with ext2, if I allow the normal writeback to occur in ext4_writepages() even for DAX inodes, the issues go away, but I'm not sure whether or not this is the correct fix. As far as I can see, XFS does not have these issues - returning immediately having done just the DAX writeback in xfs_vm_writepages() lets all my xfstests pass. For v4.5 should I send out an updated version of this series that does the regular page writeback for ext2 & ext4, or should we work to clear S_DAX for regular filesystem inodes that have dirty page cache data? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 12 Feb 2016 07:50:49 +1100 From: Dave Chinner Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160211205049.GJ19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> <20160211194922.GA5260@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160211194922.GA5260@linux.intel.com> Sender: owner-linux-mm@kvack.org To: Ross Zwisler , Jan Kara , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: On Thu, Feb 11, 2016 at 12:49:22PM -0700, Ross Zwisler wrote: > I think the plan of unsetting S_DAX on bdev->bd_inode when we mount will save > us from this, as long as we do it super early in the mount process. I think that S_DAX should not be set on the block device by default in the first place. If we've been surprised by unexpected behaviour, then I'm sure there are going to be other surprises waiting for us. DAX default policy should be opt-in, not opt-out. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 11 Feb 2016 12:49:22 -0700 From: Ross Zwisler Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160211194922.GA5260@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160211124304.GI21760@quack.suse.cz> Sender: owner-linux-mm@kvack.org To: Jan Kara Cc: Ross Zwisler , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote: > On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > > During testing of raw block devices + DAX I noticed that the struct > > block_device that we were using for DAX operations was incorrect. For the > > fault handlers, etc. we can just get the correct bdev via get_block(), > > which is passed in as a function pointer, but for the *sync code and for > > sector zeroing we don't have access to get_block(). This is also an issue > > for XFS real-time devices, whenever we get those working. > > > > Patch one of this series fixes the DAX sector zeroing code by explicitly > > passing in a valid struct block_device. > > > > Patch two of this series fixes DAX *sync support by moving calls to > > dax_writeback_mapping_range() out of filemap_write_and_wait_range() and > > into the filesystem/block device ->writepages function so that it can > > supply us with a valid block device. This also fixes DAX code to properly > > flush caches in response to sync(2). > > > > Thanks to Jan Kara for his initial draft of patch 2: > > https://lkml.org/lkml/2016/2/9/485 > > > > Here are the changes that I've made to that patch: > > > > 1) For DAX mappings, only return after calling > > dax_writeback_mapping_range() if we encountered an error. In the non-error > > case we still need to write back normal pages, else we lose metadata > > updates. > > > > 2) In dax_writeback_mapping_range(), move the new check for > > if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) > > above the i_blkbits check. In my testing I found cases where > > dax_writeback_mapping_range() was called for inodes with i_blkbits != > > PAGE_SHIFT - I'm assuming these are internal metadata inodes? They have no > > exceptional DAX entries to flush, so we have no work to do, but if we > > return error from the i_blkbits check we will fail the overall writeback > > operation. Please let me know if it seems wrong for us to be seeing inodes > > set to use DAX but with i_blkbits != PAGE_SHIFT and I'll get more info. > > So I'm wondering - how come S_DAX flag got set for inode where i_blkbis != > PAGE_SHIFT? That would seem to be a bug? I specifically ordered the checks > like this to catch such issues. I've isolated this one - this happens for all three filesystems (ext2, ext4 & XFS), and does indeed have to do with the fact that S_DAX is set for bdev->bd_inode. Here is one failure path: [ 102.866637] [] dump_stack+0x85/0xc2 [ 102.867101] [] dax_writeback_mapping_range+0x60/0xe0 [ 102.867738] [] blkdev_writepages+0x3f/0x50 [ 102.868272] [] do_writepages+0x21/0x30 [ 102.868784] [] __filemap_fdatawrite_range+0xc6/0x100 [ 102.869378] [] filemap_write_and_wait+0x4a/0xa0 [ 102.869933] [] set_blocksize+0x70/0xd0 [ 102.870424] [] sb_set_blocksize+0x1d/0x50 [ 102.870933] [] ext4_fill_super+0x75b/0x3360 [ 102.871487] [] ? vsnprintf+0x201/0x4c0 [ 102.872005] [] ? snprintf+0x49/0x60 [ 102.872499] [] mount_bdev+0x180/0x1b0 [ 102.872981] [] ? ext4_calculate_overhead+0x370/0x370 [ 102.873580] [] ext4_mount+0x15/0x20 [ 102.874042] [] mount_fs+0x38/0x170 [ 102.874524] [] vfs_kern_mount+0x6b/0x150 [ 102.875041] [] do_mount+0x24f/0xe90 [ 102.875508] [] ? mntput+0x24/0x40 [ 102.875958] [] ? __kmalloc_track_caller+0xea/0x240 [ 102.876542] [] ? copy_mount_options+0x2c/0x210 [ 102.877087] [] SyS_mount+0x95/0xe0 [ 102.877573] [] entry_SYSCALL_64_fastpath+0x12/0x76 In set_blocksize() we are actually updating bdev->bd_inode->i_blkbits to be 12, but before that happens we do a sync_blockdev() with i_blkbits at 10, which causes the failure. This can be reproduced easily just by mounting an ext2 or ext4 filesystem. I think the plan of unsetting S_DAX on bdev->bd_inode when we mount will save us from this, as long as we do it super early in the mount process. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 11 Feb 2016 13:43:04 +0100 From: Jan Kara Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160211124304.GI21760@quack.suse.cz> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > During testing of raw block devices + DAX I noticed that the struct > block_device that we were using for DAX operations was incorrect. For the > fault handlers, etc. we can just get the correct bdev via get_block(), > which is passed in as a function pointer, but for the *sync code and for > sector zeroing we don't have access to get_block(). This is also an issue > for XFS real-time devices, whenever we get those working. > > Patch one of this series fixes the DAX sector zeroing code by explicitly > passing in a valid struct block_device. > > Patch two of this series fixes DAX *sync support by moving calls to > dax_writeback_mapping_range() out of filemap_write_and_wait_range() and > into the filesystem/block device ->writepages function so that it can > supply us with a valid block device. This also fixes DAX code to properly > flush caches in response to sync(2). > > Thanks to Jan Kara for his initial draft of patch 2: > https://lkml.org/lkml/2016/2/9/485 > > Here are the changes that I've made to that patch: > > 1) For DAX mappings, only return after calling > dax_writeback_mapping_range() if we encountered an error. In the non-error > case we still need to write back normal pages, else we lose metadata > updates. > > 2) In dax_writeback_mapping_range(), move the new check for > if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) > above the i_blkbits check. In my testing I found cases where > dax_writeback_mapping_range() was called for inodes with i_blkbits != > PAGE_SHIFT - I'm assuming these are internal metadata inodes? They have no > exceptional DAX entries to flush, so we have no work to do, but if we > return error from the i_blkbits check we will fail the overall writeback > operation. Please let me know if it seems wrong for us to be seeing inodes > set to use DAX but with i_blkbits != PAGE_SHIFT and I'll get more info. So I'm wondering - how come S_DAX flag got set for inode where i_blkbis != PAGE_SHIFT? That would seem to be a bug? I specifically ordered the checks like this to catch such issues. > 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue > the writeback in the case that DAX is enabled but we only have a nonzero > mapping->nrpages. As with 1) and 2), I believe this is necessary to > properly writeback metadata changes. If this sounds wrong, please let me > know and I'll get more info. And I'm surprised here as well. If there are dax_mapping() inodes that have pagecache pages, then we have issues with radix tree handling as well. So how come dax_mapping() inodes have pages attached? If it is about block device inodes, then I find it buggy, that S_DAX gets set for such inodes when filesystem is mounted on them because in such cases we are IMO asking for data corruption sooner rather than later... Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Date: Wed, 10 Feb 2016 13:48:54 -0700 Message-Id: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: During testing of raw block devices + DAX I noticed that the struct block_device that we were using for DAX operations was incorrect. For the fault handlers, etc. we can just get the correct bdev via get_block(), which is passed in as a function pointer, but for the *sync code and for sector zeroing we don't have access to get_block(). This is also an issue for XFS real-time devices, whenever we get those working. Patch one of this series fixes the DAX sector zeroing code by explicitly passing in a valid struct block_device. Patch two of this series fixes DAX *sync support by moving calls to dax_writeback_mapping_range() out of filemap_write_and_wait_range() and into the filesystem/block device ->writepages function so that it can supply us with a valid block device. This also fixes DAX code to properly flush caches in response to sync(2). Thanks to Jan Kara for his initial draft of patch 2: https://lkml.org/lkml/2016/2/9/485 Here are the changes that I've made to that patch: 1) For DAX mappings, only return after calling dax_writeback_mapping_range() if we encountered an error. In the non-error case we still need to write back normal pages, else we lose metadata updates. 2) In dax_writeback_mapping_range(), move the new check for if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) above the i_blkbits check. In my testing I found cases where dax_writeback_mapping_range() was called for inodes with i_blkbits != PAGE_SHIFT - I'm assuming these are internal metadata inodes? They have no exceptional DAX entries to flush, so we have no work to do, but if we return error from the i_blkbits check we will fail the overall writeback operation. Please let me know if it seems wrong for us to be seeing inodes set to use DAX but with i_blkbits != PAGE_SHIFT and I'll get more info. 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue the writeback in the case that DAX is enabled but we only have a nonzero mapping->nrpages. As with 1) and 2), I believe this is necessary to properly writeback metadata changes. If this sounds wrong, please let me know and I'll get more info. A working tree can be found here: https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=fsync_bdev_v2 Ross Zwisler (2): dax: supply DAX clearing code with correct bdev dax: move writeback calls into the filesystems fs/block_dev.c | 16 +++++++++++++++- fs/dax.c | 22 ++++++++++++---------- fs/ext2/inode.c | 17 +++++++++++++++-- fs/ext4/inode.c | 7 +++++++ fs/xfs/xfs_aops.c | 11 ++++++++++- fs/xfs/xfs_aops.h | 1 + fs/xfs/xfs_bmap_util.c | 3 ++- include/linux/dax.h | 8 +++++--- mm/filemap.c | 12 ++++-------- 9 files changed, 71 insertions(+), 26 deletions(-) -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 12 Feb 2016 10:44:15 +1100 From: Dave Chinner Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160211234415.GM19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> <20160211204635.GI19486@dastard> <20160211224616.GL19486@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Jan Kara , Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers List-ID: On Thu, Feb 11, 2016 at 02:59:14PM -0800, Dan Williams wrote: > On Thu, Feb 11, 2016 at 2:46 PM, Dave Chinner wrote: > > On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote: > >> On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner wrote: > >> Maybe I don't need to worry because it's already the case that a > >> mmap of the raw device may not see the most up to date data for a > >> file that has dirty fs-page-cache data. > > > > It goes both ways. What happens if mkfs or fsck modifies the > > block device via mmap+DAX and then the filesystem mounts the block > > device and tries to read that metadata via the block device page > > cache? > > > > Quite frankly, DAX on the block device is a can of worms we really > > don't need to deal with right now. IMO it's a solution looking for a > > problem to solve, > > Virtualization use cases want to give large ranges to guest-VMs, and > it is currently the only way to reliably get 1GiB mappings. Precisely my point - block devices are not the best way to solve this problem. A file, on XFS, with a 1GB extent size hint and preallocated to be aligned to 1GB addresses (i.e. mkfs.xfs -d su=1G,sw=1 on the host filesystem) will give reliable 1GB aligned blocks for DAX mappings, just like a block device will. Peformance wise it's little different to using the block device directly. Management wise it's way more flexible, especially as such image files can be recycled for new VMs almost instantly via FALLOC_FL_FLAG_ZERO_RANGE. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20160211224616.GL19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> <20160211204635.GI19486@dastard> <20160211224616.GL19486@dastard> Date: Thu, 11 Feb 2016 14:59:14 -0800 Message-ID: Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Dave Chinner Cc: Jan Kara , Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers List-ID: On Thu, Feb 11, 2016 at 2:46 PM, Dave Chinner wrote: > On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote: >> On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner wrote: >> [..] >> >> It seems to me we need to modify the >> >> metadata i/o paths to bypass the page cache, >> > >> > XFS doesn't use the block device page cache for it's metadata - it >> > has it's own internal metadata cache structures and uses get_pages >> > or heap memory to back it's metadata. But that doesn't make mixing >> > DAX and pages in the block device mapping tree sane. >> > >> > What you are missing here is that the underlying architecture of >> > journalling filesystems mean they can't use DAX for their metadata. >> > Modifications have to be buffered, because they have to be written >> > to the journal first before they are written back in place. IOWs, we >> > need to buffer changes in volatile memory for some time, and that >> > means we can't use DAX during transactional modifications. >> > >> > And to put the final nail in that coffin, metadata in XFS can be >> > discontiguous multi-block objects - in those situations we vmap the >> > underlying pages so they appear to the code to be a contiguous >> > buffer, and that's something we can't do with DAX.... >> >> Sorry, I wasn't clear when I said "bypass page cache" I meant a >> solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition >> table reads". > > So there's already bandaids to prevent bad shit from happening in > the block layer, let alone when we consider all the ways that > userspace can screw this all up. > >> However, I suspect that is broken if the filesystem is not ready >> to see a new page allocated for every I/O. I assume one >> thread will want to insert a page in the radix for another thread >> to find/manipulate before metadata gets written back to storage. > > Right, you can't do that, especially as the struct page has a 1-1 > relationship with the bufferhead that is attached to it as the > bufferhead carries the filesystem state for the given cached page. > >> >> or teach the fsync code how to flush populated data pages out >> >> of the radix. >> > >> > That doesn't solve the problem. Filesystems free and reallocate >> > filesystem blocks without intermediate block device mapping >> > invalidation calls, so what is one minute a data block accessed >> > by DAX may become a metadata block that accessed via buffered >> > IO. It all goes to crap very quickly.... >> > >> > However, I'd say fsync is not the place to address this. This >> > block device cache aliasing issue is supposed to be what >> > unmap_underlying_metadata() solves, right? >> >> I'll take a look at this. Right now I'm trying to implement the >> "clear block-device-inode S_DAX on fs mount" approach. My concern >> though is that we need to disable block device mmap while a >> filesystem is mounted... > > /me chokes on his coffee. > > When did mmaping the block device behind the back of a mounted > fileystem become a valid use case? It's not supported for normal > block devices and for the same reasons it won't be supported for DAX > enabled block devices, either. i.e. I'm going to tell anyone who has > an application that does this to go and take a hike when (not if!) > they report filesystem corruption problems. Right, but we need to not confuse the fsync code regardless of how bad of an idea this is ::-). >> Maybe I don't need to worry because it's already the case that a >> mmap of the raw device may not see the most up to date data for a >> file that has dirty fs-page-cache data. > > It goes both ways. What happens if mkfs or fsck modifies the > block device via mmap+DAX and then the filesystem mounts the block > device and tries to read that metadata via the block device page > cache? > > Quite frankly, DAX on the block device is a can of worms we really > don't need to deal with right now. IMO it's a solution looking for a > problem to solve, Virtualization use cases want to give large ranges to guest-VMs, and it is currently the only way to reliably get 1GiB mappings. > the "default to on" policy is wrong (DAX is > opt-in, not opt-out) and given this we should turn it off until > we've solved the more important problems we need to solve. i.e. We > need to concentrate on getting data integrity working correctly > first, then address the cache aliasing issues, then address the > "safe access" issues, and then we can re-introduce block device DAX > access... Agreed. Note that the "default-on policy" came from commit bbab37ddc20b "block: Add support for DAX reads/writes to block devices" way back in 4.2. We're just now noticing. Credit Ross for good sanity checking. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 12 Feb 2016 09:46:16 +1100 From: Dave Chinner Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160211224616.GL19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> <20160211204635.GI19486@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Jan Kara , Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers List-ID: On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote: > On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner wrote: > [..] > >> It seems to me we need to modify the > >> metadata i/o paths to bypass the page cache, > > > > XFS doesn't use the block device page cache for it's metadata - it > > has it's own internal metadata cache structures and uses get_pages > > or heap memory to back it's metadata. But that doesn't make mixing > > DAX and pages in the block device mapping tree sane. > > > > What you are missing here is that the underlying architecture of > > journalling filesystems mean they can't use DAX for their metadata. > > Modifications have to be buffered, because they have to be written > > to the journal first before they are written back in place. IOWs, we > > need to buffer changes in volatile memory for some time, and that > > means we can't use DAX during transactional modifications. > > > > And to put the final nail in that coffin, metadata in XFS can be > > discontiguous multi-block objects - in those situations we vmap the > > underlying pages so they appear to the code to be a contiguous > > buffer, and that's something we can't do with DAX.... > > Sorry, I wasn't clear when I said "bypass page cache" I meant a > solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition > table reads". So there's already bandaids to prevent bad shit from happening in the block layer, let alone when we consider all the ways that userspace can screw this all up. > However, I suspect that is broken if the filesystem is not ready > to see a new page allocated for every I/O. I assume one > thread will want to insert a page in the radix for another thread > to find/manipulate before metadata gets written back to storage. Right, you can't do that, especially as the struct page has a 1-1 relationship with the bufferhead that is attached to it as the bufferhead carries the filesystem state for the given cached page. > >> or teach the fsync code how to flush populated data pages out > >> of the radix. > > > > That doesn't solve the problem. Filesystems free and reallocate > > filesystem blocks without intermediate block device mapping > > invalidation calls, so what is one minute a data block accessed > > by DAX may become a metadata block that accessed via buffered > > IO. It all goes to crap very quickly.... > > > > However, I'd say fsync is not the place to address this. This > > block device cache aliasing issue is supposed to be what > > unmap_underlying_metadata() solves, right? > > I'll take a look at this. Right now I'm trying to implement the > "clear block-device-inode S_DAX on fs mount" approach. My concern > though is that we need to disable block device mmap while a > filesystem is mounted... /me chokes on his coffee. When did mmaping the block device behind the back of a mounted fileystem become a valid use case? It's not supported for normal block devices and for the same reasons it won't be supported for DAX enabled block devices, either. i.e. I'm going to tell anyone who has an application that does this to go and take a hike when (not if!) they report filesystem corruption problems. > Maybe I don't need to worry because it's already the case that a > mmap of the raw device may not see the most up to date data for a > file that has dirty fs-page-cache data. It goes both ways. What happens if mkfs or fsck modifies the block device via mmap+DAX and then the filesystem mounts the block device and tries to read that metadata via the block device page cache? Quite frankly, DAX on the block device is a can of worms we really don't need to deal with right now. IMO it's a solution looking for a problem to solve, the "default to on" policy is wrong (DAX is opt-in, not opt-out) and given this we should turn it off until we've solved the more important problems we need to solve. i.e. We need to concentrate on getting data integrity working correctly first, then address the cache aliasing issues, then address the "safe access" issues, and then we can re-introduce block device DAX access... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20160211204635.GI19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> <20160211204635.GI19486@dastard> Date: Thu, 11 Feb 2016 12:58:38 -0800 Message-ID: Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Dave Chinner Cc: Jan Kara , Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers List-ID: On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner wrote: [..] >> It seems to me we need to modify the >> metadata i/o paths to bypass the page cache, > > XFS doesn't use the block device page cache for it's metadata - it > has it's own internal metadata cache structures and uses get_pages > or heap memory to back it's metadata. But that doesn't make mixing > DAX and pages in the block device mapping tree sane. > > What you are missing here is that the underlying architecture of > journalling filesystems mean they can't use DAX for their metadata. > Modifications have to be buffered, because they have to be written > to the journal first before they are written back in place. IOWs, we > need to buffer changes in volatile memory for some time, and that > means we can't use DAX during transactional modifications. > > And to put the final nail in that coffin, metadata in XFS can be > discontiguous multi-block objects - in those situations we vmap the > underlying pages so they appear to the code to be a contiguous > buffer, and that's something we can't do with DAX.... Sorry, I wasn't clear when I said "bypass page cache" I meant a solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition table reads". However, I suspect that is broken if the filesystem is not ready to see a new page allocated for every I/O. I assume one thread will want to insert a page in the radix for another thread to find/manipulate before metadata gets written back to storage. >> or teach the fsync code >> how to flush populated data pages out of the radix. > > That doesn't solve the problem. Filesystems free and reallocate > filesystem blocks without intermediate block device mapping > invalidation calls, so what is one minute a data block accessed by > DAX may become a metadata block that accessed via buffered IO. It > all goes to crap very quickly.... > > However, I'd say fsync is not the place to address this. This block > device cache aliasing issue is supposed to be what > unmap_underlying_metadata() solves, right? I'll take a look at this. Right now I'm trying to implement the "clear block-device-inode S_DAX on fs mount" approach. My concern though is that we need to disable block device mmap while a filesystem is mounted... Maybe I don't need to worry because it's already the case that a mmap of the raw device may not see the most up to date data for a file that has dirty fs-page-cache data. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 12 Feb 2016 07:46:35 +1100 From: Dave Chinner Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160211204635.GI19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Jan Kara , Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers List-ID: On Thu, Feb 11, 2016 at 07:22:00AM -0800, Dan Williams wrote: > On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara wrote: > > On Wed 10-02-16 15:43:40, Ross Zwisler wrote: > >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to > >> > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > >> > > block devices and for XFS real-time files. > >> > > > >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem > >> > > ->writepages function so that it can supply us with a valid block > >> > > device. This also fixes DAX code to properly flush caches in response to > >> > > sync(2). > >> > > > >> > > Signed-off-by: Ross Zwisler > >> > > Signed-off-by: Jan Kara > >> > > --- > >> > > fs/block_dev.c | 16 +++++++++++++++- > >> > > fs/dax.c | 13 ++++++++----- > >> > > fs/ext2/inode.c | 11 +++++++++++ > >> > > fs/ext4/inode.c | 7 +++++++ > >> > > fs/xfs/xfs_aops.c | 9 +++++++++ > >> > > include/linux/dax.h | 6 ++++-- > >> > > mm/filemap.c | 12 ++++-------- > >> > > 7 files changed, 58 insertions(+), 16 deletions(-) > >> > > > >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c > >> > > index 39b3a17..fc01e43 100644 > >> > > --- a/fs/block_dev.c > >> > > +++ b/fs/block_dev.c > >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > >> > > return try_to_free_buffers(page); > >> > > } > >> > > > >> > > +static int blkdev_writepages(struct address_space *mapping, > >> > > + struct writeback_control *wbc) > >> > > +{ > >> > > + if (dax_mapping(mapping)) { > >> > > + struct block_device *bdev = I_BDEV(mapping->host); > >> > > + int error; > >> > > + > >> > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > >> > > + if (error) > >> > > + return error; > >> > > + } > >> > > + return generic_writepages(mapping, wbc); > >> > > +} > >> > > >> > Can you remind of the reason for calling generic_writepages() on DAX > >> > enabled address spaces? > >> > >> Sure. The initial version of this patch didn't do this, and during testing I > >> hit a bunch of xfstests failures. In ext2 at least I believe these were > >> happening because we were skipping the call into generic_writepages() for DAX > >> inodes. Without a lot of data to back this up, my guess is that this is due > >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > >> returns true), but having dirty page cache pages that need to be written back > >> as part of the writeback. > >> > >> Changing this so we always call generic_writepages() even in the DAX case > >> solved the xfstest failures. > >> > >> If this sounds incorrect, please let me know and I'll go and gather more data. > > > > So I think a more correct fix it to not set S_DAX for inodes that will have > > any pagecache pages - e.g. don't set S_DAX for block device inodes when > > filesystem is mounted on it (probably the easiest is to just refuse to > > mount filesystem on block device which has S_DAX set). > > I think we have a wider problem here. See __blkdev_get, we set S_DAX > on all block devices that have ->direct_access() and have a > page-aligned starting address. That's seeming like a premature optimisation to me now. I didn't say anything at the time because I was busy with other things and it didn't affect XFS. > It seems to me we need to modify the > metadata i/o paths to bypass the page cache, XFS doesn't use the block device page cache for it's metadata - it has it's own internal metadata cache structures and uses get_pages or heap memory to back it's metadata. But that doesn't make mixing DAX and pages in the block device mapping tree sane. What you are missing here is that the underlying architecture of journalling filesystems mean they can't use DAX for their metadata. Modifications have to be buffered, because they have to be written to the journal first before they are written back in place. IOWs, we need to buffer changes in volatile memory for some time, and that means we can't use DAX during transactional modifications. And to put the final nail in that coffin, metadata in XFS can be discontiguous multi-block objects - in those situations we vmap the underlying pages so they appear to the code to be a contiguous buffer, and that's something we can't do with DAX.... > or teach the fsync code > how to flush populated data pages out of the radix. That doesn't solve the problem. Filesystems free and reallocate filesystem blocks without intermediate block device mapping invalidation calls, so what is one minute a data block accessed by DAX may become a metadata block that accessed via buffered IO. It all goes to crap very quickly.... However, I'd say fsync is not the place to address this. This block device cache aliasing issue is supposed to be what unmap_underlying_metadata() solves, right? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 11 Feb 2016 17:22:26 +0100 From: Jan Kara Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160211162226.GR21760@quack.suse.cz> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Jan Kara , Ross Zwisler , Dave Chinner , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers List-ID: On Thu 11-02-16 07:22:00, Dan Williams wrote: > On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara wrote: > > On Wed 10-02-16 15:43:40, Ross Zwisler wrote: > >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to > >> > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > >> > > block devices and for XFS real-time files. > >> > > > >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem > >> > > ->writepages function so that it can supply us with a valid block > >> > > device. This also fixes DAX code to properly flush caches in response to > >> > > sync(2). > >> > > > >> > > Signed-off-by: Ross Zwisler > >> > > Signed-off-by: Jan Kara > >> > > --- > >> > > fs/block_dev.c | 16 +++++++++++++++- > >> > > fs/dax.c | 13 ++++++++----- > >> > > fs/ext2/inode.c | 11 +++++++++++ > >> > > fs/ext4/inode.c | 7 +++++++ > >> > > fs/xfs/xfs_aops.c | 9 +++++++++ > >> > > include/linux/dax.h | 6 ++++-- > >> > > mm/filemap.c | 12 ++++-------- > >> > > 7 files changed, 58 insertions(+), 16 deletions(-) > >> > > > >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c > >> > > index 39b3a17..fc01e43 100644 > >> > > --- a/fs/block_dev.c > >> > > +++ b/fs/block_dev.c > >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > >> > > return try_to_free_buffers(page); > >> > > } > >> > > > >> > > +static int blkdev_writepages(struct address_space *mapping, > >> > > + struct writeback_control *wbc) > >> > > +{ > >> > > + if (dax_mapping(mapping)) { > >> > > + struct block_device *bdev = I_BDEV(mapping->host); > >> > > + int error; > >> > > + > >> > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > >> > > + if (error) > >> > > + return error; > >> > > + } > >> > > + return generic_writepages(mapping, wbc); > >> > > +} > >> > > >> > Can you remind of the reason for calling generic_writepages() on DAX > >> > enabled address spaces? > >> > >> Sure. The initial version of this patch didn't do this, and during testing I > >> hit a bunch of xfstests failures. In ext2 at least I believe these were > >> happening because we were skipping the call into generic_writepages() for DAX > >> inodes. Without a lot of data to back this up, my guess is that this is due > >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > >> returns true), but having dirty page cache pages that need to be written back > >> as part of the writeback. > >> > >> Changing this so we always call generic_writepages() even in the DAX case > >> solved the xfstest failures. > >> > >> If this sounds incorrect, please let me know and I'll go and gather more data. > > > > So I think a more correct fix it to not set S_DAX for inodes that will have > > any pagecache pages - e.g. don't set S_DAX for block device inodes when > > filesystem is mounted on it (probably the easiest is to just refuse to > > mount filesystem on block device which has S_DAX set). > > I think we have a wider problem here. See __blkdev_get, we set S_DAX > on all block devices that have ->direct_access() and have a > page-aligned starting address. It seems to me we need to modify the > metadata i/o paths to bypass the page cache Heh, no way to do that easily. All the journalling machinery depends on buffers and pages... >, or teach the fsync code > how to flush populated data pages out of the radix. This might be doable but it will be difficult to avoid aliasing issues and data corruption. And mainly I don't see the point: When you mount a filesystem on top of block device, you do not want to mess with the block device directly, even less using DAX. So we just have to find a way how to set S_DAX for normal open but clear it from fs path. At worst, we could clear S_DAX on the block device in mount_bdev() or something like that... Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <20160211125044.GJ21760@quack.suse.cz> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> Date: Thu, 11 Feb 2016 07:22:00 -0800 Message-ID: Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: linux-fsdevel-owner@vger.kernel.org To: Jan Kara Cc: Ross Zwisler , Dave Chinner , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers List-ID: On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara wrote: > On Wed 10-02-16 15:43:40, Ross Zwisler wrote: >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to >> > > get that from inode->i_sb->s_bdev. This is correct for normal inodes >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw >> > > block devices and for XFS real-time files. >> > > >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem >> > > ->writepages function so that it can supply us with a valid block >> > > device. This also fixes DAX code to properly flush caches in response to >> > > sync(2). >> > > >> > > Signed-off-by: Ross Zwisler >> > > Signed-off-by: Jan Kara >> > > --- >> > > fs/block_dev.c | 16 +++++++++++++++- >> > > fs/dax.c | 13 ++++++++----- >> > > fs/ext2/inode.c | 11 +++++++++++ >> > > fs/ext4/inode.c | 7 +++++++ >> > > fs/xfs/xfs_aops.c | 9 +++++++++ >> > > include/linux/dax.h | 6 ++++-- >> > > mm/filemap.c | 12 ++++-------- >> > > 7 files changed, 58 insertions(+), 16 deletions(-) >> > > >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c >> > > index 39b3a17..fc01e43 100644 >> > > --- a/fs/block_dev.c >> > > +++ b/fs/block_dev.c >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) >> > > return try_to_free_buffers(page); >> > > } >> > > >> > > +static int blkdev_writepages(struct address_space *mapping, >> > > + struct writeback_control *wbc) >> > > +{ >> > > + if (dax_mapping(mapping)) { >> > > + struct block_device *bdev = I_BDEV(mapping->host); >> > > + int error; >> > > + >> > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); >> > > + if (error) >> > > + return error; >> > > + } >> > > + return generic_writepages(mapping, wbc); >> > > +} >> > >> > Can you remind of the reason for calling generic_writepages() on DAX >> > enabled address spaces? >> >> Sure. The initial version of this patch didn't do this, and during testing I >> hit a bunch of xfstests failures. In ext2 at least I believe these were >> happening because we were skipping the call into generic_writepages() for DAX >> inodes. Without a lot of data to back this up, my guess is that this is due >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping) >> returns true), but having dirty page cache pages that need to be written back >> as part of the writeback. >> >> Changing this so we always call generic_writepages() even in the DAX case >> solved the xfstest failures. >> >> If this sounds incorrect, please let me know and I'll go and gather more data. > > So I think a more correct fix it to not set S_DAX for inodes that will have > any pagecache pages - e.g. don't set S_DAX for block device inodes when > filesystem is mounted on it (probably the easiest is to just refuse to > mount filesystem on block device which has S_DAX set). I think we have a wider problem here. See __blkdev_get, we set S_DAX on all block devices that have ->direct_access() and have a page-aligned starting address. It seems to me we need to modify the metadata i/o paths to bypass the page cache, or teach the fsync code how to flush populated data pages out of the radix. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 11 Feb 2016 13:50:44 +0100 From: Jan Kara Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160211125044.GJ21760@quack.suse.cz> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160210224340.GA30938@linux.intel.com> Sender: owner-linux-mm@kvack.org To: Ross Zwisler Cc: Dave Chinner , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com, Jan Kara List-ID: On Wed 10-02-16 15:43:40, Ross Zwisler wrote: > On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > > block devices and for XFS real-time files. > > > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem > > > ->writepages function so that it can supply us with a valid block > > > device. This also fixes DAX code to properly flush caches in response to > > > sync(2). > > > > > > Signed-off-by: Ross Zwisler > > > Signed-off-by: Jan Kara > > > --- > > > fs/block_dev.c | 16 +++++++++++++++- > > > fs/dax.c | 13 ++++++++----- > > > fs/ext2/inode.c | 11 +++++++++++ > > > fs/ext4/inode.c | 7 +++++++ > > > fs/xfs/xfs_aops.c | 9 +++++++++ > > > include/linux/dax.h | 6 ++++-- > > > mm/filemap.c | 12 ++++-------- > > > 7 files changed, 58 insertions(+), 16 deletions(-) > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > index 39b3a17..fc01e43 100644 > > > --- a/fs/block_dev.c > > > +++ b/fs/block_dev.c > > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > > > return try_to_free_buffers(page); > > > } > > > > > > +static int blkdev_writepages(struct address_space *mapping, > > > + struct writeback_control *wbc) > > > +{ > > > + if (dax_mapping(mapping)) { > > > + struct block_device *bdev = I_BDEV(mapping->host); > > > + int error; > > > + > > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > > > + if (error) > > > + return error; > > > + } > > > + return generic_writepages(mapping, wbc); > > > +} > > > > Can you remind of the reason for calling generic_writepages() on DAX > > enabled address spaces? > > Sure. The initial version of this patch didn't do this, and during testing I > hit a bunch of xfstests failures. In ext2 at least I believe these were > happening because we were skipping the call into generic_writepages() for DAX > inodes. Without a lot of data to back this up, my guess is that this is due > to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > returns true), but having dirty page cache pages that need to be written back > as part of the writeback. > > Changing this so we always call generic_writepages() even in the DAX case > solved the xfstest failures. > > If this sounds incorrect, please let me know and I'll go and gather more data. So I think a more correct fix it to not set S_DAX for inodes that will have any pagecache pages - e.g. don't set S_DAX for block device inodes when filesystem is mounted on it (probably the easiest is to just refuse to mount filesystem on block device which has S_DAX set). Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 11 Feb 2016 10:44:00 +1100 From: Dave Chinner Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160210234400.GQ14668@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160210224340.GA30938@linux.intel.com> Sender: owner-linux-mm@kvack.org To: Ross Zwisler , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com, Jan Kara List-ID: On Wed, Feb 10, 2016 at 03:43:40PM -0700, Ross Zwisler wrote: > On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > > block devices and for XFS real-time files. > > > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem > > > ->writepages function so that it can supply us with a valid block > > > device. This also fixes DAX code to properly flush caches in response to > > > sync(2). > > > > > > Signed-off-by: Ross Zwisler > > > Signed-off-by: Jan Kara > > > --- > > > fs/block_dev.c | 16 +++++++++++++++- > > > fs/dax.c | 13 ++++++++----- > > > fs/ext2/inode.c | 11 +++++++++++ > > > fs/ext4/inode.c | 7 +++++++ > > > fs/xfs/xfs_aops.c | 9 +++++++++ > > > include/linux/dax.h | 6 ++++-- > > > mm/filemap.c | 12 ++++-------- > > > 7 files changed, 58 insertions(+), 16 deletions(-) > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > index 39b3a17..fc01e43 100644 > > > --- a/fs/block_dev.c > > > +++ b/fs/block_dev.c > > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > > > return try_to_free_buffers(page); > > > } > > > > > > +static int blkdev_writepages(struct address_space *mapping, > > > + struct writeback_control *wbc) > > > +{ > > > + if (dax_mapping(mapping)) { > > > + struct block_device *bdev = I_BDEV(mapping->host); > > > + int error; > > > + > > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > > > + if (error) > > > + return error; > > > + } > > > + return generic_writepages(mapping, wbc); > > > +} > > > > Can you remind of the reason for calling generic_writepages() on DAX > > enabled address spaces? > > Sure. The initial version of this patch didn't do this, and during testing I > hit a bunch of xfstests failures. In ext2 at least I believe these were > happening because we were skipping the call into generic_writepages() for DAX > inodes. Without a lot of data to back this up, my guess is that this is due > to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > returns true), but having dirty page cache pages that need to be written back > as part of the writeback. Hmmm - the ext2 filesystem metadata uses the block device page cache to buffer inode writeback, and so writeback doesn't occur until sync_blockdev() is called. But the data access should be through the ext2 inode address space, not the block device address space, so DAX flushing occurs in ext2_writepages. So how is the block device inode being marked as a DAX inode? If it is being marked as a DAX inode, how is this valid when the filesystem metadata uses bufferheads and requires struct pages to be found in the block device mapping tree? e.g. mkfs writes the metadata into the bdev via DAX, resulting in an DAX exceptional entry in the bdev radix tree, then __bread_gfp() comes along to read the same metadata after mount and expects to find pages in the blockdev radix tree? FWIW, this seems to be specifically a block device inode issue, though, not something that affects regular files in a filesystem. i.e. filesystem inodes can only be either DAX or non-DAX, and so there is no mixed mode flushing required, right? > Changing this so we always call generic_writepages() even in the > DAX case solved the xfstest failures. > > If this sounds incorrect, please let me know and I'll go and > gather more data. It seems to me that there's a problem here with DAX on block device inodes, but not for the filesystem mappings. At minimum, the block device needs a bloody big comment explaining this landmine so people don't forget why it is a special snowflake... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 10 Feb 2016 15:43:40 -0700 From: Ross Zwisler Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160210224340.GA30938@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160210220312.GP14668@dastard> Sender: owner-linux-mm@kvack.org To: Dave Chinner Cc: Ross Zwisler , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com, Jan Kara List-ID: On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > block devices and for XFS real-time files. > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem > > ->writepages function so that it can supply us with a valid block > > device. This also fixes DAX code to properly flush caches in response to > > sync(2). > > > > Signed-off-by: Ross Zwisler > > Signed-off-by: Jan Kara > > --- > > fs/block_dev.c | 16 +++++++++++++++- > > fs/dax.c | 13 ++++++++----- > > fs/ext2/inode.c | 11 +++++++++++ > > fs/ext4/inode.c | 7 +++++++ > > fs/xfs/xfs_aops.c | 9 +++++++++ > > include/linux/dax.h | 6 ++++-- > > mm/filemap.c | 12 ++++-------- > > 7 files changed, 58 insertions(+), 16 deletions(-) > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > index 39b3a17..fc01e43 100644 > > --- a/fs/block_dev.c > > +++ b/fs/block_dev.c > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > > return try_to_free_buffers(page); > > } > > > > +static int blkdev_writepages(struct address_space *mapping, > > + struct writeback_control *wbc) > > +{ > > + if (dax_mapping(mapping)) { > > + struct block_device *bdev = I_BDEV(mapping->host); > > + int error; > > + > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > > + if (error) > > + return error; > > + } > > + return generic_writepages(mapping, wbc); > > +} > > Can you remind of the reason for calling generic_writepages() on DAX > enabled address spaces? Sure. The initial version of this patch didn't do this, and during testing I hit a bunch of xfstests failures. In ext2 at least I believe these were happening because we were skipping the call into generic_writepages() for DAX inodes. Without a lot of data to back this up, my guess is that this is due to metadata inodes or something being marked as DAX (so dax_mapping(mapping) returns true), but having dirty page cache pages that need to be written back as part of the writeback. Changing this so we always call generic_writepages() even in the DAX case solved the xfstest failures. If this sounds incorrect, please let me know and I'll go and gather more data. - Ross -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 11 Feb 2016 09:03:12 +1100 From: Dave Chinner Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160210220312.GP14668@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com, Jan Kara List-ID: On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > dax_writeback_mapping_range() needs a struct block_device, and it used to > get that from inode->i_sb->s_bdev. This is correct for normal inodes > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > block devices and for XFS real-time files. > > Instead, call dax_writeback_mapping_range() directly from the filesystem > ->writepages function so that it can supply us with a valid block > device. This also fixes DAX code to properly flush caches in response to > sync(2). > > Signed-off-by: Ross Zwisler > Signed-off-by: Jan Kara > --- > fs/block_dev.c | 16 +++++++++++++++- > fs/dax.c | 13 ++++++++----- > fs/ext2/inode.c | 11 +++++++++++ > fs/ext4/inode.c | 7 +++++++ > fs/xfs/xfs_aops.c | 9 +++++++++ > include/linux/dax.h | 6 ++++-- > mm/filemap.c | 12 ++++-------- > 7 files changed, 58 insertions(+), 16 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 39b3a17..fc01e43 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > return try_to_free_buffers(page); > } > > +static int blkdev_writepages(struct address_space *mapping, > + struct writeback_control *wbc) > +{ > + if (dax_mapping(mapping)) { > + struct block_device *bdev = I_BDEV(mapping->host); > + int error; > + > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > + if (error) > + return error; > + } > + return generic_writepages(mapping, wbc); > +} Can you remind of the reason for calling generic_writepages() on DAX enabled address spaces? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH v2 2/2] dax: move writeback calls into the filesystems Date: Wed, 10 Feb 2016 13:48:56 -0700 Message-Id: <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com, Jan Kara List-ID: Previously calls to dax_writeback_mapping_range() for all DAX filesystems (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). dax_writeback_mapping_range() needs a struct block_device, and it used to get that from inode->i_sb->s_bdev. This is correct for normal inodes mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time files. Instead, call dax_writeback_mapping_range() directly from the filesystem ->writepages function so that it can supply us with a valid block device. This also fixes DAX code to properly flush caches in response to sync(2). Signed-off-by: Ross Zwisler Signed-off-by: Jan Kara --- fs/block_dev.c | 16 +++++++++++++++- fs/dax.c | 13 ++++++++----- fs/ext2/inode.c | 11 +++++++++++ fs/ext4/inode.c | 7 +++++++ fs/xfs/xfs_aops.c | 9 +++++++++ include/linux/dax.h | 6 ++++-- mm/filemap.c | 12 ++++-------- 7 files changed, 58 insertions(+), 16 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 39b3a17..fc01e43 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) return try_to_free_buffers(page); } +static int blkdev_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + if (dax_mapping(mapping)) { + struct block_device *bdev = I_BDEV(mapping->host); + int error; + + error = dax_writeback_mapping_range(mapping, bdev, wbc); + if (error) + return error; + } + return generic_writepages(mapping, wbc); +} + static const struct address_space_operations def_blk_aops = { .readpage = blkdev_readpage, .readpages = blkdev_readpages, .writepage = blkdev_writepage, .write_begin = blkdev_write_begin, .write_end = blkdev_write_end, - .writepages = generic_writepages, + .writepages = blkdev_writepages, .releasepage = blkdev_releasepage, .direct_IO = blkdev_direct_IO, .is_dirty_writeback = buffer_check_dirty_writeback, diff --git a/fs/dax.c b/fs/dax.c index 9a173dd..034dd02 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev, * end]. This is required by data integrity operations to ensure file data is * on persistent storage prior to completion of the operation. */ -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end) +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, struct writeback_control *wbc) { struct inode *inode = mapping->host; - struct block_device *bdev = inode->i_sb->s_bdev; pgoff_t start_index, end_index, pmd_index; pgoff_t indices[PAGEVEC_SIZE]; struct pagevec pvec; @@ -496,11 +495,15 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, int i, ret = 0; void *entry; + + if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) + return 0; + if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT)) return -EIO; - start_index = start >> PAGE_CACHE_SHIFT; - end_index = end >> PAGE_CACHE_SHIFT; + start_index = wbc->range_start >> PAGE_CACHE_SHIFT; + end_index = wbc->range_end >> PAGE_CACHE_SHIFT; pmd_index = DAX_PMD_INDEX(start_index); rcu_read_lock(); diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index b6b965b..7e44fc3 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -876,6 +876,17 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset) static int ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) { +#ifdef CONFIG_FS_DAX + if (dax_mapping(mapping)) { + int error; + + error = dax_writeback_mapping_range(mapping, + mapping->host->i_sb->s_bdev, wbc); + if (error) + return error; + } +#endif + return mpage_writepages(mapping, wbc, ext2_get_block); } diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 83bc8bf..8c42020 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2450,6 +2450,13 @@ static int ext4_writepages(struct address_space *mapping, trace_ext4_writepages(inode, wbc); + if (dax_mapping(mapping)) { + ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, + wbc); + if (ret) + goto out_writepages; + } + /* * No pages to write? This is mainly a kludge to avoid starting * a transaction for special inodes like journal inode on last iput() diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index fc20518..1139ecd 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1208,6 +1208,15 @@ xfs_vm_writepages( struct writeback_control *wbc) { xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED); + if (dax_mapping(mapping)) { + int error; + + error = dax_writeback_mapping_range(mapping, + xfs_find_bdev_for_inode(mapping->host), wbc); + if (error) + return error; + } + return generic_writepages(mapping, wbc); } diff --git a/include/linux/dax.h b/include/linux/dax.h index 7b6bced..636dd59 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -52,6 +52,8 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end); + +struct writeback_control; +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, struct writeback_control *wbc); #endif diff --git a/mm/filemap.c b/mm/filemap.c index bc94386..a829779 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -446,7 +446,8 @@ int filemap_write_and_wait(struct address_space *mapping) { int err = 0; - if (mapping->nrpages) { + if (mapping->nrpages || + (dax_mapping(mapping) && mapping->nrexceptional)) { err = filemap_fdatawrite(mapping); /* * Even if the above returned error, the pages may be @@ -482,13 +483,8 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; - if (dax_mapping(mapping) && mapping->nrexceptional) { - err = dax_writeback_mapping_range(mapping, lstart, lend); - if (err) - return err; - } - - if (mapping->nrpages) { + if (mapping->nrpages || + (dax_mapping(mapping) && mapping->nrexceptional)) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); /* See comment of filemap_write_and_wait() */ -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Ross Zwisler Subject: [PATCH v2 1/2] dax: supply DAX clearing code with correct bdev Date: Wed, 10 Feb 2016 13:48:55 -0700 Message-Id: <1455137336-28720-2-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com List-ID: dax_clear_blocks() needs a valid struct block_device and previously it was using inode->i_sb->s_bdev in all cases. This is correct for normal inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time devices. Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change its arguments to take a bdev and a sector instead of an inode and a block. This better reflects what the function does, and it allows the filesystem and raw block device code to pass in an appropriate struct block_device. Signed-off-by: Ross Zwisler Suggested-by: Dan Williams --- fs/dax.c | 9 ++++----- fs/ext2/inode.c | 6 ++++-- fs/xfs/xfs_aops.c | 2 +- fs/xfs/xfs_aops.h | 1 + fs/xfs/xfs_bmap_util.c | 3 ++- include/linux/dax.h | 2 +- 6 files changed, 13 insertions(+), 10 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index fc2e314..9a173dd 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -79,15 +79,14 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n) } /* - * dax_clear_blocks() is called from within transaction context from XFS, + * dax_clear_sectors() is called from within transaction context from XFS, * and hence this means the stack from this point must follow GFP_NOFS * semantics for all operations. */ -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) +int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size) { - struct block_device *bdev = inode->i_sb->s_bdev; struct blk_dax_ctl dax = { - .sector = block << (inode->i_blkbits - 9), + .sector = _sector, .size = _size, }; @@ -109,7 +108,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long _size) wmb_pmem(); return 0; } -EXPORT_SYMBOL_GPL(dax_clear_blocks); +EXPORT_SYMBOL_GPL(dax_clear_sectors); /* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */ static void dax_new_buf(void __pmem *addr, unsigned size, unsigned first, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 338eefd..b6b965b 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -737,8 +737,10 @@ static int ext2_get_blocks(struct inode *inode, * so that it's not found by another thread before it's * initialised */ - err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key), - 1 << inode->i_blkbits); + err = dax_clear_sectors(inode->i_sb->s_bdev, + le32_to_cpu(chain[depth-1].key) << + (inode->i_blkbits - 9), + 1 << inode->i_blkbits); if (err) { mutex_unlock(&ei->truncate_mutex); goto cleanup; diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 379c089..fc20518 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -55,7 +55,7 @@ xfs_count_page_state( } while ((bh = bh->b_this_page) != head); } -STATIC struct block_device * +struct block_device * xfs_find_bdev_for_inode( struct inode *inode) { diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h index f6ffc9a..a4343c6 100644 --- a/fs/xfs/xfs_aops.h +++ b/fs/xfs/xfs_aops.h @@ -62,5 +62,6 @@ int xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset, struct buffer_head *map_bh, int create); extern void xfs_count_page_state(struct page *, int *, int *); +extern struct block_device *xfs_find_bdev_for_inode(struct inode *); #endif /* __XFS_AOPS_H__ */ diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 07ef29b..ae9d755 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -75,7 +75,8 @@ xfs_zero_extent( ssize_t size = XFS_FSB_TO_B(mp, count_fsb); if (IS_DAX(VFS_I(ip))) - return dax_clear_blocks(VFS_I(ip), block, size); + return dax_clear_sectors(xfs_find_bdev_for_inode(VFS_I(ip)), + sector, size); /* * let the block layer decide on the fastest method of diff --git a/include/linux/dax.h b/include/linux/dax.h index 818e450..7b6bced 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -7,7 +7,7 @@ ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t, get_block_t, dio_iodone_t, int flags); -int dax_clear_blocks(struct inode *, sector_t block, long size); +int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size); int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); int dax_truncate_page(struct inode *, loff_t from, get_block_t); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t, -- 2.5.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750825AbcBJUtS (ORCPT ); Wed, 10 Feb 2016 15:49:18 -0500 Received: from mga14.intel.com ([192.55.52.115]:38539 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750713AbcBJUtQ (ORCPT ); Wed, 10 Feb 2016 15:49:16 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,427,1449561600"; d="scan'208";a="881530460" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Date: Wed, 10 Feb 2016 13:48:54 -0700 Message-Id: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.5.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org During testing of raw block devices + DAX I noticed that the struct block_device that we were using for DAX operations was incorrect. For the fault handlers, etc. we can just get the correct bdev via get_block(), which is passed in as a function pointer, but for the *sync code and for sector zeroing we don't have access to get_block(). This is also an issue for XFS real-time devices, whenever we get those working. Patch one of this series fixes the DAX sector zeroing code by explicitly passing in a valid struct block_device. Patch two of this series fixes DAX *sync support by moving calls to dax_writeback_mapping_range() out of filemap_write_and_wait_range() and into the filesystem/block device ->writepages function so that it can supply us with a valid block device. This also fixes DAX code to properly flush caches in response to sync(2). Thanks to Jan Kara for his initial draft of patch 2: https://lkml.org/lkml/2016/2/9/485 Here are the changes that I've made to that patch: 1) For DAX mappings, only return after calling dax_writeback_mapping_range() if we encountered an error. In the non-error case we still need to write back normal pages, else we lose metadata updates. 2) In dax_writeback_mapping_range(), move the new check for if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) above the i_blkbits check. In my testing I found cases where dax_writeback_mapping_range() was called for inodes with i_blkbits != PAGE_SHIFT - I'm assuming these are internal metadata inodes? They have no exceptional DAX entries to flush, so we have no work to do, but if we return error from the i_blkbits check we will fail the overall writeback operation. Please let me know if it seems wrong for us to be seeing inodes set to use DAX but with i_blkbits != PAGE_SHIFT and I'll get more info. 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue the writeback in the case that DAX is enabled but we only have a nonzero mapping->nrpages. As with 1) and 2), I believe this is necessary to properly writeback metadata changes. If this sounds wrong, please let me know and I'll get more info. A working tree can be found here: https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=fsync_bdev_v2 Ross Zwisler (2): dax: supply DAX clearing code with correct bdev dax: move writeback calls into the filesystems fs/block_dev.c | 16 +++++++++++++++- fs/dax.c | 22 ++++++++++++---------- fs/ext2/inode.c | 17 +++++++++++++++-- fs/ext4/inode.c | 7 +++++++ fs/xfs/xfs_aops.c | 11 ++++++++++- fs/xfs/xfs_aops.h | 1 + fs/xfs/xfs_bmap_util.c | 3 ++- include/linux/dax.h | 8 +++++--- mm/filemap.c | 12 ++++-------- 9 files changed, 71 insertions(+), 26 deletions(-) -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750917AbcBJUtV (ORCPT ); Wed, 10 Feb 2016 15:49:21 -0500 Received: from mga14.intel.com ([192.55.52.115]:38539 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750759AbcBJUtS (ORCPT ); Wed, 10 Feb 2016 15:49:18 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,427,1449561600"; d="scan'208";a="881530481" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com, Jan Kara Subject: [PATCH v2 2/2] dax: move writeback calls into the filesystems Date: Wed, 10 Feb 2016 13:48:56 -0700 Message-Id: <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.5.0 In-Reply-To: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Previously calls to dax_writeback_mapping_range() for all DAX filesystems (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). dax_writeback_mapping_range() needs a struct block_device, and it used to get that from inode->i_sb->s_bdev. This is correct for normal inodes mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time files. Instead, call dax_writeback_mapping_range() directly from the filesystem ->writepages function so that it can supply us with a valid block device. This also fixes DAX code to properly flush caches in response to sync(2). Signed-off-by: Ross Zwisler Signed-off-by: Jan Kara --- fs/block_dev.c | 16 +++++++++++++++- fs/dax.c | 13 ++++++++----- fs/ext2/inode.c | 11 +++++++++++ fs/ext4/inode.c | 7 +++++++ fs/xfs/xfs_aops.c | 9 +++++++++ include/linux/dax.h | 6 ++++-- mm/filemap.c | 12 ++++-------- 7 files changed, 58 insertions(+), 16 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 39b3a17..fc01e43 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) return try_to_free_buffers(page); } +static int blkdev_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + if (dax_mapping(mapping)) { + struct block_device *bdev = I_BDEV(mapping->host); + int error; + + error = dax_writeback_mapping_range(mapping, bdev, wbc); + if (error) + return error; + } + return generic_writepages(mapping, wbc); +} + static const struct address_space_operations def_blk_aops = { .readpage = blkdev_readpage, .readpages = blkdev_readpages, .writepage = blkdev_writepage, .write_begin = blkdev_write_begin, .write_end = blkdev_write_end, - .writepages = generic_writepages, + .writepages = blkdev_writepages, .releasepage = blkdev_releasepage, .direct_IO = blkdev_direct_IO, .is_dirty_writeback = buffer_check_dirty_writeback, diff --git a/fs/dax.c b/fs/dax.c index 9a173dd..034dd02 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev, * end]. This is required by data integrity operations to ensure file data is * on persistent storage prior to completion of the operation. */ -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end) +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, struct writeback_control *wbc) { struct inode *inode = mapping->host; - struct block_device *bdev = inode->i_sb->s_bdev; pgoff_t start_index, end_index, pmd_index; pgoff_t indices[PAGEVEC_SIZE]; struct pagevec pvec; @@ -496,11 +495,15 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, int i, ret = 0; void *entry; + + if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) + return 0; + if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT)) return -EIO; - start_index = start >> PAGE_CACHE_SHIFT; - end_index = end >> PAGE_CACHE_SHIFT; + start_index = wbc->range_start >> PAGE_CACHE_SHIFT; + end_index = wbc->range_end >> PAGE_CACHE_SHIFT; pmd_index = DAX_PMD_INDEX(start_index); rcu_read_lock(); diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index b6b965b..7e44fc3 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -876,6 +876,17 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset) static int ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) { +#ifdef CONFIG_FS_DAX + if (dax_mapping(mapping)) { + int error; + + error = dax_writeback_mapping_range(mapping, + mapping->host->i_sb->s_bdev, wbc); + if (error) + return error; + } +#endif + return mpage_writepages(mapping, wbc, ext2_get_block); } diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 83bc8bf..8c42020 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2450,6 +2450,13 @@ static int ext4_writepages(struct address_space *mapping, trace_ext4_writepages(inode, wbc); + if (dax_mapping(mapping)) { + ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, + wbc); + if (ret) + goto out_writepages; + } + /* * No pages to write? This is mainly a kludge to avoid starting * a transaction for special inodes like journal inode on last iput() diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index fc20518..1139ecd 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1208,6 +1208,15 @@ xfs_vm_writepages( struct writeback_control *wbc) { xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED); + if (dax_mapping(mapping)) { + int error; + + error = dax_writeback_mapping_range(mapping, + xfs_find_bdev_for_inode(mapping->host), wbc); + if (error) + return error; + } + return generic_writepages(mapping, wbc); } diff --git a/include/linux/dax.h b/include/linux/dax.h index 7b6bced..636dd59 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -52,6 +52,8 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end); + +struct writeback_control; +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, struct writeback_control *wbc); #endif diff --git a/mm/filemap.c b/mm/filemap.c index bc94386..a829779 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -446,7 +446,8 @@ int filemap_write_and_wait(struct address_space *mapping) { int err = 0; - if (mapping->nrpages) { + if (mapping->nrpages || + (dax_mapping(mapping) && mapping->nrexceptional)) { err = filemap_fdatawrite(mapping); /* * Even if the above returned error, the pages may be @@ -482,13 +483,8 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; - if (dax_mapping(mapping) && mapping->nrexceptional) { - err = dax_writeback_mapping_range(mapping, lstart, lend); - if (err) - return err; - } - - if (mapping->nrpages) { + if (mapping->nrpages || + (dax_mapping(mapping) && mapping->nrexceptional)) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); /* See comment of filemap_write_and_wait() */ -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751062AbcBJUtz (ORCPT ); Wed, 10 Feb 2016 15:49:55 -0500 Received: from mga14.intel.com ([192.55.52.115]:38539 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750721AbcBJUtR (ORCPT ); Wed, 10 Feb 2016 15:49:17 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,427,1449561600"; d="scan'208";a="881530468" From: Ross Zwisler To: linux-kernel@vger.kernel.org Cc: Ross Zwisler , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: [PATCH v2 1/2] dax: supply DAX clearing code with correct bdev Date: Wed, 10 Feb 2016 13:48:55 -0700 Message-Id: <1455137336-28720-2-git-send-email-ross.zwisler@linux.intel.com> X-Mailer: git-send-email 2.5.0 In-Reply-To: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org dax_clear_blocks() needs a valid struct block_device and previously it was using inode->i_sb->s_bdev in all cases. This is correct for normal inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time devices. Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change its arguments to take a bdev and a sector instead of an inode and a block. This better reflects what the function does, and it allows the filesystem and raw block device code to pass in an appropriate struct block_device. Signed-off-by: Ross Zwisler Suggested-by: Dan Williams --- fs/dax.c | 9 ++++----- fs/ext2/inode.c | 6 ++++-- fs/xfs/xfs_aops.c | 2 +- fs/xfs/xfs_aops.h | 1 + fs/xfs/xfs_bmap_util.c | 3 ++- include/linux/dax.h | 2 +- 6 files changed, 13 insertions(+), 10 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index fc2e314..9a173dd 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -79,15 +79,14 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n) } /* - * dax_clear_blocks() is called from within transaction context from XFS, + * dax_clear_sectors() is called from within transaction context from XFS, * and hence this means the stack from this point must follow GFP_NOFS * semantics for all operations. */ -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) +int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size) { - struct block_device *bdev = inode->i_sb->s_bdev; struct blk_dax_ctl dax = { - .sector = block << (inode->i_blkbits - 9), + .sector = _sector, .size = _size, }; @@ -109,7 +108,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long _size) wmb_pmem(); return 0; } -EXPORT_SYMBOL_GPL(dax_clear_blocks); +EXPORT_SYMBOL_GPL(dax_clear_sectors); /* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */ static void dax_new_buf(void __pmem *addr, unsigned size, unsigned first, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 338eefd..b6b965b 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -737,8 +737,10 @@ static int ext2_get_blocks(struct inode *inode, * so that it's not found by another thread before it's * initialised */ - err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key), - 1 << inode->i_blkbits); + err = dax_clear_sectors(inode->i_sb->s_bdev, + le32_to_cpu(chain[depth-1].key) << + (inode->i_blkbits - 9), + 1 << inode->i_blkbits); if (err) { mutex_unlock(&ei->truncate_mutex); goto cleanup; diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 379c089..fc20518 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -55,7 +55,7 @@ xfs_count_page_state( } while ((bh = bh->b_this_page) != head); } -STATIC struct block_device * +struct block_device * xfs_find_bdev_for_inode( struct inode *inode) { diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h index f6ffc9a..a4343c6 100644 --- a/fs/xfs/xfs_aops.h +++ b/fs/xfs/xfs_aops.h @@ -62,5 +62,6 @@ int xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset, struct buffer_head *map_bh, int create); extern void xfs_count_page_state(struct page *, int *, int *); +extern struct block_device *xfs_find_bdev_for_inode(struct inode *); #endif /* __XFS_AOPS_H__ */ diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 07ef29b..ae9d755 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -75,7 +75,8 @@ xfs_zero_extent( ssize_t size = XFS_FSB_TO_B(mp, count_fsb); if (IS_DAX(VFS_I(ip))) - return dax_clear_blocks(VFS_I(ip), block, size); + return dax_clear_sectors(xfs_find_bdev_for_inode(VFS_I(ip)), + sector, size); /* * let the block layer decide on the fastest method of diff --git a/include/linux/dax.h b/include/linux/dax.h index 818e450..7b6bced 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -7,7 +7,7 @@ ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t, get_block_t, dio_iodone_t, int flags); -int dax_clear_blocks(struct inode *, sector_t block, long size); +int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size); int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); int dax_truncate_page(struct inode *, loff_t from, get_block_t); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t, -- 2.5.0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751432AbcBJWDT (ORCPT ); Wed, 10 Feb 2016 17:03:19 -0500 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:22294 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750801AbcBJWDQ (ORCPT ); Wed, 10 Feb 2016 17:03:16 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2AkCQBlsrtWPBATLHleKAECgw+BP4ZjgXmdVQEBAQEBAQaLaoVFhAeGBwICAQECgThNAQEBAQEBBwEBAQFBP4RCAQEEJxMcIxAIAxgJJQ8FJQMHGhOIGsBTAQEBBwIBHRiFMoR/iGwBBJZ4jUiOfYNTimyEWiguAYhSAQEB Date: Thu, 11 Feb 2016 09:03:12 +1100 From: Dave Chinner To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com, Jan Kara Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160210220312.GP14668@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > dax_writeback_mapping_range() needs a struct block_device, and it used to > get that from inode->i_sb->s_bdev. This is correct for normal inodes > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > block devices and for XFS real-time files. > > Instead, call dax_writeback_mapping_range() directly from the filesystem > ->writepages function so that it can supply us with a valid block > device. This also fixes DAX code to properly flush caches in response to > sync(2). > > Signed-off-by: Ross Zwisler > Signed-off-by: Jan Kara > --- > fs/block_dev.c | 16 +++++++++++++++- > fs/dax.c | 13 ++++++++----- > fs/ext2/inode.c | 11 +++++++++++ > fs/ext4/inode.c | 7 +++++++ > fs/xfs/xfs_aops.c | 9 +++++++++ > include/linux/dax.h | 6 ++++-- > mm/filemap.c | 12 ++++-------- > 7 files changed, 58 insertions(+), 16 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 39b3a17..fc01e43 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > return try_to_free_buffers(page); > } > > +static int blkdev_writepages(struct address_space *mapping, > + struct writeback_control *wbc) > +{ > + if (dax_mapping(mapping)) { > + struct block_device *bdev = I_BDEV(mapping->host); > + int error; > + > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > + if (error) > + return error; > + } > + return generic_writepages(mapping, wbc); > +} Can you remind of the reason for calling generic_writepages() on DAX enabled address spaces? Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751721AbcBJWnz (ORCPT ); Wed, 10 Feb 2016 17:43:55 -0500 Received: from mga04.intel.com ([192.55.52.120]:3936 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750759AbcBJWnw (ORCPT ); Wed, 10 Feb 2016 17:43:52 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,427,1449561600"; d="scan'208";a="743926718" Date: Wed, 10 Feb 2016 15:43:40 -0700 From: Ross Zwisler To: Dave Chinner Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com, Jan Kara Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160210224340.GA30938@linux.intel.com> Mail-Followup-To: Ross Zwisler , Dave Chinner , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com, Jan Kara References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160210220312.GP14668@dastard> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > block devices and for XFS real-time files. > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem > > ->writepages function so that it can supply us with a valid block > > device. This also fixes DAX code to properly flush caches in response to > > sync(2). > > > > Signed-off-by: Ross Zwisler > > Signed-off-by: Jan Kara > > --- > > fs/block_dev.c | 16 +++++++++++++++- > > fs/dax.c | 13 ++++++++----- > > fs/ext2/inode.c | 11 +++++++++++ > > fs/ext4/inode.c | 7 +++++++ > > fs/xfs/xfs_aops.c | 9 +++++++++ > > include/linux/dax.h | 6 ++++-- > > mm/filemap.c | 12 ++++-------- > > 7 files changed, 58 insertions(+), 16 deletions(-) > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > index 39b3a17..fc01e43 100644 > > --- a/fs/block_dev.c > > +++ b/fs/block_dev.c > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > > return try_to_free_buffers(page); > > } > > > > +static int blkdev_writepages(struct address_space *mapping, > > + struct writeback_control *wbc) > > +{ > > + if (dax_mapping(mapping)) { > > + struct block_device *bdev = I_BDEV(mapping->host); > > + int error; > > + > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > > + if (error) > > + return error; > > + } > > + return generic_writepages(mapping, wbc); > > +} > > Can you remind of the reason for calling generic_writepages() on DAX > enabled address spaces? Sure. The initial version of this patch didn't do this, and during testing I hit a bunch of xfstests failures. In ext2 at least I believe these were happening because we were skipping the call into generic_writepages() for DAX inodes. Without a lot of data to back this up, my guess is that this is due to metadata inodes or something being marked as DAX (so dax_mapping(mapping) returns true), but having dirty page cache pages that need to be written back as part of the writeback. Changing this so we always call generic_writepages() even in the DAX case solved the xfstest failures. If this sounds incorrect, please let me know and I'll go and gather more data. - Ross From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751410AbcBJXoH (ORCPT ); Wed, 10 Feb 2016 18:44:07 -0500 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:32554 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750728AbcBJXoE (ORCPT ); Wed, 10 Feb 2016 18:44:04 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2CYDADqybtWPBATLHleDhoBAoMPgT+CaIN6gXmdVAEBAQEBAQaLZ4VFhAeGBwICAQECgThNAQEBAQEBBwEBAQFBP4RBAQEBAwEnExwoCwgDGAklDwUlAwcaARKIEwfAXQELHhiFMoR+iGwFlneNSIFkhEODJoUvg1OKbIQLTyguAYhSAQEB Date: Thu, 11 Feb 2016 10:44:00 +1100 From: Dave Chinner To: Ross Zwisler , linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com, Jan Kara Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160210234400.GQ14668@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160210224340.GA30938@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 10, 2016 at 03:43:40PM -0700, Ross Zwisler wrote: > On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > > block devices and for XFS real-time files. > > > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem > > > ->writepages function so that it can supply us with a valid block > > > device. This also fixes DAX code to properly flush caches in response to > > > sync(2). > > > > > > Signed-off-by: Ross Zwisler > > > Signed-off-by: Jan Kara > > > --- > > > fs/block_dev.c | 16 +++++++++++++++- > > > fs/dax.c | 13 ++++++++----- > > > fs/ext2/inode.c | 11 +++++++++++ > > > fs/ext4/inode.c | 7 +++++++ > > > fs/xfs/xfs_aops.c | 9 +++++++++ > > > include/linux/dax.h | 6 ++++-- > > > mm/filemap.c | 12 ++++-------- > > > 7 files changed, 58 insertions(+), 16 deletions(-) > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > index 39b3a17..fc01e43 100644 > > > --- a/fs/block_dev.c > > > +++ b/fs/block_dev.c > > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > > > return try_to_free_buffers(page); > > > } > > > > > > +static int blkdev_writepages(struct address_space *mapping, > > > + struct writeback_control *wbc) > > > +{ > > > + if (dax_mapping(mapping)) { > > > + struct block_device *bdev = I_BDEV(mapping->host); > > > + int error; > > > + > > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > > > + if (error) > > > + return error; > > > + } > > > + return generic_writepages(mapping, wbc); > > > +} > > > > Can you remind of the reason for calling generic_writepages() on DAX > > enabled address spaces? > > Sure. The initial version of this patch didn't do this, and during testing I > hit a bunch of xfstests failures. In ext2 at least I believe these were > happening because we were skipping the call into generic_writepages() for DAX > inodes. Without a lot of data to back this up, my guess is that this is due > to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > returns true), but having dirty page cache pages that need to be written back > as part of the writeback. Hmmm - the ext2 filesystem metadata uses the block device page cache to buffer inode writeback, and so writeback doesn't occur until sync_blockdev() is called. But the data access should be through the ext2 inode address space, not the block device address space, so DAX flushing occurs in ext2_writepages. So how is the block device inode being marked as a DAX inode? If it is being marked as a DAX inode, how is this valid when the filesystem metadata uses bufferheads and requires struct pages to be found in the block device mapping tree? e.g. mkfs writes the metadata into the bdev via DAX, resulting in an DAX exceptional entry in the bdev radix tree, then __bread_gfp() comes along to read the same metadata after mount and expects to find pages in the blockdev radix tree? FWIW, this seems to be specifically a block device inode issue, though, not something that affects regular files in a filesystem. i.e. filesystem inodes can only be either DAX or non-DAX, and so there is no mixed mode flushing required, right? > Changing this so we always call generic_writepages() even in the > DAX case solved the xfstest failures. > > If this sounds incorrect, please let me know and I'll go and > gather more data. It seems to me that there's a problem here with DAX on block device inodes, but not for the filesystem mappings. At minimum, the block device needs a bloody big comment explaining this landmine so people don't forget why it is a special snowflake... Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752799AbcBKMmx (ORCPT ); Thu, 11 Feb 2016 07:42:53 -0500 Received: from mx2.suse.de ([195.135.220.15]:51936 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751175AbcBKMmv (ORCPT ); Thu, 11 Feb 2016 07:42:51 -0500 Date: Thu, 11 Feb 2016 13:43:04 +0100 From: Jan Kara To: Ross Zwisler Cc: linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160211124304.GI21760@quack.suse.cz> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > During testing of raw block devices + DAX I noticed that the struct > block_device that we were using for DAX operations was incorrect. For the > fault handlers, etc. we can just get the correct bdev via get_block(), > which is passed in as a function pointer, but for the *sync code and for > sector zeroing we don't have access to get_block(). This is also an issue > for XFS real-time devices, whenever we get those working. > > Patch one of this series fixes the DAX sector zeroing code by explicitly > passing in a valid struct block_device. > > Patch two of this series fixes DAX *sync support by moving calls to > dax_writeback_mapping_range() out of filemap_write_and_wait_range() and > into the filesystem/block device ->writepages function so that it can > supply us with a valid block device. This also fixes DAX code to properly > flush caches in response to sync(2). > > Thanks to Jan Kara for his initial draft of patch 2: > https://lkml.org/lkml/2016/2/9/485 > > Here are the changes that I've made to that patch: > > 1) For DAX mappings, only return after calling > dax_writeback_mapping_range() if we encountered an error. In the non-error > case we still need to write back normal pages, else we lose metadata > updates. > > 2) In dax_writeback_mapping_range(), move the new check for > if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) > above the i_blkbits check. In my testing I found cases where > dax_writeback_mapping_range() was called for inodes with i_blkbits != > PAGE_SHIFT - I'm assuming these are internal metadata inodes? They have no > exceptional DAX entries to flush, so we have no work to do, but if we > return error from the i_blkbits check we will fail the overall writeback > operation. Please let me know if it seems wrong for us to be seeing inodes > set to use DAX but with i_blkbits != PAGE_SHIFT and I'll get more info. So I'm wondering - how come S_DAX flag got set for inode where i_blkbis != PAGE_SHIFT? That would seem to be a bug? I specifically ordered the checks like this to catch such issues. > 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue > the writeback in the case that DAX is enabled but we only have a nonzero > mapping->nrpages. As with 1) and 2), I believe this is necessary to > properly writeback metadata changes. If this sounds wrong, please let me > know and I'll get more info. And I'm surprised here as well. If there are dax_mapping() inodes that have pagecache pages, then we have issues with radix tree handling as well. So how come dax_mapping() inodes have pages attached? If it is about block device inodes, then I find it buggy, that S_DAX gets set for such inodes when filesystem is mounted on them because in such cases we are IMO asking for data corruption sooner rather than later... Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752870AbcBKMub (ORCPT ); Thu, 11 Feb 2016 07:50:31 -0500 Received: from mx2.suse.de ([195.135.220.15]:52518 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752236AbcBKMu3 (ORCPT ); Thu, 11 Feb 2016 07:50:29 -0500 Date: Thu, 11 Feb 2016 13:50:44 +0100 From: Jan Kara To: Ross Zwisler Cc: Dave Chinner , linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com, Jan Kara Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160211125044.GJ21760@quack.suse.cz> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160210224340.GA30938@linux.intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 10-02-16 15:43:40, Ross Zwisler wrote: > On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > > block devices and for XFS real-time files. > > > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem > > > ->writepages function so that it can supply us with a valid block > > > device. This also fixes DAX code to properly flush caches in response to > > > sync(2). > > > > > > Signed-off-by: Ross Zwisler > > > Signed-off-by: Jan Kara > > > --- > > > fs/block_dev.c | 16 +++++++++++++++- > > > fs/dax.c | 13 ++++++++----- > > > fs/ext2/inode.c | 11 +++++++++++ > > > fs/ext4/inode.c | 7 +++++++ > > > fs/xfs/xfs_aops.c | 9 +++++++++ > > > include/linux/dax.h | 6 ++++-- > > > mm/filemap.c | 12 ++++-------- > > > 7 files changed, 58 insertions(+), 16 deletions(-) > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > index 39b3a17..fc01e43 100644 > > > --- a/fs/block_dev.c > > > +++ b/fs/block_dev.c > > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > > > return try_to_free_buffers(page); > > > } > > > > > > +static int blkdev_writepages(struct address_space *mapping, > > > + struct writeback_control *wbc) > > > +{ > > > + if (dax_mapping(mapping)) { > > > + struct block_device *bdev = I_BDEV(mapping->host); > > > + int error; > > > + > > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > > > + if (error) > > > + return error; > > > + } > > > + return generic_writepages(mapping, wbc); > > > +} > > > > Can you remind of the reason for calling generic_writepages() on DAX > > enabled address spaces? > > Sure. The initial version of this patch didn't do this, and during testing I > hit a bunch of xfstests failures. In ext2 at least I believe these were > happening because we were skipping the call into generic_writepages() for DAX > inodes. Without a lot of data to back this up, my guess is that this is due > to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > returns true), but having dirty page cache pages that need to be written back > as part of the writeback. > > Changing this so we always call generic_writepages() even in the DAX case > solved the xfstest failures. > > If this sounds incorrect, please let me know and I'll go and gather more data. So I think a more correct fix it to not set S_DAX for inodes that will have any pagecache pages - e.g. don't set S_DAX for block device inodes when filesystem is mounted on it (probably the easiest is to just refuse to mount filesystem on block device which has S_DAX set). Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751730AbcBKPWG (ORCPT ); Thu, 11 Feb 2016 10:22:06 -0500 Received: from mail-yk0-f173.google.com ([209.85.160.173]:34510 "EHLO mail-yk0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750802AbcBKPWB (ORCPT ); Thu, 11 Feb 2016 10:22:01 -0500 MIME-Version: 1.0 In-Reply-To: <20160211125044.GJ21760@quack.suse.cz> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> Date: Thu, 11 Feb 2016 07:22:00 -0800 Message-ID: Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems From: Dan Williams To: Jan Kara Cc: Ross Zwisler , Dave Chinner , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara wrote: > On Wed 10-02-16 15:43:40, Ross Zwisler wrote: >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to >> > > get that from inode->i_sb->s_bdev. This is correct for normal inodes >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw >> > > block devices and for XFS real-time files. >> > > >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem >> > > ->writepages function so that it can supply us with a valid block >> > > device. This also fixes DAX code to properly flush caches in response to >> > > sync(2). >> > > >> > > Signed-off-by: Ross Zwisler >> > > Signed-off-by: Jan Kara >> > > --- >> > > fs/block_dev.c | 16 +++++++++++++++- >> > > fs/dax.c | 13 ++++++++----- >> > > fs/ext2/inode.c | 11 +++++++++++ >> > > fs/ext4/inode.c | 7 +++++++ >> > > fs/xfs/xfs_aops.c | 9 +++++++++ >> > > include/linux/dax.h | 6 ++++-- >> > > mm/filemap.c | 12 ++++-------- >> > > 7 files changed, 58 insertions(+), 16 deletions(-) >> > > >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c >> > > index 39b3a17..fc01e43 100644 >> > > --- a/fs/block_dev.c >> > > +++ b/fs/block_dev.c >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) >> > > return try_to_free_buffers(page); >> > > } >> > > >> > > +static int blkdev_writepages(struct address_space *mapping, >> > > + struct writeback_control *wbc) >> > > +{ >> > > + if (dax_mapping(mapping)) { >> > > + struct block_device *bdev = I_BDEV(mapping->host); >> > > + int error; >> > > + >> > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); >> > > + if (error) >> > > + return error; >> > > + } >> > > + return generic_writepages(mapping, wbc); >> > > +} >> > >> > Can you remind of the reason for calling generic_writepages() on DAX >> > enabled address spaces? >> >> Sure. The initial version of this patch didn't do this, and during testing I >> hit a bunch of xfstests failures. In ext2 at least I believe these were >> happening because we were skipping the call into generic_writepages() for DAX >> inodes. Without a lot of data to back this up, my guess is that this is due >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping) >> returns true), but having dirty page cache pages that need to be written back >> as part of the writeback. >> >> Changing this so we always call generic_writepages() even in the DAX case >> solved the xfstest failures. >> >> If this sounds incorrect, please let me know and I'll go and gather more data. > > So I think a more correct fix it to not set S_DAX for inodes that will have > any pagecache pages - e.g. don't set S_DAX for block device inodes when > filesystem is mounted on it (probably the easiest is to just refuse to > mount filesystem on block device which has S_DAX set). I think we have a wider problem here. See __blkdev_get, we set S_DAX on all block devices that have ->direct_access() and have a page-aligned starting address. It seems to me we need to modify the metadata i/o paths to bypass the page cache, or teach the fsync code how to flush populated data pages out of the radix. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751529AbcBKQWP (ORCPT ); Thu, 11 Feb 2016 11:22:15 -0500 Received: from mx2.suse.de ([195.135.220.15]:37105 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750894AbcBKQWN (ORCPT ); Thu, 11 Feb 2016 11:22:13 -0500 Date: Thu, 11 Feb 2016 17:22:26 +0100 From: Jan Kara To: Dan Williams Cc: Jan Kara , Ross Zwisler , Dave Chinner , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160211162226.GR21760@quack.suse.cz> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 11-02-16 07:22:00, Dan Williams wrote: > On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara wrote: > > On Wed 10-02-16 15:43:40, Ross Zwisler wrote: > >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to > >> > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > >> > > block devices and for XFS real-time files. > >> > > > >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem > >> > > ->writepages function so that it can supply us with a valid block > >> > > device. This also fixes DAX code to properly flush caches in response to > >> > > sync(2). > >> > > > >> > > Signed-off-by: Ross Zwisler > >> > > Signed-off-by: Jan Kara > >> > > --- > >> > > fs/block_dev.c | 16 +++++++++++++++- > >> > > fs/dax.c | 13 ++++++++----- > >> > > fs/ext2/inode.c | 11 +++++++++++ > >> > > fs/ext4/inode.c | 7 +++++++ > >> > > fs/xfs/xfs_aops.c | 9 +++++++++ > >> > > include/linux/dax.h | 6 ++++-- > >> > > mm/filemap.c | 12 ++++-------- > >> > > 7 files changed, 58 insertions(+), 16 deletions(-) > >> > > > >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c > >> > > index 39b3a17..fc01e43 100644 > >> > > --- a/fs/block_dev.c > >> > > +++ b/fs/block_dev.c > >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > >> > > return try_to_free_buffers(page); > >> > > } > >> > > > >> > > +static int blkdev_writepages(struct address_space *mapping, > >> > > + struct writeback_control *wbc) > >> > > +{ > >> > > + if (dax_mapping(mapping)) { > >> > > + struct block_device *bdev = I_BDEV(mapping->host); > >> > > + int error; > >> > > + > >> > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > >> > > + if (error) > >> > > + return error; > >> > > + } > >> > > + return generic_writepages(mapping, wbc); > >> > > +} > >> > > >> > Can you remind of the reason for calling generic_writepages() on DAX > >> > enabled address spaces? > >> > >> Sure. The initial version of this patch didn't do this, and during testing I > >> hit a bunch of xfstests failures. In ext2 at least I believe these were > >> happening because we were skipping the call into generic_writepages() for DAX > >> inodes. Without a lot of data to back this up, my guess is that this is due > >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > >> returns true), but having dirty page cache pages that need to be written back > >> as part of the writeback. > >> > >> Changing this so we always call generic_writepages() even in the DAX case > >> solved the xfstest failures. > >> > >> If this sounds incorrect, please let me know and I'll go and gather more data. > > > > So I think a more correct fix it to not set S_DAX for inodes that will have > > any pagecache pages - e.g. don't set S_DAX for block device inodes when > > filesystem is mounted on it (probably the easiest is to just refuse to > > mount filesystem on block device which has S_DAX set). > > I think we have a wider problem here. See __blkdev_get, we set S_DAX > on all block devices that have ->direct_access() and have a > page-aligned starting address. It seems to me we need to modify the > metadata i/o paths to bypass the page cache Heh, no way to do that easily. All the journalling machinery depends on buffers and pages... >, or teach the fsync code > how to flush populated data pages out of the radix. This might be doable but it will be difficult to avoid aliasing issues and data corruption. And mainly I don't see the point: When you mount a filesystem on top of block device, you do not want to mess with the block device directly, even less using DAX. So we just have to find a way how to set S_DAX for normal open but clear it from fs path. At worst, we could clear S_DAX on the block device in mount_bdev() or something like that... Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751120AbcBKTtj (ORCPT ); Thu, 11 Feb 2016 14:49:39 -0500 Received: from mga11.intel.com ([192.55.52.93]:32491 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750911AbcBKTte (ORCPT ); Thu, 11 Feb 2016 14:49:34 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,432,1449561600"; d="scan'208";a="650842291" Date: Thu, 11 Feb 2016 12:49:22 -0700 From: Ross Zwisler To: Jan Kara Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160211194922.GA5260@linux.intel.com> Mail-Followup-To: Ross Zwisler , Jan Kara , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160211124304.GI21760@quack.suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote: > On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > > During testing of raw block devices + DAX I noticed that the struct > > block_device that we were using for DAX operations was incorrect. For the > > fault handlers, etc. we can just get the correct bdev via get_block(), > > which is passed in as a function pointer, but for the *sync code and for > > sector zeroing we don't have access to get_block(). This is also an issue > > for XFS real-time devices, whenever we get those working. > > > > Patch one of this series fixes the DAX sector zeroing code by explicitly > > passing in a valid struct block_device. > > > > Patch two of this series fixes DAX *sync support by moving calls to > > dax_writeback_mapping_range() out of filemap_write_and_wait_range() and > > into the filesystem/block device ->writepages function so that it can > > supply us with a valid block device. This also fixes DAX code to properly > > flush caches in response to sync(2). > > > > Thanks to Jan Kara for his initial draft of patch 2: > > https://lkml.org/lkml/2016/2/9/485 > > > > Here are the changes that I've made to that patch: > > > > 1) For DAX mappings, only return after calling > > dax_writeback_mapping_range() if we encountered an error. In the non-error > > case we still need to write back normal pages, else we lose metadata > > updates. > > > > 2) In dax_writeback_mapping_range(), move the new check for > > if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) > > above the i_blkbits check. In my testing I found cases where > > dax_writeback_mapping_range() was called for inodes with i_blkbits != > > PAGE_SHIFT - I'm assuming these are internal metadata inodes? They have no > > exceptional DAX entries to flush, so we have no work to do, but if we > > return error from the i_blkbits check we will fail the overall writeback > > operation. Please let me know if it seems wrong for us to be seeing inodes > > set to use DAX but with i_blkbits != PAGE_SHIFT and I'll get more info. > > So I'm wondering - how come S_DAX flag got set for inode where i_blkbis != > PAGE_SHIFT? That would seem to be a bug? I specifically ordered the checks > like this to catch such issues. I've isolated this one - this happens for all three filesystems (ext2, ext4 & XFS), and does indeed have to do with the fact that S_DAX is set for bdev->bd_inode. Here is one failure path: [ 102.866637] [] dump_stack+0x85/0xc2 [ 102.867101] [] dax_writeback_mapping_range+0x60/0xe0 [ 102.867738] [] blkdev_writepages+0x3f/0x50 [ 102.868272] [] do_writepages+0x21/0x30 [ 102.868784] [] __filemap_fdatawrite_range+0xc6/0x100 [ 102.869378] [] filemap_write_and_wait+0x4a/0xa0 [ 102.869933] [] set_blocksize+0x70/0xd0 [ 102.870424] [] sb_set_blocksize+0x1d/0x50 [ 102.870933] [] ext4_fill_super+0x75b/0x3360 [ 102.871487] [] ? vsnprintf+0x201/0x4c0 [ 102.872005] [] ? snprintf+0x49/0x60 [ 102.872499] [] mount_bdev+0x180/0x1b0 [ 102.872981] [] ? ext4_calculate_overhead+0x370/0x370 [ 102.873580] [] ext4_mount+0x15/0x20 [ 102.874042] [] mount_fs+0x38/0x170 [ 102.874524] [] vfs_kern_mount+0x6b/0x150 [ 102.875041] [] do_mount+0x24f/0xe90 [ 102.875508] [] ? mntput+0x24/0x40 [ 102.875958] [] ? __kmalloc_track_caller+0xea/0x240 [ 102.876542] [] ? copy_mount_options+0x2c/0x210 [ 102.877087] [] SyS_mount+0x95/0xe0 [ 102.877573] [] entry_SYSCALL_64_fastpath+0x12/0x76 In set_blocksize() we are actually updating bdev->bd_inode->i_blkbits to be 12, but before that happens we do a sync_blockdev() with i_blkbits at 10, which causes the failure. This can be reproduced easily just by mounting an ext2 or ext4 filesystem. I think the plan of unsetting S_DAX on bdev->bd_inode when we mount will save us from this, as long as we do it super early in the mount process. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751402AbcBKUqm (ORCPT ); Thu, 11 Feb 2016 15:46:42 -0500 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:38131 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751067AbcBKUqk (ORCPT ); Thu, 11 Feb 2016 15:46:40 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2D9CgDL8rxWPBATLHleKAECgw+BP4ZigXmdYAEBBoFoigOFRYQIhgcCAgEBAoE0TQEBAQEBAQcBAQEBQT+EQQEBAQMBJxMcIwULCAMSBgklDwUlAwcGFBOIEgfBSQEBAQcCAR0YhTKEfohsBZZ3jUqBZoRDiFWDU4prhFsoLgGIUgEBAQ Date: Fri, 12 Feb 2016 07:46:35 +1100 From: Dave Chinner To: Dan Williams Cc: Jan Kara , Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160211204635.GI19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 11, 2016 at 07:22:00AM -0800, Dan Williams wrote: > On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara wrote: > > On Wed 10-02-16 15:43:40, Ross Zwisler wrote: > >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to > >> > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > >> > > block devices and for XFS real-time files. > >> > > > >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem > >> > > ->writepages function so that it can supply us with a valid block > >> > > device. This also fixes DAX code to properly flush caches in response to > >> > > sync(2). > >> > > > >> > > Signed-off-by: Ross Zwisler > >> > > Signed-off-by: Jan Kara > >> > > --- > >> > > fs/block_dev.c | 16 +++++++++++++++- > >> > > fs/dax.c | 13 ++++++++----- > >> > > fs/ext2/inode.c | 11 +++++++++++ > >> > > fs/ext4/inode.c | 7 +++++++ > >> > > fs/xfs/xfs_aops.c | 9 +++++++++ > >> > > include/linux/dax.h | 6 ++++-- > >> > > mm/filemap.c | 12 ++++-------- > >> > > 7 files changed, 58 insertions(+), 16 deletions(-) > >> > > > >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c > >> > > index 39b3a17..fc01e43 100644 > >> > > --- a/fs/block_dev.c > >> > > +++ b/fs/block_dev.c > >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > >> > > return try_to_free_buffers(page); > >> > > } > >> > > > >> > > +static int blkdev_writepages(struct address_space *mapping, > >> > > + struct writeback_control *wbc) > >> > > +{ > >> > > + if (dax_mapping(mapping)) { > >> > > + struct block_device *bdev = I_BDEV(mapping->host); > >> > > + int error; > >> > > + > >> > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > >> > > + if (error) > >> > > + return error; > >> > > + } > >> > > + return generic_writepages(mapping, wbc); > >> > > +} > >> > > >> > Can you remind of the reason for calling generic_writepages() on DAX > >> > enabled address spaces? > >> > >> Sure. The initial version of this patch didn't do this, and during testing I > >> hit a bunch of xfstests failures. In ext2 at least I believe these were > >> happening because we were skipping the call into generic_writepages() for DAX > >> inodes. Without a lot of data to back this up, my guess is that this is due > >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > >> returns true), but having dirty page cache pages that need to be written back > >> as part of the writeback. > >> > >> Changing this so we always call generic_writepages() even in the DAX case > >> solved the xfstest failures. > >> > >> If this sounds incorrect, please let me know and I'll go and gather more data. > > > > So I think a more correct fix it to not set S_DAX for inodes that will have > > any pagecache pages - e.g. don't set S_DAX for block device inodes when > > filesystem is mounted on it (probably the easiest is to just refuse to > > mount filesystem on block device which has S_DAX set). > > I think we have a wider problem here. See __blkdev_get, we set S_DAX > on all block devices that have ->direct_access() and have a > page-aligned starting address. That's seeming like a premature optimisation to me now. I didn't say anything at the time because I was busy with other things and it didn't affect XFS. > It seems to me we need to modify the > metadata i/o paths to bypass the page cache, XFS doesn't use the block device page cache for it's metadata - it has it's own internal metadata cache structures and uses get_pages or heap memory to back it's metadata. But that doesn't make mixing DAX and pages in the block device mapping tree sane. What you are missing here is that the underlying architecture of journalling filesystems mean they can't use DAX for their metadata. Modifications have to be buffered, because they have to be written to the journal first before they are written back in place. IOWs, we need to buffer changes in volatile memory for some time, and that means we can't use DAX during transactional modifications. And to put the final nail in that coffin, metadata in XFS can be discontiguous multi-block objects - in those situations we vmap the underlying pages so they appear to the code to be a contiguous buffer, and that's something we can't do with DAX.... > or teach the fsync code > how to flush populated data pages out of the radix. That doesn't solve the problem. Filesystems free and reallocate filesystem blocks without intermediate block device mapping invalidation calls, so what is one minute a data block accessed by DAX may become a metadata block that accessed via buffered IO. It all goes to crap very quickly.... However, I'd say fsync is not the place to address this. This block device cache aliasing issue is supposed to be what unmap_underlying_metadata() solves, right? Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751484AbcBKUvU (ORCPT ); Thu, 11 Feb 2016 15:51:20 -0500 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:2090 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751104AbcBKUvR (ORCPT ); Thu, 11 Feb 2016 15:51:17 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2AMDwDw87xWPBATLHleDhoBAoMPgT+GYoF5nWABAQaLaziFDYQIhgcEAgKBNE0BAQEBAQEHAQEBAUE/hEIBAQQnExwzCAMYCSUPBSUDBxoBEogZwUQqGIUyhH6IbAEElneNSo5+jj6EDE8oLohTAQEB Date: Fri, 12 Feb 2016 07:50:49 +1100 From: Dave Chinner To: Ross Zwisler , Jan Kara , linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160211205049.GJ19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> <20160211194922.GA5260@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160211194922.GA5260@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 11, 2016 at 12:49:22PM -0700, Ross Zwisler wrote: > I think the plan of unsetting S_DAX on bdev->bd_inode when we mount will save > us from this, as long as we do it super early in the mount process. I think that S_DAX should not be set on the block device by default in the first place. If we've been surprised by unexpected behaviour, then I'm sure there are going to be other surprises waiting for us. DAX default policy should be opt-in, not opt-out. Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751369AbcBKU6n (ORCPT ); Thu, 11 Feb 2016 15:58:43 -0500 Received: from mail-yw0-f181.google.com ([209.85.161.181]:34902 "EHLO mail-yw0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751067AbcBKU6j (ORCPT ); Thu, 11 Feb 2016 15:58:39 -0500 MIME-Version: 1.0 In-Reply-To: <20160211204635.GI19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> <20160211204635.GI19486@dastard> Date: Thu, 11 Feb 2016 12:58:38 -0800 Message-ID: Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems From: Dan Williams To: Dave Chinner Cc: Jan Kara , Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner wrote: [..] >> It seems to me we need to modify the >> metadata i/o paths to bypass the page cache, > > XFS doesn't use the block device page cache for it's metadata - it > has it's own internal metadata cache structures and uses get_pages > or heap memory to back it's metadata. But that doesn't make mixing > DAX and pages in the block device mapping tree sane. > > What you are missing here is that the underlying architecture of > journalling filesystems mean they can't use DAX for their metadata. > Modifications have to be buffered, because they have to be written > to the journal first before they are written back in place. IOWs, we > need to buffer changes in volatile memory for some time, and that > means we can't use DAX during transactional modifications. > > And to put the final nail in that coffin, metadata in XFS can be > discontiguous multi-block objects - in those situations we vmap the > underlying pages so they appear to the code to be a contiguous > buffer, and that's something we can't do with DAX.... Sorry, I wasn't clear when I said "bypass page cache" I meant a solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition table reads". However, I suspect that is broken if the filesystem is not ready to see a new page allocated for every I/O. I assume one thread will want to insert a page in the radix for another thread to find/manipulate before metadata gets written back to storage. >> or teach the fsync code >> how to flush populated data pages out of the radix. > > That doesn't solve the problem. Filesystems free and reallocate > filesystem blocks without intermediate block device mapping > invalidation calls, so what is one minute a data block accessed by > DAX may become a metadata block that accessed via buffered IO. It > all goes to crap very quickly.... > > However, I'd say fsync is not the place to address this. This block > device cache aliasing issue is supposed to be what > unmap_underlying_metadata() solves, right? I'll take a look at this. Right now I'm trying to implement the "clear block-device-inode S_DAX on fs mount" approach. My concern though is that we need to disable block device mmap while a filesystem is mounted... Maybe I don't need to worry because it's already the case that a mmap of the raw device may not see the most up to date data for a file that has dirty fs-page-cache data. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751166AbcBKWrT (ORCPT ); Thu, 11 Feb 2016 17:47:19 -0500 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:29493 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751016AbcBKWrL (ORCPT ); Thu, 11 Feb 2016 17:47:11 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2D8CgDlDr1WPBATLHleKAECgw+BP4ZigXmdYgEBBotrhUWECIYHAgIBAQKBNE0BAQEBAQEHAQEBAUE/hEEBAQEDATocEQsHBQsIAxIGCSUPBSUDBwYUExuHdwfBVgEBAQcCAR0YhTKEfoQchFAFlneNSoFmh2mFL44+gmQZgV4oLocbgTgBAQE Date: Fri, 12 Feb 2016 09:46:16 +1100 From: Dave Chinner To: Dan Williams Cc: Jan Kara , Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160211224616.GL19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> <20160211204635.GI19486@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote: > On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner wrote: > [..] > >> It seems to me we need to modify the > >> metadata i/o paths to bypass the page cache, > > > > XFS doesn't use the block device page cache for it's metadata - it > > has it's own internal metadata cache structures and uses get_pages > > or heap memory to back it's metadata. But that doesn't make mixing > > DAX and pages in the block device mapping tree sane. > > > > What you are missing here is that the underlying architecture of > > journalling filesystems mean they can't use DAX for their metadata. > > Modifications have to be buffered, because they have to be written > > to the journal first before they are written back in place. IOWs, we > > need to buffer changes in volatile memory for some time, and that > > means we can't use DAX during transactional modifications. > > > > And to put the final nail in that coffin, metadata in XFS can be > > discontiguous multi-block objects - in those situations we vmap the > > underlying pages so they appear to the code to be a contiguous > > buffer, and that's something we can't do with DAX.... > > Sorry, I wasn't clear when I said "bypass page cache" I meant a > solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition > table reads". So there's already bandaids to prevent bad shit from happening in the block layer, let alone when we consider all the ways that userspace can screw this all up. > However, I suspect that is broken if the filesystem is not ready > to see a new page allocated for every I/O. I assume one > thread will want to insert a page in the radix for another thread > to find/manipulate before metadata gets written back to storage. Right, you can't do that, especially as the struct page has a 1-1 relationship with the bufferhead that is attached to it as the bufferhead carries the filesystem state for the given cached page. > >> or teach the fsync code how to flush populated data pages out > >> of the radix. > > > > That doesn't solve the problem. Filesystems free and reallocate > > filesystem blocks without intermediate block device mapping > > invalidation calls, so what is one minute a data block accessed > > by DAX may become a metadata block that accessed via buffered > > IO. It all goes to crap very quickly.... > > > > However, I'd say fsync is not the place to address this. This > > block device cache aliasing issue is supposed to be what > > unmap_underlying_metadata() solves, right? > > I'll take a look at this. Right now I'm trying to implement the > "clear block-device-inode S_DAX on fs mount" approach. My concern > though is that we need to disable block device mmap while a > filesystem is mounted... /me chokes on his coffee. When did mmaping the block device behind the back of a mounted fileystem become a valid use case? It's not supported for normal block devices and for the same reasons it won't be supported for DAX enabled block devices, either. i.e. I'm going to tell anyone who has an application that does this to go and take a hike when (not if!) they report filesystem corruption problems. > Maybe I don't need to worry because it's already the case that a > mmap of the raw device may not see the most up to date data for a > file that has dirty fs-page-cache data. It goes both ways. What happens if mkfs or fsck modifies the block device via mmap+DAX and then the filesystem mounts the block device and tries to read that metadata via the block device page cache? Quite frankly, DAX on the block device is a can of worms we really don't need to deal with right now. IMO it's a solution looking for a problem to solve, the "default to on" policy is wrong (DAX is opt-in, not opt-out) and given this we should turn it off until we've solved the more important problems we need to solve. i.e. We need to concentrate on getting data integrity working correctly first, then address the cache aliasing issues, then address the "safe access" issues, and then we can re-introduce block device DAX access... Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751489AbcBKW7T (ORCPT ); Thu, 11 Feb 2016 17:59:19 -0500 Received: from mail-yk0-f175.google.com ([209.85.160.175]:35265 "EHLO mail-yk0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751417AbcBKW7P (ORCPT ); Thu, 11 Feb 2016 17:59:15 -0500 MIME-Version: 1.0 In-Reply-To: <20160211224616.GL19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> <20160211204635.GI19486@dastard> <20160211224616.GL19486@dastard> Date: Thu, 11 Feb 2016 14:59:14 -0800 Message-ID: Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems From: Dan Williams To: Dave Chinner Cc: Jan Kara , Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 11, 2016 at 2:46 PM, Dave Chinner wrote: > On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote: >> On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner wrote: >> [..] >> >> It seems to me we need to modify the >> >> metadata i/o paths to bypass the page cache, >> > >> > XFS doesn't use the block device page cache for it's metadata - it >> > has it's own internal metadata cache structures and uses get_pages >> > or heap memory to back it's metadata. But that doesn't make mixing >> > DAX and pages in the block device mapping tree sane. >> > >> > What you are missing here is that the underlying architecture of >> > journalling filesystems mean they can't use DAX for their metadata. >> > Modifications have to be buffered, because they have to be written >> > to the journal first before they are written back in place. IOWs, we >> > need to buffer changes in volatile memory for some time, and that >> > means we can't use DAX during transactional modifications. >> > >> > And to put the final nail in that coffin, metadata in XFS can be >> > discontiguous multi-block objects - in those situations we vmap the >> > underlying pages so they appear to the code to be a contiguous >> > buffer, and that's something we can't do with DAX.... >> >> Sorry, I wasn't clear when I said "bypass page cache" I meant a >> solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition >> table reads". > > So there's already bandaids to prevent bad shit from happening in > the block layer, let alone when we consider all the ways that > userspace can screw this all up. > >> However, I suspect that is broken if the filesystem is not ready >> to see a new page allocated for every I/O. I assume one >> thread will want to insert a page in the radix for another thread >> to find/manipulate before metadata gets written back to storage. > > Right, you can't do that, especially as the struct page has a 1-1 > relationship with the bufferhead that is attached to it as the > bufferhead carries the filesystem state for the given cached page. > >> >> or teach the fsync code how to flush populated data pages out >> >> of the radix. >> > >> > That doesn't solve the problem. Filesystems free and reallocate >> > filesystem blocks without intermediate block device mapping >> > invalidation calls, so what is one minute a data block accessed >> > by DAX may become a metadata block that accessed via buffered >> > IO. It all goes to crap very quickly.... >> > >> > However, I'd say fsync is not the place to address this. This >> > block device cache aliasing issue is supposed to be what >> > unmap_underlying_metadata() solves, right? >> >> I'll take a look at this. Right now I'm trying to implement the >> "clear block-device-inode S_DAX on fs mount" approach. My concern >> though is that we need to disable block device mmap while a >> filesystem is mounted... > > /me chokes on his coffee. > > When did mmaping the block device behind the back of a mounted > fileystem become a valid use case? It's not supported for normal > block devices and for the same reasons it won't be supported for DAX > enabled block devices, either. i.e. I'm going to tell anyone who has > an application that does this to go and take a hike when (not if!) > they report filesystem corruption problems. Right, but we need to not confuse the fsync code regardless of how bad of an idea this is ::-). >> Maybe I don't need to worry because it's already the case that a >> mmap of the raw device may not see the most up to date data for a >> file that has dirty fs-page-cache data. > > It goes both ways. What happens if mkfs or fsck modifies the > block device via mmap+DAX and then the filesystem mounts the block > device and tries to read that metadata via the block device page > cache? > > Quite frankly, DAX on the block device is a can of worms we really > don't need to deal with right now. IMO it's a solution looking for a > problem to solve, Virtualization use cases want to give large ranges to guest-VMs, and it is currently the only way to reliably get 1GiB mappings. > the "default to on" policy is wrong (DAX is > opt-in, not opt-out) and given this we should turn it off until > we've solved the more important problems we need to solve. i.e. We > need to concentrate on getting data integrity working correctly > first, then address the cache aliasing issues, then address the > "safe access" issues, and then we can re-introduce block device DAX > access... Agreed. Note that the "default-on policy" came from commit bbab37ddc20b "block: Add support for DAX reads/writes to block devices" way back in 4.2. We're just now noticing. Credit Ross for good sanity checking. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752093AbcBKXo3 (ORCPT ); Thu, 11 Feb 2016 18:44:29 -0500 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:43723 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752011AbcBKXoU (ORCPT ); Thu, 11 Feb 2016 18:44:20 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2D6CgDnG71WPBATLHleKAECgw+BP4ZigXmdYwEBBotrhUWECIYHAgIBAQKBNU0BAQEBAQEHAQEBAUE/hEIBAQQ6HCMQCAMYCSUPBSUDBxoTG4d+wUcBKRiFMoR+iGwFlneNSoFmh2mFL44+hFsoLohTAQEB Date: Fri, 12 Feb 2016 10:44:15 +1100 From: Dave Chinner To: Dan Williams Cc: Jan Kara , Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160211234415.GM19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> <20160211204635.GI19486@dastard> <20160211224616.GL19486@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 11, 2016 at 02:59:14PM -0800, Dan Williams wrote: > On Thu, Feb 11, 2016 at 2:46 PM, Dave Chinner wrote: > > On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote: > >> On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner wrote: > >> Maybe I don't need to worry because it's already the case that a > >> mmap of the raw device may not see the most up to date data for a > >> file that has dirty fs-page-cache data. > > > > It goes both ways. What happens if mkfs or fsck modifies the > > block device via mmap+DAX and then the filesystem mounts the block > > device and tries to read that metadata via the block device page > > cache? > > > > Quite frankly, DAX on the block device is a can of worms we really > > don't need to deal with right now. IMO it's a solution looking for a > > problem to solve, > > Virtualization use cases want to give large ranges to guest-VMs, and > it is currently the only way to reliably get 1GiB mappings. Precisely my point - block devices are not the best way to solve this problem. A file, on XFS, with a 1GB extent size hint and preallocated to be aligned to 1GB addresses (i.e. mkfs.xfs -d su=1G,sw=1 on the host filesystem) will give reliable 1GB aligned blocks for DAX mappings, just like a block device will. Peformance wise it's little different to using the block device directly. Management wise it's way more flexible, especially as such image files can be recycled for new VMs almost instantly via FALLOC_FL_FLAG_ZERO_RANGE. Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751205AbcBLTDg (ORCPT ); Fri, 12 Feb 2016 14:03:36 -0500 Received: from mga01.intel.com ([192.55.52.88]:22035 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750707AbcBLTDd (ORCPT ); Fri, 12 Feb 2016 14:03:33 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,436,1449561600"; d="scan'208";a="745068115" Date: Fri, 12 Feb 2016 12:03:20 -0700 From: Ross Zwisler To: Jan Kara Cc: Ross Zwisler , linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160212190320.GA24857@linux.intel.com> Mail-Followup-To: Ross Zwisler , Jan Kara , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160211124304.GI21760@quack.suse.cz> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote: > On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > > 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue > > the writeback in the case that DAX is enabled but we only have a nonzero > > mapping->nrpages. As with 1) and 2), I believe this is necessary to > > properly writeback metadata changes. If this sounds wrong, please let me > > know and I'll get more info. > > And I'm surprised here as well. If there are dax_mapping() inodes that have > pagecache pages, then we have issues with radix tree handling as well. So > how come dax_mapping() inodes have pages attached? If it is about block > device inodes, then I find it buggy, that S_DAX gets set for such inodes > when filesystem is mounted on them because in such cases we are IMO asking > for data corruption sooner rather than later... I think I've figured this one out, at least partially. For ext2 the issues I was seeing were due to the fact that directory inodes have S_DAX set, but have dirty page cache pages. In testing with generic/002, I see two ext2 inodes with S_DAX trying to do a writeback while they have dirty page cache pages. The first has i_ino=2, which is the EXT2_ROOT_INO. The second inode changes from run to run, but for my last run was 155649. The test failed because that directory inode was found to be corrupt by fsck.ext2: *** fsck.ext2 output *** fsck from util-linux 2.26.2 e2fsck 1.42.12 (29-Aug-2014) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Directory inode 155649, block #0, offset 0: directory corrupted If I change the code in ext2_writepages() so that it does the mpage_writepages() even for DAX inodes, all my xfstests pass. I'm not sure this is the right fix, though - should it instead be that ext2 directory inodes don't have S_DAX set? A similar problem occurs with ext4, though I haven't yet tracked it down to an inode type. It could be that ext4 directory inodes have the same issue, and Eric Sandeen suggested we might also have an issue with XATTRS attached to inodes. As with ext2, if I allow the normal writeback to occur in ext4_writepages() even for DAX inodes, the issues go away, but I'm not sure whether or not this is the correct fix. As far as I can see, XFS does not have these issues - returning immediately having done just the DAX writeback in xfs_vm_writepages() lets all my xfstests pass. For v4.5 should I send out an updated version of this series that does the regular page writeback for ext2 & ext4, or should we work to clear S_DAX for regular filesystem inodes that have dirty page cache data? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751175AbcBMCjV (ORCPT ); Fri, 12 Feb 2016 21:39:21 -0500 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:35011 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750853AbcBMCjT (ORCPT ); Fri, 12 Feb 2016 21:39:19 -0500 Date: Sat, 13 Feb 2016 13:38:49 +1100 From: Dave Chinner To: Ross Zwisler , Jan Kara , linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160213023849.GD14668@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> <20160212190320.GA24857@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160212190320.GA24857@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Feb 12, 2016 at 12:03:20PM -0700, Ross Zwisler wrote: > On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote: > > On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > > > 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue > > > the writeback in the case that DAX is enabled but we only have a nonzero > > > mapping->nrpages. As with 1) and 2), I believe this is necessary to > > > properly writeback metadata changes. If this sounds wrong, please let me > > > know and I'll get more info. > > > > And I'm surprised here as well. If there are dax_mapping() inodes that have > > pagecache pages, then we have issues with radix tree handling as well. So > > how come dax_mapping() inodes have pages attached? If it is about block > > device inodes, then I find it buggy, that S_DAX gets set for such inodes > > when filesystem is mounted on them because in such cases we are IMO asking > > for data corruption sooner rather than later... > > I think I've figured this one out, at least partially. > > For ext2 the issues I was seeing were due to the fact that directory inodes > have S_DAX set, but have dirty page cache pages. In testing with > generic/002, I see two ext2 inodes with S_DAX trying to do a writeback while > they have dirty page cache pages. The first has i_ino=2, which is the > EXT2_ROOT_INO. .... > As far as I can see, XFS does not have these issues - returning immediately > having done just the DAX writeback in xfs_vm_writepages() lets all my xfstests > pass. XFS will not have issues because it does not dirty directory inodes at the VFS level, nor does it use the page cache for directory data. However, looking at the code I think it does still set S_DAX on directory inodes, which it shouldn't be doing. I've got a couple of fixes I need to do in this area - hopefully I'll get it done on Monday. Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751164AbcBME72 (ORCPT ); Fri, 12 Feb 2016 23:59:28 -0500 Received: from mga02.intel.com ([134.134.136.20]:34650 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750917AbcBME7Z (ORCPT ); Fri, 12 Feb 2016 23:59:25 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,438,1449561600"; d="scan'208";a="914314314" Date: Fri, 12 Feb 2016 21:59:12 -0700 From: Ross Zwisler To: Dave Chinner Cc: Ross Zwisler , Jan Kara , linux-kernel@vger.kernel.org, "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, xfs@oss.sgi.com Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160213045912.GA22595@linux.intel.com> Mail-Followup-To: Ross Zwisler , Dave Chinner , Jan Kara , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> <20160212190320.GA24857@linux.intel.com> <20160213023849.GD14668@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160213023849.GD14668@dastard> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Feb 13, 2016 at 01:38:49PM +1100, Dave Chinner wrote: > On Fri, Feb 12, 2016 at 12:03:20PM -0700, Ross Zwisler wrote: > > On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote: > > > On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > > > > 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue > > > > the writeback in the case that DAX is enabled but we only have a nonzero > > > > mapping->nrpages. As with 1) and 2), I believe this is necessary to > > > > properly writeback metadata changes. If this sounds wrong, please let me > > > > know and I'll get more info. > > > > > > And I'm surprised here as well. If there are dax_mapping() inodes that have > > > pagecache pages, then we have issues with radix tree handling as well. So > > > how come dax_mapping() inodes have pages attached? If it is about block > > > device inodes, then I find it buggy, that S_DAX gets set for such inodes > > > when filesystem is mounted on them because in such cases we are IMO asking > > > for data corruption sooner rather than later... > > > > I think I've figured this one out, at least partially. > > > > For ext2 the issues I was seeing were due to the fact that directory inodes > > have S_DAX set, but have dirty page cache pages. In testing with > > generic/002, I see two ext2 inodes with S_DAX trying to do a writeback while > > they have dirty page cache pages. The first has i_ino=2, which is the > > EXT2_ROOT_INO. > .... > > As far as I can see, XFS does not have these issues - returning immediately > > having done just the DAX writeback in xfs_vm_writepages() lets all my xfstests > > pass. > > XFS will not have issues because it does not dirty directory inodes > at the VFS level, nor does it use the page cache for directory data. > However, looking at the code I think it does still set S_DAX on > directory inodes, which it shouldn't be doing. > > I've got a couple of fixes I need to do in this area - hopefully > I'll get it done on Monday. Cool. I've got a quick patch that stops S_DAX from being set on everything but regular inodes for ext2 and ext4. This solved a lot of my xfstests failures. Even after that I'm seeing two last failures with ext4 - I'll keep working on those. - Ross From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Williams Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Date: Thu, 11 Feb 2016 07:22:00 -0800 Message-ID: References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , XFS Developers , Linux MM , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Matthew Wilcox , Ross Zwisler , linux-ext4 , Andrew Morton To: Jan Kara Return-path: In-Reply-To: <20160211125044.GJ21760@quack.suse.cz> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com List-Id: linux-ext4.vger.kernel.org On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara wrote: > On Wed 10-02-16 15:43:40, Ross Zwisler wrote: >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to >> > > get that from inode->i_sb->s_bdev. This is correct for normal inodes >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw >> > > block devices and for XFS real-time files. >> > > >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem >> > > ->writepages function so that it can supply us with a valid block >> > > device. This also fixes DAX code to properly flush caches in response to >> > > sync(2). >> > > >> > > Signed-off-by: Ross Zwisler >> > > Signed-off-by: Jan Kara >> > > --- >> > > fs/block_dev.c | 16 +++++++++++++++- >> > > fs/dax.c | 13 ++++++++----- >> > > fs/ext2/inode.c | 11 +++++++++++ >> > > fs/ext4/inode.c | 7 +++++++ >> > > fs/xfs/xfs_aops.c | 9 +++++++++ >> > > include/linux/dax.h | 6 ++++-- >> > > mm/filemap.c | 12 ++++-------- >> > > 7 files changed, 58 insertions(+), 16 deletions(-) >> > > >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c >> > > index 39b3a17..fc01e43 100644 >> > > --- a/fs/block_dev.c >> > > +++ b/fs/block_dev.c >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) >> > > return try_to_free_buffers(page); >> > > } >> > > >> > > +static int blkdev_writepages(struct address_space *mapping, >> > > + struct writeback_control *wbc) >> > > +{ >> > > + if (dax_mapping(mapping)) { >> > > + struct block_device *bdev = I_BDEV(mapping->host); >> > > + int error; >> > > + >> > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); >> > > + if (error) >> > > + return error; >> > > + } >> > > + return generic_writepages(mapping, wbc); >> > > +} >> > >> > Can you remind of the reason for calling generic_writepages() on DAX >> > enabled address spaces? >> >> Sure. The initial version of this patch didn't do this, and during testing I >> hit a bunch of xfstests failures. In ext2 at least I believe these were >> happening because we were skipping the call into generic_writepages() for DAX >> inodes. Without a lot of data to back this up, my guess is that this is due >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping) >> returns true), but having dirty page cache pages that need to be written back >> as part of the writeback. >> >> Changing this so we always call generic_writepages() even in the DAX case >> solved the xfstest failures. >> >> If this sounds incorrect, please let me know and I'll go and gather more data. > > So I think a more correct fix it to not set S_DAX for inodes that will have > any pagecache pages - e.g. don't set S_DAX for block device inodes when > filesystem is mounted on it (probably the easiest is to just refuse to > mount filesystem on block device which has S_DAX set). I think we have a wider problem here. See __blkdev_get, we set S_DAX on all block devices that have ->direct_access() and have a page-aligned starting address. It seems to me we need to modify the metadata i/o paths to bypass the page cache, or teach the fsync code how to flush populated data pages out of the radix. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Date: Thu, 11 Feb 2016 17:22:26 +0100 Message-ID: <20160211162226.GR21760@quack.suse.cz> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Ross Zwisler , Dave Chinner , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers To: Dan Williams Return-path: Content-Disposition: inline In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Thu 11-02-16 07:22:00, Dan Williams wrote: > On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara wrote: > > On Wed 10-02-16 15:43:40, Ross Zwisler wrote: > >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to > >> > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > >> > > block devices and for XFS real-time files. > >> > > > >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem > >> > > ->writepages function so that it can supply us with a valid block > >> > > device. This also fixes DAX code to properly flush caches in response to > >> > > sync(2). > >> > > > >> > > Signed-off-by: Ross Zwisler > >> > > Signed-off-by: Jan Kara > >> > > --- > >> > > fs/block_dev.c | 16 +++++++++++++++- > >> > > fs/dax.c | 13 ++++++++----- > >> > > fs/ext2/inode.c | 11 +++++++++++ > >> > > fs/ext4/inode.c | 7 +++++++ > >> > > fs/xfs/xfs_aops.c | 9 +++++++++ > >> > > include/linux/dax.h | 6 ++++-- > >> > > mm/filemap.c | 12 ++++-------- > >> > > 7 files changed, 58 insertions(+), 16 deletions(-) > >> > > > >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c > >> > > index 39b3a17..fc01e43 100644 > >> > > --- a/fs/block_dev.c > >> > > +++ b/fs/block_dev.c > >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > >> > > return try_to_free_buffers(page); > >> > > } > >> > > > >> > > +static int blkdev_writepages(struct address_space *mapping, > >> > > + struct writeback_control *wbc) > >> > > +{ > >> > > + if (dax_mapping(mapping)) { > >> > > + struct block_device *bdev = I_BDEV(mapping->host); > >> > > + int error; > >> > > + > >> > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > >> > > + if (error) > >> > > + return error; > >> > > + } > >> > > + return generic_writepages(mapping, wbc); > >> > > +} > >> > > >> > Can you remind of the reason for calling generic_writepages() on DAX > >> > enabled address spaces? > >> > >> Sure. The initial version of this patch didn't do this, and during testing I > >> hit a bunch of xfstests failures. In ext2 at least I believe these were > >> happening because we were skipping the call into generic_writepages() for DAX > >> inodes. Without a lot of data to back this up, my guess is that this is due > >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > >> returns true), but having dirty page cache pages that need to be written back > >> as part of the writeback. > >> > >> Changing this so we always call generic_writepages() even in the DAX case > >> solved the xfstest failures. > >> > >> If this sounds incorrect, please let me know and I'll go and gather more data. > > > > So I think a more correct fix it to not set S_DAX for inodes that will have > > any pagecache pages - e.g. don't set S_DAX for block device inodes when > > filesystem is mounted on it (probably the easiest is to just refuse to > > mount filesystem on block device which has S_DAX set). > > I think we have a wider problem here. See __blkdev_get, we set S_DAX > on all block devices that have ->direct_access() and have a > page-aligned starting address. It seems to me we need to modify the > metadata i/o paths to bypass the page cache Heh, no way to do that easily. All the journalling machinery depends on buffers and pages... >, or teach the fsync code > how to flush populated data pages out of the radix. This might be doable but it will be difficult to avoid aliasing issues and data corruption. And mainly I don't see the point: When you mount a filesystem on top of block device, you do not want to mess with the block device directly, even less using DAX. So we just have to find a way how to set S_DAX for normal open but clear it from fs path. At worst, we could clear S_DAX on the block device in mount_bdev() or something like that... Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Date: Thu, 11 Feb 2016 12:49:22 -0700 Message-ID: <20160211194922.GA5260@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Ross Zwisler , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com To: Jan Kara Return-path: Received: from mga11.intel.com ([192.55.52.93]:32491 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750911AbcBKTte (ORCPT ); Thu, 11 Feb 2016 14:49:34 -0500 Content-Disposition: inline In-Reply-To: <20160211124304.GI21760@quack.suse.cz> Sender: linux-ext4-owner@vger.kernel.org List-ID: On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote: > On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > > During testing of raw block devices + DAX I noticed that the struct > > block_device that we were using for DAX operations was incorrect. For the > > fault handlers, etc. we can just get the correct bdev via get_block(), > > which is passed in as a function pointer, but for the *sync code and for > > sector zeroing we don't have access to get_block(). This is also an issue > > for XFS real-time devices, whenever we get those working. > > > > Patch one of this series fixes the DAX sector zeroing code by explicitly > > passing in a valid struct block_device. > > > > Patch two of this series fixes DAX *sync support by moving calls to > > dax_writeback_mapping_range() out of filemap_write_and_wait_range() and > > into the filesystem/block device ->writepages function so that it can > > supply us with a valid block device. This also fixes DAX code to properly > > flush caches in response to sync(2). > > > > Thanks to Jan Kara for his initial draft of patch 2: > > https://lkml.org/lkml/2016/2/9/485 > > > > Here are the changes that I've made to that patch: > > > > 1) For DAX mappings, only return after calling > > dax_writeback_mapping_range() if we encountered an error. In the non-error > > case we still need to write back normal pages, else we lose metadata > > updates. > > > > 2) In dax_writeback_mapping_range(), move the new check for > > if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) > > above the i_blkbits check. In my testing I found cases where > > dax_writeback_mapping_range() was called for inodes with i_blkbits != > > PAGE_SHIFT - I'm assuming these are internal metadata inodes? They have no > > exceptional DAX entries to flush, so we have no work to do, but if we > > return error from the i_blkbits check we will fail the overall writeback > > operation. Please let me know if it seems wrong for us to be seeing inodes > > set to use DAX but with i_blkbits != PAGE_SHIFT and I'll get more info. > > So I'm wondering - how come S_DAX flag got set for inode where i_blkbis != > PAGE_SHIFT? That would seem to be a bug? I specifically ordered the checks > like this to catch such issues. I've isolated this one - this happens for all three filesystems (ext2, ext4 & XFS), and does indeed have to do with the fact that S_DAX is set for bdev->bd_inode. Here is one failure path: [ 102.866637] [] dump_stack+0x85/0xc2 [ 102.867101] [] dax_writeback_mapping_range+0x60/0xe0 [ 102.867738] [] blkdev_writepages+0x3f/0x50 [ 102.868272] [] do_writepages+0x21/0x30 [ 102.868784] [] __filemap_fdatawrite_range+0xc6/0x100 [ 102.869378] [] filemap_write_and_wait+0x4a/0xa0 [ 102.869933] [] set_blocksize+0x70/0xd0 [ 102.870424] [] sb_set_blocksize+0x1d/0x50 [ 102.870933] [] ext4_fill_super+0x75b/0x3360 [ 102.871487] [] ? vsnprintf+0x201/0x4c0 [ 102.872005] [] ? snprintf+0x49/0x60 [ 102.872499] [] mount_bdev+0x180/0x1b0 [ 102.872981] [] ? ext4_calculate_overhead+0x370/0x370 [ 102.873580] [] ext4_mount+0x15/0x20 [ 102.874042] [] mount_fs+0x38/0x170 [ 102.874524] [] vfs_kern_mount+0x6b/0x150 [ 102.875041] [] do_mount+0x24f/0xe90 [ 102.875508] [] ? mntput+0x24/0x40 [ 102.875958] [] ? __kmalloc_track_caller+0xea/0x240 [ 102.876542] [] ? copy_mount_options+0x2c/0x210 [ 102.877087] [] SyS_mount+0x95/0xe0 [ 102.877573] [] entry_SYSCALL_64_fastpath+0x12/0x76 In set_blocksize() we are actually updating bdev->bd_inode->i_blkbits to be 12, but before that happens we do a sync_blockdev() with i_blkbits at 10, which causes the failure. This can be reproduced easily just by mounting an ext2 or ext4 filesystem. I think the plan of unsetting S_DAX on bdev->bd_inode when we mount will save us from this, as long as we do it super early in the mount process. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Date: Fri, 12 Feb 2016 07:46:35 +1100 Message-ID: <20160211204635.GI19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Ross Zwisler , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers To: Dan Williams Return-path: Content-Disposition: inline In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Thu, Feb 11, 2016 at 07:22:00AM -0800, Dan Williams wrote: > On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara wrote: > > On Wed 10-02-16 15:43:40, Ross Zwisler wrote: > >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to > >> > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > >> > > block devices and for XFS real-time files. > >> > > > >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem > >> > > ->writepages function so that it can supply us with a valid block > >> > > device. This also fixes DAX code to properly flush caches in response to > >> > > sync(2). > >> > > > >> > > Signed-off-by: Ross Zwisler > >> > > Signed-off-by: Jan Kara > >> > > --- > >> > > fs/block_dev.c | 16 +++++++++++++++- > >> > > fs/dax.c | 13 ++++++++----- > >> > > fs/ext2/inode.c | 11 +++++++++++ > >> > > fs/ext4/inode.c | 7 +++++++ > >> > > fs/xfs/xfs_aops.c | 9 +++++++++ > >> > > include/linux/dax.h | 6 ++++-- > >> > > mm/filemap.c | 12 ++++-------- > >> > > 7 files changed, 58 insertions(+), 16 deletions(-) > >> > > > >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c > >> > > index 39b3a17..fc01e43 100644 > >> > > --- a/fs/block_dev.c > >> > > +++ b/fs/block_dev.c > >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > >> > > return try_to_free_buffers(page); > >> > > } > >> > > > >> > > +static int blkdev_writepages(struct address_space *mapping, > >> > > + struct writeback_control *wbc) > >> > > +{ > >> > > + if (dax_mapping(mapping)) { > >> > > + struct block_device *bdev = I_BDEV(mapping->host); > >> > > + int error; > >> > > + > >> > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > >> > > + if (error) > >> > > + return error; > >> > > + } > >> > > + return generic_writepages(mapping, wbc); > >> > > +} > >> > > >> > Can you remind of the reason for calling generic_writepages() on DAX > >> > enabled address spaces? > >> > >> Sure. The initial version of this patch didn't do this, and during testing I > >> hit a bunch of xfstests failures. In ext2 at least I believe these were > >> happening because we were skipping the call into generic_writepages() for DAX > >> inodes. Without a lot of data to back this up, my guess is that this is due > >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > >> returns true), but having dirty page cache pages that need to be written back > >> as part of the writeback. > >> > >> Changing this so we always call generic_writepages() even in the DAX case > >> solved the xfstest failures. > >> > >> If this sounds incorrect, please let me know and I'll go and gather more data. > > > > So I think a more correct fix it to not set S_DAX for inodes that will have > > any pagecache pages - e.g. don't set S_DAX for block device inodes when > > filesystem is mounted on it (probably the easiest is to just refuse to > > mount filesystem on block device which has S_DAX set). > > I think we have a wider problem here. See __blkdev_get, we set S_DAX > on all block devices that have ->direct_access() and have a > page-aligned starting address. That's seeming like a premature optimisation to me now. I didn't say anything at the time because I was busy with other things and it didn't affect XFS. > It seems to me we need to modify the > metadata i/o paths to bypass the page cache, XFS doesn't use the block device page cache for it's metadata - it has it's own internal metadata cache structures and uses get_pages or heap memory to back it's metadata. But that doesn't make mixing DAX and pages in the block device mapping tree sane. What you are missing here is that the underlying architecture of journalling filesystems mean they can't use DAX for their metadata. Modifications have to be buffered, because they have to be written to the journal first before they are written back in place. IOWs, we need to buffer changes in volatile memory for some time, and that means we can't use DAX during transactional modifications. And to put the final nail in that coffin, metadata in XFS can be discontiguous multi-block objects - in those situations we vmap the underlying pages so they appear to the code to be a contiguous buffer, and that's something we can't do with DAX.... > or teach the fsync code > how to flush populated data pages out of the radix. That doesn't solve the problem. Filesystems free and reallocate filesystem blocks without intermediate block device mapping invalidation calls, so what is one minute a data block accessed by DAX may become a metadata block that accessed via buffered IO. It all goes to crap very quickly.... However, I'd say fsync is not the place to address this. This block device cache aliasing issue is supposed to be what unmap_underlying_metadata() solves, right? Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Williams Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Date: Thu, 11 Feb 2016 12:58:38 -0800 Message-ID: References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> <20160211204635.GI19486@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: Jan Kara , Ross Zwisler , "linux-kernel@vger.kernel.org" , "Theodore Ts'o" , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers To: Dave Chinner Return-path: In-Reply-To: <20160211204635.GI19486@dastard> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner wrote: [..] >> It seems to me we need to modify the >> metadata i/o paths to bypass the page cache, > > XFS doesn't use the block device page cache for it's metadata - it > has it's own internal metadata cache structures and uses get_pages > or heap memory to back it's metadata. But that doesn't make mixing > DAX and pages in the block device mapping tree sane. > > What you are missing here is that the underlying architecture of > journalling filesystems mean they can't use DAX for their metadata. > Modifications have to be buffered, because they have to be written > to the journal first before they are written back in place. IOWs, we > need to buffer changes in volatile memory for some time, and that > means we can't use DAX during transactional modifications. > > And to put the final nail in that coffin, metadata in XFS can be > discontiguous multi-block objects - in those situations we vmap the > underlying pages so they appear to the code to be a contiguous > buffer, and that's something we can't do with DAX.... Sorry, I wasn't clear when I said "bypass page cache" I meant a solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition table reads". However, I suspect that is broken if the filesystem is not ready to see a new page allocated for every I/O. I assume one thread will want to insert a page in the radix for another thread to find/manipulate before metadata gets written back to storage. >> or teach the fsync code >> how to flush populated data pages out of the radix. > > That doesn't solve the problem. Filesystems free and reallocate > filesystem blocks without intermediate block device mapping > invalidation calls, so what is one minute a data block accessed by > DAX may become a metadata block that accessed via buffered IO. It > all goes to crap very quickly.... > > However, I'd say fsync is not the place to address this. This block > device cache aliasing issue is supposed to be what > unmap_underlying_metadata() solves, right? I'll take a look at this. Right now I'm trying to implement the "clear block-device-inode S_DAX on fs mount" approach. My concern though is that we need to disable block device mmap while a filesystem is mounted... Maybe I don't need to worry because it's already the case that a mmap of the raw device may not see the most up to date data for a file that has dirty fs-page-cache data. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Zwisler Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Date: Fri, 12 Feb 2016 12:03:20 -0700 Message-ID: <20160212190320.GA24857@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Ross Zwisler , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Dave Chinner , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20160211124304.GI21760@quack.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote: > On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > > 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue > > the writeback in the case that DAX is enabled but we only have a nonzero > > mapping->nrpages. As with 1) and 2), I believe this is necessary to > > properly writeback metadata changes. If this sounds wrong, please let me > > know and I'll get more info. > > And I'm surprised here as well. If there are dax_mapping() inodes that have > pagecache pages, then we have issues with radix tree handling as well. So > how come dax_mapping() inodes have pages attached? If it is about block > device inodes, then I find it buggy, that S_DAX gets set for such inodes > when filesystem is mounted on them because in such cases we are IMO asking > for data corruption sooner rather than later... I think I've figured this one out, at least partially. For ext2 the issues I was seeing were due to the fact that directory inodes have S_DAX set, but have dirty page cache pages. In testing with generic/002, I see two ext2 inodes with S_DAX trying to do a writeback while they have dirty page cache pages. The first has i_ino=2, which is the EXT2_ROOT_INO. The second inode changes from run to run, but for my last run was 155649. The test failed because that directory inode was found to be corrupt by fsck.ext2: *** fsck.ext2 output *** fsck from util-linux 2.26.2 e2fsck 1.42.12 (29-Aug-2014) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Directory inode 155649, block #0, offset 0: directory corrupted If I change the code in ext2_writepages() so that it does the mpage_writepages() even for DAX inodes, all my xfstests pass. I'm not sure this is the right fix, though - should it instead be that ext2 directory inodes don't have S_DAX set? A similar problem occurs with ext4, though I haven't yet tracked it down to an inode type. It could be that ext4 directory inodes have the same issue, and Eric Sandeen suggested we might also have an issue with XATTRS attached to inodes. As with ext2, if I allow the normal writeback to occur in ext4_writepages() even for DAX inodes, the issues go away, but I'm not sure whether or not this is the correct fix. As far as I can see, XFS does not have these issues - returning immediately having done just the DAX writeback in xfs_vm_writepages() lets all my xfstests pass. For v4.5 should I send out an updated version of this series that does the regular page writeback for ext2 & ext4, or should we work to clear S_DAX for regular filesystem inodes that have dirty page cache data? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 801DB7CB0 for ; Wed, 10 Feb 2016 14:49:17 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay1.corp.sgi.com (Postfix) with ESMTP id 7443D8F8033 for ; Wed, 10 Feb 2016 12:49:17 -0800 (PST) Received: from mga11.intel.com ([192.55.52.93]) by cuda.sgi.com with ESMTP id TXPVbDPvxGEv49Wk for ; Wed, 10 Feb 2016 12:49:16 -0800 (PST) From: Ross Zwisler Subject: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Date: Wed, 10 Feb 2016 13:48:54 -0700 Message-Id: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Theodore Ts'o , Andrew Morton , linux-nvdimm@lists.01.org, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams During testing of raw block devices + DAX I noticed that the struct block_device that we were using for DAX operations was incorrect. For the fault handlers, etc. we can just get the correct bdev via get_block(), which is passed in as a function pointer, but for the *sync code and for sector zeroing we don't have access to get_block(). This is also an issue for XFS real-time devices, whenever we get those working. Patch one of this series fixes the DAX sector zeroing code by explicitly passing in a valid struct block_device. Patch two of this series fixes DAX *sync support by moving calls to dax_writeback_mapping_range() out of filemap_write_and_wait_range() and into the filesystem/block device ->writepages function so that it can supply us with a valid block device. This also fixes DAX code to properly flush caches in response to sync(2). Thanks to Jan Kara for his initial draft of patch 2: https://lkml.org/lkml/2016/2/9/485 Here are the changes that I've made to that patch: 1) For DAX mappings, only return after calling dax_writeback_mapping_range() if we encountered an error. In the non-error case we still need to write back normal pages, else we lose metadata updates. 2) In dax_writeback_mapping_range(), move the new check for if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) above the i_blkbits check. In my testing I found cases where dax_writeback_mapping_range() was called for inodes with i_blkbits != PAGE_SHIFT - I'm assuming these are internal metadata inodes? They have no exceptional DAX entries to flush, so we have no work to do, but if we return error from the i_blkbits check we will fail the overall writeback operation. Please let me know if it seems wrong for us to be seeing inodes set to use DAX but with i_blkbits != PAGE_SHIFT and I'll get more info. 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue the writeback in the case that DAX is enabled but we only have a nonzero mapping->nrpages. As with 1) and 2), I believe this is necessary to properly writeback metadata changes. If this sounds wrong, please let me know and I'll get more info. A working tree can be found here: https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=fsync_bdev_v2 Ross Zwisler (2): dax: supply DAX clearing code with correct bdev dax: move writeback calls into the filesystems fs/block_dev.c | 16 +++++++++++++++- fs/dax.c | 22 ++++++++++++---------- fs/ext2/inode.c | 17 +++++++++++++++-- fs/ext4/inode.c | 7 +++++++ fs/xfs/xfs_aops.c | 11 ++++++++++- fs/xfs/xfs_aops.h | 1 + fs/xfs/xfs_bmap_util.c | 3 ++- include/linux/dax.h | 8 +++++--- mm/filemap.c | 12 ++++-------- 9 files changed, 71 insertions(+), 26 deletions(-) -- 2.5.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 7FD9B7CB4 for ; Wed, 10 Feb 2016 14:49:19 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay3.corp.sgi.com (Postfix) with ESMTP id E6C7BAC008 for ; Wed, 10 Feb 2016 12:49:18 -0800 (PST) Received: from mga11.intel.com ([192.55.52.93]) by cuda.sgi.com with ESMTP id 8GEG3NQp3drXhPy5 for ; Wed, 10 Feb 2016 12:49:17 -0800 (PST) From: Ross Zwisler Subject: [PATCH v2 1/2] dax: supply DAX clearing code with correct bdev Date: Wed, 10 Feb 2016 13:48:55 -0700 Message-Id: <1455137336-28720-2-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Theodore Ts'o , Andrew Morton , linux-nvdimm@lists.01.org, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams dax_clear_blocks() needs a valid struct block_device and previously it was using inode->i_sb->s_bdev in all cases. This is correct for normal inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time devices. Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change its arguments to take a bdev and a sector instead of an inode and a block. This better reflects what the function does, and it allows the filesystem and raw block device code to pass in an appropriate struct block_device. Signed-off-by: Ross Zwisler Suggested-by: Dan Williams --- fs/dax.c | 9 ++++----- fs/ext2/inode.c | 6 ++++-- fs/xfs/xfs_aops.c | 2 +- fs/xfs/xfs_aops.h | 1 + fs/xfs/xfs_bmap_util.c | 3 ++- include/linux/dax.h | 2 +- 6 files changed, 13 insertions(+), 10 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index fc2e314..9a173dd 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -79,15 +79,14 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n) } /* - * dax_clear_blocks() is called from within transaction context from XFS, + * dax_clear_sectors() is called from within transaction context from XFS, * and hence this means the stack from this point must follow GFP_NOFS * semantics for all operations. */ -int dax_clear_blocks(struct inode *inode, sector_t block, long _size) +int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size) { - struct block_device *bdev = inode->i_sb->s_bdev; struct blk_dax_ctl dax = { - .sector = block << (inode->i_blkbits - 9), + .sector = _sector, .size = _size, }; @@ -109,7 +108,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long _size) wmb_pmem(); return 0; } -EXPORT_SYMBOL_GPL(dax_clear_blocks); +EXPORT_SYMBOL_GPL(dax_clear_sectors); /* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */ static void dax_new_buf(void __pmem *addr, unsigned size, unsigned first, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 338eefd..b6b965b 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -737,8 +737,10 @@ static int ext2_get_blocks(struct inode *inode, * so that it's not found by another thread before it's * initialised */ - err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key), - 1 << inode->i_blkbits); + err = dax_clear_sectors(inode->i_sb->s_bdev, + le32_to_cpu(chain[depth-1].key) << + (inode->i_blkbits - 9), + 1 << inode->i_blkbits); if (err) { mutex_unlock(&ei->truncate_mutex); goto cleanup; diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 379c089..fc20518 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -55,7 +55,7 @@ xfs_count_page_state( } while ((bh = bh->b_this_page) != head); } -STATIC struct block_device * +struct block_device * xfs_find_bdev_for_inode( struct inode *inode) { diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h index f6ffc9a..a4343c6 100644 --- a/fs/xfs/xfs_aops.h +++ b/fs/xfs/xfs_aops.h @@ -62,5 +62,6 @@ int xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset, struct buffer_head *map_bh, int create); extern void xfs_count_page_state(struct page *, int *, int *); +extern struct block_device *xfs_find_bdev_for_inode(struct inode *); #endif /* __XFS_AOPS_H__ */ diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 07ef29b..ae9d755 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -75,7 +75,8 @@ xfs_zero_extent( ssize_t size = XFS_FSB_TO_B(mp, count_fsb); if (IS_DAX(VFS_I(ip))) - return dax_clear_blocks(VFS_I(ip), block, size); + return dax_clear_sectors(xfs_find_bdev_for_inode(VFS_I(ip)), + sector, size); /* * let the block layer decide on the fastest method of diff --git a/include/linux/dax.h b/include/linux/dax.h index 818e450..7b6bced 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -7,7 +7,7 @@ ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t, get_block_t, dio_iodone_t, int flags); -int dax_clear_blocks(struct inode *, sector_t block, long size); +int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size); int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); int dax_truncate_page(struct inode *, loff_t from, get_block_t); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t, -- 2.5.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id DF0307CB8 for ; Wed, 10 Feb 2016 14:49:22 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id D0637304039 for ; Wed, 10 Feb 2016 12:49:19 -0800 (PST) Received: from mga11.intel.com ([192.55.52.93]) by cuda.sgi.com with ESMTP id rLl0668mEENoPR2r for ; Wed, 10 Feb 2016 12:49:18 -0800 (PST) From: Ross Zwisler Subject: [PATCH v2 2/2] dax: move writeback calls into the filesystems Date: Wed, 10 Feb 2016 13:48:56 -0700 Message-Id: <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: linux-kernel@vger.kernel.org Cc: Theodore Ts'o , Andrew Morton , linux-nvdimm@lists.01.org, Jan Kara , xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Dan Williams Previously calls to dax_writeback_mapping_range() for all DAX filesystems (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). dax_writeback_mapping_range() needs a struct block_device, and it used to get that from inode->i_sb->s_bdev. This is correct for normal inodes mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time files. Instead, call dax_writeback_mapping_range() directly from the filesystem ->writepages function so that it can supply us with a valid block device. This also fixes DAX code to properly flush caches in response to sync(2). Signed-off-by: Ross Zwisler Signed-off-by: Jan Kara --- fs/block_dev.c | 16 +++++++++++++++- fs/dax.c | 13 ++++++++----- fs/ext2/inode.c | 11 +++++++++++ fs/ext4/inode.c | 7 +++++++ fs/xfs/xfs_aops.c | 9 +++++++++ include/linux/dax.h | 6 ++++-- mm/filemap.c | 12 ++++-------- 7 files changed, 58 insertions(+), 16 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 39b3a17..fc01e43 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) return try_to_free_buffers(page); } +static int blkdev_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + if (dax_mapping(mapping)) { + struct block_device *bdev = I_BDEV(mapping->host); + int error; + + error = dax_writeback_mapping_range(mapping, bdev, wbc); + if (error) + return error; + } + return generic_writepages(mapping, wbc); +} + static const struct address_space_operations def_blk_aops = { .readpage = blkdev_readpage, .readpages = blkdev_readpages, .writepage = blkdev_writepage, .write_begin = blkdev_write_begin, .write_end = blkdev_write_end, - .writepages = generic_writepages, + .writepages = blkdev_writepages, .releasepage = blkdev_releasepage, .direct_IO = blkdev_direct_IO, .is_dirty_writeback = buffer_check_dirty_writeback, diff --git a/fs/dax.c b/fs/dax.c index 9a173dd..034dd02 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev, * end]. This is required by data integrity operations to ensure file data is * on persistent storage prior to completion of the operation. */ -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end) +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, struct writeback_control *wbc) { struct inode *inode = mapping->host; - struct block_device *bdev = inode->i_sb->s_bdev; pgoff_t start_index, end_index, pmd_index; pgoff_t indices[PAGEVEC_SIZE]; struct pagevec pvec; @@ -496,11 +495,15 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, int i, ret = 0; void *entry; + + if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) + return 0; + if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT)) return -EIO; - start_index = start >> PAGE_CACHE_SHIFT; - end_index = end >> PAGE_CACHE_SHIFT; + start_index = wbc->range_start >> PAGE_CACHE_SHIFT; + end_index = wbc->range_end >> PAGE_CACHE_SHIFT; pmd_index = DAX_PMD_INDEX(start_index); rcu_read_lock(); diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index b6b965b..7e44fc3 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -876,6 +876,17 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset) static int ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) { +#ifdef CONFIG_FS_DAX + if (dax_mapping(mapping)) { + int error; + + error = dax_writeback_mapping_range(mapping, + mapping->host->i_sb->s_bdev, wbc); + if (error) + return error; + } +#endif + return mpage_writepages(mapping, wbc, ext2_get_block); } diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 83bc8bf..8c42020 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2450,6 +2450,13 @@ static int ext4_writepages(struct address_space *mapping, trace_ext4_writepages(inode, wbc); + if (dax_mapping(mapping)) { + ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, + wbc); + if (ret) + goto out_writepages; + } + /* * No pages to write? This is mainly a kludge to avoid starting * a transaction for special inodes like journal inode on last iput() diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index fc20518..1139ecd 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1208,6 +1208,15 @@ xfs_vm_writepages( struct writeback_control *wbc) { xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED); + if (dax_mapping(mapping)) { + int error; + + error = dax_writeback_mapping_range(mapping, + xfs_find_bdev_for_inode(mapping->host), wbc); + if (error) + return error; + } + return generic_writepages(mapping, wbc); } diff --git a/include/linux/dax.h b/include/linux/dax.h index 7b6bced..636dd59 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -52,6 +52,8 @@ static inline bool dax_mapping(struct address_space *mapping) { return mapping->host && IS_DAX(mapping->host); } -int dax_writeback_mapping_range(struct address_space *mapping, loff_t start, - loff_t end); + +struct writeback_control; +int dax_writeback_mapping_range(struct address_space *mapping, + struct block_device *bdev, struct writeback_control *wbc); #endif diff --git a/mm/filemap.c b/mm/filemap.c index bc94386..a829779 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -446,7 +446,8 @@ int filemap_write_and_wait(struct address_space *mapping) { int err = 0; - if (mapping->nrpages) { + if (mapping->nrpages || + (dax_mapping(mapping) && mapping->nrexceptional)) { err = filemap_fdatawrite(mapping); /* * Even if the above returned error, the pages may be @@ -482,13 +483,8 @@ int filemap_write_and_wait_range(struct address_space *mapping, { int err = 0; - if (dax_mapping(mapping) && mapping->nrexceptional) { - err = dax_writeback_mapping_range(mapping, lstart, lend); - if (err) - return err; - } - - if (mapping->nrpages) { + if (mapping->nrpages || + (dax_mapping(mapping) && mapping->nrexceptional)) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); /* See comment of filemap_write_and_wait() */ -- 2.5.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 812427CBE for ; Wed, 10 Feb 2016 16:03:23 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay2.corp.sgi.com (Postfix) with ESMTP id 52FA6304053 for ; Wed, 10 Feb 2016 14:03:20 -0800 (PST) Received: from ipmail06.adl2.internode.on.net (ipmail06.adl2.internode.on.net [150.101.137.129]) by cuda.sgi.com with ESMTP id wLuWHK0WJHUlJ8wM for ; Wed, 10 Feb 2016 14:03:14 -0800 (PST) Date: Thu, 11 Feb 2016 09:03:12 +1100 From: Dave Chinner Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160210220312.GP14668@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: Theodore Ts'o , linux-nvdimm@lists.01.org, Jan Kara , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Andrew Morton , linux-ext4@vger.kernel.org, Dan Williams On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > dax_writeback_mapping_range() needs a struct block_device, and it used to > get that from inode->i_sb->s_bdev. This is correct for normal inodes > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > block devices and for XFS real-time files. > > Instead, call dax_writeback_mapping_range() directly from the filesystem > ->writepages function so that it can supply us with a valid block > device. This also fixes DAX code to properly flush caches in response to > sync(2). > > Signed-off-by: Ross Zwisler > Signed-off-by: Jan Kara > --- > fs/block_dev.c | 16 +++++++++++++++- > fs/dax.c | 13 ++++++++----- > fs/ext2/inode.c | 11 +++++++++++ > fs/ext4/inode.c | 7 +++++++ > fs/xfs/xfs_aops.c | 9 +++++++++ > include/linux/dax.h | 6 ++++-- > mm/filemap.c | 12 ++++-------- > 7 files changed, 58 insertions(+), 16 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 39b3a17..fc01e43 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > return try_to_free_buffers(page); > } > > +static int blkdev_writepages(struct address_space *mapping, > + struct writeback_control *wbc) > +{ > + if (dax_mapping(mapping)) { > + struct block_device *bdev = I_BDEV(mapping->host); > + int error; > + > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > + if (error) > + return error; > + } > + return generic_writepages(mapping, wbc); > +} Can you remind of the reason for calling generic_writepages() on DAX enabled address spaces? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 0D61729DF5 for ; Wed, 10 Feb 2016 16:43:57 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay1.corp.sgi.com (Postfix) with ESMTP id EE8C48F8064 for ; Wed, 10 Feb 2016 14:43:53 -0800 (PST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by cuda.sgi.com with ESMTP id u04Ybtxq1t0oItc4 for ; Wed, 10 Feb 2016 14:43:52 -0800 (PST) Date: Wed, 10 Feb 2016 15:43:40 -0700 From: Ross Zwisler Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160210224340.GA30938@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160210220312.GP14668@dastard> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: Theodore Ts'o , linux-nvdimm@lists.01.org, Jan Kara , Dan Williams , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Andrew Morton On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > block devices and for XFS real-time files. > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem > > ->writepages function so that it can supply us with a valid block > > device. This also fixes DAX code to properly flush caches in response to > > sync(2). > > > > Signed-off-by: Ross Zwisler > > Signed-off-by: Jan Kara > > --- > > fs/block_dev.c | 16 +++++++++++++++- > > fs/dax.c | 13 ++++++++----- > > fs/ext2/inode.c | 11 +++++++++++ > > fs/ext4/inode.c | 7 +++++++ > > fs/xfs/xfs_aops.c | 9 +++++++++ > > include/linux/dax.h | 6 ++++-- > > mm/filemap.c | 12 ++++-------- > > 7 files changed, 58 insertions(+), 16 deletions(-) > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > index 39b3a17..fc01e43 100644 > > --- a/fs/block_dev.c > > +++ b/fs/block_dev.c > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > > return try_to_free_buffers(page); > > } > > > > +static int blkdev_writepages(struct address_space *mapping, > > + struct writeback_control *wbc) > > +{ > > + if (dax_mapping(mapping)) { > > + struct block_device *bdev = I_BDEV(mapping->host); > > + int error; > > + > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > > + if (error) > > + return error; > > + } > > + return generic_writepages(mapping, wbc); > > +} > > Can you remind of the reason for calling generic_writepages() on DAX > enabled address spaces? Sure. The initial version of this patch didn't do this, and during testing I hit a bunch of xfstests failures. In ext2 at least I believe these were happening because we were skipping the call into generic_writepages() for DAX inodes. Without a lot of data to back this up, my guess is that this is due to metadata inodes or something being marked as DAX (so dax_mapping(mapping) returns true), but having dirty page cache pages that need to be written back as part of the writeback. Changing this so we always call generic_writepages() even in the DAX case solved the xfstest failures. If this sounds incorrect, please let me know and I'll go and gather more data. - Ross _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 1DA7B7CBC for ; Wed, 10 Feb 2016 17:44:08 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay2.corp.sgi.com (Postfix) with ESMTP id 0A357304051 for ; Wed, 10 Feb 2016 15:44:04 -0800 (PST) Received: from ipmail06.adl2.internode.on.net (ipmail06.adl2.internode.on.net [150.101.137.129]) by cuda.sgi.com with ESMTP id 5bcoaQAcGgF6rqCo for ; Wed, 10 Feb 2016 15:44:02 -0800 (PST) Date: Thu, 11 Feb 2016 10:44:00 +1100 From: Dave Chinner Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160210234400.GQ14668@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160210224340.GA30938@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com, Jan Kara On Wed, Feb 10, 2016 at 03:43:40PM -0700, Ross Zwisler wrote: > On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > > block devices and for XFS real-time files. > > > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem > > > ->writepages function so that it can supply us with a valid block > > > device. This also fixes DAX code to properly flush caches in response to > > > sync(2). > > > > > > Signed-off-by: Ross Zwisler > > > Signed-off-by: Jan Kara > > > --- > > > fs/block_dev.c | 16 +++++++++++++++- > > > fs/dax.c | 13 ++++++++----- > > > fs/ext2/inode.c | 11 +++++++++++ > > > fs/ext4/inode.c | 7 +++++++ > > > fs/xfs/xfs_aops.c | 9 +++++++++ > > > include/linux/dax.h | 6 ++++-- > > > mm/filemap.c | 12 ++++-------- > > > 7 files changed, 58 insertions(+), 16 deletions(-) > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > index 39b3a17..fc01e43 100644 > > > --- a/fs/block_dev.c > > > +++ b/fs/block_dev.c > > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > > > return try_to_free_buffers(page); > > > } > > > > > > +static int blkdev_writepages(struct address_space *mapping, > > > + struct writeback_control *wbc) > > > +{ > > > + if (dax_mapping(mapping)) { > > > + struct block_device *bdev = I_BDEV(mapping->host); > > > + int error; > > > + > > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > > > + if (error) > > > + return error; > > > + } > > > + return generic_writepages(mapping, wbc); > > > +} > > > > Can you remind of the reason for calling generic_writepages() on DAX > > enabled address spaces? > > Sure. The initial version of this patch didn't do this, and during testing I > hit a bunch of xfstests failures. In ext2 at least I believe these were > happening because we were skipping the call into generic_writepages() for DAX > inodes. Without a lot of data to back this up, my guess is that this is due > to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > returns true), but having dirty page cache pages that need to be written back > as part of the writeback. Hmmm - the ext2 filesystem metadata uses the block device page cache to buffer inode writeback, and so writeback doesn't occur until sync_blockdev() is called. But the data access should be through the ext2 inode address space, not the block device address space, so DAX flushing occurs in ext2_writepages. So how is the block device inode being marked as a DAX inode? If it is being marked as a DAX inode, how is this valid when the filesystem metadata uses bufferheads and requires struct pages to be found in the block device mapping tree? e.g. mkfs writes the metadata into the bdev via DAX, resulting in an DAX exceptional entry in the bdev radix tree, then __bread_gfp() comes along to read the same metadata after mount and expects to find pages in the blockdev radix tree? FWIW, this seems to be specifically a block device inode issue, though, not something that affects regular files in a filesystem. i.e. filesystem inodes can only be either DAX or non-DAX, and so there is no mixed mode flushing required, right? > Changing this so we always call generic_writepages() even in the > DAX case solved the xfstest failures. > > If this sounds incorrect, please let me know and I'll go and > gather more data. It seems to me that there's a problem here with DAX on block device inodes, but not for the filesystem mappings. At minimum, the block device needs a bloody big comment explaining this landmine so people don't forget why it is a special snowflake... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 186C57CA2 for ; Thu, 11 Feb 2016 06:42:53 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay2.corp.sgi.com (Postfix) with ESMTP id DFB31304043 for ; Thu, 11 Feb 2016 04:42:52 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by cuda.sgi.com with ESMTP id FmQNBp2HikIcmb5N (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NO) for ; Thu, 11 Feb 2016 04:42:50 -0800 (PST) Date: Thu, 11 Feb 2016 13:43:04 +0100 From: Jan Kara Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160211124304.GI21760@quack.suse.cz> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: Theodore Ts'o , linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Andrew Morton , linux-ext4@vger.kernel.org, Dan Williams On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > During testing of raw block devices + DAX I noticed that the struct > block_device that we were using for DAX operations was incorrect. For the > fault handlers, etc. we can just get the correct bdev via get_block(), > which is passed in as a function pointer, but for the *sync code and for > sector zeroing we don't have access to get_block(). This is also an issue > for XFS real-time devices, whenever we get those working. > > Patch one of this series fixes the DAX sector zeroing code by explicitly > passing in a valid struct block_device. > > Patch two of this series fixes DAX *sync support by moving calls to > dax_writeback_mapping_range() out of filemap_write_and_wait_range() and > into the filesystem/block device ->writepages function so that it can > supply us with a valid block device. This also fixes DAX code to properly > flush caches in response to sync(2). > > Thanks to Jan Kara for his initial draft of patch 2: > https://lkml.org/lkml/2016/2/9/485 > > Here are the changes that I've made to that patch: > > 1) For DAX mappings, only return after calling > dax_writeback_mapping_range() if we encountered an error. In the non-error > case we still need to write back normal pages, else we lose metadata > updates. > > 2) In dax_writeback_mapping_range(), move the new check for > if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) > above the i_blkbits check. In my testing I found cases where > dax_writeback_mapping_range() was called for inodes with i_blkbits != > PAGE_SHIFT - I'm assuming these are internal metadata inodes? They have no > exceptional DAX entries to flush, so we have no work to do, but if we > return error from the i_blkbits check we will fail the overall writeback > operation. Please let me know if it seems wrong for us to be seeing inodes > set to use DAX but with i_blkbits != PAGE_SHIFT and I'll get more info. So I'm wondering - how come S_DAX flag got set for inode where i_blkbis != PAGE_SHIFT? That would seem to be a bug? I specifically ordered the checks like this to catch such issues. > 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue > the writeback in the case that DAX is enabled but we only have a nonzero > mapping->nrpages. As with 1) and 2), I believe this is necessary to > properly writeback metadata changes. If this sounds wrong, please let me > know and I'll get more info. And I'm surprised here as well. If there are dax_mapping() inodes that have pagecache pages, then we have issues with radix tree handling as well. So how come dax_mapping() inodes have pages attached? If it is about block device inodes, then I find it buggy, that S_DAX gets set for such inodes when filesystem is mounted on them because in such cases we are IMO asking for data corruption sooner rather than later... Honza -- Jan Kara SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id EE1C87CA2 for ; Thu, 11 Feb 2016 06:50:36 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay1.corp.sgi.com (Postfix) with ESMTP id F020E8F804C for ; Thu, 11 Feb 2016 04:50:30 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by cuda.sgi.com with ESMTP id 6rMwEWvyIOAwLSXj (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NO) for ; Thu, 11 Feb 2016 04:50:28 -0800 (PST) Date: Thu, 11 Feb 2016 13:50:44 +0100 From: Jan Kara Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160211125044.GJ21760@quack.suse.cz> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160210224340.GA30938@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler Cc: Theodore Ts'o , linux-nvdimm@lists.01.org, Jan Kara , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Andrew Morton , linux-ext4@vger.kernel.org, Dan Williams On Wed 10-02-16 15:43:40, Ross Zwisler wrote: > On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > > > dax_writeback_mapping_range() needs a struct block_device, and it used to > > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > > > block devices and for XFS real-time files. > > > > > > Instead, call dax_writeback_mapping_range() directly from the filesystem > > > ->writepages function so that it can supply us with a valid block > > > device. This also fixes DAX code to properly flush caches in response to > > > sync(2). > > > > > > Signed-off-by: Ross Zwisler > > > Signed-off-by: Jan Kara > > > --- > > > fs/block_dev.c | 16 +++++++++++++++- > > > fs/dax.c | 13 ++++++++----- > > > fs/ext2/inode.c | 11 +++++++++++ > > > fs/ext4/inode.c | 7 +++++++ > > > fs/xfs/xfs_aops.c | 9 +++++++++ > > > include/linux/dax.h | 6 ++++-- > > > mm/filemap.c | 12 ++++-------- > > > 7 files changed, 58 insertions(+), 16 deletions(-) > > > > > > diff --git a/fs/block_dev.c b/fs/block_dev.c > > > index 39b3a17..fc01e43 100644 > > > --- a/fs/block_dev.c > > > +++ b/fs/block_dev.c > > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > > > return try_to_free_buffers(page); > > > } > > > > > > +static int blkdev_writepages(struct address_space *mapping, > > > + struct writeback_control *wbc) > > > +{ > > > + if (dax_mapping(mapping)) { > > > + struct block_device *bdev = I_BDEV(mapping->host); > > > + int error; > > > + > > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > > > + if (error) > > > + return error; > > > + } > > > + return generic_writepages(mapping, wbc); > > > +} > > > > Can you remind of the reason for calling generic_writepages() on DAX > > enabled address spaces? > > Sure. The initial version of this patch didn't do this, and during testing I > hit a bunch of xfstests failures. In ext2 at least I believe these were > happening because we were skipping the call into generic_writepages() for DAX > inodes. Without a lot of data to back this up, my guess is that this is due > to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > returns true), but having dirty page cache pages that need to be written back > as part of the writeback. > > Changing this so we always call generic_writepages() even in the DAX case > solved the xfstest failures. > > If this sounds incorrect, please let me know and I'll go and gather more data. So I think a more correct fix it to not set S_DAX for inodes that will have any pagecache pages - e.g. don't set S_DAX for block device inodes when filesystem is mounted on it (probably the easiest is to just refuse to mount filesystem on block device which has S_DAX set). Honza -- Jan Kara SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id B9F0E7CA3 for ; Thu, 11 Feb 2016 14:51:21 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay1.corp.sgi.com (Postfix) with ESMTP id 9BC568F8050 for ; Thu, 11 Feb 2016 12:51:18 -0800 (PST) Received: from ipmail06.adl2.internode.on.net (ipmail06.adl2.internode.on.net [150.101.137.129]) by cuda.sgi.com with ESMTP id fK1SOtjezfBUBy7Y for ; Thu, 11 Feb 2016 12:51:16 -0800 (PST) Date: Fri, 12 Feb 2016 07:50:49 +1100 From: Dave Chinner Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160211205049.GJ19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> <20160211194922.GA5260@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160211194922.GA5260@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler , Jan Kara , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com On Thu, Feb 11, 2016 at 12:49:22PM -0700, Ross Zwisler wrote: > I think the plan of unsetting S_DAX on bdev->bd_inode when we mount will save > us from this, as long as we do it super early in the mount process. I think that S_DAX should not be set on the block device by default in the first place. If we've been surprised by unexpected behaviour, then I'm sure there are going to be other surprises waiting for us. DAX default policy should be opt-in, not opt-out. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id ADF7D7CA2 for ; Thu, 11 Feb 2016 16:47:15 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay1.corp.sgi.com (Postfix) with ESMTP id 74A808F8035 for ; Thu, 11 Feb 2016 14:47:12 -0800 (PST) Received: from ipmail06.adl2.internode.on.net (ipmail06.adl2.internode.on.net [150.101.137.129]) by cuda.sgi.com with ESMTP id BogmEtesoEHeQLsM for ; Thu, 11 Feb 2016 14:47:09 -0800 (PST) Date: Fri, 12 Feb 2016 09:46:16 +1100 From: Dave Chinner Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160211224616.GL19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> <20160211204635.GI19486@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dan Williams Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , Matthew Wilcox , XFS Developers , Linux MM , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Jan Kara , Ross Zwisler , linux-ext4 , Andrew Morton On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote: > On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner wrote: > [..] > >> It seems to me we need to modify the > >> metadata i/o paths to bypass the page cache, > > > > XFS doesn't use the block device page cache for it's metadata - it > > has it's own internal metadata cache structures and uses get_pages > > or heap memory to back it's metadata. But that doesn't make mixing > > DAX and pages in the block device mapping tree sane. > > > > What you are missing here is that the underlying architecture of > > journalling filesystems mean they can't use DAX for their metadata. > > Modifications have to be buffered, because they have to be written > > to the journal first before they are written back in place. IOWs, we > > need to buffer changes in volatile memory for some time, and that > > means we can't use DAX during transactional modifications. > > > > And to put the final nail in that coffin, metadata in XFS can be > > discontiguous multi-block objects - in those situations we vmap the > > underlying pages so they appear to the code to be a contiguous > > buffer, and that's something we can't do with DAX.... > > Sorry, I wasn't clear when I said "bypass page cache" I meant a > solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition > table reads". So there's already bandaids to prevent bad shit from happening in the block layer, let alone when we consider all the ways that userspace can screw this all up. > However, I suspect that is broken if the filesystem is not ready > to see a new page allocated for every I/O. I assume one > thread will want to insert a page in the radix for another thread > to find/manipulate before metadata gets written back to storage. Right, you can't do that, especially as the struct page has a 1-1 relationship with the bufferhead that is attached to it as the bufferhead carries the filesystem state for the given cached page. > >> or teach the fsync code how to flush populated data pages out > >> of the radix. > > > > That doesn't solve the problem. Filesystems free and reallocate > > filesystem blocks without intermediate block device mapping > > invalidation calls, so what is one minute a data block accessed > > by DAX may become a metadata block that accessed via buffered > > IO. It all goes to crap very quickly.... > > > > However, I'd say fsync is not the place to address this. This > > block device cache aliasing issue is supposed to be what > > unmap_underlying_metadata() solves, right? > > I'll take a look at this. Right now I'm trying to implement the > "clear block-device-inode S_DAX on fs mount" approach. My concern > though is that we need to disable block device mmap while a > filesystem is mounted... /me chokes on his coffee. When did mmaping the block device behind the back of a mounted fileystem become a valid use case? It's not supported for normal block devices and for the same reasons it won't be supported for DAX enabled block devices, either. i.e. I'm going to tell anyone who has an application that does this to go and take a hike when (not if!) they report filesystem corruption problems. > Maybe I don't need to worry because it's already the case that a > mmap of the raw device may not see the most up to date data for a > file that has dirty fs-page-cache data. It goes both ways. What happens if mkfs or fsck modifies the block device via mmap+DAX and then the filesystem mounts the block device and tries to read that metadata via the block device page cache? Quite frankly, DAX on the block device is a can of worms we really don't need to deal with right now. IMO it's a solution looking for a problem to solve, the "default to on" policy is wrong (DAX is opt-in, not opt-out) and given this we should turn it off until we've solved the more important problems we need to solve. i.e. We need to concentrate on getting data integrity working correctly first, then address the cache aliasing issues, then address the "safe access" issues, and then we can re-introduce block device DAX access... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 7BCCA7CA2 for ; Thu, 11 Feb 2016 16:59:18 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay1.corp.sgi.com (Postfix) with ESMTP id 382368F804B for ; Thu, 11 Feb 2016 14:59:18 -0800 (PST) Received: from mail-yk0-f177.google.com (mail-yk0-f177.google.com [209.85.160.177]) by cuda.sgi.com with ESMTP id 4VkwH3odHbfsQ2rG (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Thu, 11 Feb 2016 14:59:14 -0800 (PST) Received: by mail-yk0-f177.google.com with SMTP id z7so27549583yka.3 for ; Thu, 11 Feb 2016 14:59:15 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20160211224616.GL19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> <20160211204635.GI19486@dastard> <20160211224616.GL19486@dastard> Date: Thu, 11 Feb 2016 14:59:14 -0800 Message-ID: Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems From: Dan Williams List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , Matthew Wilcox , XFS Developers , Linux MM , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Jan Kara , Ross Zwisler , linux-ext4 , Andrew Morton On Thu, Feb 11, 2016 at 2:46 PM, Dave Chinner wrote: > On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote: >> On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner wrote: >> [..] >> >> It seems to me we need to modify the >> >> metadata i/o paths to bypass the page cache, >> > >> > XFS doesn't use the block device page cache for it's metadata - it >> > has it's own internal metadata cache structures and uses get_pages >> > or heap memory to back it's metadata. But that doesn't make mixing >> > DAX and pages in the block device mapping tree sane. >> > >> > What you are missing here is that the underlying architecture of >> > journalling filesystems mean they can't use DAX for their metadata. >> > Modifications have to be buffered, because they have to be written >> > to the journal first before they are written back in place. IOWs, we >> > need to buffer changes in volatile memory for some time, and that >> > means we can't use DAX during transactional modifications. >> > >> > And to put the final nail in that coffin, metadata in XFS can be >> > discontiguous multi-block objects - in those situations we vmap the >> > underlying pages so they appear to the code to be a contiguous >> > buffer, and that's something we can't do with DAX.... >> >> Sorry, I wasn't clear when I said "bypass page cache" I meant a >> solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition >> table reads". > > So there's already bandaids to prevent bad shit from happening in > the block layer, let alone when we consider all the ways that > userspace can screw this all up. > >> However, I suspect that is broken if the filesystem is not ready >> to see a new page allocated for every I/O. I assume one >> thread will want to insert a page in the radix for another thread >> to find/manipulate before metadata gets written back to storage. > > Right, you can't do that, especially as the struct page has a 1-1 > relationship with the bufferhead that is attached to it as the > bufferhead carries the filesystem state for the given cached page. > >> >> or teach the fsync code how to flush populated data pages out >> >> of the radix. >> > >> > That doesn't solve the problem. Filesystems free and reallocate >> > filesystem blocks without intermediate block device mapping >> > invalidation calls, so what is one minute a data block accessed >> > by DAX may become a metadata block that accessed via buffered >> > IO. It all goes to crap very quickly.... >> > >> > However, I'd say fsync is not the place to address this. This >> > block device cache aliasing issue is supposed to be what >> > unmap_underlying_metadata() solves, right? >> >> I'll take a look at this. Right now I'm trying to implement the >> "clear block-device-inode S_DAX on fs mount" approach. My concern >> though is that we need to disable block device mmap while a >> filesystem is mounted... > > /me chokes on his coffee. > > When did mmaping the block device behind the back of a mounted > fileystem become a valid use case? It's not supported for normal > block devices and for the same reasons it won't be supported for DAX > enabled block devices, either. i.e. I'm going to tell anyone who has > an application that does this to go and take a hike when (not if!) > they report filesystem corruption problems. Right, but we need to not confuse the fsync code regardless of how bad of an idea this is ::-). >> Maybe I don't need to worry because it's already the case that a >> mmap of the raw device may not see the most up to date data for a >> file that has dirty fs-page-cache data. > > It goes both ways. What happens if mkfs or fsck modifies the > block device via mmap+DAX and then the filesystem mounts the block > device and tries to read that metadata via the block device page > cache? > > Quite frankly, DAX on the block device is a can of worms we really > don't need to deal with right now. IMO it's a solution looking for a > problem to solve, Virtualization use cases want to give large ranges to guest-VMs, and it is currently the only way to reliably get 1GiB mappings. > the "default to on" policy is wrong (DAX is > opt-in, not opt-out) and given this we should turn it off until > we've solved the more important problems we need to solve. i.e. We > need to concentrate on getting data integrity working correctly > first, then address the cache aliasing issues, then address the > "safe access" issues, and then we can re-introduce block device DAX > access... Agreed. Note that the "default-on policy" came from commit bbab37ddc20b "block: Add support for DAX reads/writes to block devices" way back in 4.2. We're just now noticing. Credit Ross for good sanity checking. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 83E4A8000 for ; Thu, 11 Feb 2016 17:44:20 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id 76087304051 for ; Thu, 11 Feb 2016 15:44:20 -0800 (PST) Received: from ipmail06.adl2.internode.on.net (ipmail06.adl2.internode.on.net [150.101.137.129]) by cuda.sgi.com with ESMTP id yeSuxkQDrynnHgBi for ; Thu, 11 Feb 2016 15:44:18 -0800 (PST) Date: Fri, 12 Feb 2016 10:44:15 +1100 From: Dave Chinner Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160211234415.GM19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> <20160211204635.GI19486@dastard> <20160211224616.GL19486@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dan Williams Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , Matthew Wilcox , XFS Developers , Linux MM , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Jan Kara , Ross Zwisler , linux-ext4 , Andrew Morton On Thu, Feb 11, 2016 at 02:59:14PM -0800, Dan Williams wrote: > On Thu, Feb 11, 2016 at 2:46 PM, Dave Chinner wrote: > > On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote: > >> On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner wrote: > >> Maybe I don't need to worry because it's already the case that a > >> mmap of the raw device may not see the most up to date data for a > >> file that has dirty fs-page-cache data. > > > > It goes both ways. What happens if mkfs or fsck modifies the > > block device via mmap+DAX and then the filesystem mounts the block > > device and tries to read that metadata via the block device page > > cache? > > > > Quite frankly, DAX on the block device is a can of worms we really > > don't need to deal with right now. IMO it's a solution looking for a > > problem to solve, > > Virtualization use cases want to give large ranges to guest-VMs, and > it is currently the only way to reliably get 1GiB mappings. Precisely my point - block devices are not the best way to solve this problem. A file, on XFS, with a 1GB extent size hint and preallocated to be aligned to 1GB addresses (i.e. mkfs.xfs -d su=1G,sw=1 on the host filesystem) will give reliable 1GB aligned blocks for DAX mappings, just like a block device will. Peformance wise it's little different to using the block device directly. Management wise it's way more flexible, especially as such image files can be recycled for new VMs almost instantly via FALLOC_FL_FLAG_ZERO_RANGE. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id CDDC57CA2 for ; Thu, 11 Feb 2016 10:22:18 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay3.corp.sgi.com (Postfix) with ESMTP id 51B40AC003 for ; Thu, 11 Feb 2016 08:22:15 -0800 (PST) Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by cuda.sgi.com with ESMTP id OPBlvCK6DVGHariU (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NO) for ; Thu, 11 Feb 2016 08:22:12 -0800 (PST) Date: Thu, 11 Feb 2016 17:22:26 +0100 From: Jan Kara Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160211162226.GR21760@quack.suse.cz> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dan Williams Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , Matthew Wilcox , XFS Developers , Linux MM , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Jan Kara , Ross Zwisler , linux-ext4 , Andrew Morton On Thu 11-02-16 07:22:00, Dan Williams wrote: > On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara wrote: > > On Wed 10-02-16 15:43:40, Ross Zwisler wrote: > >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to > >> > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > >> > > block devices and for XFS real-time files. > >> > > > >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem > >> > > ->writepages function so that it can supply us with a valid block > >> > > device. This also fixes DAX code to properly flush caches in response to > >> > > sync(2). > >> > > > >> > > Signed-off-by: Ross Zwisler > >> > > Signed-off-by: Jan Kara > >> > > --- > >> > > fs/block_dev.c | 16 +++++++++++++++- > >> > > fs/dax.c | 13 ++++++++----- > >> > > fs/ext2/inode.c | 11 +++++++++++ > >> > > fs/ext4/inode.c | 7 +++++++ > >> > > fs/xfs/xfs_aops.c | 9 +++++++++ > >> > > include/linux/dax.h | 6 ++++-- > >> > > mm/filemap.c | 12 ++++-------- > >> > > 7 files changed, 58 insertions(+), 16 deletions(-) > >> > > > >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c > >> > > index 39b3a17..fc01e43 100644 > >> > > --- a/fs/block_dev.c > >> > > +++ b/fs/block_dev.c > >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > >> > > return try_to_free_buffers(page); > >> > > } > >> > > > >> > > +static int blkdev_writepages(struct address_space *mapping, > >> > > + struct writeback_control *wbc) > >> > > +{ > >> > > + if (dax_mapping(mapping)) { > >> > > + struct block_device *bdev = I_BDEV(mapping->host); > >> > > + int error; > >> > > + > >> > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > >> > > + if (error) > >> > > + return error; > >> > > + } > >> > > + return generic_writepages(mapping, wbc); > >> > > +} > >> > > >> > Can you remind of the reason for calling generic_writepages() on DAX > >> > enabled address spaces? > >> > >> Sure. The initial version of this patch didn't do this, and during testing I > >> hit a bunch of xfstests failures. In ext2 at least I believe these were > >> happening because we were skipping the call into generic_writepages() for DAX > >> inodes. Without a lot of data to back this up, my guess is that this is due > >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > >> returns true), but having dirty page cache pages that need to be written back > >> as part of the writeback. > >> > >> Changing this so we always call generic_writepages() even in the DAX case > >> solved the xfstest failures. > >> > >> If this sounds incorrect, please let me know and I'll go and gather more data. > > > > So I think a more correct fix it to not set S_DAX for inodes that will have > > any pagecache pages - e.g. don't set S_DAX for block device inodes when > > filesystem is mounted on it (probably the easiest is to just refuse to > > mount filesystem on block device which has S_DAX set). > > I think we have a wider problem here. See __blkdev_get, we set S_DAX > on all block devices that have ->direct_access() and have a > page-aligned starting address. It seems to me we need to modify the > metadata i/o paths to bypass the page cache Heh, no way to do that easily. All the journalling machinery depends on buffers and pages... >, or teach the fsync code > how to flush populated data pages out of the radix. This might be doable but it will be difficult to avoid aliasing issues and data corruption. And mainly I don't see the point: When you mount a filesystem on top of block device, you do not want to mess with the block device directly, even less using DAX. So we just have to find a way how to set S_DAX for normal open but clear it from fs path. At worst, we could clear S_DAX on the block device in mount_bdev() or something like that... Honza -- Jan Kara SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id C02BB7CA2 for ; Thu, 11 Feb 2016 13:49:42 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay2.corp.sgi.com (Postfix) with ESMTP id 981D0304059 for ; Thu, 11 Feb 2016 11:49:39 -0800 (PST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by cuda.sgi.com with ESMTP id ojBYR5DLPvUPkNcq for ; Thu, 11 Feb 2016 11:49:38 -0800 (PST) Date: Thu, 11 Feb 2016 12:49:22 -0700 From: Ross Zwisler Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160211194922.GA5260@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160211124304.GI21760@quack.suse.cz> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Jan Kara Cc: Theodore Ts'o , linux-nvdimm@lists.01.org, Dan Williams , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Andrew Morton On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote: > On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > > During testing of raw block devices + DAX I noticed that the struct > > block_device that we were using for DAX operations was incorrect. For the > > fault handlers, etc. we can just get the correct bdev via get_block(), > > which is passed in as a function pointer, but for the *sync code and for > > sector zeroing we don't have access to get_block(). This is also an issue > > for XFS real-time devices, whenever we get those working. > > > > Patch one of this series fixes the DAX sector zeroing code by explicitly > > passing in a valid struct block_device. > > > > Patch two of this series fixes DAX *sync support by moving calls to > > dax_writeback_mapping_range() out of filemap_write_and_wait_range() and > > into the filesystem/block device ->writepages function so that it can > > supply us with a valid block device. This also fixes DAX code to properly > > flush caches in response to sync(2). > > > > Thanks to Jan Kara for his initial draft of patch 2: > > https://lkml.org/lkml/2016/2/9/485 > > > > Here are the changes that I've made to that patch: > > > > 1) For DAX mappings, only return after calling > > dax_writeback_mapping_range() if we encountered an error. In the non-error > > case we still need to write back normal pages, else we lose metadata > > updates. > > > > 2) In dax_writeback_mapping_range(), move the new check for > > if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL) > > above the i_blkbits check. In my testing I found cases where > > dax_writeback_mapping_range() was called for inodes with i_blkbits != > > PAGE_SHIFT - I'm assuming these are internal metadata inodes? They have no > > exceptional DAX entries to flush, so we have no work to do, but if we > > return error from the i_blkbits check we will fail the overall writeback > > operation. Please let me know if it seems wrong for us to be seeing inodes > > set to use DAX but with i_blkbits != PAGE_SHIFT and I'll get more info. > > So I'm wondering - how come S_DAX flag got set for inode where i_blkbis != > PAGE_SHIFT? That would seem to be a bug? I specifically ordered the checks > like this to catch such issues. I've isolated this one - this happens for all three filesystems (ext2, ext4 & XFS), and does indeed have to do with the fact that S_DAX is set for bdev->bd_inode. Here is one failure path: [ 102.866637] [] dump_stack+0x85/0xc2 [ 102.867101] [] dax_writeback_mapping_range+0x60/0xe0 [ 102.867738] [] blkdev_writepages+0x3f/0x50 [ 102.868272] [] do_writepages+0x21/0x30 [ 102.868784] [] __filemap_fdatawrite_range+0xc6/0x100 [ 102.869378] [] filemap_write_and_wait+0x4a/0xa0 [ 102.869933] [] set_blocksize+0x70/0xd0 [ 102.870424] [] sb_set_blocksize+0x1d/0x50 [ 102.870933] [] ext4_fill_super+0x75b/0x3360 [ 102.871487] [] ? vsnprintf+0x201/0x4c0 [ 102.872005] [] ? snprintf+0x49/0x60 [ 102.872499] [] mount_bdev+0x180/0x1b0 [ 102.872981] [] ? ext4_calculate_overhead+0x370/0x370 [ 102.873580] [] ext4_mount+0x15/0x20 [ 102.874042] [] mount_fs+0x38/0x170 [ 102.874524] [] vfs_kern_mount+0x6b/0x150 [ 102.875041] [] do_mount+0x24f/0xe90 [ 102.875508] [] ? mntput+0x24/0x40 [ 102.875958] [] ? __kmalloc_track_caller+0xea/0x240 [ 102.876542] [] ? copy_mount_options+0x2c/0x210 [ 102.877087] [] SyS_mount+0x95/0xe0 [ 102.877573] [] entry_SYSCALL_64_fastpath+0x12/0x76 In set_blocksize() we are actually updating bdev->bd_inode->i_blkbits to be 12, but before that happens we do a sync_blockdev() with i_blkbits at 10, which causes the failure. This can be reproduced easily just by mounting an ext2 or ext4 filesystem. I think the plan of unsetting S_DAX on bdev->bd_inode when we mount will save us from this, as long as we do it super early in the mount process. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 219CF7CA2 for ; Thu, 11 Feb 2016 14:46:41 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay2.corp.sgi.com (Postfix) with ESMTP id 0301F30405F for ; Thu, 11 Feb 2016 12:46:40 -0800 (PST) Received: from ipmail06.adl2.internode.on.net (ipmail06.adl2.internode.on.net [150.101.137.129]) by cuda.sgi.com with ESMTP id RZxlNlBjmBgbqt71 for ; Thu, 11 Feb 2016 12:46:38 -0800 (PST) Date: Fri, 12 Feb 2016 07:46:35 +1100 From: Dave Chinner Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems Message-ID: <20160211204635.GI19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dan Williams Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , Matthew Wilcox , XFS Developers , Linux MM , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Jan Kara , Ross Zwisler , linux-ext4 , Andrew Morton On Thu, Feb 11, 2016 at 07:22:00AM -0800, Dan Williams wrote: > On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara wrote: > > On Wed 10-02-16 15:43:40, Ross Zwisler wrote: > >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: > >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: > >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems > >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). > >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to > >> > > get that from inode->i_sb->s_bdev. This is correct for normal inodes > >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw > >> > > block devices and for XFS real-time files. > >> > > > >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem > >> > > ->writepages function so that it can supply us with a valid block > >> > > device. This also fixes DAX code to properly flush caches in response to > >> > > sync(2). > >> > > > >> > > Signed-off-by: Ross Zwisler > >> > > Signed-off-by: Jan Kara > >> > > --- > >> > > fs/block_dev.c | 16 +++++++++++++++- > >> > > fs/dax.c | 13 ++++++++----- > >> > > fs/ext2/inode.c | 11 +++++++++++ > >> > > fs/ext4/inode.c | 7 +++++++ > >> > > fs/xfs/xfs_aops.c | 9 +++++++++ > >> > > include/linux/dax.h | 6 ++++-- > >> > > mm/filemap.c | 12 ++++-------- > >> > > 7 files changed, 58 insertions(+), 16 deletions(-) > >> > > > >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c > >> > > index 39b3a17..fc01e43 100644 > >> > > --- a/fs/block_dev.c > >> > > +++ b/fs/block_dev.c > >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) > >> > > return try_to_free_buffers(page); > >> > > } > >> > > > >> > > +static int blkdev_writepages(struct address_space *mapping, > >> > > + struct writeback_control *wbc) > >> > > +{ > >> > > + if (dax_mapping(mapping)) { > >> > > + struct block_device *bdev = I_BDEV(mapping->host); > >> > > + int error; > >> > > + > >> > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); > >> > > + if (error) > >> > > + return error; > >> > > + } > >> > > + return generic_writepages(mapping, wbc); > >> > > +} > >> > > >> > Can you remind of the reason for calling generic_writepages() on DAX > >> > enabled address spaces? > >> > >> Sure. The initial version of this patch didn't do this, and during testing I > >> hit a bunch of xfstests failures. In ext2 at least I believe these were > >> happening because we were skipping the call into generic_writepages() for DAX > >> inodes. Without a lot of data to back this up, my guess is that this is due > >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping) > >> returns true), but having dirty page cache pages that need to be written back > >> as part of the writeback. > >> > >> Changing this so we always call generic_writepages() even in the DAX case > >> solved the xfstest failures. > >> > >> If this sounds incorrect, please let me know and I'll go and gather more data. > > > > So I think a more correct fix it to not set S_DAX for inodes that will have > > any pagecache pages - e.g. don't set S_DAX for block device inodes when > > filesystem is mounted on it (probably the easiest is to just refuse to > > mount filesystem on block device which has S_DAX set). > > I think we have a wider problem here. See __blkdev_get, we set S_DAX > on all block devices that have ->direct_access() and have a > page-aligned starting address. That's seeming like a premature optimisation to me now. I didn't say anything at the time because I was busy with other things and it didn't affect XFS. > It seems to me we need to modify the > metadata i/o paths to bypass the page cache, XFS doesn't use the block device page cache for it's metadata - it has it's own internal metadata cache structures and uses get_pages or heap memory to back it's metadata. But that doesn't make mixing DAX and pages in the block device mapping tree sane. What you are missing here is that the underlying architecture of journalling filesystems mean they can't use DAX for their metadata. Modifications have to be buffered, because they have to be written to the journal first before they are written back in place. IOWs, we need to buffer changes in volatile memory for some time, and that means we can't use DAX during transactional modifications. And to put the final nail in that coffin, metadata in XFS can be discontiguous multi-block objects - in those situations we vmap the underlying pages so they appear to the code to be a contiguous buffer, and that's something we can't do with DAX.... > or teach the fsync code > how to flush populated data pages out of the radix. That doesn't solve the problem. Filesystems free and reallocate filesystem blocks without intermediate block device mapping invalidation calls, so what is one minute a data block accessed by DAX may become a metadata block that accessed via buffered IO. It all goes to crap very quickly.... However, I'd say fsync is not the place to address this. This block device cache aliasing issue is supposed to be what unmap_underlying_metadata() solves, right? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id 3B9A37CA2 for ; Thu, 11 Feb 2016 14:58:42 -0600 (CST) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay3.corp.sgi.com (Postfix) with ESMTP id A8882AC002 for ; Thu, 11 Feb 2016 12:58:41 -0800 (PST) Received: from mail-yw0-f174.google.com (mail-yw0-f174.google.com [209.85.161.174]) by cuda.sgi.com with ESMTP id DOREEMIjVfBNLCU2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Thu, 11 Feb 2016 12:58:39 -0800 (PST) Received: by mail-yw0-f174.google.com with SMTP id u200so49601002ywf.0 for ; Thu, 11 Feb 2016 12:58:39 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20160211204635.GI19486@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> <20160211204635.GI19486@dastard> Date: Thu, 11 Feb 2016 12:58:38 -0800 Message-ID: Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems From: Dan Williams List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: Theodore Ts'o , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , Matthew Wilcox , XFS Developers , Linux MM , Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel , Jan Kara , Ross Zwisler , linux-ext4 , Andrew Morton On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner wrote: [..] >> It seems to me we need to modify the >> metadata i/o paths to bypass the page cache, > > XFS doesn't use the block device page cache for it's metadata - it > has it's own internal metadata cache structures and uses get_pages > or heap memory to back it's metadata. But that doesn't make mixing > DAX and pages in the block device mapping tree sane. > > What you are missing here is that the underlying architecture of > journalling filesystems mean they can't use DAX for their metadata. > Modifications have to be buffered, because they have to be written > to the journal first before they are written back in place. IOWs, we > need to buffer changes in volatile memory for some time, and that > means we can't use DAX during transactional modifications. > > And to put the final nail in that coffin, metadata in XFS can be > discontiguous multi-block objects - in those situations we vmap the > underlying pages so they appear to the code to be a contiguous > buffer, and that's something we can't do with DAX.... Sorry, I wasn't clear when I said "bypass page cache" I meant a solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition table reads". However, I suspect that is broken if the filesystem is not ready to see a new page allocated for every I/O. I assume one thread will want to insert a page in the radix for another thread to find/manipulate before metadata gets written back to storage. >> or teach the fsync code >> how to flush populated data pages out of the radix. > > That doesn't solve the problem. Filesystems free and reallocate > filesystem blocks without intermediate block device mapping > invalidation calls, so what is one minute a data block accessed by > DAX may become a metadata block that accessed via buffered IO. It > all goes to crap very quickly.... > > However, I'd say fsync is not the place to address this. This block > device cache aliasing issue is supposed to be what > unmap_underlying_metadata() solves, right? I'll take a look at this. Right now I'm trying to implement the "clear block-device-inode S_DAX on fs mount" approach. My concern though is that we need to disable block device mmap while a filesystem is mounted... Maybe I don't need to worry because it's already the case that a mmap of the raw device may not see the most up to date data for a file that has dirty fs-page-cache data. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 3ACEE7CA2 for ; Fri, 12 Feb 2016 20:39:21 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id 248C4304051 for ; Fri, 12 Feb 2016 18:39:21 -0800 (PST) Received: from ipmail06.adl2.internode.on.net (ipmail06.adl2.internode.on.net [150.101.137.129]) by cuda.sgi.com with ESMTP id M4FE8YnpLIfJBLad for ; Fri, 12 Feb 2016 18:39:18 -0800 (PST) Date: Sat, 13 Feb 2016 13:38:49 +1100 From: Dave Chinner Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160213023849.GD14668@dastard> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> <20160212190320.GA24857@linux.intel.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160212190320.GA24857@linux.intel.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Ross Zwisler , Jan Kara , linux-kernel@vger.kernel.org, Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Dan Williams , Jan Kara , Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, xfs@oss.sgi.com On Fri, Feb 12, 2016 at 12:03:20PM -0700, Ross Zwisler wrote: > On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote: > > On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > > > 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue > > > the writeback in the case that DAX is enabled but we only have a nonzero > > > mapping->nrpages. As with 1) and 2), I believe this is necessary to > > > properly writeback metadata changes. If this sounds wrong, please let me > > > know and I'll get more info. > > > > And I'm surprised here as well. If there are dax_mapping() inodes that have > > pagecache pages, then we have issues with radix tree handling as well. So > > how come dax_mapping() inodes have pages attached? If it is about block > > device inodes, then I find it buggy, that S_DAX gets set for such inodes > > when filesystem is mounted on them because in such cases we are IMO asking > > for data corruption sooner rather than later... > > I think I've figured this one out, at least partially. > > For ext2 the issues I was seeing were due to the fact that directory inodes > have S_DAX set, but have dirty page cache pages. In testing with > generic/002, I see two ext2 inodes with S_DAX trying to do a writeback while > they have dirty page cache pages. The first has i_ino=2, which is the > EXT2_ROOT_INO. .... > As far as I can see, XFS does not have these issues - returning immediately > having done just the DAX writeback in xfs_vm_writepages() lets all my xfstests > pass. XFS will not have issues because it does not dirty directory inodes at the VFS level, nor does it use the page cache for directory data. However, looking at the code I think it does still set S_DAX on directory inodes, which it shouldn't be doing. I've got a couple of fixes I need to do in this area - hopefully I'll get it done on Monday. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 7DCDA7CA2 for ; Fri, 12 Feb 2016 22:59:29 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay2.corp.sgi.com (Postfix) with ESMTP id 45FF6304048 for ; Fri, 12 Feb 2016 20:59:25 -0800 (PST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by cuda.sgi.com with ESMTP id YkE0VHARLKTVGKIi for ; Fri, 12 Feb 2016 20:59:24 -0800 (PST) Date: Fri, 12 Feb 2016 21:59:12 -0700 From: Ross Zwisler Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160213045912.GA22595@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> <20160212190320.GA24857@linux.intel.com> <20160213023849.GD14668@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160213023849.GD14668@dastard> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: Theodore Ts'o , linux-nvdimm@lists.01.org, Dan Williams , linux-kernel@vger.kernel.org, Matthew Wilcox , xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Jan Kara , Ross Zwisler , linux-ext4@vger.kernel.org, Andrew Morton On Sat, Feb 13, 2016 at 01:38:49PM +1100, Dave Chinner wrote: > On Fri, Feb 12, 2016 at 12:03:20PM -0700, Ross Zwisler wrote: > > On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote: > > > On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > > > > 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue > > > > the writeback in the case that DAX is enabled but we only have a nonzero > > > > mapping->nrpages. As with 1) and 2), I believe this is necessary to > > > > properly writeback metadata changes. If this sounds wrong, please let me > > > > know and I'll get more info. > > > > > > And I'm surprised here as well. If there are dax_mapping() inodes that have > > > pagecache pages, then we have issues with radix tree handling as well. So > > > how come dax_mapping() inodes have pages attached? If it is about block > > > device inodes, then I find it buggy, that S_DAX gets set for such inodes > > > when filesystem is mounted on them because in such cases we are IMO asking > > > for data corruption sooner rather than later... > > > > I think I've figured this one out, at least partially. > > > > For ext2 the issues I was seeing were due to the fact that directory inodes > > have S_DAX set, but have dirty page cache pages. In testing with > > generic/002, I see two ext2 inodes with S_DAX trying to do a writeback while > > they have dirty page cache pages. The first has i_ino=2, which is the > > EXT2_ROOT_INO. > .... > > As far as I can see, XFS does not have these issues - returning immediately > > having done just the DAX writeback in xfs_vm_writepages() lets all my xfstests > > pass. > > XFS will not have issues because it does not dirty directory inodes > at the VFS level, nor does it use the page cache for directory data. > However, looking at the code I think it does still set S_DAX on > directory inodes, which it shouldn't be doing. > > I've got a couple of fixes I need to do in this area - hopefully > I'll get it done on Monday. Cool. I've got a quick patch that stops S_DAX from being set on everything but regular inodes for ext2 and ext4. This solved a lot of my xfstests failures. Even after that I'm seeing two last failures with ext4 - I'll keep working on those. - Ross _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id C40587CBA for ; Fri, 12 Feb 2016 13:03:37 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id B93D7304048 for ; Fri, 12 Feb 2016 11:03:34 -0800 (PST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by cuda.sgi.com with ESMTP id lqBsCr7HmfG1mFaD for ; Fri, 12 Feb 2016 11:03:33 -0800 (PST) Date: Fri, 12 Feb 2016 12:03:20 -0700 From: Ross Zwisler Subject: Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Message-ID: <20160212190320.GA24857@linux.intel.com> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <20160211124304.GI21760@quack.suse.cz> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160211124304.GI21760@quack.suse.cz> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Jan Kara Cc: Theodore Ts'o , linux-nvdimm@lists.01.org, Dan Williams , linux-kernel@vger.kernel.org, xfs@oss.sgi.com, linux-mm@kvack.org, Andreas Dilger , Alexander Viro , Jan Kara , linux-fsdevel@vger.kernel.org, Matthew Wilcox , Ross Zwisler , linux-ext4@vger.kernel.org, Andrew Morton On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote: > On Wed 10-02-16 13:48:54, Ross Zwisler wrote: > > 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue > > the writeback in the case that DAX is enabled but we only have a nonzero > > mapping->nrpages. As with 1) and 2), I believe this is necessary to > > properly writeback metadata changes. If this sounds wrong, please let me > > know and I'll get more info. > > And I'm surprised here as well. If there are dax_mapping() inodes that have > pagecache pages, then we have issues with radix tree handling as well. So > how come dax_mapping() inodes have pages attached? If it is about block > device inodes, then I find it buggy, that S_DAX gets set for such inodes > when filesystem is mounted on them because in such cases we are IMO asking > for data corruption sooner rather than later... I think I've figured this one out, at least partially. For ext2 the issues I was seeing were due to the fact that directory inodes have S_DAX set, but have dirty page cache pages. In testing with generic/002, I see two ext2 inodes with S_DAX trying to do a writeback while they have dirty page cache pages. The first has i_ino=2, which is the EXT2_ROOT_INO. The second inode changes from run to run, but for my last run was 155649. The test failed because that directory inode was found to be corrupt by fsck.ext2: *** fsck.ext2 output *** fsck from util-linux 2.26.2 e2fsck 1.42.12 (29-Aug-2014) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Directory inode 155649, block #0, offset 0: directory corrupted If I change the code in ext2_writepages() so that it does the mpage_writepages() even for DAX inodes, all my xfstests pass. I'm not sure this is the right fix, though - should it instead be that ext2 directory inodes don't have S_DAX set? A similar problem occurs with ext4, though I haven't yet tracked it down to an inode type. It could be that ext4 directory inodes have the same issue, and Eric Sandeen suggested we might also have an issue with XATTRS attached to inodes. As with ext2, if I allow the normal writeback to occur in ext4_writepages() even for DAX inodes, the issues go away, but I'm not sure whether or not this is the correct fix. As far as I can see, XFS does not have these issues - returning immediately having done just the DAX writeback in xfs_vm_writepages() lets all my xfstests pass. For v4.5 should I send out an updated version of this series that does the regular page writeback for ext2 & ext4, or should we work to clear S_DAX for regular filesystem inodes that have dirty page cache data? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw0-f178.google.com (mail-yw0-f178.google.com [209.85.161.178]) by kanga.kvack.org (Postfix) with ESMTP id CB524828E1 for ; Thu, 11 Feb 2016 10:47:52 -0500 (EST) Received: by mail-yw0-f178.google.com with SMTP id q190so42007341ywd.3 for ; Thu, 11 Feb 2016 07:47:52 -0800 (PST) Received: from mail-yk0-x235.google.com (mail-yk0-x235.google.com. [2607:f8b0:4002:c07::235]) by mx.google.com with ESMTPS id l70si3751210ywb.45.2016.02.11.07.22.01 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 11 Feb 2016 07:22:01 -0800 (PST) Received: by mail-yk0-x235.google.com with SMTP id z7so21891170yka.3 for ; Thu, 11 Feb 2016 07:22:01 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20160211125044.GJ21760@quack.suse.cz> References: <1455137336-28720-1-git-send-email-ross.zwisler@linux.intel.com> <1455137336-28720-3-git-send-email-ross.zwisler@linux.intel.com> <20160210220312.GP14668@dastard> <20160210224340.GA30938@linux.intel.com> <20160211125044.GJ21760@quack.suse.cz> Date: Thu, 11 Feb 2016 07:22:00 -0800 Message-ID: Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Ross Zwisler , Dave Chinner , "linux-kernel@vger.kernel.org" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Andrew Morton , Jan Kara , Matthew Wilcox , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , XFS Developers On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara wrote: > On Wed 10-02-16 15:43:40, Ross Zwisler wrote: >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote: >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote: >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to >> > > get that from inode->i_sb->s_bdev. This is correct for normal inodes >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw >> > > block devices and for XFS real-time files. >> > > >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem >> > > ->writepages function so that it can supply us with a valid block >> > > device. This also fixes DAX code to properly flush caches in response to >> > > sync(2). >> > > >> > > Signed-off-by: Ross Zwisler >> > > Signed-off-by: Jan Kara >> > > --- >> > > fs/block_dev.c | 16 +++++++++++++++- >> > > fs/dax.c | 13 ++++++++----- >> > > fs/ext2/inode.c | 11 +++++++++++ >> > > fs/ext4/inode.c | 7 +++++++ >> > > fs/xfs/xfs_aops.c | 9 +++++++++ >> > > include/linux/dax.h | 6 ++++-- >> > > mm/filemap.c | 12 ++++-------- >> > > 7 files changed, 58 insertions(+), 16 deletions(-) >> > > >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c >> > > index 39b3a17..fc01e43 100644 >> > > --- a/fs/block_dev.c >> > > +++ b/fs/block_dev.c >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait) >> > > return try_to_free_buffers(page); >> > > } >> > > >> > > +static int blkdev_writepages(struct address_space *mapping, >> > > + struct writeback_control *wbc) >> > > +{ >> > > + if (dax_mapping(mapping)) { >> > > + struct block_device *bdev = I_BDEV(mapping->host); >> > > + int error; >> > > + >> > > + error = dax_writeback_mapping_range(mapping, bdev, wbc); >> > > + if (error) >> > > + return error; >> > > + } >> > > + return generic_writepages(mapping, wbc); >> > > +} >> > >> > Can you remind of the reason for calling generic_writepages() on DAX >> > enabled address spaces? >> >> Sure. The initial version of this patch didn't do this, and during testing I >> hit a bunch of xfstests failures. In ext2 at least I believe these were >> happening because we were skipping the call into generic_writepages() for DAX >> inodes. Without a lot of data to back this up, my guess is that this is due >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping) >> returns true), but having dirty page cache pages that need to be written back >> as part of the writeback. >> >> Changing this so we always call generic_writepages() even in the DAX case >> solved the xfstest failures. >> >> If this sounds incorrect, please let me know and I'll go and gather more data. > > So I think a more correct fix it to not set S_DAX for inodes that will have > any pagecache pages - e.g. don't set S_DAX for block device inodes when > filesystem is mounted on it (probably the easiest is to just refuse to > mount filesystem on block device which has S_DAX set). I think we have a wider problem here. See __blkdev_get, we set S_DAX on all block devices that have ->direct_access() and have a page-aligned starting address. It seems to me we need to modify the metadata i/o paths to bypass the page cache, or teach the fsync code how to flush populated data pages out of the radix. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org