linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS
@ 2016-02-10 20:48 Ross Zwisler
  2016-02-10 20:48 ` [PATCH v2 1/2] dax: supply DAX clearing code with correct bdev Ross Zwisler
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Ross Zwisler @ 2016-02-10 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, xfs

During testing of raw block devices + DAX I noticed that the struct
block_device that we were using for DAX operations was incorrect.  For the
fault handlers, etc. we can just get the correct bdev via get_block(),
which is passed in as a function pointer, but for the *sync code and for
sector zeroing we don't have access to get_block().  This is also an issue
for XFS real-time devices, whenever we get those working.

Patch one of this series fixes the DAX sector zeroing code by explicitly
passing in a valid struct block_device.

Patch two of this series fixes DAX *sync support by moving calls to
dax_writeback_mapping_range() out of filemap_write_and_wait_range() and
into the filesystem/block device ->writepages function so that it can
supply us with a valid block device. This also fixes DAX code to properly
flush caches in response to sync(2).

Thanks to Jan Kara for his initial draft of patch 2:
https://lkml.org/lkml/2016/2/9/485

Here are the changes that I've made to that patch:

1) For DAX mappings, only return after calling
dax_writeback_mapping_range() if we encountered an error.  In the non-error
case we still need to write back normal pages, else we lose metadata
updates. 

2) In dax_writeback_mapping_range(), move the new check for 
        if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
above the i_blkbits check.  In my testing I found cases where
dax_writeback_mapping_range() was called for inodes with i_blkbits !=
PAGE_SHIFT - I'm assuming these are internal metadata inodes?  They have no
exceptional DAX entries to flush, so we have no work to do, but if we
return error from the i_blkbits check we will fail the overall writeback
operation.  Please let me know if it seems wrong for us to be seeing inodes
set to use DAX but with i_blkbits != PAGE_SHIFT and I'll get more info.

3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue
the writeback in the case that DAX is enabled but we only have a nonzero
mapping->nrpages.  As with 1) and 2), I believe this is necessary to
properly writeback metadata changes.  If this sounds wrong, please let me
know and I'll get more info.

A working tree can be found here:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=fsync_bdev_v2

Ross Zwisler (2):
  dax: supply DAX clearing code with correct bdev
  dax: move writeback calls into the filesystems

 fs/block_dev.c         | 16 +++++++++++++++-
 fs/dax.c               | 22 ++++++++++++----------
 fs/ext2/inode.c        | 17 +++++++++++++++--
 fs/ext4/inode.c        |  7 +++++++
 fs/xfs/xfs_aops.c      | 11 ++++++++++-
 fs/xfs/xfs_aops.h      |  1 +
 fs/xfs/xfs_bmap_util.c |  3 ++-
 include/linux/dax.h    |  8 +++++---
 mm/filemap.c           | 12 ++++--------
 9 files changed, 71 insertions(+), 26 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/2] dax: supply DAX clearing code with correct bdev
  2016-02-10 20:48 [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Ross Zwisler
@ 2016-02-10 20:48 ` Ross Zwisler
  2016-02-10 20:48 ` [PATCH v2 2/2] dax: move writeback calls into the filesystems Ross Zwisler
  2016-02-11 12:43 ` [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Jan Kara
  2 siblings, 0 replies; 20+ messages in thread
From: Ross Zwisler @ 2016-02-10 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, xfs

dax_clear_blocks() needs a valid struct block_device and previously it was
using inode->i_sb->s_bdev in all cases.  This is correct for normal inodes
on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time devices.

Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change its
arguments to take a bdev and a sector instead of an inode and a block.
This better reflects what the function does, and it allows the filesystem
and raw block device code to pass in an appropriate struct block_device.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c               | 9 ++++-----
 fs/ext2/inode.c        | 6 ++++--
 fs/xfs/xfs_aops.c      | 2 +-
 fs/xfs/xfs_aops.h      | 1 +
 fs/xfs/xfs_bmap_util.c | 3 ++-
 include/linux/dax.h    | 2 +-
 6 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index fc2e314..9a173dd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -79,15 +79,14 @@ struct page *read_dax_sector(struct block_device *bdev, sector_t n)
 }
 
 /*
- * dax_clear_blocks() is called from within transaction context from XFS,
+ * dax_clear_sectors() is called from within transaction context from XFS,
  * and hence this means the stack from this point must follow GFP_NOFS
  * semantics for all operations.
  */
-int dax_clear_blocks(struct inode *inode, sector_t block, long _size)
+int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size)
 {
-	struct block_device *bdev = inode->i_sb->s_bdev;
 	struct blk_dax_ctl dax = {
-		.sector = block << (inode->i_blkbits - 9),
+		.sector = _sector,
 		.size = _size,
 	};
 
@@ -109,7 +108,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long _size)
 	wmb_pmem();
 	return 0;
 }
-EXPORT_SYMBOL_GPL(dax_clear_blocks);
+EXPORT_SYMBOL_GPL(dax_clear_sectors);
 
 /* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */
 static void dax_new_buf(void __pmem *addr, unsigned size, unsigned first,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 338eefd..b6b965b 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -737,8 +737,10 @@ static int ext2_get_blocks(struct inode *inode,
 		 * so that it's not found by another thread before it's
 		 * initialised
 		 */
-		err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key),
-						1 << inode->i_blkbits);
+		err = dax_clear_sectors(inode->i_sb->s_bdev,
+				le32_to_cpu(chain[depth-1].key) <<
+				(inode->i_blkbits - 9),
+				1 << inode->i_blkbits);
 		if (err) {
 			mutex_unlock(&ei->truncate_mutex);
 			goto cleanup;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 379c089..fc20518 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -55,7 +55,7 @@ xfs_count_page_state(
 	} while ((bh = bh->b_this_page) != head);
 }
 
-STATIC struct block_device *
+struct block_device *
 xfs_find_bdev_for_inode(
 	struct inode		*inode)
 {
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index f6ffc9a..a4343c6 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -62,5 +62,6 @@ int	xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
 			         struct buffer_head *map_bh, int create);
 
 extern void xfs_count_page_state(struct page *, int *, int *);
+extern struct block_device *xfs_find_bdev_for_inode(struct inode *);
 
 #endif /* __XFS_AOPS_H__ */
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 07ef29b..ae9d755 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -75,7 +75,8 @@ xfs_zero_extent(
 	ssize_t		size = XFS_FSB_TO_B(mp, count_fsb);
 
 	if (IS_DAX(VFS_I(ip)))
-		return dax_clear_blocks(VFS_I(ip), block, size);
+		return dax_clear_sectors(xfs_find_bdev_for_inode(VFS_I(ip)),
+				sector, size);
 
 	/*
 	 * let the block layer decide on the fastest method of
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 818e450..7b6bced 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -7,7 +7,7 @@
 
 ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
 		  get_block_t, dio_iodone_t, int flags);
-int dax_clear_blocks(struct inode *, sector_t block, long size);
+int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/2] dax: move writeback calls into the filesystems
  2016-02-10 20:48 [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Ross Zwisler
  2016-02-10 20:48 ` [PATCH v2 1/2] dax: supply DAX clearing code with correct bdev Ross Zwisler
@ 2016-02-10 20:48 ` Ross Zwisler
  2016-02-10 22:03   ` Dave Chinner
  2016-02-11 12:43 ` [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Jan Kara
  2 siblings, 1 reply; 20+ messages in thread
From: Ross Zwisler @ 2016-02-10 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, xfs, Jan Kara

Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
dax_writeback_mapping_range() needs a struct block_device, and it used to
get that from inode->i_sb->s_bdev.  This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.

Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device. This also fixes DAX code to properly flush caches in response to
sync(2).

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c      | 16 +++++++++++++++-
 fs/dax.c            | 13 ++++++++-----
 fs/ext2/inode.c     | 11 +++++++++++
 fs/ext4/inode.c     |  7 +++++++
 fs/xfs/xfs_aops.c   |  9 +++++++++
 include/linux/dax.h |  6 ++++--
 mm/filemap.c        | 12 ++++--------
 7 files changed, 58 insertions(+), 16 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 39b3a17..fc01e43 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
 	return try_to_free_buffers(page);
 }
 
+static int blkdev_writepages(struct address_space *mapping,
+			     struct writeback_control *wbc)
+{
+	if (dax_mapping(mapping)) {
+		struct block_device *bdev = I_BDEV(mapping->host);
+		int error;
+
+		error = dax_writeback_mapping_range(mapping, bdev, wbc);
+		if (error)
+			return error;
+	}
+	return generic_writepages(mapping, wbc);
+}
+
 static const struct address_space_operations def_blk_aops = {
 	.readpage	= blkdev_readpage,
 	.readpages	= blkdev_readpages,
 	.writepage	= blkdev_writepage,
 	.write_begin	= blkdev_write_begin,
 	.write_end	= blkdev_write_end,
-	.writepages	= generic_writepages,
+	.writepages	= blkdev_writepages,
 	.releasepage	= blkdev_releasepage,
 	.direct_IO	= blkdev_direct_IO,
 	.is_dirty_writeback = buffer_check_dirty_writeback,
diff --git a/fs/dax.c b/fs/dax.c
index 9a173dd..034dd02 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -484,11 +484,10 @@ static int dax_writeback_one(struct block_device *bdev,
  * end]. This is required by data integrity operations to ensure file data is
  * on persistent storage prior to completion of the operation.
  */
-int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
-		loff_t end)
+int dax_writeback_mapping_range(struct address_space *mapping,
+		struct block_device *bdev, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	struct block_device *bdev = inode->i_sb->s_bdev;
 	pgoff_t start_index, end_index, pmd_index;
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
@@ -496,11 +495,15 @@ int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
 	int i, ret = 0;
 	void *entry;
 
+
+	if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
+		return 0;
+
 	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
 		return -EIO;
 
-	start_index = start >> PAGE_CACHE_SHIFT;
-	end_index = end >> PAGE_CACHE_SHIFT;
+	start_index = wbc->range_start >> PAGE_CACHE_SHIFT;
+	end_index = wbc->range_end >> PAGE_CACHE_SHIFT;
 	pmd_index = DAX_PMD_INDEX(start_index);
 
 	rcu_read_lock();
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index b6b965b..7e44fc3 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -876,6 +876,17 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
 static int
 ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
+#ifdef CONFIG_FS_DAX
+	if (dax_mapping(mapping)) {
+		int error;
+
+		error = dax_writeback_mapping_range(mapping,
+				mapping->host->i_sb->s_bdev, wbc);
+		if (error)
+			return error;
+	}
+#endif
+
 	return mpage_writepages(mapping, wbc, ext2_get_block);
 }
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 83bc8bf..8c42020 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2450,6 +2450,13 @@ static int ext4_writepages(struct address_space *mapping,
 
 	trace_ext4_writepages(inode, wbc);
 
+	if (dax_mapping(mapping)) {
+		ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev,
+						   wbc);
+		if (ret)
+			goto out_writepages;
+	}
+
 	/*
 	 * No pages to write? This is mainly a kludge to avoid starting
 	 * a transaction for special inodes like journal inode on last iput()
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index fc20518..1139ecd 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1208,6 +1208,15 @@ xfs_vm_writepages(
 	struct writeback_control *wbc)
 {
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
+	if (dax_mapping(mapping)) {
+		int error;
+
+		error = dax_writeback_mapping_range(mapping,
+				xfs_find_bdev_for_inode(mapping->host), wbc);
+		if (error)
+			return error;
+	}
+
 	return generic_writepages(mapping, wbc);
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 7b6bced..636dd59 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -52,6 +52,8 @@ static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
 }
-int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
-		loff_t end);
+
+struct writeback_control;
+int dax_writeback_mapping_range(struct address_space *mapping,
+		struct block_device *bdev, struct writeback_control *wbc);
 #endif
diff --git a/mm/filemap.c b/mm/filemap.c
index bc94386..a829779 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -446,7 +446,8 @@ int filemap_write_and_wait(struct address_space *mapping)
 {
 	int err = 0;
 
-	if (mapping->nrpages) {
+	if (mapping->nrpages ||
+	    (dax_mapping(mapping) && mapping->nrexceptional)) {
 		err = filemap_fdatawrite(mapping);
 		/*
 		 * Even if the above returned error, the pages may be
@@ -482,13 +483,8 @@ int filemap_write_and_wait_range(struct address_space *mapping,
 {
 	int err = 0;
 
-	if (dax_mapping(mapping) && mapping->nrexceptional) {
-		err = dax_writeback_mapping_range(mapping, lstart, lend);
-		if (err)
-			return err;
-	}
-
-	if (mapping->nrpages) {
+	if (mapping->nrpages ||
+	    (dax_mapping(mapping) && mapping->nrexceptional)) {
 		err = __filemap_fdatawrite_range(mapping, lstart, lend,
 						 WB_SYNC_ALL);
 		/* See comment of filemap_write_and_wait() */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems
  2016-02-10 20:48 ` [PATCH v2 2/2] dax: move writeback calls into the filesystems Ross Zwisler
@ 2016-02-10 22:03   ` Dave Chinner
  2016-02-10 22:43     ` Ross Zwisler
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2016-02-10 22:03 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Jan Kara, Matthew Wilcox,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, xfs, Jan Kara

On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote:
> Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> dax_writeback_mapping_range() needs a struct block_device, and it used to
> get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> block devices and for XFS real-time files.
> 
> Instead, call dax_writeback_mapping_range() directly from the filesystem
> ->writepages function so that it can supply us with a valid block
> device. This also fixes DAX code to properly flush caches in response to
> sync(2).
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/block_dev.c      | 16 +++++++++++++++-
>  fs/dax.c            | 13 ++++++++-----
>  fs/ext2/inode.c     | 11 +++++++++++
>  fs/ext4/inode.c     |  7 +++++++
>  fs/xfs/xfs_aops.c   |  9 +++++++++
>  include/linux/dax.h |  6 ++++--
>  mm/filemap.c        | 12 ++++--------
>  7 files changed, 58 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 39b3a17..fc01e43 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
>  	return try_to_free_buffers(page);
>  }
>  
> +static int blkdev_writepages(struct address_space *mapping,
> +			     struct writeback_control *wbc)
> +{
> +	if (dax_mapping(mapping)) {
> +		struct block_device *bdev = I_BDEV(mapping->host);
> +		int error;
> +
> +		error = dax_writeback_mapping_range(mapping, bdev, wbc);
> +		if (error)
> +			return error;
> +	}
> +	return generic_writepages(mapping, wbc);
> +}

Can you remind of the reason for calling generic_writepages() on DAX
enabled address spaces?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems
  2016-02-10 22:03   ` Dave Chinner
@ 2016-02-10 22:43     ` Ross Zwisler
  2016-02-10 23:44       ` Dave Chinner
  2016-02-11 12:50       ` Jan Kara
  0 siblings, 2 replies; 20+ messages in thread
From: Ross Zwisler @ 2016-02-10 22:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, xfs, Jan Kara

On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote:
> On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote:
> > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> > dax_writeback_mapping_range() needs a struct block_device, and it used to
> > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> > block devices and for XFS real-time files.
> > 
> > Instead, call dax_writeback_mapping_range() directly from the filesystem
> > ->writepages function so that it can supply us with a valid block
> > device. This also fixes DAX code to properly flush caches in response to
> > sync(2).
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/block_dev.c      | 16 +++++++++++++++-
> >  fs/dax.c            | 13 ++++++++-----
> >  fs/ext2/inode.c     | 11 +++++++++++
> >  fs/ext4/inode.c     |  7 +++++++
> >  fs/xfs/xfs_aops.c   |  9 +++++++++
> >  include/linux/dax.h |  6 ++++--
> >  mm/filemap.c        | 12 ++++--------
> >  7 files changed, 58 insertions(+), 16 deletions(-)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 39b3a17..fc01e43 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
> >  	return try_to_free_buffers(page);
> >  }
> >  
> > +static int blkdev_writepages(struct address_space *mapping,
> > +			     struct writeback_control *wbc)
> > +{
> > +	if (dax_mapping(mapping)) {
> > +		struct block_device *bdev = I_BDEV(mapping->host);
> > +		int error;
> > +
> > +		error = dax_writeback_mapping_range(mapping, bdev, wbc);
> > +		if (error)
> > +			return error;
> > +	}
> > +	return generic_writepages(mapping, wbc);
> > +}
> 
> Can you remind of the reason for calling generic_writepages() on DAX
> enabled address spaces?

Sure.  The initial version of this patch didn't do this, and during testing I
hit a bunch of xfstests failures.  In ext2 at least I believe these were
happening because we were skipping the call into generic_writepages() for DAX
inodes. Without a lot of data to back this up, my guess is that this is due
to metadata inodes or something being marked as DAX (so dax_mapping(mapping)
returns true), but having dirty page cache pages that need to be written back
as part of the writeback.

Changing this so we always call generic_writepages() even in the DAX case
solved the xfstest failures. 

If this sounds incorrect, please let me know and I'll go and gather more data.

- Ross

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems
  2016-02-10 22:43     ` Ross Zwisler
@ 2016-02-10 23:44       ` Dave Chinner
  2016-02-11 12:50       ` Jan Kara
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2016-02-10 23:44 UTC (permalink / raw)
  To: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, xfs, Jan Kara

On Wed, Feb 10, 2016 at 03:43:40PM -0700, Ross Zwisler wrote:
> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote:
> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote:
> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> > > dax_writeback_mapping_range() needs a struct block_device, and it used to
> > > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> > > block devices and for XFS real-time files.
> > > 
> > > Instead, call dax_writeback_mapping_range() directly from the filesystem
> > > ->writepages function so that it can supply us with a valid block
> > > device. This also fixes DAX code to properly flush caches in response to
> > > sync(2).
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >  fs/block_dev.c      | 16 +++++++++++++++-
> > >  fs/dax.c            | 13 ++++++++-----
> > >  fs/ext2/inode.c     | 11 +++++++++++
> > >  fs/ext4/inode.c     |  7 +++++++
> > >  fs/xfs/xfs_aops.c   |  9 +++++++++
> > >  include/linux/dax.h |  6 ++++--
> > >  mm/filemap.c        | 12 ++++--------
> > >  7 files changed, 58 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 39b3a17..fc01e43 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
> > >  	return try_to_free_buffers(page);
> > >  }
> > >  
> > > +static int blkdev_writepages(struct address_space *mapping,
> > > +			     struct writeback_control *wbc)
> > > +{
> > > +	if (dax_mapping(mapping)) {
> > > +		struct block_device *bdev = I_BDEV(mapping->host);
> > > +		int error;
> > > +
> > > +		error = dax_writeback_mapping_range(mapping, bdev, wbc);
> > > +		if (error)
> > > +			return error;
> > > +	}
> > > +	return generic_writepages(mapping, wbc);
> > > +}
> > 
> > Can you remind of the reason for calling generic_writepages() on DAX
> > enabled address spaces?
> 
> Sure.  The initial version of this patch didn't do this, and during testing I
> hit a bunch of xfstests failures.  In ext2 at least I believe these were
> happening because we were skipping the call into generic_writepages() for DAX
> inodes. Without a lot of data to back this up, my guess is that this is due
> to metadata inodes or something being marked as DAX (so dax_mapping(mapping)
> returns true), but having dirty page cache pages that need to be written back
> as part of the writeback.

Hmmm - the ext2 filesystem metadata uses the block device page cache
to buffer inode writeback, and so writeback doesn't occur until
sync_blockdev() is called.

But the data access should be through the ext2 inode address space,
not the block device address space, so DAX flushing occurs in
ext2_writepages. So how is the block device inode being marked as
a DAX inode?

If it is being marked as a DAX inode, how is this valid when the
filesystem metadata uses bufferheads and requires struct pages to be
found in the block device mapping tree?  e.g. mkfs writes the
metadata into the bdev via DAX, resulting in an DAX exceptional
entry in the bdev radix tree, then __bread_gfp() comes along to read
the same metadata after mount and expects to find pages in the
blockdev radix tree?

FWIW, this seems to be specifically a block device inode issue,
though, not something that affects regular files in a filesystem.
i.e. filesystem inodes can only be either DAX or non-DAX, and so
there is no mixed mode flushing required, right?

> Changing this so we always call generic_writepages() even in the
> DAX case solved the xfstest failures. 
> 
> If this sounds incorrect, please let me know and I'll go and
> gather more data.

It seems to me that there's a problem here with DAX on block device
inodes, but not for the filesystem mappings. At minimum, the block
device needs a bloody big comment explaining this landmine so people
don't forget why it is a special snowflake...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS
  2016-02-10 20:48 [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Ross Zwisler
  2016-02-10 20:48 ` [PATCH v2 1/2] dax: supply DAX clearing code with correct bdev Ross Zwisler
  2016-02-10 20:48 ` [PATCH v2 2/2] dax: move writeback calls into the filesystems Ross Zwisler
@ 2016-02-11 12:43 ` Jan Kara
  2016-02-11 19:49   ` Ross Zwisler
  2016-02-12 19:03   ` Ross Zwisler
  2 siblings, 2 replies; 20+ messages in thread
From: Jan Kara @ 2016-02-11 12:43 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Andrew Morton, Dan Williams, Dave Chinner, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, xfs

On Wed 10-02-16 13:48:54, Ross Zwisler wrote:
> During testing of raw block devices + DAX I noticed that the struct
> block_device that we were using for DAX operations was incorrect.  For the
> fault handlers, etc. we can just get the correct bdev via get_block(),
> which is passed in as a function pointer, but for the *sync code and for
> sector zeroing we don't have access to get_block().  This is also an issue
> for XFS real-time devices, whenever we get those working.
> 
> Patch one of this series fixes the DAX sector zeroing code by explicitly
> passing in a valid struct block_device.
> 
> Patch two of this series fixes DAX *sync support by moving calls to
> dax_writeback_mapping_range() out of filemap_write_and_wait_range() and
> into the filesystem/block device ->writepages function so that it can
> supply us with a valid block device. This also fixes DAX code to properly
> flush caches in response to sync(2).
> 
> Thanks to Jan Kara for his initial draft of patch 2:
> https://lkml.org/lkml/2016/2/9/485
> 
> Here are the changes that I've made to that patch:
> 
> 1) For DAX mappings, only return after calling
> dax_writeback_mapping_range() if we encountered an error.  In the non-error
> case we still need to write back normal pages, else we lose metadata
> updates. 
> 
> 2) In dax_writeback_mapping_range(), move the new check for 
>         if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
> above the i_blkbits check.  In my testing I found cases where
> dax_writeback_mapping_range() was called for inodes with i_blkbits !=
> PAGE_SHIFT - I'm assuming these are internal metadata inodes?  They have no
> exceptional DAX entries to flush, so we have no work to do, but if we
> return error from the i_blkbits check we will fail the overall writeback
> operation.  Please let me know if it seems wrong for us to be seeing inodes
> set to use DAX but with i_blkbits != PAGE_SHIFT and I'll get more info.

So I'm wondering - how come S_DAX flag got set for inode where i_blkbis !=
PAGE_SHIFT? That would seem to be a bug? I specifically ordered the checks
like this to catch such issues.

> 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue
> the writeback in the case that DAX is enabled but we only have a nonzero
> mapping->nrpages.  As with 1) and 2), I believe this is necessary to
> properly writeback metadata changes.  If this sounds wrong, please let me
> know and I'll get more info.

And I'm surprised here as well. If there are dax_mapping() inodes that have
pagecache pages, then we have issues with radix tree handling as well. So
how come dax_mapping() inodes have pages attached? If it is about block
device inodes, then I find it buggy, that S_DAX gets set for such inodes
when filesystem is mounted on them because in such cases we are IMO asking
for data corruption sooner rather than later...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems
  2016-02-10 22:43     ` Ross Zwisler
  2016-02-10 23:44       ` Dave Chinner
@ 2016-02-11 12:50       ` Jan Kara
  2016-02-11 15:22         ` Dan Williams
  1 sibling, 1 reply; 20+ messages in thread
From: Jan Kara @ 2016-02-11 12:50 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dave Chinner, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, xfs, Jan Kara

On Wed 10-02-16 15:43:40, Ross Zwisler wrote:
> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote:
> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote:
> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> > > dax_writeback_mapping_range() needs a struct block_device, and it used to
> > > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> > > block devices and for XFS real-time files.
> > > 
> > > Instead, call dax_writeback_mapping_range() directly from the filesystem
> > > ->writepages function so that it can supply us with a valid block
> > > device. This also fixes DAX code to properly flush caches in response to
> > > sync(2).
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >  fs/block_dev.c      | 16 +++++++++++++++-
> > >  fs/dax.c            | 13 ++++++++-----
> > >  fs/ext2/inode.c     | 11 +++++++++++
> > >  fs/ext4/inode.c     |  7 +++++++
> > >  fs/xfs/xfs_aops.c   |  9 +++++++++
> > >  include/linux/dax.h |  6 ++++--
> > >  mm/filemap.c        | 12 ++++--------
> > >  7 files changed, 58 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 39b3a17..fc01e43 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
> > >  	return try_to_free_buffers(page);
> > >  }
> > >  
> > > +static int blkdev_writepages(struct address_space *mapping,
> > > +			     struct writeback_control *wbc)
> > > +{
> > > +	if (dax_mapping(mapping)) {
> > > +		struct block_device *bdev = I_BDEV(mapping->host);
> > > +		int error;
> > > +
> > > +		error = dax_writeback_mapping_range(mapping, bdev, wbc);
> > > +		if (error)
> > > +			return error;
> > > +	}
> > > +	return generic_writepages(mapping, wbc);
> > > +}
> > 
> > Can you remind of the reason for calling generic_writepages() on DAX
> > enabled address spaces?
> 
> Sure.  The initial version of this patch didn't do this, and during testing I
> hit a bunch of xfstests failures.  In ext2 at least I believe these were
> happening because we were skipping the call into generic_writepages() for DAX
> inodes. Without a lot of data to back this up, my guess is that this is due
> to metadata inodes or something being marked as DAX (so dax_mapping(mapping)
> returns true), but having dirty page cache pages that need to be written back
> as part of the writeback.
> 
> Changing this so we always call generic_writepages() even in the DAX case
> solved the xfstest failures. 
> 
> If this sounds incorrect, please let me know and I'll go and gather more data.

So I think a more correct fix it to not set S_DAX for inodes that will have
any pagecache pages - e.g. don't set S_DAX for block device inodes when
filesystem is mounted on it (probably the easiest is to just refuse to
mount filesystem on block device which has S_DAX set).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems
  2016-02-11 12:50       ` Jan Kara
@ 2016-02-11 15:22         ` Dan Williams
  2016-02-11 16:22           ` Jan Kara
  2016-02-11 20:46           ` Dave Chinner
  0 siblings, 2 replies; 20+ messages in thread
From: Dan Williams @ 2016-02-11 15:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Dave Chinner, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, XFS Developers

On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 10-02-16 15:43:40, Ross Zwisler wrote:
>> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote:
>> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote:
>> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
>> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
>> > > dax_writeback_mapping_range() needs a struct block_device, and it used to
>> > > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
>> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
>> > > block devices and for XFS real-time files.
>> > >
>> > > Instead, call dax_writeback_mapping_range() directly from the filesystem
>> > > ->writepages function so that it can supply us with a valid block
>> > > device. This also fixes DAX code to properly flush caches in response to
>> > > sync(2).
>> > >
>> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> > > Signed-off-by: Jan Kara <jack@suse.cz>
>> > > ---
>> > >  fs/block_dev.c      | 16 +++++++++++++++-
>> > >  fs/dax.c            | 13 ++++++++-----
>> > >  fs/ext2/inode.c     | 11 +++++++++++
>> > >  fs/ext4/inode.c     |  7 +++++++
>> > >  fs/xfs/xfs_aops.c   |  9 +++++++++
>> > >  include/linux/dax.h |  6 ++++--
>> > >  mm/filemap.c        | 12 ++++--------
>> > >  7 files changed, 58 insertions(+), 16 deletions(-)
>> > >
>> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
>> > > index 39b3a17..fc01e43 100644
>> > > --- a/fs/block_dev.c
>> > > +++ b/fs/block_dev.c
>> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
>> > >   return try_to_free_buffers(page);
>> > >  }
>> > >
>> > > +static int blkdev_writepages(struct address_space *mapping,
>> > > +                      struct writeback_control *wbc)
>> > > +{
>> > > + if (dax_mapping(mapping)) {
>> > > +         struct block_device *bdev = I_BDEV(mapping->host);
>> > > +         int error;
>> > > +
>> > > +         error = dax_writeback_mapping_range(mapping, bdev, wbc);
>> > > +         if (error)
>> > > +                 return error;
>> > > + }
>> > > + return generic_writepages(mapping, wbc);
>> > > +}
>> >
>> > Can you remind of the reason for calling generic_writepages() on DAX
>> > enabled address spaces?
>>
>> Sure.  The initial version of this patch didn't do this, and during testing I
>> hit a bunch of xfstests failures.  In ext2 at least I believe these were
>> happening because we were skipping the call into generic_writepages() for DAX
>> inodes. Without a lot of data to back this up, my guess is that this is due
>> to metadata inodes or something being marked as DAX (so dax_mapping(mapping)
>> returns true), but having dirty page cache pages that need to be written back
>> as part of the writeback.
>>
>> Changing this so we always call generic_writepages() even in the DAX case
>> solved the xfstest failures.
>>
>> If this sounds incorrect, please let me know and I'll go and gather more data.
>
> So I think a more correct fix it to not set S_DAX for inodes that will have
> any pagecache pages - e.g. don't set S_DAX for block device inodes when
> filesystem is mounted on it (probably the easiest is to just refuse to
> mount filesystem on block device which has S_DAX set).

I think we have a wider problem here.  See __blkdev_get, we set S_DAX
on all block devices that have ->direct_access() and have a
page-aligned starting address.  It seems to me we need to modify the
metadata i/o paths to bypass the page cache, or teach the fsync code
how to flush populated data pages out of the radix.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems
  2016-02-11 15:22         ` Dan Williams
@ 2016-02-11 16:22           ` Jan Kara
  2016-02-11 20:46           ` Dave Chinner
  1 sibling, 0 replies; 20+ messages in thread
From: Jan Kara @ 2016-02-11 16:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ross Zwisler, Dave Chinner, linux-kernel,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, XFS Developers

On Thu 11-02-16 07:22:00, Dan Williams wrote:
> On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara <jack@suse.cz> wrote:
> > On Wed 10-02-16 15:43:40, Ross Zwisler wrote:
> >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote:
> >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote:
> >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to
> >> > > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> >> > > block devices and for XFS real-time files.
> >> > >
> >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem
> >> > > ->writepages function so that it can supply us with a valid block
> >> > > device. This also fixes DAX code to properly flush caches in response to
> >> > > sync(2).
> >> > >
> >> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> > > Signed-off-by: Jan Kara <jack@suse.cz>
> >> > > ---
> >> > >  fs/block_dev.c      | 16 +++++++++++++++-
> >> > >  fs/dax.c            | 13 ++++++++-----
> >> > >  fs/ext2/inode.c     | 11 +++++++++++
> >> > >  fs/ext4/inode.c     |  7 +++++++
> >> > >  fs/xfs/xfs_aops.c   |  9 +++++++++
> >> > >  include/linux/dax.h |  6 ++++--
> >> > >  mm/filemap.c        | 12 ++++--------
> >> > >  7 files changed, 58 insertions(+), 16 deletions(-)
> >> > >
> >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> >> > > index 39b3a17..fc01e43 100644
> >> > > --- a/fs/block_dev.c
> >> > > +++ b/fs/block_dev.c
> >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
> >> > >   return try_to_free_buffers(page);
> >> > >  }
> >> > >
> >> > > +static int blkdev_writepages(struct address_space *mapping,
> >> > > +                      struct writeback_control *wbc)
> >> > > +{
> >> > > + if (dax_mapping(mapping)) {
> >> > > +         struct block_device *bdev = I_BDEV(mapping->host);
> >> > > +         int error;
> >> > > +
> >> > > +         error = dax_writeback_mapping_range(mapping, bdev, wbc);
> >> > > +         if (error)
> >> > > +                 return error;
> >> > > + }
> >> > > + return generic_writepages(mapping, wbc);
> >> > > +}
> >> >
> >> > Can you remind of the reason for calling generic_writepages() on DAX
> >> > enabled address spaces?
> >>
> >> Sure.  The initial version of this patch didn't do this, and during testing I
> >> hit a bunch of xfstests failures.  In ext2 at least I believe these were
> >> happening because we were skipping the call into generic_writepages() for DAX
> >> inodes. Without a lot of data to back this up, my guess is that this is due
> >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping)
> >> returns true), but having dirty page cache pages that need to be written back
> >> as part of the writeback.
> >>
> >> Changing this so we always call generic_writepages() even in the DAX case
> >> solved the xfstest failures.
> >>
> >> If this sounds incorrect, please let me know and I'll go and gather more data.
> >
> > So I think a more correct fix it to not set S_DAX for inodes that will have
> > any pagecache pages - e.g. don't set S_DAX for block device inodes when
> > filesystem is mounted on it (probably the easiest is to just refuse to
> > mount filesystem on block device which has S_DAX set).
> 
> I think we have a wider problem here.  See __blkdev_get, we set S_DAX
> on all block devices that have ->direct_access() and have a
> page-aligned starting address.  It seems to me we need to modify the
> metadata i/o paths to bypass the page cache

Heh, no way to do that easily. All the journalling machinery depends on
buffers and pages...

>, or teach the fsync code
> how to flush populated data pages out of the radix.

This might be doable but it will be difficult to avoid aliasing issues and
data corruption. And mainly I don't see the point: When you mount a
filesystem on top of block device, you do not want to mess with the block
device directly, even less using DAX. So we just have to find a way how to
set S_DAX for normal open but clear it from fs path. At worst, we could
clear S_DAX on the block device in mount_bdev() or something like that...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS
  2016-02-11 12:43 ` [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Jan Kara
@ 2016-02-11 19:49   ` Ross Zwisler
  2016-02-11 20:50     ` Dave Chinner
  2016-02-12 19:03   ` Ross Zwisler
  1 sibling, 1 reply; 20+ messages in thread
From: Ross Zwisler @ 2016-02-11 19:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, xfs

On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote:
> On Wed 10-02-16 13:48:54, Ross Zwisler wrote:
> > During testing of raw block devices + DAX I noticed that the struct
> > block_device that we were using for DAX operations was incorrect.  For the
> > fault handlers, etc. we can just get the correct bdev via get_block(),
> > which is passed in as a function pointer, but for the *sync code and for
> > sector zeroing we don't have access to get_block().  This is also an issue
> > for XFS real-time devices, whenever we get those working.
> > 
> > Patch one of this series fixes the DAX sector zeroing code by explicitly
> > passing in a valid struct block_device.
> > 
> > Patch two of this series fixes DAX *sync support by moving calls to
> > dax_writeback_mapping_range() out of filemap_write_and_wait_range() and
> > into the filesystem/block device ->writepages function so that it can
> > supply us with a valid block device. This also fixes DAX code to properly
> > flush caches in response to sync(2).
> > 
> > Thanks to Jan Kara for his initial draft of patch 2:
> > https://lkml.org/lkml/2016/2/9/485
> > 
> > Here are the changes that I've made to that patch:
> > 
> > 1) For DAX mappings, only return after calling
> > dax_writeback_mapping_range() if we encountered an error.  In the non-error
> > case we still need to write back normal pages, else we lose metadata
> > updates. 
> > 
> > 2) In dax_writeback_mapping_range(), move the new check for 
> >         if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
> > above the i_blkbits check.  In my testing I found cases where
> > dax_writeback_mapping_range() was called for inodes with i_blkbits !=
> > PAGE_SHIFT - I'm assuming these are internal metadata inodes?  They have no
> > exceptional DAX entries to flush, so we have no work to do, but if we
> > return error from the i_blkbits check we will fail the overall writeback
> > operation.  Please let me know if it seems wrong for us to be seeing inodes
> > set to use DAX but with i_blkbits != PAGE_SHIFT and I'll get more info.
> 
> So I'm wondering - how come S_DAX flag got set for inode where i_blkbis !=
> PAGE_SHIFT? That would seem to be a bug? I specifically ordered the checks
> like this to catch such issues.

I've isolated this one - this happens for all three filesystems (ext2, ext4 &
XFS), and does indeed have to do with the fact that S_DAX is set for
bdev->bd_inode.

Here is one failure path:

[  102.866637]  [<ffffffff81576d93>] dump_stack+0x85/0xc2
[  102.867101]  [<ffffffff812b9ee0>] dax_writeback_mapping_range+0x60/0xe0
[  102.867738]  [<ffffffff812a1d4f>] blkdev_writepages+0x3f/0x50
[  102.868272]  [<ffffffff811db011>] do_writepages+0x21/0x30
[  102.868784]  [<ffffffff811cb6a6>] __filemap_fdatawrite_range+0xc6/0x100
[  102.869378]  [<ffffffff811cb75a>] filemap_write_and_wait+0x4a/0xa0
[  102.869933]  [<ffffffff812a15e0>] set_blocksize+0x70/0xd0
[  102.870424]  [<ffffffff812a273d>] sb_set_blocksize+0x1d/0x50
[  102.870933]  [<ffffffff8132ac9b>] ext4_fill_super+0x75b/0x3360
[  102.871487]  [<ffffffff81583381>] ? vsnprintf+0x201/0x4c0
[  102.872005]  [<ffffffff815836d9>] ? snprintf+0x49/0x60
[  102.872499]  [<ffffffff81263010>] mount_bdev+0x180/0x1b0
[  102.872981]  [<ffffffff8132a540>] ? ext4_calculate_overhead+0x370/0x370
[  102.873580]  [<ffffffff8131ad95>] ext4_mount+0x15/0x20
[  102.874042]  [<ffffffff81263908>] mount_fs+0x38/0x170
[  102.874524]  [<ffffffff812839db>] vfs_kern_mount+0x6b/0x150
[  102.875041]  [<ffffffff8128670f>] do_mount+0x24f/0xe90
[  102.875508]  [<ffffffff81284444>] ? mntput+0x24/0x40
[  102.875958]  [<ffffffff812399ba>] ? __kmalloc_track_caller+0xea/0x240
[  102.876542]  [<ffffffff812862bc>] ? copy_mount_options+0x2c/0x210
[  102.877087]  [<ffffffff81287695>] SyS_mount+0x95/0xe0
[  102.877573]  [<ffffffff81a6af72>] entry_SYSCALL_64_fastpath+0x12/0x76

In set_blocksize() we are actually updating bdev->bd_inode->i_blkbits to be
12, but before that happens we do a sync_blockdev() with i_blkbits at 10,
which causes the failure.  This can be reproduced easily just by mounting an
ext2 or ext4 filesystem.

I think the plan of unsetting S_DAX on bdev->bd_inode when we mount will save
us from this, as long as we do it super early in the mount process.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems
  2016-02-11 15:22         ` Dan Williams
  2016-02-11 16:22           ` Jan Kara
@ 2016-02-11 20:46           ` Dave Chinner
  2016-02-11 20:58             ` Dan Williams
  1 sibling, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2016-02-11 20:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ross Zwisler, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, XFS Developers

On Thu, Feb 11, 2016 at 07:22:00AM -0800, Dan Williams wrote:
> On Thu, Feb 11, 2016 at 4:50 AM, Jan Kara <jack@suse.cz> wrote:
> > On Wed 10-02-16 15:43:40, Ross Zwisler wrote:
> >> On Thu, Feb 11, 2016 at 09:03:12AM +1100, Dave Chinner wrote:
> >> > On Wed, Feb 10, 2016 at 01:48:56PM -0700, Ross Zwisler wrote:
> >> > > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> >> > > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> >> > > dax_writeback_mapping_range() needs a struct block_device, and it used to
> >> > > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> >> > > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> >> > > block devices and for XFS real-time files.
> >> > >
> >> > > Instead, call dax_writeback_mapping_range() directly from the filesystem
> >> > > ->writepages function so that it can supply us with a valid block
> >> > > device. This also fixes DAX code to properly flush caches in response to
> >> > > sync(2).
> >> > >
> >> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> > > Signed-off-by: Jan Kara <jack@suse.cz>
> >> > > ---
> >> > >  fs/block_dev.c      | 16 +++++++++++++++-
> >> > >  fs/dax.c            | 13 ++++++++-----
> >> > >  fs/ext2/inode.c     | 11 +++++++++++
> >> > >  fs/ext4/inode.c     |  7 +++++++
> >> > >  fs/xfs/xfs_aops.c   |  9 +++++++++
> >> > >  include/linux/dax.h |  6 ++++--
> >> > >  mm/filemap.c        | 12 ++++--------
> >> > >  7 files changed, 58 insertions(+), 16 deletions(-)
> >> > >
> >> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> >> > > index 39b3a17..fc01e43 100644
> >> > > --- a/fs/block_dev.c
> >> > > +++ b/fs/block_dev.c
> >> > > @@ -1693,13 +1693,27 @@ static int blkdev_releasepage(struct page *page, gfp_t wait)
> >> > >   return try_to_free_buffers(page);
> >> > >  }
> >> > >
> >> > > +static int blkdev_writepages(struct address_space *mapping,
> >> > > +                      struct writeback_control *wbc)
> >> > > +{
> >> > > + if (dax_mapping(mapping)) {
> >> > > +         struct block_device *bdev = I_BDEV(mapping->host);
> >> > > +         int error;
> >> > > +
> >> > > +         error = dax_writeback_mapping_range(mapping, bdev, wbc);
> >> > > +         if (error)
> >> > > +                 return error;
> >> > > + }
> >> > > + return generic_writepages(mapping, wbc);
> >> > > +}
> >> >
> >> > Can you remind of the reason for calling generic_writepages() on DAX
> >> > enabled address spaces?
> >>
> >> Sure.  The initial version of this patch didn't do this, and during testing I
> >> hit a bunch of xfstests failures.  In ext2 at least I believe these were
> >> happening because we were skipping the call into generic_writepages() for DAX
> >> inodes. Without a lot of data to back this up, my guess is that this is due
> >> to metadata inodes or something being marked as DAX (so dax_mapping(mapping)
> >> returns true), but having dirty page cache pages that need to be written back
> >> as part of the writeback.
> >>
> >> Changing this so we always call generic_writepages() even in the DAX case
> >> solved the xfstest failures.
> >>
> >> If this sounds incorrect, please let me know and I'll go and gather more data.
> >
> > So I think a more correct fix it to not set S_DAX for inodes that will have
> > any pagecache pages - e.g. don't set S_DAX for block device inodes when
> > filesystem is mounted on it (probably the easiest is to just refuse to
> > mount filesystem on block device which has S_DAX set).
> 
> I think we have a wider problem here.  See __blkdev_get, we set S_DAX
> on all block devices that have ->direct_access() and have a
> page-aligned starting address.

That's seeming like a premature optimisation to me now. I didn't
say anything at the time because I was busy with other things and it
didn't affect XFS.

> It seems to me we need to modify the
> metadata i/o paths to bypass the page cache,

XFS doesn't use the block device page cache for it's metadata - it
has it's own internal metadata cache structures and uses get_pages
or heap memory to back it's metadata. But that doesn't make mixing
DAX and pages in the block device mapping tree sane.

What you are missing here is that the underlying architecture of
journalling filesystems mean they can't use DAX for their metadata.
Modifications have to be buffered, because they have to be written
to the journal first before they are written back in place. IOWs, we
need to buffer changes in volatile memory for some time, and that
means we can't use DAX during transactional modifications.

And to put the final nail in that coffin, metadata in XFS can be
discontiguous multi-block objects - in those situations we vmap the
underlying pages so they appear to the code to be a contiguous
buffer, and that's something we can't do with DAX....

> or teach the fsync code
> how to flush populated data pages out of the radix.

That doesn't solve the problem. Filesystems free and reallocate
filesystem blocks without intermediate block device mapping
invalidation calls, so what is one minute a data block accessed by
DAX may become a metadata block that accessed via buffered IO.  It
all goes to crap very quickly....

However, I'd say fsync is not the place to address this. This block
device cache aliasing issue is supposed to be what
unmap_underlying_metadata() solves, right?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS
  2016-02-11 19:49   ` Ross Zwisler
@ 2016-02-11 20:50     ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2016-02-11 20:50 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, xfs

On Thu, Feb 11, 2016 at 12:49:22PM -0700, Ross Zwisler wrote:
> I think the plan of unsetting S_DAX on bdev->bd_inode when we mount will save
> us from this, as long as we do it super early in the mount process.

I think that S_DAX should not be set on the block device by default
in the first place. If we've been surprised by unexpected behaviour,
then I'm sure there are going to be other surprises waiting for us.
DAX default policy should be opt-in, not opt-out.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems
  2016-02-11 20:46           ` Dave Chinner
@ 2016-02-11 20:58             ` Dan Williams
  2016-02-11 22:46               ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Dan Williams @ 2016-02-11 20:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Ross Zwisler, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, XFS Developers

On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner <david@fromorbit.com> wrote:
[..]
>> It seems to me we need to modify the
>> metadata i/o paths to bypass the page cache,
>
> XFS doesn't use the block device page cache for it's metadata - it
> has it's own internal metadata cache structures and uses get_pages
> or heap memory to back it's metadata. But that doesn't make mixing
> DAX and pages in the block device mapping tree sane.
>
> What you are missing here is that the underlying architecture of
> journalling filesystems mean they can't use DAX for their metadata.
> Modifications have to be buffered, because they have to be written
> to the journal first before they are written back in place. IOWs, we
> need to buffer changes in volatile memory for some time, and that
> means we can't use DAX during transactional modifications.
>
> And to put the final nail in that coffin, metadata in XFS can be
> discontiguous multi-block objects - in those situations we vmap the
> underlying pages so they appear to the code to be a contiguous
> buffer, and that's something we can't do with DAX....

Sorry, I wasn't clear when I said "bypass page cache" I meant a
solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition
table reads".  However, I suspect that is broken if the filesystem is
not ready to see a new page allocated for every I/O.  I assume one
thread will want to insert a page in the radix for another thread to
find/manipulate before metadata gets written back to storage.

>> or teach the fsync code
>> how to flush populated data pages out of the radix.
>
> That doesn't solve the problem. Filesystems free and reallocate
> filesystem blocks without intermediate block device mapping
> invalidation calls, so what is one minute a data block accessed by
> DAX may become a metadata block that accessed via buffered IO.  It
> all goes to crap very quickly....
>
> However, I'd say fsync is not the place to address this. This block
> device cache aliasing issue is supposed to be what
> unmap_underlying_metadata() solves, right?

I'll take a look at this.  Right now I'm trying to implement the
"clear block-device-inode S_DAX on fs mount" approach.  My concern
though is that  we need to disable block device mmap while a
filesystem is mounted...

Maybe I don't need to worry because it's already the case that a mmap
of the raw device may not see the most up to date data for a file that
has dirty fs-page-cache data.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems
  2016-02-11 20:58             ` Dan Williams
@ 2016-02-11 22:46               ` Dave Chinner
  2016-02-11 22:59                 ` Dan Williams
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2016-02-11 22:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ross Zwisler, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, XFS Developers

On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote:
> On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner <david@fromorbit.com> wrote:
> [..]
> >> It seems to me we need to modify the
> >> metadata i/o paths to bypass the page cache,
> >
> > XFS doesn't use the block device page cache for it's metadata - it
> > has it's own internal metadata cache structures and uses get_pages
> > or heap memory to back it's metadata. But that doesn't make mixing
> > DAX and pages in the block device mapping tree sane.
> >
> > What you are missing here is that the underlying architecture of
> > journalling filesystems mean they can't use DAX for their metadata.
> > Modifications have to be buffered, because they have to be written
> > to the journal first before they are written back in place. IOWs, we
> > need to buffer changes in volatile memory for some time, and that
> > means we can't use DAX during transactional modifications.
> >
> > And to put the final nail in that coffin, metadata in XFS can be
> > discontiguous multi-block objects - in those situations we vmap the
> > underlying pages so they appear to the code to be a contiguous
> > buffer, and that's something we can't do with DAX....
> 
> Sorry, I wasn't clear when I said "bypass page cache" I meant a
> solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition
> table reads".

So there's already bandaids to prevent bad shit from happening in
the block layer, let alone when we consider all the ways that
userspace can screw this all up.

> However, I suspect that is broken if the filesystem is not ready
> to see a new page allocated for every I/O.  I assume one
> thread will want to insert a page in the radix for another thread
> to find/manipulate before metadata gets written back to storage.

Right, you can't do that, especially as the struct page has a 1-1
relationship with the bufferhead that is attached to it as the
bufferhead carries the filesystem state for the given cached page.

> >> or teach the fsync code how to flush populated data pages out
> >> of the radix.
> >
> > That doesn't solve the problem. Filesystems free and reallocate
> > filesystem blocks without intermediate block device mapping
> > invalidation calls, so what is one minute a data block accessed
> > by DAX may become a metadata block that accessed via buffered
> > IO.  It all goes to crap very quickly....
> >
> > However, I'd say fsync is not the place to address this. This
> > block device cache aliasing issue is supposed to be what
> > unmap_underlying_metadata() solves, right?
> 
> I'll take a look at this.  Right now I'm trying to implement the
> "clear block-device-inode S_DAX on fs mount" approach.  My concern
> though is that  we need to disable block device mmap while a
> filesystem is mounted...

/me chokes on his coffee.

When did mmaping the block device behind the back of a mounted
fileystem become a valid use case? It's not supported for normal
block devices and for the same reasons it won't be supported for DAX
enabled block devices, either. i.e. I'm going to tell anyone who has
an application that does this to go and take a hike when (not if!)
they report filesystem corruption problems.

> Maybe I don't need to worry because it's already the case that a
> mmap of the raw device may not see the most up to date data for a
> file that has dirty fs-page-cache data.

It goes both ways. What happens if mkfs or fsck modifies the
block device via mmap+DAX and then the filesystem mounts the block
device and tries to read that metadata via the block device page
cache?

Quite frankly, DAX on the block device is a can of worms we really
don't need to deal with right now. IMO it's a solution looking for a
problem to solve, the "default to on" policy is wrong (DAX is
opt-in, not opt-out) and given this we should turn it off until
we've solved the more important problems we need to solve. i.e. We
need to concentrate on getting data integrity working correctly
first, then address the cache aliasing issues, then address the
"safe access" issues, and then we can re-introduce block device DAX
access...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems
  2016-02-11 22:46               ` Dave Chinner
@ 2016-02-11 22:59                 ` Dan Williams
  2016-02-11 23:44                   ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Dan Williams @ 2016-02-11 22:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Ross Zwisler, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, XFS Developers

On Thu, Feb 11, 2016 at 2:46 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote:
>> On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner <david@fromorbit.com> wrote:
>> [..]
>> >> It seems to me we need to modify the
>> >> metadata i/o paths to bypass the page cache,
>> >
>> > XFS doesn't use the block device page cache for it's metadata - it
>> > has it's own internal metadata cache structures and uses get_pages
>> > or heap memory to back it's metadata. But that doesn't make mixing
>> > DAX and pages in the block device mapping tree sane.
>> >
>> > What you are missing here is that the underlying architecture of
>> > journalling filesystems mean they can't use DAX for their metadata.
>> > Modifications have to be buffered, because they have to be written
>> > to the journal first before they are written back in place. IOWs, we
>> > need to buffer changes in volatile memory for some time, and that
>> > means we can't use DAX during transactional modifications.
>> >
>> > And to put the final nail in that coffin, metadata in XFS can be
>> > discontiguous multi-block objects - in those situations we vmap the
>> > underlying pages so they appear to the code to be a contiguous
>> > buffer, and that's something we can't do with DAX....
>>
>> Sorry, I wasn't clear when I said "bypass page cache" I meant a
>> solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition
>> table reads".
>
> So there's already bandaids to prevent bad shit from happening in
> the block layer, let alone when we consider all the ways that
> userspace can screw this all up.
>
>> However, I suspect that is broken if the filesystem is not ready
>> to see a new page allocated for every I/O.  I assume one
>> thread will want to insert a page in the radix for another thread
>> to find/manipulate before metadata gets written back to storage.
>
> Right, you can't do that, especially as the struct page has a 1-1
> relationship with the bufferhead that is attached to it as the
> bufferhead carries the filesystem state for the given cached page.
>
>> >> or teach the fsync code how to flush populated data pages out
>> >> of the radix.
>> >
>> > That doesn't solve the problem. Filesystems free and reallocate
>> > filesystem blocks without intermediate block device mapping
>> > invalidation calls, so what is one minute a data block accessed
>> > by DAX may become a metadata block that accessed via buffered
>> > IO.  It all goes to crap very quickly....
>> >
>> > However, I'd say fsync is not the place to address this. This
>> > block device cache aliasing issue is supposed to be what
>> > unmap_underlying_metadata() solves, right?
>>
>> I'll take a look at this.  Right now I'm trying to implement the
>> "clear block-device-inode S_DAX on fs mount" approach.  My concern
>> though is that  we need to disable block device mmap while a
>> filesystem is mounted...
>
> /me chokes on his coffee.
>
> When did mmaping the block device behind the back of a mounted
> fileystem become a valid use case? It's not supported for normal
> block devices and for the same reasons it won't be supported for DAX
> enabled block devices, either. i.e. I'm going to tell anyone who has
> an application that does this to go and take a hike when (not if!)
> they report filesystem corruption problems.

Right, but we need to not confuse the fsync code regardless of how bad
of an idea this is ::-).

>> Maybe I don't need to worry because it's already the case that a
>> mmap of the raw device may not see the most up to date data for a
>> file that has dirty fs-page-cache data.
>
> It goes both ways. What happens if mkfs or fsck modifies the
> block device via mmap+DAX and then the filesystem mounts the block
> device and tries to read that metadata via the block device page
> cache?
>
> Quite frankly, DAX on the block device is a can of worms we really
> don't need to deal with right now. IMO it's a solution looking for a
> problem to solve,

Virtualization use cases want to give large ranges to guest-VMs, and
it is currently the only way to reliably get 1GiB mappings.

> the "default to on" policy is wrong (DAX is
> opt-in, not opt-out) and given this we should turn it off until
> we've solved the more important problems we need to solve. i.e. We
> need to concentrate on getting data integrity working correctly
> first, then address the cache aliasing issues, then address the
> "safe access" issues, and then we can re-introduce block device DAX
> access...

Agreed.

Note that the "default-on policy" came from commit bbab37ddc20b
"block: Add support for DAX reads/writes to block devices" way back in
4.2.  We're just now noticing.  Credit Ross for good sanity checking.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems
  2016-02-11 22:59                 ` Dan Williams
@ 2016-02-11 23:44                   ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2016-02-11 23:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ross Zwisler, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Jan Kara,
	Matthew Wilcox, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, XFS Developers

On Thu, Feb 11, 2016 at 02:59:14PM -0800, Dan Williams wrote:
> On Thu, Feb 11, 2016 at 2:46 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote:
> >> On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> Maybe I don't need to worry because it's already the case that a
> >> mmap of the raw device may not see the most up to date data for a
> >> file that has dirty fs-page-cache data.
> >
> > It goes both ways. What happens if mkfs or fsck modifies the
> > block device via mmap+DAX and then the filesystem mounts the block
> > device and tries to read that metadata via the block device page
> > cache?
> >
> > Quite frankly, DAX on the block device is a can of worms we really
> > don't need to deal with right now. IMO it's a solution looking for a
> > problem to solve,
> 
> Virtualization use cases want to give large ranges to guest-VMs, and
> it is currently the only way to reliably get 1GiB mappings.

Precisely my point - block devices are not the best way to solve
this problem.

A file, on XFS, with a 1GB extent size hint and preallocated to be
aligned to 1GB addresses (i.e. mkfs.xfs -d su=1G,sw=1 on the host
filesystem) will give reliable 1GB aligned blocks for DAX mappings,
just like a block device will. Peformance wise it's little different
to using the block device directly. Management wise it's way more
flexible, especially as such image files can be recycled for new VMs
almost instantly via FALLOC_FL_FLAG_ZERO_RANGE.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS
  2016-02-11 12:43 ` [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Jan Kara
  2016-02-11 19:49   ` Ross Zwisler
@ 2016-02-12 19:03   ` Ross Zwisler
  2016-02-13  2:38     ` Dave Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Ross Zwisler @ 2016-02-12 19:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, xfs

On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote:
> On Wed 10-02-16 13:48:54, Ross Zwisler wrote:
> > 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue
> > the writeback in the case that DAX is enabled but we only have a nonzero
> > mapping->nrpages.  As with 1) and 2), I believe this is necessary to
> > properly writeback metadata changes.  If this sounds wrong, please let me
> > know and I'll get more info.
> 
> And I'm surprised here as well. If there are dax_mapping() inodes that have
> pagecache pages, then we have issues with radix tree handling as well. So
> how come dax_mapping() inodes have pages attached? If it is about block
> device inodes, then I find it buggy, that S_DAX gets set for such inodes
> when filesystem is mounted on them because in such cases we are IMO asking
> for data corruption sooner rather than later...

I think I've figured this one out, at least partially.

For ext2 the issues I was seeing were due to the fact that directory inodes
have S_DAX set, but have dirty page cache pages.   In testing with
generic/002, I see two ext2 inodes with S_DAX trying to do a writeback while
they have dirty page cache pages.  The first has i_ino=2, which is the
EXT2_ROOT_INO.

The second inode changes from run to run, but for my last run was 155649.  The
test failed because that directory inode was found to be corrupt by fsck.ext2:

	*** fsck.ext2 output ***
	fsck from util-linux 2.26.2
	e2fsck 1.42.12 (29-Aug-2014)
	Pass 1: Checking inodes, blocks, and sizes
	Pass 2: Checking directory structure
	Directory inode 155649, block #0, offset 0: directory corrupted

If I change the code in ext2_writepages() so that it does the
mpage_writepages() even for DAX inodes, all my xfstests pass.  I'm not sure
this is the right fix, though - should it instead be that ext2 directory
inodes don't have S_DAX set?

A similar problem occurs with ext4, though I haven't yet tracked it down to an
inode type.  It could be that ext4 directory inodes have the same issue, and
Eric Sandeen suggested we might also have an issue with XATTRS attached to
inodes.

As with ext2, if I allow the normal writeback to occur in ext4_writepages()
even for DAX inodes, the issues go away, but I'm not sure whether or not this
is the correct fix.

As far as I can see, XFS does not have these issues - returning immediately
having done just the DAX writeback in xfs_vm_writepages() lets all my xfstests
pass.

For v4.5 should I send out an updated version of this series that does the
regular page writeback for ext2 & ext4, or should we work to clear S_DAX for
regular filesystem inodes that have dirty page cache data?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS
  2016-02-12 19:03   ` Ross Zwisler
@ 2016-02-13  2:38     ` Dave Chinner
  2016-02-13  4:59       ` Ross Zwisler
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2016-02-13  2:38 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, xfs

On Fri, Feb 12, 2016 at 12:03:20PM -0700, Ross Zwisler wrote:
> On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote:
> > On Wed 10-02-16 13:48:54, Ross Zwisler wrote:
> > > 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue
> > > the writeback in the case that DAX is enabled but we only have a nonzero
> > > mapping->nrpages.  As with 1) and 2), I believe this is necessary to
> > > properly writeback metadata changes.  If this sounds wrong, please let me
> > > know and I'll get more info.
> > 
> > And I'm surprised here as well. If there are dax_mapping() inodes that have
> > pagecache pages, then we have issues with radix tree handling as well. So
> > how come dax_mapping() inodes have pages attached? If it is about block
> > device inodes, then I find it buggy, that S_DAX gets set for such inodes
> > when filesystem is mounted on them because in such cases we are IMO asking
> > for data corruption sooner rather than later...
> 
> I think I've figured this one out, at least partially.
> 
> For ext2 the issues I was seeing were due to the fact that directory inodes
> have S_DAX set, but have dirty page cache pages.   In testing with
> generic/002, I see two ext2 inodes with S_DAX trying to do a writeback while
> they have dirty page cache pages.  The first has i_ino=2, which is the
> EXT2_ROOT_INO.
....
> As far as I can see, XFS does not have these issues - returning immediately
> having done just the DAX writeback in xfs_vm_writepages() lets all my xfstests
> pass.

XFS will not have issues because it does not dirty directory inodes
at the VFS level, nor does it use the page cache for directory data.
However, looking at the code I think it does still set S_DAX on
directory inodes, which it shouldn't be doing.

I've got a couple of fixes I need to do in this area - hopefully
I'll get it done on Monday.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS
  2016-02-13  2:38     ` Dave Chinner
@ 2016-02-13  4:59       ` Ross Zwisler
  0 siblings, 0 replies; 20+ messages in thread
From: Ross Zwisler @ 2016-02-13  4:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, Jan Kara, linux-kernel, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Jan Kara, Matthew Wilcox, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, xfs

On Sat, Feb 13, 2016 at 01:38:49PM +1100, Dave Chinner wrote:
> On Fri, Feb 12, 2016 at 12:03:20PM -0700, Ross Zwisler wrote:
> > On Thu, Feb 11, 2016 at 01:43:04PM +0100, Jan Kara wrote:
> > > On Wed 10-02-16 13:48:54, Ross Zwisler wrote:
> > > > 3) In filemap_write_and_wait() and filemap_write_and_wait_range(), continue
> > > > the writeback in the case that DAX is enabled but we only have a nonzero
> > > > mapping->nrpages.  As with 1) and 2), I believe this is necessary to
> > > > properly writeback metadata changes.  If this sounds wrong, please let me
> > > > know and I'll get more info.
> > > 
> > > And I'm surprised here as well. If there are dax_mapping() inodes that have
> > > pagecache pages, then we have issues with radix tree handling as well. So
> > > how come dax_mapping() inodes have pages attached? If it is about block
> > > device inodes, then I find it buggy, that S_DAX gets set for such inodes
> > > when filesystem is mounted on them because in such cases we are IMO asking
> > > for data corruption sooner rather than later...
> > 
> > I think I've figured this one out, at least partially.
> > 
> > For ext2 the issues I was seeing were due to the fact that directory inodes
> > have S_DAX set, but have dirty page cache pages.   In testing with
> > generic/002, I see two ext2 inodes with S_DAX trying to do a writeback while
> > they have dirty page cache pages.  The first has i_ino=2, which is the
> > EXT2_ROOT_INO.
> ....
> > As far as I can see, XFS does not have these issues - returning immediately
> > having done just the DAX writeback in xfs_vm_writepages() lets all my xfstests
> > pass.
> 
> XFS will not have issues because it does not dirty directory inodes
> at the VFS level, nor does it use the page cache for directory data.
> However, looking at the code I think it does still set S_DAX on
> directory inodes, which it shouldn't be doing.
> 
> I've got a couple of fixes I need to do in this area - hopefully
> I'll get it done on Monday.

Cool.  I've got a quick patch that stops S_DAX from being set on everything
but regular inodes for ext2 and ext4.  This solved a lot of my xfstests
failures.

Even after that I'm seeing two last failures with ext4 - I'll keep working on
those.

- Ross

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-02-13  4:59 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-10 20:48 [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Ross Zwisler
2016-02-10 20:48 ` [PATCH v2 1/2] dax: supply DAX clearing code with correct bdev Ross Zwisler
2016-02-10 20:48 ` [PATCH v2 2/2] dax: move writeback calls into the filesystems Ross Zwisler
2016-02-10 22:03   ` Dave Chinner
2016-02-10 22:43     ` Ross Zwisler
2016-02-10 23:44       ` Dave Chinner
2016-02-11 12:50       ` Jan Kara
2016-02-11 15:22         ` Dan Williams
2016-02-11 16:22           ` Jan Kara
2016-02-11 20:46           ` Dave Chinner
2016-02-11 20:58             ` Dan Williams
2016-02-11 22:46               ` Dave Chinner
2016-02-11 22:59                 ` Dan Williams
2016-02-11 23:44                   ` Dave Chinner
2016-02-11 12:43 ` [PATCH v2 0/2] DAX bdev fixes - move flushing calls to FS Jan Kara
2016-02-11 19:49   ` Ross Zwisler
2016-02-11 20:50     ` Dave Chinner
2016-02-12 19:03   ` Ross Zwisler
2016-02-13  2:38     ` Dave Chinner
2016-02-13  4:59       ` Ross Zwisler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).