All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] xfs: DAX support
@ 2015-03-03 23:30 ` Dave Chinner
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-03 23:30 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, jack, willy

Hi folks,

This is the first pass I've made for supporting DAX in XFS. Most
of the XFS changes are straight forward and fairly self contained,
and mostly seem to work.

The first patch, however, is changing the DAX infrastructure to take
a "block zeroing completion" function for dax_fault() so that we
don't need to abuse the mapping bufferhead to pass the completion
function. This is straight forward for XFS, but the ext4 code is,
well, it's already broken so I don't think I've made anything worse.

I note that Jan Kara is aware of the problems related to the
unwritten extent conversion in ext4, as mentioned here earlier
today on the ext4 list (items 1 and 2):

http://permalink.gmane.org/gmane.comp.file-systems.ext4/47943

So really my only concern here is cleaning up the interface to
remove the mapping bh hack, not whether ext4 works reliably or not.
The XFS implementation does not have any of the problems the ext4
code does.

The rest of the series adds all the DAX hooks into the required code
paths for block zeroing, page fault, io, truncate and finally the
mount path for the dax mount option.

The series passes xfstests without and serious problems - there are
a couple of tests where the extent layout isn't as the tests expect,
but these are minor issues that don't affect correctness.

Comments welcome, especially on the this dax_fault callback patch.

-Dave.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH 0/6] xfs: DAX support
@ 2015-03-03 23:30 ` Dave Chinner
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-03 23:30 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

Hi folks,

This is the first pass I've made for supporting DAX in XFS. Most
of the XFS changes are straight forward and fairly self contained,
and mostly seem to work.

The first patch, however, is changing the DAX infrastructure to take
a "block zeroing completion" function for dax_fault() so that we
don't need to abuse the mapping bufferhead to pass the completion
function. This is straight forward for XFS, but the ext4 code is,
well, it's already broken so I don't think I've made anything worse.

I note that Jan Kara is aware of the problems related to the
unwritten extent conversion in ext4, as mentioned here earlier
today on the ext4 list (items 1 and 2):

http://permalink.gmane.org/gmane.comp.file-systems.ext4/47943

So really my only concern here is cleaning up the interface to
remove the mapping bh hack, not whether ext4 works reliably or not.
The XFS implementation does not have any of the problems the ext4
code does.

The rest of the series adds all the DAX hooks into the required code
paths for block zeroing, page fault, io, truncate and finally the
mount path for the dax mount option.

The series passes xfstests without and serious problems - there are
a couple of tests where the extent layout isn't as the tests expect,
but these are minor issues that don't affect correctness.

Comments welcome, especially on the this dax_fault callback patch.

-Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/6] dax: don't abuse get_block mapping for endio callbacks
  2015-03-03 23:30 ` Dave Chinner
@ 2015-03-03 23:30   ` Dave Chinner
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-03 23:30 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, jack, willy

From: Dave Chinner <dchinner@redhat.com>

dax_fault() currently relies on the get_block callback to attach an
io completion callback to the mapping buffer head so that it can
run unwritten extent conversion after zeroing allocated blocks.

Instead of this hack, pass the conversion callback directly into
dax_fault() similar to the get_block callback. When the filesystem
allocates unwritten extents, it will set the buffer_unwritten()
flag, and hence the dax_fault code can call the completion function
in the contexts where it is necessary without overloading the
mapping buffer head.

Note: The changes to ext4 to use this interface are suspect at best.
In fact, the way ext4 did this end_io assignment in the first place
looks suspect because it only set a completion callback when there
wasn't already some other write() call taking place on the same
inode. The ext4 end_io code looks rather intricate and fragile with
all it's reference counting and passing to different contexts for
modification via inode private pointers that aren't protected by
locks...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dax.c           | 15 ++++++++-------
 fs/ext2/file.c     |  4 ++--
 fs/ext4/file.c     | 16 ++++++++++++++--
 fs/ext4/inode.c    | 21 +++++++--------------
 include/linux/fs.h |  6 ++++--
 5 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ed1619e..d7b4dba 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -269,7 +269,8 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
 }
 
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
-			struct vm_area_struct *vma, struct vm_fault *vmf)
+			struct vm_area_struct *vma, struct vm_fault *vmf,
+			dax_iodone_t complete_unwritten)
 {
 	struct address_space *mapping = inode->i_mapping;
 	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
@@ -310,14 +311,14 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
  out:
 	i_mmap_unlock_read(mapping);
 
-	if (bh->b_end_io)
-		bh->b_end_io(bh, 1);
+	if (buffer_unwritten(bh))
+		complete_unwritten(bh, 1);
 
 	return error;
 }
 
 static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-			get_block_t get_block)
+			get_block_t get_block, dax_iodone_t complete_unwritten)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
@@ -418,7 +419,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		page_cache_release(page);
 	}
 
-	error = dax_insert_mapping(inode, &bh, vma, vmf);
+	error = dax_insert_mapping(inode, &bh, vma, vmf, complete_unwritten);
 
  out:
 	if (error == -ENOMEM)
@@ -446,7 +447,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
  * fault handler for DAX files.
  */
 int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-			get_block_t get_block)
+	      get_block_t get_block, dax_iodone_t complete_unwritten)
 {
 	int result;
 	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
@@ -455,7 +456,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
 	}
-	result = do_dax_fault(vma, vmf, get_block);
+	result = do_dax_fault(vma, vmf, get_block, complete_unwritten);
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(sb);
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index e317017..8da747a 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -28,12 +28,12 @@
 #ifdef CONFIG_FS_DAX
 static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext2_get_block);
+	return dax_fault(vma, vmf, ext2_get_block, NULL);
 }
 
 static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_mkwrite(vma, vmf, ext2_get_block);
+	return dax_mkwrite(vma, vmf, ext2_get_block, NULL);
 }
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 33a09da..f7dabb1 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -192,15 +192,27 @@ errout:
 }
 
 #ifdef CONFIG_FS_DAX
+static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
+{
+	struct inode *inode = bh->b_assoc_map->host;
+	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
+	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
+	int err;
+	if (!uptodate)
+		return;
+	WARN_ON(!buffer_unwritten(bh));
+	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
+}
+
 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext4_get_block);
+	return dax_fault(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
 					/* Is this the right get_block? */
 }
 
 static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_mkwrite(vma, vmf, ext4_get_block);
+	return dax_mkwrite(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
 }
 
 static const struct vm_operations_struct ext4_dax_vm_ops = {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5cb9a21..43433de 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -657,18 +657,6 @@ has_zeroout:
 	return retval;
 }
 
-static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
-{
-	struct inode *inode = bh->b_assoc_map->host;
-	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
-	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
-	int err;
-	if (!uptodate)
-		return;
-	WARN_ON(!buffer_unwritten(bh));
-	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
-}
-
 /* Maximum number of blocks we map for direct IO at once. */
 #define DIO_MAX_BLOCKS 4096
 
@@ -706,10 +694,15 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 
 		map_bh(bh, inode->i_sb, map.m_pblk);
 		bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
-		if (IS_DAX(inode) && buffer_unwritten(bh) && !io_end) {
+		if (IS_DAX(inode) && buffer_unwritten(bh)) {
+			/*
+			 * dgc: I suspect unwritten conversion on ext4+DAX is
+			 * fundamentally broken here when there are concurrent
+			 * read/write in progress on this inode.
+			 */
+			WARN_ON_ONCE(io_end);
 			bh->b_assoc_map = inode->i_mapping;
 			bh->b_private = (void *)(unsigned long)iblock;
-			bh->b_end_io = ext4_end_io_unwritten;
 		}
 		if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
 			set_buffer_defer_completion(bh);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 937e280..82100ae 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -70,6 +70,7 @@ typedef int (get_block_t)(struct inode *inode, sector_t iblock,
 			struct buffer_head *bh_result, int create);
 typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 			ssize_t bytes, void *private);
+typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
 
 #define MAY_EXEC		0x00000001
 #define MAY_WRITE		0x00000002
@@ -2603,8 +2604,9 @@ ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
 int dax_clear_blocks(struct inode *, sector_t block, long size);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
-#define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
+		dax_iodone_t);
+#define dax_mkwrite(vma, vmf, gb, iod)	dax_fault(vma, vmf, gb, iod)
 
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 1/6] dax: don't abuse get_block mapping for endio callbacks
@ 2015-03-03 23:30   ` Dave Chinner
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-03 23:30 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

dax_fault() currently relies on the get_block callback to attach an
io completion callback to the mapping buffer head so that it can
run unwritten extent conversion after zeroing allocated blocks.

Instead of this hack, pass the conversion callback directly into
dax_fault() similar to the get_block callback. When the filesystem
allocates unwritten extents, it will set the buffer_unwritten()
flag, and hence the dax_fault code can call the completion function
in the contexts where it is necessary without overloading the
mapping buffer head.

Note: The changes to ext4 to use this interface are suspect at best.
In fact, the way ext4 did this end_io assignment in the first place
looks suspect because it only set a completion callback when there
wasn't already some other write() call taking place on the same
inode. The ext4 end_io code looks rather intricate and fragile with
all it's reference counting and passing to different contexts for
modification via inode private pointers that aren't protected by
locks...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dax.c           | 15 ++++++++-------
 fs/ext2/file.c     |  4 ++--
 fs/ext4/file.c     | 16 ++++++++++++++--
 fs/ext4/inode.c    | 21 +++++++--------------
 include/linux/fs.h |  6 ++++--
 5 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ed1619e..d7b4dba 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -269,7 +269,8 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
 }
 
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
-			struct vm_area_struct *vma, struct vm_fault *vmf)
+			struct vm_area_struct *vma, struct vm_fault *vmf,
+			dax_iodone_t complete_unwritten)
 {
 	struct address_space *mapping = inode->i_mapping;
 	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
@@ -310,14 +311,14 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
  out:
 	i_mmap_unlock_read(mapping);
 
-	if (bh->b_end_io)
-		bh->b_end_io(bh, 1);
+	if (buffer_unwritten(bh))
+		complete_unwritten(bh, 1);
 
 	return error;
 }
 
 static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-			get_block_t get_block)
+			get_block_t get_block, dax_iodone_t complete_unwritten)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
@@ -418,7 +419,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		page_cache_release(page);
 	}
 
-	error = dax_insert_mapping(inode, &bh, vma, vmf);
+	error = dax_insert_mapping(inode, &bh, vma, vmf, complete_unwritten);
 
  out:
 	if (error == -ENOMEM)
@@ -446,7 +447,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
  * fault handler for DAX files.
  */
 int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-			get_block_t get_block)
+	      get_block_t get_block, dax_iodone_t complete_unwritten)
 {
 	int result;
 	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
@@ -455,7 +456,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
 	}
-	result = do_dax_fault(vma, vmf, get_block);
+	result = do_dax_fault(vma, vmf, get_block, complete_unwritten);
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(sb);
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index e317017..8da747a 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -28,12 +28,12 @@
 #ifdef CONFIG_FS_DAX
 static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext2_get_block);
+	return dax_fault(vma, vmf, ext2_get_block, NULL);
 }
 
 static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_mkwrite(vma, vmf, ext2_get_block);
+	return dax_mkwrite(vma, vmf, ext2_get_block, NULL);
 }
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 33a09da..f7dabb1 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -192,15 +192,27 @@ errout:
 }
 
 #ifdef CONFIG_FS_DAX
+static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
+{
+	struct inode *inode = bh->b_assoc_map->host;
+	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
+	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
+	int err;
+	if (!uptodate)
+		return;
+	WARN_ON(!buffer_unwritten(bh));
+	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
+}
+
 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext4_get_block);
+	return dax_fault(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
 					/* Is this the right get_block? */
 }
 
 static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_mkwrite(vma, vmf, ext4_get_block);
+	return dax_mkwrite(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
 }
 
 static const struct vm_operations_struct ext4_dax_vm_ops = {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5cb9a21..43433de 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -657,18 +657,6 @@ has_zeroout:
 	return retval;
 }
 
-static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
-{
-	struct inode *inode = bh->b_assoc_map->host;
-	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
-	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
-	int err;
-	if (!uptodate)
-		return;
-	WARN_ON(!buffer_unwritten(bh));
-	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
-}
-
 /* Maximum number of blocks we map for direct IO at once. */
 #define DIO_MAX_BLOCKS 4096
 
@@ -706,10 +694,15 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 
 		map_bh(bh, inode->i_sb, map.m_pblk);
 		bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
-		if (IS_DAX(inode) && buffer_unwritten(bh) && !io_end) {
+		if (IS_DAX(inode) && buffer_unwritten(bh)) {
+			/*
+			 * dgc: I suspect unwritten conversion on ext4+DAX is
+			 * fundamentally broken here when there are concurrent
+			 * read/write in progress on this inode.
+			 */
+			WARN_ON_ONCE(io_end);
 			bh->b_assoc_map = inode->i_mapping;
 			bh->b_private = (void *)(unsigned long)iblock;
-			bh->b_end_io = ext4_end_io_unwritten;
 		}
 		if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
 			set_buffer_defer_completion(bh);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 937e280..82100ae 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -70,6 +70,7 @@ typedef int (get_block_t)(struct inode *inode, sector_t iblock,
 			struct buffer_head *bh_result, int create);
 typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 			ssize_t bytes, void *private);
+typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
 
 #define MAY_EXEC		0x00000001
 #define MAY_WRITE		0x00000002
@@ -2603,8 +2604,9 @@ ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
 int dax_clear_blocks(struct inode *, sector_t block, long size);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
-#define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
+		dax_iodone_t);
+#define dax_mkwrite(vma, vmf, gb, iod)	dax_fault(vma, vmf, gb, iod)
 
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/6] xfs: add DAX block zeroing support
  2015-03-03 23:30 ` Dave Chinner
@ 2015-03-03 23:30   ` Dave Chinner
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-03 23:30 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, jack, willy

From: Dave Chinner <dchinner@redhat.com>

Add initial support for DAX block zeroing operations to XFS. DAX
cannot use buffered IO through the page cache for zeroing, nor do we
need to issue IO for uncached block zeroing. In both cases, we can
simply call out to the dax block zeroing function.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 23 +++++++++++++++++++----
 fs/xfs/xfs_file.c      | 28 +++++++++++++++++-----------
 2 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 1bd5393..d1fe432 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1133,14 +1133,29 @@ xfs_zero_remaining_bytes(
 			break;
 		ASSERT(imap.br_blockcount >= 1);
 		ASSERT(imap.br_startoff == offset_fsb);
+		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
+
+		if (imap.br_startblock == HOLESTARTBLOCK ||
+		    imap.br_state == XFS_EXT_UNWRITTEN) {
+			/* skip the entire extent */
+			lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff +
+						      imap.br_blockcount) - 1;
+			continue;
+		}
+
 		lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff + 1) - 1;
 		if (lastoffset > endoff)
 			lastoffset = endoff;
-		if (imap.br_startblock == HOLESTARTBLOCK)
-			continue;
-		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
-		if (imap.br_state == XFS_EXT_UNWRITTEN)
+
+		/* DAX can just zero the backing device directly */
+		if (IS_DAX(VFS_I(ip))) {
+			error = dax_zero_page_range(VFS_I(ip), offset,
+						    lastoffset - offset + 1,
+						    xfs_get_blocks_dax);
+			if (error)
+				return error;
 			continue;
+		}
 
 		error = xfs_buf_read_uncached(XFS_IS_REALTIME_INODE(ip) ?
 				mp->m_rtdev_targp : mp->m_ddev_targp,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index dc5f609..bc0008f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -97,7 +97,8 @@ xfs_iozero(
 {
 	struct page		*page;
 	struct address_space	*mapping;
-	int			status;
+	int			status = 0;
+
 
 	mapping = VFS_I(ip)->i_mapping;
 	do {
@@ -109,20 +110,25 @@ xfs_iozero(
 		if (bytes > count)
 			bytes = count;
 
-		status = pagecache_write_begin(NULL, mapping, pos, bytes,
-					AOP_FLAG_UNINTERRUPTIBLE,
-					&page, &fsdata);
-		if (status)
-			break;
+		if (IS_DAX(VFS_I(ip)))
+			dax_zero_page_range(VFS_I(ip), pos, bytes,
+						   xfs_get_blocks_dax);
+		else {
+			status = pagecache_write_begin(NULL, mapping, pos, bytes,
+						AOP_FLAG_UNINTERRUPTIBLE,
+						&page, &fsdata);
+			if (status)
+				break;
 
-		zero_user(page, offset, bytes);
+			zero_user(page, offset, bytes);
 
-		status = pagecache_write_end(NULL, mapping, pos, bytes, bytes,
-					page, fsdata);
-		WARN_ON(status <= 0); /* can't return less than zero! */
+			status = pagecache_write_end(NULL, mapping, pos, bytes,
+						bytes, page, fsdata);
+			WARN_ON(status <= 0); /* can't return less than zero! */
+			status = 0;
+		}
 		pos += bytes;
 		count -= bytes;
-		status = 0;
 	} while (count);
 
 	return (-status);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/6] xfs: add DAX block zeroing support
@ 2015-03-03 23:30   ` Dave Chinner
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-03 23:30 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

Add initial support for DAX block zeroing operations to XFS. DAX
cannot use buffered IO through the page cache for zeroing, nor do we
need to issue IO for uncached block zeroing. In both cases, we can
simply call out to the dax block zeroing function.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 23 +++++++++++++++++++----
 fs/xfs/xfs_file.c      | 28 +++++++++++++++++-----------
 2 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 1bd5393..d1fe432 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1133,14 +1133,29 @@ xfs_zero_remaining_bytes(
 			break;
 		ASSERT(imap.br_blockcount >= 1);
 		ASSERT(imap.br_startoff == offset_fsb);
+		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
+
+		if (imap.br_startblock == HOLESTARTBLOCK ||
+		    imap.br_state == XFS_EXT_UNWRITTEN) {
+			/* skip the entire extent */
+			lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff +
+						      imap.br_blockcount) - 1;
+			continue;
+		}
+
 		lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff + 1) - 1;
 		if (lastoffset > endoff)
 			lastoffset = endoff;
-		if (imap.br_startblock == HOLESTARTBLOCK)
-			continue;
-		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
-		if (imap.br_state == XFS_EXT_UNWRITTEN)
+
+		/* DAX can just zero the backing device directly */
+		if (IS_DAX(VFS_I(ip))) {
+			error = dax_zero_page_range(VFS_I(ip), offset,
+						    lastoffset - offset + 1,
+						    xfs_get_blocks_dax);
+			if (error)
+				return error;
 			continue;
+		}
 
 		error = xfs_buf_read_uncached(XFS_IS_REALTIME_INODE(ip) ?
 				mp->m_rtdev_targp : mp->m_ddev_targp,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index dc5f609..bc0008f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -97,7 +97,8 @@ xfs_iozero(
 {
 	struct page		*page;
 	struct address_space	*mapping;
-	int			status;
+	int			status = 0;
+
 
 	mapping = VFS_I(ip)->i_mapping;
 	do {
@@ -109,20 +110,25 @@ xfs_iozero(
 		if (bytes > count)
 			bytes = count;
 
-		status = pagecache_write_begin(NULL, mapping, pos, bytes,
-					AOP_FLAG_UNINTERRUPTIBLE,
-					&page, &fsdata);
-		if (status)
-			break;
+		if (IS_DAX(VFS_I(ip)))
+			dax_zero_page_range(VFS_I(ip), pos, bytes,
+						   xfs_get_blocks_dax);
+		else {
+			status = pagecache_write_begin(NULL, mapping, pos, bytes,
+						AOP_FLAG_UNINTERRUPTIBLE,
+						&page, &fsdata);
+			if (status)
+				break;
 
-		zero_user(page, offset, bytes);
+			zero_user(page, offset, bytes);
 
-		status = pagecache_write_end(NULL, mapping, pos, bytes, bytes,
-					page, fsdata);
-		WARN_ON(status <= 0); /* can't return less than zero! */
+			status = pagecache_write_end(NULL, mapping, pos, bytes,
+						bytes, page, fsdata);
+			WARN_ON(status <= 0); /* can't return less than zero! */
+			status = 0;
+		}
 		pos += bytes;
 		count -= bytes;
-		status = 0;
 	} while (count);
 
 	return (-status);
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/6] xfs: add DAX file operations support
  2015-03-03 23:30 ` Dave Chinner
@ 2015-03-03 23:30   ` Dave Chinner
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-03 23:30 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, jack, willy

From: Dave Chinner <dchinner@redhat.com>

Add the initial support for DAX file operations to XFS. This
includes the necessary block allocation and mmap page fault hooks
for DAX to function.

Note that the current block allocation code abuses the mapping
buffer head to provide a completion callback for unwritten extent
allocation when DAX is clearing blocks. The DAX interface needs to
be changed to provide a callback similar to get_blocks for this
callback.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c |  72 ++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_aops.h |   7 +++-
 fs/xfs/xfs_file.c | 108 ++++++++++++++++++++++++++++++++++++++++++++----------
 fs/xfs/xfs_iops.c |   4 ++
 fs/xfs/xfs_iops.h |   6 +++
 5 files changed, 173 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3a9b7a1..22cb03a 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1233,13 +1233,63 @@ xfs_vm_releasepage(
 	return try_to_free_buffers(page);
 }
 
+/*
+ * For DAX we need a mapping buffer callback for unwritten extent conversion
+ * when page faults allocation blocks and then zero them.
+ */
+#ifdef CONFIG_FS_DAX
+static struct xfs_ioend *
+xfs_dax_alloc_ioend(
+	struct inode	*inode,
+	xfs_off_t	offset,
+	ssize_t		size)
+{
+	struct xfs_ioend *ioend;
+
+	ASSERT(IS_DAX(inode));
+	ioend = xfs_alloc_ioend(inode, XFS_IO_UNWRITTEN);
+	ioend->io_offset = offset;
+	ioend->io_size = size;
+	return ioend;
+}
+
+void
+xfs_get_blocks_dax_complete(
+	struct buffer_head	*bh,
+	int			uptodate)
+{
+	struct xfs_ioend	*ioend = bh->b_private;
+	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
+	int			error;
+
+	ASSERT(IS_DAX(ioend->io_inode));
+
+	/* if there was an error zeroing, then don't convert it */
+	if (!uptodate)
+		goto out_free;
+
+	error = xfs_iomap_write_unwritten(ip, ioend->io_offset, ioend->io_size);
+	if (error)
+		xfs_warn(ip->i_mount,
+"%s: conversion failed, ino 0x%llx, offset 0x%llx, len 0x%lx, error %d\n",
+			__func__, ip->i_ino, ioend->io_offset,
+			ioend->io_size, error);
+out_free:
+	mempool_free(ioend, xfs_ioend_pool);
+
+}
+#else
+#define xfs_dax_alloc_ioend(i,o,s)	NULL
+#endif
+
 STATIC int
 __xfs_get_blocks(
 	struct inode		*inode,
 	sector_t		iblock,
 	struct buffer_head	*bh_result,
 	int			create,
-	int			direct)
+	bool			direct,
+	bool			clear)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1304,6 +1354,7 @@ __xfs_get_blocks(
 			if (error)
 				return error;
 			new = 1;
+
 		} else {
 			/*
 			 * Delalloc reservations do not require a transaction,
@@ -1340,7 +1391,10 @@ __xfs_get_blocks(
 		if (create || !ISUNWRITTEN(&imap))
 			xfs_map_buffer(inode, bh_result, &imap, offset);
 		if (create && ISUNWRITTEN(&imap)) {
-			if (direct) {
+			if (clear) {
+				bh_result->b_private = xfs_dax_alloc_ioend(
+							inode, offset, size);
+			} else if (direct) {
 				bh_result->b_private = inode;
 				set_buffer_defer_completion(bh_result);
 			}
@@ -1425,7 +1479,7 @@ xfs_get_blocks(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, 0);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
 }
 
 STATIC int
@@ -1435,7 +1489,17 @@ xfs_get_blocks_direct(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, 1);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
+}
+
+int
+xfs_get_blocks_dax(
+	struct inode		*inode,
+	sector_t		iblock,
+	struct buffer_head	*bh_result,
+	int			create)
+{
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
 }
 
 /*
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index ac644e0..7c6fb3f 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -53,7 +53,12 @@ typedef struct xfs_ioend {
 } xfs_ioend_t;
 
 extern const struct address_space_operations xfs_address_space_operations;
-extern int xfs_get_blocks(struct inode *, sector_t, struct buffer_head *, int);
+
+int	xfs_get_blocks(struct inode *inode, sector_t offset,
+		       struct buffer_head *map_bh, int create);
+int	xfs_get_blocks_dax(struct inode *inode, sector_t offset,
+			   struct buffer_head *map_bh, int create);
+void	xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate);
 
 extern void xfs_count_page_state(struct page *, int *, int *);
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index bc0008f..4bfcba0 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -654,7 +654,7 @@ xfs_file_dio_aio_write(
 					mp->m_rtdev_targp : mp->m_ddev_targp;
 
 	/* DIO must be aligned to device logical sector size */
-	if ((pos | count) & target->bt_logical_sectormask)
+	if (!IS_DAX(inode) && (pos | count) & target->bt_logical_sectormask)
 		return -EINVAL;
 
 	/* "unaligned" here means not aligned to a filesystem block */
@@ -724,8 +724,11 @@ xfs_file_dio_aio_write(
 out:
 	xfs_rw_iunlock(ip, iolock);
 
-	/* No fallback to buffered IO on errors for XFS. */
-	ASSERT(ret < 0 || ret == count);
+	/*
+	 * No fallback to buffered IO on errors for XFS. DAX can result in
+	 * partial writes, but direct IO will either complete fully or fail.
+	 */
+	ASSERT(ret < 0 || ret == count || IS_DAX(VFS_I(ip)));
 	return ret;
 }
 
@@ -810,7 +813,7 @@ xfs_file_write_iter(
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return -EIO;
 
-	if (unlikely(file->f_flags & O_DIRECT))
+	if ((file->f_flags & O_DIRECT) || IS_DAX(inode))
 		ret = xfs_file_dio_aio_write(iocb, from);
 	else
 		ret = xfs_file_buffered_aio_write(iocb, from);
@@ -1031,17 +1034,6 @@ xfs_file_readdir(
 	return xfs_readdir(ip, ctx, bufsize);
 }
 
-STATIC int
-xfs_file_mmap(
-	struct file	*filp,
-	struct vm_area_struct *vma)
-{
-	vma->vm_ops = &xfs_file_vm_ops;
-
-	file_accessed(filp);
-	return 0;
-}
-
 /*
  * This type is designed to indicate the type of offset we would like
  * to search from page cache for xfs_seek_hole_data().
@@ -1466,6 +1458,71 @@ xfs_filemap_page_mkwrite(
 	return error;
 }
 
+static const struct vm_operations_struct xfs_file_vm_ops = {
+	.fault		= xfs_filemap_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= xfs_filemap_page_mkwrite,
+};
+
+#ifdef CONFIG_FS_DAX
+static int
+xfs_filemap_dax_fault(
+	struct vm_area_struct	*vma,
+	struct vm_fault		*vmf)
+{
+	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
+	int			error;
+
+	trace_xfs_filemap_fault(ip);
+
+	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+	error = dax_fault(vma, vmf, xfs_get_blocks_dax,
+			  xfs_get_blocks_dax_complete);
+	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+
+	return error;
+}
+
+static int
+xfs_filemap_dax_page_mkwrite(
+	struct vm_area_struct	*vma,
+	struct vm_fault		*vmf)
+{
+	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
+	int			error;
+
+	trace_xfs_filemap_page_mkwrite(ip);
+
+	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+	error = dax_mkwrite(vma, vmf, xfs_get_blocks_dax,
+			    xfs_get_blocks_dax_complete);
+	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+
+	return error;
+}
+
+static const struct vm_operations_struct xfs_file_dax_vm_ops = {
+	.fault		= xfs_filemap_dax_fault,
+	.page_mkwrite	= xfs_filemap_dax_page_mkwrite,
+};
+#else
+#define xfs_file_dax_vm_ops	xfs_file_vm_ops
+#endif /* CONFIG_FS_DAX */
+
+STATIC int
+xfs_file_mmap(
+	struct file	*filp,
+	struct vm_area_struct *vma)
+{
+	file_accessed(filp);
+	if (IS_DAX(file_inode(filp))) {
+		vma->vm_ops = &xfs_file_dax_vm_ops;
+		vma->vm_flags |= VM_MIXEDMAP;
+	} else
+		vma->vm_ops = &xfs_file_vm_ops;
+	return 0;
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read		= new_sync_read,
@@ -1497,8 +1554,21 @@ const struct file_operations xfs_dir_file_operations = {
 	.fsync		= xfs_dir_fsync,
 };
 
-static const struct vm_operations_struct xfs_file_vm_ops = {
-	.fault		= xfs_filemap_fault,
-	.map_pages	= filemap_map_pages,
-	.page_mkwrite	= xfs_filemap_page_mkwrite,
+#ifdef CONFIG_FS_DAX
+const struct file_operations xfs_file_dax_operations = {
+	.llseek		= xfs_file_llseek,
+	.read		= new_sync_read,
+	.write		= new_sync_write,
+	.read_iter	= xfs_file_read_iter,
+	.write_iter	= xfs_file_write_iter,
+	.unlocked_ioctl	= xfs_file_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= xfs_file_compat_ioctl,
+#endif
+	.mmap		= xfs_file_mmap,
+	.open		= xfs_file_open,
+	.release	= xfs_file_release,
+	.fsync		= xfs_file_fsync,
+	.fallocate	= xfs_file_fallocate,
 };
+#endif /* CONFIG_FS_DAX */
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 8b9e688..9f38142 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1264,6 +1264,10 @@ xfs_setup_inode(
 	case S_IFREG:
 		inode->i_op = &xfs_inode_operations;
 		inode->i_fop = &xfs_file_operations;
+		if (IS_DAX(inode))
+			inode->i_fop = &xfs_file_dax_operations;
+		else
+			inode->i_fop = &xfs_file_operations;
 		inode->i_mapping->a_ops = &xfs_address_space_operations;
 		break;
 	case S_IFDIR:
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index a0f84ab..c08983e 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -23,6 +23,12 @@ struct xfs_inode;
 extern const struct file_operations xfs_file_operations;
 extern const struct file_operations xfs_dir_file_operations;
 
+#ifdef CONFIG_FS_DAX
+extern const struct file_operations xfs_file_dax_operations;
+#else
+#define xfs_file_dax_operations xfs_file_operations
+#endif
+
 extern ssize_t xfs_vn_listxattr(struct dentry *, char *data, size_t size);
 
 /*
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/6] xfs: add DAX file operations support
@ 2015-03-03 23:30   ` Dave Chinner
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-03 23:30 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

Add the initial support for DAX file operations to XFS. This
includes the necessary block allocation and mmap page fault hooks
for DAX to function.

Note that the current block allocation code abuses the mapping
buffer head to provide a completion callback for unwritten extent
allocation when DAX is clearing blocks. The DAX interface needs to
be changed to provide a callback similar to get_blocks for this
callback.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c |  72 ++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_aops.h |   7 +++-
 fs/xfs/xfs_file.c | 108 ++++++++++++++++++++++++++++++++++++++++++++----------
 fs/xfs/xfs_iops.c |   4 ++
 fs/xfs/xfs_iops.h |   6 +++
 5 files changed, 173 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3a9b7a1..22cb03a 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1233,13 +1233,63 @@ xfs_vm_releasepage(
 	return try_to_free_buffers(page);
 }
 
+/*
+ * For DAX we need a mapping buffer callback for unwritten extent conversion
+ * when page faults allocation blocks and then zero them.
+ */
+#ifdef CONFIG_FS_DAX
+static struct xfs_ioend *
+xfs_dax_alloc_ioend(
+	struct inode	*inode,
+	xfs_off_t	offset,
+	ssize_t		size)
+{
+	struct xfs_ioend *ioend;
+
+	ASSERT(IS_DAX(inode));
+	ioend = xfs_alloc_ioend(inode, XFS_IO_UNWRITTEN);
+	ioend->io_offset = offset;
+	ioend->io_size = size;
+	return ioend;
+}
+
+void
+xfs_get_blocks_dax_complete(
+	struct buffer_head	*bh,
+	int			uptodate)
+{
+	struct xfs_ioend	*ioend = bh->b_private;
+	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
+	int			error;
+
+	ASSERT(IS_DAX(ioend->io_inode));
+
+	/* if there was an error zeroing, then don't convert it */
+	if (!uptodate)
+		goto out_free;
+
+	error = xfs_iomap_write_unwritten(ip, ioend->io_offset, ioend->io_size);
+	if (error)
+		xfs_warn(ip->i_mount,
+"%s: conversion failed, ino 0x%llx, offset 0x%llx, len 0x%lx, error %d\n",
+			__func__, ip->i_ino, ioend->io_offset,
+			ioend->io_size, error);
+out_free:
+	mempool_free(ioend, xfs_ioend_pool);
+
+}
+#else
+#define xfs_dax_alloc_ioend(i,o,s)	NULL
+#endif
+
 STATIC int
 __xfs_get_blocks(
 	struct inode		*inode,
 	sector_t		iblock,
 	struct buffer_head	*bh_result,
 	int			create,
-	int			direct)
+	bool			direct,
+	bool			clear)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1304,6 +1354,7 @@ __xfs_get_blocks(
 			if (error)
 				return error;
 			new = 1;
+
 		} else {
 			/*
 			 * Delalloc reservations do not require a transaction,
@@ -1340,7 +1391,10 @@ __xfs_get_blocks(
 		if (create || !ISUNWRITTEN(&imap))
 			xfs_map_buffer(inode, bh_result, &imap, offset);
 		if (create && ISUNWRITTEN(&imap)) {
-			if (direct) {
+			if (clear) {
+				bh_result->b_private = xfs_dax_alloc_ioend(
+							inode, offset, size);
+			} else if (direct) {
 				bh_result->b_private = inode;
 				set_buffer_defer_completion(bh_result);
 			}
@@ -1425,7 +1479,7 @@ xfs_get_blocks(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, 0);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
 }
 
 STATIC int
@@ -1435,7 +1489,17 @@ xfs_get_blocks_direct(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, 1);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
+}
+
+int
+xfs_get_blocks_dax(
+	struct inode		*inode,
+	sector_t		iblock,
+	struct buffer_head	*bh_result,
+	int			create)
+{
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
 }
 
 /*
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index ac644e0..7c6fb3f 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -53,7 +53,12 @@ typedef struct xfs_ioend {
 } xfs_ioend_t;
 
 extern const struct address_space_operations xfs_address_space_operations;
-extern int xfs_get_blocks(struct inode *, sector_t, struct buffer_head *, int);
+
+int	xfs_get_blocks(struct inode *inode, sector_t offset,
+		       struct buffer_head *map_bh, int create);
+int	xfs_get_blocks_dax(struct inode *inode, sector_t offset,
+			   struct buffer_head *map_bh, int create);
+void	xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate);
 
 extern void xfs_count_page_state(struct page *, int *, int *);
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index bc0008f..4bfcba0 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -654,7 +654,7 @@ xfs_file_dio_aio_write(
 					mp->m_rtdev_targp : mp->m_ddev_targp;
 
 	/* DIO must be aligned to device logical sector size */
-	if ((pos | count) & target->bt_logical_sectormask)
+	if (!IS_DAX(inode) && (pos | count) & target->bt_logical_sectormask)
 		return -EINVAL;
 
 	/* "unaligned" here means not aligned to a filesystem block */
@@ -724,8 +724,11 @@ xfs_file_dio_aio_write(
 out:
 	xfs_rw_iunlock(ip, iolock);
 
-	/* No fallback to buffered IO on errors for XFS. */
-	ASSERT(ret < 0 || ret == count);
+	/*
+	 * No fallback to buffered IO on errors for XFS. DAX can result in
+	 * partial writes, but direct IO will either complete fully or fail.
+	 */
+	ASSERT(ret < 0 || ret == count || IS_DAX(VFS_I(ip)));
 	return ret;
 }
 
@@ -810,7 +813,7 @@ xfs_file_write_iter(
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return -EIO;
 
-	if (unlikely(file->f_flags & O_DIRECT))
+	if ((file->f_flags & O_DIRECT) || IS_DAX(inode))
 		ret = xfs_file_dio_aio_write(iocb, from);
 	else
 		ret = xfs_file_buffered_aio_write(iocb, from);
@@ -1031,17 +1034,6 @@ xfs_file_readdir(
 	return xfs_readdir(ip, ctx, bufsize);
 }
 
-STATIC int
-xfs_file_mmap(
-	struct file	*filp,
-	struct vm_area_struct *vma)
-{
-	vma->vm_ops = &xfs_file_vm_ops;
-
-	file_accessed(filp);
-	return 0;
-}
-
 /*
  * This type is designed to indicate the type of offset we would like
  * to search from page cache for xfs_seek_hole_data().
@@ -1466,6 +1458,71 @@ xfs_filemap_page_mkwrite(
 	return error;
 }
 
+static const struct vm_operations_struct xfs_file_vm_ops = {
+	.fault		= xfs_filemap_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= xfs_filemap_page_mkwrite,
+};
+
+#ifdef CONFIG_FS_DAX
+static int
+xfs_filemap_dax_fault(
+	struct vm_area_struct	*vma,
+	struct vm_fault		*vmf)
+{
+	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
+	int			error;
+
+	trace_xfs_filemap_fault(ip);
+
+	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+	error = dax_fault(vma, vmf, xfs_get_blocks_dax,
+			  xfs_get_blocks_dax_complete);
+	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+
+	return error;
+}
+
+static int
+xfs_filemap_dax_page_mkwrite(
+	struct vm_area_struct	*vma,
+	struct vm_fault		*vmf)
+{
+	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
+	int			error;
+
+	trace_xfs_filemap_page_mkwrite(ip);
+
+	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+	error = dax_mkwrite(vma, vmf, xfs_get_blocks_dax,
+			    xfs_get_blocks_dax_complete);
+	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+
+	return error;
+}
+
+static const struct vm_operations_struct xfs_file_dax_vm_ops = {
+	.fault		= xfs_filemap_dax_fault,
+	.page_mkwrite	= xfs_filemap_dax_page_mkwrite,
+};
+#else
+#define xfs_file_dax_vm_ops	xfs_file_vm_ops
+#endif /* CONFIG_FS_DAX */
+
+STATIC int
+xfs_file_mmap(
+	struct file	*filp,
+	struct vm_area_struct *vma)
+{
+	file_accessed(filp);
+	if (IS_DAX(file_inode(filp))) {
+		vma->vm_ops = &xfs_file_dax_vm_ops;
+		vma->vm_flags |= VM_MIXEDMAP;
+	} else
+		vma->vm_ops = &xfs_file_vm_ops;
+	return 0;
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read		= new_sync_read,
@@ -1497,8 +1554,21 @@ const struct file_operations xfs_dir_file_operations = {
 	.fsync		= xfs_dir_fsync,
 };
 
-static const struct vm_operations_struct xfs_file_vm_ops = {
-	.fault		= xfs_filemap_fault,
-	.map_pages	= filemap_map_pages,
-	.page_mkwrite	= xfs_filemap_page_mkwrite,
+#ifdef CONFIG_FS_DAX
+const struct file_operations xfs_file_dax_operations = {
+	.llseek		= xfs_file_llseek,
+	.read		= new_sync_read,
+	.write		= new_sync_write,
+	.read_iter	= xfs_file_read_iter,
+	.write_iter	= xfs_file_write_iter,
+	.unlocked_ioctl	= xfs_file_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= xfs_file_compat_ioctl,
+#endif
+	.mmap		= xfs_file_mmap,
+	.open		= xfs_file_open,
+	.release	= xfs_file_release,
+	.fsync		= xfs_file_fsync,
+	.fallocate	= xfs_file_fallocate,
 };
+#endif /* CONFIG_FS_DAX */
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 8b9e688..9f38142 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1264,6 +1264,10 @@ xfs_setup_inode(
 	case S_IFREG:
 		inode->i_op = &xfs_inode_operations;
 		inode->i_fop = &xfs_file_operations;
+		if (IS_DAX(inode))
+			inode->i_fop = &xfs_file_dax_operations;
+		else
+			inode->i_fop = &xfs_file_operations;
 		inode->i_mapping->a_ops = &xfs_address_space_operations;
 		break;
 	case S_IFDIR:
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index a0f84ab..c08983e 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -23,6 +23,12 @@ struct xfs_inode;
 extern const struct file_operations xfs_file_operations;
 extern const struct file_operations xfs_dir_file_operations;
 
+#ifdef CONFIG_FS_DAX
+extern const struct file_operations xfs_file_dax_operations;
+#else
+#define xfs_file_dax_operations xfs_file_operations
+#endif
+
 extern ssize_t xfs_vn_listxattr(struct dentry *, char *data, size_t size);
 
 /*
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 4/6] xfs: add DAX truncate support
  2015-03-03 23:30 ` Dave Chinner
@ 2015-03-03 23:30   ` Dave Chinner
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-03 23:30 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, jack, willy

From: Dave Chinner <dchinner@redhat.com>

When we truncate a DAX file, we need to call through the DAX page
truncation path rather than through block_truncate_page() so that
mappings and block zeroing are all handled correctly. Otherwise,
truncate does not need to change.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 9f38142..3ff24c3 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -851,7 +851,11 @@ xfs_setattr_size(
 	 * to hope that the caller sees ENOMEM and retries the truncate
 	 * operation.
 	 */
-	error = block_truncate_page(inode->i_mapping, newsize, xfs_get_blocks);
+	if (IS_DAX(inode))
+		error = dax_truncate_page(inode, newsize, xfs_get_blocks_dax);
+	else
+		error = block_truncate_page(inode->i_mapping, newsize,
+					    xfs_get_blocks);
 	if (error)
 		return error;
 	truncate_setsize(inode, newsize);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 4/6] xfs: add DAX truncate support
@ 2015-03-03 23:30   ` Dave Chinner
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-03 23:30 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

When we truncate a DAX file, we need to call through the DAX page
truncation path rather than through block_truncate_page() so that
mappings and block zeroing are all handled correctly. Otherwise,
truncate does not need to change.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 9f38142..3ff24c3 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -851,7 +851,11 @@ xfs_setattr_size(
 	 * to hope that the caller sees ENOMEM and retries the truncate
 	 * operation.
 	 */
-	error = block_truncate_page(inode->i_mapping, newsize, xfs_get_blocks);
+	if (IS_DAX(inode))
+		error = dax_truncate_page(inode, newsize, xfs_get_blocks_dax);
+	else
+		error = block_truncate_page(inode->i_mapping, newsize,
+					    xfs_get_blocks);
 	if (error)
 		return error;
 	truncate_setsize(inode, newsize);
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 5/6] xfs: add DAX IO path support
  2015-03-03 23:30 ` Dave Chinner
@ 2015-03-03 23:30   ` Dave Chinner
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-03 23:30 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, jack, willy

From: Dave Chinner <dchinner@redhat.com>

DAX does not do buffered IO (can't buffer direct access!) and hence
all read/write IO is vectored through the direct IO path.  Hence we
need to add the DAX IO path callouts to the direct IO
infrastructure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 22cb03a..28b79c5 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1558,6 +1558,30 @@ xfs_end_io_direct_write(
 	}
 }
 
+static inline ssize_t
+xfs_vm_do_dio(
+	struct inode		*inode,
+	int			rw,
+	struct kiocb		*iocb,
+	struct iov_iter		*iter,
+	loff_t			offset,
+	void			(*endio)(struct kiocb	*iocb,
+					 loff_t		offset,
+					 ssize_t	size,
+					 void		*private),
+	int			flags)
+{
+	struct block_device	*bdev;
+
+	if (IS_DAX(inode))
+		return dax_do_io(rw, iocb, inode, iter, offset,
+				 xfs_get_blocks_direct, endio, 0);
+
+	bdev = xfs_find_bdev_for_inode(inode);
+	return  __blockdev_direct_IO(rw, iocb, inode, bdev, iter, offset,
+				     xfs_get_blocks_direct, endio, NULL, flags);
+}
+
 STATIC ssize_t
 xfs_vm_direct_IO(
 	int			rw,
@@ -1566,17 +1590,12 @@ xfs_vm_direct_IO(
 	loff_t			offset)
 {
 	struct inode		*inode = iocb->ki_filp->f_mapping->host;
-	struct block_device	*bdev = xfs_find_bdev_for_inode(inode);
 
 	if (rw & WRITE) {
-		return __blockdev_direct_IO(rw, iocb, inode, bdev, iter,
-					    offset, xfs_get_blocks_direct,
-					    xfs_end_io_direct_write, NULL,
-					    DIO_ASYNC_EXTEND);
+		return xfs_vm_do_dio(inode, rw, iocb, iter, offset,
+				     xfs_end_io_direct_write, DIO_ASYNC_EXTEND);
 	}
-	return __blockdev_direct_IO(rw, iocb, inode, bdev, iter,
-				    offset, xfs_get_blocks_direct,
-				    NULL, NULL, 0);
+	return xfs_vm_do_dio(inode, rw, iocb, iter, offset, NULL, 0);
 }
 
 /*
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 5/6] xfs: add DAX IO path support
@ 2015-03-03 23:30   ` Dave Chinner
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-03 23:30 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

DAX does not do buffered IO (can't buffer direct access!) and hence
all read/write IO is vectored through the direct IO path.  Hence we
need to add the DAX IO path callouts to the direct IO
infrastructure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 22cb03a..28b79c5 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1558,6 +1558,30 @@ xfs_end_io_direct_write(
 	}
 }
 
+static inline ssize_t
+xfs_vm_do_dio(
+	struct inode		*inode,
+	int			rw,
+	struct kiocb		*iocb,
+	struct iov_iter		*iter,
+	loff_t			offset,
+	void			(*endio)(struct kiocb	*iocb,
+					 loff_t		offset,
+					 ssize_t	size,
+					 void		*private),
+	int			flags)
+{
+	struct block_device	*bdev;
+
+	if (IS_DAX(inode))
+		return dax_do_io(rw, iocb, inode, iter, offset,
+				 xfs_get_blocks_direct, endio, 0);
+
+	bdev = xfs_find_bdev_for_inode(inode);
+	return  __blockdev_direct_IO(rw, iocb, inode, bdev, iter, offset,
+				     xfs_get_blocks_direct, endio, NULL, flags);
+}
+
 STATIC ssize_t
 xfs_vm_direct_IO(
 	int			rw,
@@ -1566,17 +1590,12 @@ xfs_vm_direct_IO(
 	loff_t			offset)
 {
 	struct inode		*inode = iocb->ki_filp->f_mapping->host;
-	struct block_device	*bdev = xfs_find_bdev_for_inode(inode);
 
 	if (rw & WRITE) {
-		return __blockdev_direct_IO(rw, iocb, inode, bdev, iter,
-					    offset, xfs_get_blocks_direct,
-					    xfs_end_io_direct_write, NULL,
-					    DIO_ASYNC_EXTEND);
+		return xfs_vm_do_dio(inode, rw, iocb, iter, offset,
+				     xfs_end_io_direct_write, DIO_ASYNC_EXTEND);
 	}
-	return __blockdev_direct_IO(rw, iocb, inode, bdev, iter,
-				    offset, xfs_get_blocks_direct,
-				    NULL, NULL, 0);
+	return xfs_vm_do_dio(inode, rw, iocb, iter, offset, NULL, 0);
 }
 
 /*
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 6/6] xfs: add initial DAX support
  2015-03-03 23:30 ` Dave Chinner
@ 2015-03-03 23:30   ` Dave Chinner
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-03 23:30 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, jack, willy

From: Dave Chinner <dchinner@redhat.com>

Add initial DAX support to XFS. To do this we need a new mount
option to turn DAX on filesystem, and we need to propagate thi into
the inode flags whenever an inode is instantiated so that the
per-inode checks throughout the code Do The Right Thing.

There are still some things remaining to be done:

	- needs per-inode flags to mark inodes as DAX enabled, and
	  an inheritance flag to enable automatic filesystem
	  propagation of the property
	- fails occasionally with zero length writes instead of
	  ENOSPC errors, so error propagation inside/from the DAX
	  code need work
	- occasionally creates two extents rather than a single
	  larger extent like non-dax filesystems.
	- much more testing

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c  | 24 ++++++++++++------------
 fs/xfs/xfs_mount.h |  2 ++
 fs/xfs/xfs_super.c | 25 +++++++++++++++++++++++--
 3 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 3ff24c3..887d196 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1195,22 +1195,22 @@ xfs_diflags_to_iflags(
 	struct inode		*inode,
 	struct xfs_inode	*ip)
 {
-	if (ip->i_d.di_flags & XFS_DIFLAG_IMMUTABLE)
+	uint16_t		flags = ip->i_d.di_flags;
+
+	inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
+			    S_NOATIME | S_DAX);
+
+	if (flags & XFS_DIFLAG_IMMUTABLE)
 		inode->i_flags |= S_IMMUTABLE;
-	else
-		inode->i_flags &= ~S_IMMUTABLE;
-	if (ip->i_d.di_flags & XFS_DIFLAG_APPEND)
+	if (flags & XFS_DIFLAG_APPEND)
 		inode->i_flags |= S_APPEND;
-	else
-		inode->i_flags &= ~S_APPEND;
-	if (ip->i_d.di_flags & XFS_DIFLAG_SYNC)
+	if (flags & XFS_DIFLAG_SYNC)
 		inode->i_flags |= S_SYNC;
-	else
-		inode->i_flags &= ~S_SYNC;
-	if (ip->i_d.di_flags & XFS_DIFLAG_NOATIME)
+	if (flags & XFS_DIFLAG_NOATIME)
 		inode->i_flags |= S_NOATIME;
-	else
-		inode->i_flags &= ~S_NOATIME;
+	/* XXX: Also needs an on-disk per inode flag! */
+	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
+		inode->i_flags |= S_DAX;
 }
 
 /*
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 8c995a2..cd44e88 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -179,6 +179,8 @@ typedef struct xfs_mount {
 						   allocator */
 #define XFS_MOUNT_NOATTR2	(1ULL << 25)	/* disable use of attr2 format */
 
+#define XFS_MOUNT_DAX		(1ULL << 62)	/* TEST ONLY! */
+
 
 /*
  * Default minimum read and write sizes.
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 3ad0b17..0f26d7a 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -112,6 +112,8 @@ static struct xfs_kobj xfs_dbg_kobj;	/* global debug sysfs attrs */
 #define MNTOPT_DISCARD	   "discard"	/* Discard unused blocks */
 #define MNTOPT_NODISCARD   "nodiscard"	/* Do not discard unused blocks */
 
+#define MNTOPT_DAX	"dax"		/* Enable direct access to bdev pages */
+
 /*
  * Table driven mount option parser.
  *
@@ -363,6 +365,10 @@ xfs_parseargs(
 			mp->m_flags |= XFS_MOUNT_DISCARD;
 		} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
 			mp->m_flags &= ~XFS_MOUNT_DISCARD;
+#ifdef CONFIG_FS_DAX
+		} else if (!strcmp(this_char, MNTOPT_DAX)) {
+			mp->m_flags |= XFS_MOUNT_DAX;
+#endif
 		} else {
 			xfs_warn(mp, "unknown mount option [%s].", this_char);
 			return -EINVAL;
@@ -452,8 +458,8 @@ done:
 }
 
 struct proc_xfs_info {
-	int	flag;
-	char	*str;
+	uint64_t	flag;
+	char		*str;
 };
 
 STATIC int
@@ -474,6 +480,7 @@ xfs_showargs(
 		{ XFS_MOUNT_GRPID,		"," MNTOPT_GRPID },
 		{ XFS_MOUNT_DISCARD,		"," MNTOPT_DISCARD },
 		{ XFS_MOUNT_SMALL_INUMS,	"," MNTOPT_32BITINODE },
+		{ XFS_MOUNT_DAX,		"," MNTOPT_DAX },
 		{ 0, NULL }
 	};
 	static struct proc_xfs_info xfs_info_unset[] = {
@@ -1501,6 +1508,20 @@ xfs_fs_fill_super(
 	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
 		sb->s_flags |= MS_I_VERSION;
 
+	if (mp->m_flags & XFS_MOUNT_DAX) {
+		xfs_warn(mp,
+	"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
+		if (sb->s_blocksize != PAGE_SIZE) {
+			xfs_alert(mp,
+		"Filesystem block size invalid for DAX Turning DAX off.");
+			mp->m_flags &= ~XFS_MOUNT_DAX;
+		} else if (!sb->s_bdev->bd_disk->fops->direct_access) {
+			xfs_alert(mp,
+		"Block device does not support DAX Turning DAX off.");
+			mp->m_flags &= ~XFS_MOUNT_DAX;
+		}
+	}
+
 	error = xfs_mountfs(mp);
 	if (error)
 		goto out_filestream_unmount;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 6/6] xfs: add initial DAX support
@ 2015-03-03 23:30   ` Dave Chinner
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-03 23:30 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

Add initial DAX support to XFS. To do this we need a new mount
option to turn DAX on filesystem, and we need to propagate thi into
the inode flags whenever an inode is instantiated so that the
per-inode checks throughout the code Do The Right Thing.

There are still some things remaining to be done:

	- needs per-inode flags to mark inodes as DAX enabled, and
	  an inheritance flag to enable automatic filesystem
	  propagation of the property
	- fails occasionally with zero length writes instead of
	  ENOSPC errors, so error propagation inside/from the DAX
	  code need work
	- occasionally creates two extents rather than a single
	  larger extent like non-dax filesystems.
	- much more testing

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c  | 24 ++++++++++++------------
 fs/xfs/xfs_mount.h |  2 ++
 fs/xfs/xfs_super.c | 25 +++++++++++++++++++++++--
 3 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 3ff24c3..887d196 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1195,22 +1195,22 @@ xfs_diflags_to_iflags(
 	struct inode		*inode,
 	struct xfs_inode	*ip)
 {
-	if (ip->i_d.di_flags & XFS_DIFLAG_IMMUTABLE)
+	uint16_t		flags = ip->i_d.di_flags;
+
+	inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
+			    S_NOATIME | S_DAX);
+
+	if (flags & XFS_DIFLAG_IMMUTABLE)
 		inode->i_flags |= S_IMMUTABLE;
-	else
-		inode->i_flags &= ~S_IMMUTABLE;
-	if (ip->i_d.di_flags & XFS_DIFLAG_APPEND)
+	if (flags & XFS_DIFLAG_APPEND)
 		inode->i_flags |= S_APPEND;
-	else
-		inode->i_flags &= ~S_APPEND;
-	if (ip->i_d.di_flags & XFS_DIFLAG_SYNC)
+	if (flags & XFS_DIFLAG_SYNC)
 		inode->i_flags |= S_SYNC;
-	else
-		inode->i_flags &= ~S_SYNC;
-	if (ip->i_d.di_flags & XFS_DIFLAG_NOATIME)
+	if (flags & XFS_DIFLAG_NOATIME)
 		inode->i_flags |= S_NOATIME;
-	else
-		inode->i_flags &= ~S_NOATIME;
+	/* XXX: Also needs an on-disk per inode flag! */
+	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
+		inode->i_flags |= S_DAX;
 }
 
 /*
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 8c995a2..cd44e88 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -179,6 +179,8 @@ typedef struct xfs_mount {
 						   allocator */
 #define XFS_MOUNT_NOATTR2	(1ULL << 25)	/* disable use of attr2 format */
 
+#define XFS_MOUNT_DAX		(1ULL << 62)	/* TEST ONLY! */
+
 
 /*
  * Default minimum read and write sizes.
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 3ad0b17..0f26d7a 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -112,6 +112,8 @@ static struct xfs_kobj xfs_dbg_kobj;	/* global debug sysfs attrs */
 #define MNTOPT_DISCARD	   "discard"	/* Discard unused blocks */
 #define MNTOPT_NODISCARD   "nodiscard"	/* Do not discard unused blocks */
 
+#define MNTOPT_DAX	"dax"		/* Enable direct access to bdev pages */
+
 /*
  * Table driven mount option parser.
  *
@@ -363,6 +365,10 @@ xfs_parseargs(
 			mp->m_flags |= XFS_MOUNT_DISCARD;
 		} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
 			mp->m_flags &= ~XFS_MOUNT_DISCARD;
+#ifdef CONFIG_FS_DAX
+		} else if (!strcmp(this_char, MNTOPT_DAX)) {
+			mp->m_flags |= XFS_MOUNT_DAX;
+#endif
 		} else {
 			xfs_warn(mp, "unknown mount option [%s].", this_char);
 			return -EINVAL;
@@ -452,8 +458,8 @@ done:
 }
 
 struct proc_xfs_info {
-	int	flag;
-	char	*str;
+	uint64_t	flag;
+	char		*str;
 };
 
 STATIC int
@@ -474,6 +480,7 @@ xfs_showargs(
 		{ XFS_MOUNT_GRPID,		"," MNTOPT_GRPID },
 		{ XFS_MOUNT_DISCARD,		"," MNTOPT_DISCARD },
 		{ XFS_MOUNT_SMALL_INUMS,	"," MNTOPT_32BITINODE },
+		{ XFS_MOUNT_DAX,		"," MNTOPT_DAX },
 		{ 0, NULL }
 	};
 	static struct proc_xfs_info xfs_info_unset[] = {
@@ -1501,6 +1508,20 @@ xfs_fs_fill_super(
 	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
 		sb->s_flags |= MS_I_VERSION;
 
+	if (mp->m_flags & XFS_MOUNT_DAX) {
+		xfs_warn(mp,
+	"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
+		if (sb->s_blocksize != PAGE_SIZE) {
+			xfs_alert(mp,
+		"Filesystem block size invalid for DAX Turning DAX off.");
+			mp->m_flags &= ~XFS_MOUNT_DAX;
+		} else if (!sb->s_bdev->bd_disk->fops->direct_access) {
+			xfs_alert(mp,
+		"Block device does not support DAX Turning DAX off.");
+			mp->m_flags &= ~XFS_MOUNT_DAX;
+		}
+	}
+
 	error = xfs_mountfs(mp);
 	if (error)
 		goto out_filestream_unmount;
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
  2015-03-03 23:30   ` Dave Chinner
@ 2015-03-04 10:09     ` Boaz Harrosh
  -1 siblings, 0 replies; 40+ messages in thread
From: Boaz Harrosh @ 2015-03-04 10:09 UTC (permalink / raw)
  To: Dave Chinner, xfs; +Cc: linux-fsdevel, jack, willy

On 03/04/2015 01:30 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add the initial support for DAX file operations to XFS. This
> includes the necessary block allocation and mmap page fault hooks
> for DAX to function.
> 
> Note that the current block allocation code abuses the mapping
> buffer head to provide a completion callback for unwritten extent
> allocation when DAX is clearing blocks. The DAX interface needs to
> be changed to provide a callback similar to get_blocks for this
> callback.

It looks like this comment is stale for this set

A question below

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c |  72 ++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_aops.h |   7 +++-
>  fs/xfs/xfs_file.c | 108 ++++++++++++++++++++++++++++++++++++++++++++----------
>  fs/xfs/xfs_iops.c |   4 ++
>  fs/xfs/xfs_iops.h |   6 +++
>  5 files changed, 173 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3a9b7a1..22cb03a 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1233,13 +1233,63 @@ xfs_vm_releasepage(
>  	return try_to_free_buffers(page);
>  }
>  
> +/*
> + * For DAX we need a mapping buffer callback for unwritten extent conversion
> + * when page faults allocation blocks and then zero them.
> + */
> +#ifdef CONFIG_FS_DAX
> +static struct xfs_ioend *
> +xfs_dax_alloc_ioend(
> +	struct inode	*inode,
> +	xfs_off_t	offset,
> +	ssize_t		size)
> +{
> +	struct xfs_ioend *ioend;
> +
> +	ASSERT(IS_DAX(inode));
> +	ioend = xfs_alloc_ioend(inode, XFS_IO_UNWRITTEN);
> +	ioend->io_offset = offset;
> +	ioend->io_size = size;
> +	return ioend;
> +}
> +
> +void
> +xfs_get_blocks_dax_complete(
> +	struct buffer_head	*bh,
> +	int			uptodate)
> +{
> +	struct xfs_ioend	*ioend = bh->b_private;
> +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> +	int			error;
> +
> +	ASSERT(IS_DAX(ioend->io_inode));
> +
> +	/* if there was an error zeroing, then don't convert it */
> +	if (!uptodate)
> +		goto out_free;
> +
> +	error = xfs_iomap_write_unwritten(ip, ioend->io_offset, ioend->io_size);
> +	if (error)
> +		xfs_warn(ip->i_mount,
> +"%s: conversion failed, ino 0x%llx, offset 0x%llx, len 0x%lx, error %d\n",
> +			__func__, ip->i_ino, ioend->io_offset,
> +			ioend->io_size, error);
> +out_free:
> +	mempool_free(ioend, xfs_ioend_pool);
> +
> +}
> +#else
> +#define xfs_dax_alloc_ioend(i,o,s)	NULL
> +#endif
> +
>  STATIC int
>  __xfs_get_blocks(
>  	struct inode		*inode,
>  	sector_t		iblock,
>  	struct buffer_head	*bh_result,
>  	int			create,
> -	int			direct)
> +	bool			direct,
> +	bool			clear)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1304,6 +1354,7 @@ __xfs_get_blocks(
>  			if (error)
>  				return error;
>  			new = 1;
> +
>  		} else {
>  			/*
>  			 * Delalloc reservations do not require a transaction,
> @@ -1340,7 +1391,10 @@ __xfs_get_blocks(
>  		if (create || !ISUNWRITTEN(&imap))
>  			xfs_map_buffer(inode, bh_result, &imap, offset);
>  		if (create && ISUNWRITTEN(&imap)) {
> -			if (direct) {
> +			if (clear) {
> +				bh_result->b_private = xfs_dax_alloc_ioend(
> +							inode, offset, size);
> +			} else if (direct) {
>  				bh_result->b_private = inode;
>  				set_buffer_defer_completion(bh_result);
>  			}
> @@ -1425,7 +1479,7 @@ xfs_get_blocks(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, 0);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
>  }
>  
>  STATIC int
> @@ -1435,7 +1489,17 @@ xfs_get_blocks_direct(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, 1);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
> +}
> +
> +int
> +xfs_get_blocks_dax(
> +	struct inode		*inode,
> +	sector_t		iblock,
> +	struct buffer_head	*bh_result,
> +	int			create)
> +{
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index ac644e0..7c6fb3f 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -53,7 +53,12 @@ typedef struct xfs_ioend {
>  } xfs_ioend_t;
>  
>  extern const struct address_space_operations xfs_address_space_operations;
> -extern int xfs_get_blocks(struct inode *, sector_t, struct buffer_head *, int);
> +
> +int	xfs_get_blocks(struct inode *inode, sector_t offset,
> +		       struct buffer_head *map_bh, int create);
> +int	xfs_get_blocks_dax(struct inode *inode, sector_t offset,
> +			   struct buffer_head *map_bh, int create);
> +void	xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate);
>  
>  extern void xfs_count_page_state(struct page *, int *, int *);
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index bc0008f..4bfcba0 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -654,7 +654,7 @@ xfs_file_dio_aio_write(
>  					mp->m_rtdev_targp : mp->m_ddev_targp;
>  
>  	/* DIO must be aligned to device logical sector size */
> -	if ((pos | count) & target->bt_logical_sectormask)
> +	if (!IS_DAX(inode) && (pos | count) & target->bt_logical_sectormask)
>  		return -EINVAL;
>  
>  	/* "unaligned" here means not aligned to a filesystem block */
> @@ -724,8 +724,11 @@ xfs_file_dio_aio_write(
>  out:
>  	xfs_rw_iunlock(ip, iolock);
>  
> -	/* No fallback to buffered IO on errors for XFS. */
> -	ASSERT(ret < 0 || ret == count);
> +	/*
> +	 * No fallback to buffered IO on errors for XFS. DAX can result in
> +	 * partial writes, but direct IO will either complete fully or fail.
> +	 */
> +	ASSERT(ret < 0 || ret == count || IS_DAX(VFS_I(ip)));
>  	return ret;
>  }
>  
> @@ -810,7 +813,7 @@ xfs_file_write_iter(
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
>  		return -EIO;
>  
> -	if (unlikely(file->f_flags & O_DIRECT))
> +	if ((file->f_flags & O_DIRECT) || IS_DAX(inode))
>  		ret = xfs_file_dio_aio_write(iocb, from);
>  	else
>  		ret = xfs_file_buffered_aio_write(iocb, from);
> @@ -1031,17 +1034,6 @@ xfs_file_readdir(
>  	return xfs_readdir(ip, ctx, bufsize);
>  }
>  
> -STATIC int
> -xfs_file_mmap(
> -	struct file	*filp,
> -	struct vm_area_struct *vma)
> -{
> -	vma->vm_ops = &xfs_file_vm_ops;
> -
> -	file_accessed(filp);
> -	return 0;
> -}
> -
>  /*
>   * This type is designed to indicate the type of offset we would like
>   * to search from page cache for xfs_seek_hole_data().
> @@ -1466,6 +1458,71 @@ xfs_filemap_page_mkwrite(
>  	return error;
>  }
>  
> +static const struct vm_operations_struct xfs_file_vm_ops = {
> +	.fault		= xfs_filemap_fault,
> +	.map_pages	= filemap_map_pages,
> +	.page_mkwrite	= xfs_filemap_page_mkwrite,
> +};
> +
> +#ifdef CONFIG_FS_DAX
> +static int
> +xfs_filemap_dax_fault(
> +	struct vm_area_struct	*vma,
> +	struct vm_fault		*vmf)
> +{
> +	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> +	int			error;
> +
> +	trace_xfs_filemap_fault(ip);
> +
> +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> +	error = dax_fault(vma, vmf, xfs_get_blocks_dax,
> +			  xfs_get_blocks_dax_complete);
> +	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> +
> +	return error;
> +}
> +
> +static int
> +xfs_filemap_dax_page_mkwrite(
> +	struct vm_area_struct	*vma,
> +	struct vm_fault		*vmf)
> +{
> +	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> +	int			error;
> +
> +	trace_xfs_filemap_page_mkwrite(ip);
> +
> +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> +	error = dax_mkwrite(vma, vmf, xfs_get_blocks_dax,
> +			    xfs_get_blocks_dax_complete);
> +	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> +
> +	return error;
> +}
> +
> +static const struct vm_operations_struct xfs_file_dax_vm_ops = {
> +	.fault		= xfs_filemap_dax_fault,
> +	.page_mkwrite	= xfs_filemap_dax_page_mkwrite,
> +};
> +#else
> +#define xfs_file_dax_vm_ops	xfs_file_vm_ops
> +#endif /* CONFIG_FS_DAX */
> +
> +STATIC int
> +xfs_file_mmap(
> +	struct file	*filp,
> +	struct vm_area_struct *vma)
> +{
> +	file_accessed(filp);
> +	if (IS_DAX(file_inode(filp))) {
> +		vma->vm_ops = &xfs_file_dax_vm_ops;
> +		vma->vm_flags |= VM_MIXEDMAP;
> +	} else
> +		vma->vm_ops = &xfs_file_vm_ops;
> +	return 0;
> +}
> +
>  const struct file_operations xfs_file_operations = {
>  	.llseek		= xfs_file_llseek,
>  	.read		= new_sync_read,
> @@ -1497,8 +1554,21 @@ const struct file_operations xfs_dir_file_operations = {
>  	.fsync		= xfs_dir_fsync,
>  };
>  
> -static const struct vm_operations_struct xfs_file_vm_ops = {
> -	.fault		= xfs_filemap_fault,
> -	.map_pages	= filemap_map_pages,
> -	.page_mkwrite	= xfs_filemap_page_mkwrite,
> +#ifdef CONFIG_FS_DAX
> +const struct file_operations xfs_file_dax_operations = {
> +	.llseek		= xfs_file_llseek,
> +	.read		= new_sync_read,
> +	.write		= new_sync_write,
> +	.read_iter	= xfs_file_read_iter,
> +	.write_iter	= xfs_file_write_iter,
> +	.unlocked_ioctl	= xfs_file_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= xfs_file_compat_ioctl,
> +#endif
> +	.mmap		= xfs_file_mmap,
> +	.open		= xfs_file_open,
> +	.release	= xfs_file_release,
> +	.fsync		= xfs_file_fsync,
> +	.fallocate	= xfs_file_fallocate,
>  };

sigh, The same problem was in ext4, the reason you need
a second xfs_file_operations vector is because of the minus
	.splice_read	= xfs_file_splice_read,
	.splice_write	= iter_file_splice_write,
Which do not exist for DAX

Inspecting do_splice_from && do_splice_to, if I'm reading
the code right it looks like the difference of these vectors
present is the call to default_file_splice_write/read.

[default_file_splice_read for example would go head and use
 kernel_readv() ]

Would it be cleaner to call default_file_splice_write/read
directly from xfs_file_splice_read/write in the DAX case
and only keep one vector?

I have looked through the code, nothing stands out that I
can see. Do you have a public tree I can pull for easy
testing?

Thanks
Boaz

> +#endif /* CONFIG_FS_DAX */
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 8b9e688..9f38142 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1264,6 +1264,10 @@ xfs_setup_inode(
>  	case S_IFREG:
>  		inode->i_op = &xfs_inode_operations;
>  		inode->i_fop = &xfs_file_operations;
> +		if (IS_DAX(inode))
> +			inode->i_fop = &xfs_file_dax_operations;
> +		else
> +			inode->i_fop = &xfs_file_operations;
>  		inode->i_mapping->a_ops = &xfs_address_space_operations;
>  		break;
>  	case S_IFDIR:
> diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
> index a0f84ab..c08983e 100644
> --- a/fs/xfs/xfs_iops.h
> +++ b/fs/xfs/xfs_iops.h
> @@ -23,6 +23,12 @@ struct xfs_inode;
>  extern const struct file_operations xfs_file_operations;
>  extern const struct file_operations xfs_dir_file_operations;
>  
> +#ifdef CONFIG_FS_DAX
> +extern const struct file_operations xfs_file_dax_operations;
> +#else
> +#define xfs_file_dax_operations xfs_file_operations
> +#endif
> +
>  extern ssize_t xfs_vn_listxattr(struct dentry *, char *data, size_t size);
>  
>  /*
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
@ 2015-03-04 10:09     ` Boaz Harrosh
  0 siblings, 0 replies; 40+ messages in thread
From: Boaz Harrosh @ 2015-03-04 10:09 UTC (permalink / raw)
  To: Dave Chinner, xfs; +Cc: linux-fsdevel, willy, jack

On 03/04/2015 01:30 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add the initial support for DAX file operations to XFS. This
> includes the necessary block allocation and mmap page fault hooks
> for DAX to function.
> 
> Note that the current block allocation code abuses the mapping
> buffer head to provide a completion callback for unwritten extent
> allocation when DAX is clearing blocks. The DAX interface needs to
> be changed to provide a callback similar to get_blocks for this
> callback.

It looks like this comment is stale for this set

A question below

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c |  72 ++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_aops.h |   7 +++-
>  fs/xfs/xfs_file.c | 108 ++++++++++++++++++++++++++++++++++++++++++++----------
>  fs/xfs/xfs_iops.c |   4 ++
>  fs/xfs/xfs_iops.h |   6 +++
>  5 files changed, 173 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3a9b7a1..22cb03a 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1233,13 +1233,63 @@ xfs_vm_releasepage(
>  	return try_to_free_buffers(page);
>  }
>  
> +/*
> + * For DAX we need a mapping buffer callback for unwritten extent conversion
> + * when page faults allocation blocks and then zero them.
> + */
> +#ifdef CONFIG_FS_DAX
> +static struct xfs_ioend *
> +xfs_dax_alloc_ioend(
> +	struct inode	*inode,
> +	xfs_off_t	offset,
> +	ssize_t		size)
> +{
> +	struct xfs_ioend *ioend;
> +
> +	ASSERT(IS_DAX(inode));
> +	ioend = xfs_alloc_ioend(inode, XFS_IO_UNWRITTEN);
> +	ioend->io_offset = offset;
> +	ioend->io_size = size;
> +	return ioend;
> +}
> +
> +void
> +xfs_get_blocks_dax_complete(
> +	struct buffer_head	*bh,
> +	int			uptodate)
> +{
> +	struct xfs_ioend	*ioend = bh->b_private;
> +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> +	int			error;
> +
> +	ASSERT(IS_DAX(ioend->io_inode));
> +
> +	/* if there was an error zeroing, then don't convert it */
> +	if (!uptodate)
> +		goto out_free;
> +
> +	error = xfs_iomap_write_unwritten(ip, ioend->io_offset, ioend->io_size);
> +	if (error)
> +		xfs_warn(ip->i_mount,
> +"%s: conversion failed, ino 0x%llx, offset 0x%llx, len 0x%lx, error %d\n",
> +			__func__, ip->i_ino, ioend->io_offset,
> +			ioend->io_size, error);
> +out_free:
> +	mempool_free(ioend, xfs_ioend_pool);
> +
> +}
> +#else
> +#define xfs_dax_alloc_ioend(i,o,s)	NULL
> +#endif
> +
>  STATIC int
>  __xfs_get_blocks(
>  	struct inode		*inode,
>  	sector_t		iblock,
>  	struct buffer_head	*bh_result,
>  	int			create,
> -	int			direct)
> +	bool			direct,
> +	bool			clear)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1304,6 +1354,7 @@ __xfs_get_blocks(
>  			if (error)
>  				return error;
>  			new = 1;
> +
>  		} else {
>  			/*
>  			 * Delalloc reservations do not require a transaction,
> @@ -1340,7 +1391,10 @@ __xfs_get_blocks(
>  		if (create || !ISUNWRITTEN(&imap))
>  			xfs_map_buffer(inode, bh_result, &imap, offset);
>  		if (create && ISUNWRITTEN(&imap)) {
> -			if (direct) {
> +			if (clear) {
> +				bh_result->b_private = xfs_dax_alloc_ioend(
> +							inode, offset, size);
> +			} else if (direct) {
>  				bh_result->b_private = inode;
>  				set_buffer_defer_completion(bh_result);
>  			}
> @@ -1425,7 +1479,7 @@ xfs_get_blocks(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, 0);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
>  }
>  
>  STATIC int
> @@ -1435,7 +1489,17 @@ xfs_get_blocks_direct(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, 1);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
> +}
> +
> +int
> +xfs_get_blocks_dax(
> +	struct inode		*inode,
> +	sector_t		iblock,
> +	struct buffer_head	*bh_result,
> +	int			create)
> +{
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index ac644e0..7c6fb3f 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -53,7 +53,12 @@ typedef struct xfs_ioend {
>  } xfs_ioend_t;
>  
>  extern const struct address_space_operations xfs_address_space_operations;
> -extern int xfs_get_blocks(struct inode *, sector_t, struct buffer_head *, int);
> +
> +int	xfs_get_blocks(struct inode *inode, sector_t offset,
> +		       struct buffer_head *map_bh, int create);
> +int	xfs_get_blocks_dax(struct inode *inode, sector_t offset,
> +			   struct buffer_head *map_bh, int create);
> +void	xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate);
>  
>  extern void xfs_count_page_state(struct page *, int *, int *);
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index bc0008f..4bfcba0 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -654,7 +654,7 @@ xfs_file_dio_aio_write(
>  					mp->m_rtdev_targp : mp->m_ddev_targp;
>  
>  	/* DIO must be aligned to device logical sector size */
> -	if ((pos | count) & target->bt_logical_sectormask)
> +	if (!IS_DAX(inode) && (pos | count) & target->bt_logical_sectormask)
>  		return -EINVAL;
>  
>  	/* "unaligned" here means not aligned to a filesystem block */
> @@ -724,8 +724,11 @@ xfs_file_dio_aio_write(
>  out:
>  	xfs_rw_iunlock(ip, iolock);
>  
> -	/* No fallback to buffered IO on errors for XFS. */
> -	ASSERT(ret < 0 || ret == count);
> +	/*
> +	 * No fallback to buffered IO on errors for XFS. DAX can result in
> +	 * partial writes, but direct IO will either complete fully or fail.
> +	 */
> +	ASSERT(ret < 0 || ret == count || IS_DAX(VFS_I(ip)));
>  	return ret;
>  }
>  
> @@ -810,7 +813,7 @@ xfs_file_write_iter(
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
>  		return -EIO;
>  
> -	if (unlikely(file->f_flags & O_DIRECT))
> +	if ((file->f_flags & O_DIRECT) || IS_DAX(inode))
>  		ret = xfs_file_dio_aio_write(iocb, from);
>  	else
>  		ret = xfs_file_buffered_aio_write(iocb, from);
> @@ -1031,17 +1034,6 @@ xfs_file_readdir(
>  	return xfs_readdir(ip, ctx, bufsize);
>  }
>  
> -STATIC int
> -xfs_file_mmap(
> -	struct file	*filp,
> -	struct vm_area_struct *vma)
> -{
> -	vma->vm_ops = &xfs_file_vm_ops;
> -
> -	file_accessed(filp);
> -	return 0;
> -}
> -
>  /*
>   * This type is designed to indicate the type of offset we would like
>   * to search from page cache for xfs_seek_hole_data().
> @@ -1466,6 +1458,71 @@ xfs_filemap_page_mkwrite(
>  	return error;
>  }
>  
> +static const struct vm_operations_struct xfs_file_vm_ops = {
> +	.fault		= xfs_filemap_fault,
> +	.map_pages	= filemap_map_pages,
> +	.page_mkwrite	= xfs_filemap_page_mkwrite,
> +};
> +
> +#ifdef CONFIG_FS_DAX
> +static int
> +xfs_filemap_dax_fault(
> +	struct vm_area_struct	*vma,
> +	struct vm_fault		*vmf)
> +{
> +	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> +	int			error;
> +
> +	trace_xfs_filemap_fault(ip);
> +
> +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> +	error = dax_fault(vma, vmf, xfs_get_blocks_dax,
> +			  xfs_get_blocks_dax_complete);
> +	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> +
> +	return error;
> +}
> +
> +static int
> +xfs_filemap_dax_page_mkwrite(
> +	struct vm_area_struct	*vma,
> +	struct vm_fault		*vmf)
> +{
> +	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> +	int			error;
> +
> +	trace_xfs_filemap_page_mkwrite(ip);
> +
> +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> +	error = dax_mkwrite(vma, vmf, xfs_get_blocks_dax,
> +			    xfs_get_blocks_dax_complete);
> +	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> +
> +	return error;
> +}
> +
> +static const struct vm_operations_struct xfs_file_dax_vm_ops = {
> +	.fault		= xfs_filemap_dax_fault,
> +	.page_mkwrite	= xfs_filemap_dax_page_mkwrite,
> +};
> +#else
> +#define xfs_file_dax_vm_ops	xfs_file_vm_ops
> +#endif /* CONFIG_FS_DAX */
> +
> +STATIC int
> +xfs_file_mmap(
> +	struct file	*filp,
> +	struct vm_area_struct *vma)
> +{
> +	file_accessed(filp);
> +	if (IS_DAX(file_inode(filp))) {
> +		vma->vm_ops = &xfs_file_dax_vm_ops;
> +		vma->vm_flags |= VM_MIXEDMAP;
> +	} else
> +		vma->vm_ops = &xfs_file_vm_ops;
> +	return 0;
> +}
> +
>  const struct file_operations xfs_file_operations = {
>  	.llseek		= xfs_file_llseek,
>  	.read		= new_sync_read,
> @@ -1497,8 +1554,21 @@ const struct file_operations xfs_dir_file_operations = {
>  	.fsync		= xfs_dir_fsync,
>  };
>  
> -static const struct vm_operations_struct xfs_file_vm_ops = {
> -	.fault		= xfs_filemap_fault,
> -	.map_pages	= filemap_map_pages,
> -	.page_mkwrite	= xfs_filemap_page_mkwrite,
> +#ifdef CONFIG_FS_DAX
> +const struct file_operations xfs_file_dax_operations = {
> +	.llseek		= xfs_file_llseek,
> +	.read		= new_sync_read,
> +	.write		= new_sync_write,
> +	.read_iter	= xfs_file_read_iter,
> +	.write_iter	= xfs_file_write_iter,
> +	.unlocked_ioctl	= xfs_file_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= xfs_file_compat_ioctl,
> +#endif
> +	.mmap		= xfs_file_mmap,
> +	.open		= xfs_file_open,
> +	.release	= xfs_file_release,
> +	.fsync		= xfs_file_fsync,
> +	.fallocate	= xfs_file_fallocate,
>  };

sigh, The same problem was in ext4, the reason you need
a second xfs_file_operations vector is because of the minus
	.splice_read	= xfs_file_splice_read,
	.splice_write	= iter_file_splice_write,
Which do not exist for DAX

Inspecting do_splice_from && do_splice_to, if I'm reading
the code right it looks like the difference of these vectors
present is the call to default_file_splice_write/read.

[default_file_splice_read for example would go head and use
 kernel_readv() ]

Would it be cleaner to call default_file_splice_write/read
directly from xfs_file_splice_read/write in the DAX case
and only keep one vector?

I have looked through the code, nothing stands out that I
can see. Do you have a public tree I can pull for easy
testing?

Thanks
Boaz

> +#endif /* CONFIG_FS_DAX */
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 8b9e688..9f38142 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1264,6 +1264,10 @@ xfs_setup_inode(
>  	case S_IFREG:
>  		inode->i_op = &xfs_inode_operations;
>  		inode->i_fop = &xfs_file_operations;
> +		if (IS_DAX(inode))
> +			inode->i_fop = &xfs_file_dax_operations;
> +		else
> +			inode->i_fop = &xfs_file_operations;
>  		inode->i_mapping->a_ops = &xfs_address_space_operations;
>  		break;
>  	case S_IFDIR:
> diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
> index a0f84ab..c08983e 100644
> --- a/fs/xfs/xfs_iops.h
> +++ b/fs/xfs/xfs_iops.h
> @@ -23,6 +23,12 @@ struct xfs_inode;
>  extern const struct file_operations xfs_file_operations;
>  extern const struct file_operations xfs_dir_file_operations;
>  
> +#ifdef CONFIG_FS_DAX
> +extern const struct file_operations xfs_file_dax_operations;
> +#else
> +#define xfs_file_dax_operations xfs_file_operations
> +#endif
> +
>  extern ssize_t xfs_vn_listxattr(struct dentry *, char *data, size_t size);
>  
>  /*
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
  2015-03-04 10:09     ` Boaz Harrosh
@ 2015-03-04 13:01       ` Dave Chinner
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-04 13:01 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: xfs, linux-fsdevel, jack, willy

On Wed, Mar 04, 2015 at 12:09:40PM +0200, Boaz Harrosh wrote:
> On 03/04/2015 01:30 AM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Add the initial support for DAX file operations to XFS. This
> > includes the necessary block allocation and mmap page fault hooks
> > for DAX to function.
> > 
> > Note that the current block allocation code abuses the mapping
> > buffer head to provide a completion callback for unwritten extent
> > allocation when DAX is clearing blocks. The DAX interface needs to
> > be changed to provide a callback similar to get_blocks for this
> > callback.
> 
> It looks like this comment is stale for this set
> 
> A question below
.....

> >  	.fsync		= xfs_dir_fsync,
> >  };
> >  
> > -static const struct vm_operations_struct xfs_file_vm_ops = {
> > -	.fault		= xfs_filemap_fault,
> > -	.map_pages	= filemap_map_pages,
> > -	.page_mkwrite	= xfs_filemap_page_mkwrite,
> > +#ifdef CONFIG_FS_DAX
> > +const struct file_operations xfs_file_dax_operations = {
> > +	.llseek		= xfs_file_llseek,
> > +	.read		= new_sync_read,
> > +	.write		= new_sync_write,
> > +	.read_iter	= xfs_file_read_iter,
> > +	.write_iter	= xfs_file_write_iter,
> > +	.unlocked_ioctl	= xfs_file_ioctl,
> > +#ifdef CONFIG_COMPAT
> > +	.compat_ioctl	= xfs_file_compat_ioctl,
> > +#endif
> > +	.mmap		= xfs_file_mmap,
> > +	.open		= xfs_file_open,
> > +	.release	= xfs_file_release,
> > +	.fsync		= xfs_file_fsync,
> > +	.fallocate	= xfs_file_fallocate,
> >  };
> 
> sigh, The same problem was in ext4, the reason you need
> a second xfs_file_operations vector is because of the minus
> 	.splice_read	= xfs_file_splice_read,
> 	.splice_write	= iter_file_splice_write,
> Which do not exist for DAX

Right, because they use buffered IO, and DAX doesn't do buffered
IO.

> Would it be cleaner to call default_file_splice_write/read
> directly from xfs_file_splice_read/write in the DAX case
> and only keep one vector?

Umm, looking at the code, that is different ot what I remember. it
used to be if the filesystem did not support splice, then it did not
set a vector. That has changed - it now calls a generic splice path
instead.

So, we definitely need splice to/from DAX enabled inodes to be
rejected. I'll have a look at that...

> I have looked through the code, nothing stands out that I
> can see. Do you have a public tree I can pull for easy
> testing?

No, not for a small RFC patchset. git am is your friend ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
@ 2015-03-04 13:01       ` Dave Chinner
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-04 13:01 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: linux-fsdevel, willy, jack, xfs

On Wed, Mar 04, 2015 at 12:09:40PM +0200, Boaz Harrosh wrote:
> On 03/04/2015 01:30 AM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Add the initial support for DAX file operations to XFS. This
> > includes the necessary block allocation and mmap page fault hooks
> > for DAX to function.
> > 
> > Note that the current block allocation code abuses the mapping
> > buffer head to provide a completion callback for unwritten extent
> > allocation when DAX is clearing blocks. The DAX interface needs to
> > be changed to provide a callback similar to get_blocks for this
> > callback.
> 
> It looks like this comment is stale for this set
> 
> A question below
.....

> >  	.fsync		= xfs_dir_fsync,
> >  };
> >  
> > -static const struct vm_operations_struct xfs_file_vm_ops = {
> > -	.fault		= xfs_filemap_fault,
> > -	.map_pages	= filemap_map_pages,
> > -	.page_mkwrite	= xfs_filemap_page_mkwrite,
> > +#ifdef CONFIG_FS_DAX
> > +const struct file_operations xfs_file_dax_operations = {
> > +	.llseek		= xfs_file_llseek,
> > +	.read		= new_sync_read,
> > +	.write		= new_sync_write,
> > +	.read_iter	= xfs_file_read_iter,
> > +	.write_iter	= xfs_file_write_iter,
> > +	.unlocked_ioctl	= xfs_file_ioctl,
> > +#ifdef CONFIG_COMPAT
> > +	.compat_ioctl	= xfs_file_compat_ioctl,
> > +#endif
> > +	.mmap		= xfs_file_mmap,
> > +	.open		= xfs_file_open,
> > +	.release	= xfs_file_release,
> > +	.fsync		= xfs_file_fsync,
> > +	.fallocate	= xfs_file_fallocate,
> >  };
> 
> sigh, The same problem was in ext4, the reason you need
> a second xfs_file_operations vector is because of the minus
> 	.splice_read	= xfs_file_splice_read,
> 	.splice_write	= iter_file_splice_write,
> Which do not exist for DAX

Right, because they use buffered IO, and DAX doesn't do buffered
IO.

> Would it be cleaner to call default_file_splice_write/read
> directly from xfs_file_splice_read/write in the DAX case
> and only keep one vector?

Umm, looking at the code, that is different ot what I remember. it
used to be if the filesystem did not support splice, then it did not
set a vector. That has changed - it now calls a generic splice path
instead.

So, we definitely need splice to/from DAX enabled inodes to be
rejected. I'll have a look at that...

> I have looked through the code, nothing stands out that I
> can see. Do you have a public tree I can pull for easy
> testing?

No, not for a small RFC patchset. git am is your friend ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
  2015-03-04 13:01       ` Dave Chinner
@ 2015-03-04 14:54         ` Boaz Harrosh
  -1 siblings, 0 replies; 40+ messages in thread
From: Boaz Harrosh @ 2015-03-04 14:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, jack, willy

On 03/04/2015 03:01 PM, Dave Chinner wrote:
> On Wed, Mar 04, 2015 at 12:09:40PM +0200, Boaz Harrosh wrote:
<>
> 
> So, we definitely need splice to/from DAX enabled inodes to be
> rejected. I'll have a look at that...
> 

default_file_splice_read uses kernel_readv which I think might actually
work. Do you know what xfstest(s) exercise splice?

Thanks
Boaz


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
@ 2015-03-04 14:54         ` Boaz Harrosh
  0 siblings, 0 replies; 40+ messages in thread
From: Boaz Harrosh @ 2015-03-04 14:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, willy, jack, xfs

On 03/04/2015 03:01 PM, Dave Chinner wrote:
> On Wed, Mar 04, 2015 at 12:09:40PM +0200, Boaz Harrosh wrote:
<>
> 
> So, we definitely need splice to/from DAX enabled inodes to be
> rejected. I'll have a look at that...
> 

default_file_splice_read uses kernel_readv which I think might actually
work. Do you know what xfstest(s) exercise splice?

Thanks
Boaz

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/6] dax: don't abuse get_block mapping for endio callbacks
  2015-03-03 23:30   ` Dave Chinner
@ 2015-03-04 15:54     ` Jan Kara
  -1 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2015-03-04 15:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, jack, willy

On Wed 04-03-15 10:30:22, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> dax_fault() currently relies on the get_block callback to attach an
> io completion callback to the mapping buffer head so that it can
> run unwritten extent conversion after zeroing allocated blocks.
> 
> Instead of this hack, pass the conversion callback directly into
> dax_fault() similar to the get_block callback. When the filesystem
> allocates unwritten extents, it will set the buffer_unwritten()
> flag, and hence the dax_fault code can call the completion function
> in the contexts where it is necessary without overloading the
> mapping buffer head.
> 
> Note: The changes to ext4 to use this interface are suspect at best.
> In fact, the way ext4 did this end_io assignment in the first place
> looks suspect because it only set a completion callback when there
> wasn't already some other write() call taking place on the same
> inode. The ext4 end_io code looks rather intricate and fragile with
> all it's reference counting and passing to different contexts for
> modification via inode private pointers that aren't protected by
> locks...
  Yeah, ext4 is currently broken in that regard so if we won't make things
worse, I'm OK.

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/dax.c           | 15 ++++++++-------
>  fs/ext2/file.c     |  4 ++--
>  fs/ext4/file.c     | 16 ++++++++++++++--
>  fs/ext4/inode.c    | 21 +++++++--------------
>  include/linux/fs.h |  6 ++++--
>  5 files changed, 35 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ed1619e..d7b4dba 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -269,7 +269,8 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
>  }
>  
>  static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> -			struct vm_area_struct *vma, struct vm_fault *vmf)
> +			struct vm_area_struct *vma, struct vm_fault *vmf,
> +			dax_iodone_t complete_unwritten)
>  {
>  	struct address_space *mapping = inode->i_mapping;
>  	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
> @@ -310,14 +311,14 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>   out:
>  	i_mmap_unlock_read(mapping);
>  
> -	if (bh->b_end_io)
> -		bh->b_end_io(bh, 1);
> +	if (buffer_unwritten(bh))
> +		complete_unwritten(bh, 1);
>  
>  	return error;
>  }
  So frankly I don't see a big point in passing completion callback into
dax_insert_mapping() only to call the function at the end of it. We could
as well call the completion function from do_dax_fault() where it would
seem more natural to me. But I don't feel too strongly about this.

Instead of the above I was also thinking about some way to pass information
out of do_dax_fault() into filesystem so that it could just call completion
handler itself but the completion callback is more standard interface I
guess.

								Honza

>  static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> -			get_block_t get_block)
> +			get_block_t get_block, dax_iodone_t complete_unwritten)
>  {
>  	struct file *file = vma->vm_file;
>  	struct address_space *mapping = file->f_mapping;
> @@ -418,7 +419,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		page_cache_release(page);
>  	}
>  
> -	error = dax_insert_mapping(inode, &bh, vma, vmf);
> +	error = dax_insert_mapping(inode, &bh, vma, vmf, complete_unwritten);
>  
>   out:
>  	if (error == -ENOMEM)
> @@ -446,7 +447,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>   * fault handler for DAX files.
>   */
>  int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> -			get_block_t get_block)
> +	      get_block_t get_block, dax_iodone_t complete_unwritten)
>  {
>  	int result;
>  	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> @@ -455,7 +456,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		sb_start_pagefault(sb);
>  		file_update_time(vma->vm_file);
>  	}
> -	result = do_dax_fault(vma, vmf, get_block);
> +	result = do_dax_fault(vma, vmf, get_block, complete_unwritten);
>  	if (vmf->flags & FAULT_FLAG_WRITE)
>  		sb_end_pagefault(sb);
>  
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index e317017..8da747a 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -28,12 +28,12 @@
>  #ifdef CONFIG_FS_DAX
>  static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_fault(vma, vmf, ext2_get_block);
> +	return dax_fault(vma, vmf, ext2_get_block, NULL);
>  }
>  
>  static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_mkwrite(vma, vmf, ext2_get_block);
> +	return dax_mkwrite(vma, vmf, ext2_get_block, NULL);
>  }
>  
>  static const struct vm_operations_struct ext2_dax_vm_ops = {
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 33a09da..f7dabb1 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -192,15 +192,27 @@ errout:
>  }
>  
>  #ifdef CONFIG_FS_DAX
> +static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
> +{
> +	struct inode *inode = bh->b_assoc_map->host;
> +	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
> +	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
> +	int err;
> +	if (!uptodate)
> +		return;
> +	WARN_ON(!buffer_unwritten(bh));
> +	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
> +}
> +
>  static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_fault(vma, vmf, ext4_get_block);
> +	return dax_fault(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
>  					/* Is this the right get_block? */
>  }
>  
>  static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_mkwrite(vma, vmf, ext4_get_block);
> +	return dax_mkwrite(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
>  }
>  
>  static const struct vm_operations_struct ext4_dax_vm_ops = {
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5cb9a21..43433de 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -657,18 +657,6 @@ has_zeroout:
>  	return retval;
>  }
>  
> -static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
> -{
> -	struct inode *inode = bh->b_assoc_map->host;
> -	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
> -	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
> -	int err;
> -	if (!uptodate)
> -		return;
> -	WARN_ON(!buffer_unwritten(bh));
> -	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
> -}
> -
>  /* Maximum number of blocks we map for direct IO at once. */
>  #define DIO_MAX_BLOCKS 4096
>  
> @@ -706,10 +694,15 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
>  
>  		map_bh(bh, inode->i_sb, map.m_pblk);
>  		bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
> -		if (IS_DAX(inode) && buffer_unwritten(bh) && !io_end) {
> +		if (IS_DAX(inode) && buffer_unwritten(bh)) {
> +			/*
> +			 * dgc: I suspect unwritten conversion on ext4+DAX is
> +			 * fundamentally broken here when there are concurrent
> +			 * read/write in progress on this inode.
> +			 */
> +			WARN_ON_ONCE(io_end);
>  			bh->b_assoc_map = inode->i_mapping;
>  			bh->b_private = (void *)(unsigned long)iblock;
> -			bh->b_end_io = ext4_end_io_unwritten;
>  		}
>  		if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
>  			set_buffer_defer_completion(bh);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 937e280..82100ae 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -70,6 +70,7 @@ typedef int (get_block_t)(struct inode *inode, sector_t iblock,
>  			struct buffer_head *bh_result, int create);
>  typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  			ssize_t bytes, void *private);
> +typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
>  
>  #define MAY_EXEC		0x00000001
>  #define MAY_WRITE		0x00000002
> @@ -2603,8 +2604,9 @@ ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
>  int dax_clear_blocks(struct inode *, sector_t block, long size);
>  int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
> -int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
> -#define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
> +int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
> +		dax_iodone_t);
> +#define dax_mkwrite(vma, vmf, gb, iod)	dax_fault(vma, vmf, gb, iod)
>  
>  #ifdef CONFIG_BLOCK
>  typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
> -- 
> 2.0.0
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/6] dax: don't abuse get_block mapping for endio callbacks
@ 2015-03-04 15:54     ` Jan Kara
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2015-03-04 15:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, willy, jack, xfs

On Wed 04-03-15 10:30:22, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> dax_fault() currently relies on the get_block callback to attach an
> io completion callback to the mapping buffer head so that it can
> run unwritten extent conversion after zeroing allocated blocks.
> 
> Instead of this hack, pass the conversion callback directly into
> dax_fault() similar to the get_block callback. When the filesystem
> allocates unwritten extents, it will set the buffer_unwritten()
> flag, and hence the dax_fault code can call the completion function
> in the contexts where it is necessary without overloading the
> mapping buffer head.
> 
> Note: The changes to ext4 to use this interface are suspect at best.
> In fact, the way ext4 did this end_io assignment in the first place
> looks suspect because it only set a completion callback when there
> wasn't already some other write() call taking place on the same
> inode. The ext4 end_io code looks rather intricate and fragile with
> all it's reference counting and passing to different contexts for
> modification via inode private pointers that aren't protected by
> locks...
  Yeah, ext4 is currently broken in that regard so if we won't make things
worse, I'm OK.

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/dax.c           | 15 ++++++++-------
>  fs/ext2/file.c     |  4 ++--
>  fs/ext4/file.c     | 16 ++++++++++++++--
>  fs/ext4/inode.c    | 21 +++++++--------------
>  include/linux/fs.h |  6 ++++--
>  5 files changed, 35 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ed1619e..d7b4dba 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -269,7 +269,8 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
>  }
>  
>  static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> -			struct vm_area_struct *vma, struct vm_fault *vmf)
> +			struct vm_area_struct *vma, struct vm_fault *vmf,
> +			dax_iodone_t complete_unwritten)
>  {
>  	struct address_space *mapping = inode->i_mapping;
>  	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
> @@ -310,14 +311,14 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>   out:
>  	i_mmap_unlock_read(mapping);
>  
> -	if (bh->b_end_io)
> -		bh->b_end_io(bh, 1);
> +	if (buffer_unwritten(bh))
> +		complete_unwritten(bh, 1);
>  
>  	return error;
>  }
  So frankly I don't see a big point in passing completion callback into
dax_insert_mapping() only to call the function at the end of it. We could
as well call the completion function from do_dax_fault() where it would
seem more natural to me. But I don't feel too strongly about this.

Instead of the above I was also thinking about some way to pass information
out of do_dax_fault() into filesystem so that it could just call completion
handler itself but the completion callback is more standard interface I
guess.

								Honza

>  static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> -			get_block_t get_block)
> +			get_block_t get_block, dax_iodone_t complete_unwritten)
>  {
>  	struct file *file = vma->vm_file;
>  	struct address_space *mapping = file->f_mapping;
> @@ -418,7 +419,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		page_cache_release(page);
>  	}
>  
> -	error = dax_insert_mapping(inode, &bh, vma, vmf);
> +	error = dax_insert_mapping(inode, &bh, vma, vmf, complete_unwritten);
>  
>   out:
>  	if (error == -ENOMEM)
> @@ -446,7 +447,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>   * fault handler for DAX files.
>   */
>  int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> -			get_block_t get_block)
> +	      get_block_t get_block, dax_iodone_t complete_unwritten)
>  {
>  	int result;
>  	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> @@ -455,7 +456,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		sb_start_pagefault(sb);
>  		file_update_time(vma->vm_file);
>  	}
> -	result = do_dax_fault(vma, vmf, get_block);
> +	result = do_dax_fault(vma, vmf, get_block, complete_unwritten);
>  	if (vmf->flags & FAULT_FLAG_WRITE)
>  		sb_end_pagefault(sb);
>  
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index e317017..8da747a 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -28,12 +28,12 @@
>  #ifdef CONFIG_FS_DAX
>  static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_fault(vma, vmf, ext2_get_block);
> +	return dax_fault(vma, vmf, ext2_get_block, NULL);
>  }
>  
>  static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_mkwrite(vma, vmf, ext2_get_block);
> +	return dax_mkwrite(vma, vmf, ext2_get_block, NULL);
>  }
>  
>  static const struct vm_operations_struct ext2_dax_vm_ops = {
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 33a09da..f7dabb1 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -192,15 +192,27 @@ errout:
>  }
>  
>  #ifdef CONFIG_FS_DAX
> +static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
> +{
> +	struct inode *inode = bh->b_assoc_map->host;
> +	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
> +	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
> +	int err;
> +	if (!uptodate)
> +		return;
> +	WARN_ON(!buffer_unwritten(bh));
> +	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
> +}
> +
>  static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_fault(vma, vmf, ext4_get_block);
> +	return dax_fault(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
>  					/* Is this the right get_block? */
>  }
>  
>  static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_mkwrite(vma, vmf, ext4_get_block);
> +	return dax_mkwrite(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
>  }
>  
>  static const struct vm_operations_struct ext4_dax_vm_ops = {
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5cb9a21..43433de 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -657,18 +657,6 @@ has_zeroout:
>  	return retval;
>  }
>  
> -static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
> -{
> -	struct inode *inode = bh->b_assoc_map->host;
> -	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
> -	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
> -	int err;
> -	if (!uptodate)
> -		return;
> -	WARN_ON(!buffer_unwritten(bh));
> -	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
> -}
> -
>  /* Maximum number of blocks we map for direct IO at once. */
>  #define DIO_MAX_BLOCKS 4096
>  
> @@ -706,10 +694,15 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
>  
>  		map_bh(bh, inode->i_sb, map.m_pblk);
>  		bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
> -		if (IS_DAX(inode) && buffer_unwritten(bh) && !io_end) {
> +		if (IS_DAX(inode) && buffer_unwritten(bh)) {
> +			/*
> +			 * dgc: I suspect unwritten conversion on ext4+DAX is
> +			 * fundamentally broken here when there are concurrent
> +			 * read/write in progress on this inode.
> +			 */
> +			WARN_ON_ONCE(io_end);
>  			bh->b_assoc_map = inode->i_mapping;
>  			bh->b_private = (void *)(unsigned long)iblock;
> -			bh->b_end_io = ext4_end_io_unwritten;
>  		}
>  		if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
>  			set_buffer_defer_completion(bh);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 937e280..82100ae 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -70,6 +70,7 @@ typedef int (get_block_t)(struct inode *inode, sector_t iblock,
>  			struct buffer_head *bh_result, int create);
>  typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  			ssize_t bytes, void *private);
> +typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
>  
>  #define MAY_EXEC		0x00000001
>  #define MAY_WRITE		0x00000002
> @@ -2603,8 +2604,9 @@ ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
>  int dax_clear_blocks(struct inode *, sector_t block, long size);
>  int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
> -int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
> -#define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
> +int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
> +		dax_iodone_t);
> +#define dax_mkwrite(vma, vmf, gb, iod)	dax_fault(vma, vmf, gb, iod)
>  
>  #ifdef CONFIG_BLOCK
>  typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
> -- 
> 2.0.0
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
  2015-03-03 23:30   ` Dave Chinner
@ 2015-03-04 16:18     ` Jan Kara
  -1 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2015-03-04 16:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, jack, willy

On Wed 04-03-15 10:30:24, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add the initial support for DAX file operations to XFS. This
> includes the necessary block allocation and mmap page fault hooks
> for DAX to function.
> 
> Note that the current block allocation code abuses the mapping
> buffer head to provide a completion callback for unwritten extent
> allocation when DAX is clearing blocks. The DAX interface needs to
> be changed to provide a callback similar to get_blocks for this
> callback.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c |  72 ++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_aops.h |   7 +++-
>  fs/xfs/xfs_file.c | 108 ++++++++++++++++++++++++++++++++++++++++++++----------
>  fs/xfs/xfs_iops.c |   4 ++
>  fs/xfs/xfs_iops.h |   6 +++
>  5 files changed, 173 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3a9b7a1..22cb03a 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1233,13 +1233,63 @@ xfs_vm_releasepage(
>  	return try_to_free_buffers(page);
>  }
>  
> +/*
> + * For DAX we need a mapping buffer callback for unwritten extent conversion
> + * when page faults allocation blocks and then zero them.
> + */
> +#ifdef CONFIG_FS_DAX
> +static struct xfs_ioend *
> +xfs_dax_alloc_ioend(
> +	struct inode	*inode,
> +	xfs_off_t	offset,
> +	ssize_t		size)
> +{
> +	struct xfs_ioend *ioend;
> +
> +	ASSERT(IS_DAX(inode));
> +	ioend = xfs_alloc_ioend(inode, XFS_IO_UNWRITTEN);
> +	ioend->io_offset = offset;
> +	ioend->io_size = size;
> +	return ioend;
> +}
> +
> +void
> +xfs_get_blocks_dax_complete(
> +	struct buffer_head	*bh,
> +	int			uptodate)
> +{
> +	struct xfs_ioend	*ioend = bh->b_private;
> +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> +	int			error;
> +
> +	ASSERT(IS_DAX(ioend->io_inode));
> +
> +	/* if there was an error zeroing, then don't convert it */
> +	if (!uptodate)
> +		goto out_free;
> +
> +	error = xfs_iomap_write_unwritten(ip, ioend->io_offset, ioend->io_size);
> +	if (error)
> +		xfs_warn(ip->i_mount,
> +"%s: conversion failed, ino 0x%llx, offset 0x%llx, len 0x%lx, error %d\n",
> +			__func__, ip->i_ino, ioend->io_offset,
> +			ioend->io_size, error);
> +out_free:
> +	mempool_free(ioend, xfs_ioend_pool);
> +
> +}
> +#else
> +#define xfs_dax_alloc_ioend(i,o,s)	NULL
> +#endif
> +
>  STATIC int
>  __xfs_get_blocks(
>  	struct inode		*inode,
>  	sector_t		iblock,
>  	struct buffer_head	*bh_result,
>  	int			create,
> -	int			direct)
> +	bool			direct,
> +	bool			clear)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1304,6 +1354,7 @@ __xfs_get_blocks(
>  			if (error)
>  				return error;
>  			new = 1;
> +
>  		} else {
>  			/*
>  			 * Delalloc reservations do not require a transaction,
> @@ -1340,7 +1391,10 @@ __xfs_get_blocks(
>  		if (create || !ISUNWRITTEN(&imap))
>  			xfs_map_buffer(inode, bh_result, &imap, offset);
>  		if (create && ISUNWRITTEN(&imap)) {
> -			if (direct) {
> +			if (clear) {
> +				bh_result->b_private = xfs_dax_alloc_ioend(
> +							inode, offset, size);
> +			} else if (direct) {
>  				bh_result->b_private = inode;
>  				set_buffer_defer_completion(bh_result);
>  			}
> @@ -1425,7 +1479,7 @@ xfs_get_blocks(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, 0);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
>  }
>  
>  STATIC int
> @@ -1435,7 +1489,17 @@ xfs_get_blocks_direct(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, 1);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
> +}
> +
> +int
> +xfs_get_blocks_dax(
> +	struct inode		*inode,
> +	sector_t		iblock,
> +	struct buffer_head	*bh_result,
> +	int			create)
> +{
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index ac644e0..7c6fb3f 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -53,7 +53,12 @@ typedef struct xfs_ioend {
>  } xfs_ioend_t;
>  
>  extern const struct address_space_operations xfs_address_space_operations;
> -extern int xfs_get_blocks(struct inode *, sector_t, struct buffer_head *, int);
> +
> +int	xfs_get_blocks(struct inode *inode, sector_t offset,
> +		       struct buffer_head *map_bh, int create);
> +int	xfs_get_blocks_dax(struct inode *inode, sector_t offset,
> +			   struct buffer_head *map_bh, int create);
> +void	xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate);
>  
>  extern void xfs_count_page_state(struct page *, int *, int *);
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index bc0008f..4bfcba0 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -654,7 +654,7 @@ xfs_file_dio_aio_write(
>  					mp->m_rtdev_targp : mp->m_ddev_targp;
>  
>  	/* DIO must be aligned to device logical sector size */
> -	if ((pos | count) & target->bt_logical_sectormask)
> +	if (!IS_DAX(inode) && (pos | count) & target->bt_logical_sectormask)
>  		return -EINVAL;
>  
>  	/* "unaligned" here means not aligned to a filesystem block */
> @@ -724,8 +724,11 @@ xfs_file_dio_aio_write(
>  out:
>  	xfs_rw_iunlock(ip, iolock);
>  
> -	/* No fallback to buffered IO on errors for XFS. */
> -	ASSERT(ret < 0 || ret == count);
> +	/*
> +	 * No fallback to buffered IO on errors for XFS. DAX can result in
> +	 * partial writes, but direct IO will either complete fully or fail.
> +	 */
> +	ASSERT(ret < 0 || ret == count || IS_DAX(VFS_I(ip)));
>  	return ret;
>  }
>  
> @@ -810,7 +813,7 @@ xfs_file_write_iter(
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
>  		return -EIO;
>  
> -	if (unlikely(file->f_flags & O_DIRECT))
> +	if ((file->f_flags & O_DIRECT) || IS_DAX(inode))
>  		ret = xfs_file_dio_aio_write(iocb, from);
>  	else
>  		ret = xfs_file_buffered_aio_write(iocb, from);
> @@ -1031,17 +1034,6 @@ xfs_file_readdir(
>  	return xfs_readdir(ip, ctx, bufsize);
>  }
>  
> -STATIC int
> -xfs_file_mmap(
> -	struct file	*filp,
> -	struct vm_area_struct *vma)
> -{
> -	vma->vm_ops = &xfs_file_vm_ops;
> -
> -	file_accessed(filp);
> -	return 0;
> -}
> -
>  /*
>   * This type is designed to indicate the type of offset we would like
>   * to search from page cache for xfs_seek_hole_data().
> @@ -1466,6 +1458,71 @@ xfs_filemap_page_mkwrite(
>  	return error;
>  }
>  
> +static const struct vm_operations_struct xfs_file_vm_ops = {
> +	.fault		= xfs_filemap_fault,
> +	.map_pages	= filemap_map_pages,
> +	.page_mkwrite	= xfs_filemap_page_mkwrite,
> +};
> +
> +#ifdef CONFIG_FS_DAX
> +static int
> +xfs_filemap_dax_fault(
> +	struct vm_area_struct	*vma,
> +	struct vm_fault		*vmf)
> +{
> +	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> +	int			error;
> +
> +	trace_xfs_filemap_fault(ip);
> +
> +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> +	error = dax_fault(vma, vmf, xfs_get_blocks_dax,
> +			  xfs_get_blocks_dax_complete);
> +	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> +
> +	return error;
> +}
> +
> +static int
> +xfs_filemap_dax_page_mkwrite(
> +	struct vm_area_struct	*vma,
> +	struct vm_fault		*vmf)
> +{
> +	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> +	int			error;
> +
> +	trace_xfs_filemap_page_mkwrite(ip);
> +
> +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
  So I think the lock ordering of XFS_MMAPLOCK and freezing protection is
suspicious (and actually so is for normal write faults as I'm looking -
didn't realize that when I was first reading your MMAPLOCK patches).
Because you take XFS_MMAPLOCK outside of freeze protection however usually
we want freeze protection to be the outermost lock - in particular in
xfs_file_fallocate() you take XFS_MMAPLOCK inside freeze protection I
think.

So you'll need to do what ext4 needs to do - take freeze protection, take
fs specific locks, and then call do_dax_fault(). Matthew has a patch to
actually export do_dax_fault (as __dax_fault()) for filesystems.

Similarly for xfs_filemap_page_mkwrite() you need to use
__block_page_mkwrite() as a callback (similarly as ext4 does) and do freeze
protection early by hand.

								Honza

> +	error = dax_mkwrite(vma, vmf, xfs_get_blocks_dax,
> +			    xfs_get_blocks_dax_complete);
> +	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> +
> +	return error;
> +}
> +
> +static const struct vm_operations_struct xfs_file_dax_vm_ops = {
> +	.fault		= xfs_filemap_dax_fault,
> +	.page_mkwrite	= xfs_filemap_dax_page_mkwrite,
> +};
> +#else
> +#define xfs_file_dax_vm_ops	xfs_file_vm_ops
> +#endif /* CONFIG_FS_DAX */
> +
> +STATIC int
> +xfs_file_mmap(
> +	struct file	*filp,
> +	struct vm_area_struct *vma)
> +{
> +	file_accessed(filp);
> +	if (IS_DAX(file_inode(filp))) {
> +		vma->vm_ops = &xfs_file_dax_vm_ops;
> +		vma->vm_flags |= VM_MIXEDMAP;
> +	} else
> +		vma->vm_ops = &xfs_file_vm_ops;
> +	return 0;
> +}
> +
>  const struct file_operations xfs_file_operations = {
>  	.llseek		= xfs_file_llseek,
>  	.read		= new_sync_read,
> @@ -1497,8 +1554,21 @@ const struct file_operations xfs_dir_file_operations = {
>  	.fsync		= xfs_dir_fsync,
>  };
>  
> -static const struct vm_operations_struct xfs_file_vm_ops = {
> -	.fault		= xfs_filemap_fault,
> -	.map_pages	= filemap_map_pages,
> -	.page_mkwrite	= xfs_filemap_page_mkwrite,
> +#ifdef CONFIG_FS_DAX
> +const struct file_operations xfs_file_dax_operations = {
> +	.llseek		= xfs_file_llseek,
> +	.read		= new_sync_read,
> +	.write		= new_sync_write,
> +	.read_iter	= xfs_file_read_iter,
> +	.write_iter	= xfs_file_write_iter,
> +	.unlocked_ioctl	= xfs_file_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= xfs_file_compat_ioctl,
> +#endif
> +	.mmap		= xfs_file_mmap,
> +	.open		= xfs_file_open,
> +	.release	= xfs_file_release,
> +	.fsync		= xfs_file_fsync,
> +	.fallocate	= xfs_file_fallocate,
>  };
> +#endif /* CONFIG_FS_DAX */
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 8b9e688..9f38142 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1264,6 +1264,10 @@ xfs_setup_inode(
>  	case S_IFREG:
>  		inode->i_op = &xfs_inode_operations;
>  		inode->i_fop = &xfs_file_operations;
> +		if (IS_DAX(inode))
> +			inode->i_fop = &xfs_file_dax_operations;
> +		else
> +			inode->i_fop = &xfs_file_operations;
>  		inode->i_mapping->a_ops = &xfs_address_space_operations;
>  		break;
>  	case S_IFDIR:
> diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
> index a0f84ab..c08983e 100644
> --- a/fs/xfs/xfs_iops.h
> +++ b/fs/xfs/xfs_iops.h
> @@ -23,6 +23,12 @@ struct xfs_inode;
>  extern const struct file_operations xfs_file_operations;
>  extern const struct file_operations xfs_dir_file_operations;
>  
> +#ifdef CONFIG_FS_DAX
> +extern const struct file_operations xfs_file_dax_operations;
> +#else
> +#define xfs_file_dax_operations xfs_file_operations
> +#endif
> +
>  extern ssize_t xfs_vn_listxattr(struct dentry *, char *data, size_t size);
>  
>  /*
> -- 
> 2.0.0
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
@ 2015-03-04 16:18     ` Jan Kara
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2015-03-04 16:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, willy, jack, xfs

On Wed 04-03-15 10:30:24, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add the initial support for DAX file operations to XFS. This
> includes the necessary block allocation and mmap page fault hooks
> for DAX to function.
> 
> Note that the current block allocation code abuses the mapping
> buffer head to provide a completion callback for unwritten extent
> allocation when DAX is clearing blocks. The DAX interface needs to
> be changed to provide a callback similar to get_blocks for this
> callback.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c |  72 ++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_aops.h |   7 +++-
>  fs/xfs/xfs_file.c | 108 ++++++++++++++++++++++++++++++++++++++++++++----------
>  fs/xfs/xfs_iops.c |   4 ++
>  fs/xfs/xfs_iops.h |   6 +++
>  5 files changed, 173 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3a9b7a1..22cb03a 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1233,13 +1233,63 @@ xfs_vm_releasepage(
>  	return try_to_free_buffers(page);
>  }
>  
> +/*
> + * For DAX we need a mapping buffer callback for unwritten extent conversion
> + * when page faults allocation blocks and then zero them.
> + */
> +#ifdef CONFIG_FS_DAX
> +static struct xfs_ioend *
> +xfs_dax_alloc_ioend(
> +	struct inode	*inode,
> +	xfs_off_t	offset,
> +	ssize_t		size)
> +{
> +	struct xfs_ioend *ioend;
> +
> +	ASSERT(IS_DAX(inode));
> +	ioend = xfs_alloc_ioend(inode, XFS_IO_UNWRITTEN);
> +	ioend->io_offset = offset;
> +	ioend->io_size = size;
> +	return ioend;
> +}
> +
> +void
> +xfs_get_blocks_dax_complete(
> +	struct buffer_head	*bh,
> +	int			uptodate)
> +{
> +	struct xfs_ioend	*ioend = bh->b_private;
> +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> +	int			error;
> +
> +	ASSERT(IS_DAX(ioend->io_inode));
> +
> +	/* if there was an error zeroing, then don't convert it */
> +	if (!uptodate)
> +		goto out_free;
> +
> +	error = xfs_iomap_write_unwritten(ip, ioend->io_offset, ioend->io_size);
> +	if (error)
> +		xfs_warn(ip->i_mount,
> +"%s: conversion failed, ino 0x%llx, offset 0x%llx, len 0x%lx, error %d\n",
> +			__func__, ip->i_ino, ioend->io_offset,
> +			ioend->io_size, error);
> +out_free:
> +	mempool_free(ioend, xfs_ioend_pool);
> +
> +}
> +#else
> +#define xfs_dax_alloc_ioend(i,o,s)	NULL
> +#endif
> +
>  STATIC int
>  __xfs_get_blocks(
>  	struct inode		*inode,
>  	sector_t		iblock,
>  	struct buffer_head	*bh_result,
>  	int			create,
> -	int			direct)
> +	bool			direct,
> +	bool			clear)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1304,6 +1354,7 @@ __xfs_get_blocks(
>  			if (error)
>  				return error;
>  			new = 1;
> +
>  		} else {
>  			/*
>  			 * Delalloc reservations do not require a transaction,
> @@ -1340,7 +1391,10 @@ __xfs_get_blocks(
>  		if (create || !ISUNWRITTEN(&imap))
>  			xfs_map_buffer(inode, bh_result, &imap, offset);
>  		if (create && ISUNWRITTEN(&imap)) {
> -			if (direct) {
> +			if (clear) {
> +				bh_result->b_private = xfs_dax_alloc_ioend(
> +							inode, offset, size);
> +			} else if (direct) {
>  				bh_result->b_private = inode;
>  				set_buffer_defer_completion(bh_result);
>  			}
> @@ -1425,7 +1479,7 @@ xfs_get_blocks(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, 0);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
>  }
>  
>  STATIC int
> @@ -1435,7 +1489,17 @@ xfs_get_blocks_direct(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, 1);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
> +}
> +
> +int
> +xfs_get_blocks_dax(
> +	struct inode		*inode,
> +	sector_t		iblock,
> +	struct buffer_head	*bh_result,
> +	int			create)
> +{
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index ac644e0..7c6fb3f 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -53,7 +53,12 @@ typedef struct xfs_ioend {
>  } xfs_ioend_t;
>  
>  extern const struct address_space_operations xfs_address_space_operations;
> -extern int xfs_get_blocks(struct inode *, sector_t, struct buffer_head *, int);
> +
> +int	xfs_get_blocks(struct inode *inode, sector_t offset,
> +		       struct buffer_head *map_bh, int create);
> +int	xfs_get_blocks_dax(struct inode *inode, sector_t offset,
> +			   struct buffer_head *map_bh, int create);
> +void	xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate);
>  
>  extern void xfs_count_page_state(struct page *, int *, int *);
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index bc0008f..4bfcba0 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -654,7 +654,7 @@ xfs_file_dio_aio_write(
>  					mp->m_rtdev_targp : mp->m_ddev_targp;
>  
>  	/* DIO must be aligned to device logical sector size */
> -	if ((pos | count) & target->bt_logical_sectormask)
> +	if (!IS_DAX(inode) && (pos | count) & target->bt_logical_sectormask)
>  		return -EINVAL;
>  
>  	/* "unaligned" here means not aligned to a filesystem block */
> @@ -724,8 +724,11 @@ xfs_file_dio_aio_write(
>  out:
>  	xfs_rw_iunlock(ip, iolock);
>  
> -	/* No fallback to buffered IO on errors for XFS. */
> -	ASSERT(ret < 0 || ret == count);
> +	/*
> +	 * No fallback to buffered IO on errors for XFS. DAX can result in
> +	 * partial writes, but direct IO will either complete fully or fail.
> +	 */
> +	ASSERT(ret < 0 || ret == count || IS_DAX(VFS_I(ip)));
>  	return ret;
>  }
>  
> @@ -810,7 +813,7 @@ xfs_file_write_iter(
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
>  		return -EIO;
>  
> -	if (unlikely(file->f_flags & O_DIRECT))
> +	if ((file->f_flags & O_DIRECT) || IS_DAX(inode))
>  		ret = xfs_file_dio_aio_write(iocb, from);
>  	else
>  		ret = xfs_file_buffered_aio_write(iocb, from);
> @@ -1031,17 +1034,6 @@ xfs_file_readdir(
>  	return xfs_readdir(ip, ctx, bufsize);
>  }
>  
> -STATIC int
> -xfs_file_mmap(
> -	struct file	*filp,
> -	struct vm_area_struct *vma)
> -{
> -	vma->vm_ops = &xfs_file_vm_ops;
> -
> -	file_accessed(filp);
> -	return 0;
> -}
> -
>  /*
>   * This type is designed to indicate the type of offset we would like
>   * to search from page cache for xfs_seek_hole_data().
> @@ -1466,6 +1458,71 @@ xfs_filemap_page_mkwrite(
>  	return error;
>  }
>  
> +static const struct vm_operations_struct xfs_file_vm_ops = {
> +	.fault		= xfs_filemap_fault,
> +	.map_pages	= filemap_map_pages,
> +	.page_mkwrite	= xfs_filemap_page_mkwrite,
> +};
> +
> +#ifdef CONFIG_FS_DAX
> +static int
> +xfs_filemap_dax_fault(
> +	struct vm_area_struct	*vma,
> +	struct vm_fault		*vmf)
> +{
> +	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> +	int			error;
> +
> +	trace_xfs_filemap_fault(ip);
> +
> +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> +	error = dax_fault(vma, vmf, xfs_get_blocks_dax,
> +			  xfs_get_blocks_dax_complete);
> +	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> +
> +	return error;
> +}
> +
> +static int
> +xfs_filemap_dax_page_mkwrite(
> +	struct vm_area_struct	*vma,
> +	struct vm_fault		*vmf)
> +{
> +	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> +	int			error;
> +
> +	trace_xfs_filemap_page_mkwrite(ip);
> +
> +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
  So I think the lock ordering of XFS_MMAPLOCK and freezing protection is
suspicious (and actually so is for normal write faults as I'm looking -
didn't realize that when I was first reading your MMAPLOCK patches).
Because you take XFS_MMAPLOCK outside of freeze protection however usually
we want freeze protection to be the outermost lock - in particular in
xfs_file_fallocate() you take XFS_MMAPLOCK inside freeze protection I
think.

So you'll need to do what ext4 needs to do - take freeze protection, take
fs specific locks, and then call do_dax_fault(). Matthew has a patch to
actually export do_dax_fault (as __dax_fault()) for filesystems.

Similarly for xfs_filemap_page_mkwrite() you need to use
__block_page_mkwrite() as a callback (similarly as ext4 does) and do freeze
protection early by hand.

								Honza

> +	error = dax_mkwrite(vma, vmf, xfs_get_blocks_dax,
> +			    xfs_get_blocks_dax_complete);
> +	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> +
> +	return error;
> +}
> +
> +static const struct vm_operations_struct xfs_file_dax_vm_ops = {
> +	.fault		= xfs_filemap_dax_fault,
> +	.page_mkwrite	= xfs_filemap_dax_page_mkwrite,
> +};
> +#else
> +#define xfs_file_dax_vm_ops	xfs_file_vm_ops
> +#endif /* CONFIG_FS_DAX */
> +
> +STATIC int
> +xfs_file_mmap(
> +	struct file	*filp,
> +	struct vm_area_struct *vma)
> +{
> +	file_accessed(filp);
> +	if (IS_DAX(file_inode(filp))) {
> +		vma->vm_ops = &xfs_file_dax_vm_ops;
> +		vma->vm_flags |= VM_MIXEDMAP;
> +	} else
> +		vma->vm_ops = &xfs_file_vm_ops;
> +	return 0;
> +}
> +
>  const struct file_operations xfs_file_operations = {
>  	.llseek		= xfs_file_llseek,
>  	.read		= new_sync_read,
> @@ -1497,8 +1554,21 @@ const struct file_operations xfs_dir_file_operations = {
>  	.fsync		= xfs_dir_fsync,
>  };
>  
> -static const struct vm_operations_struct xfs_file_vm_ops = {
> -	.fault		= xfs_filemap_fault,
> -	.map_pages	= filemap_map_pages,
> -	.page_mkwrite	= xfs_filemap_page_mkwrite,
> +#ifdef CONFIG_FS_DAX
> +const struct file_operations xfs_file_dax_operations = {
> +	.llseek		= xfs_file_llseek,
> +	.read		= new_sync_read,
> +	.write		= new_sync_write,
> +	.read_iter	= xfs_file_read_iter,
> +	.write_iter	= xfs_file_write_iter,
> +	.unlocked_ioctl	= xfs_file_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= xfs_file_compat_ioctl,
> +#endif
> +	.mmap		= xfs_file_mmap,
> +	.open		= xfs_file_open,
> +	.release	= xfs_file_release,
> +	.fsync		= xfs_file_fsync,
> +	.fallocate	= xfs_file_fallocate,
>  };
> +#endif /* CONFIG_FS_DAX */
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 8b9e688..9f38142 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1264,6 +1264,10 @@ xfs_setup_inode(
>  	case S_IFREG:
>  		inode->i_op = &xfs_inode_operations;
>  		inode->i_fop = &xfs_file_operations;
> +		if (IS_DAX(inode))
> +			inode->i_fop = &xfs_file_dax_operations;
> +		else
> +			inode->i_fop = &xfs_file_operations;
>  		inode->i_mapping->a_ops = &xfs_address_space_operations;
>  		break;
>  	case S_IFDIR:
> diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
> index a0f84ab..c08983e 100644
> --- a/fs/xfs/xfs_iops.h
> +++ b/fs/xfs/xfs_iops.h
> @@ -23,6 +23,12 @@ struct xfs_inode;
>  extern const struct file_operations xfs_file_operations;
>  extern const struct file_operations xfs_dir_file_operations;
>  
> +#ifdef CONFIG_FS_DAX
> +extern const struct file_operations xfs_file_dax_operations;
> +#else
> +#define xfs_file_dax_operations xfs_file_operations
> +#endif
> +
>  extern ssize_t xfs_vn_listxattr(struct dentry *, char *data, size_t size);
>  
>  /*
> -- 
> 2.0.0
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
  2015-03-04 16:18     ` Jan Kara
@ 2015-03-04 22:00       ` Dave Chinner
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-04 22:00 UTC (permalink / raw)
  To: Jan Kara; +Cc: xfs, linux-fsdevel, willy

On Wed, Mar 04, 2015 at 05:18:48PM +0100, Jan Kara wrote:
> On Wed 04-03-15 10:30:24, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Add the initial support for DAX file operations to XFS. This
> > includes the necessary block allocation and mmap page fault hooks
> > for DAX to function.
> > 
> > Note that the current block allocation code abuses the mapping
> > buffer head to provide a completion callback for unwritten extent
> > allocation when DAX is clearing blocks. The DAX interface needs to
> > be changed to provide a callback similar to get_blocks for this
> > callback.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
.....
> > +static int
> > +xfs_filemap_dax_page_mkwrite(
> > +	struct vm_area_struct	*vma,
> > +	struct vm_fault		*vmf)
> > +{
> > +	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> > +	int			error;
> > +
> > +	trace_xfs_filemap_page_mkwrite(ip);
> > +
> > +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
>   So I think the lock ordering of XFS_MMAPLOCK and freezing protection is
> suspicious (and actually so is for normal write faults as I'm looking -
> didn't realize that when I was first reading your MMAPLOCK patches).
> Because you take XFS_MMAPLOCK outside of freeze protection however usually
> we want freeze protection to be the outermost lock - in particular in
> xfs_file_fallocate() you take XFS_MMAPLOCK inside freeze protection I
> think.

OK, so why isn't lockdep triggering on that? lockdep is aware of
inode locks and the freeze states, supposedly to pick up these exact
issues...

Oh, probably because the sb freeze order is write, pagefault,
transaction.

i.e. In the fallocate case, we do sb_start_write, MMAP_LOCK. If we are in
a freeze case, we aren't going to freeze page faults until we've
frozen all the writes have drained, so there isn't a lock order
dependency there. Same for any other mnt_want_write/sb-start_write
based modification. 

Hence the fallocate path and anything that runs through setattr will
complete and release the mmap lock and then be prevented from taking
it again by the time sb_start_pagefault() can block with the mmap
lock held.  So there isn't actually a deadlock there because of the
way freeze works, and that's why lockdep is staying silent.

Still, I probably need to fix it so I'm not leaving a potential
landmine around.

> So you'll need to do what ext4 needs to do - take freeze protection, take
> fs specific locks, and then call do_dax_fault(). Matthew has a patch to
> actually export do_dax_fault (as __dax_fault()) for filesystems.

pointer to it? if none, I'll just write my own....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
@ 2015-03-04 22:00       ` Dave Chinner
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-04 22:00 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, willy, xfs

On Wed, Mar 04, 2015 at 05:18:48PM +0100, Jan Kara wrote:
> On Wed 04-03-15 10:30:24, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Add the initial support for DAX file operations to XFS. This
> > includes the necessary block allocation and mmap page fault hooks
> > for DAX to function.
> > 
> > Note that the current block allocation code abuses the mapping
> > buffer head to provide a completion callback for unwritten extent
> > allocation when DAX is clearing blocks. The DAX interface needs to
> > be changed to provide a callback similar to get_blocks for this
> > callback.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
.....
> > +static int
> > +xfs_filemap_dax_page_mkwrite(
> > +	struct vm_area_struct	*vma,
> > +	struct vm_fault		*vmf)
> > +{
> > +	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> > +	int			error;
> > +
> > +	trace_xfs_filemap_page_mkwrite(ip);
> > +
> > +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
>   So I think the lock ordering of XFS_MMAPLOCK and freezing protection is
> suspicious (and actually so is for normal write faults as I'm looking -
> didn't realize that when I was first reading your MMAPLOCK patches).
> Because you take XFS_MMAPLOCK outside of freeze protection however usually
> we want freeze protection to be the outermost lock - in particular in
> xfs_file_fallocate() you take XFS_MMAPLOCK inside freeze protection I
> think.

OK, so why isn't lockdep triggering on that? lockdep is aware of
inode locks and the freeze states, supposedly to pick up these exact
issues...

Oh, probably because the sb freeze order is write, pagefault,
transaction.

i.e. In the fallocate case, we do sb_start_write, MMAP_LOCK. If we are in
a freeze case, we aren't going to freeze page faults until we've
frozen all the writes have drained, so there isn't a lock order
dependency there. Same for any other mnt_want_write/sb-start_write
based modification. 

Hence the fallocate path and anything that runs through setattr will
complete and release the mmap lock and then be prevented from taking
it again by the time sb_start_pagefault() can block with the mmap
lock held.  So there isn't actually a deadlock there because of the
way freeze works, and that's why lockdep is staying silent.

Still, I probably need to fix it so I'm not leaving a potential
landmine around.

> So you'll need to do what ext4 needs to do - take freeze protection, take
> fs specific locks, and then call do_dax_fault(). Matthew has a patch to
> actually export do_dax_fault (as __dax_fault()) for filesystems.

pointer to it? if none, I'll just write my own....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
  2015-03-04 14:54         ` Boaz Harrosh
@ 2015-03-04 22:03           ` Dave Chinner
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-04 22:03 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: xfs, linux-fsdevel, jack, willy

On Wed, Mar 04, 2015 at 04:54:50PM +0200, Boaz Harrosh wrote:
> On 03/04/2015 03:01 PM, Dave Chinner wrote:
> > On Wed, Mar 04, 2015 at 12:09:40PM +0200, Boaz Harrosh wrote:
> <>
> > 
> > So, we definitely need splice to/from DAX enabled inodes to be
> > rejected. I'll have a look at that...
> > 
> 
> default_file_splice_read uses kernel_readv which I think might actually
> work. Do you know what xfstest(s) exercise splice?

We have a rudimentary one only because I discovered a while back
none existed at all. i.e. splice is effectively untested by
xfstests. If you want to write some tests to execise it, that'd be
great....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
@ 2015-03-04 22:03           ` Dave Chinner
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-04 22:03 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: linux-fsdevel, willy, jack, xfs

On Wed, Mar 04, 2015 at 04:54:50PM +0200, Boaz Harrosh wrote:
> On 03/04/2015 03:01 PM, Dave Chinner wrote:
> > On Wed, Mar 04, 2015 at 12:09:40PM +0200, Boaz Harrosh wrote:
> <>
> > 
> > So, we definitely need splice to/from DAX enabled inodes to be
> > rejected. I'll have a look at that...
> > 
> 
> default_file_splice_read uses kernel_readv which I think might actually
> work. Do you know what xfstest(s) exercise splice?

We have a rudimentary one only because I discovered a while back
none existed at all. i.e. splice is effectively untested by
xfstests. If you want to write some tests to execise it, that'd be
great....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/6] dax: don't abuse get_block mapping for endio callbacks
  2015-03-04 15:54     ` Jan Kara
@ 2015-03-04 22:29       ` Dave Chinner
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-04 22:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: xfs, linux-fsdevel, willy

On Wed, Mar 04, 2015 at 04:54:08PM +0100, Jan Kara wrote:
> On Wed 04-03-15 10:30:22, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > @@ -269,7 +269,8 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
> >  }
> >  
> >  static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> > -			struct vm_area_struct *vma, struct vm_fault *vmf)
> > +			struct vm_area_struct *vma, struct vm_fault *vmf,
> > +			dax_iodone_t complete_unwritten)
> >  {
> >  	struct address_space *mapping = inode->i_mapping;
> >  	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
> > @@ -310,14 +311,14 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >   out:
> >  	i_mmap_unlock_read(mapping);
> >  
> > -	if (bh->b_end_io)
> > -		bh->b_end_io(bh, 1);
> > +	if (buffer_unwritten(bh))
> > +		complete_unwritten(bh, 1);
> >  
> >  	return error;
> >  }
>   So frankly I don't see a big point in passing completion callback into
> dax_insert_mapping() only to call the function at the end of it. We could
> as well call the completion function from do_dax_fault() where it would
> seem more natural to me. But I don't feel too strongly about this.

On further review, I think the code is incorrect as is, even without
this change - we shouldn't be running unwritten extent conversion
if the block zeroing failed. So this needs fixing anyway. I'll pull
the completion back to do_dax_fault(), where it willonly be run if
there was no error inserting the mapping.

> Instead of the above I was also thinking about some way to pass information
> out of do_dax_fault() into filesystem so that it could just call completion
> handler itself but the completion callback is more standard interface I
> guess.

That seems unbalanced to me, as internal mapping state would need to
be leaked back out to the caller so they could run conversion. I
think it's cleaner to pass in the callback and leave all that
mapping state internal to do_dax_fault()....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/6] dax: don't abuse get_block mapping for endio callbacks
@ 2015-03-04 22:29       ` Dave Chinner
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-04 22:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, willy, xfs

On Wed, Mar 04, 2015 at 04:54:08PM +0100, Jan Kara wrote:
> On Wed 04-03-15 10:30:22, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > @@ -269,7 +269,8 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
> >  }
> >  
> >  static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> > -			struct vm_area_struct *vma, struct vm_fault *vmf)
> > +			struct vm_area_struct *vma, struct vm_fault *vmf,
> > +			dax_iodone_t complete_unwritten)
> >  {
> >  	struct address_space *mapping = inode->i_mapping;
> >  	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
> > @@ -310,14 +311,14 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >   out:
> >  	i_mmap_unlock_read(mapping);
> >  
> > -	if (bh->b_end_io)
> > -		bh->b_end_io(bh, 1);
> > +	if (buffer_unwritten(bh))
> > +		complete_unwritten(bh, 1);
> >  
> >  	return error;
> >  }
>   So frankly I don't see a big point in passing completion callback into
> dax_insert_mapping() only to call the function at the end of it. We could
> as well call the completion function from do_dax_fault() where it would
> seem more natural to me. But I don't feel too strongly about this.

On further review, I think the code is incorrect as is, even without
this change - we shouldn't be running unwritten extent conversion
if the block zeroing failed. So this needs fixing anyway. I'll pull
the completion back to do_dax_fault(), where it willonly be run if
there was no error inserting the mapping.

> Instead of the above I was also thinking about some way to pass information
> out of do_dax_fault() into filesystem so that it could just call completion
> handler itself but the completion callback is more standard interface I
> guess.

That seems unbalanced to me, as internal mapping state would need to
be leaked back out to the caller so they could run conversion. I
think it's cleaner to pass in the callback and leave all that
mapping state internal to do_dax_fault()....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
  2015-03-04 22:00       ` Dave Chinner
@ 2015-03-05 11:05         ` Jan Kara
  -1 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2015-03-05 11:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, xfs, linux-fsdevel, willy

On Thu 05-03-15 09:00:05, Dave Chinner wrote:
> On Wed, Mar 04, 2015 at 05:18:48PM +0100, Jan Kara wrote:
> > On Wed 04-03-15 10:30:24, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Add the initial support for DAX file operations to XFS. This
> > > includes the necessary block allocation and mmap page fault hooks
> > > for DAX to function.
> > > 
> > > Note that the current block allocation code abuses the mapping
> > > buffer head to provide a completion callback for unwritten extent
> > > allocation when DAX is clearing blocks. The DAX interface needs to
> > > be changed to provide a callback similar to get_blocks for this
> > > callback.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> .....
> > > +static int
> > > +xfs_filemap_dax_page_mkwrite(
> > > +	struct vm_area_struct	*vma,
> > > +	struct vm_fault		*vmf)
> > > +{
> > > +	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> > > +	int			error;
> > > +
> > > +	trace_xfs_filemap_page_mkwrite(ip);
> > > +
> > > +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> >   So I think the lock ordering of XFS_MMAPLOCK and freezing protection is
> > suspicious (and actually so is for normal write faults as I'm looking -
> > didn't realize that when I was first reading your MMAPLOCK patches).
> > Because you take XFS_MMAPLOCK outside of freeze protection however usually
> > we want freeze protection to be the outermost lock - in particular in
> > xfs_file_fallocate() you take XFS_MMAPLOCK inside freeze protection I
> > think.
> 
> OK, so why isn't lockdep triggering on that? lockdep is aware of
> inode locks and the freeze states, supposedly to pick up these exact
> issues...
> 
> Oh, probably because the sb freeze order is write, pagefault,
> transaction.
> 
> i.e. In the fallocate case, we do sb_start_write, MMAP_LOCK. If we are in
> a freeze case, we aren't going to freeze page faults until we've
> frozen all the writes have drained, so there isn't a lock order
> dependency there. Same for any other mnt_want_write/sb-start_write
> based modification. 
> 
> Hence the fallocate path and anything that runs through setattr will
> complete and release the mmap lock and then be prevented from taking
> it again by the time sb_start_pagefault() can block with the mmap
> lock held.  So there isn't actually a deadlock there because of the
> way freeze works, and that's why lockdep is staying silent.
  Yeah, you're right there isn't a deadlock possibility. After all the lock
ranking of your MMAP_LOCk is currently the same as of mmap_sem (and the
difficult lock ordering of that semaphore has been the reason why we have
special type of freeze protection for page faults).

> Still, I probably need to fix it so I'm not leaving a potential
> landmine around.
  I would find it easier to grasp. Yes.

> > So you'll need to do what ext4 needs to do - take freeze protection, take
> > fs specific locks, and then call do_dax_fault(). Matthew has a patch to
> > actually export do_dax_fault (as __dax_fault()) for filesystems.
> 
> pointer to it? if none, I'll just write my own....
  http://permalink.gmane.org/gmane.comp.file-systems.ext4/47866

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
@ 2015-03-05 11:05         ` Jan Kara
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2015-03-05 11:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, willy, Jan Kara, xfs

On Thu 05-03-15 09:00:05, Dave Chinner wrote:
> On Wed, Mar 04, 2015 at 05:18:48PM +0100, Jan Kara wrote:
> > On Wed 04-03-15 10:30:24, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Add the initial support for DAX file operations to XFS. This
> > > includes the necessary block allocation and mmap page fault hooks
> > > for DAX to function.
> > > 
> > > Note that the current block allocation code abuses the mapping
> > > buffer head to provide a completion callback for unwritten extent
> > > allocation when DAX is clearing blocks. The DAX interface needs to
> > > be changed to provide a callback similar to get_blocks for this
> > > callback.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> .....
> > > +static int
> > > +xfs_filemap_dax_page_mkwrite(
> > > +	struct vm_area_struct	*vma,
> > > +	struct vm_fault		*vmf)
> > > +{
> > > +	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> > > +	int			error;
> > > +
> > > +	trace_xfs_filemap_page_mkwrite(ip);
> > > +
> > > +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> >   So I think the lock ordering of XFS_MMAPLOCK and freezing protection is
> > suspicious (and actually so is for normal write faults as I'm looking -
> > didn't realize that when I was first reading your MMAPLOCK patches).
> > Because you take XFS_MMAPLOCK outside of freeze protection however usually
> > we want freeze protection to be the outermost lock - in particular in
> > xfs_file_fallocate() you take XFS_MMAPLOCK inside freeze protection I
> > think.
> 
> OK, so why isn't lockdep triggering on that? lockdep is aware of
> inode locks and the freeze states, supposedly to pick up these exact
> issues...
> 
> Oh, probably because the sb freeze order is write, pagefault,
> transaction.
> 
> i.e. In the fallocate case, we do sb_start_write, MMAP_LOCK. If we are in
> a freeze case, we aren't going to freeze page faults until we've
> frozen all the writes have drained, so there isn't a lock order
> dependency there. Same for any other mnt_want_write/sb-start_write
> based modification. 
> 
> Hence the fallocate path and anything that runs through setattr will
> complete and release the mmap lock and then be prevented from taking
> it again by the time sb_start_pagefault() can block with the mmap
> lock held.  So there isn't actually a deadlock there because of the
> way freeze works, and that's why lockdep is staying silent.
  Yeah, you're right there isn't a deadlock possibility. After all the lock
ranking of your MMAP_LOCk is currently the same as of mmap_sem (and the
difficult lock ordering of that semaphore has been the reason why we have
special type of freeze protection for page faults).

> Still, I probably need to fix it so I'm not leaving a potential
> landmine around.
  I would find it easier to grasp. Yes.

> > So you'll need to do what ext4 needs to do - take freeze protection, take
> > fs specific locks, and then call do_dax_fault(). Matthew has a patch to
> > actually export do_dax_fault (as __dax_fault()) for filesystems.
> 
> pointer to it? if none, I'll just write my own....
  http://permalink.gmane.org/gmane.comp.file-systems.ext4/47866

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
  2015-03-05 11:05         ` Jan Kara
@ 2015-03-22 23:02           ` Dave Chinner
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-22 23:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: xfs, linux-fsdevel, willy

On Thu, Mar 05, 2015 at 12:05:04PM +0100, Jan Kara wrote:
> On Thu 05-03-15 09:00:05, Dave Chinner wrote:
> > On Wed, Mar 04, 2015 at 05:18:48PM +0100, Jan Kara wrote:
> > > On Wed 04-03-15 10:30:24, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Add the initial support for DAX file operations to XFS. This
> > > > includes the necessary block allocation and mmap page fault hooks
> > > > for DAX to function.
> > > > 
> > > > Note that the current block allocation code abuses the mapping
> > > > buffer head to provide a completion callback for unwritten extent
> > > > allocation when DAX is clearing blocks. The DAX interface needs to
> > > > be changed to provide a callback similar to get_blocks for this
> > > > callback.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > .....
> > > > +static int
> > > > +xfs_filemap_dax_page_mkwrite(
> > > > +	struct vm_area_struct	*vma,
> > > > +	struct vm_fault		*vmf)
> > > > +{
> > > > +	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> > > > +	int			error;
> > > > +
> > > > +	trace_xfs_filemap_page_mkwrite(ip);
> > > > +
> > > > +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> > >   So I think the lock ordering of XFS_MMAPLOCK and freezing protection is
> > > suspicious (and actually so is for normal write faults as I'm looking -
> > > didn't realize that when I was first reading your MMAPLOCK patches).
> > > Because you take XFS_MMAPLOCK outside of freeze protection however usually
> > > we want freeze protection to be the outermost lock - in particular in
> > > xfs_file_fallocate() you take XFS_MMAPLOCK inside freeze protection I
> > > think.
> > 
> > OK, so why isn't lockdep triggering on that? lockdep is aware of
> > inode locks and the freeze states, supposedly to pick up these exact
> > issues...
> > 
> > Oh, probably because the sb freeze order is write, pagefault,
> > transaction.
> > 
> > i.e. In the fallocate case, we do sb_start_write, MMAP_LOCK. If we are in
> > a freeze case, we aren't going to freeze page faults until we've
> > frozen all the writes have drained, so there isn't a lock order
> > dependency there. Same for any other mnt_want_write/sb-start_write
> > based modification. 
> > 
> > Hence the fallocate path and anything that runs through setattr will
> > complete and release the mmap lock and then be prevented from taking
> > it again by the time sb_start_pagefault() can block with the mmap
> > lock held.  So there isn't actually a deadlock there because of the
> > way freeze works, and that's why lockdep is staying silent.
>   Yeah, you're right there isn't a deadlock possibility. After all the lock
> ranking of your MMAP_LOCk is currently the same as of mmap_sem (and the
> difficult lock ordering of that semaphore has been the reason why we have
> special type of freeze protection for page faults).
> 
> > Still, I probably need to fix it so I'm not leaving a potential
> > landmine around.
>   I would find it easier to grasp. Yes.

Finally getting back to this. Fixed this, but...
> 
> > > So you'll need to do what ext4 needs to do - take freeze protection, take
> > > fs specific locks, and then call do_dax_fault(). Matthew has a patch to
> > > actually export do_dax_fault (as __dax_fault()) for filesystems.
> > 
> > pointer to it? if none, I'll just write my own....
>   http://permalink.gmane.org/gmane.comp.file-systems.ext4/47866

I can't find any followup to this patch. Is it in any tree anywhere?

Right now, I've just pulled the dax instructure part of the patch
into my series and modified it to suit because I've fixed the
unwritten extent conversion problem differently (i.e. the extra
callback) and ext4 needs a lot more help to fix the problems than
in that patch.

I'll post the patches when I've at least smoke tested them....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
@ 2015-03-22 23:02           ` Dave Chinner
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-22 23:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, willy, xfs

On Thu, Mar 05, 2015 at 12:05:04PM +0100, Jan Kara wrote:
> On Thu 05-03-15 09:00:05, Dave Chinner wrote:
> > On Wed, Mar 04, 2015 at 05:18:48PM +0100, Jan Kara wrote:
> > > On Wed 04-03-15 10:30:24, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Add the initial support for DAX file operations to XFS. This
> > > > includes the necessary block allocation and mmap page fault hooks
> > > > for DAX to function.
> > > > 
> > > > Note that the current block allocation code abuses the mapping
> > > > buffer head to provide a completion callback for unwritten extent
> > > > allocation when DAX is clearing blocks. The DAX interface needs to
> > > > be changed to provide a callback similar to get_blocks for this
> > > > callback.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > .....
> > > > +static int
> > > > +xfs_filemap_dax_page_mkwrite(
> > > > +	struct vm_area_struct	*vma,
> > > > +	struct vm_fault		*vmf)
> > > > +{
> > > > +	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> > > > +	int			error;
> > > > +
> > > > +	trace_xfs_filemap_page_mkwrite(ip);
> > > > +
> > > > +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> > >   So I think the lock ordering of XFS_MMAPLOCK and freezing protection is
> > > suspicious (and actually so is for normal write faults as I'm looking -
> > > didn't realize that when I was first reading your MMAPLOCK patches).
> > > Because you take XFS_MMAPLOCK outside of freeze protection however usually
> > > we want freeze protection to be the outermost lock - in particular in
> > > xfs_file_fallocate() you take XFS_MMAPLOCK inside freeze protection I
> > > think.
> > 
> > OK, so why isn't lockdep triggering on that? lockdep is aware of
> > inode locks and the freeze states, supposedly to pick up these exact
> > issues...
> > 
> > Oh, probably because the sb freeze order is write, pagefault,
> > transaction.
> > 
> > i.e. In the fallocate case, we do sb_start_write, MMAP_LOCK. If we are in
> > a freeze case, we aren't going to freeze page faults until we've
> > frozen all the writes have drained, so there isn't a lock order
> > dependency there. Same for any other mnt_want_write/sb-start_write
> > based modification. 
> > 
> > Hence the fallocate path and anything that runs through setattr will
> > complete and release the mmap lock and then be prevented from taking
> > it again by the time sb_start_pagefault() can block with the mmap
> > lock held.  So there isn't actually a deadlock there because of the
> > way freeze works, and that's why lockdep is staying silent.
>   Yeah, you're right there isn't a deadlock possibility. After all the lock
> ranking of your MMAP_LOCk is currently the same as of mmap_sem (and the
> difficult lock ordering of that semaphore has been the reason why we have
> special type of freeze protection for page faults).
> 
> > Still, I probably need to fix it so I'm not leaving a potential
> > landmine around.
>   I would find it easier to grasp. Yes.

Finally getting back to this. Fixed this, but...
> 
> > > So you'll need to do what ext4 needs to do - take freeze protection, take
> > > fs specific locks, and then call do_dax_fault(). Matthew has a patch to
> > > actually export do_dax_fault (as __dax_fault()) for filesystems.
> > 
> > pointer to it? if none, I'll just write my own....
>   http://permalink.gmane.org/gmane.comp.file-systems.ext4/47866

I can't find any followup to this patch. Is it in any tree anywhere?

Right now, I've just pulled the dax instructure part of the patch
into my series and modified it to suit because I've fixed the
unwritten extent conversion problem differently (i.e. the extra
callback) and ext4 needs a lot more help to fix the problems than
in that patch.

I'll post the patches when I've at least smoke tested them....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
  2015-03-04 22:03           ` Dave Chinner
@ 2015-03-24  4:27             ` Dave Chinner
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-24  4:27 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: xfs, linux-fsdevel, jack, willy

On Thu, Mar 05, 2015 at 09:03:48AM +1100, Dave Chinner wrote:
> On Wed, Mar 04, 2015 at 04:54:50PM +0200, Boaz Harrosh wrote:
> > On 03/04/2015 03:01 PM, Dave Chinner wrote:
> > > On Wed, Mar 04, 2015 at 12:09:40PM +0200, Boaz Harrosh wrote:
> > <>
> > > 
> > > So, we definitely need splice to/from DAX enabled inodes to be
> > > rejected. I'll have a look at that...
> > > 
> > 
> > default_file_splice_read uses kernel_readv which I think might actually
> > work. Do you know what xfstest(s) exercise splice?
> 
> We have a rudimentary one only because I discovered a while back
> none existed at all. i.e. splice is effectively untested by
> xfstests. If you want to write some tests to execise it, that'd be
> great....

Turns out there's no great need to write splice tests for xfstests -
the current loopback device uses splice, and so all of the tests
that run on loopback are exercising the splice path through the
filesystem.

I found this out by disabling splice on dax altogether, and then finding out
that lots of tests failed badly, then narrowing it down to:

$ sudo mount -o dax /dev/ram0 /mnt/test
$ sudo mkfs.xfs -dfile,name=/mnt/test/foo1,size=1g
meta-data=/mnt/test/foo1         isize=512    agcount=4, agsize=65536 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1
data     =                       bsize=4096   blocks=262144, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ sudo mount -o loop /mnt/test/foo1 /mnt/test/foo
mount: /dev/loop0: can't read superblock
$

because the splice read returned EINVAL rather than data. So, yes,
splice canbe made to work with dax if we pass it through the paths
that aren't interacting directly with the page cache.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
@ 2015-03-24  4:27             ` Dave Chinner
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2015-03-24  4:27 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: linux-fsdevel, willy, jack, xfs

On Thu, Mar 05, 2015 at 09:03:48AM +1100, Dave Chinner wrote:
> On Wed, Mar 04, 2015 at 04:54:50PM +0200, Boaz Harrosh wrote:
> > On 03/04/2015 03:01 PM, Dave Chinner wrote:
> > > On Wed, Mar 04, 2015 at 12:09:40PM +0200, Boaz Harrosh wrote:
> > <>
> > > 
> > > So, we definitely need splice to/from DAX enabled inodes to be
> > > rejected. I'll have a look at that...
> > > 
> > 
> > default_file_splice_read uses kernel_readv which I think might actually
> > work. Do you know what xfstest(s) exercise splice?
> 
> We have a rudimentary one only because I discovered a while back
> none existed at all. i.e. splice is effectively untested by
> xfstests. If you want to write some tests to execise it, that'd be
> great....

Turns out there's no great need to write splice tests for xfstests -
the current loopback device uses splice, and so all of the tests
that run on loopback are exercising the splice path through the
filesystem.

I found this out by disabling splice on dax altogether, and then finding out
that lots of tests failed badly, then narrowing it down to:

$ sudo mount -o dax /dev/ram0 /mnt/test
$ sudo mkfs.xfs -dfile,name=/mnt/test/foo1,size=1g
meta-data=/mnt/test/foo1         isize=512    agcount=4, agsize=65536 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1
data     =                       bsize=4096   blocks=262144, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ sudo mount -o loop /mnt/test/foo1 /mnt/test/foo
mount: /dev/loop0: can't read superblock
$

because the splice read returned EINVAL rather than data. So, yes,
splice canbe made to work with dax if we pass it through the paths
that aren't interacting directly with the page cache.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
  2015-03-24  4:27             ` Dave Chinner
@ 2015-03-24  7:01               ` Christoph Hellwig
  -1 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2015-03-24  7:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Boaz Harrosh, xfs, linux-fsdevel, jack, willy

On Tue, Mar 24, 2015 at 03:27:27PM +1100, Dave Chinner wrote:
> Turns out there's no great need to write splice tests for xfstests -
> the current loopback device uses splice, and so all of the tests
> that run on loopback are exercising the splice path through the
> filesystem.

FYI, we're getting rid of the splice read (ab-)use in loop, so this
won't last forever.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
@ 2015-03-24  7:01               ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2015-03-24  7:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, jack, Boaz Harrosh, willy, xfs

On Tue, Mar 24, 2015 at 03:27:27PM +1100, Dave Chinner wrote:
> Turns out there's no great need to write splice tests for xfstests -
> the current loopback device uses splice, and so all of the tests
> that run on loopback are exercising the splice path through the
> filesystem.

FYI, we're getting rid of the splice read (ab-)use in loop, so this
won't last forever.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
  2015-03-24  4:27             ` Dave Chinner
@ 2015-03-24  8:13               ` Boaz Harrosh
  -1 siblings, 0 replies; 40+ messages in thread
From: Boaz Harrosh @ 2015-03-24  8:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, jack, willy

On 03/24/2015 06:27 AM, Dave Chinner wrote:
> On Thu, Mar 05, 2015 at 09:03:48AM +1100, Dave Chinner wrote:
>> On Wed, Mar 04, 2015 at 04:54:50PM +0200, Boaz Harrosh wrote:
>>> On 03/04/2015 03:01 PM, Dave Chinner wrote:
>>>> On Wed, Mar 04, 2015 at 12:09:40PM +0200, Boaz Harrosh wrote:
>>> <>
>>>>
>>>> So, we definitely need splice to/from DAX enabled inodes to be
>>>> rejected. I'll have a look at that...
>>>>
>>>
>>> default_file_splice_read uses kernel_readv which I think might actually
>>> work. Do you know what xfstest(s) exercise splice?
>>
>> We have a rudimentary one only because I discovered a while back
>> none existed at all. i.e. splice is effectively untested by
>> xfstests. If you want to write some tests to execise it, that'd be
>> great....
> 
> Turns out there's no great need to write splice tests for xfstests -
> the current loopback device uses splice, and so all of the tests
> that run on loopback are exercising the splice path through the
> filesystem.
> 
> I found this out by disabling splice on dax altogether, and then finding out
> that lots of tests failed badly, then narrowing it down to:
> 
> $ sudo mount -o dax /dev/ram0 /mnt/test
> $ sudo mkfs.xfs -dfile,name=/mnt/test/foo1,size=1g
> meta-data=/mnt/test/foo1         isize=512    agcount=4, agsize=65536 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1
> data     =                       bsize=4096   blocks=262144, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal log           bsize=4096   blocks=2560, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> $ sudo mount -o loop /mnt/test/foo1 /mnt/test/foo
> mount: /dev/loop0: can't read superblock
> $
> 
> because the splice read returned EINVAL rather than data. So, yes,
> splice canbe made to work with dax if we pass it through the paths
> that aren't interacting directly with the page cache.
> 

Cool so current dax code actually does support splice by using
default_file_splice_read/write indirectly.

therefor I think there is merit in keeping just the one
file_operations vector pointing to an internal function and
doing an if (IS_DAX()) default_file_splice_read/write()
at run time.

Because with current code, if CONFIG_FS_DAX is enabled at compile
time, then also the regular HD none dax mounts will use the slow
default_file_splice_read instead of what ever something better
that the FS is doing.

Do you think we should do the IS_DAX() switch at
generic_file_splice_read and iter_file_splice_write to
fix all the Fss in one go or point to an internal FS function
and do the switch there? Please advise?

I will send up a patch that fixes up ext2 ext4, to see how it
looks like.

> Cheers,
> Dave.
> 

Thanks Dave
Boaz


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] xfs: add DAX file operations support
@ 2015-03-24  8:13               ` Boaz Harrosh
  0 siblings, 0 replies; 40+ messages in thread
From: Boaz Harrosh @ 2015-03-24  8:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, willy, jack, xfs

On 03/24/2015 06:27 AM, Dave Chinner wrote:
> On Thu, Mar 05, 2015 at 09:03:48AM +1100, Dave Chinner wrote:
>> On Wed, Mar 04, 2015 at 04:54:50PM +0200, Boaz Harrosh wrote:
>>> On 03/04/2015 03:01 PM, Dave Chinner wrote:
>>>> On Wed, Mar 04, 2015 at 12:09:40PM +0200, Boaz Harrosh wrote:
>>> <>
>>>>
>>>> So, we definitely need splice to/from DAX enabled inodes to be
>>>> rejected. I'll have a look at that...
>>>>
>>>
>>> default_file_splice_read uses kernel_readv which I think might actually
>>> work. Do you know what xfstest(s) exercise splice?
>>
>> We have a rudimentary one only because I discovered a while back
>> none existed at all. i.e. splice is effectively untested by
>> xfstests. If you want to write some tests to execise it, that'd be
>> great....
> 
> Turns out there's no great need to write splice tests for xfstests -
> the current loopback device uses splice, and so all of the tests
> that run on loopback are exercising the splice path through the
> filesystem.
> 
> I found this out by disabling splice on dax altogether, and then finding out
> that lots of tests failed badly, then narrowing it down to:
> 
> $ sudo mount -o dax /dev/ram0 /mnt/test
> $ sudo mkfs.xfs -dfile,name=/mnt/test/foo1,size=1g
> meta-data=/mnt/test/foo1         isize=512    agcount=4, agsize=65536 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1
> data     =                       bsize=4096   blocks=262144, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal log           bsize=4096   blocks=2560, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> $ sudo mount -o loop /mnt/test/foo1 /mnt/test/foo
> mount: /dev/loop0: can't read superblock
> $
> 
> because the splice read returned EINVAL rather than data. So, yes,
> splice canbe made to work with dax if we pass it through the paths
> that aren't interacting directly with the page cache.
> 

Cool so current dax code actually does support splice by using
default_file_splice_read/write indirectly.

therefor I think there is merit in keeping just the one
file_operations vector pointing to an internal function and
doing an if (IS_DAX()) default_file_splice_read/write()
at run time.

Because with current code, if CONFIG_FS_DAX is enabled at compile
time, then also the regular HD none dax mounts will use the slow
default_file_splice_read instead of what ever something better
that the FS is doing.

Do you think we should do the IS_DAX() switch at
generic_file_splice_read and iter_file_splice_write to
fix all the Fss in one go or point to an internal FS function
and do the switch there? Please advise?

I will send up a patch that fixes up ext2 ext4, to see how it
looks like.

> Cheers,
> Dave.
> 

Thanks Dave
Boaz

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2015-03-24  8:13 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-03 23:30 [RFC PATCH 0/6] xfs: DAX support Dave Chinner
2015-03-03 23:30 ` Dave Chinner
2015-03-03 23:30 ` [PATCH 1/6] dax: don't abuse get_block mapping for endio callbacks Dave Chinner
2015-03-03 23:30   ` Dave Chinner
2015-03-04 15:54   ` Jan Kara
2015-03-04 15:54     ` Jan Kara
2015-03-04 22:29     ` Dave Chinner
2015-03-04 22:29       ` Dave Chinner
2015-03-03 23:30 ` [PATCH 2/6] xfs: add DAX block zeroing support Dave Chinner
2015-03-03 23:30   ` Dave Chinner
2015-03-03 23:30 ` [PATCH 3/6] xfs: add DAX file operations support Dave Chinner
2015-03-03 23:30   ` Dave Chinner
2015-03-04 10:09   ` Boaz Harrosh
2015-03-04 10:09     ` Boaz Harrosh
2015-03-04 13:01     ` Dave Chinner
2015-03-04 13:01       ` Dave Chinner
2015-03-04 14:54       ` Boaz Harrosh
2015-03-04 14:54         ` Boaz Harrosh
2015-03-04 22:03         ` Dave Chinner
2015-03-04 22:03           ` Dave Chinner
2015-03-24  4:27           ` Dave Chinner
2015-03-24  4:27             ` Dave Chinner
2015-03-24  7:01             ` Christoph Hellwig
2015-03-24  7:01               ` Christoph Hellwig
2015-03-24  8:13             ` Boaz Harrosh
2015-03-24  8:13               ` Boaz Harrosh
2015-03-04 16:18   ` Jan Kara
2015-03-04 16:18     ` Jan Kara
2015-03-04 22:00     ` Dave Chinner
2015-03-04 22:00       ` Dave Chinner
2015-03-05 11:05       ` Jan Kara
2015-03-05 11:05         ` Jan Kara
2015-03-22 23:02         ` Dave Chinner
2015-03-22 23:02           ` Dave Chinner
2015-03-03 23:30 ` [PATCH 4/6] xfs: add DAX truncate support Dave Chinner
2015-03-03 23:30   ` Dave Chinner
2015-03-03 23:30 ` [PATCH 5/6] xfs: add DAX IO path support Dave Chinner
2015-03-03 23:30   ` Dave Chinner
2015-03-03 23:30 ` [PATCH 6/6] xfs: add initial DAX support Dave Chinner
2015-03-03 23:30   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.