All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8 v2] xfs: DAX support
@ 2015-03-24 10:50 ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:50 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

Hi folks,

This is an updated version of the XFS DAX support patchset. It was
first posted here:

http://oss.sgi.com/pipermail/xfs/2015-March/040804.html

This version of the patchset sits on top of the current for-next
branch of the XFS dev tree.

The series has grown slightly as a result of the first round of
reviews. The initial patch is a new patch that reworks the new XFS
mmap lock to nest correctly inside freeze notifications, as noticed
and requested by Jan Kara.

The second patch has been updated to move unwritten extent
conversion back into do_dax_fault so the completion function does
not need ot be passed around any further.

The third patch is also a new patch, derived from Willy's patch to
allow ext4 to correctly nest locking and freeze operations during
page faults. I gutted that patch down to the bare infrastructure
that was necessary for XFS, and havent included any of the ext4
changes because the unwritten extent conversion problems are dealt
with by the previous patch.

The rest of the patch set is updating the XFS support to use the new
infrastructure and changed mmap locking order via __dax_fault. Most
of the changes fall in the file operations patch (patch 5). 

The main changes to patch 5 are that it no longer has separate
operations structures. I validated that splice has paths that don't
require page cache manipulations and switched XFS to use them (as
pointed out by Boaz) and that solved one of the duplicate operations
structures. All of the .fault and .page_mkwrite code were also made
common and that got rid of all the other ifdef CONFIG_DAX hackery
left in the code.

Overall, it's a much cleaner set of changes than in V1. it passes
xfstests without any significant regressions (the only problems are
to do with extent layouts differing from expected patterns) so I'm
happy that it won't immediately go feral and start eating babies for
lunch.

Comments, thoughts and flames are welcome, especially for the DAX
infrastructure changes.

-Dave.

Diffstat:

 fs/dax.c               |  30 ++++++++---
 fs/ext2/file.c         |   4 +-
 fs/ext4/file.c         |  16 +++++-
 fs/ext4/inode.c        |  21 +++-----
 fs/xfs/xfs_aops.c      | 108 +++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_aops.h      |   7 ++-
 fs/xfs/xfs_bmap_util.c |  23 +++++++--
 fs/xfs/xfs_file.c      | 143 +++++++++++++++++++++++++++++++--------------------
 fs/xfs/xfs_iops.c      |  30 ++++++-----
 fs/xfs/xfs_mount.h     |   2 +
 fs/xfs/xfs_super.c     |  25 ++++++++-
 include/linux/fs.h     |   9 +++-
 12 files changed, 303 insertions(+), 115 deletions(-)


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 0/8 v2] xfs: DAX support
@ 2015-03-24 10:50 ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:50 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

Hi folks,

This is an updated version of the XFS DAX support patchset. It was
first posted here:

http://oss.sgi.com/pipermail/xfs/2015-March/040804.html

This version of the patchset sits on top of the current for-next
branch of the XFS dev tree.

The series has grown slightly as a result of the first round of
reviews. The initial patch is a new patch that reworks the new XFS
mmap lock to nest correctly inside freeze notifications, as noticed
and requested by Jan Kara.

The second patch has been updated to move unwritten extent
conversion back into do_dax_fault so the completion function does
not need ot be passed around any further.

The third patch is also a new patch, derived from Willy's patch to
allow ext4 to correctly nest locking and freeze operations during
page faults. I gutted that patch down to the bare infrastructure
that was necessary for XFS, and havent included any of the ext4
changes because the unwritten extent conversion problems are dealt
with by the previous patch.

The rest of the patch set is updating the XFS support to use the new
infrastructure and changed mmap locking order via __dax_fault. Most
of the changes fall in the file operations patch (patch 5). 

The main changes to patch 5 are that it no longer has separate
operations structures. I validated that splice has paths that don't
require page cache manipulations and switched XFS to use them (as
pointed out by Boaz) and that solved one of the duplicate operations
structures. All of the .fault and .page_mkwrite code were also made
common and that got rid of all the other ifdef CONFIG_DAX hackery
left in the code.

Overall, it's a much cleaner set of changes than in V1. it passes
xfstests without any significant regressions (the only problems are
to do with extent layouts differing from expected patterns) so I'm
happy that it won't immediately go feral and start eating babies for
lunch.

Comments, thoughts and flames are welcome, especially for the DAX
infrastructure changes.

-Dave.

Diffstat:

 fs/dax.c               |  30 ++++++++---
 fs/ext2/file.c         |   4 +-
 fs/ext4/file.c         |  16 +++++-
 fs/ext4/inode.c        |  21 +++-----
 fs/xfs/xfs_aops.c      | 108 +++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_aops.h      |   7 ++-
 fs/xfs/xfs_bmap_util.c |  23 +++++++--
 fs/xfs/xfs_file.c      | 143 +++++++++++++++++++++++++++++++--------------------
 fs/xfs/xfs_iops.c      |  30 ++++++-----
 fs/xfs/xfs_mount.h     |   2 +
 fs/xfs/xfs_super.c     |  25 ++++++++-
 include/linux/fs.h     |   9 +++-
 12 files changed, 303 insertions(+), 115 deletions(-)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 1/8] xfs: mmap lock needs to be inside freeze protection
  2015-03-24 10:50 ` Dave Chinner
  (?)
@ 2015-03-24 10:50 ` Dave Chinner
  2015-04-01 14:34     ` Jan Kara
  2015-04-06 17:48     ` Brian Foster
  -1 siblings, 2 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:50 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

Lock ordering for the new mmap lock needs to be:

mmap_sem
  sb_start_pagefault
    i_mmap_lock
      page lock
        <fault processsing>

Right now xfs_vm_page_mkwrite gets this the wrong way around,
While technically it cannot deadlock due to the current freeze
ordering, it's still a landmine that might explode if we change
anything in future. Hence we need to nest the locks correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index dc5f609..a4c882e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1449,15 +1449,20 @@ xfs_filemap_page_mkwrite(
 	struct vm_fault		*vmf)
 {
 	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
-	int			error;
+	int			ret;
 
 	trace_xfs_filemap_page_mkwrite(ip);
 
+	sb_start_pagefault(VFS_I(ip)->i_sb);
+	file_update_time(vma->vm_file);
 	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
-	error = block_page_mkwrite(vma, vmf, xfs_get_blocks);
+
+	ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
+
 	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+	sb_end_pagefault(VFS_I(ip)->i_sb);
 
-	return error;
+	return block_page_mkwrite_return(ret);
 }
 
 const struct file_operations xfs_file_operations = {
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 2/8] dax: don't abuse get_block mapping for endio callbacks
  2015-03-24 10:50 ` Dave Chinner
@ 2015-03-24 10:51   ` Dave Chinner
  -1 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:51 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

dax_fault() currently relies on the get_block callback to attach an
io completion callback to the mapping buffer head so that it can
run unwritten extent conversion after zeroing allocated blocks.

Instead of this hack, pass the conversion callback directly into
dax_fault() similar to the get_block callback. When the filesystem
allocates unwritten extents, it will set the buffer_unwritten()
flag, and hence the dax_fault code can call the completion function
in the contexts where it is necessary without overloading the
mapping buffer head.

Note: The changes to ext4 to use this interface are suspect at best.
In fact, the way ext4 did this end_io assignment in the first place
looks suspect because it only set a completion callback when there
wasn't already some other write() call taking place on the same
inode. The ext4 end_io code looks rather intricate and fragile with
all it's reference counting and passing to different contexts for
modification via inode private pointers that aren't protected by
locks...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dax.c           | 17 +++++++++++------
 fs/ext2/file.c     |  4 ++--
 fs/ext4/file.c     | 16 ++++++++++++++--
 fs/ext4/inode.c    | 21 +++++++--------------
 include/linux/fs.h |  6 ++++--
 5 files changed, 38 insertions(+), 26 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ed1619e..431ec2b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -310,14 +310,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
  out:
 	i_mmap_unlock_read(mapping);
 
-	if (bh->b_end_io)
-		bh->b_end_io(bh, 1);
-
 	return error;
 }
 
 static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-			get_block_t get_block)
+			get_block_t get_block, dax_iodone_t complete_unwritten)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
@@ -418,7 +415,15 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		page_cache_release(page);
 	}
 
+	/*
+	 * If we successfully insert the new mapping over an unwritten extent,
+	 * we need to ensure we convert the unwritten extent. If there is an
+	 * error inserting the mapping, we leave the extent as unwritten to
+	 * prevent exposure of the stale underlying data to userspace.
+	 */
 	error = dax_insert_mapping(inode, &bh, vma, vmf);
+	if (!error && buffer_unwritten(&bh))
+		complete_unwritten(&bh, 1);
 
  out:
 	if (error == -ENOMEM)
@@ -446,7 +451,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
  * fault handler for DAX files.
  */
 int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-			get_block_t get_block)
+	      get_block_t get_block, dax_iodone_t complete_unwritten)
 {
 	int result;
 	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
@@ -455,7 +460,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
 	}
-	result = do_dax_fault(vma, vmf, get_block);
+	result = do_dax_fault(vma, vmf, get_block, complete_unwritten);
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(sb);
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index e317017..8da747a 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -28,12 +28,12 @@
 #ifdef CONFIG_FS_DAX
 static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext2_get_block);
+	return dax_fault(vma, vmf, ext2_get_block, NULL);
 }
 
 static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_mkwrite(vma, vmf, ext2_get_block);
+	return dax_mkwrite(vma, vmf, ext2_get_block, NULL);
 }
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 33a09da..f7dabb1 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -192,15 +192,27 @@ errout:
 }
 
 #ifdef CONFIG_FS_DAX
+static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
+{
+	struct inode *inode = bh->b_assoc_map->host;
+	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
+	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
+	int err;
+	if (!uptodate)
+		return;
+	WARN_ON(!buffer_unwritten(bh));
+	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
+}
+
 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext4_get_block);
+	return dax_fault(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
 					/* Is this the right get_block? */
 }
 
 static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_mkwrite(vma, vmf, ext4_get_block);
+	return dax_mkwrite(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
 }
 
 static const struct vm_operations_struct ext4_dax_vm_ops = {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5cb9a21..43433de 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -657,18 +657,6 @@ has_zeroout:
 	return retval;
 }
 
-static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
-{
-	struct inode *inode = bh->b_assoc_map->host;
-	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
-	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
-	int err;
-	if (!uptodate)
-		return;
-	WARN_ON(!buffer_unwritten(bh));
-	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
-}
-
 /* Maximum number of blocks we map for direct IO at once. */
 #define DIO_MAX_BLOCKS 4096
 
@@ -706,10 +694,15 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 
 		map_bh(bh, inode->i_sb, map.m_pblk);
 		bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
-		if (IS_DAX(inode) && buffer_unwritten(bh) && !io_end) {
+		if (IS_DAX(inode) && buffer_unwritten(bh)) {
+			/*
+			 * dgc: I suspect unwritten conversion on ext4+DAX is
+			 * fundamentally broken here when there are concurrent
+			 * read/write in progress on this inode.
+			 */
+			WARN_ON_ONCE(io_end);
 			bh->b_assoc_map = inode->i_mapping;
 			bh->b_private = (void *)(unsigned long)iblock;
-			bh->b_end_io = ext4_end_io_unwritten;
 		}
 		if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
 			set_buffer_defer_completion(bh);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 937e280..82100ae 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -70,6 +70,7 @@ typedef int (get_block_t)(struct inode *inode, sector_t iblock,
 			struct buffer_head *bh_result, int create);
 typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 			ssize_t bytes, void *private);
+typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
 
 #define MAY_EXEC		0x00000001
 #define MAY_WRITE		0x00000002
@@ -2603,8 +2604,9 @@ ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
 int dax_clear_blocks(struct inode *, sector_t block, long size);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
-#define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
+		dax_iodone_t);
+#define dax_mkwrite(vma, vmf, gb, iod)	dax_fault(vma, vmf, gb, iod)
 
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 2/8] dax: don't abuse get_block mapping for endio callbacks
@ 2015-03-24 10:51   ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:51 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

dax_fault() currently relies on the get_block callback to attach an
io completion callback to the mapping buffer head so that it can
run unwritten extent conversion after zeroing allocated blocks.

Instead of this hack, pass the conversion callback directly into
dax_fault() similar to the get_block callback. When the filesystem
allocates unwritten extents, it will set the buffer_unwritten()
flag, and hence the dax_fault code can call the completion function
in the contexts where it is necessary without overloading the
mapping buffer head.

Note: The changes to ext4 to use this interface are suspect at best.
In fact, the way ext4 did this end_io assignment in the first place
looks suspect because it only set a completion callback when there
wasn't already some other write() call taking place on the same
inode. The ext4 end_io code looks rather intricate and fragile with
all it's reference counting and passing to different contexts for
modification via inode private pointers that aren't protected by
locks...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dax.c           | 17 +++++++++++------
 fs/ext2/file.c     |  4 ++--
 fs/ext4/file.c     | 16 ++++++++++++++--
 fs/ext4/inode.c    | 21 +++++++--------------
 include/linux/fs.h |  6 ++++--
 5 files changed, 38 insertions(+), 26 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ed1619e..431ec2b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -310,14 +310,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
  out:
 	i_mmap_unlock_read(mapping);
 
-	if (bh->b_end_io)
-		bh->b_end_io(bh, 1);
-
 	return error;
 }
 
 static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-			get_block_t get_block)
+			get_block_t get_block, dax_iodone_t complete_unwritten)
 {
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
@@ -418,7 +415,15 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		page_cache_release(page);
 	}
 
+	/*
+	 * If we successfully insert the new mapping over an unwritten extent,
+	 * we need to ensure we convert the unwritten extent. If there is an
+	 * error inserting the mapping, we leave the extent as unwritten to
+	 * prevent exposure of the stale underlying data to userspace.
+	 */
 	error = dax_insert_mapping(inode, &bh, vma, vmf);
+	if (!error && buffer_unwritten(&bh))
+		complete_unwritten(&bh, 1);
 
  out:
 	if (error == -ENOMEM)
@@ -446,7 +451,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
  * fault handler for DAX files.
  */
 int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
-			get_block_t get_block)
+	      get_block_t get_block, dax_iodone_t complete_unwritten)
 {
 	int result;
 	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
@@ -455,7 +460,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
 	}
-	result = do_dax_fault(vma, vmf, get_block);
+	result = do_dax_fault(vma, vmf, get_block, complete_unwritten);
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(sb);
 
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index e317017..8da747a 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -28,12 +28,12 @@
 #ifdef CONFIG_FS_DAX
 static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext2_get_block);
+	return dax_fault(vma, vmf, ext2_get_block, NULL);
 }
 
 static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_mkwrite(vma, vmf, ext2_get_block);
+	return dax_mkwrite(vma, vmf, ext2_get_block, NULL);
 }
 
 static const struct vm_operations_struct ext2_dax_vm_ops = {
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 33a09da..f7dabb1 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -192,15 +192,27 @@ errout:
 }
 
 #ifdef CONFIG_FS_DAX
+static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
+{
+	struct inode *inode = bh->b_assoc_map->host;
+	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
+	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
+	int err;
+	if (!uptodate)
+		return;
+	WARN_ON(!buffer_unwritten(bh));
+	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
+}
+
 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_fault(vma, vmf, ext4_get_block);
+	return dax_fault(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
 					/* Is this the right get_block? */
 }
 
 static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	return dax_mkwrite(vma, vmf, ext4_get_block);
+	return dax_mkwrite(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
 }
 
 static const struct vm_operations_struct ext4_dax_vm_ops = {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5cb9a21..43433de 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -657,18 +657,6 @@ has_zeroout:
 	return retval;
 }
 
-static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
-{
-	struct inode *inode = bh->b_assoc_map->host;
-	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
-	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
-	int err;
-	if (!uptodate)
-		return;
-	WARN_ON(!buffer_unwritten(bh));
-	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
-}
-
 /* Maximum number of blocks we map for direct IO at once. */
 #define DIO_MAX_BLOCKS 4096
 
@@ -706,10 +694,15 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
 
 		map_bh(bh, inode->i_sb, map.m_pblk);
 		bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
-		if (IS_DAX(inode) && buffer_unwritten(bh) && !io_end) {
+		if (IS_DAX(inode) && buffer_unwritten(bh)) {
+			/*
+			 * dgc: I suspect unwritten conversion on ext4+DAX is
+			 * fundamentally broken here when there are concurrent
+			 * read/write in progress on this inode.
+			 */
+			WARN_ON_ONCE(io_end);
 			bh->b_assoc_map = inode->i_mapping;
 			bh->b_private = (void *)(unsigned long)iblock;
-			bh->b_end_io = ext4_end_io_unwritten;
 		}
 		if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
 			set_buffer_defer_completion(bh);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 937e280..82100ae 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -70,6 +70,7 @@ typedef int (get_block_t)(struct inode *inode, sector_t iblock,
 			struct buffer_head *bh_result, int create);
 typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 			ssize_t bytes, void *private);
+typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
 
 #define MAY_EXEC		0x00000001
 #define MAY_WRITE		0x00000002
@@ -2603,8 +2604,9 @@ ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
 int dax_clear_blocks(struct inode *, sector_t block, long size);
 int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
-#define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
+		dax_iodone_t);
+#define dax_mkwrite(vma, vmf, gb, iod)	dax_fault(vma, vmf, gb, iod)
 
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 3/8] dax: expose __dax_fault for filesystems with locking constraints
  2015-03-24 10:50 ` Dave Chinner
@ 2015-03-24 10:51   ` Dave Chinner
  -1 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:51 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

Some filesystems cannot call dax_fault() directly because they have
different locking and/or allocation constraints in the page fault IO
path. To handle this, we need to follow the same model as the
generic block_page_mkwrite code, where the internals are exposed via
__block_page_mkwrite() so that filesystems can wrap the correct
locking and operations around the outside.

This is loosely based on a patch originally from Matthew Willcox.
Unlike the original patch, it does not change ext4 code, error
returns or unwritten extent conversion handling.  It also adds a
__dax_mkwrite() wrapper for .page_mkwrite implementations to do the
right thing, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dax.c           | 15 +++++++++++++--
 include/linux/fs.h |  5 ++++-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 431ec2b..0121f7d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -313,7 +313,17 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	return error;
 }
 
-static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+/**
+ * __dax_fault - handle a page fault on a DAX file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * When a page fault occurs, filesystems may call this helper in their
+ * fault handler for DAX files. __dax_fault() assumes the caller has done all
+ * the necessary locking for the page fault to proceed successfully.
+ */
+int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			get_block_t get_block, dax_iodone_t complete_unwritten)
 {
 	struct file *file = vma->vm_file;
@@ -440,6 +450,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	}
 	goto out;
 }
+EXPORT_SYMBOL(__dax_fault);
 
 /**
  * dax_fault - handle a page fault on a DAX file
@@ -460,7 +471,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
 	}
-	result = do_dax_fault(vma, vmf, get_block, complete_unwritten);
+	result = __dax_fault(vma, vmf, get_block, complete_unwritten);
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(sb);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 82100ae..7e5a2d6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2606,7 +2606,10 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
 		dax_iodone_t);
-#define dax_mkwrite(vma, vmf, gb, iod)	dax_fault(vma, vmf, gb, iod)
+int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
+		dax_iodone_t);
+#define dax_mkwrite(vma, vmf, gb, iod)		dax_fault(vma, vmf, gb, iod)
+#define __dax_mkwrite(vma, vmf, gb, iod)	__dax_fault(vma, vmf, gb, iod)
 
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 3/8] dax: expose __dax_fault for filesystems with locking constraints
@ 2015-03-24 10:51   ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:51 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

Some filesystems cannot call dax_fault() directly because they have
different locking and/or allocation constraints in the page fault IO
path. To handle this, we need to follow the same model as the
generic block_page_mkwrite code, where the internals are exposed via
__block_page_mkwrite() so that filesystems can wrap the correct
locking and operations around the outside.

This is loosely based on a patch originally from Matthew Willcox.
Unlike the original patch, it does not change ext4 code, error
returns or unwritten extent conversion handling.  It also adds a
__dax_mkwrite() wrapper for .page_mkwrite implementations to do the
right thing, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dax.c           | 15 +++++++++++++--
 include/linux/fs.h |  5 ++++-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 431ec2b..0121f7d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -313,7 +313,17 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	return error;
 }
 
-static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+/**
+ * __dax_fault - handle a page fault on a DAX file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * When a page fault occurs, filesystems may call this helper in their
+ * fault handler for DAX files. __dax_fault() assumes the caller has done all
+ * the necessary locking for the page fault to proceed successfully.
+ */
+int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			get_block_t get_block, dax_iodone_t complete_unwritten)
 {
 	struct file *file = vma->vm_file;
@@ -440,6 +450,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	}
 	goto out;
 }
+EXPORT_SYMBOL(__dax_fault);
 
 /**
  * dax_fault - handle a page fault on a DAX file
@@ -460,7 +471,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		sb_start_pagefault(sb);
 		file_update_time(vma->vm_file);
 	}
-	result = do_dax_fault(vma, vmf, get_block, complete_unwritten);
+	result = __dax_fault(vma, vmf, get_block, complete_unwritten);
 	if (vmf->flags & FAULT_FLAG_WRITE)
 		sb_end_pagefault(sb);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 82100ae..7e5a2d6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2606,7 +2606,10 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
 		dax_iodone_t);
-#define dax_mkwrite(vma, vmf, gb, iod)	dax_fault(vma, vmf, gb, iod)
+int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
+		dax_iodone_t);
+#define dax_mkwrite(vma, vmf, gb, iod)		dax_fault(vma, vmf, gb, iod)
+#define __dax_mkwrite(vma, vmf, gb, iod)	__dax_fault(vma, vmf, gb, iod)
 
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 4/8] xfs: add DAX block zeroing support
  2015-03-24 10:50 ` Dave Chinner
@ 2015-03-24 10:51   ` Dave Chinner
  -1 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:51 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

Add initial support for DAX block zeroing operations to XFS. DAX
cannot use buffered IO through the page cache for zeroing, nor do we
need to issue IO for uncached block zeroing. In both cases, we can
simply call out to the dax block zeroing function.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 23 +++++++++++++++++++----
 fs/xfs/xfs_file.c      | 28 +++++++++++++++++-----------
 2 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 1bd5393..d1fe432 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1133,14 +1133,29 @@ xfs_zero_remaining_bytes(
 			break;
 		ASSERT(imap.br_blockcount >= 1);
 		ASSERT(imap.br_startoff == offset_fsb);
+		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
+
+		if (imap.br_startblock == HOLESTARTBLOCK ||
+		    imap.br_state == XFS_EXT_UNWRITTEN) {
+			/* skip the entire extent */
+			lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff +
+						      imap.br_blockcount) - 1;
+			continue;
+		}
+
 		lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff + 1) - 1;
 		if (lastoffset > endoff)
 			lastoffset = endoff;
-		if (imap.br_startblock == HOLESTARTBLOCK)
-			continue;
-		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
-		if (imap.br_state == XFS_EXT_UNWRITTEN)
+
+		/* DAX can just zero the backing device directly */
+		if (IS_DAX(VFS_I(ip))) {
+			error = dax_zero_page_range(VFS_I(ip), offset,
+						    lastoffset - offset + 1,
+						    xfs_get_blocks_dax);
+			if (error)
+				return error;
 			continue;
+		}
 
 		error = xfs_buf_read_uncached(XFS_IS_REALTIME_INODE(ip) ?
 				mp->m_rtdev_targp : mp->m_ddev_targp,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a4c882e..94713c2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -97,7 +97,8 @@ xfs_iozero(
 {
 	struct page		*page;
 	struct address_space	*mapping;
-	int			status;
+	int			status = 0;
+
 
 	mapping = VFS_I(ip)->i_mapping;
 	do {
@@ -109,20 +110,25 @@ xfs_iozero(
 		if (bytes > count)
 			bytes = count;
 
-		status = pagecache_write_begin(NULL, mapping, pos, bytes,
-					AOP_FLAG_UNINTERRUPTIBLE,
-					&page, &fsdata);
-		if (status)
-			break;
+		if (IS_DAX(VFS_I(ip)))
+			dax_zero_page_range(VFS_I(ip), pos, bytes,
+						   xfs_get_blocks_dax);
+		else {
+			status = pagecache_write_begin(NULL, mapping, pos, bytes,
+						AOP_FLAG_UNINTERRUPTIBLE,
+						&page, &fsdata);
+			if (status)
+				break;
 
-		zero_user(page, offset, bytes);
+			zero_user(page, offset, bytes);
 
-		status = pagecache_write_end(NULL, mapping, pos, bytes, bytes,
-					page, fsdata);
-		WARN_ON(status <= 0); /* can't return less than zero! */
+			status = pagecache_write_end(NULL, mapping, pos, bytes,
+						bytes, page, fsdata);
+			WARN_ON(status <= 0); /* can't return less than zero! */
+			status = 0;
+		}
 		pos += bytes;
 		count -= bytes;
-		status = 0;
 	} while (count);
 
 	return (-status);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 4/8] xfs: add DAX block zeroing support
@ 2015-03-24 10:51   ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:51 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

Add initial support for DAX block zeroing operations to XFS. DAX
cannot use buffered IO through the page cache for zeroing, nor do we
need to issue IO for uncached block zeroing. In both cases, we can
simply call out to the dax block zeroing function.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 23 +++++++++++++++++++----
 fs/xfs/xfs_file.c      | 28 +++++++++++++++++-----------
 2 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 1bd5393..d1fe432 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1133,14 +1133,29 @@ xfs_zero_remaining_bytes(
 			break;
 		ASSERT(imap.br_blockcount >= 1);
 		ASSERT(imap.br_startoff == offset_fsb);
+		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
+
+		if (imap.br_startblock == HOLESTARTBLOCK ||
+		    imap.br_state == XFS_EXT_UNWRITTEN) {
+			/* skip the entire extent */
+			lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff +
+						      imap.br_blockcount) - 1;
+			continue;
+		}
+
 		lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff + 1) - 1;
 		if (lastoffset > endoff)
 			lastoffset = endoff;
-		if (imap.br_startblock == HOLESTARTBLOCK)
-			continue;
-		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
-		if (imap.br_state == XFS_EXT_UNWRITTEN)
+
+		/* DAX can just zero the backing device directly */
+		if (IS_DAX(VFS_I(ip))) {
+			error = dax_zero_page_range(VFS_I(ip), offset,
+						    lastoffset - offset + 1,
+						    xfs_get_blocks_dax);
+			if (error)
+				return error;
 			continue;
+		}
 
 		error = xfs_buf_read_uncached(XFS_IS_REALTIME_INODE(ip) ?
 				mp->m_rtdev_targp : mp->m_ddev_targp,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a4c882e..94713c2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -97,7 +97,8 @@ xfs_iozero(
 {
 	struct page		*page;
 	struct address_space	*mapping;
-	int			status;
+	int			status = 0;
+
 
 	mapping = VFS_I(ip)->i_mapping;
 	do {
@@ -109,20 +110,25 @@ xfs_iozero(
 		if (bytes > count)
 			bytes = count;
 
-		status = pagecache_write_begin(NULL, mapping, pos, bytes,
-					AOP_FLAG_UNINTERRUPTIBLE,
-					&page, &fsdata);
-		if (status)
-			break;
+		if (IS_DAX(VFS_I(ip)))
+			dax_zero_page_range(VFS_I(ip), pos, bytes,
+						   xfs_get_blocks_dax);
+		else {
+			status = pagecache_write_begin(NULL, mapping, pos, bytes,
+						AOP_FLAG_UNINTERRUPTIBLE,
+						&page, &fsdata);
+			if (status)
+				break;
 
-		zero_user(page, offset, bytes);
+			zero_user(page, offset, bytes);
 
-		status = pagecache_write_end(NULL, mapping, pos, bytes, bytes,
-					page, fsdata);
-		WARN_ON(status <= 0); /* can't return less than zero! */
+			status = pagecache_write_end(NULL, mapping, pos, bytes,
+						bytes, page, fsdata);
+			WARN_ON(status <= 0); /* can't return less than zero! */
+			status = 0;
+		}
 		pos += bytes;
 		count -= bytes;
-		status = 0;
 	} while (count);
 
 	return (-status);
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 5/8] xfs: add DAX file operations support
  2015-03-24 10:50 ` Dave Chinner
@ 2015-03-24 10:51   ` Dave Chinner
  -1 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:51 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

Add the initial support for DAX file operations to XFS. This
includes the necessary block allocation and mmap page fault hooks
for DAX to function.

Note: we specifically have to disable splice_read/write from
occurring because they are dependent on usingthe page cache for
correct operation. We have no page cache for DAX, so we need to
disable them completely on DAX inodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c |  73 ++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_aops.h |   7 +++-
 fs/xfs/xfs_file.c | 116 ++++++++++++++++++++++++++++++++----------------------
 3 files changed, 143 insertions(+), 53 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3a9b7a1..3fc5052 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1233,13 +1233,64 @@ xfs_vm_releasepage(
 	return try_to_free_buffers(page);
 }
 
+/*
+ * For DAX we need a mapping buffer callback for unwritten extent conversion
+ * when page faults allocation blocks and then zero them.
+ */
+#ifdef CONFIG_FS_DAX
+static struct xfs_ioend *
+xfs_dax_alloc_ioend(
+	struct inode	*inode,
+	xfs_off_t	offset,
+	ssize_t		size)
+{
+	struct xfs_ioend *ioend;
+
+	ASSERT(IS_DAX(inode));
+	ioend = xfs_alloc_ioend(inode, XFS_IO_UNWRITTEN);
+	ioend->io_offset = offset;
+	ioend->io_size = size;
+	return ioend;
+}
+
+void
+xfs_get_blocks_dax_complete(
+	struct buffer_head	*bh,
+	int			uptodate)
+{
+	struct xfs_ioend	*ioend = bh->b_private;
+	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
+	int			error;
+
+	ASSERT(IS_DAX(ioend->io_inode));
+
+	/* if there was an error zeroing, then don't convert it */
+	if (!uptodate)
+		goto out_free;
+
+	error = xfs_iomap_write_unwritten(ip, ioend->io_offset, ioend->io_size);
+	if (error)
+		xfs_warn(ip->i_mount,
+"%s: conversion failed, ino 0x%llx, offset 0x%llx, len 0x%lx, error %d\n",
+			__func__, ip->i_ino, ioend->io_offset,
+			ioend->io_size, error);
+out_free:
+	mempool_free(ioend, xfs_ioend_pool);
+
+}
+#else
+#define xfs_dax_alloc_ioend(i,o,s)	NULL
+void xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate) { }
+#endif
+
 STATIC int
 __xfs_get_blocks(
 	struct inode		*inode,
 	sector_t		iblock,
 	struct buffer_head	*bh_result,
 	int			create,
-	int			direct)
+	bool			direct,
+	bool			clear)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1304,6 +1355,7 @@ __xfs_get_blocks(
 			if (error)
 				return error;
 			new = 1;
+
 		} else {
 			/*
 			 * Delalloc reservations do not require a transaction,
@@ -1340,7 +1392,10 @@ __xfs_get_blocks(
 		if (create || !ISUNWRITTEN(&imap))
 			xfs_map_buffer(inode, bh_result, &imap, offset);
 		if (create && ISUNWRITTEN(&imap)) {
-			if (direct) {
+			if (clear) {
+				bh_result->b_private = xfs_dax_alloc_ioend(
+							inode, offset, size);
+			} else if (direct) {
 				bh_result->b_private = inode;
 				set_buffer_defer_completion(bh_result);
 			}
@@ -1425,7 +1480,7 @@ xfs_get_blocks(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, 0);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
 }
 
 STATIC int
@@ -1435,7 +1490,17 @@ xfs_get_blocks_direct(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, 1);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
+}
+
+int
+xfs_get_blocks_dax(
+	struct inode		*inode,
+	sector_t		iblock,
+	struct buffer_head	*bh_result,
+	int			create)
+{
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
 }
 
 /*
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index ac644e0..7c6fb3f 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -53,7 +53,12 @@ typedef struct xfs_ioend {
 } xfs_ioend_t;
 
 extern const struct address_space_operations xfs_address_space_operations;
-extern int xfs_get_blocks(struct inode *, sector_t, struct buffer_head *, int);
+
+int	xfs_get_blocks(struct inode *inode, sector_t offset,
+		       struct buffer_head *map_bh, int create);
+int	xfs_get_blocks_dax(struct inode *inode, sector_t offset,
+			   struct buffer_head *map_bh, int create);
+void	xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate);
 
 extern void xfs_count_page_state(struct page *, int *, int *);
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 94713c2..8017175 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -385,7 +385,11 @@ xfs_file_splice_read(
 
 	trace_xfs_file_splice_read(ip, count, *ppos, ioflags);
 
-	ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
+	/* for dax, we need to avoid the page cache */
+	if (IS_DAX(VFS_I(ip)))
+		ret = default_file_splice_read(infilp, ppos, pipe, count, flags);
+	else
+		ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
 	if (ret > 0)
 		XFS_STATS_ADD(xs_read_bytes, ret);
 
@@ -654,7 +658,7 @@ xfs_file_dio_aio_write(
 					mp->m_rtdev_targp : mp->m_ddev_targp;
 
 	/* DIO must be aligned to device logical sector size */
-	if ((pos | count) & target->bt_logical_sectormask)
+	if (!IS_DAX(inode) && (pos | count) & target->bt_logical_sectormask)
 		return -EINVAL;
 
 	/* "unaligned" here means not aligned to a filesystem block */
@@ -724,8 +728,11 @@ xfs_file_dio_aio_write(
 out:
 	xfs_rw_iunlock(ip, iolock);
 
-	/* No fallback to buffered IO on errors for XFS. */
-	ASSERT(ret < 0 || ret == count);
+	/*
+	 * No fallback to buffered IO on errors for XFS. DAX can result in
+	 * partial writes, but direct IO will either complete fully or fail.
+	 */
+	ASSERT(ret < 0 || ret == count || IS_DAX(VFS_I(ip)));
 	return ret;
 }
 
@@ -810,7 +817,7 @@ xfs_file_write_iter(
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return -EIO;
 
-	if (unlikely(file->f_flags & O_DIRECT))
+	if ((file->f_flags & O_DIRECT) || IS_DAX(inode))
 		ret = xfs_file_dio_aio_write(iocb, from);
 	else
 		ret = xfs_file_buffered_aio_write(iocb, from);
@@ -1031,17 +1038,6 @@ xfs_file_readdir(
 	return xfs_readdir(ip, ctx, bufsize);
 }
 
-STATIC int
-xfs_file_mmap(
-	struct file	*filp,
-	struct vm_area_struct *vma)
-{
-	vma->vm_ops = &xfs_file_vm_ops;
-
-	file_accessed(filp);
-	return 0;
-}
-
 /*
  * This type is designed to indicate the type of offset we would like
  * to search from page cache for xfs_seek_hole_data().
@@ -1422,26 +1418,11 @@ xfs_file_llseek(
  * ordering of:
  *
  * mmap_sem (MM)
- *   i_mmap_lock (XFS - truncate serialisation)
- *     page_lock (MM)
- *       i_lock (XFS - extent map serialisation)
+ *   sb_start_pagefault(vfs, freeze)
+ *     i_mmap_lock (XFS - truncate serialisation)
+ *       page_lock (MM)
+ *         i_lock (XFS - extent map serialisation)
  */
-STATIC int
-xfs_filemap_fault(
-	struct vm_area_struct	*vma,
-	struct vm_fault		*vmf)
-{
-	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
-	int			error;
-
-	trace_xfs_filemap_fault(ip);
-
-	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
-	error = filemap_fault(vma, vmf);
-	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
-
-	return error;
-}
 
 /*
  * mmap()d file has taken write protection fault and is being made writable. We
@@ -1454,21 +1435,66 @@ xfs_filemap_page_mkwrite(
 	struct vm_area_struct	*vma,
 	struct vm_fault		*vmf)
 {
-	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
+	struct inode		*inode = file_inode(vma->vm_file);
 	int			ret;
 
-	trace_xfs_filemap_page_mkwrite(ip);
+	trace_xfs_filemap_page_mkwrite(XFS_I(inode));
 
-	sb_start_pagefault(VFS_I(ip)->i_sb);
+	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
-	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+
+	if (IS_DAX(inode)) {
+		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax,
+				    xfs_get_blocks_dax_complete);
+	} else {
+		ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
+		ret = block_page_mkwrite_return(ret);
+	}
+
+	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	sb_end_pagefault(inode->i_sb);
+
+	return ret;
+}
+
+STATIC int
+xfs_filemap_fault(
+	struct vm_area_struct	*vma,
+	struct vm_fault		*vmf)
+{
+	struct xfs_inode	*ip = XFS_I(file_inode(vma->vm_file));
+	int			ret;
 
-	ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
+	trace_xfs_filemap_fault(ip);
+
+	/* DAX can shortcut the normal fault path on write faults! */
+	if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(VFS_I(ip)))
+		return xfs_filemap_page_mkwrite(vma, vmf);
 
+	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+	ret = filemap_fault(vma, vmf);
 	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
-	sb_end_pagefault(VFS_I(ip)->i_sb);
 
-	return block_page_mkwrite_return(ret);
+	return ret;
+}
+
+static const struct vm_operations_struct xfs_file_vm_ops = {
+	.fault		= xfs_filemap_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= xfs_filemap_page_mkwrite,
+};
+
+STATIC int
+xfs_file_mmap(
+	struct file	*filp,
+	struct vm_area_struct *vma)
+{
+	file_accessed(filp);
+	vma->vm_ops = &xfs_file_vm_ops;
+	if (IS_DAX(file_inode(filp)))
+		vma->vm_flags |= VM_MIXEDMAP;
+	return 0;
 }
 
 const struct file_operations xfs_file_operations = {
@@ -1501,9 +1527,3 @@ const struct file_operations xfs_dir_file_operations = {
 #endif
 	.fsync		= xfs_dir_fsync,
 };
-
-static const struct vm_operations_struct xfs_file_vm_ops = {
-	.fault		= xfs_filemap_fault,
-	.map_pages	= filemap_map_pages,
-	.page_mkwrite	= xfs_filemap_page_mkwrite,
-};
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 5/8] xfs: add DAX file operations support
@ 2015-03-24 10:51   ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:51 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

Add the initial support for DAX file operations to XFS. This
includes the necessary block allocation and mmap page fault hooks
for DAX to function.

Note: we specifically have to disable splice_read/write from
occurring because they are dependent on usingthe page cache for
correct operation. We have no page cache for DAX, so we need to
disable them completely on DAX inodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c |  73 ++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_aops.h |   7 +++-
 fs/xfs/xfs_file.c | 116 ++++++++++++++++++++++++++++++++----------------------
 3 files changed, 143 insertions(+), 53 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3a9b7a1..3fc5052 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1233,13 +1233,64 @@ xfs_vm_releasepage(
 	return try_to_free_buffers(page);
 }
 
+/*
+ * For DAX we need a mapping buffer callback for unwritten extent conversion
+ * when page faults allocation blocks and then zero them.
+ */
+#ifdef CONFIG_FS_DAX
+static struct xfs_ioend *
+xfs_dax_alloc_ioend(
+	struct inode	*inode,
+	xfs_off_t	offset,
+	ssize_t		size)
+{
+	struct xfs_ioend *ioend;
+
+	ASSERT(IS_DAX(inode));
+	ioend = xfs_alloc_ioend(inode, XFS_IO_UNWRITTEN);
+	ioend->io_offset = offset;
+	ioend->io_size = size;
+	return ioend;
+}
+
+void
+xfs_get_blocks_dax_complete(
+	struct buffer_head	*bh,
+	int			uptodate)
+{
+	struct xfs_ioend	*ioend = bh->b_private;
+	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
+	int			error;
+
+	ASSERT(IS_DAX(ioend->io_inode));
+
+	/* if there was an error zeroing, then don't convert it */
+	if (!uptodate)
+		goto out_free;
+
+	error = xfs_iomap_write_unwritten(ip, ioend->io_offset, ioend->io_size);
+	if (error)
+		xfs_warn(ip->i_mount,
+"%s: conversion failed, ino 0x%llx, offset 0x%llx, len 0x%lx, error %d\n",
+			__func__, ip->i_ino, ioend->io_offset,
+			ioend->io_size, error);
+out_free:
+	mempool_free(ioend, xfs_ioend_pool);
+
+}
+#else
+#define xfs_dax_alloc_ioend(i,o,s)	NULL
+void xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate) { }
+#endif
+
 STATIC int
 __xfs_get_blocks(
 	struct inode		*inode,
 	sector_t		iblock,
 	struct buffer_head	*bh_result,
 	int			create,
-	int			direct)
+	bool			direct,
+	bool			clear)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1304,6 +1355,7 @@ __xfs_get_blocks(
 			if (error)
 				return error;
 			new = 1;
+
 		} else {
 			/*
 			 * Delalloc reservations do not require a transaction,
@@ -1340,7 +1392,10 @@ __xfs_get_blocks(
 		if (create || !ISUNWRITTEN(&imap))
 			xfs_map_buffer(inode, bh_result, &imap, offset);
 		if (create && ISUNWRITTEN(&imap)) {
-			if (direct) {
+			if (clear) {
+				bh_result->b_private = xfs_dax_alloc_ioend(
+							inode, offset, size);
+			} else if (direct) {
 				bh_result->b_private = inode;
 				set_buffer_defer_completion(bh_result);
 			}
@@ -1425,7 +1480,7 @@ xfs_get_blocks(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, 0);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
 }
 
 STATIC int
@@ -1435,7 +1490,17 @@ xfs_get_blocks_direct(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __xfs_get_blocks(inode, iblock, bh_result, create, 1);
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
+}
+
+int
+xfs_get_blocks_dax(
+	struct inode		*inode,
+	sector_t		iblock,
+	struct buffer_head	*bh_result,
+	int			create)
+{
+	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
 }
 
 /*
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index ac644e0..7c6fb3f 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -53,7 +53,12 @@ typedef struct xfs_ioend {
 } xfs_ioend_t;
 
 extern const struct address_space_operations xfs_address_space_operations;
-extern int xfs_get_blocks(struct inode *, sector_t, struct buffer_head *, int);
+
+int	xfs_get_blocks(struct inode *inode, sector_t offset,
+		       struct buffer_head *map_bh, int create);
+int	xfs_get_blocks_dax(struct inode *inode, sector_t offset,
+			   struct buffer_head *map_bh, int create);
+void	xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate);
 
 extern void xfs_count_page_state(struct page *, int *, int *);
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 94713c2..8017175 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -385,7 +385,11 @@ xfs_file_splice_read(
 
 	trace_xfs_file_splice_read(ip, count, *ppos, ioflags);
 
-	ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
+	/* for dax, we need to avoid the page cache */
+	if (IS_DAX(VFS_I(ip)))
+		ret = default_file_splice_read(infilp, ppos, pipe, count, flags);
+	else
+		ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
 	if (ret > 0)
 		XFS_STATS_ADD(xs_read_bytes, ret);
 
@@ -654,7 +658,7 @@ xfs_file_dio_aio_write(
 					mp->m_rtdev_targp : mp->m_ddev_targp;
 
 	/* DIO must be aligned to device logical sector size */
-	if ((pos | count) & target->bt_logical_sectormask)
+	if (!IS_DAX(inode) && (pos | count) & target->bt_logical_sectormask)
 		return -EINVAL;
 
 	/* "unaligned" here means not aligned to a filesystem block */
@@ -724,8 +728,11 @@ xfs_file_dio_aio_write(
 out:
 	xfs_rw_iunlock(ip, iolock);
 
-	/* No fallback to buffered IO on errors for XFS. */
-	ASSERT(ret < 0 || ret == count);
+	/*
+	 * No fallback to buffered IO on errors for XFS. DAX can result in
+	 * partial writes, but direct IO will either complete fully or fail.
+	 */
+	ASSERT(ret < 0 || ret == count || IS_DAX(VFS_I(ip)));
 	return ret;
 }
 
@@ -810,7 +817,7 @@ xfs_file_write_iter(
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return -EIO;
 
-	if (unlikely(file->f_flags & O_DIRECT))
+	if ((file->f_flags & O_DIRECT) || IS_DAX(inode))
 		ret = xfs_file_dio_aio_write(iocb, from);
 	else
 		ret = xfs_file_buffered_aio_write(iocb, from);
@@ -1031,17 +1038,6 @@ xfs_file_readdir(
 	return xfs_readdir(ip, ctx, bufsize);
 }
 
-STATIC int
-xfs_file_mmap(
-	struct file	*filp,
-	struct vm_area_struct *vma)
-{
-	vma->vm_ops = &xfs_file_vm_ops;
-
-	file_accessed(filp);
-	return 0;
-}
-
 /*
  * This type is designed to indicate the type of offset we would like
  * to search from page cache for xfs_seek_hole_data().
@@ -1422,26 +1418,11 @@ xfs_file_llseek(
  * ordering of:
  *
  * mmap_sem (MM)
- *   i_mmap_lock (XFS - truncate serialisation)
- *     page_lock (MM)
- *       i_lock (XFS - extent map serialisation)
+ *   sb_start_pagefault(vfs, freeze)
+ *     i_mmap_lock (XFS - truncate serialisation)
+ *       page_lock (MM)
+ *         i_lock (XFS - extent map serialisation)
  */
-STATIC int
-xfs_filemap_fault(
-	struct vm_area_struct	*vma,
-	struct vm_fault		*vmf)
-{
-	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
-	int			error;
-
-	trace_xfs_filemap_fault(ip);
-
-	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
-	error = filemap_fault(vma, vmf);
-	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
-
-	return error;
-}
 
 /*
  * mmap()d file has taken write protection fault and is being made writable. We
@@ -1454,21 +1435,66 @@ xfs_filemap_page_mkwrite(
 	struct vm_area_struct	*vma,
 	struct vm_fault		*vmf)
 {
-	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
+	struct inode		*inode = file_inode(vma->vm_file);
 	int			ret;
 
-	trace_xfs_filemap_page_mkwrite(ip);
+	trace_xfs_filemap_page_mkwrite(XFS_I(inode));
 
-	sb_start_pagefault(VFS_I(ip)->i_sb);
+	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
-	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+
+	if (IS_DAX(inode)) {
+		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax,
+				    xfs_get_blocks_dax_complete);
+	} else {
+		ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
+		ret = block_page_mkwrite_return(ret);
+	}
+
+	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	sb_end_pagefault(inode->i_sb);
+
+	return ret;
+}
+
+STATIC int
+xfs_filemap_fault(
+	struct vm_area_struct	*vma,
+	struct vm_fault		*vmf)
+{
+	struct xfs_inode	*ip = XFS_I(file_inode(vma->vm_file));
+	int			ret;
 
-	ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
+	trace_xfs_filemap_fault(ip);
+
+	/* DAX can shortcut the normal fault path on write faults! */
+	if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(VFS_I(ip)))
+		return xfs_filemap_page_mkwrite(vma, vmf);
 
+	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+	ret = filemap_fault(vma, vmf);
 	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
-	sb_end_pagefault(VFS_I(ip)->i_sb);
 
-	return block_page_mkwrite_return(ret);
+	return ret;
+}
+
+static const struct vm_operations_struct xfs_file_vm_ops = {
+	.fault		= xfs_filemap_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= xfs_filemap_page_mkwrite,
+};
+
+STATIC int
+xfs_file_mmap(
+	struct file	*filp,
+	struct vm_area_struct *vma)
+{
+	file_accessed(filp);
+	vma->vm_ops = &xfs_file_vm_ops;
+	if (IS_DAX(file_inode(filp)))
+		vma->vm_flags |= VM_MIXEDMAP;
+	return 0;
 }
 
 const struct file_operations xfs_file_operations = {
@@ -1501,9 +1527,3 @@ const struct file_operations xfs_dir_file_operations = {
 #endif
 	.fsync		= xfs_dir_fsync,
 };
-
-static const struct vm_operations_struct xfs_file_vm_ops = {
-	.fault		= xfs_filemap_fault,
-	.map_pages	= filemap_map_pages,
-	.page_mkwrite	= xfs_filemap_page_mkwrite,
-};
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 6/8] xfs: add DAX truncate support
  2015-03-24 10:50 ` Dave Chinner
@ 2015-03-24 10:51   ` Dave Chinner
  -1 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:51 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

When we truncate a DAX file, we need to call through the DAX page
truncation path rather than through block_truncate_page() so that
mappings and block zeroing are all handled correctly. Otherwise,
truncate does not need to change.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 8b9e688..9ca5352 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -851,7 +851,11 @@ xfs_setattr_size(
 	 * to hope that the caller sees ENOMEM and retries the truncate
 	 * operation.
 	 */
-	error = block_truncate_page(inode->i_mapping, newsize, xfs_get_blocks);
+	if (IS_DAX(inode))
+		error = dax_truncate_page(inode, newsize, xfs_get_blocks_dax);
+	else
+		error = block_truncate_page(inode->i_mapping, newsize,
+					    xfs_get_blocks);
 	if (error)
 		return error;
 	truncate_setsize(inode, newsize);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 6/8] xfs: add DAX truncate support
@ 2015-03-24 10:51   ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:51 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

When we truncate a DAX file, we need to call through the DAX page
truncation path rather than through block_truncate_page() so that
mappings and block zeroing are all handled correctly. Otherwise,
truncate does not need to change.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 8b9e688..9ca5352 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -851,7 +851,11 @@ xfs_setattr_size(
 	 * to hope that the caller sees ENOMEM and retries the truncate
 	 * operation.
 	 */
-	error = block_truncate_page(inode->i_mapping, newsize, xfs_get_blocks);
+	if (IS_DAX(inode))
+		error = dax_truncate_page(inode, newsize, xfs_get_blocks_dax);
+	else
+		error = block_truncate_page(inode->i_mapping, newsize,
+					    xfs_get_blocks);
 	if (error)
 		return error;
 	truncate_setsize(inode, newsize);
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 7/8] xfs: add DAX IO path support
  2015-03-24 10:50 ` Dave Chinner
@ 2015-03-24 10:51   ` Dave Chinner
  -1 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:51 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

DAX does not do buffered IO (can't buffer direct access!) and hence
all read/write IO is vectored through the direct IO path.  Hence we
need to add the DAX IO path callouts to the direct IO
infrastructure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3fc5052..97979e9 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1559,6 +1559,30 @@ xfs_end_io_direct_write(
 	}
 }
 
+static inline ssize_t
+xfs_vm_do_dio(
+	struct inode		*inode,
+	int			rw,
+	struct kiocb		*iocb,
+	struct iov_iter		*iter,
+	loff_t			offset,
+	void			(*endio)(struct kiocb	*iocb,
+					 loff_t		offset,
+					 ssize_t	size,
+					 void		*private),
+	int			flags)
+{
+	struct block_device	*bdev;
+
+	if (IS_DAX(inode))
+		return dax_do_io(rw, iocb, inode, iter, offset,
+				 xfs_get_blocks_direct, endio, 0);
+
+	bdev = xfs_find_bdev_for_inode(inode);
+	return  __blockdev_direct_IO(rw, iocb, inode, bdev, iter, offset,
+				     xfs_get_blocks_direct, endio, NULL, flags);
+}
+
 STATIC ssize_t
 xfs_vm_direct_IO(
 	int			rw,
@@ -1567,17 +1591,12 @@ xfs_vm_direct_IO(
 	loff_t			offset)
 {
 	struct inode		*inode = iocb->ki_filp->f_mapping->host;
-	struct block_device	*bdev = xfs_find_bdev_for_inode(inode);
 
 	if (rw & WRITE) {
-		return __blockdev_direct_IO(rw, iocb, inode, bdev, iter,
-					    offset, xfs_get_blocks_direct,
-					    xfs_end_io_direct_write, NULL,
-					    DIO_ASYNC_EXTEND);
+		return xfs_vm_do_dio(inode, rw, iocb, iter, offset,
+				     xfs_end_io_direct_write, DIO_ASYNC_EXTEND);
 	}
-	return __blockdev_direct_IO(rw, iocb, inode, bdev, iter,
-				    offset, xfs_get_blocks_direct,
-				    NULL, NULL, 0);
+	return xfs_vm_do_dio(inode, rw, iocb, iter, offset, NULL, 0);
 }
 
 /*
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 7/8] xfs: add DAX IO path support
@ 2015-03-24 10:51   ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:51 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

DAX does not do buffered IO (can't buffer direct access!) and hence
all read/write IO is vectored through the direct IO path.  Hence we
need to add the DAX IO path callouts to the direct IO
infrastructure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3fc5052..97979e9 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1559,6 +1559,30 @@ xfs_end_io_direct_write(
 	}
 }
 
+static inline ssize_t
+xfs_vm_do_dio(
+	struct inode		*inode,
+	int			rw,
+	struct kiocb		*iocb,
+	struct iov_iter		*iter,
+	loff_t			offset,
+	void			(*endio)(struct kiocb	*iocb,
+					 loff_t		offset,
+					 ssize_t	size,
+					 void		*private),
+	int			flags)
+{
+	struct block_device	*bdev;
+
+	if (IS_DAX(inode))
+		return dax_do_io(rw, iocb, inode, iter, offset,
+				 xfs_get_blocks_direct, endio, 0);
+
+	bdev = xfs_find_bdev_for_inode(inode);
+	return  __blockdev_direct_IO(rw, iocb, inode, bdev, iter, offset,
+				     xfs_get_blocks_direct, endio, NULL, flags);
+}
+
 STATIC ssize_t
 xfs_vm_direct_IO(
 	int			rw,
@@ -1567,17 +1591,12 @@ xfs_vm_direct_IO(
 	loff_t			offset)
 {
 	struct inode		*inode = iocb->ki_filp->f_mapping->host;
-	struct block_device	*bdev = xfs_find_bdev_for_inode(inode);
 
 	if (rw & WRITE) {
-		return __blockdev_direct_IO(rw, iocb, inode, bdev, iter,
-					    offset, xfs_get_blocks_direct,
-					    xfs_end_io_direct_write, NULL,
-					    DIO_ASYNC_EXTEND);
+		return xfs_vm_do_dio(inode, rw, iocb, iter, offset,
+				     xfs_end_io_direct_write, DIO_ASYNC_EXTEND);
 	}
-	return __blockdev_direct_IO(rw, iocb, inode, bdev, iter,
-				    offset, xfs_get_blocks_direct,
-				    NULL, NULL, 0);
+	return xfs_vm_do_dio(inode, rw, iocb, iter, offset, NULL, 0);
 }
 
 /*
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 8/8] xfs: add initial DAX support
  2015-03-24 10:50 ` Dave Chinner
@ 2015-03-24 10:51   ` Dave Chinner
  -1 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:51 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

Add initial DAX support to XFS. To do this we need a new mount
option to turn DAX on filesystem, and we need to propagate thi into
the inode flags whenever an inode is instantiated so that the
per-inode checks throughout the code Do The Right Thing.

There are still some things remaining to be done:

	- needs per-inode flags to mark inodes as DAX enabled, and
	  an inheritance flag to enable automatic filesystem
	  propagation of the property
	- fails occasionally with zero length writes instead of
	  ENOSPC errors, so error propagation inside/from the DAX
	  code need work
	- occasionally creates two extents rather than a single
	  larger extent like non-dax filesystems.
	- much more testing

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c  | 24 ++++++++++++------------
 fs/xfs/xfs_mount.h |  2 ++
 fs/xfs/xfs_super.c | 25 +++++++++++++++++++++++--
 3 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 9ca5352..695d857 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1195,22 +1195,22 @@ xfs_diflags_to_iflags(
 	struct inode		*inode,
 	struct xfs_inode	*ip)
 {
-	if (ip->i_d.di_flags & XFS_DIFLAG_IMMUTABLE)
+	uint16_t		flags = ip->i_d.di_flags;
+
+	inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
+			    S_NOATIME | S_DAX);
+
+	if (flags & XFS_DIFLAG_IMMUTABLE)
 		inode->i_flags |= S_IMMUTABLE;
-	else
-		inode->i_flags &= ~S_IMMUTABLE;
-	if (ip->i_d.di_flags & XFS_DIFLAG_APPEND)
+	if (flags & XFS_DIFLAG_APPEND)
 		inode->i_flags |= S_APPEND;
-	else
-		inode->i_flags &= ~S_APPEND;
-	if (ip->i_d.di_flags & XFS_DIFLAG_SYNC)
+	if (flags & XFS_DIFLAG_SYNC)
 		inode->i_flags |= S_SYNC;
-	else
-		inode->i_flags &= ~S_SYNC;
-	if (ip->i_d.di_flags & XFS_DIFLAG_NOATIME)
+	if (flags & XFS_DIFLAG_NOATIME)
 		inode->i_flags |= S_NOATIME;
-	else
-		inode->i_flags &= ~S_NOATIME;
+	/* XXX: Also needs an on-disk per inode flag! */
+	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
+		inode->i_flags |= S_DAX;
 }
 
 /*
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 8c995a2..cd44e88 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -179,6 +179,8 @@ typedef struct xfs_mount {
 						   allocator */
 #define XFS_MOUNT_NOATTR2	(1ULL << 25)	/* disable use of attr2 format */
 
+#define XFS_MOUNT_DAX		(1ULL << 62)	/* TEST ONLY! */
+
 
 /*
  * Default minimum read and write sizes.
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 3ad0b17..0f26d7a 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -112,6 +112,8 @@ static struct xfs_kobj xfs_dbg_kobj;	/* global debug sysfs attrs */
 #define MNTOPT_DISCARD	   "discard"	/* Discard unused blocks */
 #define MNTOPT_NODISCARD   "nodiscard"	/* Do not discard unused blocks */
 
+#define MNTOPT_DAX	"dax"		/* Enable direct access to bdev pages */
+
 /*
  * Table driven mount option parser.
  *
@@ -363,6 +365,10 @@ xfs_parseargs(
 			mp->m_flags |= XFS_MOUNT_DISCARD;
 		} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
 			mp->m_flags &= ~XFS_MOUNT_DISCARD;
+#ifdef CONFIG_FS_DAX
+		} else if (!strcmp(this_char, MNTOPT_DAX)) {
+			mp->m_flags |= XFS_MOUNT_DAX;
+#endif
 		} else {
 			xfs_warn(mp, "unknown mount option [%s].", this_char);
 			return -EINVAL;
@@ -452,8 +458,8 @@ done:
 }
 
 struct proc_xfs_info {
-	int	flag;
-	char	*str;
+	uint64_t	flag;
+	char		*str;
 };
 
 STATIC int
@@ -474,6 +480,7 @@ xfs_showargs(
 		{ XFS_MOUNT_GRPID,		"," MNTOPT_GRPID },
 		{ XFS_MOUNT_DISCARD,		"," MNTOPT_DISCARD },
 		{ XFS_MOUNT_SMALL_INUMS,	"," MNTOPT_32BITINODE },
+		{ XFS_MOUNT_DAX,		"," MNTOPT_DAX },
 		{ 0, NULL }
 	};
 	static struct proc_xfs_info xfs_info_unset[] = {
@@ -1501,6 +1508,20 @@ xfs_fs_fill_super(
 	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
 		sb->s_flags |= MS_I_VERSION;
 
+	if (mp->m_flags & XFS_MOUNT_DAX) {
+		xfs_warn(mp,
+	"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
+		if (sb->s_blocksize != PAGE_SIZE) {
+			xfs_alert(mp,
+		"Filesystem block size invalid for DAX Turning DAX off.");
+			mp->m_flags &= ~XFS_MOUNT_DAX;
+		} else if (!sb->s_bdev->bd_disk->fops->direct_access) {
+			xfs_alert(mp,
+		"Block device does not support DAX Turning DAX off.");
+			mp->m_flags &= ~XFS_MOUNT_DAX;
+		}
+	}
+
 	error = xfs_mountfs(mp);
 	if (error)
 		goto out_filestream_unmount;
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 8/8] xfs: add initial DAX support
@ 2015-03-24 10:51   ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 10:51 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, willy, jack

From: Dave Chinner <dchinner@redhat.com>

Add initial DAX support to XFS. To do this we need a new mount
option to turn DAX on filesystem, and we need to propagate thi into
the inode flags whenever an inode is instantiated so that the
per-inode checks throughout the code Do The Right Thing.

There are still some things remaining to be done:

	- needs per-inode flags to mark inodes as DAX enabled, and
	  an inheritance flag to enable automatic filesystem
	  propagation of the property
	- fails occasionally with zero length writes instead of
	  ENOSPC errors, so error propagation inside/from the DAX
	  code need work
	- occasionally creates two extents rather than a single
	  larger extent like non-dax filesystems.
	- much more testing

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c  | 24 ++++++++++++------------
 fs/xfs/xfs_mount.h |  2 ++
 fs/xfs/xfs_super.c | 25 +++++++++++++++++++++++--
 3 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 9ca5352..695d857 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1195,22 +1195,22 @@ xfs_diflags_to_iflags(
 	struct inode		*inode,
 	struct xfs_inode	*ip)
 {
-	if (ip->i_d.di_flags & XFS_DIFLAG_IMMUTABLE)
+	uint16_t		flags = ip->i_d.di_flags;
+
+	inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
+			    S_NOATIME | S_DAX);
+
+	if (flags & XFS_DIFLAG_IMMUTABLE)
 		inode->i_flags |= S_IMMUTABLE;
-	else
-		inode->i_flags &= ~S_IMMUTABLE;
-	if (ip->i_d.di_flags & XFS_DIFLAG_APPEND)
+	if (flags & XFS_DIFLAG_APPEND)
 		inode->i_flags |= S_APPEND;
-	else
-		inode->i_flags &= ~S_APPEND;
-	if (ip->i_d.di_flags & XFS_DIFLAG_SYNC)
+	if (flags & XFS_DIFLAG_SYNC)
 		inode->i_flags |= S_SYNC;
-	else
-		inode->i_flags &= ~S_SYNC;
-	if (ip->i_d.di_flags & XFS_DIFLAG_NOATIME)
+	if (flags & XFS_DIFLAG_NOATIME)
 		inode->i_flags |= S_NOATIME;
-	else
-		inode->i_flags &= ~S_NOATIME;
+	/* XXX: Also needs an on-disk per inode flag! */
+	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
+		inode->i_flags |= S_DAX;
 }
 
 /*
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 8c995a2..cd44e88 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -179,6 +179,8 @@ typedef struct xfs_mount {
 						   allocator */
 #define XFS_MOUNT_NOATTR2	(1ULL << 25)	/* disable use of attr2 format */
 
+#define XFS_MOUNT_DAX		(1ULL << 62)	/* TEST ONLY! */
+
 
 /*
  * Default minimum read and write sizes.
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 3ad0b17..0f26d7a 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -112,6 +112,8 @@ static struct xfs_kobj xfs_dbg_kobj;	/* global debug sysfs attrs */
 #define MNTOPT_DISCARD	   "discard"	/* Discard unused blocks */
 #define MNTOPT_NODISCARD   "nodiscard"	/* Do not discard unused blocks */
 
+#define MNTOPT_DAX	"dax"		/* Enable direct access to bdev pages */
+
 /*
  * Table driven mount option parser.
  *
@@ -363,6 +365,10 @@ xfs_parseargs(
 			mp->m_flags |= XFS_MOUNT_DISCARD;
 		} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
 			mp->m_flags &= ~XFS_MOUNT_DISCARD;
+#ifdef CONFIG_FS_DAX
+		} else if (!strcmp(this_char, MNTOPT_DAX)) {
+			mp->m_flags |= XFS_MOUNT_DAX;
+#endif
 		} else {
 			xfs_warn(mp, "unknown mount option [%s].", this_char);
 			return -EINVAL;
@@ -452,8 +458,8 @@ done:
 }
 
 struct proc_xfs_info {
-	int	flag;
-	char	*str;
+	uint64_t	flag;
+	char		*str;
 };
 
 STATIC int
@@ -474,6 +480,7 @@ xfs_showargs(
 		{ XFS_MOUNT_GRPID,		"," MNTOPT_GRPID },
 		{ XFS_MOUNT_DISCARD,		"," MNTOPT_DISCARD },
 		{ XFS_MOUNT_SMALL_INUMS,	"," MNTOPT_32BITINODE },
+		{ XFS_MOUNT_DAX,		"," MNTOPT_DAX },
 		{ 0, NULL }
 	};
 	static struct proc_xfs_info xfs_info_unset[] = {
@@ -1501,6 +1508,20 @@ xfs_fs_fill_super(
 	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
 		sb->s_flags |= MS_I_VERSION;
 
+	if (mp->m_flags & XFS_MOUNT_DAX) {
+		xfs_warn(mp,
+	"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
+		if (sb->s_blocksize != PAGE_SIZE) {
+			xfs_alert(mp,
+		"Filesystem block size invalid for DAX Turning DAX off.");
+			mp->m_flags &= ~XFS_MOUNT_DAX;
+		} else if (!sb->s_bdev->bd_disk->fops->direct_access) {
+			xfs_alert(mp,
+		"Block device does not support DAX Turning DAX off.");
+			mp->m_flags &= ~XFS_MOUNT_DAX;
+		}
+	}
+
 	error = xfs_mountfs(mp);
 	if (error)
 		goto out_filestream_unmount;
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xfs: add DAX file operations support
  2015-03-24 10:51   ` Dave Chinner
@ 2015-03-24 12:08     ` Boaz Harrosh
  -1 siblings, 0 replies; 57+ messages in thread
From: Boaz Harrosh @ 2015-03-24 12:08 UTC (permalink / raw)
  To: Dave Chinner, xfs; +Cc: linux-fsdevel, willy, jack

On 03/24/2015 12:51 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add the initial support for DAX file operations to XFS. This
> includes the necessary block allocation and mmap page fault hooks
> for DAX to function.
> 
> Note: we specifically have to disable splice_read/write from

It looks from below code that splice_read is perfectly supportable
through the call to default_file_splice_read() so you might
want to change the comment here.

Regarding splice_write:
It looks to me like you left the vector at iter_file_splice_write().
If I understand correctly I think you need to call default_file_splice_write()
that uses memcpy, just as it was at the previous patches when you left
the vector NULL. So I think you need the same switch for write
as you do below for read.

I'm doing the exact same code for ext4/2 right now and testing.
(Thanks for the loop pointer).

Otherwise I have stared at this very carefully and it looks good
Did not test yet.

Thanks
Boaz

> occurring because they are dependent on usingthe page cache for
> correct operation. We have no page cache for DAX, so we need to
> disable them completely on DAX inodes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c |  73 ++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_aops.h |   7 +++-
>  fs/xfs/xfs_file.c | 116 ++++++++++++++++++++++++++++++++----------------------
>  3 files changed, 143 insertions(+), 53 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3a9b7a1..3fc5052 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1233,13 +1233,64 @@ xfs_vm_releasepage(
>  	return try_to_free_buffers(page);
>  }
>  
> +/*
> + * For DAX we need a mapping buffer callback for unwritten extent conversion
> + * when page faults allocation blocks and then zero them.
> + */
> +#ifdef CONFIG_FS_DAX
> +static struct xfs_ioend *
> +xfs_dax_alloc_ioend(
> +	struct inode	*inode,
> +	xfs_off_t	offset,
> +	ssize_t		size)
> +{
> +	struct xfs_ioend *ioend;
> +
> +	ASSERT(IS_DAX(inode));
> +	ioend = xfs_alloc_ioend(inode, XFS_IO_UNWRITTEN);
> +	ioend->io_offset = offset;
> +	ioend->io_size = size;
> +	return ioend;
> +}
> +
> +void
> +xfs_get_blocks_dax_complete(
> +	struct buffer_head	*bh,
> +	int			uptodate)
> +{
> +	struct xfs_ioend	*ioend = bh->b_private;
> +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> +	int			error;
> +
> +	ASSERT(IS_DAX(ioend->io_inode));
> +
> +	/* if there was an error zeroing, then don't convert it */
> +	if (!uptodate)
> +		goto out_free;
> +
> +	error = xfs_iomap_write_unwritten(ip, ioend->io_offset, ioend->io_size);
> +	if (error)
> +		xfs_warn(ip->i_mount,
> +"%s: conversion failed, ino 0x%llx, offset 0x%llx, len 0x%lx, error %d\n",
> +			__func__, ip->i_ino, ioend->io_offset,
> +			ioend->io_size, error);
> +out_free:
> +	mempool_free(ioend, xfs_ioend_pool);
> +
> +}
> +#else
> +#define xfs_dax_alloc_ioend(i,o,s)	NULL
> +void xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate) { }
> +#endif
> +
>  STATIC int
>  __xfs_get_blocks(
>  	struct inode		*inode,
>  	sector_t		iblock,
>  	struct buffer_head	*bh_result,
>  	int			create,
> -	int			direct)
> +	bool			direct,
> +	bool			clear)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1304,6 +1355,7 @@ __xfs_get_blocks(
>  			if (error)
>  				return error;
>  			new = 1;
> +
>  		} else {
>  			/*
>  			 * Delalloc reservations do not require a transaction,
> @@ -1340,7 +1392,10 @@ __xfs_get_blocks(
>  		if (create || !ISUNWRITTEN(&imap))
>  			xfs_map_buffer(inode, bh_result, &imap, offset);
>  		if (create && ISUNWRITTEN(&imap)) {
> -			if (direct) {
> +			if (clear) {
> +				bh_result->b_private = xfs_dax_alloc_ioend(
> +							inode, offset, size);
> +			} else if (direct) {
>  				bh_result->b_private = inode;
>  				set_buffer_defer_completion(bh_result);
>  			}
> @@ -1425,7 +1480,7 @@ xfs_get_blocks(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, 0);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
>  }
>  
>  STATIC int
> @@ -1435,7 +1490,17 @@ xfs_get_blocks_direct(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, 1);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
> +}
> +
> +int
> +xfs_get_blocks_dax(
> +	struct inode		*inode,
> +	sector_t		iblock,
> +	struct buffer_head	*bh_result,
> +	int			create)
> +{
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index ac644e0..7c6fb3f 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -53,7 +53,12 @@ typedef struct xfs_ioend {
>  } xfs_ioend_t;
>  
>  extern const struct address_space_operations xfs_address_space_operations;
> -extern int xfs_get_blocks(struct inode *, sector_t, struct buffer_head *, int);
> +
> +int	xfs_get_blocks(struct inode *inode, sector_t offset,
> +		       struct buffer_head *map_bh, int create);
> +int	xfs_get_blocks_dax(struct inode *inode, sector_t offset,
> +			   struct buffer_head *map_bh, int create);
> +void	xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate);
>  
>  extern void xfs_count_page_state(struct page *, int *, int *);
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 94713c2..8017175 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -385,7 +385,11 @@ xfs_file_splice_read(
>  
>  	trace_xfs_file_splice_read(ip, count, *ppos, ioflags);
>  
> -	ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
> +	/* for dax, we need to avoid the page cache */
> +	if (IS_DAX(VFS_I(ip)))
> +		ret = default_file_splice_read(infilp, ppos, pipe, count, flags);
> +	else
> +		ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
>  	if (ret > 0)
>  		XFS_STATS_ADD(xs_read_bytes, ret);
>  
> @@ -654,7 +658,7 @@ xfs_file_dio_aio_write(
>  					mp->m_rtdev_targp : mp->m_ddev_targp;
>  
>  	/* DIO must be aligned to device logical sector size */
> -	if ((pos | count) & target->bt_logical_sectormask)
> +	if (!IS_DAX(inode) && (pos | count) & target->bt_logical_sectormask)
>  		return -EINVAL;
>  
>  	/* "unaligned" here means not aligned to a filesystem block */
> @@ -724,8 +728,11 @@ xfs_file_dio_aio_write(
>  out:
>  	xfs_rw_iunlock(ip, iolock);
>  
> -	/* No fallback to buffered IO on errors for XFS. */
> -	ASSERT(ret < 0 || ret == count);
> +	/*
> +	 * No fallback to buffered IO on errors for XFS. DAX can result in
> +	 * partial writes, but direct IO will either complete fully or fail.
> +	 */
> +	ASSERT(ret < 0 || ret == count || IS_DAX(VFS_I(ip)));
>  	return ret;
>  }
>  
> @@ -810,7 +817,7 @@ xfs_file_write_iter(
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
>  		return -EIO;
>  
> -	if (unlikely(file->f_flags & O_DIRECT))
> +	if ((file->f_flags & O_DIRECT) || IS_DAX(inode))
>  		ret = xfs_file_dio_aio_write(iocb, from);
>  	else
>  		ret = xfs_file_buffered_aio_write(iocb, from);
> @@ -1031,17 +1038,6 @@ xfs_file_readdir(
>  	return xfs_readdir(ip, ctx, bufsize);
>  }
>  
> -STATIC int
> -xfs_file_mmap(
> -	struct file	*filp,
> -	struct vm_area_struct *vma)
> -{
> -	vma->vm_ops = &xfs_file_vm_ops;
> -
> -	file_accessed(filp);
> -	return 0;
> -}
> -
>  /*
>   * This type is designed to indicate the type of offset we would like
>   * to search from page cache for xfs_seek_hole_data().
> @@ -1422,26 +1418,11 @@ xfs_file_llseek(
>   * ordering of:
>   *
>   * mmap_sem (MM)
> - *   i_mmap_lock (XFS - truncate serialisation)
> - *     page_lock (MM)
> - *       i_lock (XFS - extent map serialisation)
> + *   sb_start_pagefault(vfs, freeze)
> + *     i_mmap_lock (XFS - truncate serialisation)
> + *       page_lock (MM)
> + *         i_lock (XFS - extent map serialisation)
>   */
> -STATIC int
> -xfs_filemap_fault(
> -	struct vm_area_struct	*vma,
> -	struct vm_fault		*vmf)
> -{
> -	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> -	int			error;
> -
> -	trace_xfs_filemap_fault(ip);
> -
> -	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> -	error = filemap_fault(vma, vmf);
> -	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> -
> -	return error;
> -}
>  
>  /*
>   * mmap()d file has taken write protection fault and is being made writable. We
> @@ -1454,21 +1435,66 @@ xfs_filemap_page_mkwrite(
>  	struct vm_area_struct	*vma,
>  	struct vm_fault		*vmf)
>  {
> -	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> +	struct inode		*inode = file_inode(vma->vm_file);
>  	int			ret;
>  
> -	trace_xfs_filemap_page_mkwrite(ip);
> +	trace_xfs_filemap_page_mkwrite(XFS_I(inode));
>  
> -	sb_start_pagefault(VFS_I(ip)->i_sb);
> +	sb_start_pagefault(inode->i_sb);
>  	file_update_time(vma->vm_file);
> -	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> +	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +
> +	if (IS_DAX(inode)) {
> +		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax,
> +				    xfs_get_blocks_dax_complete);
> +	} else {
> +		ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
> +		ret = block_page_mkwrite_return(ret);
> +	}
> +
> +	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +	sb_end_pagefault(inode->i_sb);
> +
> +	return ret;
> +}
> +
> +STATIC int
> +xfs_filemap_fault(
> +	struct vm_area_struct	*vma,
> +	struct vm_fault		*vmf)
> +{
> +	struct xfs_inode	*ip = XFS_I(file_inode(vma->vm_file));
> +	int			ret;
>  
> -	ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
> +	trace_xfs_filemap_fault(ip);
> +
> +	/* DAX can shortcut the normal fault path on write faults! */
> +	if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(VFS_I(ip)))
> +		return xfs_filemap_page_mkwrite(vma, vmf);
>  
> +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> +	ret = filemap_fault(vma, vmf);
>  	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> -	sb_end_pagefault(VFS_I(ip)->i_sb);
>  
> -	return block_page_mkwrite_return(ret);
> +	return ret;
> +}
> +
> +static const struct vm_operations_struct xfs_file_vm_ops = {
> +	.fault		= xfs_filemap_fault,
> +	.map_pages	= filemap_map_pages,
> +	.page_mkwrite	= xfs_filemap_page_mkwrite,
> +};
> +
> +STATIC int
> +xfs_file_mmap(
> +	struct file	*filp,
> +	struct vm_area_struct *vma)
> +{
> +	file_accessed(filp);
> +	vma->vm_ops = &xfs_file_vm_ops;
> +	if (IS_DAX(file_inode(filp)))
> +		vma->vm_flags |= VM_MIXEDMAP;
> +	return 0;
>  }
>  
>  const struct file_operations xfs_file_operations = {
> @@ -1501,9 +1527,3 @@ const struct file_operations xfs_dir_file_operations = {
>  #endif
>  	.fsync		= xfs_dir_fsync,
>  };
> -
> -static const struct vm_operations_struct xfs_file_vm_ops = {
> -	.fault		= xfs_filemap_fault,
> -	.map_pages	= filemap_map_pages,
> -	.page_mkwrite	= xfs_filemap_page_mkwrite,
> -};
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xfs: add DAX file operations support
@ 2015-03-24 12:08     ` Boaz Harrosh
  0 siblings, 0 replies; 57+ messages in thread
From: Boaz Harrosh @ 2015-03-24 12:08 UTC (permalink / raw)
  To: Dave Chinner, xfs; +Cc: linux-fsdevel, willy, jack

On 03/24/2015 12:51 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add the initial support for DAX file operations to XFS. This
> includes the necessary block allocation and mmap page fault hooks
> for DAX to function.
> 
> Note: we specifically have to disable splice_read/write from

It looks from below code that splice_read is perfectly supportable
through the call to default_file_splice_read() so you might
want to change the comment here.

Regarding splice_write:
It looks to me like you left the vector at iter_file_splice_write().
If I understand correctly I think you need to call default_file_splice_write()
that uses memcpy, just as it was at the previous patches when you left
the vector NULL. So I think you need the same switch for write
as you do below for read.

I'm doing the exact same code for ext4/2 right now and testing.
(Thanks for the loop pointer).

Otherwise I have stared at this very carefully and it looks good
Did not test yet.

Thanks
Boaz

> occurring because they are dependent on usingthe page cache for
> correct operation. We have no page cache for DAX, so we need to
> disable them completely on DAX inodes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c |  73 ++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_aops.h |   7 +++-
>  fs/xfs/xfs_file.c | 116 ++++++++++++++++++++++++++++++++----------------------
>  3 files changed, 143 insertions(+), 53 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3a9b7a1..3fc5052 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1233,13 +1233,64 @@ xfs_vm_releasepage(
>  	return try_to_free_buffers(page);
>  }
>  
> +/*
> + * For DAX we need a mapping buffer callback for unwritten extent conversion
> + * when page faults allocation blocks and then zero them.
> + */
> +#ifdef CONFIG_FS_DAX
> +static struct xfs_ioend *
> +xfs_dax_alloc_ioend(
> +	struct inode	*inode,
> +	xfs_off_t	offset,
> +	ssize_t		size)
> +{
> +	struct xfs_ioend *ioend;
> +
> +	ASSERT(IS_DAX(inode));
> +	ioend = xfs_alloc_ioend(inode, XFS_IO_UNWRITTEN);
> +	ioend->io_offset = offset;
> +	ioend->io_size = size;
> +	return ioend;
> +}
> +
> +void
> +xfs_get_blocks_dax_complete(
> +	struct buffer_head	*bh,
> +	int			uptodate)
> +{
> +	struct xfs_ioend	*ioend = bh->b_private;
> +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> +	int			error;
> +
> +	ASSERT(IS_DAX(ioend->io_inode));
> +
> +	/* if there was an error zeroing, then don't convert it */
> +	if (!uptodate)
> +		goto out_free;
> +
> +	error = xfs_iomap_write_unwritten(ip, ioend->io_offset, ioend->io_size);
> +	if (error)
> +		xfs_warn(ip->i_mount,
> +"%s: conversion failed, ino 0x%llx, offset 0x%llx, len 0x%lx, error %d\n",
> +			__func__, ip->i_ino, ioend->io_offset,
> +			ioend->io_size, error);
> +out_free:
> +	mempool_free(ioend, xfs_ioend_pool);
> +
> +}
> +#else
> +#define xfs_dax_alloc_ioend(i,o,s)	NULL
> +void xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate) { }
> +#endif
> +
>  STATIC int
>  __xfs_get_blocks(
>  	struct inode		*inode,
>  	sector_t		iblock,
>  	struct buffer_head	*bh_result,
>  	int			create,
> -	int			direct)
> +	bool			direct,
> +	bool			clear)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1304,6 +1355,7 @@ __xfs_get_blocks(
>  			if (error)
>  				return error;
>  			new = 1;
> +
>  		} else {
>  			/*
>  			 * Delalloc reservations do not require a transaction,
> @@ -1340,7 +1392,10 @@ __xfs_get_blocks(
>  		if (create || !ISUNWRITTEN(&imap))
>  			xfs_map_buffer(inode, bh_result, &imap, offset);
>  		if (create && ISUNWRITTEN(&imap)) {
> -			if (direct) {
> +			if (clear) {
> +				bh_result->b_private = xfs_dax_alloc_ioend(
> +							inode, offset, size);
> +			} else if (direct) {
>  				bh_result->b_private = inode;
>  				set_buffer_defer_completion(bh_result);
>  			}
> @@ -1425,7 +1480,7 @@ xfs_get_blocks(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, 0);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
>  }
>  
>  STATIC int
> @@ -1435,7 +1490,17 @@ xfs_get_blocks_direct(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, 1);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
> +}
> +
> +int
> +xfs_get_blocks_dax(
> +	struct inode		*inode,
> +	sector_t		iblock,
> +	struct buffer_head	*bh_result,
> +	int			create)
> +{
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index ac644e0..7c6fb3f 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -53,7 +53,12 @@ typedef struct xfs_ioend {
>  } xfs_ioend_t;
>  
>  extern const struct address_space_operations xfs_address_space_operations;
> -extern int xfs_get_blocks(struct inode *, sector_t, struct buffer_head *, int);
> +
> +int	xfs_get_blocks(struct inode *inode, sector_t offset,
> +		       struct buffer_head *map_bh, int create);
> +int	xfs_get_blocks_dax(struct inode *inode, sector_t offset,
> +			   struct buffer_head *map_bh, int create);
> +void	xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate);
>  
>  extern void xfs_count_page_state(struct page *, int *, int *);
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 94713c2..8017175 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -385,7 +385,11 @@ xfs_file_splice_read(
>  
>  	trace_xfs_file_splice_read(ip, count, *ppos, ioflags);
>  
> -	ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
> +	/* for dax, we need to avoid the page cache */
> +	if (IS_DAX(VFS_I(ip)))
> +		ret = default_file_splice_read(infilp, ppos, pipe, count, flags);
> +	else
> +		ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
>  	if (ret > 0)
>  		XFS_STATS_ADD(xs_read_bytes, ret);
>  
> @@ -654,7 +658,7 @@ xfs_file_dio_aio_write(
>  					mp->m_rtdev_targp : mp->m_ddev_targp;
>  
>  	/* DIO must be aligned to device logical sector size */
> -	if ((pos | count) & target->bt_logical_sectormask)
> +	if (!IS_DAX(inode) && (pos | count) & target->bt_logical_sectormask)
>  		return -EINVAL;
>  
>  	/* "unaligned" here means not aligned to a filesystem block */
> @@ -724,8 +728,11 @@ xfs_file_dio_aio_write(
>  out:
>  	xfs_rw_iunlock(ip, iolock);
>  
> -	/* No fallback to buffered IO on errors for XFS. */
> -	ASSERT(ret < 0 || ret == count);
> +	/*
> +	 * No fallback to buffered IO on errors for XFS. DAX can result in
> +	 * partial writes, but direct IO will either complete fully or fail.
> +	 */
> +	ASSERT(ret < 0 || ret == count || IS_DAX(VFS_I(ip)));
>  	return ret;
>  }
>  
> @@ -810,7 +817,7 @@ xfs_file_write_iter(
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
>  		return -EIO;
>  
> -	if (unlikely(file->f_flags & O_DIRECT))
> +	if ((file->f_flags & O_DIRECT) || IS_DAX(inode))
>  		ret = xfs_file_dio_aio_write(iocb, from);
>  	else
>  		ret = xfs_file_buffered_aio_write(iocb, from);
> @@ -1031,17 +1038,6 @@ xfs_file_readdir(
>  	return xfs_readdir(ip, ctx, bufsize);
>  }
>  
> -STATIC int
> -xfs_file_mmap(
> -	struct file	*filp,
> -	struct vm_area_struct *vma)
> -{
> -	vma->vm_ops = &xfs_file_vm_ops;
> -
> -	file_accessed(filp);
> -	return 0;
> -}
> -
>  /*
>   * This type is designed to indicate the type of offset we would like
>   * to search from page cache for xfs_seek_hole_data().
> @@ -1422,26 +1418,11 @@ xfs_file_llseek(
>   * ordering of:
>   *
>   * mmap_sem (MM)
> - *   i_mmap_lock (XFS - truncate serialisation)
> - *     page_lock (MM)
> - *       i_lock (XFS - extent map serialisation)
> + *   sb_start_pagefault(vfs, freeze)
> + *     i_mmap_lock (XFS - truncate serialisation)
> + *       page_lock (MM)
> + *         i_lock (XFS - extent map serialisation)
>   */
> -STATIC int
> -xfs_filemap_fault(
> -	struct vm_area_struct	*vma,
> -	struct vm_fault		*vmf)
> -{
> -	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> -	int			error;
> -
> -	trace_xfs_filemap_fault(ip);
> -
> -	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> -	error = filemap_fault(vma, vmf);
> -	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> -
> -	return error;
> -}
>  
>  /*
>   * mmap()d file has taken write protection fault and is being made writable. We
> @@ -1454,21 +1435,66 @@ xfs_filemap_page_mkwrite(
>  	struct vm_area_struct	*vma,
>  	struct vm_fault		*vmf)
>  {
> -	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> +	struct inode		*inode = file_inode(vma->vm_file);
>  	int			ret;
>  
> -	trace_xfs_filemap_page_mkwrite(ip);
> +	trace_xfs_filemap_page_mkwrite(XFS_I(inode));
>  
> -	sb_start_pagefault(VFS_I(ip)->i_sb);
> +	sb_start_pagefault(inode->i_sb);
>  	file_update_time(vma->vm_file);
> -	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> +	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +
> +	if (IS_DAX(inode)) {
> +		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax,
> +				    xfs_get_blocks_dax_complete);
> +	} else {
> +		ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
> +		ret = block_page_mkwrite_return(ret);
> +	}
> +
> +	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +	sb_end_pagefault(inode->i_sb);
> +
> +	return ret;
> +}
> +
> +STATIC int
> +xfs_filemap_fault(
> +	struct vm_area_struct	*vma,
> +	struct vm_fault		*vmf)
> +{
> +	struct xfs_inode	*ip = XFS_I(file_inode(vma->vm_file));
> +	int			ret;
>  
> -	ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
> +	trace_xfs_filemap_fault(ip);
> +
> +	/* DAX can shortcut the normal fault path on write faults! */
> +	if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(VFS_I(ip)))
> +		return xfs_filemap_page_mkwrite(vma, vmf);
>  
> +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> +	ret = filemap_fault(vma, vmf);
>  	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> -	sb_end_pagefault(VFS_I(ip)->i_sb);
>  
> -	return block_page_mkwrite_return(ret);
> +	return ret;
> +}
> +
> +static const struct vm_operations_struct xfs_file_vm_ops = {
> +	.fault		= xfs_filemap_fault,
> +	.map_pages	= filemap_map_pages,
> +	.page_mkwrite	= xfs_filemap_page_mkwrite,
> +};
> +
> +STATIC int
> +xfs_file_mmap(
> +	struct file	*filp,
> +	struct vm_area_struct *vma)
> +{
> +	file_accessed(filp);
> +	vma->vm_ops = &xfs_file_vm_ops;
> +	if (IS_DAX(file_inode(filp)))
> +		vma->vm_flags |= VM_MIXEDMAP;
> +	return 0;
>  }
>  
>  const struct file_operations xfs_file_operations = {
> @@ -1501,9 +1527,3 @@ const struct file_operations xfs_dir_file_operations = {
>  #endif
>  	.fsync		= xfs_dir_fsync,
>  };
> -
> -static const struct vm_operations_struct xfs_file_vm_ops = {
> -	.fault		= xfs_filemap_fault,
> -	.map_pages	= filemap_map_pages,
> -	.page_mkwrite	= xfs_filemap_page_mkwrite,
> -};
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xfs: add DAX file operations support
  2015-03-24 12:08     ` Boaz Harrosh
@ 2015-03-24 12:24       ` Boaz Harrosh
  -1 siblings, 0 replies; 57+ messages in thread
From: Boaz Harrosh @ 2015-03-24 12:24 UTC (permalink / raw)
  To: Boaz Harrosh, Dave Chinner, xfs; +Cc: linux-fsdevel, willy, jack

On 03/24/2015 02:08 PM, Boaz Harrosh wrote:
> On 03/24/2015 12:51 PM, Dave Chinner wrote:
<>
> 
> Regarding splice_write:
> It looks to me like you left the vector at iter_file_splice_write().
> If I understand correctly I think you need to call default_file_splice_write()
> that uses memcpy, just as it was at the previous patches when you left
> the vector NULL. So I think you need the same switch for write
> as you do below for read.
> 

I take this back It looks like iter_file_splice_write() is perfectly usable
it relies on the source (Not us for write) being a page then kmaps and
calls one of f_op->write(file,..) or new_sync_write(file, ...) so
it looks like it all should work.

So the only bad thing is the comment it looks like splice should just
work also for dax.

Will be testing exact same thing for ext4 and report

Thanks
Boaz

> I'm doing the exact same code for ext4/2 right now and testing.
> (Thanks for the loop pointer).
> 
> Otherwise I have stared at this very carefully and it looks good
> Did not test yet.
> 
> Thanks
> Boaz
> 
<>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xfs: add DAX file operations support
@ 2015-03-24 12:24       ` Boaz Harrosh
  0 siblings, 0 replies; 57+ messages in thread
From: Boaz Harrosh @ 2015-03-24 12:24 UTC (permalink / raw)
  To: Boaz Harrosh, Dave Chinner, xfs; +Cc: linux-fsdevel, willy, jack

On 03/24/2015 02:08 PM, Boaz Harrosh wrote:
> On 03/24/2015 12:51 PM, Dave Chinner wrote:
<>
> 
> Regarding splice_write:
> It looks to me like you left the vector at iter_file_splice_write().
> If I understand correctly I think you need to call default_file_splice_write()
> that uses memcpy, just as it was at the previous patches when you left
> the vector NULL. So I think you need the same switch for write
> as you do below for read.
> 

I take this back It looks like iter_file_splice_write() is perfectly usable
it relies on the source (Not us for write) being a page then kmaps and
calls one of f_op->write(file,..) or new_sync_write(file, ...) so
it looks like it all should work.

So the only bad thing is the comment it looks like splice should just
work also for dax.

Will be testing exact same thing for ext4 and report

Thanks
Boaz

> I'm doing the exact same code for ext4/2 right now and testing.
> (Thanks for the loop pointer).
> 
> Otherwise I have stared at this very carefully and it looks good
> Did not test yet.
> 
> Thanks
> Boaz
> 
<>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] xfs: add initial DAX support
  2015-03-24 10:51   ` Dave Chinner
@ 2015-03-24 12:52     ` Boaz Harrosh
  -1 siblings, 0 replies; 57+ messages in thread
From: Boaz Harrosh @ 2015-03-24 12:52 UTC (permalink / raw)
  To: Dave Chinner, xfs; +Cc: linux-fsdevel, willy, jack

On 03/24/2015 12:51 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add initial DAX support to XFS. To do this we need a new mount
> option to turn DAX on filesystem, and we need to propagate thi into
> the inode flags whenever an inode is instantiated so that the
> per-inode checks throughout the code Do The Right Thing.
> 
> There are still some things remaining to be done:
> 
> 	- needs per-inode flags to mark inodes as DAX enabled, and
> 	  an inheritance flag to enable automatic filesystem
> 	  propagation of the property
> 	- fails occasionally with zero length writes instead of
> 	  ENOSPC errors, so error propagation inside/from the DAX
> 	  code need work
> 	- occasionally creates two extents rather than a single
> 	  larger extent like non-dax filesystems.
> 	- much more testing
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_iops.c  | 24 ++++++++++++------------
>  fs/xfs/xfs_mount.h |  2 ++
>  fs/xfs/xfs_super.c | 25 +++++++++++++++++++++++--
>  3 files changed, 37 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 9ca5352..695d857 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1195,22 +1195,22 @@ xfs_diflags_to_iflags(
>  	struct inode		*inode,
>  	struct xfs_inode	*ip)
>  {
> -	if (ip->i_d.di_flags & XFS_DIFLAG_IMMUTABLE)
> +	uint16_t		flags = ip->i_d.di_flags;
> +
> +	inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
> +			    S_NOATIME | S_DAX);
> +
> +	if (flags & XFS_DIFLAG_IMMUTABLE)
>  		inode->i_flags |= S_IMMUTABLE;
> -	else
> -		inode->i_flags &= ~S_IMMUTABLE;
> -	if (ip->i_d.di_flags & XFS_DIFLAG_APPEND)
> +	if (flags & XFS_DIFLAG_APPEND)
>  		inode->i_flags |= S_APPEND;
> -	else
> -		inode->i_flags &= ~S_APPEND;
> -	if (ip->i_d.di_flags & XFS_DIFLAG_SYNC)
> +	if (flags & XFS_DIFLAG_SYNC)
>  		inode->i_flags |= S_SYNC;
> -	else
> -		inode->i_flags &= ~S_SYNC;
> -	if (ip->i_d.di_flags & XFS_DIFLAG_NOATIME)
> +	if (flags & XFS_DIFLAG_NOATIME)
>  		inode->i_flags |= S_NOATIME;
> -	else
> -		inode->i_flags &= ~S_NOATIME;
> +	/* XXX: Also needs an on-disk per inode flag! */
> +	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
> +		inode->i_flags |= S_DAX;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 8c995a2..cd44e88 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -179,6 +179,8 @@ typedef struct xfs_mount {
>  						   allocator */
>  #define XFS_MOUNT_NOATTR2	(1ULL << 25)	/* disable use of attr2 format */
>  
> +#define XFS_MOUNT_DAX		(1ULL << 62)	/* TEST ONLY! */
> +
>  
>  /*
>   * Default minimum read and write sizes.
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 3ad0b17..0f26d7a 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -112,6 +112,8 @@ static struct xfs_kobj xfs_dbg_kobj;	/* global debug sysfs attrs */
>  #define MNTOPT_DISCARD	   "discard"	/* Discard unused blocks */
>  #define MNTOPT_NODISCARD   "nodiscard"	/* Do not discard unused blocks */
>  
> +#define MNTOPT_DAX	"dax"		/* Enable direct access to bdev pages */
> +
>  /*
>   * Table driven mount option parser.
>   *
> @@ -363,6 +365,10 @@ xfs_parseargs(
>  			mp->m_flags |= XFS_MOUNT_DISCARD;
>  		} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
>  			mp->m_flags &= ~XFS_MOUNT_DISCARD;
> +#ifdef CONFIG_FS_DAX
> +		} else if (!strcmp(this_char, MNTOPT_DAX)) {
> +			mp->m_flags |= XFS_MOUNT_DAX;

Hi

So what I see, (I might be wrong), is that once this flag is set here the
fs (At above xfs_diflags_to_iflags() ) will start serving DAX inodes.

This is a problem because the bdev passed in might not support direct_access
at all.

I think we might want a dax_supported(sb) and call somewhere at mount time.

> +#endif
>  		} else {
>  			xfs_warn(mp, "unknown mount option [%s].", this_char);
>  			return -EINVAL;
> @@ -452,8 +458,8 @@ done:
>  }
>  
>  struct proc_xfs_info {
> -	int	flag;
> -	char	*str;
> +	uint64_t	flag;
> +	char		*str;
>  };
>  
>  STATIC int
> @@ -474,6 +480,7 @@ xfs_showargs(
>  		{ XFS_MOUNT_GRPID,		"," MNTOPT_GRPID },
>  		{ XFS_MOUNT_DISCARD,		"," MNTOPT_DISCARD },
>  		{ XFS_MOUNT_SMALL_INUMS,	"," MNTOPT_32BITINODE },
> +		{ XFS_MOUNT_DAX,		"," MNTOPT_DAX },
>  		{ 0, NULL }
>  	};
>  	static struct proc_xfs_info xfs_info_unset[] = {
> @@ -1501,6 +1508,20 @@ xfs_fs_fill_super(
>  	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
>  		sb->s_flags |= MS_I_VERSION;
>  
> +	if (mp->m_flags & XFS_MOUNT_DAX) {
> +		xfs_warn(mp,
> +	"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
> +		if (sb->s_blocksize != PAGE_SIZE) {
> +			xfs_alert(mp,
> +		"Filesystem block size invalid for DAX Turning DAX off.");
> +			mp->m_flags &= ~XFS_MOUNT_DAX;
> +		} else if (!sb->s_bdev->bd_disk->fops->direct_access) {
> +			xfs_alert(mp,
> +		"Block device does not support DAX Turning DAX off.");
> +			mp->m_flags &= ~XFS_MOUNT_DAX;
> +		}
> +	}
> 

If we agree about the s_flags MS_MOUNT_DAX  then we can define a
	if (MNTOPT_DAX)
		dax_enable_if_supported(sb);

This will try a call to bdev_direct_access(sb->s_bdev, ...) and set the
flag if everything is OK, else will leave it off.

(I can do this patch if you want)

>  	error = xfs_mountfs(mp);
>  	if (error)
>  		goto out_filestream_unmount;
> 

Thanks
Boaz


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] xfs: add initial DAX support
@ 2015-03-24 12:52     ` Boaz Harrosh
  0 siblings, 0 replies; 57+ messages in thread
From: Boaz Harrosh @ 2015-03-24 12:52 UTC (permalink / raw)
  To: Dave Chinner, xfs; +Cc: linux-fsdevel, willy, jack

On 03/24/2015 12:51 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add initial DAX support to XFS. To do this we need a new mount
> option to turn DAX on filesystem, and we need to propagate thi into
> the inode flags whenever an inode is instantiated so that the
> per-inode checks throughout the code Do The Right Thing.
> 
> There are still some things remaining to be done:
> 
> 	- needs per-inode flags to mark inodes as DAX enabled, and
> 	  an inheritance flag to enable automatic filesystem
> 	  propagation of the property
> 	- fails occasionally with zero length writes instead of
> 	  ENOSPC errors, so error propagation inside/from the DAX
> 	  code need work
> 	- occasionally creates two extents rather than a single
> 	  larger extent like non-dax filesystems.
> 	- much more testing
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_iops.c  | 24 ++++++++++++------------
>  fs/xfs/xfs_mount.h |  2 ++
>  fs/xfs/xfs_super.c | 25 +++++++++++++++++++++++--
>  3 files changed, 37 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 9ca5352..695d857 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1195,22 +1195,22 @@ xfs_diflags_to_iflags(
>  	struct inode		*inode,
>  	struct xfs_inode	*ip)
>  {
> -	if (ip->i_d.di_flags & XFS_DIFLAG_IMMUTABLE)
> +	uint16_t		flags = ip->i_d.di_flags;
> +
> +	inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
> +			    S_NOATIME | S_DAX);
> +
> +	if (flags & XFS_DIFLAG_IMMUTABLE)
>  		inode->i_flags |= S_IMMUTABLE;
> -	else
> -		inode->i_flags &= ~S_IMMUTABLE;
> -	if (ip->i_d.di_flags & XFS_DIFLAG_APPEND)
> +	if (flags & XFS_DIFLAG_APPEND)
>  		inode->i_flags |= S_APPEND;
> -	else
> -		inode->i_flags &= ~S_APPEND;
> -	if (ip->i_d.di_flags & XFS_DIFLAG_SYNC)
> +	if (flags & XFS_DIFLAG_SYNC)
>  		inode->i_flags |= S_SYNC;
> -	else
> -		inode->i_flags &= ~S_SYNC;
> -	if (ip->i_d.di_flags & XFS_DIFLAG_NOATIME)
> +	if (flags & XFS_DIFLAG_NOATIME)
>  		inode->i_flags |= S_NOATIME;
> -	else
> -		inode->i_flags &= ~S_NOATIME;
> +	/* XXX: Also needs an on-disk per inode flag! */
> +	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
> +		inode->i_flags |= S_DAX;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 8c995a2..cd44e88 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -179,6 +179,8 @@ typedef struct xfs_mount {
>  						   allocator */
>  #define XFS_MOUNT_NOATTR2	(1ULL << 25)	/* disable use of attr2 format */
>  
> +#define XFS_MOUNT_DAX		(1ULL << 62)	/* TEST ONLY! */
> +
>  
>  /*
>   * Default minimum read and write sizes.
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 3ad0b17..0f26d7a 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -112,6 +112,8 @@ static struct xfs_kobj xfs_dbg_kobj;	/* global debug sysfs attrs */
>  #define MNTOPT_DISCARD	   "discard"	/* Discard unused blocks */
>  #define MNTOPT_NODISCARD   "nodiscard"	/* Do not discard unused blocks */
>  
> +#define MNTOPT_DAX	"dax"		/* Enable direct access to bdev pages */
> +
>  /*
>   * Table driven mount option parser.
>   *
> @@ -363,6 +365,10 @@ xfs_parseargs(
>  			mp->m_flags |= XFS_MOUNT_DISCARD;
>  		} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
>  			mp->m_flags &= ~XFS_MOUNT_DISCARD;
> +#ifdef CONFIG_FS_DAX
> +		} else if (!strcmp(this_char, MNTOPT_DAX)) {
> +			mp->m_flags |= XFS_MOUNT_DAX;

Hi

So what I see, (I might be wrong), is that once this flag is set here the
fs (At above xfs_diflags_to_iflags() ) will start serving DAX inodes.

This is a problem because the bdev passed in might not support direct_access
at all.

I think we might want a dax_supported(sb) and call somewhere at mount time.

> +#endif
>  		} else {
>  			xfs_warn(mp, "unknown mount option [%s].", this_char);
>  			return -EINVAL;
> @@ -452,8 +458,8 @@ done:
>  }
>  
>  struct proc_xfs_info {
> -	int	flag;
> -	char	*str;
> +	uint64_t	flag;
> +	char		*str;
>  };
>  
>  STATIC int
> @@ -474,6 +480,7 @@ xfs_showargs(
>  		{ XFS_MOUNT_GRPID,		"," MNTOPT_GRPID },
>  		{ XFS_MOUNT_DISCARD,		"," MNTOPT_DISCARD },
>  		{ XFS_MOUNT_SMALL_INUMS,	"," MNTOPT_32BITINODE },
> +		{ XFS_MOUNT_DAX,		"," MNTOPT_DAX },
>  		{ 0, NULL }
>  	};
>  	static struct proc_xfs_info xfs_info_unset[] = {
> @@ -1501,6 +1508,20 @@ xfs_fs_fill_super(
>  	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
>  		sb->s_flags |= MS_I_VERSION;
>  
> +	if (mp->m_flags & XFS_MOUNT_DAX) {
> +		xfs_warn(mp,
> +	"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
> +		if (sb->s_blocksize != PAGE_SIZE) {
> +			xfs_alert(mp,
> +		"Filesystem block size invalid for DAX Turning DAX off.");
> +			mp->m_flags &= ~XFS_MOUNT_DAX;
> +		} else if (!sb->s_bdev->bd_disk->fops->direct_access) {
> +			xfs_alert(mp,
> +		"Block device does not support DAX Turning DAX off.");
> +			mp->m_flags &= ~XFS_MOUNT_DAX;
> +		}
> +	}
> 

If we agree about the s_flags MS_MOUNT_DAX  then we can define a
	if (MNTOPT_DAX)
		dax_enable_if_supported(sb);

This will try a call to bdev_direct_access(sb->s_bdev, ...) and set the
flag if everything is OK, else will leave it off.

(I can do this patch if you want)

>  	error = xfs_mountfs(mp);
>  	if (error)
>  		goto out_filestream_unmount;
> 

Thanks
Boaz

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xfs: add DAX file operations support
  2015-03-24 12:08     ` Boaz Harrosh
@ 2015-03-24 21:17       ` Dave Chinner
  -1 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 21:17 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: xfs, linux-fsdevel, willy, jack

On Tue, Mar 24, 2015 at 02:08:36PM +0200, Boaz Harrosh wrote:
> On 03/24/2015 12:51 PM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Add the initial support for DAX file operations to XFS. This
> > includes the necessary block allocation and mmap page fault hooks
> > for DAX to function.
> > 
> > Note: we specifically have to disable splice_read/write from
> 
> It looks from below code that splice_read is perfectly supportable
> through the call to default_file_splice_read() so you might
> want to change the comment here.

Ah, I forgot to update the description.

> Regarding splice_write:
> It looks to me like you left the vector at iter_file_splice_write().
> If I understand correctly I think you need to call default_file_splice_write()
> that uses memcpy, just as it was at the previous patches when you left
> the vector NULL. So I think you need the same switch for write
> as you do below for read.

iter_file_splice_write() uses vfs_iter_write, which goes through
->write_iter, which we punt to the direct IO path for DAX.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xfs: add DAX file operations support
@ 2015-03-24 21:17       ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 21:17 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: linux-fsdevel, willy, jack, xfs

On Tue, Mar 24, 2015 at 02:08:36PM +0200, Boaz Harrosh wrote:
> On 03/24/2015 12:51 PM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Add the initial support for DAX file operations to XFS. This
> > includes the necessary block allocation and mmap page fault hooks
> > for DAX to function.
> > 
> > Note: we specifically have to disable splice_read/write from
> 
> It looks from below code that splice_read is perfectly supportable
> through the call to default_file_splice_read() so you might
> want to change the comment here.

Ah, I forgot to update the description.

> Regarding splice_write:
> It looks to me like you left the vector at iter_file_splice_write().
> If I understand correctly I think you need to call default_file_splice_write()
> that uses memcpy, just as it was at the previous patches when you left
> the vector NULL. So I think you need the same switch for write
> as you do below for read.

iter_file_splice_write() uses vfs_iter_write, which goes through
->write_iter, which we punt to the direct IO path for DAX.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] xfs: add initial DAX support
  2015-03-24 12:52     ` Boaz Harrosh
@ 2015-03-24 21:25       ` Dave Chinner
  -1 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 21:25 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: xfs, linux-fsdevel, willy, jack

On Tue, Mar 24, 2015 at 02:52:48PM +0200, Boaz Harrosh wrote:
> On 03/24/2015 12:51 PM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > @@ -363,6 +365,10 @@ xfs_parseargs(
> >  			mp->m_flags |= XFS_MOUNT_DISCARD;
> >  		} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
> >  			mp->m_flags &= ~XFS_MOUNT_DISCARD;
> > +#ifdef CONFIG_FS_DAX
> > +		} else if (!strcmp(this_char, MNTOPT_DAX)) {
> > +			mp->m_flags |= XFS_MOUNT_DAX;
> 
> Hi
> 
> So what I see, (I might be wrong), is that once this flag is set here the
> fs (At above xfs_diflags_to_iflags() ) will start serving DAX inodes.

No, it won't, because...

> > @@ -1501,6 +1508,20 @@ xfs_fs_fill_super(
> >  	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
> >  		sb->s_flags |= MS_I_VERSION;
> >  
> > +	if (mp->m_flags & XFS_MOUNT_DAX) {
> > +		xfs_warn(mp,
> > +	"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
> > +		if (sb->s_blocksize != PAGE_SIZE) {
> > +			xfs_alert(mp,
> > +		"Filesystem block size invalid for DAX Turning DAX off.");
> > +			mp->m_flags &= ~XFS_MOUNT_DAX;
> > +		} else if (!sb->s_bdev->bd_disk->fops->direct_access) {
> > +			xfs_alert(mp,
> > +		"Block device does not support DAX Turning DAX off.");
> > +			mp->m_flags &= ~XFS_MOUNT_DAX;
> > +		}
> > +	}

We run these tests and then clear the XFS_MOUNT_DAX flag later.

> If we agree about the s_flags MS_MOUNT_DAX  then we can define a
> 	if (MNTOPT_DAX)
> 		dax_enable_if_supported(sb);

No, I don't see any reason for a mount flag for this, because we
do not want to be stuck relying on a mount option forever. Once
This code has been shaken out, I fully intend DAX to be turned on
automatically for any device that supports, and for it to be turned
on and off on a per-inode basis through on-disk inode flags. i.e.
the mount option is really a temporary solution and I don't want to
code in any dependencies on DAX being mount wide existing.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] xfs: add initial DAX support
@ 2015-03-24 21:25       ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-03-24 21:25 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: linux-fsdevel, willy, jack, xfs

On Tue, Mar 24, 2015 at 02:52:48PM +0200, Boaz Harrosh wrote:
> On 03/24/2015 12:51 PM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > @@ -363,6 +365,10 @@ xfs_parseargs(
> >  			mp->m_flags |= XFS_MOUNT_DISCARD;
> >  		} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
> >  			mp->m_flags &= ~XFS_MOUNT_DISCARD;
> > +#ifdef CONFIG_FS_DAX
> > +		} else if (!strcmp(this_char, MNTOPT_DAX)) {
> > +			mp->m_flags |= XFS_MOUNT_DAX;
> 
> Hi
> 
> So what I see, (I might be wrong), is that once this flag is set here the
> fs (At above xfs_diflags_to_iflags() ) will start serving DAX inodes.

No, it won't, because...

> > @@ -1501,6 +1508,20 @@ xfs_fs_fill_super(
> >  	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
> >  		sb->s_flags |= MS_I_VERSION;
> >  
> > +	if (mp->m_flags & XFS_MOUNT_DAX) {
> > +		xfs_warn(mp,
> > +	"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
> > +		if (sb->s_blocksize != PAGE_SIZE) {
> > +			xfs_alert(mp,
> > +		"Filesystem block size invalid for DAX Turning DAX off.");
> > +			mp->m_flags &= ~XFS_MOUNT_DAX;
> > +		} else if (!sb->s_bdev->bd_disk->fops->direct_access) {
> > +			xfs_alert(mp,
> > +		"Block device does not support DAX Turning DAX off.");
> > +			mp->m_flags &= ~XFS_MOUNT_DAX;
> > +		}
> > +	}

We run these tests and then clear the XFS_MOUNT_DAX flag later.

> If we agree about the s_flags MS_MOUNT_DAX  then we can define a
> 	if (MNTOPT_DAX)
> 		dax_enable_if_supported(sb);

No, I don't see any reason for a mount flag for this, because we
do not want to be stuck relying on a mount option forever. Once
This code has been shaken out, I fully intend DAX to be turned on
automatically for any device that supports, and for it to be turned
on and off on a per-inode basis through on-disk inode flags. i.e.
the mount option is really a temporary solution and I don't want to
code in any dependencies on DAX being mount wide existing.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xfs: add DAX file operations support
  2015-03-24 21:17       ` Dave Chinner
@ 2015-03-25  8:47         ` Boaz Harrosh
  -1 siblings, 0 replies; 57+ messages in thread
From: Boaz Harrosh @ 2015-03-25  8:47 UTC (permalink / raw)
  To: Dave Chinner, Boaz Harrosh; +Cc: xfs, linux-fsdevel, willy, jack

On 03/24/2015 11:17 PM, Dave Chinner wrote:
> On Tue, Mar 24, 2015 at 02:08:36PM +0200, Boaz Harrosh wrote:
<>
> 
> iter_file_splice_write() uses vfs_iter_write, which goes through
> ->write_iter, which we punt to the direct IO path for DAX.
> 

Cool real cool. What I thought too. Will send a patch to fix
and simplify ext2/4 ASAP. Its ready only still testing ...

> Cheers,
> Dave.
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xfs: add DAX file operations support
@ 2015-03-25  8:47         ` Boaz Harrosh
  0 siblings, 0 replies; 57+ messages in thread
From: Boaz Harrosh @ 2015-03-25  8:47 UTC (permalink / raw)
  To: Dave Chinner, Boaz Harrosh; +Cc: linux-fsdevel, willy, jack, xfs

On 03/24/2015 11:17 PM, Dave Chinner wrote:
> On Tue, Mar 24, 2015 at 02:08:36PM +0200, Boaz Harrosh wrote:
<>
> 
> iter_file_splice_write() uses vfs_iter_write, which goes through
> ->write_iter, which we punt to the direct IO path for DAX.
> 

Cool real cool. What I thought too. Will send a patch to fix
and simplify ext2/4 ASAP. Its ready only still testing ...

> Cheers,
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] xfs: add initial DAX support
  2015-03-24 21:25       ` Dave Chinner
@ 2015-03-25  9:14         ` Boaz Harrosh
  -1 siblings, 0 replies; 57+ messages in thread
From: Boaz Harrosh @ 2015-03-25  9:14 UTC (permalink / raw)
  To: Dave Chinner, Boaz Harrosh; +Cc: xfs, linux-fsdevel, willy, jack

On 03/24/2015 11:25 PM, Dave Chinner wrote:
<>
> 
> No, I don't see any reason for a mount flag for this, because we
> do not want to be stuck relying on a mount option forever. Once
> This code has been shaken out, I fully intend DAX to be turned on
> automatically for any device that supports, and for it to be turned
> on and off on a per-inode basis through on-disk inode flags. i.e.
> the mount option is really a temporary solution and I don't want to
> code in any dependencies on DAX being mount wide existing.
> 

OK I dropped this flag. I will just check for the direct_access
vector. I'll need a wrapper that defines to nothing in the
!CONFIG_FS_DAX case

> Cheers,
> Dave.
> 

Thanks for your help
Boaz


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] xfs: add initial DAX support
@ 2015-03-25  9:14         ` Boaz Harrosh
  0 siblings, 0 replies; 57+ messages in thread
From: Boaz Harrosh @ 2015-03-25  9:14 UTC (permalink / raw)
  To: Dave Chinner, Boaz Harrosh; +Cc: linux-fsdevel, willy, jack, xfs

On 03/24/2015 11:25 PM, Dave Chinner wrote:
<>
> 
> No, I don't see any reason for a mount flag for this, because we
> do not want to be stuck relying on a mount option forever. Once
> This code has been shaken out, I fully intend DAX to be turned on
> automatically for any device that supports, and for it to be turned
> on and off on a per-inode basis through on-disk inode flags. i.e.
> the mount option is really a temporary solution and I don't want to
> code in any dependencies on DAX being mount wide existing.
> 

OK I dropped this flag. I will just check for the direct_access
vector. I'll need a wrapper that defines to nothing in the
!CONFIG_FS_DAX case

> Cheers,
> Dave.
> 

Thanks for your help
Boaz

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 1/8] xfs: mmap lock needs to be inside freeze protection
  2015-03-24 10:50 ` [PATCH 1/8] xfs: mmap lock needs to be inside freeze protection Dave Chinner
@ 2015-04-01 14:34     ` Jan Kara
  2015-04-06 17:48     ` Brian Foster
  1 sibling, 0 replies; 57+ messages in thread
From: Jan Kara @ 2015-04-01 14:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, willy, jack

On Tue 24-03-15 21:50:59, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Lock ordering for the new mmap lock needs to be:
> 
> mmap_sem
>   sb_start_pagefault
>     i_mmap_lock
>       page lock
>         <fault processsing>
> 
> Right now xfs_vm_page_mkwrite gets this the wrong way around,
> While technically it cannot deadlock due to the current freeze
> ordering, it's still a landmine that might explode if we change
> anything in future. Hence we need to nest the locks correctly.
  Looks good to me. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_file.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index dc5f609..a4c882e 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1449,15 +1449,20 @@ xfs_filemap_page_mkwrite(
>  	struct vm_fault		*vmf)
>  {
>  	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> -	int			error;
> +	int			ret;
>  
>  	trace_xfs_filemap_page_mkwrite(ip);
>  
> +	sb_start_pagefault(VFS_I(ip)->i_sb);
> +	file_update_time(vma->vm_file);
>  	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> -	error = block_page_mkwrite(vma, vmf, xfs_get_blocks);
> +
> +	ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
> +
>  	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> +	sb_end_pagefault(VFS_I(ip)->i_sb);
>  
> -	return error;
> +	return block_page_mkwrite_return(ret);
>  }
>  
>  const struct file_operations xfs_file_operations = {
> -- 
> 2.0.0
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 1/8] xfs: mmap lock needs to be inside freeze protection
@ 2015-04-01 14:34     ` Jan Kara
  0 siblings, 0 replies; 57+ messages in thread
From: Jan Kara @ 2015-04-01 14:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, willy, jack, xfs

On Tue 24-03-15 21:50:59, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Lock ordering for the new mmap lock needs to be:
> 
> mmap_sem
>   sb_start_pagefault
>     i_mmap_lock
>       page lock
>         <fault processsing>
> 
> Right now xfs_vm_page_mkwrite gets this the wrong way around,
> While technically it cannot deadlock due to the current freeze
> ordering, it's still a landmine that might explode if we change
> anything in future. Hence we need to nest the locks correctly.
  Looks good to me. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_file.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index dc5f609..a4c882e 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1449,15 +1449,20 @@ xfs_filemap_page_mkwrite(
>  	struct vm_fault		*vmf)
>  {
>  	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> -	int			error;
> +	int			ret;
>  
>  	trace_xfs_filemap_page_mkwrite(ip);
>  
> +	sb_start_pagefault(VFS_I(ip)->i_sb);
> +	file_update_time(vma->vm_file);
>  	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> -	error = block_page_mkwrite(vma, vmf, xfs_get_blocks);
> +
> +	ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
> +
>  	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> +	sb_end_pagefault(VFS_I(ip)->i_sb);
>  
> -	return error;
> +	return block_page_mkwrite_return(ret);
>  }
>  
>  const struct file_operations xfs_file_operations = {
> -- 
> 2.0.0
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/8] dax: don't abuse get_block mapping for endio callbacks
  2015-03-24 10:51   ` Dave Chinner
@ 2015-04-01 14:53     ` Jan Kara
  -1 siblings, 0 replies; 57+ messages in thread
From: Jan Kara @ 2015-04-01 14:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, willy, jack

On Tue 24-03-15 21:51:00, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> dax_fault() currently relies on the get_block callback to attach an
> io completion callback to the mapping buffer head so that it can
> run unwritten extent conversion after zeroing allocated blocks.
> 
> Instead of this hack, pass the conversion callback directly into
> dax_fault() similar to the get_block callback. When the filesystem
> allocates unwritten extents, it will set the buffer_unwritten()
> flag, and hence the dax_fault code can call the completion function
> in the contexts where it is necessary without overloading the
> mapping buffer head.
> 
> Note: The changes to ext4 to use this interface are suspect at best.
> In fact, the way ext4 did this end_io assignment in the first place
> looks suspect because it only set a completion callback when there
> wasn't already some other write() call taking place on the same
> inode. The ext4 end_io code looks rather intricate and fragile with
> all it's reference counting and passing to different contexts for
> modification via inode private pointers that aren't protected by
> locks...
  Yeah, the io_end handling is currently buggy when you try to do more than
one write in parallel (normally we don't allow that and seriealize
everything behind i_mutex). That needs fixing but here what you did looks
good enough for this patch set. You have my

Acked-by: Jan Kara <jack@suse.cz>

								Honza

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/dax.c           | 17 +++++++++++------
>  fs/ext2/file.c     |  4 ++--
>  fs/ext4/file.c     | 16 ++++++++++++++--
>  fs/ext4/inode.c    | 21 +++++++--------------
>  include/linux/fs.h |  6 ++++--
>  5 files changed, 38 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ed1619e..431ec2b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -310,14 +310,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>   out:
>  	i_mmap_unlock_read(mapping);
>  
> -	if (bh->b_end_io)
> -		bh->b_end_io(bh, 1);
> -
>  	return error;
>  }
>  
>  static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> -			get_block_t get_block)
> +			get_block_t get_block, dax_iodone_t complete_unwritten)
>  {
>  	struct file *file = vma->vm_file;
>  	struct address_space *mapping = file->f_mapping;
> @@ -418,7 +415,15 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		page_cache_release(page);
>  	}
>  
> +	/*
> +	 * If we successfully insert the new mapping over an unwritten extent,
> +	 * we need to ensure we convert the unwritten extent. If there is an
> +	 * error inserting the mapping, we leave the extent as unwritten to
> +	 * prevent exposure of the stale underlying data to userspace.
> +	 */
>  	error = dax_insert_mapping(inode, &bh, vma, vmf);
> +	if (!error && buffer_unwritten(&bh))
> +		complete_unwritten(&bh, 1);
>  
>   out:
>  	if (error == -ENOMEM)
> @@ -446,7 +451,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>   * fault handler for DAX files.
>   */
>  int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> -			get_block_t get_block)
> +	      get_block_t get_block, dax_iodone_t complete_unwritten)
>  {
>  	int result;
>  	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> @@ -455,7 +460,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		sb_start_pagefault(sb);
>  		file_update_time(vma->vm_file);
>  	}
> -	result = do_dax_fault(vma, vmf, get_block);
> +	result = do_dax_fault(vma, vmf, get_block, complete_unwritten);
>  	if (vmf->flags & FAULT_FLAG_WRITE)
>  		sb_end_pagefault(sb);
>  
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index e317017..8da747a 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -28,12 +28,12 @@
>  #ifdef CONFIG_FS_DAX
>  static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_fault(vma, vmf, ext2_get_block);
> +	return dax_fault(vma, vmf, ext2_get_block, NULL);
>  }
>  
>  static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_mkwrite(vma, vmf, ext2_get_block);
> +	return dax_mkwrite(vma, vmf, ext2_get_block, NULL);
>  }
>  
>  static const struct vm_operations_struct ext2_dax_vm_ops = {
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 33a09da..f7dabb1 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -192,15 +192,27 @@ errout:
>  }
>  
>  #ifdef CONFIG_FS_DAX
> +static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
> +{
> +	struct inode *inode = bh->b_assoc_map->host;
> +	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
> +	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
> +	int err;
> +	if (!uptodate)
> +		return;
> +	WARN_ON(!buffer_unwritten(bh));
> +	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
> +}
> +
>  static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_fault(vma, vmf, ext4_get_block);
> +	return dax_fault(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
>  					/* Is this the right get_block? */
>  }
>  
>  static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_mkwrite(vma, vmf, ext4_get_block);
> +	return dax_mkwrite(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
>  }
>  
>  static const struct vm_operations_struct ext4_dax_vm_ops = {
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5cb9a21..43433de 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -657,18 +657,6 @@ has_zeroout:
>  	return retval;
>  }
>  
> -static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
> -{
> -	struct inode *inode = bh->b_assoc_map->host;
> -	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
> -	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
> -	int err;
> -	if (!uptodate)
> -		return;
> -	WARN_ON(!buffer_unwritten(bh));
> -	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
> -}
> -
>  /* Maximum number of blocks we map for direct IO at once. */
>  #define DIO_MAX_BLOCKS 4096
>  
> @@ -706,10 +694,15 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
>  
>  		map_bh(bh, inode->i_sb, map.m_pblk);
>  		bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
> -		if (IS_DAX(inode) && buffer_unwritten(bh) && !io_end) {
> +		if (IS_DAX(inode) && buffer_unwritten(bh)) {
> +			/*
> +			 * dgc: I suspect unwritten conversion on ext4+DAX is
> +			 * fundamentally broken here when there are concurrent
> +			 * read/write in progress on this inode.
> +			 */
> +			WARN_ON_ONCE(io_end);
>  			bh->b_assoc_map = inode->i_mapping;
>  			bh->b_private = (void *)(unsigned long)iblock;
> -			bh->b_end_io = ext4_end_io_unwritten;
>  		}
>  		if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
>  			set_buffer_defer_completion(bh);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 937e280..82100ae 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -70,6 +70,7 @@ typedef int (get_block_t)(struct inode *inode, sector_t iblock,
>  			struct buffer_head *bh_result, int create);
>  typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  			ssize_t bytes, void *private);
> +typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
>  
>  #define MAY_EXEC		0x00000001
>  #define MAY_WRITE		0x00000002
> @@ -2603,8 +2604,9 @@ ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
>  int dax_clear_blocks(struct inode *, sector_t block, long size);
>  int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
> -int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
> -#define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
> +int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
> +		dax_iodone_t);
> +#define dax_mkwrite(vma, vmf, gb, iod)	dax_fault(vma, vmf, gb, iod)
>  
>  #ifdef CONFIG_BLOCK
>  typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
> -- 
> 2.0.0
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/8] dax: don't abuse get_block mapping for endio callbacks
@ 2015-04-01 14:53     ` Jan Kara
  0 siblings, 0 replies; 57+ messages in thread
From: Jan Kara @ 2015-04-01 14:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, willy, jack, xfs

On Tue 24-03-15 21:51:00, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> dax_fault() currently relies on the get_block callback to attach an
> io completion callback to the mapping buffer head so that it can
> run unwritten extent conversion after zeroing allocated blocks.
> 
> Instead of this hack, pass the conversion callback directly into
> dax_fault() similar to the get_block callback. When the filesystem
> allocates unwritten extents, it will set the buffer_unwritten()
> flag, and hence the dax_fault code can call the completion function
> in the contexts where it is necessary without overloading the
> mapping buffer head.
> 
> Note: The changes to ext4 to use this interface are suspect at best.
> In fact, the way ext4 did this end_io assignment in the first place
> looks suspect because it only set a completion callback when there
> wasn't already some other write() call taking place on the same
> inode. The ext4 end_io code looks rather intricate and fragile with
> all it's reference counting and passing to different contexts for
> modification via inode private pointers that aren't protected by
> locks...
  Yeah, the io_end handling is currently buggy when you try to do more than
one write in parallel (normally we don't allow that and seriealize
everything behind i_mutex). That needs fixing but here what you did looks
good enough for this patch set. You have my

Acked-by: Jan Kara <jack@suse.cz>

								Honza

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/dax.c           | 17 +++++++++++------
>  fs/ext2/file.c     |  4 ++--
>  fs/ext4/file.c     | 16 ++++++++++++++--
>  fs/ext4/inode.c    | 21 +++++++--------------
>  include/linux/fs.h |  6 ++++--
>  5 files changed, 38 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ed1619e..431ec2b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -310,14 +310,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>   out:
>  	i_mmap_unlock_read(mapping);
>  
> -	if (bh->b_end_io)
> -		bh->b_end_io(bh, 1);
> -
>  	return error;
>  }
>  
>  static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> -			get_block_t get_block)
> +			get_block_t get_block, dax_iodone_t complete_unwritten)
>  {
>  	struct file *file = vma->vm_file;
>  	struct address_space *mapping = file->f_mapping;
> @@ -418,7 +415,15 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		page_cache_release(page);
>  	}
>  
> +	/*
> +	 * If we successfully insert the new mapping over an unwritten extent,
> +	 * we need to ensure we convert the unwritten extent. If there is an
> +	 * error inserting the mapping, we leave the extent as unwritten to
> +	 * prevent exposure of the stale underlying data to userspace.
> +	 */
>  	error = dax_insert_mapping(inode, &bh, vma, vmf);
> +	if (!error && buffer_unwritten(&bh))
> +		complete_unwritten(&bh, 1);
>  
>   out:
>  	if (error == -ENOMEM)
> @@ -446,7 +451,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>   * fault handler for DAX files.
>   */
>  int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> -			get_block_t get_block)
> +	      get_block_t get_block, dax_iodone_t complete_unwritten)
>  {
>  	int result;
>  	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> @@ -455,7 +460,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		sb_start_pagefault(sb);
>  		file_update_time(vma->vm_file);
>  	}
> -	result = do_dax_fault(vma, vmf, get_block);
> +	result = do_dax_fault(vma, vmf, get_block, complete_unwritten);
>  	if (vmf->flags & FAULT_FLAG_WRITE)
>  		sb_end_pagefault(sb);
>  
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index e317017..8da747a 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -28,12 +28,12 @@
>  #ifdef CONFIG_FS_DAX
>  static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_fault(vma, vmf, ext2_get_block);
> +	return dax_fault(vma, vmf, ext2_get_block, NULL);
>  }
>  
>  static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_mkwrite(vma, vmf, ext2_get_block);
> +	return dax_mkwrite(vma, vmf, ext2_get_block, NULL);
>  }
>  
>  static const struct vm_operations_struct ext2_dax_vm_ops = {
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 33a09da..f7dabb1 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -192,15 +192,27 @@ errout:
>  }
>  
>  #ifdef CONFIG_FS_DAX
> +static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
> +{
> +	struct inode *inode = bh->b_assoc_map->host;
> +	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
> +	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
> +	int err;
> +	if (!uptodate)
> +		return;
> +	WARN_ON(!buffer_unwritten(bh));
> +	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
> +}
> +
>  static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_fault(vma, vmf, ext4_get_block);
> +	return dax_fault(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
>  					/* Is this the right get_block? */
>  }
>  
>  static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	return dax_mkwrite(vma, vmf, ext4_get_block);
> +	return dax_mkwrite(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
>  }
>  
>  static const struct vm_operations_struct ext4_dax_vm_ops = {
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5cb9a21..43433de 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -657,18 +657,6 @@ has_zeroout:
>  	return retval;
>  }
>  
> -static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
> -{
> -	struct inode *inode = bh->b_assoc_map->host;
> -	/* XXX: breaks on 32-bit > 16GB. Is that even supported? */
> -	loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
> -	int err;
> -	if (!uptodate)
> -		return;
> -	WARN_ON(!buffer_unwritten(bh));
> -	err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
> -}
> -
>  /* Maximum number of blocks we map for direct IO at once. */
>  #define DIO_MAX_BLOCKS 4096
>  
> @@ -706,10 +694,15 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
>  
>  		map_bh(bh, inode->i_sb, map.m_pblk);
>  		bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
> -		if (IS_DAX(inode) && buffer_unwritten(bh) && !io_end) {
> +		if (IS_DAX(inode) && buffer_unwritten(bh)) {
> +			/*
> +			 * dgc: I suspect unwritten conversion on ext4+DAX is
> +			 * fundamentally broken here when there are concurrent
> +			 * read/write in progress on this inode.
> +			 */
> +			WARN_ON_ONCE(io_end);
>  			bh->b_assoc_map = inode->i_mapping;
>  			bh->b_private = (void *)(unsigned long)iblock;
> -			bh->b_end_io = ext4_end_io_unwritten;
>  		}
>  		if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
>  			set_buffer_defer_completion(bh);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 937e280..82100ae 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -70,6 +70,7 @@ typedef int (get_block_t)(struct inode *inode, sector_t iblock,
>  			struct buffer_head *bh_result, int create);
>  typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  			ssize_t bytes, void *private);
> +typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
>  
>  #define MAY_EXEC		0x00000001
>  #define MAY_WRITE		0x00000002
> @@ -2603,8 +2604,9 @@ ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
>  int dax_clear_blocks(struct inode *, sector_t block, long size);
>  int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
> -int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
> -#define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
> +int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
> +		dax_iodone_t);
> +#define dax_mkwrite(vma, vmf, gb, iod)	dax_fault(vma, vmf, gb, iod)
>  
>  #ifdef CONFIG_BLOCK
>  typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
> -- 
> 2.0.0
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 3/8] dax: expose __dax_fault for filesystems with locking constraints
  2015-03-24 10:51   ` Dave Chinner
@ 2015-04-01 15:07     ` Jan Kara
  -1 siblings, 0 replies; 57+ messages in thread
From: Jan Kara @ 2015-04-01 15:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, willy, jack

On Tue 24-03-15 21:51:01, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Some filesystems cannot call dax_fault() directly because they have
> different locking and/or allocation constraints in the page fault IO
> path. To handle this, we need to follow the same model as the
> generic block_page_mkwrite code, where the internals are exposed via
> __block_page_mkwrite() so that filesystems can wrap the correct
> locking and operations around the outside.
> 
> This is loosely based on a patch originally from Matthew Willcox.
> Unlike the original patch, it does not change ext4 code, error
> returns or unwritten extent conversion handling.  It also adds a
> __dax_mkwrite() wrapper for .page_mkwrite implementations to do the
> right thing, too.
  We will need a normal error return from __dax_mkwrite() for proper ENOSPC
handling in ext4. You could do this when touching that code here if you
feel like that but if not, I can do that as a separate patch.

Anyway, feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/dax.c           | 15 +++++++++++++--
>  include/linux/fs.h |  5 ++++-
>  2 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 431ec2b..0121f7d 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -313,7 +313,17 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	return error;
>  }
>  
> -static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> +/**
> + * __dax_fault - handle a page fault on a DAX file
> + * @vma: The virtual memory area where the fault occurred
> + * @vmf: The description of the fault
> + * @get_block: The filesystem method used to translate file offsets to blocks
> + *
> + * When a page fault occurs, filesystems may call this helper in their
> + * fault handler for DAX files. __dax_fault() assumes the caller has done all
> + * the necessary locking for the page fault to proceed successfully.
> + */
> +int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  			get_block_t get_block, dax_iodone_t complete_unwritten)
>  {
>  	struct file *file = vma->vm_file;
> @@ -440,6 +450,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	}
>  	goto out;
>  }
> +EXPORT_SYMBOL(__dax_fault);
>  
>  /**
>   * dax_fault - handle a page fault on a DAX file
> @@ -460,7 +471,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		sb_start_pagefault(sb);
>  		file_update_time(vma->vm_file);
>  	}
> -	result = do_dax_fault(vma, vmf, get_block, complete_unwritten);
> +	result = __dax_fault(vma, vmf, get_block, complete_unwritten);
>  	if (vmf->flags & FAULT_FLAG_WRITE)
>  		sb_end_pagefault(sb);
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 82100ae..7e5a2d6 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2606,7 +2606,10 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
>  int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
>  		dax_iodone_t);
> -#define dax_mkwrite(vma, vmf, gb, iod)	dax_fault(vma, vmf, gb, iod)
> +int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
> +		dax_iodone_t);
> +#define dax_mkwrite(vma, vmf, gb, iod)		dax_fault(vma, vmf, gb, iod)
> +#define __dax_mkwrite(vma, vmf, gb, iod)	__dax_fault(vma, vmf, gb, iod)
>  
>  #ifdef CONFIG_BLOCK
>  typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
> -- 
> 2.0.0
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 3/8] dax: expose __dax_fault for filesystems with locking constraints
@ 2015-04-01 15:07     ` Jan Kara
  0 siblings, 0 replies; 57+ messages in thread
From: Jan Kara @ 2015-04-01 15:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, willy, jack, xfs

On Tue 24-03-15 21:51:01, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Some filesystems cannot call dax_fault() directly because they have
> different locking and/or allocation constraints in the page fault IO
> path. To handle this, we need to follow the same model as the
> generic block_page_mkwrite code, where the internals are exposed via
> __block_page_mkwrite() so that filesystems can wrap the correct
> locking and operations around the outside.
> 
> This is loosely based on a patch originally from Matthew Willcox.
> Unlike the original patch, it does not change ext4 code, error
> returns or unwritten extent conversion handling.  It also adds a
> __dax_mkwrite() wrapper for .page_mkwrite implementations to do the
> right thing, too.
  We will need a normal error return from __dax_mkwrite() for proper ENOSPC
handling in ext4. You could do this when touching that code here if you
feel like that but if not, I can do that as a separate patch.

Anyway, feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/dax.c           | 15 +++++++++++++--
>  include/linux/fs.h |  5 ++++-
>  2 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 431ec2b..0121f7d 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -313,7 +313,17 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	return error;
>  }
>  
> -static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> +/**
> + * __dax_fault - handle a page fault on a DAX file
> + * @vma: The virtual memory area where the fault occurred
> + * @vmf: The description of the fault
> + * @get_block: The filesystem method used to translate file offsets to blocks
> + *
> + * When a page fault occurs, filesystems may call this helper in their
> + * fault handler for DAX files. __dax_fault() assumes the caller has done all
> + * the necessary locking for the page fault to proceed successfully.
> + */
> +int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  			get_block_t get_block, dax_iodone_t complete_unwritten)
>  {
>  	struct file *file = vma->vm_file;
> @@ -440,6 +450,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	}
>  	goto out;
>  }
> +EXPORT_SYMBOL(__dax_fault);
>  
>  /**
>   * dax_fault - handle a page fault on a DAX file
> @@ -460,7 +471,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		sb_start_pagefault(sb);
>  		file_update_time(vma->vm_file);
>  	}
> -	result = do_dax_fault(vma, vmf, get_block, complete_unwritten);
> +	result = __dax_fault(vma, vmf, get_block, complete_unwritten);
>  	if (vmf->flags & FAULT_FLAG_WRITE)
>  		sb_end_pagefault(sb);
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 82100ae..7e5a2d6 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2606,7 +2606,10 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
>  int dax_truncate_page(struct inode *, loff_t from, get_block_t);
>  int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
>  		dax_iodone_t);
> -#define dax_mkwrite(vma, vmf, gb, iod)	dax_fault(vma, vmf, gb, iod)
> +int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
> +		dax_iodone_t);
> +#define dax_mkwrite(vma, vmf, gb, iod)		dax_fault(vma, vmf, gb, iod)
> +#define __dax_mkwrite(vma, vmf, gb, iod)	__dax_fault(vma, vmf, gb, iod)
>  
>  #ifdef CONFIG_BLOCK
>  typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
> -- 
> 2.0.0
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 1/8] xfs: mmap lock needs to be inside freeze protection
  2015-03-24 10:50 ` [PATCH 1/8] xfs: mmap lock needs to be inside freeze protection Dave Chinner
@ 2015-04-06 17:48     ` Brian Foster
  2015-04-06 17:48     ` Brian Foster
  1 sibling, 0 replies; 57+ messages in thread
From: Brian Foster @ 2015-04-06 17:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, willy, jack

On Tue, Mar 24, 2015 at 09:50:59PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Lock ordering for the new mmap lock needs to be:
> 
> mmap_sem
>   sb_start_pagefault
>     i_mmap_lock
>       page lock
>         <fault processsing>
> 
> Right now xfs_vm_page_mkwrite gets this the wrong way around,
> While technically it cannot deadlock due to the current freeze
> ordering, it's still a landmine that might explode if we change
> anything in future. Hence we need to nest the locks correctly.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_file.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index dc5f609..a4c882e 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1449,15 +1449,20 @@ xfs_filemap_page_mkwrite(
>  	struct vm_fault		*vmf)
>  {
>  	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> -	int			error;
> +	int			ret;
>  
>  	trace_xfs_filemap_page_mkwrite(ip);
>  
> +	sb_start_pagefault(VFS_I(ip)->i_sb);
> +	file_update_time(vma->vm_file);
>  	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> -	error = block_page_mkwrite(vma, vmf, xfs_get_blocks);
> +
> +	ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
> +
>  	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> +	sb_end_pagefault(VFS_I(ip)->i_sb);
>  
> -	return error;
> +	return block_page_mkwrite_return(ret);
>  }
>  
>  const struct file_operations xfs_file_operations = {
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 1/8] xfs: mmap lock needs to be inside freeze protection
@ 2015-04-06 17:48     ` Brian Foster
  0 siblings, 0 replies; 57+ messages in thread
From: Brian Foster @ 2015-04-06 17:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, willy, jack, xfs

On Tue, Mar 24, 2015 at 09:50:59PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Lock ordering for the new mmap lock needs to be:
> 
> mmap_sem
>   sb_start_pagefault
>     i_mmap_lock
>       page lock
>         <fault processsing>
> 
> Right now xfs_vm_page_mkwrite gets this the wrong way around,
> While technically it cannot deadlock due to the current freeze
> ordering, it's still a landmine that might explode if we change
> anything in future. Hence we need to nest the locks correctly.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_file.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index dc5f609..a4c882e 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1449,15 +1449,20 @@ xfs_filemap_page_mkwrite(
>  	struct vm_fault		*vmf)
>  {
>  	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> -	int			error;
> +	int			ret;
>  
>  	trace_xfs_filemap_page_mkwrite(ip);
>  
> +	sb_start_pagefault(VFS_I(ip)->i_sb);
> +	file_update_time(vma->vm_file);
>  	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> -	error = block_page_mkwrite(vma, vmf, xfs_get_blocks);
> +
> +	ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
> +
>  	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> +	sb_end_pagefault(VFS_I(ip)->i_sb);
>  
> -	return error;
> +	return block_page_mkwrite_return(ret);
>  }
>  
>  const struct file_operations xfs_file_operations = {
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 4/8] xfs: add DAX block zeroing support
  2015-03-24 10:51   ` Dave Chinner
@ 2015-04-06 17:48     ` Brian Foster
  -1 siblings, 0 replies; 57+ messages in thread
From: Brian Foster @ 2015-04-06 17:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, willy, jack

On Tue, Mar 24, 2015 at 09:51:02PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add initial support for DAX block zeroing operations to XFS. DAX
> cannot use buffered IO through the page cache for zeroing, nor do we
> need to issue IO for uncached block zeroing. In both cases, we can
> simply call out to the dax block zeroing function.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 23 +++++++++++++++++++----
>  fs/xfs/xfs_file.c      | 28 +++++++++++++++++-----------
>  2 files changed, 36 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 1bd5393..d1fe432 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1133,14 +1133,29 @@ xfs_zero_remaining_bytes(
>  			break;
>  		ASSERT(imap.br_blockcount >= 1);
>  		ASSERT(imap.br_startoff == offset_fsb);
> +		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> +
> +		if (imap.br_startblock == HOLESTARTBLOCK ||
> +		    imap.br_state == XFS_EXT_UNWRITTEN) {
> +			/* skip the entire extent */
> +			lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff +
> +						      imap.br_blockcount) - 1;
> +			continue;
> +		}
> +
>  		lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff + 1) - 1;
>  		if (lastoffset > endoff)
>  			lastoffset = endoff;
> -		if (imap.br_startblock == HOLESTARTBLOCK)
> -			continue;
> -		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> -		if (imap.br_state == XFS_EXT_UNWRITTEN)
> +
> +		/* DAX can just zero the backing device directly */
> +		if (IS_DAX(VFS_I(ip))) {
> +			error = dax_zero_page_range(VFS_I(ip), offset,
> +						    lastoffset - offset + 1,
> +						    xfs_get_blocks_dax);
> +			if (error)
> +				return error;
>  			continue;
> +		}
>  
>  		error = xfs_buf_read_uncached(XFS_IS_REALTIME_INODE(ip) ?
>  				mp->m_rtdev_targp : mp->m_ddev_targp,
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index a4c882e..94713c2 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -97,7 +97,8 @@ xfs_iozero(
>  {
>  	struct page		*page;
>  	struct address_space	*mapping;
> -	int			status;
> +	int			status = 0;
> +
>  
>  	mapping = VFS_I(ip)->i_mapping;
>  	do {
> @@ -109,20 +110,25 @@ xfs_iozero(
>  		if (bytes > count)
>  			bytes = count;
>  
> -		status = pagecache_write_begin(NULL, mapping, pos, bytes,
> -					AOP_FLAG_UNINTERRUPTIBLE,
> -					&page, &fsdata);
> -		if (status)
> -			break;
> +		if (IS_DAX(VFS_I(ip)))
> +			dax_zero_page_range(VFS_I(ip), pos, bytes,
> +						   xfs_get_blocks_dax);

xfs_get_blocks_dax() isn't defined yet. We should also probably error
check here, yes?

A nit... If we have to update this patch, it would be nice to update the
comment above the function to adjust expectations with regard to the
suggestion that this always allocates blocks. If I follow the dax
codepath correctly, dax_zero_page_range() is a noop over holes or
unwritten blocks (not that it seems to matter for current callers).

> +		else {
> +			status = pagecache_write_begin(NULL, mapping, pos, bytes,
> +						AOP_FLAG_UNINTERRUPTIBLE,
> +						&page, &fsdata);
> +			if (status)
> +				break;
>  
> -		zero_user(page, offset, bytes);
> +			zero_user(page, offset, bytes);
>  
> -		status = pagecache_write_end(NULL, mapping, pos, bytes, bytes,
> -					page, fsdata);
> -		WARN_ON(status <= 0); /* can't return less than zero! */
> +			status = pagecache_write_end(NULL, mapping, pos, bytes,
> +						bytes, page, fsdata);
> +			WARN_ON(status <= 0); /* can't return less than zero! */
> +			status = 0;
> +		}
>  		pos += bytes;
>  		count -= bytes;
> -		status = 0;
>  	} while (count);
>  
>  	return (-status);

FWIW, that looks like a potential positive return code path
(write_begin)...

Brian

> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 4/8] xfs: add DAX block zeroing support
@ 2015-04-06 17:48     ` Brian Foster
  0 siblings, 0 replies; 57+ messages in thread
From: Brian Foster @ 2015-04-06 17:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, willy, jack, xfs

On Tue, Mar 24, 2015 at 09:51:02PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add initial support for DAX block zeroing operations to XFS. DAX
> cannot use buffered IO through the page cache for zeroing, nor do we
> need to issue IO for uncached block zeroing. In both cases, we can
> simply call out to the dax block zeroing function.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 23 +++++++++++++++++++----
>  fs/xfs/xfs_file.c      | 28 +++++++++++++++++-----------
>  2 files changed, 36 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 1bd5393..d1fe432 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1133,14 +1133,29 @@ xfs_zero_remaining_bytes(
>  			break;
>  		ASSERT(imap.br_blockcount >= 1);
>  		ASSERT(imap.br_startoff == offset_fsb);
> +		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> +
> +		if (imap.br_startblock == HOLESTARTBLOCK ||
> +		    imap.br_state == XFS_EXT_UNWRITTEN) {
> +			/* skip the entire extent */
> +			lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff +
> +						      imap.br_blockcount) - 1;
> +			continue;
> +		}
> +
>  		lastoffset = XFS_FSB_TO_B(mp, imap.br_startoff + 1) - 1;
>  		if (lastoffset > endoff)
>  			lastoffset = endoff;
> -		if (imap.br_startblock == HOLESTARTBLOCK)
> -			continue;
> -		ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
> -		if (imap.br_state == XFS_EXT_UNWRITTEN)
> +
> +		/* DAX can just zero the backing device directly */
> +		if (IS_DAX(VFS_I(ip))) {
> +			error = dax_zero_page_range(VFS_I(ip), offset,
> +						    lastoffset - offset + 1,
> +						    xfs_get_blocks_dax);
> +			if (error)
> +				return error;
>  			continue;
> +		}
>  
>  		error = xfs_buf_read_uncached(XFS_IS_REALTIME_INODE(ip) ?
>  				mp->m_rtdev_targp : mp->m_ddev_targp,
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index a4c882e..94713c2 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -97,7 +97,8 @@ xfs_iozero(
>  {
>  	struct page		*page;
>  	struct address_space	*mapping;
> -	int			status;
> +	int			status = 0;
> +
>  
>  	mapping = VFS_I(ip)->i_mapping;
>  	do {
> @@ -109,20 +110,25 @@ xfs_iozero(
>  		if (bytes > count)
>  			bytes = count;
>  
> -		status = pagecache_write_begin(NULL, mapping, pos, bytes,
> -					AOP_FLAG_UNINTERRUPTIBLE,
> -					&page, &fsdata);
> -		if (status)
> -			break;
> +		if (IS_DAX(VFS_I(ip)))
> +			dax_zero_page_range(VFS_I(ip), pos, bytes,
> +						   xfs_get_blocks_dax);

xfs_get_blocks_dax() isn't defined yet. We should also probably error
check here, yes?

A nit... If we have to update this patch, it would be nice to update the
comment above the function to adjust expectations with regard to the
suggestion that this always allocates blocks. If I follow the dax
codepath correctly, dax_zero_page_range() is a noop over holes or
unwritten blocks (not that it seems to matter for current callers).

> +		else {
> +			status = pagecache_write_begin(NULL, mapping, pos, bytes,
> +						AOP_FLAG_UNINTERRUPTIBLE,
> +						&page, &fsdata);
> +			if (status)
> +				break;
>  
> -		zero_user(page, offset, bytes);
> +			zero_user(page, offset, bytes);
>  
> -		status = pagecache_write_end(NULL, mapping, pos, bytes, bytes,
> -					page, fsdata);
> -		WARN_ON(status <= 0); /* can't return less than zero! */
> +			status = pagecache_write_end(NULL, mapping, pos, bytes,
> +						bytes, page, fsdata);
> +			WARN_ON(status <= 0); /* can't return less than zero! */
> +			status = 0;
> +		}
>  		pos += bytes;
>  		count -= bytes;
> -		status = 0;
>  	} while (count);
>  
>  	return (-status);

FWIW, that looks like a potential positive return code path
(write_begin)...

Brian

> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xfs: add DAX file operations support
  2015-03-24 10:51   ` Dave Chinner
@ 2015-04-06 17:49     ` Brian Foster
  -1 siblings, 0 replies; 57+ messages in thread
From: Brian Foster @ 2015-04-06 17:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, willy, jack

On Tue, Mar 24, 2015 at 09:51:03PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add the initial support for DAX file operations to XFS. This
> includes the necessary block allocation and mmap page fault hooks
> for DAX to function.
> 
> Note: we specifically have to disable splice_read/write from
> occurring because they are dependent on usingthe page cache for
> correct operation. We have no page cache for DAX, so we need to
> disable them completely on DAX inodes.
> 

Looks like Boaz already pointed out this required an update wrt to
splice...

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c |  73 ++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_aops.h |   7 +++-
>  fs/xfs/xfs_file.c | 116 ++++++++++++++++++++++++++++++++----------------------
>  3 files changed, 143 insertions(+), 53 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3a9b7a1..3fc5052 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1233,13 +1233,64 @@ xfs_vm_releasepage(
>  	return try_to_free_buffers(page);
>  }
>  
> +/*
> + * For DAX we need a mapping buffer callback for unwritten extent conversion
> + * when page faults allocation blocks and then zero them.

s/allocation/allocate/

> + */
> +#ifdef CONFIG_FS_DAX
> +static struct xfs_ioend *
> +xfs_dax_alloc_ioend(
> +	struct inode	*inode,
> +	xfs_off_t	offset,
> +	ssize_t		size)
> +{
> +	struct xfs_ioend *ioend;
> +
> +	ASSERT(IS_DAX(inode));
> +	ioend = xfs_alloc_ioend(inode, XFS_IO_UNWRITTEN);
> +	ioend->io_offset = offset;
> +	ioend->io_size = size;
> +	return ioend;
> +}
> +
> +void
> +xfs_get_blocks_dax_complete(
> +	struct buffer_head	*bh,
> +	int			uptodate)
> +{
> +	struct xfs_ioend	*ioend = bh->b_private;
> +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> +	int			error;
> +
> +	ASSERT(IS_DAX(ioend->io_inode));
> +
> +	/* if there was an error zeroing, then don't convert it */
> +	if (!uptodate)
> +		goto out_free;
> +

Hmm, the error handling seems a bit off here. I'm new to the mmap paths
so I could easily be missing something. Anyways, this uptodate val is
hardcoded to 1 down in __dax_mkwrite(). This function is only called on
!error, however, which seems to make this error handling superfluous. If
I am following that correctly, who is going to free the ioend if an
error does occur?

Brian

> +	error = xfs_iomap_write_unwritten(ip, ioend->io_offset, ioend->io_size);
> +	if (error)
> +		xfs_warn(ip->i_mount,
> +"%s: conversion failed, ino 0x%llx, offset 0x%llx, len 0x%lx, error %d\n",
> +			__func__, ip->i_ino, ioend->io_offset,
> +			ioend->io_size, error);
> +out_free:
> +	mempool_free(ioend, xfs_ioend_pool);
> +
> +}
> +#else
> +#define xfs_dax_alloc_ioend(i,o,s)	NULL
> +void xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate) { }
> +#endif
> +
>  STATIC int
>  __xfs_get_blocks(
>  	struct inode		*inode,
>  	sector_t		iblock,
>  	struct buffer_head	*bh_result,
>  	int			create,
> -	int			direct)
> +	bool			direct,
> +	bool			clear)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1304,6 +1355,7 @@ __xfs_get_blocks(
>  			if (error)
>  				return error;
>  			new = 1;
> +
>  		} else {
>  			/*
>  			 * Delalloc reservations do not require a transaction,
> @@ -1340,7 +1392,10 @@ __xfs_get_blocks(
>  		if (create || !ISUNWRITTEN(&imap))
>  			xfs_map_buffer(inode, bh_result, &imap, offset);
>  		if (create && ISUNWRITTEN(&imap)) {
> -			if (direct) {
> +			if (clear) {
> +				bh_result->b_private = xfs_dax_alloc_ioend(
> +							inode, offset, size);
> +			} else if (direct) {
>  				bh_result->b_private = inode;
>  				set_buffer_defer_completion(bh_result);
>  			}
> @@ -1425,7 +1480,7 @@ xfs_get_blocks(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, 0);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
>  }
>  
>  STATIC int
> @@ -1435,7 +1490,17 @@ xfs_get_blocks_direct(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, 1);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
> +}
> +
> +int
> +xfs_get_blocks_dax(
> +	struct inode		*inode,
> +	sector_t		iblock,
> +	struct buffer_head	*bh_result,
> +	int			create)
> +{
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index ac644e0..7c6fb3f 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -53,7 +53,12 @@ typedef struct xfs_ioend {
>  } xfs_ioend_t;
>  
>  extern const struct address_space_operations xfs_address_space_operations;
> -extern int xfs_get_blocks(struct inode *, sector_t, struct buffer_head *, int);
> +
> +int	xfs_get_blocks(struct inode *inode, sector_t offset,
> +		       struct buffer_head *map_bh, int create);
> +int	xfs_get_blocks_dax(struct inode *inode, sector_t offset,
> +			   struct buffer_head *map_bh, int create);
> +void	xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate);
>  
>  extern void xfs_count_page_state(struct page *, int *, int *);
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 94713c2..8017175 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -385,7 +385,11 @@ xfs_file_splice_read(
>  
>  	trace_xfs_file_splice_read(ip, count, *ppos, ioflags);
>  
> -	ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
> +	/* for dax, we need to avoid the page cache */
> +	if (IS_DAX(VFS_I(ip)))
> +		ret = default_file_splice_read(infilp, ppos, pipe, count, flags);
> +	else
> +		ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
>  	if (ret > 0)
>  		XFS_STATS_ADD(xs_read_bytes, ret);
>  
> @@ -654,7 +658,7 @@ xfs_file_dio_aio_write(
>  					mp->m_rtdev_targp : mp->m_ddev_targp;
>  
>  	/* DIO must be aligned to device logical sector size */
> -	if ((pos | count) & target->bt_logical_sectormask)
> +	if (!IS_DAX(inode) && (pos | count) & target->bt_logical_sectormask)
>  		return -EINVAL;
>  
>  	/* "unaligned" here means not aligned to a filesystem block */
> @@ -724,8 +728,11 @@ xfs_file_dio_aio_write(
>  out:
>  	xfs_rw_iunlock(ip, iolock);
>  
> -	/* No fallback to buffered IO on errors for XFS. */
> -	ASSERT(ret < 0 || ret == count);
> +	/*
> +	 * No fallback to buffered IO on errors for XFS. DAX can result in
> +	 * partial writes, but direct IO will either complete fully or fail.
> +	 */
> +	ASSERT(ret < 0 || ret == count || IS_DAX(VFS_I(ip)));
>  	return ret;
>  }
>  
> @@ -810,7 +817,7 @@ xfs_file_write_iter(
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
>  		return -EIO;
>  
> -	if (unlikely(file->f_flags & O_DIRECT))
> +	if ((file->f_flags & O_DIRECT) || IS_DAX(inode))
>  		ret = xfs_file_dio_aio_write(iocb, from);
>  	else
>  		ret = xfs_file_buffered_aio_write(iocb, from);
> @@ -1031,17 +1038,6 @@ xfs_file_readdir(
>  	return xfs_readdir(ip, ctx, bufsize);
>  }
>  
> -STATIC int
> -xfs_file_mmap(
> -	struct file	*filp,
> -	struct vm_area_struct *vma)
> -{
> -	vma->vm_ops = &xfs_file_vm_ops;
> -
> -	file_accessed(filp);
> -	return 0;
> -}
> -
>  /*
>   * This type is designed to indicate the type of offset we would like
>   * to search from page cache for xfs_seek_hole_data().
> @@ -1422,26 +1418,11 @@ xfs_file_llseek(
>   * ordering of:
>   *
>   * mmap_sem (MM)
> - *   i_mmap_lock (XFS - truncate serialisation)
> - *     page_lock (MM)
> - *       i_lock (XFS - extent map serialisation)
> + *   sb_start_pagefault(vfs, freeze)
> + *     i_mmap_lock (XFS - truncate serialisation)
> + *       page_lock (MM)
> + *         i_lock (XFS - extent map serialisation)
>   */
> -STATIC int
> -xfs_filemap_fault(
> -	struct vm_area_struct	*vma,
> -	struct vm_fault		*vmf)
> -{
> -	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> -	int			error;
> -
> -	trace_xfs_filemap_fault(ip);
> -
> -	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> -	error = filemap_fault(vma, vmf);
> -	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> -
> -	return error;
> -}
>  
>  /*
>   * mmap()d file has taken write protection fault and is being made writable. We
> @@ -1454,21 +1435,66 @@ xfs_filemap_page_mkwrite(
>  	struct vm_area_struct	*vma,
>  	struct vm_fault		*vmf)
>  {
> -	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> +	struct inode		*inode = file_inode(vma->vm_file);
>  	int			ret;
>  
> -	trace_xfs_filemap_page_mkwrite(ip);
> +	trace_xfs_filemap_page_mkwrite(XFS_I(inode));
>  
> -	sb_start_pagefault(VFS_I(ip)->i_sb);
> +	sb_start_pagefault(inode->i_sb);
>  	file_update_time(vma->vm_file);
> -	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> +	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +
> +	if (IS_DAX(inode)) {
> +		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax,
> +				    xfs_get_blocks_dax_complete);
> +	} else {
> +		ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
> +		ret = block_page_mkwrite_return(ret);
> +	}
> +
> +	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +	sb_end_pagefault(inode->i_sb);
> +
> +	return ret;
> +}
> +
> +STATIC int
> +xfs_filemap_fault(
> +	struct vm_area_struct	*vma,
> +	struct vm_fault		*vmf)
> +{
> +	struct xfs_inode	*ip = XFS_I(file_inode(vma->vm_file));
> +	int			ret;
>  
> -	ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
> +	trace_xfs_filemap_fault(ip);
> +
> +	/* DAX can shortcut the normal fault path on write faults! */
> +	if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(VFS_I(ip)))
> +		return xfs_filemap_page_mkwrite(vma, vmf);
>  
> +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> +	ret = filemap_fault(vma, vmf);
>  	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> -	sb_end_pagefault(VFS_I(ip)->i_sb);
>  
> -	return block_page_mkwrite_return(ret);
> +	return ret;
> +}
> +
> +static const struct vm_operations_struct xfs_file_vm_ops = {
> +	.fault		= xfs_filemap_fault,
> +	.map_pages	= filemap_map_pages,
> +	.page_mkwrite	= xfs_filemap_page_mkwrite,
> +};
> +
> +STATIC int
> +xfs_file_mmap(
> +	struct file	*filp,
> +	struct vm_area_struct *vma)
> +{
> +	file_accessed(filp);
> +	vma->vm_ops = &xfs_file_vm_ops;
> +	if (IS_DAX(file_inode(filp)))
> +		vma->vm_flags |= VM_MIXEDMAP;
> +	return 0;
>  }
>  
>  const struct file_operations xfs_file_operations = {
> @@ -1501,9 +1527,3 @@ const struct file_operations xfs_dir_file_operations = {
>  #endif
>  	.fsync		= xfs_dir_fsync,
>  };
> -
> -static const struct vm_operations_struct xfs_file_vm_ops = {
> -	.fault		= xfs_filemap_fault,
> -	.map_pages	= filemap_map_pages,
> -	.page_mkwrite	= xfs_filemap_page_mkwrite,
> -};
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xfs: add DAX file operations support
@ 2015-04-06 17:49     ` Brian Foster
  0 siblings, 0 replies; 57+ messages in thread
From: Brian Foster @ 2015-04-06 17:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, willy, jack, xfs

On Tue, Mar 24, 2015 at 09:51:03PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add the initial support for DAX file operations to XFS. This
> includes the necessary block allocation and mmap page fault hooks
> for DAX to function.
> 
> Note: we specifically have to disable splice_read/write from
> occurring because they are dependent on usingthe page cache for
> correct operation. We have no page cache for DAX, so we need to
> disable them completely on DAX inodes.
> 

Looks like Boaz already pointed out this required an update wrt to
splice...

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c |  73 ++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_aops.h |   7 +++-
>  fs/xfs/xfs_file.c | 116 ++++++++++++++++++++++++++++++++----------------------
>  3 files changed, 143 insertions(+), 53 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3a9b7a1..3fc5052 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1233,13 +1233,64 @@ xfs_vm_releasepage(
>  	return try_to_free_buffers(page);
>  }
>  
> +/*
> + * For DAX we need a mapping buffer callback for unwritten extent conversion
> + * when page faults allocation blocks and then zero them.

s/allocation/allocate/

> + */
> +#ifdef CONFIG_FS_DAX
> +static struct xfs_ioend *
> +xfs_dax_alloc_ioend(
> +	struct inode	*inode,
> +	xfs_off_t	offset,
> +	ssize_t		size)
> +{
> +	struct xfs_ioend *ioend;
> +
> +	ASSERT(IS_DAX(inode));
> +	ioend = xfs_alloc_ioend(inode, XFS_IO_UNWRITTEN);
> +	ioend->io_offset = offset;
> +	ioend->io_size = size;
> +	return ioend;
> +}
> +
> +void
> +xfs_get_blocks_dax_complete(
> +	struct buffer_head	*bh,
> +	int			uptodate)
> +{
> +	struct xfs_ioend	*ioend = bh->b_private;
> +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> +	int			error;
> +
> +	ASSERT(IS_DAX(ioend->io_inode));
> +
> +	/* if there was an error zeroing, then don't convert it */
> +	if (!uptodate)
> +		goto out_free;
> +

Hmm, the error handling seems a bit off here. I'm new to the mmap paths
so I could easily be missing something. Anyways, this uptodate val is
hardcoded to 1 down in __dax_mkwrite(). This function is only called on
!error, however, which seems to make this error handling superfluous. If
I am following that correctly, who is going to free the ioend if an
error does occur?

Brian

> +	error = xfs_iomap_write_unwritten(ip, ioend->io_offset, ioend->io_size);
> +	if (error)
> +		xfs_warn(ip->i_mount,
> +"%s: conversion failed, ino 0x%llx, offset 0x%llx, len 0x%lx, error %d\n",
> +			__func__, ip->i_ino, ioend->io_offset,
> +			ioend->io_size, error);
> +out_free:
> +	mempool_free(ioend, xfs_ioend_pool);
> +
> +}
> +#else
> +#define xfs_dax_alloc_ioend(i,o,s)	NULL
> +void xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate) { }
> +#endif
> +
>  STATIC int
>  __xfs_get_blocks(
>  	struct inode		*inode,
>  	sector_t		iblock,
>  	struct buffer_head	*bh_result,
>  	int			create,
> -	int			direct)
> +	bool			direct,
> +	bool			clear)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1304,6 +1355,7 @@ __xfs_get_blocks(
>  			if (error)
>  				return error;
>  			new = 1;
> +
>  		} else {
>  			/*
>  			 * Delalloc reservations do not require a transaction,
> @@ -1340,7 +1392,10 @@ __xfs_get_blocks(
>  		if (create || !ISUNWRITTEN(&imap))
>  			xfs_map_buffer(inode, bh_result, &imap, offset);
>  		if (create && ISUNWRITTEN(&imap)) {
> -			if (direct) {
> +			if (clear) {
> +				bh_result->b_private = xfs_dax_alloc_ioend(
> +							inode, offset, size);
> +			} else if (direct) {
>  				bh_result->b_private = inode;
>  				set_buffer_defer_completion(bh_result);
>  			}
> @@ -1425,7 +1480,7 @@ xfs_get_blocks(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, 0);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
>  }
>  
>  STATIC int
> @@ -1435,7 +1490,17 @@ xfs_get_blocks_direct(
>  	struct buffer_head	*bh_result,
>  	int			create)
>  {
> -	return __xfs_get_blocks(inode, iblock, bh_result, create, 1);
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
> +}
> +
> +int
> +xfs_get_blocks_dax(
> +	struct inode		*inode,
> +	sector_t		iblock,
> +	struct buffer_head	*bh_result,
> +	int			create)
> +{
> +	return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index ac644e0..7c6fb3f 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -53,7 +53,12 @@ typedef struct xfs_ioend {
>  } xfs_ioend_t;
>  
>  extern const struct address_space_operations xfs_address_space_operations;
> -extern int xfs_get_blocks(struct inode *, sector_t, struct buffer_head *, int);
> +
> +int	xfs_get_blocks(struct inode *inode, sector_t offset,
> +		       struct buffer_head *map_bh, int create);
> +int	xfs_get_blocks_dax(struct inode *inode, sector_t offset,
> +			   struct buffer_head *map_bh, int create);
> +void	xfs_get_blocks_dax_complete(struct buffer_head *bh, int uptodate);
>  
>  extern void xfs_count_page_state(struct page *, int *, int *);
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 94713c2..8017175 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -385,7 +385,11 @@ xfs_file_splice_read(
>  
>  	trace_xfs_file_splice_read(ip, count, *ppos, ioflags);
>  
> -	ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
> +	/* for dax, we need to avoid the page cache */
> +	if (IS_DAX(VFS_I(ip)))
> +		ret = default_file_splice_read(infilp, ppos, pipe, count, flags);
> +	else
> +		ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
>  	if (ret > 0)
>  		XFS_STATS_ADD(xs_read_bytes, ret);
>  
> @@ -654,7 +658,7 @@ xfs_file_dio_aio_write(
>  					mp->m_rtdev_targp : mp->m_ddev_targp;
>  
>  	/* DIO must be aligned to device logical sector size */
> -	if ((pos | count) & target->bt_logical_sectormask)
> +	if (!IS_DAX(inode) && (pos | count) & target->bt_logical_sectormask)
>  		return -EINVAL;
>  
>  	/* "unaligned" here means not aligned to a filesystem block */
> @@ -724,8 +728,11 @@ xfs_file_dio_aio_write(
>  out:
>  	xfs_rw_iunlock(ip, iolock);
>  
> -	/* No fallback to buffered IO on errors for XFS. */
> -	ASSERT(ret < 0 || ret == count);
> +	/*
> +	 * No fallback to buffered IO on errors for XFS. DAX can result in
> +	 * partial writes, but direct IO will either complete fully or fail.
> +	 */
> +	ASSERT(ret < 0 || ret == count || IS_DAX(VFS_I(ip)));
>  	return ret;
>  }
>  
> @@ -810,7 +817,7 @@ xfs_file_write_iter(
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
>  		return -EIO;
>  
> -	if (unlikely(file->f_flags & O_DIRECT))
> +	if ((file->f_flags & O_DIRECT) || IS_DAX(inode))
>  		ret = xfs_file_dio_aio_write(iocb, from);
>  	else
>  		ret = xfs_file_buffered_aio_write(iocb, from);
> @@ -1031,17 +1038,6 @@ xfs_file_readdir(
>  	return xfs_readdir(ip, ctx, bufsize);
>  }
>  
> -STATIC int
> -xfs_file_mmap(
> -	struct file	*filp,
> -	struct vm_area_struct *vma)
> -{
> -	vma->vm_ops = &xfs_file_vm_ops;
> -
> -	file_accessed(filp);
> -	return 0;
> -}
> -
>  /*
>   * This type is designed to indicate the type of offset we would like
>   * to search from page cache for xfs_seek_hole_data().
> @@ -1422,26 +1418,11 @@ xfs_file_llseek(
>   * ordering of:
>   *
>   * mmap_sem (MM)
> - *   i_mmap_lock (XFS - truncate serialisation)
> - *     page_lock (MM)
> - *       i_lock (XFS - extent map serialisation)
> + *   sb_start_pagefault(vfs, freeze)
> + *     i_mmap_lock (XFS - truncate serialisation)
> + *       page_lock (MM)
> + *         i_lock (XFS - extent map serialisation)
>   */
> -STATIC int
> -xfs_filemap_fault(
> -	struct vm_area_struct	*vma,
> -	struct vm_fault		*vmf)
> -{
> -	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> -	int			error;
> -
> -	trace_xfs_filemap_fault(ip);
> -
> -	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> -	error = filemap_fault(vma, vmf);
> -	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> -
> -	return error;
> -}
>  
>  /*
>   * mmap()d file has taken write protection fault and is being made writable. We
> @@ -1454,21 +1435,66 @@ xfs_filemap_page_mkwrite(
>  	struct vm_area_struct	*vma,
>  	struct vm_fault		*vmf)
>  {
> -	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
> +	struct inode		*inode = file_inode(vma->vm_file);
>  	int			ret;
>  
> -	trace_xfs_filemap_page_mkwrite(ip);
> +	trace_xfs_filemap_page_mkwrite(XFS_I(inode));
>  
> -	sb_start_pagefault(VFS_I(ip)->i_sb);
> +	sb_start_pagefault(inode->i_sb);
>  	file_update_time(vma->vm_file);
> -	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> +	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +
> +	if (IS_DAX(inode)) {
> +		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax,
> +				    xfs_get_blocks_dax_complete);
> +	} else {
> +		ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
> +		ret = block_page_mkwrite_return(ret);
> +	}
> +
> +	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +	sb_end_pagefault(inode->i_sb);
> +
> +	return ret;
> +}
> +
> +STATIC int
> +xfs_filemap_fault(
> +	struct vm_area_struct	*vma,
> +	struct vm_fault		*vmf)
> +{
> +	struct xfs_inode	*ip = XFS_I(file_inode(vma->vm_file));
> +	int			ret;
>  
> -	ret = __block_page_mkwrite(vma, vmf, xfs_get_blocks);
> +	trace_xfs_filemap_fault(ip);
> +
> +	/* DAX can shortcut the normal fault path on write faults! */
> +	if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(VFS_I(ip)))
> +		return xfs_filemap_page_mkwrite(vma, vmf);
>  
> +	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
> +	ret = filemap_fault(vma, vmf);
>  	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
> -	sb_end_pagefault(VFS_I(ip)->i_sb);
>  
> -	return block_page_mkwrite_return(ret);
> +	return ret;
> +}
> +
> +static const struct vm_operations_struct xfs_file_vm_ops = {
> +	.fault		= xfs_filemap_fault,
> +	.map_pages	= filemap_map_pages,
> +	.page_mkwrite	= xfs_filemap_page_mkwrite,
> +};
> +
> +STATIC int
> +xfs_file_mmap(
> +	struct file	*filp,
> +	struct vm_area_struct *vma)
> +{
> +	file_accessed(filp);
> +	vma->vm_ops = &xfs_file_vm_ops;
> +	if (IS_DAX(file_inode(filp)))
> +		vma->vm_flags |= VM_MIXEDMAP;
> +	return 0;
>  }
>  
>  const struct file_operations xfs_file_operations = {
> @@ -1501,9 +1527,3 @@ const struct file_operations xfs_dir_file_operations = {
>  #endif
>  	.fsync		= xfs_dir_fsync,
>  };
> -
> -static const struct vm_operations_struct xfs_file_vm_ops = {
> -	.fault		= xfs_filemap_fault,
> -	.map_pages	= filemap_map_pages,
> -	.page_mkwrite	= xfs_filemap_page_mkwrite,
> -};
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/8] xfs: add DAX truncate support
  2015-03-24 10:51   ` Dave Chinner
@ 2015-04-06 17:49     ` Brian Foster
  -1 siblings, 0 replies; 57+ messages in thread
From: Brian Foster @ 2015-04-06 17:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, willy, jack

On Tue, Mar 24, 2015 at 09:51:04PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we truncate a DAX file, we need to call through the DAX page
> truncation path rather than through block_truncate_page() so that
> mappings and block zeroing are all handled correctly. Otherwise,
> truncate does not need to change.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_iops.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 8b9e688..9ca5352 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -851,7 +851,11 @@ xfs_setattr_size(
>  	 * to hope that the caller sees ENOMEM and retries the truncate
>  	 * operation.
>  	 */
> -	error = block_truncate_page(inode->i_mapping, newsize, xfs_get_blocks);
> +	if (IS_DAX(inode))
> +		error = dax_truncate_page(inode, newsize, xfs_get_blocks_dax);
> +	else
> +		error = block_truncate_page(inode->i_mapping, newsize,
> +					    xfs_get_blocks);
>  	if (error)
>  		return error;
>  	truncate_setsize(inode, newsize);
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/8] xfs: add DAX truncate support
@ 2015-04-06 17:49     ` Brian Foster
  0 siblings, 0 replies; 57+ messages in thread
From: Brian Foster @ 2015-04-06 17:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, willy, jack, xfs

On Tue, Mar 24, 2015 at 09:51:04PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we truncate a DAX file, we need to call through the DAX page
> truncation path rather than through block_truncate_page() so that
> mappings and block zeroing are all handled correctly. Otherwise,
> truncate does not need to change.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_iops.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 8b9e688..9ca5352 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -851,7 +851,11 @@ xfs_setattr_size(
>  	 * to hope that the caller sees ENOMEM and retries the truncate
>  	 * operation.
>  	 */
> -	error = block_truncate_page(inode->i_mapping, newsize, xfs_get_blocks);
> +	if (IS_DAX(inode))
> +		error = dax_truncate_page(inode, newsize, xfs_get_blocks_dax);
> +	else
> +		error = block_truncate_page(inode->i_mapping, newsize,
> +					    xfs_get_blocks);
>  	if (error)
>  		return error;
>  	truncate_setsize(inode, newsize);
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 7/8] xfs: add DAX IO path support
  2015-03-24 10:51   ` Dave Chinner
@ 2015-04-06 17:49     ` Brian Foster
  -1 siblings, 0 replies; 57+ messages in thread
From: Brian Foster @ 2015-04-06 17:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, willy, jack

On Tue, Mar 24, 2015 at 09:51:05PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> DAX does not do buffered IO (can't buffer direct access!) and hence
> all read/write IO is vectored through the direct IO path.  Hence we
> need to add the DAX IO path callouts to the direct IO
> infrastructure.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c | 35 +++++++++++++++++++++++++++--------
>  1 file changed, 27 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3fc5052..97979e9 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1559,6 +1559,30 @@ xfs_end_io_direct_write(
>  	}
>  }
>  
> +static inline ssize_t
> +xfs_vm_do_dio(
> +	struct inode		*inode,
> +	int			rw,
> +	struct kiocb		*iocb,
> +	struct iov_iter		*iter,
> +	loff_t			offset,
> +	void			(*endio)(struct kiocb	*iocb,
> +					 loff_t		offset,
> +					 ssize_t	size,
> +					 void		*private),
> +	int			flags)
> +{
> +	struct block_device	*bdev;
> +
> +	if (IS_DAX(inode))
> +		return dax_do_io(rw, iocb, inode, iter, offset,
> +				 xfs_get_blocks_direct, endio, 0);
> +

I assume this is supposed to be get_blocks_direct and not
get_blocks_dax, based on the I/O codepath. The naming is starting to get
a little confusing though. xfs_get_blocks_dax() implies to me that it's
for any DAX I/O, but we only appear to use it internally for
truncate/zeroing/mmap and such. Alas, I can't think of a better name atm
and the code seems Ok to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

... but a comment somewhere around here and/or at the
xfs_get_blocks_dax() function would be helpful.

Brian

> +	bdev = xfs_find_bdev_for_inode(inode);
> +	return  __blockdev_direct_IO(rw, iocb, inode, bdev, iter, offset,
> +				     xfs_get_blocks_direct, endio, NULL, flags);
> +}
> +
>  STATIC ssize_t
>  xfs_vm_direct_IO(
>  	int			rw,
> @@ -1567,17 +1591,12 @@ xfs_vm_direct_IO(
>  	loff_t			offset)
>  {
>  	struct inode		*inode = iocb->ki_filp->f_mapping->host;
> -	struct block_device	*bdev = xfs_find_bdev_for_inode(inode);
>  
>  	if (rw & WRITE) {
> -		return __blockdev_direct_IO(rw, iocb, inode, bdev, iter,
> -					    offset, xfs_get_blocks_direct,
> -					    xfs_end_io_direct_write, NULL,
> -					    DIO_ASYNC_EXTEND);
> +		return xfs_vm_do_dio(inode, rw, iocb, iter, offset,
> +				     xfs_end_io_direct_write, DIO_ASYNC_EXTEND);
>  	}
> -	return __blockdev_direct_IO(rw, iocb, inode, bdev, iter,
> -				    offset, xfs_get_blocks_direct,
> -				    NULL, NULL, 0);
> +	return xfs_vm_do_dio(inode, rw, iocb, iter, offset, NULL, 0);
>  }
>  
>  /*
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 7/8] xfs: add DAX IO path support
@ 2015-04-06 17:49     ` Brian Foster
  0 siblings, 0 replies; 57+ messages in thread
From: Brian Foster @ 2015-04-06 17:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, willy, jack, xfs

On Tue, Mar 24, 2015 at 09:51:05PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> DAX does not do buffered IO (can't buffer direct access!) and hence
> all read/write IO is vectored through the direct IO path.  Hence we
> need to add the DAX IO path callouts to the direct IO
> infrastructure.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c | 35 +++++++++++++++++++++++++++--------
>  1 file changed, 27 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3fc5052..97979e9 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1559,6 +1559,30 @@ xfs_end_io_direct_write(
>  	}
>  }
>  
> +static inline ssize_t
> +xfs_vm_do_dio(
> +	struct inode		*inode,
> +	int			rw,
> +	struct kiocb		*iocb,
> +	struct iov_iter		*iter,
> +	loff_t			offset,
> +	void			(*endio)(struct kiocb	*iocb,
> +					 loff_t		offset,
> +					 ssize_t	size,
> +					 void		*private),
> +	int			flags)
> +{
> +	struct block_device	*bdev;
> +
> +	if (IS_DAX(inode))
> +		return dax_do_io(rw, iocb, inode, iter, offset,
> +				 xfs_get_blocks_direct, endio, 0);
> +

I assume this is supposed to be get_blocks_direct and not
get_blocks_dax, based on the I/O codepath. The naming is starting to get
a little confusing though. xfs_get_blocks_dax() implies to me that it's
for any DAX I/O, but we only appear to use it internally for
truncate/zeroing/mmap and such. Alas, I can't think of a better name atm
and the code seems Ok to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

... but a comment somewhere around here and/or at the
xfs_get_blocks_dax() function would be helpful.

Brian

> +	bdev = xfs_find_bdev_for_inode(inode);
> +	return  __blockdev_direct_IO(rw, iocb, inode, bdev, iter, offset,
> +				     xfs_get_blocks_direct, endio, NULL, flags);
> +}
> +
>  STATIC ssize_t
>  xfs_vm_direct_IO(
>  	int			rw,
> @@ -1567,17 +1591,12 @@ xfs_vm_direct_IO(
>  	loff_t			offset)
>  {
>  	struct inode		*inode = iocb->ki_filp->f_mapping->host;
> -	struct block_device	*bdev = xfs_find_bdev_for_inode(inode);
>  
>  	if (rw & WRITE) {
> -		return __blockdev_direct_IO(rw, iocb, inode, bdev, iter,
> -					    offset, xfs_get_blocks_direct,
> -					    xfs_end_io_direct_write, NULL,
> -					    DIO_ASYNC_EXTEND);
> +		return xfs_vm_do_dio(inode, rw, iocb, iter, offset,
> +				     xfs_end_io_direct_write, DIO_ASYNC_EXTEND);
>  	}
> -	return __blockdev_direct_IO(rw, iocb, inode, bdev, iter,
> -				    offset, xfs_get_blocks_direct,
> -				    NULL, NULL, 0);
> +	return xfs_vm_do_dio(inode, rw, iocb, iter, offset, NULL, 0);
>  }
>  
>  /*
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] xfs: add initial DAX support
  2015-03-24 10:51   ` Dave Chinner
@ 2015-04-06 19:00     ` Brian Foster
  -1 siblings, 0 replies; 57+ messages in thread
From: Brian Foster @ 2015-04-06 19:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, willy, jack

On Tue, Mar 24, 2015 at 09:51:06PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add initial DAX support to XFS. To do this we need a new mount
> option to turn DAX on filesystem, and we need to propagate thi into
> the inode flags whenever an inode is instantiated so that the
> per-inode checks throughout the code Do The Right Thing.
> 
> There are still some things remaining to be done:
> 
> 	- needs per-inode flags to mark inodes as DAX enabled, and
> 	  an inheritance flag to enable automatic filesystem
> 	  propagation of the property
> 	- fails occasionally with zero length writes instead of
> 	  ENOSPC errors, so error propagation inside/from the DAX
> 	  code need work
> 	- occasionally creates two extents rather than a single
> 	  larger extent like non-dax filesystems.
> 	- much more testing
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_iops.c  | 24 ++++++++++++------------
>  fs/xfs/xfs_mount.h |  2 ++
>  fs/xfs/xfs_super.c | 25 +++++++++++++++++++++++--
>  3 files changed, 37 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 9ca5352..695d857 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1195,22 +1195,22 @@ xfs_diflags_to_iflags(
>  	struct inode		*inode,
>  	struct xfs_inode	*ip)
>  {
> -	if (ip->i_d.di_flags & XFS_DIFLAG_IMMUTABLE)
> +	uint16_t		flags = ip->i_d.di_flags;
> +
> +	inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
> +			    S_NOATIME | S_DAX);
> +
> +	if (flags & XFS_DIFLAG_IMMUTABLE)
>  		inode->i_flags |= S_IMMUTABLE;
> -	else
> -		inode->i_flags &= ~S_IMMUTABLE;
> -	if (ip->i_d.di_flags & XFS_DIFLAG_APPEND)
> +	if (flags & XFS_DIFLAG_APPEND)
>  		inode->i_flags |= S_APPEND;
> -	else
> -		inode->i_flags &= ~S_APPEND;
> -	if (ip->i_d.di_flags & XFS_DIFLAG_SYNC)
> +	if (flags & XFS_DIFLAG_SYNC)
>  		inode->i_flags |= S_SYNC;
> -	else
> -		inode->i_flags &= ~S_SYNC;
> -	if (ip->i_d.di_flags & XFS_DIFLAG_NOATIME)
> +	if (flags & XFS_DIFLAG_NOATIME)
>  		inode->i_flags |= S_NOATIME;
> -	else
> -		inode->i_flags &= ~S_NOATIME;
> +	/* XXX: Also needs an on-disk per inode flag! */
> +	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
> +		inode->i_flags |= S_DAX;

This is a temporary hack until we have some kind of inode flag
inheritance mechanism as mentioned in the commit log, correct?

>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 8c995a2..cd44e88 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -179,6 +179,8 @@ typedef struct xfs_mount {
>  						   allocator */
>  #define XFS_MOUNT_NOATTR2	(1ULL << 25)	/* disable use of attr2 format */
>  
> +#define XFS_MOUNT_DAX		(1ULL << 62)	/* TEST ONLY! */
> +
>  
>  /*
>   * Default minimum read and write sizes.
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 3ad0b17..0f26d7a 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -112,6 +112,8 @@ static struct xfs_kobj xfs_dbg_kobj;	/* global debug sysfs attrs */
>  #define MNTOPT_DISCARD	   "discard"	/* Discard unused blocks */
>  #define MNTOPT_NODISCARD   "nodiscard"	/* Do not discard unused blocks */
>  
> +#define MNTOPT_DAX	"dax"		/* Enable direct access to bdev pages */
> +
>  /*
>   * Table driven mount option parser.
>   *
> @@ -363,6 +365,10 @@ xfs_parseargs(
>  			mp->m_flags |= XFS_MOUNT_DISCARD;
>  		} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
>  			mp->m_flags &= ~XFS_MOUNT_DISCARD;
> +#ifdef CONFIG_FS_DAX
> +		} else if (!strcmp(this_char, MNTOPT_DAX)) {
> +			mp->m_flags |= XFS_MOUNT_DAX;
> +#endif

Something like what we do for !CONFIG_XFS_QUOTA just a few lines below
might be a bit nicer. Then we can have a slightly more useful error
message.

>  		} else {
>  			xfs_warn(mp, "unknown mount option [%s].", this_char);
>  			return -EINVAL;
> @@ -452,8 +458,8 @@ done:
>  }
>  
>  struct proc_xfs_info {
> -	int	flag;
> -	char	*str;
> +	uint64_t	flag;
> +	char		*str;
>  };
>  
>  STATIC int
> @@ -474,6 +480,7 @@ xfs_showargs(
>  		{ XFS_MOUNT_GRPID,		"," MNTOPT_GRPID },
>  		{ XFS_MOUNT_DISCARD,		"," MNTOPT_DISCARD },
>  		{ XFS_MOUNT_SMALL_INUMS,	"," MNTOPT_32BITINODE },
> +		{ XFS_MOUNT_DAX,		"," MNTOPT_DAX },
>  		{ 0, NULL }
>  	};
>  	static struct proc_xfs_info xfs_info_unset[] = {
> @@ -1501,6 +1508,20 @@ xfs_fs_fill_super(
>  	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
>  		sb->s_flags |= MS_I_VERSION;
>  
> +	if (mp->m_flags & XFS_MOUNT_DAX) {
> +		xfs_warn(mp,
> +	"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
> +		if (sb->s_blocksize != PAGE_SIZE) {
> +			xfs_alert(mp,
> +		"Filesystem block size invalid for DAX Turning DAX off.");
> +			mp->m_flags &= ~XFS_MOUNT_DAX;
> +		} else if (!sb->s_bdev->bd_disk->fops->direct_access) {
> +			xfs_alert(mp,
> +		"Block device does not support DAX Turning DAX off.");
> +			mp->m_flags &= ~XFS_MOUNT_DAX;
> +		}

Missing period in both error messages above. ;)

Brian

> +	}
> +
>  	error = xfs_mountfs(mp);
>  	if (error)
>  		goto out_filestream_unmount;
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] xfs: add initial DAX support
@ 2015-04-06 19:00     ` Brian Foster
  0 siblings, 0 replies; 57+ messages in thread
From: Brian Foster @ 2015-04-06 19:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, willy, jack, xfs

On Tue, Mar 24, 2015 at 09:51:06PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add initial DAX support to XFS. To do this we need a new mount
> option to turn DAX on filesystem, and we need to propagate thi into
> the inode flags whenever an inode is instantiated so that the
> per-inode checks throughout the code Do The Right Thing.
> 
> There are still some things remaining to be done:
> 
> 	- needs per-inode flags to mark inodes as DAX enabled, and
> 	  an inheritance flag to enable automatic filesystem
> 	  propagation of the property
> 	- fails occasionally with zero length writes instead of
> 	  ENOSPC errors, so error propagation inside/from the DAX
> 	  code need work
> 	- occasionally creates two extents rather than a single
> 	  larger extent like non-dax filesystems.
> 	- much more testing
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_iops.c  | 24 ++++++++++++------------
>  fs/xfs/xfs_mount.h |  2 ++
>  fs/xfs/xfs_super.c | 25 +++++++++++++++++++++++--
>  3 files changed, 37 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 9ca5352..695d857 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1195,22 +1195,22 @@ xfs_diflags_to_iflags(
>  	struct inode		*inode,
>  	struct xfs_inode	*ip)
>  {
> -	if (ip->i_d.di_flags & XFS_DIFLAG_IMMUTABLE)
> +	uint16_t		flags = ip->i_d.di_flags;
> +
> +	inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
> +			    S_NOATIME | S_DAX);
> +
> +	if (flags & XFS_DIFLAG_IMMUTABLE)
>  		inode->i_flags |= S_IMMUTABLE;
> -	else
> -		inode->i_flags &= ~S_IMMUTABLE;
> -	if (ip->i_d.di_flags & XFS_DIFLAG_APPEND)
> +	if (flags & XFS_DIFLAG_APPEND)
>  		inode->i_flags |= S_APPEND;
> -	else
> -		inode->i_flags &= ~S_APPEND;
> -	if (ip->i_d.di_flags & XFS_DIFLAG_SYNC)
> +	if (flags & XFS_DIFLAG_SYNC)
>  		inode->i_flags |= S_SYNC;
> -	else
> -		inode->i_flags &= ~S_SYNC;
> -	if (ip->i_d.di_flags & XFS_DIFLAG_NOATIME)
> +	if (flags & XFS_DIFLAG_NOATIME)
>  		inode->i_flags |= S_NOATIME;
> -	else
> -		inode->i_flags &= ~S_NOATIME;
> +	/* XXX: Also needs an on-disk per inode flag! */
> +	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
> +		inode->i_flags |= S_DAX;

This is a temporary hack until we have some kind of inode flag
inheritance mechanism as mentioned in the commit log, correct?

>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 8c995a2..cd44e88 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -179,6 +179,8 @@ typedef struct xfs_mount {
>  						   allocator */
>  #define XFS_MOUNT_NOATTR2	(1ULL << 25)	/* disable use of attr2 format */
>  
> +#define XFS_MOUNT_DAX		(1ULL << 62)	/* TEST ONLY! */
> +
>  
>  /*
>   * Default minimum read and write sizes.
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 3ad0b17..0f26d7a 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -112,6 +112,8 @@ static struct xfs_kobj xfs_dbg_kobj;	/* global debug sysfs attrs */
>  #define MNTOPT_DISCARD	   "discard"	/* Discard unused blocks */
>  #define MNTOPT_NODISCARD   "nodiscard"	/* Do not discard unused blocks */
>  
> +#define MNTOPT_DAX	"dax"		/* Enable direct access to bdev pages */
> +
>  /*
>   * Table driven mount option parser.
>   *
> @@ -363,6 +365,10 @@ xfs_parseargs(
>  			mp->m_flags |= XFS_MOUNT_DISCARD;
>  		} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
>  			mp->m_flags &= ~XFS_MOUNT_DISCARD;
> +#ifdef CONFIG_FS_DAX
> +		} else if (!strcmp(this_char, MNTOPT_DAX)) {
> +			mp->m_flags |= XFS_MOUNT_DAX;
> +#endif

Something like what we do for !CONFIG_XFS_QUOTA just a few lines below
might be a bit nicer. Then we can have a slightly more useful error
message.

>  		} else {
>  			xfs_warn(mp, "unknown mount option [%s].", this_char);
>  			return -EINVAL;
> @@ -452,8 +458,8 @@ done:
>  }
>  
>  struct proc_xfs_info {
> -	int	flag;
> -	char	*str;
> +	uint64_t	flag;
> +	char		*str;
>  };
>  
>  STATIC int
> @@ -474,6 +480,7 @@ xfs_showargs(
>  		{ XFS_MOUNT_GRPID,		"," MNTOPT_GRPID },
>  		{ XFS_MOUNT_DISCARD,		"," MNTOPT_DISCARD },
>  		{ XFS_MOUNT_SMALL_INUMS,	"," MNTOPT_32BITINODE },
> +		{ XFS_MOUNT_DAX,		"," MNTOPT_DAX },
>  		{ 0, NULL }
>  	};
>  	static struct proc_xfs_info xfs_info_unset[] = {
> @@ -1501,6 +1508,20 @@ xfs_fs_fill_super(
>  	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
>  		sb->s_flags |= MS_I_VERSION;
>  
> +	if (mp->m_flags & XFS_MOUNT_DAX) {
> +		xfs_warn(mp,
> +	"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
> +		if (sb->s_blocksize != PAGE_SIZE) {
> +			xfs_alert(mp,
> +		"Filesystem block size invalid for DAX Turning DAX off.");
> +			mp->m_flags &= ~XFS_MOUNT_DAX;
> +		} else if (!sb->s_bdev->bd_disk->fops->direct_access) {
> +			xfs_alert(mp,
> +		"Block device does not support DAX Turning DAX off.");
> +			mp->m_flags &= ~XFS_MOUNT_DAX;
> +		}

Missing period in both error messages above. ;)

Brian

> +	}
> +
>  	error = xfs_mountfs(mp);
>  	if (error)
>  		goto out_filestream_unmount;
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xfs: add DAX file operations support
  2015-04-06 17:49     ` Brian Foster
@ 2015-04-16  8:29       ` Dave Chinner
  -1 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-04-16  8:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs, linux-fsdevel, willy, jack

On Mon, Apr 06, 2015 at 01:49:00PM -0400, Brian Foster wrote:
> On Tue, Mar 24, 2015 at 09:51:03PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Add the initial support for DAX file operations to XFS. This
> > includes the necessary block allocation and mmap page fault hooks
> > for DAX to function.
> > 
> > Note: we specifically have to disable splice_read/write from
> > occurring because they are dependent on usingthe page cache for
> > correct operation. We have no page cache for DAX, so we need to
> > disable them completely on DAX inodes.
> > 
> 
> Looks like Boaz already pointed out this required an update wrt to
> splice...
> 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c |  73 ++++++++++++++++++++++++++++++++--
> >  fs/xfs/xfs_aops.h |   7 +++-
> >  fs/xfs/xfs_file.c | 116 ++++++++++++++++++++++++++++++++----------------------
> >  3 files changed, 143 insertions(+), 53 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 3a9b7a1..3fc5052 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -1233,13 +1233,64 @@ xfs_vm_releasepage(
> >  	return try_to_free_buffers(page);
> >  }
> >  
> > +/*
> > + * For DAX we need a mapping buffer callback for unwritten extent conversion
> > + * when page faults allocation blocks and then zero them.
> 
> s/allocation/allocate/
> 
> > + */
> > +#ifdef CONFIG_FS_DAX
> > +static struct xfs_ioend *
> > +xfs_dax_alloc_ioend(
> > +	struct inode	*inode,
> > +	xfs_off_t	offset,
> > +	ssize_t		size)
> > +{
> > +	struct xfs_ioend *ioend;
> > +
> > +	ASSERT(IS_DAX(inode));
> > +	ioend = xfs_alloc_ioend(inode, XFS_IO_UNWRITTEN);
> > +	ioend->io_offset = offset;
> > +	ioend->io_size = size;
> > +	return ioend;
> > +}
> > +
> > +void
> > +xfs_get_blocks_dax_complete(
> > +	struct buffer_head	*bh,
> > +	int			uptodate)
> > +{
> > +	struct xfs_ioend	*ioend = bh->b_private;
> > +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> > +	int			error;
> > +
> > +	ASSERT(IS_DAX(ioend->io_inode));
> > +
> > +	/* if there was an error zeroing, then don't convert it */
> > +	if (!uptodate)
> > +		goto out_free;
> > +
> 
> Hmm, the error handling seems a bit off here. I'm new to the mmap paths
> so I could easily be missing something. Anyways, this uptodate val is
> hardcoded to 1 down in __dax_mkwrite(). This function is only called on
> !error, however, which seems to make this error handling superfluous. If
> I am following that correctly, who is going to free the ioend if an
> error does occur?

Right, the dax code needs fixing to unconditionally call the end IO
callback. I'll add that patch into the start of the series.

As it is, this patch needs significant rework after the DIO
write completion path rework. It greatly simplifies this because we
now have an ioend being allocated in __xfs_get_blocks....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xfs: add DAX file operations support
@ 2015-04-16  8:29       ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-04-16  8:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, willy, jack, xfs

On Mon, Apr 06, 2015 at 01:49:00PM -0400, Brian Foster wrote:
> On Tue, Mar 24, 2015 at 09:51:03PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Add the initial support for DAX file operations to XFS. This
> > includes the necessary block allocation and mmap page fault hooks
> > for DAX to function.
> > 
> > Note: we specifically have to disable splice_read/write from
> > occurring because they are dependent on usingthe page cache for
> > correct operation. We have no page cache for DAX, so we need to
> > disable them completely on DAX inodes.
> > 
> 
> Looks like Boaz already pointed out this required an update wrt to
> splice...
> 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c |  73 ++++++++++++++++++++++++++++++++--
> >  fs/xfs/xfs_aops.h |   7 +++-
> >  fs/xfs/xfs_file.c | 116 ++++++++++++++++++++++++++++++++----------------------
> >  3 files changed, 143 insertions(+), 53 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 3a9b7a1..3fc5052 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -1233,13 +1233,64 @@ xfs_vm_releasepage(
> >  	return try_to_free_buffers(page);
> >  }
> >  
> > +/*
> > + * For DAX we need a mapping buffer callback for unwritten extent conversion
> > + * when page faults allocation blocks and then zero them.
> 
> s/allocation/allocate/
> 
> > + */
> > +#ifdef CONFIG_FS_DAX
> > +static struct xfs_ioend *
> > +xfs_dax_alloc_ioend(
> > +	struct inode	*inode,
> > +	xfs_off_t	offset,
> > +	ssize_t		size)
> > +{
> > +	struct xfs_ioend *ioend;
> > +
> > +	ASSERT(IS_DAX(inode));
> > +	ioend = xfs_alloc_ioend(inode, XFS_IO_UNWRITTEN);
> > +	ioend->io_offset = offset;
> > +	ioend->io_size = size;
> > +	return ioend;
> > +}
> > +
> > +void
> > +xfs_get_blocks_dax_complete(
> > +	struct buffer_head	*bh,
> > +	int			uptodate)
> > +{
> > +	struct xfs_ioend	*ioend = bh->b_private;
> > +	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
> > +	int			error;
> > +
> > +	ASSERT(IS_DAX(ioend->io_inode));
> > +
> > +	/* if there was an error zeroing, then don't convert it */
> > +	if (!uptodate)
> > +		goto out_free;
> > +
> 
> Hmm, the error handling seems a bit off here. I'm new to the mmap paths
> so I could easily be missing something. Anyways, this uptodate val is
> hardcoded to 1 down in __dax_mkwrite(). This function is only called on
> !error, however, which seems to make this error handling superfluous. If
> I am following that correctly, who is going to free the ioend if an
> error does occur?

Right, the dax code needs fixing to unconditionally call the end IO
callback. I'll add that patch into the start of the series.

As it is, this patch needs significant rework after the DIO
write completion path rework. It greatly simplifies this because we
now have an ioend being allocated in __xfs_get_blocks....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 7/8] xfs: add DAX IO path support
  2015-04-06 17:49     ` Brian Foster
@ 2015-04-16  8:54       ` Dave Chinner
  -1 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-04-16  8:54 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs, linux-fsdevel, willy, jack

On Mon, Apr 06, 2015 at 01:49:14PM -0400, Brian Foster wrote:
> On Tue, Mar 24, 2015 at 09:51:05PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > DAX does not do buffered IO (can't buffer direct access!) and hence
> > all read/write IO is vectored through the direct IO path.  Hence we
> > need to add the DAX IO path callouts to the direct IO
> > infrastructure.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c | 35 +++++++++++++++++++++++++++--------
> >  1 file changed, 27 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 3fc5052..97979e9 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -1559,6 +1559,30 @@ xfs_end_io_direct_write(
> >  	}
> >  }
> >  
> > +static inline ssize_t
> > +xfs_vm_do_dio(
> > +	struct inode		*inode,
> > +	int			rw,
> > +	struct kiocb		*iocb,
> > +	struct iov_iter		*iter,
> > +	loff_t			offset,
> > +	void			(*endio)(struct kiocb	*iocb,
> > +					 loff_t		offset,
> > +					 ssize_t	size,
> > +					 void		*private),
> > +	int			flags)
> > +{
> > +	struct block_device	*bdev;
> > +
> > +	if (IS_DAX(inode))
> > +		return dax_do_io(rw, iocb, inode, iter, offset,
> > +				 xfs_get_blocks_direct, endio, 0);
> > +
> 
> I assume this is supposed to be get_blocks_direct and not
> get_blocks_dax, based on the I/O codepath. The naming is starting to get
> a little confusing though. xfs_get_blocks_dax() implies to me that it's
> for any DAX I/O, but we only appear to use it internally for
> truncate/zeroing/mmap and such. Alas, I can't think of a better name atm
> and the code seems Ok to me:

xfs_get_blocks_dax is now gone as there is no difference between
the block mapping of DAX and direct IO....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 7/8] xfs: add DAX IO path support
@ 2015-04-16  8:54       ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-04-16  8:54 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, willy, jack, xfs

On Mon, Apr 06, 2015 at 01:49:14PM -0400, Brian Foster wrote:
> On Tue, Mar 24, 2015 at 09:51:05PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > DAX does not do buffered IO (can't buffer direct access!) and hence
> > all read/write IO is vectored through the direct IO path.  Hence we
> > need to add the DAX IO path callouts to the direct IO
> > infrastructure.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c | 35 +++++++++++++++++++++++++++--------
> >  1 file changed, 27 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 3fc5052..97979e9 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -1559,6 +1559,30 @@ xfs_end_io_direct_write(
> >  	}
> >  }
> >  
> > +static inline ssize_t
> > +xfs_vm_do_dio(
> > +	struct inode		*inode,
> > +	int			rw,
> > +	struct kiocb		*iocb,
> > +	struct iov_iter		*iter,
> > +	loff_t			offset,
> > +	void			(*endio)(struct kiocb	*iocb,
> > +					 loff_t		offset,
> > +					 ssize_t	size,
> > +					 void		*private),
> > +	int			flags)
> > +{
> > +	struct block_device	*bdev;
> > +
> > +	if (IS_DAX(inode))
> > +		return dax_do_io(rw, iocb, inode, iter, offset,
> > +				 xfs_get_blocks_direct, endio, 0);
> > +
> 
> I assume this is supposed to be get_blocks_direct and not
> get_blocks_dax, based on the I/O codepath. The naming is starting to get
> a little confusing though. xfs_get_blocks_dax() implies to me that it's
> for any DAX I/O, but we only appear to use it internally for
> truncate/zeroing/mmap and such. Alas, I can't think of a better name atm
> and the code seems Ok to me:

xfs_get_blocks_dax is now gone as there is no difference between
the block mapping of DAX and direct IO....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xfs: add DAX file operations support
  2015-03-24 10:51   ` Dave Chinner
@ 2015-04-16  9:33     ` Boaz Harrosh
  -1 siblings, 0 replies; 57+ messages in thread
From: Boaz Harrosh @ 2015-04-16  9:33 UTC (permalink / raw)
  To: Dave Chinner, xfs; +Cc: linux-fsdevel, willy, jack

On 03/24/2015 12:51 PM, Dave Chinner wrote:
<>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 94713c2..8017175 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -385,7 +385,11 @@ xfs_file_splice_read(
>  
>  	trace_xfs_file_splice_read(ip, count, *ppos, ioflags);
>  
> -	ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
> +	/* for dax, we need to avoid the page cache */
> +	if (IS_DAX(VFS_I(ip)))
> +		ret = default_file_splice_read(infilp, ppos, pipe, count, flags);
> +	else
> +		ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);

Dave hi

Linus has accepted this patch:
  [be64f884be] dax: unify ext2/4_{dax,}_file_operations

Which adds the same exact if(IS_DAX)) to generic_file_splice_read for use
by ext2/4. (It made things easier for both ext2/4)

But also this code is just fine of course

Thanks
Boaz

>  	if (ret > 0)
>  		XFS_STATS_ADD(xs_read_bytes, ret);
>  
<>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xfs: add DAX file operations support
@ 2015-04-16  9:33     ` Boaz Harrosh
  0 siblings, 0 replies; 57+ messages in thread
From: Boaz Harrosh @ 2015-04-16  9:33 UTC (permalink / raw)
  To: Dave Chinner, xfs; +Cc: linux-fsdevel, willy, jack

On 03/24/2015 12:51 PM, Dave Chinner wrote:
<>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 94713c2..8017175 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -385,7 +385,11 @@ xfs_file_splice_read(
>  
>  	trace_xfs_file_splice_read(ip, count, *ppos, ioflags);
>  
> -	ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
> +	/* for dax, we need to avoid the page cache */
> +	if (IS_DAX(VFS_I(ip)))
> +		ret = default_file_splice_read(infilp, ppos, pipe, count, flags);
> +	else
> +		ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);

Dave hi

Linus has accepted this patch:
  [be64f884be] dax: unify ext2/4_{dax,}_file_operations

Which adds the same exact if(IS_DAX)) to generic_file_splice_read for use
by ext2/4. (It made things easier for both ext2/4)

But also this code is just fine of course

Thanks
Boaz

>  	if (ret > 0)
>  		XFS_STATS_ADD(xs_read_bytes, ret);
>  
<>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xfs: add DAX file operations support
  2015-04-16  9:33     ` Boaz Harrosh
@ 2015-04-16 11:47       ` Dave Chinner
  -1 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-04-16 11:47 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: xfs, linux-fsdevel, willy, jack

On Thu, Apr 16, 2015 at 12:33:26PM +0300, Boaz Harrosh wrote:
> On 03/24/2015 12:51 PM, Dave Chinner wrote:
> <>
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 94713c2..8017175 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -385,7 +385,11 @@ xfs_file_splice_read(
> >  
> >  	trace_xfs_file_splice_read(ip, count, *ppos, ioflags);
> >  
> > -	ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
> > +	/* for dax, we need to avoid the page cache */
> > +	if (IS_DAX(VFS_I(ip)))
> > +		ret = default_file_splice_read(infilp, ppos, pipe, count, flags);
> > +	else
> > +		ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
> 
> Dave hi
> 
> Linus has accepted this patch:
>   [be64f884be] dax: unify ext2/4_{dax,}_file_operations
> 
> Which adds the same exact if(IS_DAX)) to generic_file_splice_read for use
> by ext2/4. (It made things easier for both ext2/4)
> 
> But also this code is just fine of course

I'll leave it like this for the moment - I've missed the 4.1 merge
window for this code so I'll fix it up when 4.1-rc1 is out and I can
merge it pretty much straight away for the 4.2 cycle....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xfs: add DAX file operations support
@ 2015-04-16 11:47       ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2015-04-16 11:47 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: linux-fsdevel, willy, jack, xfs

On Thu, Apr 16, 2015 at 12:33:26PM +0300, Boaz Harrosh wrote:
> On 03/24/2015 12:51 PM, Dave Chinner wrote:
> <>
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 94713c2..8017175 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -385,7 +385,11 @@ xfs_file_splice_read(
> >  
> >  	trace_xfs_file_splice_read(ip, count, *ppos, ioflags);
> >  
> > -	ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
> > +	/* for dax, we need to avoid the page cache */
> > +	if (IS_DAX(VFS_I(ip)))
> > +		ret = default_file_splice_read(infilp, ppos, pipe, count, flags);
> > +	else
> > +		ret = generic_file_splice_read(infilp, ppos, pipe, count, flags);
> 
> Dave hi
> 
> Linus has accepted this patch:
>   [be64f884be] dax: unify ext2/4_{dax,}_file_operations
> 
> Which adds the same exact if(IS_DAX)) to generic_file_splice_read for use
> by ext2/4. (It made things easier for both ext2/4)
> 
> But also this code is just fine of course

I'll leave it like this for the moment - I've missed the 4.1 merge
window for this code so I'll fix it up when 4.1-rc1 is out and I can
merge it pretty much straight away for the 4.2 cycle....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2015-04-16 11:48 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-24 10:50 [PATCH 0/8 v2] xfs: DAX support Dave Chinner
2015-03-24 10:50 ` Dave Chinner
2015-03-24 10:50 ` [PATCH 1/8] xfs: mmap lock needs to be inside freeze protection Dave Chinner
2015-04-01 14:34   ` Jan Kara
2015-04-01 14:34     ` Jan Kara
2015-04-06 17:48   ` Brian Foster
2015-04-06 17:48     ` Brian Foster
2015-03-24 10:51 ` [PATCH 2/8] dax: don't abuse get_block mapping for endio callbacks Dave Chinner
2015-03-24 10:51   ` Dave Chinner
2015-04-01 14:53   ` Jan Kara
2015-04-01 14:53     ` Jan Kara
2015-03-24 10:51 ` [PATCH 3/8] dax: expose __dax_fault for filesystems with locking constraints Dave Chinner
2015-03-24 10:51   ` Dave Chinner
2015-04-01 15:07   ` Jan Kara
2015-04-01 15:07     ` Jan Kara
2015-03-24 10:51 ` [PATCH 4/8] xfs: add DAX block zeroing support Dave Chinner
2015-03-24 10:51   ` Dave Chinner
2015-04-06 17:48   ` Brian Foster
2015-04-06 17:48     ` Brian Foster
2015-03-24 10:51 ` [PATCH 5/8] xfs: add DAX file operations support Dave Chinner
2015-03-24 10:51   ` Dave Chinner
2015-03-24 12:08   ` Boaz Harrosh
2015-03-24 12:08     ` Boaz Harrosh
2015-03-24 12:24     ` Boaz Harrosh
2015-03-24 12:24       ` Boaz Harrosh
2015-03-24 21:17     ` Dave Chinner
2015-03-24 21:17       ` Dave Chinner
2015-03-25  8:47       ` Boaz Harrosh
2015-03-25  8:47         ` Boaz Harrosh
2015-04-06 17:49   ` Brian Foster
2015-04-06 17:49     ` Brian Foster
2015-04-16  8:29     ` Dave Chinner
2015-04-16  8:29       ` Dave Chinner
2015-04-16  9:33   ` Boaz Harrosh
2015-04-16  9:33     ` Boaz Harrosh
2015-04-16 11:47     ` Dave Chinner
2015-04-16 11:47       ` Dave Chinner
2015-03-24 10:51 ` [PATCH 6/8] xfs: add DAX truncate support Dave Chinner
2015-03-24 10:51   ` Dave Chinner
2015-04-06 17:49   ` Brian Foster
2015-04-06 17:49     ` Brian Foster
2015-03-24 10:51 ` [PATCH 7/8] xfs: add DAX IO path support Dave Chinner
2015-03-24 10:51   ` Dave Chinner
2015-04-06 17:49   ` Brian Foster
2015-04-06 17:49     ` Brian Foster
2015-04-16  8:54     ` Dave Chinner
2015-04-16  8:54       ` Dave Chinner
2015-03-24 10:51 ` [PATCH 8/8] xfs: add initial DAX support Dave Chinner
2015-03-24 10:51   ` Dave Chinner
2015-03-24 12:52   ` Boaz Harrosh
2015-03-24 12:52     ` Boaz Harrosh
2015-03-24 21:25     ` Dave Chinner
2015-03-24 21:25       ` Dave Chinner
2015-03-25  9:14       ` Boaz Harrosh
2015-03-25  9:14         ` Boaz Harrosh
2015-04-06 19:00   ` Brian Foster
2015-04-06 19:00     ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.