Linux-BTRFS Archive on lore.kernel.org
 help / Atom feed
* [PATCH 00/10] btrfs: Support for DAX devices
@ 2018-12-05 12:28 Goldwyn Rodrigues
  2018-12-05 12:28 ` [PATCH 01/10] btrfs: create a mount option for dax Goldwyn Rodrigues
                   ` (13 more replies)
  0 siblings, 14 replies; 30+ messages in thread
From: Goldwyn Rodrigues @ 2018-12-05 12:28 UTC (permalink / raw)
  To: linux-btrfs

This is a support for DAX in btrfs. I understand there have been
previous attempts at it. However, I wanted to make sure copy-on-write
(COW) works on dax as well.

Before I present this to the FS folks I wanted to run this through the
btrfs. Even though I wish, I cannot get it correct the first time
around :/.. Here are some questions for which I need suggestions:

Questions:
1. I have been unable to do checksumming for DAX devices. While
checksumming can be done for reads and writes, it is a problem when mmap
is involved because btrfs kernel module does not get back control after
an mmap() writes. Any ideas are appreciated, or we would have to set
nodatasum when dax is enabled.

2. Currently, a user can continue writing on "old" extents of an mmaped file
after a snapshot has been created. How can we enforce writes to be directed
to new extents after snapshots have been created? Do we keep a list of
all mmap()s, and re-mmap them after a snapshot?

Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
command line parameter.


[PATCH 01/10] btrfs: create a mount option for dax
[PATCH 02/10] btrfs: basic dax read
[PATCH 03/10] btrfs: dax: read zeros from holes
[PATCH 04/10] Rename __endio_write_update_ordered() to
[PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
[PATCH 06/10] btrfs: dax write support
[PATCH 07/10] dax: export functions for use with btrfs
[PATCH 08/10] btrfs: dax add read mmap path
[PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
[PATCH 10/10] btrfs: dax mmap write

 fs/btrfs/Makefile   |    1 
 fs/btrfs/ctree.h    |   17 ++
 fs/btrfs/dax.c      |  303 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/file.c     |   29 ++++
 fs/btrfs/inode.c    |   54 +++++----
 fs/btrfs/ioctl.c    |    5 
 fs/btrfs/super.c    |   15 ++
 fs/dax.c            |   35 ++++--
 include/linux/dax.h |   16 ++
 9 files changed, 430 insertions(+), 45 deletions(-)


-- 
Goldwyn


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 01/10] btrfs: create a mount option for dax
  2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
@ 2018-12-05 12:28 ` Goldwyn Rodrigues
  2018-12-05 12:42   ` Johannes Thumshirn
  2018-12-05 12:43   ` Nikolay Borisov
  2018-12-05 12:28 ` [PATCH 02/10] btrfs: basic dax read Goldwyn Rodrigues
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 30+ messages in thread
From: Goldwyn Rodrigues @ 2018-12-05 12:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Also, set the inode->i_flags to S_DAX

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/ioctl.c |  5 ++++-
 fs/btrfs/super.c | 15 +++++++++++++++
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 68f322f600a0..5cc470fa6a40 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1353,6 +1353,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
+#define BTRFS_MOUNT_DAX			(1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 802a628e9f7d..e9146c157816 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
 	if (binode->flags & BTRFS_INODE_DIRSYNC)
 		new_fl |= S_DIRSYNC;
 
+	if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && S_ISREG(inode->i_mode))
+		new_fl |= S_DAX;
+
 	set_mask_bits(&inode->i_flags,
-		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
+		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | S_DAX,
 		      new_fl);
 }
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 645fc81e2a94..035263b61cf5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -326,6 +326,7 @@ enum {
 	Opt_treelog, Opt_notreelog,
 	Opt_usebackuproot,
 	Opt_user_subvol_rm_allowed,
+	Opt_dax,
 
 	/* Deprecated options */
 	Opt_alloc_start,
@@ -393,6 +394,7 @@ static const match_table_t tokens = {
 	{Opt_notreelog, "notreelog"},
 	{Opt_usebackuproot, "usebackuproot"},
 	{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
+	{Opt_dax, "dax"},
 
 	/* Deprecated options */
 	{Opt_alloc_start, "alloc_start=%s"},
@@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 		case Opt_user_subvol_rm_allowed:
 			btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
 			break;
+#ifdef CONFIG_FS_DAX
+		case Opt_dax:
+			if (btrfs_super_num_devices(info->super_copy) > 1) {
+				btrfs_info(info,
+					   "dax not supported for multi-device btrfs partition\n");
+				ret = -EOPNOTSUPP;
+				goto out;
+			}
+			btrfs_set_opt(info->mount_opt, DAX);
+			break;
+#endif
 		case Opt_enospc_debug:
 			btrfs_set_opt(info->mount_opt, ENOSPC_DEBUG);
 			break;
@@ -1329,6 +1342,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",clear_cache");
 	if (btrfs_test_opt(info, USER_SUBVOL_RM_ALLOWED))
 		seq_puts(seq, ",user_subvol_rm_allowed");
+	if (btrfs_test_opt(info, DAX))
+		seq_puts(seq, ",dax");
 	if (btrfs_test_opt(info, ENOSPC_DEBUG))
 		seq_puts(seq, ",enospc_debug");
 	if (btrfs_test_opt(info, AUTO_DEFRAG))
-- 
2.16.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 02/10] btrfs: basic dax read
  2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
  2018-12-05 12:28 ` [PATCH 01/10] btrfs: create a mount option for dax Goldwyn Rodrigues
@ 2018-12-05 12:28 ` Goldwyn Rodrigues
  2018-12-05 13:11   ` Nikolay Borisov
  2018-12-05 13:22   ` Johannes Thumshirn
  2018-12-05 12:28 ` [PATCH 03/10] btrfs: dax: read zeros from holes Goldwyn Rodrigues
                   ` (11 subsequent siblings)
  13 siblings, 2 replies; 30+ messages in thread
From: Goldwyn Rodrigues @ 2018-12-05 12:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/Makefile |  1 +
 fs/btrfs/ctree.h  |  5 ++++
 fs/btrfs/dax.c    | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/file.c   | 13 ++++++++++-
 4 files changed, 86 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/dax.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index ca693dd554e9..1fa77b875ae9 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o
 
+btrfs-$(CONFIG_FS_DAX) += dax.o
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
 btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5cc470fa6a40..038d64ecebe5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3685,6 +3685,11 @@ int btrfs_reada_wait(void *handle);
 void btrfs_reada_detach(void *handle);
 int btree_readahead_hook(struct extent_buffer *eb, int err);
 
+#ifdef CONFIG_FS_DAX
+/* dax.c */
+ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
+#endif /* CONFIG_FS_DAX */
+
 static inline int is_fstree(u64 rootid)
 {
 	if (rootid == BTRFS_FS_TREE_OBJECTID ||
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
new file mode 100644
index 000000000000..d614bf73bf8e
--- /dev/null
+++ b/fs/btrfs/dax.c
@@ -0,0 +1,68 @@
+#include <linux/dax.h>
+#include <linux/uio.h>
+#include "ctree.h"
+#include "btrfs_inode.h"
+
+static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos,
+		u64 len, struct iov_iter *iter)
+{
+        struct dax_device *dax_dev = fs_dax_get_by_bdev(em->bdev);
+        ssize_t map_len;
+        pgoff_t blk_pg;
+        void *kaddr;
+        sector_t blk_start;
+        unsigned offset = pos & (PAGE_SIZE - 1);
+
+        len = min(len + offset, em->len - (pos - em->start));
+        len = ALIGN(len, PAGE_SIZE);
+        blk_start = (get_start_sect(em->bdev) << 9) + (em->block_start + (pos - em->start));
+        blk_pg = blk_start - offset;
+        map_len = dax_direct_access(dax_dev, PHYS_PFN(blk_pg), PHYS_PFN(len), &kaddr, NULL);
+        map_len = PFN_PHYS(map_len);
+        kaddr += offset;
+        map_len -= offset;
+        if (map_len > len)
+                map_len = len;
+        if (iov_iter_rw(iter) == WRITE)
+                return dax_copy_from_iter(dax_dev, blk_pg, kaddr, map_len, iter);
+        else
+                return dax_copy_to_iter(dax_dev, blk_pg, kaddr, map_len, iter);
+}
+
+ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
+{
+        size_t ret = 0, done = 0, count = iov_iter_count(to);
+        struct extent_map *em;
+        u64 pos = iocb->ki_pos;
+        u64 end = pos + count;
+        struct inode *inode = file_inode(iocb->ki_filp);
+
+        if (!count)
+                return 0;
+
+        end = i_size_read(inode) < end ? i_size_read(inode) : end;
+
+        while (pos < end) {
+                u64 len = end - pos;
+
+                em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, len, 0);
+                if (IS_ERR(em)) {
+                        if (!ret)
+                                ret = PTR_ERR(em);
+                        goto out;
+                }
+
+                BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);
+
+                ret = em_dax_rw(inode, em, pos, len, to);
+                if (ret < 0)
+                        goto out;
+                pos += ret;
+                done += ret;
+        }
+
+out:
+        iocb->ki_pos += done;
+        return done ? done : ret;
+}
+
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 58e93bce3036..ef6ed93f44d1 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3308,9 +3308,20 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
 	return generic_file_open(inode, filp);
 }
 
+static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+#ifdef CONFIG_FS_DAX
+	if (IS_DAX(inode))
+		return btrfs_file_dax_read(iocb, to);
+#endif
+	return generic_file_read_iter(iocb, to);
+}
+
 const struct file_operations btrfs_file_operations = {
 	.llseek		= btrfs_file_llseek,
-	.read_iter      = generic_file_read_iter,
+	.read_iter      = btrfs_file_read_iter,
 	.splice_read	= generic_file_splice_read,
 	.write_iter	= btrfs_file_write_iter,
 	.mmap		= btrfs_file_mmap,
-- 
2.16.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 03/10] btrfs: dax: read zeros from holes
  2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
  2018-12-05 12:28 ` [PATCH 01/10] btrfs: create a mount option for dax Goldwyn Rodrigues
  2018-12-05 12:28 ` [PATCH 02/10] btrfs: basic dax read Goldwyn Rodrigues
@ 2018-12-05 12:28 ` Goldwyn Rodrigues
  2018-12-05 13:26   ` Nikolay Borisov
  2018-12-05 12:28 ` [PATCH 04/10] Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 30+ messages in thread
From: Goldwyn Rodrigues @ 2018-12-05 12:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/dax.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index d614bf73bf8e..5a297674adec 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -54,7 +54,12 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
 
                 BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);
 
-                ret = em_dax_rw(inode, em, pos, len, to);
+		if (em->block_start == EXTENT_MAP_HOLE) {
+			u64 zero_len = min(em->len - (em->start - pos), len);
+			ret = iov_iter_zero(zero_len, to);
+		} else {
+			ret = em_dax_rw(inode, em, pos, len, to);
+		}
                 if (ret < 0)
                         goto out;
                 pos += ret;
-- 
2.16.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 04/10] Rename __endio_write_update_ordered() to btrfs_update_ordered_extent()
  2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
                   ` (2 preceding siblings ...)
  2018-12-05 12:28 ` [PATCH 03/10] btrfs: dax: read zeros from holes Goldwyn Rodrigues
@ 2018-12-05 12:28 ` Goldwyn Rodrigues
  2018-12-05 13:35   ` Nikolay Borisov
  2018-12-05 12:28 ` [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 30+ messages in thread
From: Goldwyn Rodrigues @ 2018-12-05 12:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Since we will be using it in another part of the code, use a
better name to declare it non-static

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  7 +++++--
 fs/btrfs/inode.c | 14 +++++---------
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 038d64ecebe5..5144d28216b0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3170,8 +3170,11 @@ struct inode *btrfs_iget_path(struct super_block *s, struct btrfs_key *location,
 struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
 			 struct btrfs_root *root, int *was_new);
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
-				    struct page *page, size_t pg_offset,
-				    u64 start, u64 end, int create);
+		struct page *page, size_t pg_offset,
+		u64 start, u64 end, int create);
+void btrfs_update_ordered_extent(struct inode *inode,
+		const u64 offset, const u64 bytes,
+		const bool uptodate);
 int btrfs_update_inode(struct btrfs_trans_handle *trans,
 			      struct btrfs_root *root,
 			      struct inode *inode);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9ea4c6f0352f..96e9fe9e4150 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -97,10 +97,6 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
 				       u64 ram_bytes, int compress_type,
 				       int type);
 
-static void __endio_write_update_ordered(struct inode *inode,
-					 const u64 offset, const u64 bytes,
-					 const bool uptodate);
-
 /*
  * Cleanup all submitted ordered extents in specified range to handle errors
  * from the fill_dellaloc() callback.
@@ -130,7 +126,7 @@ static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
 		ClearPagePrivate2(page);
 		put_page(page);
 	}
-	return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
+	return btrfs_update_ordered_extent(inode, offset + PAGE_SIZE,
 					    bytes - PAGE_SIZE, false);
 }
 
@@ -8059,7 +8055,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
 	bio_put(bio);
 }
 
-static void __endio_write_update_ordered(struct inode *inode,
+void btrfs_update_ordered_extent(struct inode *inode,
 					 const u64 offset, const u64 bytes,
 					 const bool uptodate)
 {
@@ -8112,7 +8108,7 @@ static void btrfs_endio_direct_write(struct bio *bio)
 	struct btrfs_dio_private *dip = bio->bi_private;
 	struct bio *dio_bio = dip->dio_bio;
 
-	__endio_write_update_ordered(dip->inode, dip->logical_offset,
+	btrfs_update_ordered_extent(dip->inode, dip->logical_offset,
 				     dip->bytes, !bio->bi_status);
 
 	kfree(dip);
@@ -8432,7 +8428,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 		bio = NULL;
 	} else {
 		if (write)
-			__endio_write_update_ordered(inode,
+			btrfs_update_ordered_extent(inode,
 						file_offset,
 						dio_bio->bi_iter.bi_size,
 						false);
@@ -8572,7 +8568,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 			 */
 			if (dio_data.unsubmitted_oe_range_start <
 			    dio_data.unsubmitted_oe_range_end)
-				__endio_write_update_ordered(inode,
+				btrfs_update_ordered_extent(inode,
 					dio_data.unsubmitted_oe_range_start,
 					dio_data.unsubmitted_oe_range_end -
 					dio_data.unsubmitted_oe_range_start,
-- 
2.16.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write()
  2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
                   ` (3 preceding siblings ...)
  2018-12-05 12:28 ` [PATCH 04/10] Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
@ 2018-12-05 12:28 ` Goldwyn Rodrigues
  2018-12-05 12:28 ` [PATCH 06/10] btrfs: dax write support Goldwyn Rodrigues
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Goldwyn Rodrigues @ 2018-12-05 12:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This makes btrfs_get_extent_map_write() independent of Direct
I/O code.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/inode.c | 40 +++++++++++++++++++++++++++-------------
 2 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5144d28216b0..a0d296b0d826 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3169,6 +3169,8 @@ struct inode *btrfs_iget_path(struct super_block *s, struct btrfs_key *location,
 			      struct btrfs_path *path);
 struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
 			 struct btrfs_root *root, int *was_new);
+int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh,
+		struct inode *inode, u64 start, u64 len);
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 		struct page *page, size_t pg_offset,
 		u64 start, u64 end, int create);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 96e9fe9e4150..4671cd9165c1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7485,11 +7485,10 @@ static int btrfs_get_blocks_direct_read(struct extent_map *em,
 	return 0;
 }
 
-static int btrfs_get_blocks_direct_write(struct extent_map **map,
-					 struct buffer_head *bh_result,
-					 struct inode *inode,
-					 struct btrfs_dio_data *dio_data,
-					 u64 start, u64 len)
+int btrfs_get_extent_map_write(struct extent_map **map,
+		struct buffer_head *bh,
+		struct inode *inode,
+		u64 start, u64 len)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_map *em = *map;
@@ -7543,22 +7542,38 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 			 */
 			btrfs_free_reserved_data_space_noquota(inode, start,
 							       len);
-			goto skip_cow;
+			/* skip COW */
+			goto out;
 		}
 	}
 
 	/* this will cow the extent */
-	len = bh_result->b_size;
+	if (bh)
+		len = bh->b_size;
 	free_extent_map(em);
 	*map = em = btrfs_new_extent_direct(inode, start, len);
-	if (IS_ERR(em)) {
-		ret = PTR_ERR(em);
-		goto out;
-	}
+	if (IS_ERR(em))
+		return PTR_ERR(em);
+out:
+	return ret;
+}
 
+static int btrfs_get_blocks_direct_write(struct extent_map **map,
+					 struct buffer_head *bh_result,
+					 struct inode *inode,
+					 struct btrfs_dio_data *dio_data,
+					 u64 start, u64 len)
+{
+	int ret = 0;
+	struct extent_map *em;
+
+	ret = btrfs_get_extent_map_write(map, bh_result, inode,
+			start, len);
+	if (ret < 0)
+		return ret;
+	em = *map;
 	len = min(len, em->len - (start - em->start));
 
-skip_cow:
 	bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
 		inode->i_blkbits;
 	bh_result->b_size = len;
@@ -7579,7 +7594,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	dio_data->reserve -= len;
 	dio_data->unsubmitted_oe_range_end = start + len;
 	current->journal_info = dio_data;
-out:
 	return ret;
 }
 
-- 
2.16.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 06/10] btrfs: dax write support
  2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
                   ` (4 preceding siblings ...)
  2018-12-05 12:28 ` [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
@ 2018-12-05 12:28 ` Goldwyn Rodrigues
  2018-12-05 13:56   ` Johannes Thumshirn
  2018-12-05 12:28 ` [PATCH 07/10] dax: export functions for use with btrfs Goldwyn Rodrigues
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 30+ messages in thread
From: Goldwyn Rodrigues @ 2018-12-05 12:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This is a combination of direct and buffered I/O. Similarties
with direct I/O is that it needs to allocate space before
writing. Similarities with buffered is when the data is not
page-aligned, it needs to copy parts of the previous extents. In
order to accomplish that, keep a references of the first and last
extent (if required) and then perform allocations. If the "pos"
or "end" is not aligned, copy the data from first and last extent
respectively.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |   1 +
 fs/btrfs/dax.c   | 121 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/file.c  |   4 +-
 3 files changed, 125 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a0d296b0d826..d91ff283a966 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3693,6 +3693,7 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
 #ifdef CONFIG_FS_DAX
 /* dax.c */
 ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
 #endif /* CONFIG_FS_DAX */
 
 static inline int is_fstree(u64 rootid)
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 5a297674adec..4000259a426c 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -2,6 +2,7 @@
 #include <linux/uio.h>
 #include "ctree.h"
 #include "btrfs_inode.h"
+#include "extent_io.h"
 
 static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos,
 		u64 len, struct iov_iter *iter)
@@ -71,3 +72,123 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
         return done ? done : ret;
 }
 
+static int copy_extent_page(struct extent_map *em, void *daddr, u64 pos)
+{
+        struct dax_device *dax_dev;
+	void *saddr;
+	sector_t start;
+	size_t len;
+
+	if (em->block_start == EXTENT_MAP_HOLE) {
+		memset(daddr, 0, PAGE_SIZE);
+	} else {
+		dax_dev = fs_dax_get_by_bdev(em->bdev);
+		start = (get_start_sect(em->bdev) << 9) + (em->block_start + (pos - em->start));
+		len = dax_direct_access(dax_dev, PHYS_PFN(start), 1, &saddr, NULL);
+		memcpy(daddr, saddr, PAGE_SIZE);
+	}
+	free_extent_map(em);
+
+	return 0;
+}
+
+
+ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	ssize_t ret, done = 0, count = iov_iter_count(from);
+        struct inode *inode = file_inode(iocb->ki_filp);
+	u64 pos = iocb->ki_pos;
+	u64 start = round_down(pos, PAGE_SIZE);
+	u64 end = round_up(pos + count, PAGE_SIZE);
+	struct extent_state *cached_state = NULL;
+	struct extent_changeset *data_reserved = NULL;
+	struct extent_map *first = NULL, *last = NULL;
+
+	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start, end - start);
+	if (ret < 0)
+		return ret;
+
+	/* Grab a reference of the first extent to copy data */
+	if (start < pos) {
+		first = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, end - start, 0);
+		if (IS_ERR(first)) {
+			ret = PTR_ERR(first);
+			goto out2;
+		}
+	}
+
+	/* Grab a reference of the last extent to copy data */
+	if (pos + count < end) {
+		last = btrfs_get_extent(BTRFS_I(inode), NULL, 0, end - PAGE_SIZE, PAGE_SIZE, 0);
+		if (IS_ERR(last)) {
+			ret = PTR_ERR(last);
+			goto out2;
+		}
+	}
+
+	lock_extent_bits(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
+	while (done < count) {
+		struct extent_map *em;
+		struct dax_device *dax_dev;
+		int offset = pos & (PAGE_SIZE - 1);
+		u64 estart = round_down(pos, PAGE_SIZE);
+		u64 elen = end - estart;
+		size_t len = count - done;
+		sector_t dstart;
+		void *daddr;
+		ssize_t maplen;
+
+		/* Read the current extent */
+                em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, estart, elen, 0);
+		if (IS_ERR(em)) {
+			ret = PTR_ERR(em);
+			goto out;
+		}
+
+		/* Get a new extent */
+		ret = btrfs_get_extent_map_write(&em, NULL, inode, estart, elen);
+		if (ret < 0)
+			goto out;
+
+		dax_dev = fs_dax_get_by_bdev(em->bdev);
+		/* Calculate start address start of destination extent */
+		dstart = (get_start_sect(em->bdev) << 9) + em->block_start;
+		maplen = dax_direct_access(dax_dev, PHYS_PFN(dstart),
+				PHYS_PFN(em->len), &daddr, NULL);
+
+		/* Copy front of extent page */
+		if (offset)
+			ret = copy_extent_page(first, daddr, estart);
+
+		/* Copy end of extent page */
+		if ((pos + len > estart + PAGE_SIZE) && (pos + len < em->start + em->len))
+			ret = copy_extent_page(last, daddr + em->len - PAGE_SIZE, em->start + em->len - PAGE_SIZE);
+
+		/* Copy the data from the iter */
+		maplen = PFN_PHYS(maplen);
+		maplen -= offset;
+		ret = dax_copy_from_iter(dax_dev, dstart, daddr + offset, maplen, from);
+		if (ret < 0)
+			goto out;
+		pos += ret;
+		done += ret;
+	}
+out:
+	unlock_extent_cached(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
+	if (done) {
+		btrfs_update_ordered_extent(inode, start,
+				end - start, true);
+		iocb->ki_pos += done;
+		if (iocb->ki_pos > i_size_read(inode))
+			i_size_write(inode, iocb->ki_pos);
+	}
+
+	btrfs_delalloc_release_extents(BTRFS_I(inode), count, false);
+out2:
+	if (count - done > 0)
+		btrfs_delalloc_release_space(inode, data_reserved, pos,
+				count - done, true);
+	extent_changeset_free(data_reserved);
+        return done ? done : ret;
+
+}
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ef6ed93f44d1..29a3b12e6660 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1964,7 +1964,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	if (sync)
 		atomic_inc(&BTRFS_I(inode)->sync_writers);
 
-	if (iocb->ki_flags & IOCB_DIRECT) {
+	if (IS_DAX(inode)) {
+		num_written = btrfs_file_dax_write(iocb, from);
+	} else if (iocb->ki_flags & IOCB_DIRECT) {
 		num_written = __btrfs_direct_write(iocb, from);
 	} else {
 		num_written = btrfs_buffered_write(iocb, from);
-- 
2.16.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 07/10] dax: export functions for use with btrfs
  2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
                   ` (5 preceding siblings ...)
  2018-12-05 12:28 ` [PATCH 06/10] btrfs: dax write support Goldwyn Rodrigues
@ 2018-12-05 12:28 ` Goldwyn Rodrigues
  2018-12-05 13:59   ` Johannes Thumshirn
  2018-12-05 14:52   ` Christoph Hellwig
  2018-12-05 12:28 ` [PATCH 08/10] btrfs: dax add read mmap path Goldwyn Rodrigues
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 30+ messages in thread
From: Goldwyn Rodrigues @ 2018-12-05 12:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

These functions are required for btrfs dax support.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/dax.c            | 35 ++++++++++++++++++++++++-----------
 include/linux/dax.h | 16 ++++++++++++++++
 2 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9bcce89ea18e..4578640af631 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -244,7 +244,7 @@ static void put_unlocked_entry(struct xa_state *xas, void *entry)
  * dropped the xa_lock, so we know the xa_state is stale and must be reset
  * before use.
  */
-static void dax_unlock_entry(struct xa_state *xas, void *entry)
+void dax_unlock_entry(struct xa_state *xas, void *entry)
 {
 	void *old;
 
@@ -256,6 +256,7 @@ static void dax_unlock_entry(struct xa_state *xas, void *entry)
 	BUG_ON(!dax_is_locked(old));
 	dax_wake_entry(xas, entry, false);
 }
+EXPORT_SYMBOL(dax_unlock_entry);
 
 /*
  * Return: The entry stored at this location before it was locked.
@@ -448,7 +449,7 @@ void dax_unlock_mapping_entry(struct page *page)
  * a VM_FAULT code, encoded as an xarray internal entry.  The ERR_PTR values
  * overlap with xarray value entries.
  */
-static void *grab_mapping_entry(struct xa_state *xas,
+void *grab_mapping_entry(struct xa_state *xas,
 		struct address_space *mapping, unsigned long size_flag)
 {
 	unsigned long index = xas->xa_index;
@@ -531,6 +532,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
 	xas_unlock_irq(xas);
 	return xa_mk_internal(VM_FAULT_FALLBACK);
 }
+EXPORT_SYMBOL(grab_mapping_entry);
 
 /**
  * dax_layout_busy_page - find first pinned page in @mapping
@@ -654,7 +656,7 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 	return __dax_invalidate_entry(mapping, index, false);
 }
 
-static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
+int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
 		sector_t sector, size_t size, struct page *to,
 		unsigned long vaddr)
 {
@@ -679,6 +681,7 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
 	dax_read_unlock(id);
 	return 0;
 }
+EXPORT_SYMBOL(copy_user_dax);
 
 /*
  * By this point grab_mapping_entry() has ensured that we have a locked entry
@@ -687,7 +690,7 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
  * already in the tree, we will skip the insertion and just dirty the PMD as
  * appropriate.
  */
-static void *dax_insert_entry(struct xa_state *xas,
+void *dax_insert_entry(struct xa_state *xas,
 		struct address_space *mapping, struct vm_fault *vmf,
 		void *entry, pfn_t pfn, unsigned long flags, bool dirty)
 {
@@ -736,6 +739,7 @@ static void *dax_insert_entry(struct xa_state *xas,
 	xas_unlock_irq(xas);
 	return entry;
 }
+EXPORT_SYMBOL(dax_insert_entry);
 
 static inline
 unsigned long pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
@@ -962,19 +966,18 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
 	return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
 }
 
-static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
-			 pfn_t *pfnp)
+int dax_pfn(struct dax_device *dax_dev, struct block_device *bdev,
+	    const sector_t sector, size_t size, pfn_t *pfnp)
 {
-	const sector_t sector = dax_iomap_sector(iomap, pos);
 	pgoff_t pgoff;
 	int id, rc;
 	long length;
 
-	rc = bdev_dax_pgoff(iomap->bdev, sector, size, &pgoff);
+	rc = bdev_dax_pgoff(bdev, sector, size, &pgoff);
 	if (rc)
 		return rc;
 	id = dax_read_lock();
-	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
+	length = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
 				   NULL, pfnp);
 	if (length < 0) {
 		rc = length;
@@ -993,6 +996,14 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
 	dax_read_unlock(id);
 	return rc;
 }
+EXPORT_SYMBOL(dax_pfn);
+
+static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
+			 pfn_t *pfnp)
+{
+	const sector_t sector = dax_iomap_sector(iomap, pos);
+	return dax_pfn(iomap->dax_dev, iomap->bdev, sector, size, pfnp);
+}
 
 /*
  * The user has performed a load from a hole in the file.  Allocating a new
@@ -1001,7 +1012,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
  * If this page is ever written to we will re-fault and change the mapping to
  * point to real DAX storage instead.
  */
-static vm_fault_t dax_load_hole(struct xa_state *xas,
+vm_fault_t dax_load_hole(struct xa_state *xas,
 		struct address_space *mapping, void **entry,
 		struct vm_fault *vmf)
 {
@@ -1017,6 +1028,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 	trace_dax_load_hole(inode, vmf, ret);
 	return ret;
 }
+EXPORT_SYMBOL(dax_load_hole);
 
 static bool dax_range_is_aligned(struct block_device *bdev,
 				 unsigned int offset, unsigned int length)
@@ -1195,7 +1207,7 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
-static vm_fault_t dax_fault_return(int error)
+vm_fault_t dax_fault_return(int error)
 {
 	if (error == 0)
 		return VM_FAULT_NOPAGE;
@@ -1203,6 +1215,7 @@ static vm_fault_t dax_fault_return(int error)
 		return VM_FAULT_OOM;
 	return VM_FAULT_SIGBUS;
 }
+EXPORT_SYMBOL(dax_fault_return);
 
 /*
  * MAP_SYNC on a dax mapping guarantees dirty metadata is
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 450b28db9533..72f0a1e85ab5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -155,6 +155,22 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
+int dax_pfn(struct dax_device *dax_dev, struct block_device *bdev,
+	    const sector_t sector, size_t size, pfn_t *pfn);
+void *dax_insert_entry(struct xa_state *xas,
+	               struct address_space *mapping, struct vm_fault *vmf,
+	               void *entry, pfn_t pfn, unsigned long flags, bool dirty);
+void *grab_mapping_entry(struct xa_state *xas,
+	                 struct address_space *mapping,
+			 unsigned long size_flag);
+void dax_unlock_entry(struct xa_state *xas, void *entry);
+vm_fault_t dax_load_hole(struct xa_state *xas,
+			 struct address_space *mapping, void **entry,
+			 struct vm_fault *vmf);
+vm_fault_t dax_fault_return(int error);
+int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
+		sector_t sector, size_t size, struct page *to,
+		unsigned long vaddr);
 
 #ifdef CONFIG_FS_DAX
 int __dax_zero_page_range(struct block_device *bdev,
-- 
2.16.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 08/10] btrfs: dax add read mmap path
  2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
                   ` (6 preceding siblings ...)
  2018-12-05 12:28 ` [PATCH 07/10] dax: export functions for use with btrfs Goldwyn Rodrigues
@ 2018-12-05 12:28 ` Goldwyn Rodrigues
  2018-12-05 12:28 ` [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared Goldwyn Rodrigues
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Goldwyn Rodrigues @ 2018-12-05 12:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/dax.c   | 43 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/file.c  | 12 +++++++++++-
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d91ff283a966..33648121ca52 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3694,6 +3694,7 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
 /* dax.c */
 ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
 ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
+vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
 #endif /* CONFIG_FS_DAX */
 
 static inline int is_fstree(u64 rootid)
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 4000259a426c..88017f8799d1 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -190,5 +190,48 @@ ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
 				count - done, true);
 	extent_changeset_free(data_reserved);
         return done ? done : ret;
+}
+
+/* As copied from dax_iomap_pte_fault() */
+vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
+{
+	pfn_t pfn;
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
+	struct inode *inode = mapping->host;
+	loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
+	void *entry = NULL;
+	vm_fault_t ret = 0;
+
+	if (pos > i_size_read(inode)) {
+		ret = VM_FAULT_SIGBUS;
+		goto out;
+	}
 
+	entry = grab_mapping_entry(&xas, mapping, 0);
+	if (IS_ERR(entry)) {
+		ret = dax_fault_return(PTR_ERR(entry));
+		goto out;
+	}
+
+	if (!vmf->cow_page) {
+        	sector_t sector;
+		struct extent_map *em;
+                em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, PAGE_SIZE, 0);
+		if (em->block_start == EXTENT_MAP_HOLE) {
+			ret = dax_load_hole(&xas, mapping, entry, vmf);
+			goto out;
+		}
+	        sector = ((get_start_sect(em->bdev) << 9) +
+			  (em->block_start + (pos - em->start))) >> 9;
+		ret = dax_pfn(fs_dax_get_by_bdev(em->bdev), em->bdev, sector, PAGE_SIZE, &pfn);
+		if (ret)
+			goto out;
+		dax_insert_entry(&xas, mapping, vmf, entry, pfn, 0, false);
+		ret = vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+	}
+out:
+	if (entry)
+		dax_unlock_entry(&xas, entry);
+	return ret;
 }
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 29a3b12e6660..38b494686fb2 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2227,8 +2227,18 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	return ret > 0 ? -EIO : ret;
 }
 
+static vm_fault_t btrfs_fault(struct vm_fault *vmf)
+{
+	struct inode *inode = vmf->vma->vm_file->f_mapping->host;
+#ifdef CONFIG_FS_DAX
+	if (IS_DAX(inode))
+		return btrfs_dax_fault(vmf);
+#endif
+	return filemap_fault(vmf);
+}
+
 static const struct vm_operations_struct btrfs_file_vm_ops = {
-	.fault		= filemap_fault,
+	.fault		= btrfs_fault,
 	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= btrfs_page_mkwrite,
 };
-- 
2.16.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
  2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
                   ` (7 preceding siblings ...)
  2018-12-05 12:28 ` [PATCH 08/10] btrfs: dax add read mmap path Goldwyn Rodrigues
@ 2018-12-05 12:28 ` Goldwyn Rodrigues
  2018-12-05 12:28 ` [PATCH 10/10] btrfs: dax mmap write Goldwyn Rodrigues
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Goldwyn Rodrigues @ 2018-12-05 12:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/dax.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 88017f8799d1..6d68d39cc5da 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -198,10 +198,13 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
 	pfn_t pfn;
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
+	unsigned long vaddr = vmf->address;
 	struct inode *inode = mapping->host;
 	loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
 	void *entry = NULL;
 	vm_fault_t ret = 0;
+	struct extent_map *em;
+	struct dax_device *dax_dev;
 
 	if (pos > i_size_read(inode)) {
 		ret = VM_FAULT_SIGBUS;
@@ -214,21 +217,33 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
 		goto out;
 	}
 
-	if (!vmf->cow_page) {
+        em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, PAGE_SIZE, 0);
+	if (em->block_start != EXTENT_MAP_HOLE)
+		dax_dev = fs_dax_get_by_bdev(em->bdev);
+
+	if (vmf->cow_page) {
+		sector_t sector;
+		if (em->block_start == EXTENT_MAP_HOLE) {
+			clear_user_highpage(vmf->cow_page, vaddr);
+			goto out;
+		}
+        	sector = (get_start_sect(em->bdev) << 9) + (em->block_start + (pos - em->start));
+		sector >>= 9;
+		ret = copy_user_dax(em->bdev, dax_dev, sector, PAGE_SIZE, vmf->cow_page, vaddr);
+		goto out;
+	} else {
         	sector_t sector;
-		struct extent_map *em;
-                em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, PAGE_SIZE, 0);
 		if (em->block_start == EXTENT_MAP_HOLE) {
 			ret = dax_load_hole(&xas, mapping, entry, vmf);
 			goto out;
 		}
 	        sector = ((get_start_sect(em->bdev) << 9) +
 			  (em->block_start + (pos - em->start))) >> 9;
-		ret = dax_pfn(fs_dax_get_by_bdev(em->bdev), em->bdev, sector, PAGE_SIZE, &pfn);
+		ret = dax_pfn(dax_dev, em->bdev, sector, PAGE_SIZE, &pfn);
 		if (ret)
 			goto out;
 		dax_insert_entry(&xas, mapping, vmf, entry, pfn, 0, false);
-		ret = vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+		ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
 	}
 out:
 	if (entry)
-- 
2.16.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 10/10] btrfs: dax mmap write
  2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
                   ` (8 preceding siblings ...)
  2018-12-05 12:28 ` [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared Goldwyn Rodrigues
@ 2018-12-05 12:28 ` Goldwyn Rodrigues
  2018-12-05 13:03 ` [PATCH 00/10] btrfs: Support for DAX devices Qu Wenruo
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 30+ messages in thread
From: Goldwyn Rodrigues @ 2018-12-05 12:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Create a page size extent and copy the contents of the original
extent into the new one, and present to user space as the page
to write.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/dax.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 6d68d39cc5da..4634917877f3 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -231,6 +231,45 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
 		sector >>= 9;
 		ret = copy_user_dax(em->bdev, dax_dev, sector, PAGE_SIZE, vmf->cow_page, vaddr);
 		goto out;
+	} else if (vmf->flags & FAULT_FLAG_WRITE) {
+		pfn_t pfn;
+		struct extent_map *orig = em;
+		void *daddr;
+		sector_t dstart;
+		size_t maplen;
+		struct extent_changeset *data_reserved = NULL;
+		struct extent_state *cached_state = NULL;
+
+		ret = btrfs_delalloc_reserve_space(inode, &data_reserved, pos, PAGE_SIZE);
+		if (ret < 0)
+			return ret;
+		refcount_inc(&orig->refs);
+		lock_extent_bits(&BTRFS_I(inode)->io_tree, pos, pos + PAGE_SIZE, &cached_state);
+		/* Create an extent of page size */
+		ret = btrfs_get_extent_map_write(&em, NULL, inode, pos,
+				PAGE_SIZE);
+		if (ret < 0) {
+			free_extent_map(orig);
+			btrfs_delalloc_release_space(inode, data_reserved, pos,
+					PAGE_SIZE, true);
+			goto out;
+		}
+
+		dax_dev = fs_dax_get_by_bdev(em->bdev);
+		/* Calculate start address of destination extent */
+		dstart = (get_start_sect(em->bdev) << 9) + em->block_start;
+		maplen = dax_direct_access(dax_dev, PHYS_PFN(dstart),
+				1, &daddr, &pfn);
+
+		/* Copy the original contents into new destination */
+		copy_extent_page(orig, daddr, pos);
+		btrfs_update_ordered_extent(inode, pos, PAGE_SIZE, true);
+		dax_insert_entry(&xas, mapping, vmf, entry, pfn, 0, false);
+		ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
+		free_extent_map(orig);
+		unlock_extent_cached(&BTRFS_I(inode)->io_tree, pos, pos + PAGE_SIZE, &cached_state);
+		extent_changeset_free(data_reserved);
+		btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, false);
 	} else {
         	sector_t sector;
 		if (em->block_start == EXTENT_MAP_HOLE) {
-- 
2.16.4


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/10] btrfs: create a mount option for dax
  2018-12-05 12:28 ` [PATCH 01/10] btrfs: create a mount option for dax Goldwyn Rodrigues
@ 2018-12-05 12:42   ` Johannes Thumshirn
  2018-12-05 12:43   ` Nikolay Borisov
  1 sibling, 0 replies; 30+ messages in thread
From: Johannes Thumshirn @ 2018-12-05 12:42 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs; +Cc: Goldwyn Rodrigues

On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Also, set the inode->i_flags to S_DAX
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/btrfs/ctree.h |  1 +
>  fs/btrfs/ioctl.c |  5 ++++-
>  fs/btrfs/super.c | 15 +++++++++++++++
>  3 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 68f322f600a0..5cc470fa6a40 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1353,6 +1353,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
>  #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
>  #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
>  #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
> +#define BTRFS_MOUNT_DAX			(1 << 29)

Just as a heads up, this will collide with the patch called '[RFC PATCH
02/17] btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE' from Su Yue.

[...]

> +#ifdef CONFIG_FS_DAX
> +		case Opt_dax:
> +			if (btrfs_super_num_devices(info->super_copy) > 1) {
> +				btrfs_info(info,
> +					   "dax not supported for multi-device btrfs partition\n");
> +				ret = -EOPNOTSUPP;
> +				goto out;
> +			}
> +			btrfs_set_opt(info->mount_opt, DAX);
> +			break;
> +#endif

Can you please explain why we can't enable DAX on a multi device FS in
the changelog? It's (for me at least) not obvious.

Thanks,
	Johannes
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/10] btrfs: create a mount option for dax
  2018-12-05 12:28 ` [PATCH 01/10] btrfs: create a mount option for dax Goldwyn Rodrigues
  2018-12-05 12:42   ` Johannes Thumshirn
@ 2018-12-05 12:43   ` Nikolay Borisov
  2018-12-05 14:59     ` Adam Borowski
  1 sibling, 1 reply; 30+ messages in thread
From: Nikolay Borisov @ 2018-12-05 12:43 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs; +Cc: Goldwyn Rodrigues



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Also, set the inode->i_flags to S_DAX
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

One question below though .

> ---
>  fs/btrfs/ctree.h |  1 +
>  fs/btrfs/ioctl.c |  5 ++++-
>  fs/btrfs/super.c | 15 +++++++++++++++
>  3 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 68f322f600a0..5cc470fa6a40 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1353,6 +1353,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
>  #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
>  #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
>  #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
> +#define BTRFS_MOUNT_DAX			(1 << 29)
>  
>  #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
>  #define BTRFS_DEFAULT_MAX_INLINE	(2048)
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 802a628e9f7d..e9146c157816 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
>  	if (binode->flags & BTRFS_INODE_DIRSYNC)
>  		new_fl |= S_DIRSYNC;
>  
> +	if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && S_ISREG(inode->i_mode))
> +		new_fl |= S_DAX;
> +
>  	set_mask_bits(&inode->i_flags,
> -		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
> +		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | S_DAX,
>  		      new_fl);
>  }
>  
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 645fc81e2a94..035263b61cf5 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -326,6 +326,7 @@ enum {
>  	Opt_treelog, Opt_notreelog,
>  	Opt_usebackuproot,
>  	Opt_user_subvol_rm_allowed,
> +	Opt_dax,
>  
>  	/* Deprecated options */
>  	Opt_alloc_start,
> @@ -393,6 +394,7 @@ static const match_table_t tokens = {
>  	{Opt_notreelog, "notreelog"},
>  	{Opt_usebackuproot, "usebackuproot"},
>  	{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
> +	{Opt_dax, "dax"},
>  
>  	/* Deprecated options */
>  	{Opt_alloc_start, "alloc_start=%s"},
> @@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>  		case Opt_user_subvol_rm_allowed:
>  			btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
>  			break;
> +#ifdef CONFIG_FS_DAX
> +		case Opt_dax:
> +			if (btrfs_super_num_devices(info->super_copy) > 1) {
> +				btrfs_info(info,
> +					   "dax not supported for multi-device btrfs partition\n");

What prevents supporting dax for multiple devices so long as all devices
are dax?

<snip>

> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/10] btrfs: Support for DAX devices
  2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
                   ` (9 preceding siblings ...)
  2018-12-05 12:28 ` [PATCH 10/10] btrfs: dax mmap write Goldwyn Rodrigues
@ 2018-12-05 13:03 ` Qu Wenruo
  2018-12-05 21:36   ` Jeff Mahoney
  2018-12-05 13:57 ` Adam Borowski
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 30+ messages in thread
From: Qu Wenruo @ 2018-12-05 13:03 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 2524 bytes --]



On 2018/12/5 下午8:28, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs. I understand there have been
> previous attempts at it. However, I wanted to make sure copy-on-write
> (COW) works on dax as well.
> 
> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:
> 
> Questions:
> 1. I have been unable to do checksumming for DAX devices. While
> checksumming can be done for reads and writes, it is a problem when mmap
> is involved because btrfs kernel module does not get back control after
> an mmap() writes. Any ideas are appreciated, or we would have to set
> nodatasum when dax is enabled.

I'm not familar with DAX, so it's completely possible I'm talking like
an idiot.

If btrfs_page_mkwrite() can't provide enough control, then I have a
crazy idea.

Forcing page fault for every mmap() read/write (completely disable page
cache like DIO).
So that we could get some control since we're informed to read the page
and do some hacks there.

Thanks,
Qu
> 
> 2. Currently, a user can continue writing on "old" extents of an mmaped file
> after a snapshot has been created. How can we enforce writes to be directed
> to new extents after snapshots have been created? Do we keep a list of
> all mmap()s, and re-mmap them after a snapshot?
> 
> Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
> command line parameter.
> 
> 
> [PATCH 01/10] btrfs: create a mount option for dax
> [PATCH 02/10] btrfs: basic dax read
> [PATCH 03/10] btrfs: dax: read zeros from holes
> [PATCH 04/10] Rename __endio_write_update_ordered() to
> [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
> [PATCH 06/10] btrfs: dax write support
> [PATCH 07/10] dax: export functions for use with btrfs
> [PATCH 08/10] btrfs: dax add read mmap path
> [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
> [PATCH 10/10] btrfs: dax mmap write
> 
>  fs/btrfs/Makefile   |    1 
>  fs/btrfs/ctree.h    |   17 ++
>  fs/btrfs/dax.c      |  303 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/file.c     |   29 ++++
>  fs/btrfs/inode.c    |   54 +++++----
>  fs/btrfs/ioctl.c    |    5 
>  fs/btrfs/super.c    |   15 ++
>  fs/dax.c            |   35 ++++--
>  include/linux/dax.h |   16 ++
>  9 files changed, 430 insertions(+), 45 deletions(-)
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/10] btrfs: basic dax read
  2018-12-05 12:28 ` [PATCH 02/10] btrfs: basic dax read Goldwyn Rodrigues
@ 2018-12-05 13:11   ` Nikolay Borisov
  2018-12-05 13:22   ` Johannes Thumshirn
  1 sibling, 0 replies; 30+ messages in thread
From: Nikolay Borisov @ 2018-12-05 13:11 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/btrfs/Makefile |  1 +
>  fs/btrfs/ctree.h  |  5 ++++
>  fs/btrfs/dax.c    | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/file.c   | 13 ++++++++++-
>  4 files changed, 86 insertions(+), 1 deletion(-)
>  create mode 100644 fs/btrfs/dax.c
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index ca693dd554e9..1fa77b875ae9 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>  	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>  	   uuid-tree.o props.o free-space-tree.o tree-checker.o
>  
> +btrfs-$(CONFIG_FS_DAX) += dax.o
>  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
>  btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 5cc470fa6a40..038d64ecebe5 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3685,6 +3685,11 @@ int btrfs_reada_wait(void *handle);
>  void btrfs_reada_detach(void *handle);
>  int btree_readahead_hook(struct extent_buffer *eb, int err);
>  
> +#ifdef CONFIG_FS_DAX
> +/* dax.c */
> +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
> +#endif /* CONFIG_FS_DAX */
> +
>  static inline int is_fstree(u64 rootid)
>  {
>  	if (rootid == BTRFS_FS_TREE_OBJECTID ||
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> new file mode 100644
> index 000000000000..d614bf73bf8e
> --- /dev/null
> +++ b/fs/btrfs/dax.c
> @@ -0,0 +1,68 @@
> +#include <linux/dax.h>
> +#include <linux/uio.h>
> +#include "ctree.h"
> +#include "btrfs_inode.h"
> +
> +static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos,
> +		u64 len, struct iov_iter *iter)
> +{
> +        struct dax_device *dax_dev = fs_dax_get_by_bdev(em->bdev);
> +        ssize_t map_len;
> +        pgoff_t blk_pg;
> +        void *kaddr;
> +        sector_t blk_start;
> +        unsigned offset = pos & (PAGE_SIZE - 1);

offset = offset_in_page(pos)

> +
> +        len = min(len + offset, em->len - (pos - em->start));
> +        len = ALIGN(len, PAGE_SIZE);

len = PAGE_ALIGN(len);

> +        blk_start = (get_start_sect(em->bdev) << 9) + (em->block_start + (pos - em->start));
> +        blk_pg = blk_start - offset;
> +        map_len = dax_direct_access(dax_dev, PHYS_PFN(blk_pg), PHYS_PFN(len), &kaddr, NULL);
> +        map_len = PFN_PHYS(map_len)> +        kaddr += offset;
> +        map_len -= offset;
> +        if (map_len > len)
> +                map_len = len;

map_len = min(map_len, len);

> +        if (iov_iter_rw(iter) == WRITE)
> +                return dax_copy_from_iter(dax_dev, blk_pg, kaddr, map_len, iter);
> +        else
> +                return dax_copy_to_iter(dax_dev, blk_pg, kaddr, map_len, iter);

Have you looked at the implementation of dax_iomap_actor where they have
pretty similar code. In case of either of those returning 0 they set ret
to EFAULT, should the same be done in btrfs_file_dax_read? IMO it will
be good of you can follow dax_iomap_actor's logic as much as possible
since this code has been used for quite some time and is deemed robust.

> +}
> +
> +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
> +{
> +        size_t ret = 0, done = 0, count = iov_iter_count(to);
> +        struct extent_map *em;
> +        u64 pos = iocb->ki_pos;
> +        u64 end = pos + count;
> +        struct inode *inode = file_inode(iocb->ki_filp);
> +
> +        if (!count)
> +                return 0;
> +
> +        end = i_size_read(inode) < end ? i_size_read(inode) : end;

end = min(i_size_read(inode), end)

> +
> +        while (pos < end) {
> +                u64 len = end - pos;
> +
> +                em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, len, 0);
> +                if (IS_ERR(em)) {
> +                        if (!ret)
> +                                ret = PTR_ERR(em);
> +                        goto out;
> +                }
> +
> +                BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);

I think this can never trigger, because EXTENT_FLAG_FS_MAPPING is set
for extents that map chunk and those are housed in the chunk tree at
fs_info->mapping_tree. Since the write call back is only ever called for
file inodes I'd say this BUG_ON can be eliminated. Did you manage to
trigger it during development?


> +
> +                ret = em_dax_rw(inode, em, pos, len, to);
> +                if (ret < 0)
> +                        goto out;
> +                pos += ret;
> +                done += ret;
> +        }
> +
> +out:
> +        iocb->ki_pos += done;
> +        return done ? done : ret;
> +}
> +
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 58e93bce3036..ef6ed93f44d1 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3308,9 +3308,20 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
>  	return generic_file_open(inode, filp);
>  }
>  
> +static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +#ifdef CONFIG_FS_DAX
> +	if (IS_DAX(inode))
> +		return btrfs_file_dax_read(iocb, to);
> +#endif
> +	return generic_file_read_iter(iocb, to);
> +}
> +
>  const struct file_operations btrfs_file_operations = {
>  	.llseek		= btrfs_file_llseek,
> -	.read_iter      = generic_file_read_iter,
> +	.read_iter      = btrfs_file_read_iter,
>  	.splice_read	= generic_file_splice_read,
>  	.write_iter	= btrfs_file_write_iter,
>  	.mmap		= btrfs_file_mmap,
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/10] btrfs: basic dax read
  2018-12-05 12:28 ` [PATCH 02/10] btrfs: basic dax read Goldwyn Rodrigues
  2018-12-05 13:11   ` Nikolay Borisov
@ 2018-12-05 13:22   ` Johannes Thumshirn
  1 sibling, 0 replies; 30+ messages in thread
From: Johannes Thumshirn @ 2018-12-05 13:22 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs; +Cc: Goldwyn Rodrigues

On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Can you explain why we can't use th dax_iomap_rw() interface like XFS or
EXT4?

[...]

> +static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos,
> +		u64 len, struct iov_iter *iter)
> +{
> +        struct dax_device *dax_dev = fs_dax_get_by_bdev(em->bdev);
> +        ssize_t map_len;
> +        pgoff_t blk_pg;
> +        void *kaddr;
> +        sector_t blk_start;
> +        unsigned offset = pos & (PAGE_SIZE - 1);

Nit: unsigned offset = offset_in_page(pos);




-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 03/10] btrfs: dax: read zeros from holes
  2018-12-05 12:28 ` [PATCH 03/10] btrfs: dax: read zeros from holes Goldwyn Rodrigues
@ 2018-12-05 13:26   ` Nikolay Borisov
  0 siblings, 0 replies; 30+ messages in thread
From: Nikolay Borisov @ 2018-12-05 13:26 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs; +Cc: Goldwyn Rodrigues



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/btrfs/dax.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> index d614bf73bf8e..5a297674adec 100644
> --- a/fs/btrfs/dax.c
> +++ b/fs/btrfs/dax.c
> @@ -54,7 +54,12 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)

nit: I think it's better if you rename the iterator variable to "iter".

>  
>                  BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);
>  
> -                ret = em_dax_rw(inode, em, pos, len, to);
> +		if (em->block_start == EXTENT_MAP_HOLE) {
> +			u64 zero_len = min(em->len - (em->start - pos), len);

Shouldn't this be em->len - (pos - em->start) since this gives the
remaining length of the extent? Isn't pos guaranteed to be >= em->start ?

> +			ret = iov_iter_zero(zero_len, to);
> +		} else {
> +			ret = em_dax_rw(inode, em, pos, len, to);
> +		}
>                  if (ret < 0)
>                          goto out;
>                  pos += ret;
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 04/10] Rename __endio_write_update_ordered() to btrfs_update_ordered_extent()
  2018-12-05 12:28 ` [PATCH 04/10] Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
@ 2018-12-05 13:35   ` Nikolay Borisov
  0 siblings, 0 replies; 30+ messages in thread
From: Nikolay Borisov @ 2018-12-05 13:35 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs; +Cc: Goldwyn Rodrigues



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Since we will be using it in another part of the code, use a
> better name to declare it non-static
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/btrfs/ctree.h |  7 +++++--
>  fs/btrfs/inode.c | 14 +++++---------
>  2 files changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 038d64ecebe5..5144d28216b0 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3170,8 +3170,11 @@ struct inode *btrfs_iget_path(struct super_block *s, struct btrfs_key *location,
>  struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
>  			 struct btrfs_root *root, int *was_new);
>  struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
> -				    struct page *page, size_t pg_offset,
> -				    u64 start, u64 end, int create);
> +		struct page *page, size_t pg_offset,
> +		u64 start, u64 end, int create);
> +void btrfs_update_ordered_extent(struct inode *inode,
> +		const u64 offset, const u64 bytes,
> +		const bool uptodate);
>  int btrfs_update_inode(struct btrfs_trans_handle *trans,
>  			      struct btrfs_root *root,
>  			      struct inode *inode);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9ea4c6f0352f..96e9fe9e4150 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -97,10 +97,6 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
>  				       u64 ram_bytes, int compress_type,
>  				       int type);
>  
> -static void __endio_write_update_ordered(struct inode *inode,
> -					 const u64 offset, const u64 bytes,
> -					 const bool uptodate);
> -
>  /*
>   * Cleanup all submitted ordered extents in specified range to handle errors
>   * from the fill_dellaloc() callback.
> @@ -130,7 +126,7 @@ static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
>  		ClearPagePrivate2(page);
>  		put_page(page);
>  	}
> -	return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
> +	return btrfs_update_ordered_extent(inode, offset + PAGE_SIZE,
>  					    bytes - PAGE_SIZE, false);
>  }
>  
> @@ -8059,7 +8055,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
>  	bio_put(bio);
>  }
>  
> -static void __endio_write_update_ordered(struct inode *inode,
> +void btrfs_update_ordered_extent(struct inode *inode,
>  					 const u64 offset, const u64 bytes,
>  					 const bool uptodate)

Since you are exporting the function I'd suggest to use the occasion to
introduce proper kernel-doc. THe primary help will be if you document
the context under which the function is called - when writes are
finished and it's used to update respective portion of the ordered extent.

>  {
> @@ -8112,7 +8108,7 @@ static void btrfs_endio_direct_write(struct bio *bio)
>  	struct btrfs_dio_private *dip = bio->bi_private;
>  	struct bio *dio_bio = dip->dio_bio;
>  
> -	__endio_write_update_ordered(dip->inode, dip->logical_offset,
> +	btrfs_update_ordered_extent(dip->inode, dip->logical_offset,
>  				     dip->bytes, !bio->bi_status);
>  
>  	kfree(dip);
> @@ -8432,7 +8428,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
>  		bio = NULL;
>  	} else {
>  		if (write)
> -			__endio_write_update_ordered(inode,
> +			btrfs_update_ordered_extent(inode,
>  						file_offset,
>  						dio_bio->bi_iter.bi_size,
>  						false);
> @@ -8572,7 +8568,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>  			 */
>  			if (dio_data.unsubmitted_oe_range_start <
>  			    dio_data.unsubmitted_oe_range_end)
> -				__endio_write_update_ordered(inode,
> +				btrfs_update_ordered_extent(inode,
>  					dio_data.unsubmitted_oe_range_start,
>  					dio_data.unsubmitted_oe_range_end -
>  					dio_data.unsubmitted_oe_range_start,
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 06/10] btrfs: dax write support
  2018-12-05 12:28 ` [PATCH 06/10] btrfs: dax write support Goldwyn Rodrigues
@ 2018-12-05 13:56   ` Johannes Thumshirn
  0 siblings, 0 replies; 30+ messages in thread
From: Johannes Thumshirn @ 2018-12-05 13:56 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs; +Cc: Goldwyn Rodrigues

On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
[...]

> +static int copy_extent_page(struct extent_map *em, void *daddr, u64 pos)
> +{
> +        struct dax_device *dax_dev;

^ space instead of tabs?

> +	void *saddr;
> +	sector_t start;
> +	size_t len;
> +
> +	if (em->block_start == EXTENT_MAP_HOLE) {
> +		memset(daddr, 0, PAGE_SIZE);
> +	} else {
> +		dax_dev = fs_dax_get_by_bdev(em->bdev);
> +		start = (get_start_sect(em->bdev) << 9) + (em->block_start + (pos - em->start));
> +		len = dax_direct_access(dax_dev, PHYS_PFN(start), 1, &saddr, NULL);
> +		memcpy(daddr, saddr, PAGE_SIZE);
> +	}
> +	free_extent_map(em);
> +
> +	return 0;
> +}
> +
> +

copy_extent_page() always returns 0, why not make it void?
Plus a nit: double newline.

> +ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	ssize_t ret, done = 0, count = iov_iter_count(from);
> +        struct inode *inode = file_inode(iocb->ki_filp);
^ again spaces vs tabs.

> +	u64 pos = iocb->ki_pos;
> +	u64 start = round_down(pos, PAGE_SIZE);
> +	u64 end = round_up(pos + count, PAGE_SIZE);
> +	struct extent_state *cached_state = NULL;
> +	struct extent_changeset *data_reserved = NULL;
> +	struct extent_map *first = NULL, *last = NULL;
> +
> +	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start, end - start);
> +	if (ret < 0)
> +		return ret;
> +
> +	/* Grab a reference of the first extent to copy data */
> +	if (start < pos) {
> +		first = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, end - start, 0);
> +		if (IS_ERR(first)) {
> +			ret = PTR_ERR(first);
> +			goto out2;
> +		}
> +	}

You're using 'end - start' at least twice here, maybe you could move
'len' out of the loop and use it for btrfs_delalloc_reserve_space() and
btrfs_get_extent() as well.

> +
> +	/* Grab a reference of the last extent to copy data */
> +	if (pos + count < end) {
> +		last = btrfs_get_extent(BTRFS_I(inode), NULL, 0, end - PAGE_SIZE, PAGE_SIZE, 0);
> +		if (IS_ERR(last)) {
> +			ret = PTR_ERR(last);
> +			goto out2;
> +		}
> +	}
> +
> +	lock_extent_bits(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
> +	while (done < count) {
> +		struct extent_map *em;
> +		struct dax_device *dax_dev;
> +		int offset = pos & (PAGE_SIZE - 1);
> +		u64 estart = round_down(pos, PAGE_SIZE);
> +		u64 elen = end - estart;
> +		size_t len = count - done;
> +		sector_t dstart;
> +		void *daddr;
> +		ssize_t maplen;
> +
> +		/* Read the current extent */
> +                em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, estart, elen, 0);

Space again.

> +		if (IS_ERR(em)) {
> +			ret = PTR_ERR(em);
> +			goto out;
> +		}
> +
> +		/* Get a new extent */
> +		ret = btrfs_get_extent_map_write(&em, NULL, inode, estart, elen);
> +		if (ret < 0)
> +			goto out;
> +
> +		dax_dev = fs_dax_get_by_bdev(em->bdev);
> +		/* Calculate start address start of destination extent */
> +		dstart = (get_start_sect(em->bdev) << 9) + em->block_start;
> +		maplen = dax_direct_access(dax_dev, PHYS_PFN(dstart),
> +				PHYS_PFN(em->len), &daddr, NULL);
> +
> +		/* Copy front of extent page */
> +		if (offset)
> +			ret = copy_extent_page(first, daddr, estart);
> +
> +		/* Copy end of extent page */
> +		if ((pos + len > estart + PAGE_SIZE) && (pos + len < em->start + em->len))
> +			ret = copy_extent_page(last, daddr + em->len - PAGE_SIZE, em->start + em->len - PAGE_SIZE);
> +
> +		/* Copy the data from the iter */
> +		maplen = PFN_PHYS(maplen);
> +		maplen -= offset;
> +		ret = dax_copy_from_iter(dax_dev, dstart, daddr + offset, maplen, from);
> +		if (ret < 0)
> +			goto out;
> +		pos += ret;
> +		done += ret;
> +	}
> +out:

out_unlock?

> +	unlock_extent_cached(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
> +	if (done) {
> +		btrfs_update_ordered_extent(inode, start,
> +				end - start, true);
> +		iocb->ki_pos += done;
> +		if (iocb->ki_pos > i_size_read(inode))
> +			i_size_write(inode, iocb->ki_pos);
> +	}
> +
> +	btrfs_delalloc_release_extents(BTRFS_I(inode), count, false);
> +out2:

out?

> +	if (count - done > 0)
> +		btrfs_delalloc_release_space(inode, data_reserved, pos,
> +				count - done, true);
> +	extent_changeset_free(data_reserved);
> +        return done ? done : ret;
> +
> +}
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index ef6ed93f44d1..29a3b12e6660 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1964,7 +1964,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>  	if (sync)
>  		atomic_inc(&BTRFS_I(inode)->sync_writers);
>  
> -	if (iocb->ki_flags & IOCB_DIRECT) {
> +	if (IS_DAX(inode)) {
> +		num_written = btrfs_file_dax_write(iocb, from);
> +	} else if (iocb->ki_flags & IOCB_DIRECT) {
>  		num_written = __btrfs_direct_write(iocb, from);
>  	} else {
>  		num_written = btrfs_buffered_write(iocb, from);
> 


-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/10] btrfs: Support for DAX devices
  2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
                   ` (10 preceding siblings ...)
  2018-12-05 13:03 ` [PATCH 00/10] btrfs: Support for DAX devices Qu Wenruo
@ 2018-12-05 13:57 ` Adam Borowski
  2018-12-05 21:37 ` Jeff Mahoney
  2018-12-06 10:07 ` Johannes Thumshirn
  13 siblings, 0 replies; 30+ messages in thread
From: Adam Borowski @ 2018-12-05 13:57 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs

On Wed, Dec 05, 2018 at 06:28:25AM -0600, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs.

Yay!

> I understand there have been previous attempts at it.  However, I wanted
> to make sure copy-on-write (COW) works on dax as well.

btrfs' usual use of CoW and DAX are thoroughly in conflict.

The very point of DAX is to have writes not go through the kernel, you
mmap the file then do all writes right to the pmem, flushing when needed
(without hitting the kernel) and having the processor+memory persist what
you wrote.

CoW via page faults are fine -- pmem is closer to memory than disk, and this
means the kernel will ask the filesystem for an extent to place the new page
in, copy the contents and let the process play with it.  But real btrfs CoW
would mean we'd need to page fault on ᴇᴠᴇʀʏ ꜱɪɴɢʟᴇ ᴡʀɪᴛᴇ.

Delaying CoW until the next commit doesn't help -- you'd need to store the
dirty page in DRAM then write it, which goes against the whole concept of
DAX.

Only way I see would be to CoW once then pretend the page is nodatacow until
the next commit, when we checksum it, add to the metadata trees, and mark
for CoWing on the next write.  Lots of complexity, and you still need to
copy the whole thing every commit (so no gain).

Ie, we're in nodatacow land.  CoW for metadata is fine.

> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:
> 
> Questions:
> 1. I have been unable to do checksumming for DAX devices. While
> checksumming can be done for reads and writes, it is a problem when mmap
> is involved because btrfs kernel module does not get back control after
> an mmap() writes. Any ideas are appreciated, or we would have to set
> nodatasum when dax is enabled.

Per the above, it sounds like nodatacow (ie, "cow once") would be needed.

> 2. Currently, a user can continue writing on "old" extents of an mmaped file
> after a snapshot has been created. How can we enforce writes to be directed
> to new extents after snapshots have been created? Do we keep a list of
> all mmap()s, and re-mmap them after a snapshot?

Same as for any other memory that's shared: when a new instance of sharing
is added (a snapshot/reflink in our case), you deny writes, causing a page
fault on the next attempt.  "pmem" is named "ᴘersistent ᴍᴇᴍory" for a
reason...

> Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
> command line parameter.

Might be more useful to use a bigger piece of the "disk" than 2G, it's not
in the danger area though.

Also note that it's utterly pointless to use any RAID modes; multi-dev
single is fine, DUP counts as RAID here.
* RAID0 is already done better in hardware (interleave)
* RAID1 would require hardware support, replication isn't easy
* RAID5/6 

What would make sense, is disabling dax for any files that are not marked as
nodatacow.  This way, unrelated files can still use checksums or
compression, while only files meant as a pmempool or otherwise by a
pmem-aware program would have dax writes (you can still give read-only pages
that CoW to DRAM).  This way we can have write dax for only a subset of
files, and full set of btrfs features for the rest.  Write dax is dangerous
for programs that have no specific support: the vast majority of
database-like programs rely on page-level atomicity while pmem gives you
cacheline/word atomicity only; torn writes mean data loss.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄⠀⠀⠀⠀ to the city of his birth to die.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 07/10] dax: export functions for use with btrfs
  2018-12-05 12:28 ` [PATCH 07/10] dax: export functions for use with btrfs Goldwyn Rodrigues
@ 2018-12-05 13:59   ` Johannes Thumshirn
  2018-12-05 14:52   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Johannes Thumshirn @ 2018-12-05 13:59 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs; +Cc: Goldwyn Rodrigues

On 05/12/2018 13:28, Goldwyn Rodrigues wrote:

[...]

> -static void *grab_mapping_entry(struct xa_state *xas,
> +void *grab_mapping_entry(struct xa_state *xas,
>  		struct address_space *mapping, unsigned long size_flag)
>  {
>  	unsigned long index = xas->xa_index;
> @@ -531,6 +532,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
>  	xas_unlock_irq(xas);
>  	return xa_mk_internal(VM_FAULT_FALLBACK);
>  }
> +EXPORT_SYMBOL(grab_mapping_entry);

dax_grab_mapping_entry() please.



-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 07/10] dax: export functions for use with btrfs
  2018-12-05 12:28 ` [PATCH 07/10] dax: export functions for use with btrfs Goldwyn Rodrigues
  2018-12-05 13:59   ` Johannes Thumshirn
@ 2018-12-05 14:52   ` Christoph Hellwig
  2018-12-06 11:46     ` Goldwyn Rodrigues
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2018-12-05 14:52 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, Goldwyn Rodrigues

If you want to export these at all they have to be EXPORT_SYMBOL_GPL.

But I'd really like to avoid seeing another duplicate DAX I/O path.
Please try to adopt the existing iomap-based infrastructure for your
needs first.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/10] btrfs: create a mount option for dax
  2018-12-05 12:43   ` Nikolay Borisov
@ 2018-12-05 14:59     ` Adam Borowski
  0 siblings, 0 replies; 30+ messages in thread
From: Adam Borowski @ 2018-12-05 14:59 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Goldwyn Rodrigues, linux-btrfs

On Wed, Dec 05, 2018 at 02:43:03PM +0200, Nikolay Borisov wrote:
> One question below though .
> 
> > +++ b/fs/btrfs/super.c
> > @@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
> >  		case Opt_user_subvol_rm_allowed:
> >  			btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
> >  			break;
> > +#ifdef CONFIG_FS_DAX
> > +		case Opt_dax:
> > +			if (btrfs_super_num_devices(info->super_copy) > 1) {
> > +				btrfs_info(info,
> > +					   "dax not supported for multi-device btrfs partition\n");
> 
> What prevents supporting dax for multiple devices so long as all devices
> are dax?

As I mentioned in a separate mail, most profiles are either redundant
(RAID0), require hardware support (RAID1, DUP) or are impossible (RAID5,
RAID6).

But, "single" profile multi-device would be useful and actually provide
something other dax-supporting filesystems don't have: combining multiple
devices into one logical piece.

On the other hand, DUP profiles need to be banned.  In particular, the
filesystem you mount might have existing DUP block groups.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄⠀⠀⠀⠀ to the city of his birth to die.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/10] btrfs: Support for DAX devices
  2018-12-05 13:03 ` [PATCH 00/10] btrfs: Support for DAX devices Qu Wenruo
@ 2018-12-05 21:36   ` Jeff Mahoney
  0 siblings, 0 replies; 30+ messages in thread
From: Jeff Mahoney @ 2018-12-05 21:36 UTC (permalink / raw)
  To: Qu Wenruo, Goldwyn Rodrigues, linux-btrfs

On 12/5/18 8:03 AM, Qu Wenruo wrote:
> 
> 
> On 2018/12/5 下午8:28, Goldwyn Rodrigues wrote:
>> This is a support for DAX in btrfs. I understand there have been
>> previous attempts at it. However, I wanted to make sure copy-on-write
>> (COW) works on dax as well.
>>
>> Before I present this to the FS folks I wanted to run this through the
>> btrfs. Even though I wish, I cannot get it correct the first time
>> around :/.. Here are some questions for which I need suggestions:
>>
>> Questions:
>> 1. I have been unable to do checksumming for DAX devices. While
>> checksumming can be done for reads and writes, it is a problem when mmap
>> is involved because btrfs kernel module does not get back control after
>> an mmap() writes. Any ideas are appreciated, or we would have to set
>> nodatasum when dax is enabled.
> 
> I'm not familar with DAX, so it's completely possible I'm talking like
> an idiot.

The general idea is:

1) there is no page cache involved. read() and write() are like direct 
i/o writes in concept.  The user buffer is written directly (via what is 
essentially a specialized memcpy) to the NVDIMM.
2) for mmap, once the mapping is established and mapped, the file system 
is not involved.  The application writes directly to the memory as it 
would a normal mmap, except it's persistent.  All that's required to 
ensure persistence is a CPU cache flush.  The only way the file system 
is involved again is if some operation has occurred to reset the WP bit.

> If btrfs_page_mkwrite() can't provide enough control, then I have a
> crazy idea.

It can't, because it is only invoked on the page fault path and we want 
to try to limit those as much as possible.

> Forcing page fault for every mmap() read/write (completely disable page
> cache like DIO).
> So that we could get some control since we're informed to read the page
> and do some hacks there.
There's no way to force a page fault for every mmap read/write.  Even if 
there was, we wouldn't want that.  No user would turn that on when they 
can just make similar guarantees in their app (which are typically apps 
that do this already) and not pay any performance penalty.   The idea 
with DAX mmap is that the file system manages the namespace, space 
allocation, and permissions.  Otherwise we stay out of the way.

-Jeff
-- 
Jeff Mahoney
SUSE Labs


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/10] btrfs: Support for DAX devices
  2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
                   ` (11 preceding siblings ...)
  2018-12-05 13:57 ` Adam Borowski
@ 2018-12-05 21:37 ` Jeff Mahoney
  2018-12-06  7:40   ` Robert White
  2018-12-06 10:07 ` Johannes Thumshirn
  13 siblings, 1 reply; 30+ messages in thread
From: Jeff Mahoney @ 2018-12-05 21:37 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs

On 12/5/18 7:28 AM, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs. I understand there have been
> previous attempts at it. However, I wanted to make sure copy-on-write
> (COW) works on dax as well.
> 
> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:
> 
> Questions:
> 1. I have been unable to do checksumming for DAX devices. While
> checksumming can be done for reads and writes, it is a problem when mmap
> is involved because btrfs kernel module does not get back control after
> an mmap() writes. Any ideas are appreciated, or we would have to set
> nodatasum when dax is enabled.

Yep.  It has to be nodatasum, at least within the confines of datasum 
today.  DAX mmap writes are essentially in the same situation as with 
direct i/o when another thread modifies the buffer being submitted. 
Except rather than it being a race, it happens every time.  An 
alternative here could be to add the ability to mark a crc as unreliable 
and then go back and update them once the last DAX mmap reference is 
dropped on a range.  There's no reason to make this a requirement of the 
initial implementation, though.

> 2. Currently, a user can continue writing on "old" extents of an mmaped file
> after a snapshot has been created. How can we enforce writes to be directed
> to new extents after snapshots have been created? Do we keep a list of
> all mmap()s, and re-mmap them after a snapshot?

It's the second question that's the hard part.  As Adam describes later, 
setting each pfn read-only will ensure page faults cause the remapping.

The high level idea that Jan Kara and I came up with in our conversation 
at Labs conf is pretty expensive.  We'd need to set a flag that pauses 
new page faults, set the WP bit on affected ranges, do the snapshot, 
commit, clear the flag, and wake up the waiting threads.  Neither of us 
had any concrete idea of how well that would perform and it still 
depends on finding a good way to resolve all open mmap ranges on a 
subvolume.  Perhaps using the address_space->private_list anchored on 
each root would work.

-Jeff

> Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
> command line parameter.
> 
> 
> [PATCH 01/10] btrfs: create a mount option for dax
> [PATCH 02/10] btrfs: basic dax read
> [PATCH 03/10] btrfs: dax: read zeros from holes
> [PATCH 04/10] Rename __endio_write_update_ordered() to
> [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
> [PATCH 06/10] btrfs: dax write support
> [PATCH 07/10] dax: export functions for use with btrfs
> [PATCH 08/10] btrfs: dax add read mmap path
> [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
> [PATCH 10/10] btrfs: dax mmap write
> 
>   fs/btrfs/Makefile   |    1
>   fs/btrfs/ctree.h    |   17 ++
>   fs/btrfs/dax.c      |  303 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>   fs/btrfs/file.c     |   29 ++++
>   fs/btrfs/inode.c    |   54 +++++----
>   fs/btrfs/ioctl.c    |    5
>   fs/btrfs/super.c    |   15 ++
>   fs/dax.c            |   35 ++++--
>   include/linux/dax.h |   16 ++
>   9 files changed, 430 insertions(+), 45 deletions(-)
> 
> 

-- 
Jeff Mahoney
SUSE Labs


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/10] btrfs: Support for DAX devices
  2018-12-05 21:37 ` Jeff Mahoney
@ 2018-12-06  7:40   ` Robert White
  0 siblings, 0 replies; 30+ messages in thread
From: Robert White @ 2018-12-06  7:40 UTC (permalink / raw)
  To: Jeff Mahoney, Goldwyn Rodrigues, linux-btrfs

On 12/5/18 9:37 PM, Jeff Mahoney wrote:
> The high level idea that Jan Kara and I came up with in our conversation 
> at Labs conf is pretty expensive.  We'd need to set a flag that pauses 
> new page faults, set the WP bit on affected ranges, do the snapshot, 
> commit, clear the flag, and wake up the waiting threads.  Neither of us 
> had any concrete idea of how well that would perform and it still 
> depends on finding a good way to resolve all open mmap ranges on a 
> subvolume.  Perhaps using the address_space->private_list anchored on 
> each root would work.

This is a potentially wild idea, so "grain of salt" and all that. I may 
misuse the exact wording.

So the essential problem of DAX is basically the opposite of 
data-deduplication. Instead of merging two duplicate data regions, you 
want to mark regions as at-risk while keeping the original content 
intact if there are snapshots in conflict.

So suppose you _require_ data checksums and data mode of "dup" or mirror 
or one of the other fault tolerant layouts.

By definition any block that gets written with content that it didn't 
have before will now have a bad checksum.

If the inode is flagged for direct IO that's an indication that the 
block has been updated.

At this point you really just need to do the opposite of deduplication, 
as in find/recover the original contents and assign/leave assigned those 
to the old/other snapshots, then compute the new checksum on the 
"original block" and assign it to the active subvolume.

So when a region is mapped for direct IO, and it's refcount is greater 
than one, and you get to a sync or close event, you "recover" the old 
contents into a new location and assign those to "all the other users". 
Now that original storage region has only one user, so on sync or close 
you fix its checksums on the cheap.

Instead of the new data being a small rock sitting over a large rug to 
make a lump, the new data is like a rock being slid under the rug to 
make a lump.

So the first write to an extent creates a burdensome copy to retain the 
old contents, but second and subsequent writes to the same extent only 
have the cost of an _eventual_ checksum of the original block list.

Maybe If the data isn't already duplicated then the write mapping or the 
DAX open or the setting of the S_DUP flag could force the file into an 
extent block that _is_ duplicated.

The mental leap required is that the new blocks don't need to belong to 
the new state being created. The new blocks can be associated to the 
snapshots since data copy is idempotent.

The side note is that it only ever matters if the usage count is greater 
than one, so at worst taking a snapshot, which is already a _little_ 
racy anyway, would/could trigger a semi-lightweight copy of any S_DAX files:

If S_DAX :
   If checksum invalid :
     copy data as-is and checksum, store in snapshot
   else : look for duplicate checksum
     if duplicate found :
       assign that extent to the snapshot
     else :
       If file opened for writing and has any mmaps for write :
         copy extent and assign to new snapshot.
       else :
         increment usage count and assign current block to snapshot

Anyway, I only know enough of the internals to be dangerous.

Since the real goal of mmap is speed during actual update, this idea is 
basically about amortizing the copy costs into the task of maintaining 
the snapshots instead of leaving them in the immediate hands of the 
time-critical updater.

The flush, unmmap, or close by the user, or a system-wide sync event, 
are also good points to expense the bookeeping time.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/10] btrfs: Support for DAX devices
  2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
                   ` (12 preceding siblings ...)
  2018-12-05 21:37 ` Jeff Mahoney
@ 2018-12-06 10:07 ` Johannes Thumshirn
  2018-12-06 11:47   ` Goldwyn Rodrigues
  13 siblings, 1 reply; 30+ messages in thread
From: Johannes Thumshirn @ 2018-12-06 10:07 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs

On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs. I understand there have been
> previous attempts at it. However, I wanted to make sure copy-on-write
> (COW) works on dax as well.
> 
> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:

Hi Goldwyn,

I've thrown your patches (from your git tree) onto one of my pmem test
machines with this pmem config:

mayhem:~/:[0]# ndctl list
[
  {
    "dev":"namespace1.0",
    "mode":"fsdax",
    "map":"dev",
    "size":792721358848,
    "uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4",
    "raw_uuid":"49264743-2351-41c5-9db9-38534813df61",
    "sector_size":512,
    "blockdev":"pmem1",
    "numa_node":1
  },
  {
    "dev":"namespace0.0",
    "mode":"fsdax",
    "map":"dev",
    "size":792721358848,
    "uuid":"dd0aec3c-7721-4621-8898-e50684a371b5",
    "raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909",
    "sector_size":4096,
    "blockdev":"pmem0",
    "numa_node":0
  }
]

Unfortunately I hit a btrfs_panic() with btrfs/002.
export TEST_DEV=/dev/pmem0
export SCRATCH_DEV=/dev/pmem1
export MOUNT_OPTIONS="-o dax"
./check
[...]
[  178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43
[  178.357044] BTRFS info (device pmem0): disk space caching is enabled
[  178.357047] BTRFS info (device pmem0): has skinny extents
[  178.360042] BTRFS info (device pmem0): enabling ssd optimizations
[  178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8
devid 1 transid 3 /dev/pmem1
[  178.505717] BTRFS info (device pmem1): disk space caching is enabled
[  178.513593] BTRFS info (device pmem1): has skinny extents
[  178.520384] BTRFS info (device pmem1): flagging fs with big metadata
feature
[  178.530997] BTRFS info (device pmem1): enabling ssd optimizations
[  178.538331] BTRFS info (device pmem1): creating UUID tree
[  178.587200] BTRFS critical (device pmem1): panic in
ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0
(errno=-17 Object already exists)
[  178.603129] ------------[ cut here ]------------
[  178.608667] kernel BUG at fs/btrfs/ordered-data.c:57!
[  178.614333] invalid opcode: 0000 [#1] SMP PTI
[  178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G
      E     4.20.0-rc5-default-btrfs-dax #920
[  178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS
SE5C620.86B.0D.01.0010.072020182008 07/20/2018
[  178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs]
[  178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba
39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06
00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0
[  178.667019] RSP: 0018:ffffa3e3674c7ba8 EFLAGS: 00010096
[  178.672684] RAX: 000000000000008f RBX: ffff9770c2ac5748 RCX:
0000000000000000
[  178.680254] RDX: ffff97711f9dee80 RSI: ffff97711f9d6868 RDI:
ffff97711f9d6868
[  178.687831] RBP: ffff97711d523000 R08: 0000000000000000 R09:
000000000000065a
[  178.695411] R10: 00000000000003ff R11: 0000000000000001 R12:
ffff97710d66da70
[  178.702993] R13: ffff9770c2ac5600 R14: 0000000000000000 R15:
ffff97710d66d9c0
[  178.710573] FS:  00007fe11ef90700(0000) GS:ffff97711f9c0000(0000)
knlGS:0000000000000000
[  178.719122] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  178.725380] CR2: 000000000156a000 CR3: 000000eb30dfc006 CR4:
00000000007606e0
[  178.732999] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  178.740574] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[  178.748147] PKRU: 55555554
[  178.751297] Call Trace:
[  178.754230]  btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs]
[  178.760269]  btrfs_create_dio_extent+0x79/0xe0 [btrfs]
[  178.765930]  btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs]
[  178.771959]  btrfs_file_dax_write+0x1f8/0x4f0 [btrfs]
[  178.777508]  ? current_time+0x3f/0x70
[  178.781672]  btrfs_file_write_iter+0x384/0x580 [btrfs]
[  178.787265]  ? pipe_read+0x243/0x2a0
[  178.791298]  __vfs_write+0xee/0x170
[  178.795241]  vfs_write+0xad/0x1a0
[  178.799008]  ? vfs_read+0x111/0x130
[  178.802949]  ksys_write+0x42/0x90
[  178.806712]  do_syscall_64+0x5b/0x180
[  178.810829]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  178.816334] RIP: 0033:0x7fe11eabb3d0
[  178.820364] Code: 73 01 c3 48 8b 0d b8 ea 2b 00 f7 d8 64 89 01 48 83
c8 ff c3 66 0f 1f 44 00 00 83 3d b9 43 2c 00 00 75 10 b8 01 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 2e 90 01 00 48 89 04 24
[  178.840052] RSP: 002b:00007ffec969d978 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[  178.848100] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007fe11eabb3d0
[  178.855715] RDX: 0000000000000400 RSI: 000000000156a000 RDI:
0000000000000001
[  178.863326] RBP: 0000000000000400 R08: 0000000000000003 R09:
00007fe11ed7a698
[  178.870928] R10: 0000000010a8b550 R11: 0000000000000246 R12:
000000000156a000
[  178.878529] R13: 0000000000000000 R14: 000000000156a000 R15:
00007ffec969e9f1
[  178.886177] Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E)
nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) fscache(E) devlink(E)
ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E)
iptable_filter(E) ip_tables(E) x_tables(E) rpcrdma(E) sunrpc(E)
rdma_ucm(E) ib_uverbs(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E)
intel_rapl(E) libiscsi(E) af_packet(E) scsi_transport_iscsi(E)
skx_edac(E) configfs(E) x86_pkg_temp_thermal(E) intel_powerclamp(E)
iscsi_ibft(E) coretemp(E) iscsi_boot_sysfs(E) ipmi_ssif(E) kvm(E) msr(E)
i40iw(E) ib_core(E) ext4(E) nls_iso8859_1(E) nls_cp437(E) crc16(E)
mbcache(E) vfat(E) irqbypass(E) crc32_pclmul(E) ghash_clmulni_intel(E)
jbd2(E) joydev(E) fat(E) i40e(E) aesni_intel(E) iTCO_wdt(E) ptp(E)
aes_x86_64(E) iTCO_vendor_support(E) mei_me(E) crypto_simd(E) ipmi_si(E)
pps_core(E) lpc_ich(E) ioatdma(E) dax_pmem(E) ipmi_devintf(E) nd_pmem(E)
cryptd(E) glue_helper(E) pcspkr(E) mfd_core(E) i2c_i801(E) device_dax(E)
ipmi_msghandler(E) mei(E) nd_btt(E) dca(E)
[  178.886201]  pcc_cpufreq(E) acpi_pad(E) btrfs(E) libcrc32c(E) xor(E)
zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) hid_generic(E)
usbhid(E) sd_mod(E) sr_mod(E) cdrom(E) ast(E) i2c_algo_bit(E)
drm_kms_helper(E) syscopyarea(E) ahci(E) sysfillrect(E) xhci_pci(E)
sysimgblt(E) fb_sys_fops(E) libahci(E) xhci_hcd(E) ttm(E)
crc32c_intel(E) drm(E) libata(E) usbcore(E) wmi(E) nfit(E) libnvdimm(E)
button(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E)
scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E)



-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 07/10] dax: export functions for use with btrfs
  2018-12-05 14:52   ` Christoph Hellwig
@ 2018-12-06 11:46     ` Goldwyn Rodrigues
  2018-12-12  8:07       ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Goldwyn Rodrigues @ 2018-12-06 11:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-btrfs

On  6:52 05/12, Christoph Hellwig wrote:
> If you want to export these at all they have to be EXPORT_SYMBOL_GPL.
> 

Understood.

> But I'd really like to avoid seeing another duplicate DAX I/O path.
> Please try to adopt the existing iomap-based infrastructure for your
> needs first.

This is not worth with btrfs. With non-page aligned I/O on btrfs, we
need to copy the first/last page of the extents for CoW. So, we
would end up using the exported functions anyways. Believe me, I have
spent some time getting btrfs iomap compatible before giving up. The
problems are btrfs needs to carry a lot of information across
iomap_begin and iomap_end. While the added private variable helps in
this, it also needs hooks in bio_submit() functions for crc calculations
during direct writes.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/10] btrfs: Support for DAX devices
  2018-12-06 10:07 ` Johannes Thumshirn
@ 2018-12-06 11:47   ` Goldwyn Rodrigues
  0 siblings, 0 replies; 30+ messages in thread
From: Goldwyn Rodrigues @ 2018-12-06 11:47 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On 11:07 06/12, Johannes Thumshirn wrote:
> On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> > This is a support for DAX in btrfs. I understand there have been
> > previous attempts at it. However, I wanted to make sure copy-on-write
> > (COW) works on dax as well.
> > 
> > Before I present this to the FS folks I wanted to run this through the
> > btrfs. Even though I wish, I cannot get it correct the first time
> > around :/.. Here are some questions for which I need suggestions:
> 
> Hi Goldwyn,
> 
> I've thrown your patches (from your git tree) onto one of my pmem test
> machines with this pmem config:

Thanks. I will check on this. Ordered extents have been a pain to deal
with for me (though mainly because of my incorrect usage)

> 
> mayhem:~/:[0]# ndctl list
> [
>   {
>     "dev":"namespace1.0",
>     "mode":"fsdax",
>     "map":"dev",
>     "size":792721358848,
>     "uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4",
>     "raw_uuid":"49264743-2351-41c5-9db9-38534813df61",
>     "sector_size":512,
>     "blockdev":"pmem1",
>     "numa_node":1
>   },
>   {
>     "dev":"namespace0.0",
>     "mode":"fsdax",
>     "map":"dev",
>     "size":792721358848,
>     "uuid":"dd0aec3c-7721-4621-8898-e50684a371b5",
>     "raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909",
>     "sector_size":4096,
>     "blockdev":"pmem0",
>     "numa_node":0
>   }
> ]
> 
> Unfortunately I hit a btrfs_panic() with btrfs/002.
> export TEST_DEV=/dev/pmem0
> export SCRATCH_DEV=/dev/pmem1
> export MOUNT_OPTIONS="-o dax"
> ./check
> [...]
> [  178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43
> [  178.357044] BTRFS info (device pmem0): disk space caching is enabled
> [  178.357047] BTRFS info (device pmem0): has skinny extents
> [  178.360042] BTRFS info (device pmem0): enabling ssd optimizations
> [  178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8
> devid 1 transid 3 /dev/pmem1
> [  178.505717] BTRFS info (device pmem1): disk space caching is enabled
> [  178.513593] BTRFS info (device pmem1): has skinny extents
> [  178.520384] BTRFS info (device pmem1): flagging fs with big metadata
> feature
> [  178.530997] BTRFS info (device pmem1): enabling ssd optimizations
> [  178.538331] BTRFS info (device pmem1): creating UUID tree
> [  178.587200] BTRFS critical (device pmem1): panic in
> ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0
> (errno=-17 Object already exists)
> [  178.603129] ------------[ cut here ]------------
> [  178.608667] kernel BUG at fs/btrfs/ordered-data.c:57!
> [  178.614333] invalid opcode: 0000 [#1] SMP PTI
> [  178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G
>       E     4.20.0-rc5-default-btrfs-dax #920
> [  178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS
> SE5C620.86B.0D.01.0010.072020182008 07/20/2018
> [  178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs]
> [  178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba
> 39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06
> 00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0
> [  178.667019] RSP: 0018:ffffa3e3674c7ba8 EFLAGS: 00010096
> [  178.672684] RAX: 000000000000008f RBX: ffff9770c2ac5748 RCX:
> 0000000000000000
> [  178.680254] RDX: ffff97711f9dee80 RSI: ffff97711f9d6868 RDI:
> ffff97711f9d6868
> [  178.687831] RBP: ffff97711d523000 R08: 0000000000000000 R09:
> 000000000000065a
> [  178.695411] R10: 00000000000003ff R11: 0000000000000001 R12:
> ffff97710d66da70
> [  178.702993] R13: ffff9770c2ac5600 R14: 0000000000000000 R15:
> ffff97710d66d9c0
> [  178.710573] FS:  00007fe11ef90700(0000) GS:ffff97711f9c0000(0000)
> knlGS:0000000000000000
> [  178.719122] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  178.725380] CR2: 000000000156a000 CR3: 000000eb30dfc006 CR4:
> 00000000007606e0
> [  178.732999] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  178.740574] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [  178.748147] PKRU: 55555554
> [  178.751297] Call Trace:
> [  178.754230]  btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs]
> [  178.760269]  btrfs_create_dio_extent+0x79/0xe0 [btrfs]
> [  178.765930]  btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs]
> [  178.771959]  btrfs_file_dax_write+0x1f8/0x4f0 [btrfs]
> [  178.777508]  ? current_time+0x3f/0x70
> [  178.781672]  btrfs_file_write_iter+0x384/0x580 [btrfs]
> [  178.787265]  ? pipe_read+0x243/0x2a0
> [  178.791298]  __vfs_write+0xee/0x170
> [  178.795241]  vfs_write+0xad/0x1a0
> [  178.799008]  ? vfs_read+0x111/0x130
> [  178.802949]  ksys_write+0x42/0x90
> [  178.806712]  do_syscall_64+0x5b/0x180
> [  178.810829]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  178.816334] RIP: 0033:0x7fe11eabb3d0
> [  178.820364] Code: 73 01 c3 48 8b 0d b8 ea 2b 00 f7 d8 64 89 01 48 83
> c8 ff c3 66 0f 1f 44 00 00 83 3d b9 43 2c 00 00 75 10 b8 01 00 00 00 0f
> 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 2e 90 01 00 48 89 04 24
> [  178.840052] RSP: 002b:00007ffec969d978 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000001
> [  178.848100] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
> 00007fe11eabb3d0
> [  178.855715] RDX: 0000000000000400 RSI: 000000000156a000 RDI:
> 0000000000000001
> [  178.863326] RBP: 0000000000000400 R08: 0000000000000003 R09:
> 00007fe11ed7a698
> [  178.870928] R10: 0000000010a8b550 R11: 0000000000000246 R12:
> 000000000156a000
> [  178.878529] R13: 0000000000000000 R14: 000000000156a000 R15:
> 00007ffec969e9f1
> [  178.886177] Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E)
> nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) fscache(E) devlink(E)
> ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E)
> iptable_filter(E) ip_tables(E) x_tables(E) rpcrdma(E) sunrpc(E)
> rdma_ucm(E) ib_uverbs(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E)
> intel_rapl(E) libiscsi(E) af_packet(E) scsi_transport_iscsi(E)
> skx_edac(E) configfs(E) x86_pkg_temp_thermal(E) intel_powerclamp(E)
> iscsi_ibft(E) coretemp(E) iscsi_boot_sysfs(E) ipmi_ssif(E) kvm(E) msr(E)
> i40iw(E) ib_core(E) ext4(E) nls_iso8859_1(E) nls_cp437(E) crc16(E)
> mbcache(E) vfat(E) irqbypass(E) crc32_pclmul(E) ghash_clmulni_intel(E)
> jbd2(E) joydev(E) fat(E) i40e(E) aesni_intel(E) iTCO_wdt(E) ptp(E)
> aes_x86_64(E) iTCO_vendor_support(E) mei_me(E) crypto_simd(E) ipmi_si(E)
> pps_core(E) lpc_ich(E) ioatdma(E) dax_pmem(E) ipmi_devintf(E) nd_pmem(E)
> cryptd(E) glue_helper(E) pcspkr(E) mfd_core(E) i2c_i801(E) device_dax(E)
> ipmi_msghandler(E) mei(E) nd_btt(E) dca(E)
> [  178.886201]  pcc_cpufreq(E) acpi_pad(E) btrfs(E) libcrc32c(E) xor(E)
> zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) hid_generic(E)
> usbhid(E) sd_mod(E) sr_mod(E) cdrom(E) ast(E) i2c_algo_bit(E)
> drm_kms_helper(E) syscopyarea(E) ahci(E) sysfillrect(E) xhci_pci(E)
> sysimgblt(E) fb_sys_fops(E) libahci(E) xhci_hcd(E) ttm(E)
> crc32c_intel(E) drm(E) libata(E) usbcore(E) wmi(E) nfit(E) libnvdimm(E)
> button(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E)
> scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E)
> 
> 
> 
> -- 
> Johannes Thumshirn                            SUSE Labs Filesystems
> jthumshirn@suse.de                                +49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
> Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 07/10] dax: export functions for use with btrfs
  2018-12-06 11:46     ` Goldwyn Rodrigues
@ 2018-12-12  8:07       ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2018-12-12  8:07 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: Christoph Hellwig, linux-btrfs

On Thu, Dec 06, 2018 at 05:46:01AM -0600, Goldwyn Rodrigues wrote:
> This is not worth with btrfs. With non-page aligned I/O on btrfs, we
> need to copy the first/last page of the extents for CoW.

We'll need to do the same for XFS with reflink support anyway.  So
I'd rather handle it in common code if we can.

> So, we
> would end up using the exported functions anyways. Believe me, I have
> spent some time getting btrfs iomap compatible before giving up. The
> problems are btrfs needs to carry a lot of information across
> iomap_begin and iomap_end. While the added private variable helps in
> this, it also needs hooks in bio_submit() functions for crc calculations
> during direct writes.

Although we'd need to find a smarter way to handle this vs adding
more indirect calls everywhere.


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, back to index

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-05 12:28 [PATCH 00/10] btrfs: Support for DAX devices Goldwyn Rodrigues
2018-12-05 12:28 ` [PATCH 01/10] btrfs: create a mount option for dax Goldwyn Rodrigues
2018-12-05 12:42   ` Johannes Thumshirn
2018-12-05 12:43   ` Nikolay Borisov
2018-12-05 14:59     ` Adam Borowski
2018-12-05 12:28 ` [PATCH 02/10] btrfs: basic dax read Goldwyn Rodrigues
2018-12-05 13:11   ` Nikolay Borisov
2018-12-05 13:22   ` Johannes Thumshirn
2018-12-05 12:28 ` [PATCH 03/10] btrfs: dax: read zeros from holes Goldwyn Rodrigues
2018-12-05 13:26   ` Nikolay Borisov
2018-12-05 12:28 ` [PATCH 04/10] Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
2018-12-05 13:35   ` Nikolay Borisov
2018-12-05 12:28 ` [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
2018-12-05 12:28 ` [PATCH 06/10] btrfs: dax write support Goldwyn Rodrigues
2018-12-05 13:56   ` Johannes Thumshirn
2018-12-05 12:28 ` [PATCH 07/10] dax: export functions for use with btrfs Goldwyn Rodrigues
2018-12-05 13:59   ` Johannes Thumshirn
2018-12-05 14:52   ` Christoph Hellwig
2018-12-06 11:46     ` Goldwyn Rodrigues
2018-12-12  8:07       ` Christoph Hellwig
2018-12-05 12:28 ` [PATCH 08/10] btrfs: dax add read mmap path Goldwyn Rodrigues
2018-12-05 12:28 ` [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared Goldwyn Rodrigues
2018-12-05 12:28 ` [PATCH 10/10] btrfs: dax mmap write Goldwyn Rodrigues
2018-12-05 13:03 ` [PATCH 00/10] btrfs: Support for DAX devices Qu Wenruo
2018-12-05 21:36   ` Jeff Mahoney
2018-12-05 13:57 ` Adam Borowski
2018-12-05 21:37 ` Jeff Mahoney
2018-12-06  7:40   ` Robert White
2018-12-06 10:07 ` Johannes Thumshirn
2018-12-06 11:47   ` Goldwyn Rodrigues

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable: git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox