linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 01/15] btrfs: create a mount option for dax
       [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
@ 2019-03-26 19:02 ` Goldwyn Rodrigues
  2019-03-26 19:10   ` Matthew Wilcox
                     ` (2 more replies)
  2019-03-26 19:02 ` [PATCH 02/15] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
                   ` (14 subsequent siblings)
  15 siblings, 3 replies; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-26 19:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This sets S_DAX in inode->i_flags, which can be used with
IS_DAX().

The dax option is restricted to non multi-device mounts.
dax interacts with the device directly instead of using bio, so
all bio-hooks which we use for multi-device cannot be performed
here. While regular read/writes could be manipulated with
RAID0/1, mmap() is still an issue.

Auto-setting free space tree, because dealing with free space
inode (specifically readpages) is a nightmare.
Auto-setting nodatasum because we don't get callback for writing
checksums after mmap()s.

Store the dax_device in fs_info which will be used in iomap code.
Question: Since we have only one dax device, I thought fs_info is
the best place. However, should it moved to btrfs_device?

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/disk-io.c |  4 ++++
 fs/btrfs/ioctl.c   |  5 ++++-
 fs/btrfs/super.c   | 26 ++++++++++++++++++++++++++
 4 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b3642367a595..8ca1c0d120f4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1067,6 +1067,7 @@ struct btrfs_fs_info {
 	u32 metadata_ratio;
 
 	void *bdev_holder;
+	struct dax_device *dax_dev;
 
 	/* private scrub information */
 	struct mutex scrub_lock;
@@ -1442,6 +1443,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
+#define BTRFS_MOUNT_DAX			(1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6fe9197f6ee4..2bbb63b2fcff 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -16,6 +16,7 @@
 #include <linux/uuid.h>
 #include <linux/semaphore.h>
 #include <linux/error-injection.h>
+#include <linux/dax.h>
 #include <linux/crc32c.h>
 #include <linux/sched/mm.h>
 #include <asm/unaligned.h>
@@ -2805,6 +2806,8 @@ int open_ctree(struct super_block *sb,
 		goto fail_alloc;
 	}
 
+	fs_info->dax_dev = fs_dax_get_by_bdev(fs_devices->latest_bdev);
+
 	/*
 	 * We want to check superblock checksum, the type is stored inside.
 	 * Pass the whole disk block of size BTRFS_SUPER_INFO_SIZE (4k).
@@ -4043,6 +4046,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 #endif
 
 	btrfs_close_devices(fs_info->fs_devices);
+	fs_put_dax(fs_info->dax_dev);
 	btrfs_mapping_tree_free(&fs_info->mapping_tree);
 
 	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index ec2d8919e7fb..e66426e7692d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
 	if (binode->flags & BTRFS_INODE_DIRSYNC)
 		new_fl |= S_DIRSYNC;
 
+	if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && S_ISREG(inode->i_mode))
+		new_fl |= S_DAX;
+
 	set_mask_bits(&inode->i_flags,
-		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
+		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | S_DAX,
 		      new_fl);
 }
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 120e4340792a..2d448b9d6004 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -326,6 +326,7 @@ enum {
 	Opt_treelog, Opt_notreelog,
 	Opt_usebackuproot,
 	Opt_user_subvol_rm_allowed,
+	Opt_dax,
 
 	/* Deprecated options */
 	Opt_alloc_start,
@@ -393,6 +394,7 @@ static const match_table_t tokens = {
 	{Opt_notreelog, "notreelog"},
 	{Opt_usebackuproot, "usebackuproot"},
 	{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
+	{Opt_dax, "dax"},
 
 	/* Deprecated options */
 	{Opt_alloc_start, "alloc_start=%s"},
@@ -745,6 +747,28 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 		case Opt_user_subvol_rm_allowed:
 			btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
 			break;
+		case Opt_dax:
+#ifdef CONFIG_FS_DAX
+			if (btrfs_super_num_devices(info->super_copy) > 1) {
+				btrfs_info(info,
+					   "dax not supported for multi-device btrfs partition\n");
+				ret = -EOPNOTSUPP;
+				goto out;
+			}
+			btrfs_set_opt(info->mount_opt, DAX);
+			btrfs_warn(info, "DAX enabled. Warning: EXPERIMENTAL, use at your own risk\n");
+			btrfs_set_and_info(info, NODATASUM,
+					   "auto-setting nodatasum (dax)");
+			btrfs_clear_opt(info->mount_opt, SPACE_CACHE);
+			btrfs_set_and_info(info, FREE_SPACE_TREE,
+					"auto-setting free space tree (dax)");
+			break;
+#else
+			btrfs_err(info,
+				  "DAX option not supported\n");
+			ret = -EINVAL;
+			goto out;
+#endif
 		case Opt_enospc_debug:
 			btrfs_set_opt(info->mount_opt, ENOSPC_DEBUG);
 			break;
@@ -1335,6 +1359,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",clear_cache");
 	if (btrfs_test_opt(info, USER_SUBVOL_RM_ALLOWED))
 		seq_puts(seq, ",user_subvol_rm_allowed");
+	if (btrfs_test_opt(info, DAX))
+		seq_puts(seq, ",dax");
 	if (btrfs_test_opt(info, ENOSPC_DEBUG))
 		seq_puts(seq, ",enospc_debug");
 	if (btrfs_test_opt(info, AUTO_DEFRAG))
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 02/15] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write()
       [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
  2019-03-26 19:02 ` [PATCH 01/15] btrfs: create a mount option for dax Goldwyn Rodrigues
@ 2019-03-26 19:02 ` Goldwyn Rodrigues
  2019-03-26 19:02 ` [PATCH 03/15] btrfs: basic dax read Goldwyn Rodrigues
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-26 19:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This makes btrfs_get_extent_map_write() independent of Direct
I/O code.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/inode.c | 40 +++++++++++++++++++++++++++-------------
 2 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8ca1c0d120f4..9512f49262dd 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3277,6 +3277,8 @@ struct inode *btrfs_iget_path(struct super_block *s, struct btrfs_key *location,
 			      struct btrfs_path *path);
 struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
 			 struct btrfs_root *root, int *was_new);
+int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh,
+		struct inode *inode, u64 start, u64 len);
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 				    struct page *page, size_t pg_offset,
 				    u64 start, u64 end, int create);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 82fdda8ff5ab..80184d0c3b52 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7496,11 +7496,10 @@ static int btrfs_get_blocks_direct_read(struct extent_map *em,
 	return 0;
 }
 
-static int btrfs_get_blocks_direct_write(struct extent_map **map,
-					 struct buffer_head *bh_result,
-					 struct inode *inode,
-					 struct btrfs_dio_data *dio_data,
-					 u64 start, u64 len)
+int btrfs_get_extent_map_write(struct extent_map **map,
+		struct buffer_head *bh,
+		struct inode *inode,
+		u64 start, u64 len)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_map *em = *map;
@@ -7554,22 +7553,38 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 			 */
 			btrfs_free_reserved_data_space_noquota(inode, start,
 							       len);
-			goto skip_cow;
+			/* skip COW */
+			goto out;
 		}
 	}
 
 	/* this will cow the extent */
-	len = bh_result->b_size;
+	if (bh)
+		len = bh->b_size;
 	free_extent_map(em);
 	*map = em = btrfs_new_extent_direct(inode, start, len);
-	if (IS_ERR(em)) {
-		ret = PTR_ERR(em);
-		goto out;
-	}
+	if (IS_ERR(em))
+		return PTR_ERR(em);
+out:
+	return ret;
+}
 
+static int btrfs_get_blocks_direct_write(struct extent_map **map,
+					 struct buffer_head *bh_result,
+					 struct inode *inode,
+					 struct btrfs_dio_data *dio_data,
+					 u64 start, u64 len)
+{
+	int ret = 0;
+	struct extent_map *em;
+
+	ret = btrfs_get_extent_map_write(map, bh_result, inode,
+			start, len);
+	if (ret < 0)
+		return ret;
+	em = *map;
 	len = min(len, em->len - (start - em->start));
 
-skip_cow:
 	bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
 		inode->i_blkbits;
 	bh_result->b_size = len;
@@ -7590,7 +7605,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	dio_data->reserve -= len;
 	dio_data->unsubmitted_oe_range_end = start + len;
 	current->journal_info = dio_data;
-out:
 	return ret;
 }
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 03/15] btrfs: basic dax read
       [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
  2019-03-26 19:02 ` [PATCH 01/15] btrfs: create a mount option for dax Goldwyn Rodrigues
  2019-03-26 19:02 ` [PATCH 02/15] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
@ 2019-03-26 19:02 ` Goldwyn Rodrigues
  2019-03-26 19:02 ` [PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write Goldwyn Rodrigues
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-26 19:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Perform a basic read using iomap support. The btrfs_iomap_begin()
finds the extent at the position and fills the iomap data
structure with the values.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/Makefile |  1 +
 fs/btrfs/ctree.h  |  5 +++++
 fs/btrfs/dax.c    | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/file.c   | 12 +++++++++++-
 4 files changed, 66 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/dax.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index ca693dd554e9..1fa77b875ae9 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o
 
+btrfs-$(CONFIG_FS_DAX) += dax.o
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
 btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9512f49262dd..b7bbe5130a3b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3795,6 +3795,11 @@ int btrfs_reada_wait(void *handle);
 void btrfs_reada_detach(void *handle);
 int btree_readahead_hook(struct extent_buffer *eb, int err);
 
+#ifdef CONFIG_FS_DAX
+/* dax.c */
+ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
+#endif /* CONFIG_FS_DAX */
+
 static inline int is_fstree(u64 rootid)
 {
 	if (rootid == BTRFS_FS_TREE_OBJECTID ||
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
new file mode 100644
index 000000000000..bf3d46b0acb6
--- /dev/null
+++ b/fs/btrfs/dax.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * DAX support for BTRFS
+ *
+ * Copyright (c) 2019  SUSE Linux
+ * Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
+ */
+
+#ifdef CONFIG_FS_DAX
+#include <linux/dax.h>
+#include <linux/iomap.h>
+#include "ctree.h"
+#include "btrfs_inode.h"
+
+static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
+		loff_t length, unsigned flags, struct iomap *iomap)
+{
+	struct extent_map *em;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
+	if (em->block_start == EXTENT_MAP_HOLE) {
+		iomap->type = IOMAP_HOLE;
+		return 0;
+	}
+	iomap->type = IOMAP_MAPPED;
+	iomap->bdev = em->bdev;
+	iomap->dax_dev = fs_info->dax_dev;
+	iomap->offset = em->start;
+	iomap->length = em->len;
+	iomap->addr = em->block_start;
+	return 0;
+}
+
+static const struct iomap_ops btrfs_iomap_ops = {
+	.iomap_begin		= btrfs_iomap_begin,
+};
+
+ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
+{
+	ssize_t ret;
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	inode_lock_shared(inode);
+	ret = dax_iomap_rw(iocb, to, &btrfs_iomap_ops);
+	inode_unlock_shared(inode);
+
+	return ret;
+}
+#endif /* CONFIG_FS_DAX */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 34fe8a58b0e9..b620f4e718b2 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3288,9 +3288,19 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
 	return generic_file_open(inode, filp);
 }
 
+static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+#ifdef CONFIG_FS_DAX
+	struct inode *inode = file_inode(iocb->ki_filp);
+	if (IS_DAX(file_inode(iocb->ki_filp)))
+		return btrfs_file_dax_read(iocb, to);
+#endif
+	return generic_file_read_iter(iocb, to);
+}
+
 const struct file_operations btrfs_file_operations = {
 	.llseek		= btrfs_file_llseek,
-	.read_iter      = generic_file_read_iter,
+	.read_iter      = btrfs_file_read_iter,
 	.splice_read	= generic_file_splice_read,
 	.write_iter	= btrfs_file_write_iter,
 	.mmap		= btrfs_file_mmap,
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write
       [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
                   ` (2 preceding siblings ...)
  2019-03-26 19:02 ` [PATCH 03/15] btrfs: basic dax read Goldwyn Rodrigues
@ 2019-03-26 19:02 ` Goldwyn Rodrigues
  2019-03-27 17:54   ` Darrick J. Wong
  2019-04-01  4:38   ` Dave Chinner
  2019-03-26 19:02 ` [PATCH 05/15] btrfs: return whether extent is nocow or not Goldwyn Rodrigues
                   ` (11 subsequent siblings)
  15 siblings, 2 replies; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-26 19:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

The IOMAP_F_COW is a flag to notify dax that it needs to copy
the data from iomap->cow_addr to iomap->addr, if the start/end
of I/O are not page aligned.

This also introduces dax_to_dax_copy() which performs a copy
from one part of the device to another, to a maximum of one page.

Question: Using iomap.cow_addr == 0 means the CoW is to be copied
(or memset) from a hole. Would this be better handled through a flag?

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/dax.c              | 36 ++++++++++++++++++++++++++++++++++++
 include/linux/iomap.h |  3 +++
 2 files changed, 39 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index ca0671d55aa6..e254535dd830 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1051,6 +1051,28 @@ static bool dax_range_is_aligned(struct block_device *bdev,
 	return true;
 }
 
+static void dax_to_dax_copy(struct iomap *iomap, loff_t pos, void *daddr,
+			    size_t len)
+{
+	loff_t blk_start, blk_pg;
+	void *saddr;
+	ssize_t map_len;
+
+	/* A zero address is a hole. */
+	if (iomap->cow_addr == 0) {
+		memset(daddr, 0, len);
+		return;
+	}
+
+	blk_start = iomap->cow_addr + pos - iomap->cow_pos;
+	blk_pg = round_down(blk_start, PAGE_SIZE);
+
+	map_len = dax_direct_access(iomap->dax_dev, PHYS_PFN(blk_pg), PAGE_SIZE,
+			&saddr, NULL);
+	saddr += blk_start - blk_pg;
+	memcpy(daddr, saddr, len);
+}
+
 int __dax_zero_page_range(struct block_device *bdev,
 		struct dax_device *dax_dev, sector_t sector,
 		unsigned int offset, unsigned int size)
@@ -1143,6 +1165,20 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			break;
 		}
 
+		if (iomap->flags & IOMAP_F_COW) {
+			loff_t pg_end = round_up(end, PAGE_SIZE);
+			/*
+			 * Copy the first part of the page
+			 * Note: we pass offset as length
+			 */
+			if (offset)
+				dax_to_dax_copy(iomap, pos - offset, kaddr, offset);
+
+			/* Copy the last part of the range */
+			if (end < pg_end)
+				dax_to_dax_copy(iomap, end, kaddr + offset + length, pg_end - end);
+		}
+
 		map_len = PFN_PHYS(map_len);
 		kaddr += offset;
 		map_len -= offset;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0fefb5455bda..391785de1428 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -35,6 +35,7 @@ struct vm_fault;
 #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
 #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
 #define IOMAP_F_BUFFER_HEAD	0x04	/* file system requires buffer heads */
+#define IOMAP_F_COW		0x08	/* cow before write */
 
 /*
  * Flags that only need to be reported for IOMAP_REPORT requests:
@@ -59,6 +60,8 @@ struct iomap {
 	u64			length;	/* length of mapping, bytes */
 	u16			type;	/* type of mapping */
 	u16			flags;	/* flags for mapping */
+	u64			cow_addr; /* read address to perform CoW */
+	loff_t			cow_pos; /* file offset of cow_addr */
 	struct block_device	*bdev;	/* block device for I/O */
 	struct dax_device	*dax_dev; /* dax_dev for dax operations */
 	void			*inline_data;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 05/15] btrfs: return whether extent is nocow or not
       [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
                   ` (3 preceding siblings ...)
  2019-03-26 19:02 ` [PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write Goldwyn Rodrigues
@ 2019-03-26 19:02 ` Goldwyn Rodrigues
  2019-03-31 18:42   ` Nikolay Borisov
  2019-03-26 19:02 ` [PATCH 06/15] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-26 19:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

We require this to set the IOMAP_F_COW flag in
iomap structure, in the later patches.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h | 2 +-
 fs/btrfs/inode.c | 9 +++++++--
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b7bbe5130a3b..2c49d3c46170 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3278,7 +3278,7 @@ struct inode *btrfs_iget_path(struct super_block *s, struct btrfs_key *location,
 struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
 			 struct btrfs_root *root, int *was_new);
 int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh,
-		struct inode *inode, u64 start, u64 len);
+		struct inode *inode, u64 start, u64 len, int *nocow);
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 				    struct page *page, size_t pg_offset,
 				    u64 start, u64 end, int create);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 80184d0c3b52..c8702e0b5e66 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7499,12 +7499,15 @@ static int btrfs_get_blocks_direct_read(struct extent_map *em,
 int btrfs_get_extent_map_write(struct extent_map **map,
 		struct buffer_head *bh,
 		struct inode *inode,
-		u64 start, u64 len)
+		u64 start, u64 len, int *nocow)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_map *em = *map;
 	int ret = 0;
 
+	if (nocow)
+		*nocow = 0;
+
 	/*
 	 * We don't allocate a new extent in the following cases
 	 *
@@ -7553,6 +7556,8 @@ int btrfs_get_extent_map_write(struct extent_map **map,
 			 */
 			btrfs_free_reserved_data_space_noquota(inode, start,
 							       len);
+			if (nocow)
+				*nocow = 1;
 			/* skip COW */
 			goto out;
 		}
@@ -7579,7 +7584,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	struct extent_map *em;
 
 	ret = btrfs_get_extent_map_write(map, bh_result, inode,
-			start, len);
+			start, len, NULL);
 	if (ret < 0)
 		return ret;
 	em = *map;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 06/15] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent()
       [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
                   ` (4 preceding siblings ...)
  2019-03-26 19:02 ` [PATCH 05/15] btrfs: return whether extent is nocow or not Goldwyn Rodrigues
@ 2019-03-26 19:02 ` Goldwyn Rodrigues
  2019-03-26 19:02 ` [PATCH 07/15] btrfs: add dax write support Goldwyn Rodrigues
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-26 19:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Since we will be using it in another part of the code, use a
better name to declare it non-static

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  7 +++++--
 fs/btrfs/inode.c | 14 +++++---------
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2c49d3c46170..a3543a4a063d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3280,8 +3280,11 @@ struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
 int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh,
 		struct inode *inode, u64 start, u64 len, int *nocow);
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
-				    struct page *page, size_t pg_offset,
-				    u64 start, u64 end, int create);
+		struct page *page, size_t pg_offset,
+		u64 start, u64 end, int create);
+void btrfs_update_ordered_extent(struct inode *inode,
+		const u64 offset, const u64 bytes,
+		const bool uptodate);
 int btrfs_update_inode(struct btrfs_trans_handle *trans,
 			      struct btrfs_root *root,
 			      struct inode *inode);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c8702e0b5e66..f721fc1e3f7f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -98,10 +98,6 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
 				       u64 ram_bytes, int compress_type,
 				       int type);
 
-static void __endio_write_update_ordered(struct inode *inode,
-					 const u64 offset, const u64 bytes,
-					 const bool uptodate);
-
 /*
  * Cleanup all submitted ordered extents in specified range to handle errors
  * from the btrfs_run_delalloc_range() callback.
@@ -142,7 +138,7 @@ static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
 		bytes -= PAGE_SIZE;
 	}
 
-	return __endio_write_update_ordered(inode, offset, bytes, false);
+	return btrfs_update_ordered_extent(inode, offset, bytes, false);
 }
 
 static int btrfs_dirty_inode(struct inode *inode);
@@ -8085,7 +8081,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
 	bio_put(bio);
 }
 
-static void __endio_write_update_ordered(struct inode *inode,
+void btrfs_update_ordered_extent(struct inode *inode,
 					 const u64 offset, const u64 bytes,
 					 const bool uptodate)
 {
@@ -8138,7 +8134,7 @@ static void btrfs_endio_direct_write(struct bio *bio)
 	struct btrfs_dio_private *dip = bio->bi_private;
 	struct bio *dio_bio = dip->dio_bio;
 
-	__endio_write_update_ordered(dip->inode, dip->logical_offset,
+	btrfs_update_ordered_extent(dip->inode, dip->logical_offset,
 				     dip->bytes, !bio->bi_status);
 
 	kfree(dip);
@@ -8457,7 +8453,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 		bio = NULL;
 	} else {
 		if (write)
-			__endio_write_update_ordered(inode,
+			btrfs_update_ordered_extent(inode,
 						file_offset,
 						dio_bio->bi_iter.bi_size,
 						false);
@@ -8597,7 +8593,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 			 */
 			if (dio_data.unsubmitted_oe_range_start <
 			    dio_data.unsubmitted_oe_range_end)
-				__endio_write_update_ordered(inode,
+				btrfs_update_ordered_extent(inode,
 					dio_data.unsubmitted_oe_range_start,
 					dio_data.unsubmitted_oe_range_end -
 					dio_data.unsubmitted_oe_range_start,
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 07/15] btrfs: add dax write support
       [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
                   ` (5 preceding siblings ...)
  2019-03-26 19:02 ` [PATCH 06/15] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
@ 2019-03-26 19:02 ` Goldwyn Rodrigues
  2019-03-28 14:53   ` Darrick J. Wong
  2019-03-26 19:02 ` [PATCH 08/15] dax: add dax_iomap_cow to copy a mmap page before writing Goldwyn Rodrigues
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-26 19:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

IOMAP_F_COW allows to inform the dax code, to first perform
a copy which are not page-aligned before performing the write.

A new struct btrfs_iomap is passed from iomap_begin() to
iomap_end(), which contains all the accounting and locking information
for CoW based writes.

For writing to a hole, iomap->cow_addr is set to zero. Would this
be better handled by a flag or can a valid filesystem block be at
offset zero of the device?

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |   6 +++
 fs/btrfs/dax.c   | 119 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/file.c  |   4 +-
 3 files changed, 124 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a3543a4a063d..3bcd2a4959c1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3801,6 +3801,12 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
 #ifdef CONFIG_FS_DAX
 /* dax.c */
 ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
+#else
+static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	return 0;
+}
 #endif /* CONFIG_FS_DAX */
 
 static inline int is_fstree(u64 rootid)
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index bf3d46b0acb6..49619fe3f94f 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -9,30 +9,124 @@
 #ifdef CONFIG_FS_DAX
 #include <linux/dax.h>
 #include <linux/iomap.h>
+#include <linux/uio.h>
 #include "ctree.h"
 #include "btrfs_inode.h"
 
+struct btrfs_iomap {
+	u64 start;
+	u64 end;
+	int nocow;
+	struct extent_changeset *data_reserved;
+	struct extent_state *cached_state;
+};
+
 static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
 		loff_t length, unsigned flags, struct iomap *iomap)
 {
 	struct extent_map *em;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
 	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
+
+	if (flags & IOMAP_WRITE) {
+		int ret = 0, nocow;
+		struct extent_map *map = em;
+		struct btrfs_iomap *bi;
+
+		bi = kzalloc(sizeof(struct btrfs_iomap), GFP_NOFS);
+		if (!bi)
+			return -ENOMEM;
+
+		bi->start = round_down(pos, PAGE_SIZE);
+		bi->end = round_up(pos + length, PAGE_SIZE);
+
+		iomap->private = bi;
+
+		/* Wait for existing ordered extents in range to finish */
+		btrfs_wait_ordered_range(inode, bi->start, bi->end - bi->start);
+
+		lock_extent_bits(&BTRFS_I(inode)->io_tree, bi->start, bi->end, &bi->cached_state);
+
+		ret = btrfs_delalloc_reserve_space(inode, &bi->data_reserved,
+				bi->start, bi->end - bi->start);
+		if (ret) {
+			unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
+					&bi->cached_state);
+			kfree(bi);
+			return ret;
+		}
+
+		refcount_inc(&map->refs);
+		ret = btrfs_get_extent_map_write(&em, NULL,
+				inode, bi->start, bi->end - bi->start, &nocow);
+		if (ret) {
+			unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
+					&bi->cached_state);
+			btrfs_delalloc_release_space(inode,
+					bi->data_reserved, bi->start,
+					bi->end - bi->start, true);
+			extent_changeset_free(bi->data_reserved);
+			kfree(bi);
+			return ret;
+		}
+		if (!nocow) {
+			iomap->flags |= IOMAP_F_COW;
+			if (map->block_start != EXTENT_MAP_HOLE) {
+				iomap->cow_addr = map->block_start;
+				iomap->cow_pos = map->start;
+			}
+		} else {
+			bi->nocow = 1;
+		}
+		free_extent_map(map);
+	}
+
+	iomap->offset = em->start;
+	iomap->length = em->len;
+	iomap->bdev = em->bdev;
+	iomap->dax_dev = fs_info->dax_dev;
+
 	if (em->block_start == EXTENT_MAP_HOLE) {
 		iomap->type = IOMAP_HOLE;
 		return 0;
 	}
+
 	iomap->type = IOMAP_MAPPED;
-	iomap->bdev = em->bdev;
-	iomap->dax_dev = fs_info->dax_dev;
-	iomap->offset = em->start;
-	iomap->length = em->len;
 	iomap->addr = em->block_start;
 	return 0;
 }
 
+static int btrfs_iomap_end(struct inode *inode, loff_t pos,
+		loff_t length, ssize_t written, unsigned flags,
+		struct iomap *iomap)
+{
+	struct btrfs_iomap *bi = iomap->private;
+	u64 wend;
+
+	if (!bi)
+		return 0;
+
+	unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
+			&bi->cached_state);
+
+	wend = round_up(pos + written, PAGE_SIZE);
+	if (wend < bi->end) {
+		btrfs_delalloc_release_space(inode,
+				bi->data_reserved, wend,
+				bi->end - wend, true);
+	}
+
+	btrfs_update_ordered_extent(inode, bi->start, wend - bi->start, true);
+	btrfs_delalloc_release_extents(BTRFS_I(inode), wend - bi->start, false);
+	extent_changeset_free(bi->data_reserved);
+	kfree(bi);
+	return 0;
+}
+
 static const struct iomap_ops btrfs_iomap_ops = {
 	.iomap_begin		= btrfs_iomap_begin,
+	.iomap_end		= btrfs_iomap_end,
 };
 
 ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
@@ -46,4 +140,21 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
 
 	return ret;
 }
+
+ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *iter)
+{
+	ssize_t ret = 0;
+	u64 pos = iocb->ki_pos;
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	ret = dax_iomap_rw(iocb, iter, &btrfs_iomap_ops);
+
+	if (ret > 0) {
+		pos += ret;
+		if (pos > i_size_read(inode))
+			i_size_write(inode, pos);
+		iocb->ki_pos = pos;
+	}
+	return ret;
+}
 #endif /* CONFIG_FS_DAX */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b620f4e718b2..3b320d0ab495 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1964,7 +1964,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	if (sync)
 		atomic_inc(&BTRFS_I(inode)->sync_writers);
 
-	if (iocb->ki_flags & IOCB_DIRECT) {
+	if (IS_DAX(inode)) {
+		num_written = btrfs_file_dax_write(iocb, from);
+	} else if (iocb->ki_flags & IOCB_DIRECT) {
 		num_written = __btrfs_direct_write(iocb, from);
 	} else {
 		num_written = btrfs_buffered_write(iocb, from);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 08/15] dax: add dax_iomap_cow to copy a mmap page before writing
       [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
                   ` (6 preceding siblings ...)
  2019-03-26 19:02 ` [PATCH 07/15] btrfs: add dax write support Goldwyn Rodrigues
@ 2019-03-26 19:02 ` Goldwyn Rodrigues
  2019-03-28 15:41   ` Darrick J. Wong
  2019-03-26 19:02 ` [PATCH 09/15] btrfs: add dax mmap support Goldwyn Rodrigues
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-26 19:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

dax_iomap_cow copies a page before presenting for mmap.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/dax.c | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index e254535dd830..21ee3df6f02c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1269,6 +1269,33 @@ static bool dax_fault_is_synchronous(unsigned long flags,
 		&& (iomap->flags & IOMAP_F_DIRTY);
 }
 
+static int dax_iomap_cow(struct iomap *iomap, loff_t pos, pfn_t *pfn)
+{
+	void *daddr;
+	pgoff_t pgoff;
+	long rc;
+	int id;
+	sector_t sector;
+
+	pos = round_down(pos, PAGE_SIZE);
+
+	sector = round_down(iomap->addr + iomap->offset - pos, PAGE_SIZE) >> 9;
+	rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, &pgoff);
+	if (rc)
+		return rc;
+
+	id = dax_read_lock();
+	rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &daddr, pfn);
+	if (rc < 0)
+		goto out;
+
+	dax_to_dax_copy(iomap, pos, daddr, PAGE_SIZE);
+
+out:
+	dax_read_unlock(id);
+	return rc;
+}
+
 static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 			       int *iomap_errp, const struct iomap_ops *ops)
 {
@@ -1372,7 +1399,11 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 			count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
 			major = VM_FAULT_MAJOR;
 		}
-		error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn);
+
+		if (iomap.flags & IOMAP_F_COW)
+			error = dax_iomap_cow(&iomap, pos, &pfn);
+		else
+			error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn);
 		if (error < 0)
 			goto error_finish_iomap;
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 09/15] btrfs: add dax mmap support
       [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
                   ` (7 preceding siblings ...)
  2019-03-26 19:02 ` [PATCH 08/15] dax: add dax_iomap_cow to copy a mmap page before writing Goldwyn Rodrigues
@ 2019-03-26 19:02 ` Goldwyn Rodrigues
  2019-03-28 15:45   ` Darrick J. Wong
  2019-03-26 19:02 ` [PATCH 10/15] btrfs: Add dax specific address_space_operations Goldwyn Rodrigues
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-26 19:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Add a new vm_operations struct btrfs_dax_vm_ops
specifically for dax files.

Since we will be removing(nulling) readpages/writepages for dax
return ENOEXEC only for non-dax files.

dax_insert_entry() looks ugly. Do you think we should break it
into dax_insert_cow_entry() and dax_insert_entry()?

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/dax.c   | 11 +++++++++++
 fs/btrfs/file.c  | 18 ++++++++++++++++--
 fs/dax.c         | 17 ++++++++++-------
 4 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3bcd2a4959c1..0e5060933bde 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3802,6 +3802,7 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
 /* dax.c */
 ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
 ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
+vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
 #else
 static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
 {
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 49619fe3f94f..927f962d1e88 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -157,4 +157,15 @@ ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *iter)
 	}
 	return ret;
 }
+
+vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
+{
+	vm_fault_t ret;
+	pfn_t pfn;
+	ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &pfn, NULL, &btrfs_iomap_ops);
+	if (ret & VM_FAULT_NEEDDSYNC)
+		ret = dax_finish_sync_fault(vmf, PE_SIZE_PTE, pfn);
+
+	return ret;
+}
 #endif /* CONFIG_FS_DAX */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 3b320d0ab495..196c8f37ff9d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2214,15 +2214,29 @@ static const struct vm_operations_struct btrfs_file_vm_ops = {
 	.page_mkwrite	= btrfs_page_mkwrite,
 };
 
+#ifdef CONFIG_FS_DAX
+static const struct vm_operations_struct btrfs_dax_vm_ops = {
+	.fault          = btrfs_dax_fault,
+	.page_mkwrite   = btrfs_dax_fault,
+	.pfn_mkwrite    = btrfs_dax_fault,
+};
+#else
+#define btrfs_dax_vm_ops btrfs_file_vm_ops
+#endif
+
 static int btrfs_file_mmap(struct file	*filp, struct vm_area_struct *vma)
 {
 	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = file_inode(filp);
 
-	if (!mapping->a_ops->readpage)
+	if (!IS_DAX(inode) && !mapping->a_ops->readpage)
 		return -ENOEXEC;
 
 	file_accessed(filp);
-	vma->vm_ops = &btrfs_file_vm_ops;
+	if (IS_DAX(inode))
+		vma->vm_ops = &btrfs_dax_vm_ops;
+	else
+		vma->vm_ops = &btrfs_file_vm_ops;
 
 	return 0;
 }
diff --git a/fs/dax.c b/fs/dax.c
index 21ee3df6f02c..41061da42771 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -708,14 +708,15 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
  */
 static void *dax_insert_entry(struct xa_state *xas,
 		struct address_space *mapping, struct vm_fault *vmf,
-		void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+		void *entry, pfn_t pfn, unsigned long flags, bool dirty,
+		bool cow)
 {
 	void *new_entry = dax_make_entry(pfn, flags);
 
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
+	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
 		unsigned long index = xas->xa_index;
 		/* we are replacing a zero page with block mapping */
 		if (dax_is_pmd_entry(entry))
@@ -732,7 +733,7 @@ static void *dax_insert_entry(struct xa_state *xas,
 		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
 	}
 
-	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+	if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
 		 * Only swap our new entry into the page cache if the current
 		 * entry is a zero page or an empty entry.  If a normal PTE or
@@ -1031,7 +1032,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 	vm_fault_t ret;
 
 	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-			DAX_ZERO_PAGE, false);
+			DAX_ZERO_PAGE, false, false);
 
 	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
 	trace_dax_load_hole(inode, vmf, ret);
@@ -1408,7 +1409,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 			goto error_finish_iomap;
 
 		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
-						 0, write && !sync);
+						 0, write && !sync,
+						(iomap.flags & IOMAP_F_COW) != 0);
 
 		/*
 		 * If we are doing synchronous page fault and inode needs fsync,
@@ -1487,7 +1489,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 
 	pfn = page_to_pfn_t(zero_page);
 	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-			DAX_PMD | DAX_ZERO_PAGE, false);
+			DAX_PMD | DAX_ZERO_PAGE, false, false);
 
 	ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
 	if (!pmd_none(*(vmf->pmd))) {
@@ -1610,7 +1612,8 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 			goto finish_iomap;
 
 		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
-						DAX_PMD, write && !sync);
+						DAX_PMD, write && !sync,
+						false);
 
 		/*
 		 * If we are doing synchronous page fault and inode needs fsync,
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 10/15] btrfs: Add dax specific address_space_operations
       [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
                   ` (8 preceding siblings ...)
  2019-03-26 19:02 ` [PATCH 09/15] btrfs: add dax mmap support Goldwyn Rodrigues
@ 2019-03-26 19:02 ` Goldwyn Rodrigues
  2019-03-26 19:02 ` [PATCH 11/15] fs: dedup file range to use a compare function Goldwyn Rodrigues
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-26 19:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/inode.c | 34 +++++++++++++++++++++++++++++++---
 1 file changed, 31 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f721fc1e3f7f..21780ea14e5a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -28,6 +28,7 @@
 #include <linux/magic.h>
 #include <linux/iversion.h>
 #include <linux/swap.h>
+#include <linux/dax.h>
 #include <asm/unaligned.h>
 #include "ctree.h"
 #include "disk-io.h"
@@ -65,6 +66,7 @@ static const struct inode_operations btrfs_dir_ro_inode_operations;
 static const struct inode_operations btrfs_special_inode_operations;
 static const struct inode_operations btrfs_file_inode_operations;
 static const struct address_space_operations btrfs_aops;
+static const struct address_space_operations btrfs_dax_aops;
 static const struct file_operations btrfs_dir_file_operations;
 static const struct extent_io_ops btrfs_extent_io_ops;
 
@@ -3757,7 +3759,10 @@ static int btrfs_read_locked_inode(struct inode *inode,
 
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFREG:
-		inode->i_mapping->a_ops = &btrfs_aops;
+		if (btrfs_test_opt(fs_info, DAX))
+			inode->i_mapping->a_ops = &btrfs_dax_aops;
+		else
+			inode->i_mapping->a_ops = &btrfs_aops;
 		BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
 		inode->i_fop = &btrfs_file_operations;
 		inode->i_op = &btrfs_file_inode_operations;
@@ -3778,6 +3783,7 @@ static int btrfs_read_locked_inode(struct inode *inode,
 	}
 
 	btrfs_sync_inode_flags_to_i_flags(inode);
+
 	return 0;
 }
 
@@ -6538,7 +6544,10 @@ static int btrfs_create(struct inode *dir, struct dentry *dentry,
 	*/
 	inode->i_fop = &btrfs_file_operations;
 	inode->i_op = &btrfs_file_inode_operations;
-	inode->i_mapping->a_ops = &btrfs_aops;
+	if (IS_DAX(inode) && S_ISREG(mode))
+		inode->i_mapping->a_ops = &btrfs_dax_aops;
+	else
+		inode->i_mapping->a_ops = &btrfs_aops;
 
 	err = btrfs_init_inode_security(trans, inode, dir, &dentry->d_name);
 	if (err)
@@ -8665,6 +8674,15 @@ static int btrfs_writepages(struct address_space *mapping,
 	return extent_writepages(mapping, wbc);
 }
 
+static int btrfs_dax_writepages(struct address_space *mapping,
+			    struct writeback_control *wbc)
+{
+	struct inode *inode = mapping->host;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	return dax_writeback_mapping_range(mapping, fs_info->fs_devices->latest_bdev,
+			wbc);
+}
+
 static int
 btrfs_readpages(struct file *file, struct address_space *mapping,
 		struct list_head *pages, unsigned nr_pages)
@@ -10436,7 +10454,10 @@ static int btrfs_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 	inode->i_fop = &btrfs_file_operations;
 	inode->i_op = &btrfs_file_inode_operations;
 
-	inode->i_mapping->a_ops = &btrfs_aops;
+	if (IS_DAX(inode))
+		inode->i_mapping->a_ops = &btrfs_dax_aops;
+	else
+		inode->i_mapping->a_ops = &btrfs_aops;
 	BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
 
 	ret = btrfs_init_inode_security(trans, inode, dir, NULL);
@@ -10892,6 +10913,13 @@ static const struct address_space_operations btrfs_aops = {
 	.swap_deactivate = btrfs_swap_deactivate,
 };
 
+static const struct address_space_operations btrfs_dax_aops = {
+	.writepages             = btrfs_dax_writepages,
+	.direct_IO              = noop_direct_IO,
+	.set_page_dirty         = noop_set_page_dirty,
+	.invalidatepage         = noop_invalidatepage,
+};
+
 static const struct inode_operations btrfs_file_inode_operations = {
 	.getattr	= btrfs_getattr,
 	.setattr	= btrfs_setattr,
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 11/15] fs: dedup file range to use a compare function
       [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
                   ` (9 preceding siblings ...)
  2019-03-26 19:02 ` [PATCH 10/15] btrfs: Add dax specific address_space_operations Goldwyn Rodrigues
@ 2019-03-26 19:02 ` Goldwyn Rodrigues
  2019-03-28 17:04   ` Darrick J. Wong
  2019-03-26 19:02 ` [PATCH 12/15] btrfs: trace functions for btrfs_iomap_begin/end Goldwyn Rodrigues
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-26 19:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

With dax we cannot deal with readpage() etc. So, we create a
funciton callback to perform the file data comparison and pass
it to generic_remap_file_range_prep() so it can use iomap-based
functions.

This may not be the best way to solve this. Suggestions welcome.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h     |  3 +++
 fs/btrfs/dax.c       |  7 +++++++
 fs/btrfs/ioctl.c     | 13 +++++++++++-
 fs/dax.c             | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/file.c      |  2 +-
 fs/read_write.c      |  9 +++++---
 fs/xfs/xfs_reflink.c |  2 +-
 include/linux/dax.h  |  2 ++
 include/linux/fs.h   |  4 +++-
 9 files changed, 93 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0e5060933bde..750f9c70fabe 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3803,6 +3803,9 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
 ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
 ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
 vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
+int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
+		struct inode *dest, loff_t destoff, loff_t len,
+		bool *is_same);
 #else
 static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
 {
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 927f962d1e88..9488cae0f8b4 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -168,4 +168,11 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
 
 	return ret;
 }
+
+int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
+		struct inode *dest, loff_t destoff, loff_t len,
+		bool *is_same)
+{
+	return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same, &btrfs_iomap_ops);
+}
 #endif /* CONFIG_FS_DAX */
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e66426e7692d..2e5137b01561 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3990,8 +3990,19 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 	if (ret < 0)
 		goto out_unlock;
 
+#ifdef CONFIG_FS_DAX
+	if (IS_DAX(file_inode(file_in)) && IS_DAX(file_inode(file_out)))
+		ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
+				pos_out, len, remap_flags,
+				btrfs_dax_file_range_compare);
+	else
+		ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
+				pos_out, len, remap_flags, NULL);
+#else
 	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-					    len, remap_flags);
+						    len, remap_flags, NULL);
+#endif
+
 	if (ret < 0 || *len == 0)
 		goto out_unlock;
 
diff --git a/fs/dax.c b/fs/dax.c
index 41061da42771..18998c5ee27a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1775,3 +1775,61 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 	return dax_insert_pfn_mkwrite(vmf, pfn, order);
 }
 EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+
+int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
+		loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops)
+{
+	void *saddr, *daddr;
+	struct iomap s_iomap = {0};
+	struct iomap d_iomap = {0};
+	loff_t dstart, sstart;
+	bool same = true;
+	loff_t cmp_len, l;
+	int id, ret = 0;
+
+	id = dax_read_lock();
+	while (len) {
+		ret = ops->iomap_begin(src, srcoff, len, 0, &s_iomap);
+		if (ret < 0) {
+			if (ops->iomap_end)
+				ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap);
+			return ret;
+		}
+		cmp_len = len;
+		if (cmp_len > s_iomap.offset + s_iomap.length - srcoff)
+			cmp_len = s_iomap.offset + s_iomap.length - srcoff;
+
+		ret = ops->iomap_begin(dest, destoff, cmp_len, 0, &d_iomap);
+		if (ret < 0) {
+			if (ops->iomap_end) {
+				ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap);
+				ops->iomap_end(dest, destoff, len, ret, 0, &d_iomap);
+			}
+			return ret;
+		}
+		if (cmp_len > d_iomap.offset + d_iomap.length - destoff)
+			cmp_len = d_iomap.offset + d_iomap.length - destoff;
+
+
+		sstart = (get_start_sect(s_iomap.bdev) << 9) + s_iomap.addr + (srcoff - s_iomap.offset);
+		l = dax_direct_access(s_iomap.dax_dev, PHYS_PFN(sstart), PHYS_PFN(cmp_len), &saddr, NULL);
+		dstart = (get_start_sect(d_iomap.bdev) << 9) + d_iomap.addr + (destoff - d_iomap.offset);
+		l = dax_direct_access(d_iomap.dax_dev, PHYS_PFN(dstart), PHYS_PFN(cmp_len), &daddr, NULL);
+		same = !memcmp(saddr, daddr, cmp_len);
+		if (!same)
+			break;
+		len -= cmp_len;
+		srcoff += cmp_len;
+		destoff += cmp_len;
+
+		if (ops->iomap_end) {
+			ret = ops->iomap_end(src, srcoff, len, 0, 0, &s_iomap);
+			ret = ops->iomap_end(dest, destoff, len, 0, 0, &d_iomap);
+		}
+	}
+	dax_read_unlock(id);
+	*is_same = same;
+	return ret;
+}
+EXPORT_SYMBOL_GPL(dax_file_range_compare);
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index d640c5f8a85d..6bf3e8fbb016 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2558,7 +2558,7 @@ static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in,
 		goto out_unlock;
 
 	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-			&len, remap_flags);
+			&len, remap_flags, NULL);
 	if (ret < 0 || len == 0)
 		goto out_unlock;
 
diff --git a/fs/read_write.c b/fs/read_write.c
index 177ccc3d405a..da521a221213 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1855,7 +1855,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
  */
 int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
-				  loff_t *len, unsigned int remap_flags)
+				  loff_t *len, unsigned int remap_flags, compare_range_t compare)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
@@ -1914,8 +1914,11 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 	 */
 	if (remap_flags & REMAP_FILE_DEDUP) {
 		bool		is_same = false;
-
-		ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
+		if (compare)
+			ret = compare(inode_in, pos_in,
+				inode_out, pos_out, *len, &is_same);
+		else
+			ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
 				inode_out, pos_out, *len, &is_same);
 		if (ret)
 			return ret;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 680ae7662a78..8907c7aa3f19 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1350,7 +1350,7 @@ xfs_reflink_remap_prep(
 		goto out_unlock;
 
 	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-			len, remap_flags);
+			len, remap_flags, NULL);
 	if (ret < 0 || *len == 0)
 		goto out_unlock;
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0dd316a74a29..a11bc7b1f526 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -157,6 +157,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
+int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
+                loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops);
 
 #ifdef CONFIG_FS_DAX
 int __dax_zero_page_range(struct block_device *bdev,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8b42df09b04c..22fe4324b22e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1880,10 +1880,12 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *, rwf_t);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
+typedef int (compare_range_t)(struct inode *src, loff_t srcpos, struct inode *dest,
+		loff_t destpos, loff_t len, bool *is_same);
 extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 					 struct file *file_out, loff_t pos_out,
 					 loff_t *count,
-					 unsigned int remap_flags);
+					 unsigned int remap_flags, compare_range_t cmp);
 extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
 				  loff_t len, unsigned int remap_flags);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 12/15] btrfs: trace functions for btrfs_iomap_begin/end
       [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
                   ` (10 preceding siblings ...)
  2019-03-26 19:02 ` [PATCH 11/15] fs: dedup file range to use a compare function Goldwyn Rodrigues
@ 2019-03-26 19:02 ` Goldwyn Rodrigues
  2019-03-26 19:02 ` [PATCH 13/15] btrfs: handle dax page zeroing Goldwyn Rodrigues
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-26 19:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This is for debug purposes only and can be skipped.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/dax.c               |  3 +++
 include/trace/events/btrfs.h | 56 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 9488cae0f8b4..7900b5773829 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -27,6 +27,8 @@ static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
 	struct extent_map *em;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 
+	trace_btrfs_iomap_begin(inode, pos, length, flags);
+
 	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
 
 	if (flags & IOMAP_WRITE) {
@@ -103,6 +105,7 @@ static int btrfs_iomap_end(struct inode *inode, loff_t pos,
 {
 	struct btrfs_iomap *bi = iomap->private;
 	u64 wend;
+	trace_btrfs_iomap_end(inode, pos, length, written, flags);
 
 	if (!bi)
 		return 0;
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index ab1cc33adbac..8779e5789a7c 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1850,6 +1850,62 @@ DEFINE_EVENT(btrfs__block_group, btrfs_skip_unused_block_group,
 	TP_ARGS(bg_cache)
 );
 
+TRACE_EVENT(btrfs_iomap_begin,
+
+	TP_PROTO(const struct inode *inode, loff_t pos, loff_t length, int flags),
+
+	TP_ARGS(inode, pos, length, flags),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	ino		)
+		__field(	u64,	pos		)
+		__field(	u64,	length		)
+		__field(	int,    flags		)
+	),
+
+	TP_fast_assign_btrfs(btrfs_sb(inode->i_sb),
+		__entry->ino		= btrfs_ino(BTRFS_I(inode));
+		__entry->pos		= pos;
+		__entry->length		= length;
+		__entry->flags		= flags;
+	),
+
+	TP_printk_btrfs("ino=%llu pos=%llu len=%llu flags=0x%x",
+		  __entry->ino,
+		  __entry->pos,
+		  __entry->length,
+		  __entry->flags)
+);
+
+TRACE_EVENT(btrfs_iomap_end,
+
+	TP_PROTO(const struct inode *inode, loff_t pos, loff_t length, loff_t written, int flags),
+
+	TP_ARGS(inode, pos, length, written, flags),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	ino		)
+		__field(	u64,	pos		)
+		__field(	u64,	length		)
+		__field(	u64,	written		)
+		__field(	int,    flags		)
+	),
+
+	TP_fast_assign_btrfs(btrfs_sb(inode->i_sb),
+		__entry->ino		= btrfs_ino(BTRFS_I(inode));
+		__entry->pos		= pos;
+		__entry->length		= length;
+		__entry->written		= written;
+		__entry->flags		= flags;
+	),
+
+	TP_printk_btrfs("ino=%llu pos=%llu len=%llu written=%llu flags=0x%x",
+		  __entry->ino,
+		  __entry->pos,
+		  __entry->length,
+		  __entry->written,
+		  __entry->flags)
+);
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 13/15] btrfs: handle dax page zeroing
       [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
                   ` (11 preceding siblings ...)
  2019-03-26 19:02 ` [PATCH 12/15] btrfs: trace functions for btrfs_iomap_begin/end Goldwyn Rodrigues
@ 2019-03-26 19:02 ` Goldwyn Rodrigues
  2019-03-26 19:03 ` [PATCH 14/15] btrfs: Disable dax-based defrag and send Goldwyn Rodrigues
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-26 19:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs_dax_zero_block() zeros part of the page, either from the
front or the regular rest of the block.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h      |  1 +
 fs/btrfs/dax.c        | 29 +++++++++++++++++++++++++++--
 fs/btrfs/inode.c      |  4 ++++
 fs/dax.c              | 17 ++++++++++++-----
 fs/iomap.c            |  9 +--------
 include/linux/dax.h   | 11 +++++------
 include/linux/iomap.h |  6 ++++++
 7 files changed, 56 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 750f9c70fabe..21068dc4a95a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3806,6 +3806,7 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
 int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
 		struct inode *dest, loff_t destoff, loff_t len,
 		bool *is_same);
+int btrfs_dax_zero_block(struct inode *inode, loff_t from, loff_t len, bool front);
 #else
 static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
 {
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 7900b5773829..d73945d50b88 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -31,7 +31,7 @@ static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
 
 	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
 
-	if (flags & IOMAP_WRITE) {
+	if (flags & (IOMAP_WRITE | IOMAP_ZERO)) {
 		int ret = 0, nocow;
 		struct extent_map *map = em;
 		struct btrfs_iomap *bi;
@@ -89,7 +89,8 @@ static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
 	iomap->bdev = em->bdev;
 	iomap->dax_dev = fs_info->dax_dev;
 
-	if (em->block_start == EXTENT_MAP_HOLE) {
+	if (em->block_start == EXTENT_MAP_HOLE ||
+			em->flags == EXTENT_FLAG_FILLING) {
 		iomap->type = IOMAP_HOLE;
 		return 0;
 	}
@@ -178,4 +179,28 @@ int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
 {
 	return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same, &btrfs_iomap_ops);
 }
+
+/*
+ * zero a part of the page only. This should CoW (via iomap_begin) if required
+ */
+int btrfs_dax_zero_block(struct inode *inode, loff_t from, loff_t len, bool front)
+{
+	loff_t start = round_down(from, PAGE_SIZE);
+	loff_t end = round_up(from, PAGE_SIZE);
+	loff_t offset = from;
+	int ret = 0;
+
+	if (front) {
+		len = from - start;
+		offset = start;
+	} else	{
+		if (!len)
+			len = end - from;
+	}
+
+	if (len)
+		ret = iomap_zero_range(inode, offset, len, NULL, &btrfs_iomap_ops);
+
+	return (ret < 0) ? ret : 0;
+}
 #endif /* CONFIG_FS_DAX */
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 21780ea14e5a..5350e5f23728 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4833,6 +4833,10 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 	    (!len || IS_ALIGNED(len, blocksize)))
 		goto out;
 
+#ifdef CONFIG_FS_DAX
+	if (IS_DAX(inode))
+		return btrfs_dax_zero_block(inode, from, len, front);
+#endif
 	block_start = round_down(from, blocksize);
 	block_end = block_start + blocksize - 1;
 
diff --git a/fs/dax.c b/fs/dax.c
index 18998c5ee27a..93146142bb00 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1068,17 +1068,21 @@ static void dax_to_dax_copy(struct iomap *iomap, loff_t pos, void *daddr,
 	blk_start = iomap->cow_addr + pos - iomap->cow_pos;
 	blk_pg = round_down(blk_start, PAGE_SIZE);
 
-	map_len = dax_direct_access(iomap->dax_dev, PHYS_PFN(blk_pg), PAGE_SIZE,
+	map_len = dax_direct_access(iomap->dax_dev, PHYS_PFN(blk_pg), 1,
 			&saddr, NULL);
 	saddr += blk_start - blk_pg;
 	memcpy(daddr, saddr, len);
 }
 
-int __dax_zero_page_range(struct block_device *bdev,
-		struct dax_device *dax_dev, sector_t sector,
-		unsigned int offset, unsigned int size)
+int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
+			  unsigned int offset, unsigned int size)
 {
-	if (dax_range_is_aligned(bdev, offset, size)) {
+	sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
+	struct block_device *bdev = iomap->bdev;
+	struct dax_device *dax_dev = iomap->dax_dev;
+
+	if (!(iomap->flags & IOMAP_F_COW) &&
+	    dax_range_is_aligned(bdev, offset, size)) {
 		sector_t start_sector = sector + (offset >> 9);
 
 		return blkdev_issue_zeroout(bdev, start_sector,
@@ -1098,6 +1102,9 @@ int __dax_zero_page_range(struct block_device *bdev,
 			dax_read_unlock(id);
 			return rc;
 		}
+		if (iomap->flags & IOMAP_F_COW)
+			dax_to_dax_copy(iomap, pos & PAGE_MASK,
+					kaddr, PAGE_SIZE);
 		memset(kaddr + offset, 0, size);
 		dax_flush(dax_dev, kaddr + offset, size);
 		dax_read_unlock(id);
diff --git a/fs/iomap.c b/fs/iomap.c
index abdd18e404f8..90698c854883 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -98,12 +98,6 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	return written ? written : ret;
 }
 
-static sector_t
-iomap_sector(struct iomap *iomap, loff_t pos)
-{
-	return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
-}
-
 static struct iomap_page *
 iomap_page_create(struct inode *inode, struct page *page)
 {
@@ -990,8 +984,7 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
 static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
 		struct iomap *iomap)
 {
-	return __dax_zero_page_range(iomap->bdev, iomap->dax_dev,
-			iomap_sector(iomap, pos & PAGE_MASK), offset, bytes);
+	return __dax_zero_page_range(iomap, pos, offset, bytes);
 }
 
 static loff_t
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a11bc7b1f526..892c478d7073 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -9,6 +9,7 @@
 
 typedef unsigned long dax_entry_t;
 
+struct iomap;
 struct iomap_ops;
 struct dax_device;
 struct dax_operations {
@@ -161,13 +162,11 @@ int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
                 loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops);
 
 #ifdef CONFIG_FS_DAX
-int __dax_zero_page_range(struct block_device *bdev,
-		struct dax_device *dax_dev, sector_t sector,
-		unsigned int offset, unsigned int length);
+int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
+		unsigned int offset, unsigned int size);
 #else
-static inline int __dax_zero_page_range(struct block_device *bdev,
-		struct dax_device *dax_dev, sector_t sector,
-		unsigned int offset, unsigned int length)
+static inline int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
+		unsigned int offset, unsigned int size)
 {
 	return -ENXIO;
 }
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 391785de1428..e5a1b2a1962d 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -7,6 +7,7 @@
 #include <linux/mm.h>
 #include <linux/types.h>
 #include <linux/mm_types.h>
+#include <linux/blkdev.h>
 
 struct address_space;
 struct fiemap_extent_info;
@@ -122,6 +123,11 @@ static inline struct iomap_page *to_iomap_page(struct page *page)
 	return NULL;
 }
 
+static inline sector_t iomap_sector(struct iomap *iomap, loff_t pos)
+{
+	        return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
+}
+
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops);
 int iomap_readpage(struct page *page, const struct iomap_ops *ops);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 14/15] btrfs: Disable dax-based defrag and send
       [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
                   ` (12 preceding siblings ...)
  2019-03-26 19:02 ` [PATCH 13/15] btrfs: handle dax page zeroing Goldwyn Rodrigues
@ 2019-03-26 19:03 ` Goldwyn Rodrigues
  2019-03-26 19:03 ` [PATCH 15/15] btrfs: Writeprotect mmap pages on snapshot Goldwyn Rodrigues
  2019-03-26 19:09 ` [PATCH v2 00/15] btrfs dax support Goldwyn Rodrigues
  15 siblings, 0 replies; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-26 19:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This is temporary, and a TODO.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ioctl.c | 13 +++++++++++++
 fs/btrfs/send.c  |  4 ++++
 2 files changed, 17 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2e5137b01561..f532a8df2026 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2980,6 +2980,12 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp)
 			goto out;
 		}
 
+		if (IS_DAX(inode)) {
+			btrfs_warn(root->fs_info, "File defrag is not supported with DAX");
+			ret = -EOPNOTSUPP;
+			goto out;
+		}
+
 		if (argp) {
 			if (copy_from_user(range, argp,
 					   sizeof(*range))) {
@@ -4647,6 +4653,10 @@ static long btrfs_ioctl_balance(struct file *file, void __user *arg)
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
+	/* send can be on a directory, so check super block instead */
+	if (btrfs_test_opt(fs_info, DAX))
+		return -EOPNOTSUPP;
+
 	ret = mnt_want_write_file(file);
 	if (ret)
 		return ret;
@@ -5499,6 +5509,9 @@ static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat)
 	struct btrfs_ioctl_send_args *arg;
 	int ret;
 
+	if (IS_DAX(file_inode(file)))
+		return -EOPNOTSUPP;
+
 	if (compat) {
 #if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
 		struct btrfs_ioctl_send_args_32 args32;
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 7ea2d6b1f170..9679fd54db86 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -6609,6 +6609,10 @@ long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg)
 	int sort_clone_roots = 0;
 	int index;
 
+	/* send can be on a directory, so check super block instead */
+	if (btrfs_test_opt(fs_info, DAX))
+		return -EOPNOTSUPP;
+
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 15/15] btrfs: Writeprotect mmap pages on snapshot
       [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
                   ` (13 preceding siblings ...)
  2019-03-26 19:03 ` [PATCH 14/15] btrfs: Disable dax-based defrag and send Goldwyn Rodrigues
@ 2019-03-26 19:03 ` Goldwyn Rodrigues
  2019-03-28 15:48   ` Darrick J. Wong
  2019-03-26 19:09 ` [PATCH v2 00/15] btrfs dax support Goldwyn Rodrigues
  15 siblings, 1 reply; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-26 19:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Inorder to make sure mmap'd files don't change after snapshot,
writeprotect the mmap pages on snapshot. This is done by performing
a data writeback on the pages (which simply mark the pages are
wrprotected). This way if the user process tries to access the memory
we will get another fault and we can perform a CoW.

In order to accomplish this, we tag all CoW pages as
PAGECACHE_TAG_TOWRITE, and add the mmapd inode in delalloc_inodes.
During snapshot, it starts writeback of all delalloc'd inodes and
here we perform a data writeback. We don't want to keep the inodes
in delalloc_inodes until it umount (WARN_ON), so we remove it
during inode evictions.

This looks hackish. Other alternatives could be to create another
list for mmap'd files or rename delalloc_inodes to writeback_inodes.
Suggestions?

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  3 ++-
 fs/btrfs/dax.c   |  7 +++++++
 fs/btrfs/inode.c | 13 ++++++++++++-
 fs/dax.c         |  3 +++
 4 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 21068dc4a95a..68a63d93556a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3252,7 +3252,8 @@ int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *new_root,
 			     struct btrfs_root *parent_root,
 			     u64 new_dirid);
- void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state,
+void btrfs_add_delalloc_inodes(struct btrfs_root *root, struct inode *inode);
+void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state,
 			       unsigned *bits);
 void btrfs_clear_delalloc_extent(struct inode *inode,
 				 struct extent_state *state, unsigned *bits);
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index d73945d50b88..bcb961242c74 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -166,10 +166,17 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
 {
 	vm_fault_t ret;
 	pfn_t pfn;
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	struct btrfs_inode *binode = BTRFS_I(inode);
 	ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &pfn, NULL, &btrfs_iomap_ops);
 	if (ret & VM_FAULT_NEEDDSYNC)
 		ret = dax_finish_sync_fault(vmf, PE_SIZE_PTE, pfn);
 
+	/* Insert into delalloc so we get writeback calls on snapshots */
+	if (vmf->flags & FAULT_FLAG_WRITE &&
+			!test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags))
+		btrfs_add_delalloc_inodes(binode->root, inode);
+
 	return ret;
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5350e5f23728..3b72c1c96b34 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1713,7 +1713,7 @@ void btrfs_merge_delalloc_extent(struct inode *inode, struct extent_state *new,
 	spin_unlock(&BTRFS_I(inode)->lock);
 }
 
-static void btrfs_add_delalloc_inodes(struct btrfs_root *root,
+void btrfs_add_delalloc_inodes(struct btrfs_root *root,
 				      struct inode *inode)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -5358,12 +5358,17 @@ void btrfs_evict_inode(struct inode *inode)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_trans_handle *trans;
+	struct btrfs_inode *binode = BTRFS_I(inode);
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_block_rsv *rsv;
 	int ret;
 
 	trace_btrfs_inode_evict(inode);
 
+	if (IS_DAX(inode)
+	    && test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags))
+		btrfs_del_delalloc_inode(root, binode);
+
 	if (!root) {
 		clear_inode(inode);
 		return;
@@ -8683,6 +8688,10 @@ static int btrfs_dax_writepages(struct address_space *mapping,
 {
 	struct inode *inode = mapping->host;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_inode *binode = BTRFS_I(inode);
+	if ((wbc->sync_mode == WB_SYNC_ALL) &&
+	    test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags))
+		btrfs_del_delalloc_inode(binode->root, binode);
 	return dax_writeback_mapping_range(mapping, fs_info->fs_devices->latest_bdev,
 			wbc);
 }
@@ -9981,6 +9990,8 @@ static void btrfs_run_delalloc_work(struct btrfs_work *work)
 	delalloc_work = container_of(work, struct btrfs_delalloc_work,
 				     work);
 	inode = delalloc_work->inode;
+	if (IS_DAX(inode))
+		filemap_fdatawrite(inode->i_mapping);
 	filemap_flush(inode->i_mapping);
 	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
 				&BTRFS_I(inode)->runtime_flags))
diff --git a/fs/dax.c b/fs/dax.c
index 93146142bb00..c42e9cb486ef 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -753,6 +753,9 @@ static void *dax_insert_entry(struct xa_state *xas,
 	if (dirty)
 		xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
 
+	if (cow)
+		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
 	xas_unlock_irq(xas);
 	return entry;
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 00/15] btrfs dax support
       [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
                   ` (14 preceding siblings ...)
  2019-03-26 19:03 ` [PATCH 15/15] btrfs: Writeprotect mmap pages on snapshot Goldwyn Rodrigues
@ 2019-03-26 19:09 ` Goldwyn Rodrigues
  2019-03-27 20:14   ` Adam Borowski
  15 siblings, 1 reply; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-26 19:09 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel

Sorry, messed up the subject the first time.

This patch set adds support for dax on the BTRFS filesystem.
In order to support for CoW for btrfs, there were changes which had to be
made to the dax handling. The important one is copying blocks into the
same dax device before using them.

I have some doubts: I have put them in patch headers of the individual
patches.

Git: https://github.com/goldwynr/linux/tree/btrfs-dax

Changes since V1:
- use iomap instead of redoing everything in btrfs
- support for mmap writeprotecting on snapshotting


 fs/btrfs/Makefile            |    1 
 fs/btrfs/ctree.h             |   32 +++++-
 fs/btrfs/dax.c               |  225 +++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/disk-io.c           |    4 
 fs/btrfs/file.c              |   34 +++++-
 fs/btrfs/inode.c             |  114 ++++++++++++++++-----
 fs/btrfs/ioctl.c             |   31 +++++
 fs/btrfs/send.c              |    4 
 fs/btrfs/super.c             |   26 ++++
 fs/dax.c                     |  164 ++++++++++++++++++++++++++++---
 fs/iomap.c                   |    9 -
 fs/ocfs2/file.c              |    2 
 fs/read_write.c              |    9 +
 fs/xfs/xfs_reflink.c         |    2 
 include/linux/dax.h          |   13 +-
 include/linux/fs.h           |    4 
 include/linux/iomap.h        |    9 +
 include/trace/events/btrfs.h |   56 ++++++++++
 18 files changed, 662 insertions(+), 77 deletions(-)

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/15] btrfs: create a mount option for dax
  2019-03-26 19:02 ` [PATCH 01/15] btrfs: create a mount option for dax Goldwyn Rodrigues
@ 2019-03-26 19:10   ` Matthew Wilcox
  2019-03-27 11:00     ` Goldwyn Rodrigues
  2019-03-27 17:38     ` Adam Borowski
  2019-03-28 14:49   ` David Sterba
  2019-03-28 17:28   ` David Sterba
  2 siblings, 2 replies; 48+ messages in thread
From: Matthew Wilcox @ 2019-03-26 19:10 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, linux-fsdevel, Goldwyn Rodrigues

On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> This sets S_DAX in inode->i_flags, which can be used with
> IS_DAX().
> 
> The dax option is restricted to non multi-device mounts.
> dax interacts with the device directly instead of using bio, so
> all bio-hooks which we use for multi-device cannot be performed
> here. While regular read/writes could be manipulated with
> RAID0/1, mmap() is still an issue.
> 
> Auto-setting free space tree, because dealing with free space
> inode (specifically readpages) is a nightmare.
> Auto-setting nodatasum because we don't get callback for writing
> checksums after mmap()s.

Congratulations on getting the bear to dance.  But why?

To me, the point of btrfs is all the cool stuff it does with built-in
checksumming and snapshots and RAID and so on.  DAX doesn't let you do
any of that, so why would somebody want to use btrfs to manage DAX?


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/15] btrfs: create a mount option for dax
  2019-03-26 19:10   ` Matthew Wilcox
@ 2019-03-27 11:00     ` Goldwyn Rodrigues
  2019-03-27 12:00       ` Matthew Wilcox
  2019-03-27 17:38     ` Adam Borowski
  1 sibling, 1 reply; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-27 11:00 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-btrfs, linux-fsdevel

On 12:10 26/03, Matthew Wilcox wrote:
> On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> > This sets S_DAX in inode->i_flags, which can be used with
> > IS_DAX().
> > 
> > The dax option is restricted to non multi-device mounts.
> > dax interacts with the device directly instead of using bio, so
> > all bio-hooks which we use for multi-device cannot be performed
> > here. While regular read/writes could be manipulated with
> > RAID0/1, mmap() is still an issue.
> > 
> > Auto-setting free space tree, because dealing with free space
> > inode (specifically readpages) is a nightmare.
> > Auto-setting nodatasum because we don't get callback for writing
> > checksums after mmap()s.
> 
> Congratulations on getting the bear to dance.  But why?
> 

Why not ? ;)

> To me, the point of btrfs is all the cool stuff it does with built-in
> checksumming and snapshots and RAID and so on.  DAX doesn't let you do
> any of that, so why would somebody want to use btrfs to manage DAX?
> 

There are users who are asking for advantages of dax on btrfs.

I have tried to make it work with snapshots in this series.
Checksumming should be possible, but would require some more hacks. I am
looking into it.
multi-device is an issue for mmap() and I don't think we can work around
it.

I agree there is a price to pay to use dax, but I am sure the users
would know about that.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/15] btrfs: create a mount option for dax
  2019-03-27 11:00     ` Goldwyn Rodrigues
@ 2019-03-27 12:00       ` Matthew Wilcox
  2019-03-27 12:26         ` Goldwyn Rodrigues
  2019-03-27 23:31         ` Goldwyn Rodrigues
  0 siblings, 2 replies; 48+ messages in thread
From: Matthew Wilcox @ 2019-03-27 12:00 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, linux-fsdevel

On Wed, Mar 27, 2019 at 06:00:52AM -0500, Goldwyn Rodrigues wrote:
> On 12:10 26/03, Matthew Wilcox wrote:
> > On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> > > This sets S_DAX in inode->i_flags, which can be used with
> > > IS_DAX().
> > > 
> > > The dax option is restricted to non multi-device mounts.
> > > dax interacts with the device directly instead of using bio, so
> > > all bio-hooks which we use for multi-device cannot be performed
> > > here. While regular read/writes could be manipulated with
> > > RAID0/1, mmap() is still an issue.
> > > 
> > > Auto-setting free space tree, because dealing with free space
> > > inode (specifically readpages) is a nightmare.
> > > Auto-setting nodatasum because we don't get callback for writing
> > > checksums after mmap()s.
> > 
> > Congratulations on getting the bear to dance.  But why?
> 
> Why not ? ;)

 18 files changed, 662 insertions(+), 77 deletions(-)

I want to know what advantage we're getting for that.

> > To me, the point of btrfs is all the cool stuff it does with built-in
> > checksumming and snapshots and RAID and so on.  DAX doesn't let you do
> > any of that, so why would somebody want to use btrfs to manage DAX?
> 
> There are users who are asking for advantages of dax on btrfs.

There are also people asking for perpetual motion machines.  Do these
users understand the tradeoffs?

> I have tried to make it work with snapshots in this series.
> Checksumming should be possible, but would require some more hacks. I am
> looking into it.
> multi-device is an issue for mmap() and I don't think we can work around
> it.
> 
> I agree there is a price to pay to use dax, but I am sure the users
> would know about that.

I really doubt it, to be honest.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/15] btrfs: create a mount option for dax
  2019-03-27 12:00       ` Matthew Wilcox
@ 2019-03-27 12:26         ` Goldwyn Rodrigues
  2019-03-27 23:31         ` Goldwyn Rodrigues
  1 sibling, 0 replies; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-27 12:26 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-btrfs, linux-fsdevel

On  5:00 27/03, Matthew Wilcox wrote:
> On Wed, Mar 27, 2019 at 06:00:52AM -0500, Goldwyn Rodrigues wrote:
> > On 12:10 26/03, Matthew Wilcox wrote:
> > > On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> > > > This sets S_DAX in inode->i_flags, which can be used with
> > > > IS_DAX().
> > > > 
> > > > The dax option is restricted to non multi-device mounts.
> > > > dax interacts with the device directly instead of using bio, so
> > > > all bio-hooks which we use for multi-device cannot be performed
> > > > here. While regular read/writes could be manipulated with
> > > > RAID0/1, mmap() is still an issue.
> > > > 
> > > > Auto-setting free space tree, because dealing with free space
> > > > inode (specifically readpages) is a nightmare.
> > > > Auto-setting nodatasum because we don't get callback for writing
> > > > checksums after mmap()s.
> > > 
> > > Congratulations on getting the bear to dance.  But why?
> > 
> > Why not ? ;)
> 
>  18 files changed, 662 insertions(+), 77 deletions(-)
> 
> I want to know what advantage we're getting for that.

Direct device mmap'd files, faster access, lower pagecache usage...
all the advantages you would get from a dax device.

> 
> > > To me, the point of btrfs is all the cool stuff it does with built-in
> > > checksumming and snapshots and RAID and so on.  DAX doesn't let you do
> > > any of that, so why would somebody want to use btrfs to manage DAX?
> > 
> > There are users who are asking for advantages of dax on btrfs.
> 
> There are also people asking for perpetual motion machines.  Do these
> users understand the tradeoffs?

The only major feature they would not be able to use is
multi-device, which will be relayed at mount time.

> 
> > I have tried to make it work with snapshots in this series.
> > Checksumming should be possible, but would require some more hacks. I am
> > looking into it.
> > multi-device is an issue for mmap() and I don't think we can work around
> > it.
> > 
> > I agree there is a price to pay to use dax, but I am sure the users
> > would know about that.
> 
> I really doubt it, to be honest.

Care to elaborate?

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/15] btrfs: create a mount option for dax
  2019-03-26 19:10   ` Matthew Wilcox
  2019-03-27 11:00     ` Goldwyn Rodrigues
@ 2019-03-27 17:38     ` Adam Borowski
  1 sibling, 0 replies; 48+ messages in thread
From: Adam Borowski @ 2019-03-27 17:38 UTC (permalink / raw)
  To: Matthew Wilcox, Goldwyn Rodrigues, linux-btrfs, linux-fsdevel

On Tue, Mar 26, 2019 at 12:10:01PM -0700, Matthew Wilcox wrote:
> On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> > The dax option is restricted to non multi-device mounts.
> > dax interacts with the device directly instead of using bio, so
> > all bio-hooks which we use for multi-device cannot be performed
> > here. While regular read/writes could be manipulated with
> > RAID0/1, mmap() is still an issue.
> > 
> > Auto-setting free space tree, because dealing with free space
> > inode (specifically readpages) is a nightmare.
> > Auto-setting nodatasum because we don't get callback for writing
> > checksums after mmap()s.
> 
> Congratulations on getting the bear to dance.  But why?
> 
> To me, the point of btrfs is all the cool stuff it does with built-in
> checksumming and snapshots and RAID and so on.  DAX doesn't let you do
> any of that, so why would somebody want to use btrfs to manage DAX?

If I read this correctly (I merely glanced at it), this patchset _does_
provide the full snapshot functionality.  This is something other
filesystems don't allow: ext4 has no CoW at all, and IIRC on XFS reflinks
and DAX are mutually exclusive.

Obviously, the usual btrfs way of CoWing every write would remove all
(write) upsides of DAX, thus NOCOW (ie, CoW once) is the way to go: a page
fault should happen no more than once per page per snapshot.


On the other hand, checksumming seems useless to me.  Data corruption can
happen either in transit or at rest.  For at rest, disks already have their
own checksums -- and [NV]DIMMs have ECC.  On the other hand, the majority of
the time when someone seeks help on the btrfs mailing list, it turns out to
be a matter of bad RAM, bad motherboard or bad cabling.  This doesn't apply
to pmem.  The usual path is:

   CPU
    |<--->memory
    |
  SATA controller
    |
    (SATA cable)
    |
  disk

The data goes to memory (very unlikely to to remain in the cache before
getting checksummed), then has to travel all the way down.  On the other
hand, the path on pmem is:

  CPU
   |---->memory

So the data written by userspace goes to memory... and that's it.


As for multi-device, at least single block groups would be very nice (to
have a filesystem than spans regions) and easyish to implement, while RAID0
might spoil hugepage fun but may still be straightforward.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Did ya know that typing "test -j8" instead of "ctest -j8"
⢿⡄⠘⠷⠚⠋⠀ will make your testsuite pass much faster, and fix bugs?
⠈⠳⣄⠀⠀⠀⠀

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write
  2019-03-26 19:02 ` [PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write Goldwyn Rodrigues
@ 2019-03-27 17:54   ` Darrick J. Wong
  2019-03-27 18:58     ` Goldwyn Rodrigues
  2019-04-01  4:38   ` Dave Chinner
  1 sibling, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2019-03-27 17:54 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, linux-fsdevel, Goldwyn Rodrigues

On Tue, Mar 26, 2019 at 02:02:50PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> The IOMAP_F_COW is a flag to notify dax that it needs to copy
> the data from iomap->cow_addr to iomap->addr, if the start/end
> of I/O are not page aligned.
> 
> This also introduces dax_to_dax_copy() which performs a copy
> from one part of the device to another, to a maximum of one page.
> 
> Question: Using iomap.cow_addr == 0 means the CoW is to be copied
> (or memset) from a hole. Would this be better handled through a flag?
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/dax.c              | 36 ++++++++++++++++++++++++++++++++++++
>  include/linux/iomap.h |  3 +++
>  2 files changed, 39 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ca0671d55aa6..e254535dd830 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1051,6 +1051,28 @@ static bool dax_range_is_aligned(struct block_device *bdev,
>  	return true;
>  }
>  
> +static void dax_to_dax_copy(struct iomap *iomap, loff_t pos, void *daddr,
> +			    size_t len)
> +{
> +	loff_t blk_start, blk_pg;
> +	void *saddr;
> +	ssize_t map_len;
> +
> +	/* A zero address is a hole. */
> +	if (iomap->cow_addr == 0) {
> +		memset(daddr, 0, len);
> +		return;
> +	}
> +
> +	blk_start = iomap->cow_addr + pos - iomap->cow_pos;
> +	blk_pg = round_down(blk_start, PAGE_SIZE);
> +
> +	map_len = dax_direct_access(iomap->dax_dev, PHYS_PFN(blk_pg), PAGE_SIZE,
> +			&saddr, NULL);
> +	saddr += blk_start - blk_pg;
> +	memcpy(daddr, saddr, len);
> +}
> +
>  int __dax_zero_page_range(struct block_device *bdev,
>  		struct dax_device *dax_dev, sector_t sector,
>  		unsigned int offset, unsigned int size)
> @@ -1143,6 +1165,20 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  			break;
>  		}
>  
> +		if (iomap->flags & IOMAP_F_COW) {
> +			loff_t pg_end = round_up(end, PAGE_SIZE);
> +			/*
> +			 * Copy the first part of the page
> +			 * Note: we pass offset as length
> +			 */
> +			if (offset)
> +				dax_to_dax_copy(iomap, pos - offset, kaddr, offset);
> +
> +			/* Copy the last part of the range */
> +			if (end < pg_end)
> +				dax_to_dax_copy(iomap, end, kaddr + offset + length, pg_end - end);
> +		}
> +
>  		map_len = PFN_PHYS(map_len);
>  		kaddr += offset;
>  		map_len -= offset;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 0fefb5455bda..391785de1428 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h

Probably a good idea to cc the iomap maintainers on this (entire
patchset)... <cough>

--D

> @@ -35,6 +35,7 @@ struct vm_fault;
>  #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
>  #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
>  #define IOMAP_F_BUFFER_HEAD	0x04	/* file system requires buffer heads */
> +#define IOMAP_F_COW		0x08	/* cow before write */
>  
>  /*
>   * Flags that only need to be reported for IOMAP_REPORT requests:
> @@ -59,6 +60,8 @@ struct iomap {
>  	u64			length;	/* length of mapping, bytes */
>  	u16			type;	/* type of mapping */
>  	u16			flags;	/* flags for mapping */
> +	u64			cow_addr; /* read address to perform CoW */
> +	loff_t			cow_pos; /* file offset of cow_addr */
>  	struct block_device	*bdev;	/* block device for I/O */
>  	struct dax_device	*dax_dev; /* dax_dev for dax operations */
>  	void			*inline_data;
> -- 
> 2.16.4
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write
  2019-03-27 17:54   ` Darrick J. Wong
@ 2019-03-27 18:58     ` Goldwyn Rodrigues
  2019-03-28 14:45       ` Darrick J. Wong
  0 siblings, 1 reply; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-27 18:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-btrfs, linux-fsdevel

On 10:54 27/03, Darrick J. Wong wrote:
> On Tue, Mar 26, 2019 at 02:02:50PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > The IOMAP_F_COW is a flag to notify dax that it needs to copy
> > the data from iomap->cow_addr to iomap->addr, if the start/end
> > of I/O are not page aligned.
> > 
> > This also introduces dax_to_dax_copy() which performs a copy
> > from one part of the device to another, to a maximum of one page.
> > 
> > Question: Using iomap.cow_addr == 0 means the CoW is to be copied
> > (or memset) from a hole. Would this be better handled through a flag?
> > 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> >  fs/dax.c              | 36 ++++++++++++++++++++++++++++++++++++
> >  include/linux/iomap.h |  3 +++
> >  2 files changed, 39 insertions(+)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index ca0671d55aa6..e254535dd830 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1051,6 +1051,28 @@ static bool dax_range_is_aligned(struct block_device *bdev,
> >  	return true;
> >  }
> >  
> > +static void dax_to_dax_copy(struct iomap *iomap, loff_t pos, void *daddr,
> > +			    size_t len)
> > +{
> > +	loff_t blk_start, blk_pg;
> > +	void *saddr;
> > +	ssize_t map_len;
> > +
> > +	/* A zero address is a hole. */
> > +	if (iomap->cow_addr == 0) {
> > +		memset(daddr, 0, len);
> > +		return;
> > +	}
> > +
> > +	blk_start = iomap->cow_addr + pos - iomap->cow_pos;
> > +	blk_pg = round_down(blk_start, PAGE_SIZE);
> > +
> > +	map_len = dax_direct_access(iomap->dax_dev, PHYS_PFN(blk_pg), PAGE_SIZE,
> > +			&saddr, NULL);
> > +	saddr += blk_start - blk_pg;
> > +	memcpy(daddr, saddr, len);
> > +}
> > +
> >  int __dax_zero_page_range(struct block_device *bdev,
> >  		struct dax_device *dax_dev, sector_t sector,
> >  		unsigned int offset, unsigned int size)
> > @@ -1143,6 +1165,20 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> >  			break;
> >  		}
> >  
> > +		if (iomap->flags & IOMAP_F_COW) {
> > +			loff_t pg_end = round_up(end, PAGE_SIZE);
> > +			/*
> > +			 * Copy the first part of the page
> > +			 * Note: we pass offset as length
> > +			 */
> > +			if (offset)
> > +				dax_to_dax_copy(iomap, pos - offset, kaddr, offset);
> > +
> > +			/* Copy the last part of the range */
> > +			if (end < pg_end)
> > +				dax_to_dax_copy(iomap, end, kaddr + offset + length, pg_end - end);
> > +		}
> > +
> >  		map_len = PFN_PHYS(map_len);
> >  		kaddr += offset;
> >  		map_len -= offset;
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 0fefb5455bda..391785de1428 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> 
> Probably a good idea to cc the iomap maintainers on this (entire
> patchset)... <cough>

Yup. Will include you and Christoph in the future iterations. This will
not be the last iteration for this patchset.

Looks like fsdevel and btrfs was too narrow a list ;)

> 
> --D
> 
> > @@ -35,6 +35,7 @@ struct vm_fault;
> >  #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
> >  #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
> >  #define IOMAP_F_BUFFER_HEAD	0x04	/* file system requires buffer heads */
> > +#define IOMAP_F_COW		0x08	/* cow before write */
> >  
> >  /*
> >   * Flags that only need to be reported for IOMAP_REPORT requests:
> > @@ -59,6 +60,8 @@ struct iomap {
> >  	u64			length;	/* length of mapping, bytes */
> >  	u16			type;	/* type of mapping */
> >  	u16			flags;	/* flags for mapping */
> > +	u64			cow_addr; /* read address to perform CoW */
> > +	loff_t			cow_pos; /* file offset of cow_addr */
> >  	struct block_device	*bdev;	/* block device for I/O */
> >  	struct dax_device	*dax_dev; /* dax_dev for dax operations */
> >  	void			*inline_data;
> > -- 
> > 2.16.4
> > 
> 

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 00/15] btrfs dax support
  2019-03-26 19:09 ` [PATCH v2 00/15] btrfs dax support Goldwyn Rodrigues
@ 2019-03-27 20:14   ` Adam Borowski
  2019-03-27 23:26     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 48+ messages in thread
From: Adam Borowski @ 2019-03-27 20:14 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, linux-fsdevel, marcin.slusarz

On Tue, Mar 26, 2019 at 02:09:08PM -0500, Goldwyn Rodrigues wrote:
> This patch set adds support for dax on the BTRFS filesystem.

This patchset doesn't seem to support MAP_SYNC, which is the usual way to
use (and detect) DAX.  Basically, it requests for page faults to be
synchronous -- ie, when a page fault returns, the mapping points to actual
memory rather than to some buffer that'll be written back to the destination
at some point in the future.

Also, not really understanding these parts of the kernel, I can't tell if
the snapshots are atomic.  Ie, while the kernel walks over pages to set
mprotect flags, the process does two writes:
   RRRRRRRRRRRRRRRRRRRWWWWWWWWWWWWWWWWWWWWWW (R=ro W=rw)
        A                       B
The write at A causes a page fault, which clones the page, CoWing it and
letting the write into only one of the replicas.  After this, write to B
happens before the mprotect, thus goes into both replicas -- and despite
the process having issued proper memory barriers, the other replica has
B but not A.  To fix this, earlier page faults can't get finalized until
all mprotects are in place.  (I'm writing this as a query rather than a
problem report -- I'm an ignoramus here.)


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Did ya know that typing "test -j8" instead of "ctest -j8"
⢿⡄⠘⠷⠚⠋⠀ will make your testsuite pass much faster, and fix bugs?
⠈⠳⣄⠀⠀⠀⠀

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 00/15] btrfs dax support
  2019-03-27 20:14   ` Adam Borowski
@ 2019-03-27 23:26     ` Goldwyn Rodrigues
  2019-03-28 10:24       ` [PATCH] btrfs: allow MAP_SYNC mmap Adam Borowski
  0 siblings, 1 reply; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-27 23:26 UTC (permalink / raw)
  To: Adam Borowski; +Cc: linux-btrfs, linux-fsdevel, marcin.slusarz

On 21:14 27/03, Adam Borowski wrote:
> On Tue, Mar 26, 2019 at 02:09:08PM -0500, Goldwyn Rodrigues wrote:
> > This patch set adds support for dax on the BTRFS filesystem.
> 
> This patchset doesn't seem to support MAP_SYNC, which is the usual way to
> use (and detect) DAX.  Basically, it requests for page faults to be
> synchronous -- ie, when a page fault returns, the mapping points to actual
> memory rather than to some buffer that'll be written back to the destination
> at some point in the future.

The translation (in different flags/returns) goes as follows
MAP_SYNC -> VM_SYNC -> VM_NEEDDSYNC.
So, when dax_iomap_fault() returns, it is handled through
dax_finish_sync_fault(). This is how all filesystems are doing it currently.
Refer patch 09/15.

> 
> Also, not really understanding these parts of the kernel, I can't tell if
> the snapshots are atomic.  Ie, while the kernel walks over pages to set
> mprotect flags, the process does two writes:
>    RRRRRRRRRRRRRRRRRRRWWWWWWWWWWWWWWWWWWWWWW (R=ro W=rw)
>         A                       B
> The write at A causes a page fault, which clones the page, CoWing it and
> letting the write into only one of the replicas.  After this, write to B
> happens before the mprotect, thus goes into both replicas -- and despite
> the process having issued proper memory barriers, the other replica has
> B but not A.  To fix this, earlier page faults can't get finalized until
> all mprotects are in place.  (I'm writing this as a query rather than a
> problem report -- I'm an ignoramus here.)

When you initiate a snapshot, btrfs forces everything to CoW until
snapshot finishes. This guarantees all new allocations
are Cow, even if the extent is set to nocow. During this time, all
"writebacks" happen. We don't have writebacks in DAX, but we take this
opportunity to wrprotect the mmap'd pages.
For more details, refer to patch 15/15 in the series.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/15] btrfs: create a mount option for dax
  2019-03-27 12:00       ` Matthew Wilcox
  2019-03-27 12:26         ` Goldwyn Rodrigues
@ 2019-03-27 23:31         ` Goldwyn Rodrigues
  1 sibling, 0 replies; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-03-27 23:31 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-btrfs, linux-fsdevel

On  5:00 27/03, Matthew Wilcox wrote:
> On Wed, Mar 27, 2019 at 06:00:52AM -0500, Goldwyn Rodrigues wrote:
> > On 12:10 26/03, Matthew Wilcox wrote:
> > > On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> > > > This sets S_DAX in inode->i_flags, which can be used with
> > > > IS_DAX().
> > > > 
> > > > The dax option is restricted to non multi-device mounts.
> > > > dax interacts with the device directly instead of using bio, so
> > > > all bio-hooks which we use for multi-device cannot be performed
> > > > here. While regular read/writes could be manipulated with
> > > > RAID0/1, mmap() is still an issue.
> > > > 
> > > > Auto-setting free space tree, because dealing with free space
> > > > inode (specifically readpages) is a nightmare.
> > > > Auto-setting nodatasum because we don't get callback for writing
> > > > checksums after mmap()s.
> > > 
> > > Congratulations on getting the bear to dance.  But why?
> > 
> > Why not ? ;)
> 
>  18 files changed, 662 insertions(+), 77 deletions(-)
> 
> I want to know what advantage we're getting for that.

So the prime advantages are btrfs filesystem snapshots on the dax device.
Leveraging btrfs features, reflinks works just as well.
(In addition to all the advantages dax has to offer)



-- 
Goldwyn

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH] btrfs: allow MAP_SYNC mmap
  2019-03-27 23:26     ` Goldwyn Rodrigues
@ 2019-03-28 10:24       ` Adam Borowski
  2019-03-28 10:42         ` Adam Borowski
  2019-04-01 20:08         ` Goldwyn Rodrigues
  0 siblings, 2 replies; 48+ messages in thread
From: Adam Borowski @ 2019-03-28 10:24 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs, linux-fsdevel, marcin.slusarz
  Cc: Adam Borowski

Used by userspace to detect DAX.

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
---
 fs/btrfs/file.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 196c8f37ff9d..8efdb040bc1d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -16,6 +16,7 @@
 #include <linux/btrfs.h>
 #include <linux/uio.h>
 #include <linux/iversion.h>
+#include <linux/mman.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -3320,6 +3321,7 @@ const struct file_operations btrfs_file_operations = {
 	.splice_read	= generic_file_splice_read,
 	.write_iter	= btrfs_file_write_iter,
 	.mmap		= btrfs_file_mmap,
+	.mmap_supported_flags = MAP_SYNC,
 	.open		= btrfs_file_open,
 	.release	= btrfs_release_file,
 	.fsync		= btrfs_sync_file,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH] btrfs: allow MAP_SYNC mmap
  2019-03-28 10:24       ` [PATCH] btrfs: allow MAP_SYNC mmap Adam Borowski
@ 2019-03-28 10:42         ` Adam Borowski
  2019-04-01 20:08         ` Goldwyn Rodrigues
  1 sibling, 0 replies; 48+ messages in thread
From: Adam Borowski @ 2019-03-28 10:42 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs, linux-fsdevel, marcin.slusarz

[kdave: like the rest of btrfs+DAX patchset, this is WIP of course]

> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 196c8f37ff9d..8efdb040bc1d 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -16,6 +16,7 @@
> +#include <linux/mman.h>

> +	.mmap_supported_flags = MAP_SYNC,

With this, the userspace at least thinks that DAX works.  Whether it's
actually crash-safe, will require a lot of review that's outside of my
areas of knowledge.

Most of the PMDK's test suite passes, but at least pmemspoil test fails.
There's also (not sure if in pmemspoil or earlier):

[  652.053976] RIP: 0010:btrfs_free_reserved_data_space_noquota+0xe6/0xf0
[  652.060683] Code: b8 00 48 8b 45 00 48 85 c0 75 d7 41 c6 45 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 c1 48 f7 d9 48 39 ca 0f 83 6a ff ff ff <0f> 0b 31 c0 e9 64 ff ff ff 90 41 55 49 89 fd 41 54 49 89 f4 55 53
[  652.079767] RSP: 0000:ffff99450807fc38 EFLAGS: 00010287
[  652.085078] RAX: fffffffffffff000 RBX: 0000000000300000 RCX: 0000000000001000
[  652.092330] RDX: 0000000000000000 RSI: 0000000000300000 RDI: ffff89965030a000
[  652.099583] RBP: 0000000000300000 R08: ffff99450807fc28 R09: ffffffffae6bc01a
[  652.106836] R10: ffffc76c9f5f8600 R11: 0000000000000011 R12: 0000000000300fff
[  652.114089] R13: ffff89965030a000 R14: ffff89964a82d000 R15: 0000000000000002
[  652.121342] FS:  00007fee0adcc800(0000) GS:ffff899655bc0000(0000) knlGS:0000000000000000
[  652.129566] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  652.135405] CR2: 00007fee0a700000 CR3: 00000007742d6006 CR4: 00000000003606e0
[  652.142657] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  652.149822] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  652.157149] Call Trace:
[  652.159550]  btrfs_free_reserved_data_space+0x46/0x60
[  652.164755]  btrfs_iomap_end+0xf4/0x150
[  652.168654]  dax_iomap_pte_fault.isra.41+0x201/0x8f0
[  652.173712]  btrfs_dax_fault+0x3c/0xa0
[  652.177523]  __do_fault+0x2f/0x90
[  652.180891]  __handle_mm_fault+0x9fa/0xed0
[  652.185056]  __do_page_fault+0x242/0x4c0
[  652.188955]  ? page_fault+0x8/0x30
[  652.192484]  page_fault+0x1e/0x30
[  652.195854] RIP: 0033:0x7fee0b340efc
[  652.199400] Code: 9d 48 81 fa 80 00 00 00 77 19 c5 fe 7f 07 c5 fe 7f 47 20 c5 fe 7f 44 17 e0 c5 fe 7f 44 17 c0 c5 f8 77 c3 48 8d 8f 80 00 00 00 <c5> fe 7f 07 48 83 e1 80 c5 fe 7f 44 17 e0 c5 fe 7f 47 20 c5 fe 7f
[  652.218563] RSP: 002b:00007ffcd02d58e8 EFLAGS: 00010202
[  652.223786] RAX: 00007fee0a700000 RBX: 00007fee0a6fffc0 RCX: 00007fee0a700080
[  652.231026] RDX: 0000000000000800 RSI: 0000000000000000 RDI: 00007fee0a700000
[  652.238353] RBP: 00007fee0a401b38 R08: 000000000000000f R09: 0000000000000001
[  652.245606] R10: 000000000000000d R11: 00007fee0b360bf0 R12: 0000000000000800
[  652.252865] R13: 00007fee0b4a6eec R14: 00007fee0a700000 R15: 0000000000000180
[  652.260119] ---[ end trace b6545baf6cf711c6 ]---


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Did ya know that typing "test -j8" instead of "ctest -j8"
⢿⡄⠘⠷⠚⠋⠀ will make your testsuite pass much faster, and fix bugs?
⠈⠳⣄⠀⠀⠀⠀

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write
  2019-03-27 18:58     ` Goldwyn Rodrigues
@ 2019-03-28 14:45       ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2019-03-28 14:45 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, linux-fsdevel

On Wed, Mar 27, 2019 at 01:58:53PM -0500, Goldwyn Rodrigues wrote:
> On 10:54 27/03, Darrick J. Wong wrote:
> > On Tue, Mar 26, 2019 at 02:02:50PM -0500, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > 
> > > The IOMAP_F_COW is a flag to notify dax that it needs to copy
> > > the data from iomap->cow_addr to iomap->addr, if the start/end
> > > of I/O are not page aligned.
> > > 
> > > This also introduces dax_to_dax_copy() which performs a copy
> > > from one part of the device to another, to a maximum of one page.
> > > 
> > > Question: Using iomap.cow_addr == 0 means the CoW is to be copied
> > > (or memset) from a hole. Would this be better handled through a flag?
> > > 
> > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > ---
> > >  fs/dax.c              | 36 ++++++++++++++++++++++++++++++++++++
> > >  include/linux/iomap.h |  3 +++
> > >  2 files changed, 39 insertions(+)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index ca0671d55aa6..e254535dd830 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -1051,6 +1051,28 @@ static bool dax_range_is_aligned(struct block_device *bdev,
> > >  	return true;
> > >  }
> > >  
> > > +static void dax_to_dax_copy(struct iomap *iomap, loff_t pos, void *daddr,
> > > +			    size_t len)

Hmm... is this a dax copy-in function?  I think what's going on here is
that we have a request to copy into the file @len bytes at offset @pos;
@daddr is the pmemory address of file offset @pos, and the @iomap is
supposed to have a cow address (or nothing).

(Would love a comment here, even if it is a static helper.)

> > > +{
> > > +	loff_t blk_start, blk_pg;
> > > +	void *saddr;
> > > +	ssize_t map_len;
> > > +
> > > +	/* A zero address is a hole. */
> > > +	if (iomap->cow_addr == 0) {
> > > +		memset(daddr, 0, len);
> > > +		return;
> > > +	}
> > > +
> > > +	blk_start = iomap->cow_addr + pos - iomap->cow_pos;
> > > +	blk_pg = round_down(blk_start, PAGE_SIZE);
> > > +
> > > +	map_len = dax_direct_access(iomap->dax_dev, PHYS_PFN(blk_pg), PAGE_SIZE,
> > > +			&saddr, NULL);

/me wonders if we're supposed to do something with map_len here?

> > > +	saddr += blk_start - blk_pg;
> > > +	memcpy(daddr, saddr, len);
> > > +}
> > > +
> > >  int __dax_zero_page_range(struct block_device *bdev,
> > >  		struct dax_device *dax_dev, sector_t sector,
> > >  		unsigned int offset, unsigned int size)
> > > @@ -1143,6 +1165,20 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > >  			break;
> > >  		}
> > >  
> > > +		if (iomap->flags & IOMAP_F_COW) {
> > > +			loff_t pg_end = round_up(end, PAGE_SIZE);
> > > +			/*
> > > +			 * Copy the first part of the page
> > > +			 * Note: we pass offset as length
> > > +			 */
> > > +			if (offset)
> > > +				dax_to_dax_copy(iomap, pos - offset, kaddr, offset);
> > > +
> > > +			/* Copy the last part of the range */
> > > +			if (end < pg_end)
> > > +				dax_to_dax_copy(iomap, end, kaddr + offset + length, pg_end - end);
> > > +		}
> > > +
> > >  		map_len = PFN_PHYS(map_len);
> > >  		kaddr += offset;
> > >  		map_len -= offset;
> > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > > index 0fefb5455bda..391785de1428 100644
> > > --- a/include/linux/iomap.h
> > > +++ b/include/linux/iomap.h
> > 
> > Probably a good idea to cc the iomap maintainers on this (entire
> > patchset)... <cough>
> 
> Yup. Will include you and Christoph in the future iterations. This will
> not be the last iteration for this patchset.
> 
> Looks like fsdevel and btrfs was too narrow a list ;)

Not necessarily, I /did/ see this on both lists.  Just a friendly
reminder that iomap has maintainers now and is no longer adrift in the
VFS. :)

> > 
> > --D
> > 
> > > @@ -35,6 +35,7 @@ struct vm_fault;
> > >  #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
> > >  #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
> > >  #define IOMAP_F_BUFFER_HEAD	0x04	/* file system requires buffer heads */
> > > +#define IOMAP_F_COW		0x08	/* cow before write */

This could use some more elaboration, since I couldn't figure out with
certainty how this mechanism is supposed to work...

> > >  
> > >  /*
> > >   * Flags that only need to be reported for IOMAP_REPORT requests:
> > > @@ -59,6 +60,8 @@ struct iomap {
> > >  	u64			length;	/* length of mapping, bytes */
> > >  	u16			type;	/* type of mapping */
> > >  	u16			flags;	/* flags for mapping */
> > > +	u64			cow_addr; /* read address to perform CoW */

Oh, I see, IOMAP_F_COW means @cow_addr points to the copy source
address, not the destination space we just allocated.

/*
 * A copy-on-write operation is necessary to maintain data integrity.
 * Write actors must copy the portion of the file that will not be
 * overwritten by the write from @cow_addr to @addr.
 */
#define IOMAP_F_COW		0x08

> > > +	loff_t			cow_pos; /* file offset of cow_addr */

Why wouldn't this be named @cow_offset?

And why wouldn't it be the same as @offset?

--D

> > >  	struct block_device	*bdev;	/* block device for I/O */
> > >  	struct dax_device	*dax_dev; /* dax_dev for dax operations */
> > >  	void			*inline_data;
> > > -- 
> > > 2.16.4
> > > 
> > 
> 
> -- 
> Goldwyn

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/15] btrfs: create a mount option for dax
  2019-03-26 19:02 ` [PATCH 01/15] btrfs: create a mount option for dax Goldwyn Rodrigues
  2019-03-26 19:10   ` Matthew Wilcox
@ 2019-03-28 14:49   ` David Sterba
  2019-03-28 17:28   ` David Sterba
  2 siblings, 0 replies; 48+ messages in thread
From: David Sterba @ 2019-03-28 14:49 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, linux-fsdevel, Goldwyn Rodrigues

On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> This sets S_DAX in inode->i_flags, which can be used with
> IS_DAX().

I haven't followed the dax developments recently, IIRC the mount option
'dax' was meant to be a temporary solution. I don't know if this has
stuck, or is discouraged to be added to new code. The replacement is the
per-inode xflag.

> The dax option is restricted to non multi-device mounts.
> dax interacts with the device directly instead of using bio, so
> all bio-hooks which we use for multi-device cannot be performed
> here. While regular read/writes could be manipulated with
> RAID0/1, mmap() is still an issue.
> 
> Auto-setting free space tree, because dealing with free space
> inode (specifically readpages) is a nightmare.
> Auto-setting nodatasum because we don't get callback for writing
> checksums after mmap()s.

As discussed before, mandating free-space-tree for dax is ok and even
more that it does not complicate the space cache v1 code.

> Store the dax_device in fs_info which will be used in iomap code.
> Question: Since we have only one dax device, I thought fs_info is
> the best place. However, should it moved to btrfs_device?

At this point it's probably ok to store it in fs_info as the
multi-device support does not exist. If this changes, then the device is
the right place.

As Adam mentioned, support for the 'single' profile would be nice. It's
the simplest profile that over multiple devices only splits the logical
address space (of btrfs) into more chunks. However I don't know if the
dax side does require something that makes this impossible to coexist.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 07/15] btrfs: add dax write support
  2019-03-26 19:02 ` [PATCH 07/15] btrfs: add dax write support Goldwyn Rodrigues
@ 2019-03-28 14:53   ` Darrick J. Wong
  2019-04-01 20:39     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2019-03-28 14:53 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, linux-fsdevel, Goldwyn Rodrigues

On Tue, Mar 26, 2019 at 02:02:53PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> IOMAP_F_COW allows to inform the dax code, to first perform
> a copy which are not page-aligned before performing the write.
> 
> A new struct btrfs_iomap is passed from iomap_begin() to
> iomap_end(), which contains all the accounting and locking information
> for CoW based writes.
> 
> For writing to a hole, iomap->cow_addr is set to zero. Would this
> be better handled by a flag or can a valid filesystem block be at
> offset zero of the device?
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/btrfs/ctree.h |   6 +++
>  fs/btrfs/dax.c   | 119 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/file.c  |   4 +-
>  3 files changed, 124 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index a3543a4a063d..3bcd2a4959c1 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3801,6 +3801,12 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
>  #ifdef CONFIG_FS_DAX
>  /* dax.c */
>  ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
> +ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
> +#else
> +static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	return 0;
> +}
>  #endif /* CONFIG_FS_DAX */
>  
>  static inline int is_fstree(u64 rootid)
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> index bf3d46b0acb6..49619fe3f94f 100644
> --- a/fs/btrfs/dax.c
> +++ b/fs/btrfs/dax.c
> @@ -9,30 +9,124 @@
>  #ifdef CONFIG_FS_DAX
>  #include <linux/dax.h>
>  #include <linux/iomap.h>
> +#include <linux/uio.h>
>  #include "ctree.h"
>  #include "btrfs_inode.h"
>  
> +struct btrfs_iomap {
> +	u64 start;
> +	u64 end;
> +	int nocow;
> +	struct extent_changeset *data_reserved;
> +	struct extent_state *cached_state;
> +};
> +
>  static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
>  		loff_t length, unsigned flags, struct iomap *iomap)
>  {
>  	struct extent_map *em;
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +
>  	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
> +
> +	if (flags & IOMAP_WRITE) {
> +		int ret = 0, nocow;
> +		struct extent_map *map = em;
> +		struct btrfs_iomap *bi;

Please consider breaking this up into a separate helper before the
btrfs_iomap_begin function becomes long and hard to read like the xfs
one did.  :)

(Granted people also seem to dislike scrolling back and forth...)

> +
> +		bi = kzalloc(sizeof(struct btrfs_iomap), GFP_NOFS);
> +		if (!bi)
> +			return -ENOMEM;
> +
> +		bi->start = round_down(pos, PAGE_SIZE);
> +		bi->end = round_up(pos + length, PAGE_SIZE);
> +
> +		iomap->private = bi;
> +
> +		/* Wait for existing ordered extents in range to finish */
> +		btrfs_wait_ordered_range(inode, bi->start, bi->end - bi->start);
> +
> +		lock_extent_bits(&BTRFS_I(inode)->io_tree, bi->start, bi->end, &bi->cached_state);
> +
> +		ret = btrfs_delalloc_reserve_space(inode, &bi->data_reserved,
> +				bi->start, bi->end - bi->start);
> +		if (ret) {
> +			unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
> +					&bi->cached_state);
> +			kfree(bi);
> +			return ret;
> +		}
> +
> +		refcount_inc(&map->refs);
> +		ret = btrfs_get_extent_map_write(&em, NULL,
> +				inode, bi->start, bi->end - bi->start, &nocow);
> +		if (ret) {
> +			unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
> +					&bi->cached_state);
> +			btrfs_delalloc_release_space(inode,
> +					bi->data_reserved, bi->start,
> +					bi->end - bi->start, true);
> +			extent_changeset_free(bi->data_reserved);
> +			kfree(bi);
> +			return ret;
> +		}
> +		if (!nocow) {
> +			iomap->flags |= IOMAP_F_COW;
> +			if (map->block_start != EXTENT_MAP_HOLE) {
> +				iomap->cow_addr = map->block_start;
> +				iomap->cow_pos = map->start;

Oh, I see, cow_pos exists because the extent we're copying from and the
extent we're copying into are not necessarily going to be positioned at
the same file offset and (I guess) it's possible that the source range
could be partially sparse given the destination range?

Hmm, no, the previous patch doesn't account for that; it only seems to
know how to handle @cow_pos < @offset.  In that case, why not trim the
cow_* map to @offset?

--D

> +			}
> +		} else {
> +			bi->nocow = 1;
> +		}
> +		free_extent_map(map);
> +	}
> +
> +	iomap->offset = em->start;
> +	iomap->length = em->len;
> +	iomap->bdev = em->bdev;
> +	iomap->dax_dev = fs_info->dax_dev;
> +
>  	if (em->block_start == EXTENT_MAP_HOLE) {
>  		iomap->type = IOMAP_HOLE;
>  		return 0;
>  	}
> +
>  	iomap->type = IOMAP_MAPPED;
> -	iomap->bdev = em->bdev;
> -	iomap->dax_dev = fs_info->dax_dev;
> -	iomap->offset = em->start;
> -	iomap->length = em->len;
>  	iomap->addr = em->block_start;
>  	return 0;
>  }
>  
> +static int btrfs_iomap_end(struct inode *inode, loff_t pos,
> +		loff_t length, ssize_t written, unsigned flags,
> +		struct iomap *iomap)
> +{
> +	struct btrfs_iomap *bi = iomap->private;
> +	u64 wend;
> +
> +	if (!bi)
> +		return 0;
> +
> +	unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
> +			&bi->cached_state);
> +
> +	wend = round_up(pos + written, PAGE_SIZE);
> +	if (wend < bi->end) {
> +		btrfs_delalloc_release_space(inode,
> +				bi->data_reserved, wend,
> +				bi->end - wend, true);
> +	}
> +
> +	btrfs_update_ordered_extent(inode, bi->start, wend - bi->start, true);
> +	btrfs_delalloc_release_extents(BTRFS_I(inode), wend - bi->start, false);
> +	extent_changeset_free(bi->data_reserved);
> +	kfree(bi);
> +	return 0;
> +}
> +
>  static const struct iomap_ops btrfs_iomap_ops = {
>  	.iomap_begin		= btrfs_iomap_begin,
> +	.iomap_end		= btrfs_iomap_end,
>  };
>  
>  ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
> @@ -46,4 +140,21 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
>  
>  	return ret;
>  }
> +
> +ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *iter)
> +{
> +	ssize_t ret = 0;
> +	u64 pos = iocb->ki_pos;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	ret = dax_iomap_rw(iocb, iter, &btrfs_iomap_ops);
> +
> +	if (ret > 0) {
> +		pos += ret;
> +		if (pos > i_size_read(inode))
> +			i_size_write(inode, pos);
> +		iocb->ki_pos = pos;
> +	}
> +	return ret;
> +}
>  #endif /* CONFIG_FS_DAX */
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index b620f4e718b2..3b320d0ab495 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1964,7 +1964,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
>  	if (sync)
>  		atomic_inc(&BTRFS_I(inode)->sync_writers);
>  
> -	if (iocb->ki_flags & IOCB_DIRECT) {
> +	if (IS_DAX(inode)) {
> +		num_written = btrfs_file_dax_write(iocb, from);
> +	} else if (iocb->ki_flags & IOCB_DIRECT) {
>  		num_written = __btrfs_direct_write(iocb, from);
>  	} else {
>  		num_written = btrfs_buffered_write(iocb, from);
> -- 
> 2.16.4
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 08/15] dax: add dax_iomap_cow to copy a mmap page before writing
  2019-03-26 19:02 ` [PATCH 08/15] dax: add dax_iomap_cow to copy a mmap page before writing Goldwyn Rodrigues
@ 2019-03-28 15:41   ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2019-03-28 15:41 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, linux-fsdevel, Goldwyn Rodrigues

On Tue, Mar 26, 2019 at 02:02:54PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> dax_iomap_cow copies a page before presenting for mmap.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/dax.c | 33 ++++++++++++++++++++++++++++++++-
>  1 file changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index e254535dd830..21ee3df6f02c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1269,6 +1269,33 @@ static bool dax_fault_is_synchronous(unsigned long flags,
>  		&& (iomap->flags & IOMAP_F_DIRTY);
>  }
>  
> +static int dax_iomap_cow(struct iomap *iomap, loff_t pos, pfn_t *pfn)
> +{
> +	void *daddr;
> +	pgoff_t pgoff;
> +	long rc;
> +	int id;
> +	sector_t sector;
> +
> +	pos = round_down(pos, PAGE_SIZE);
> +
> +	sector = round_down(iomap->addr + iomap->offset - pos, PAGE_SIZE) >> 9;
> +	rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, &pgoff);
> +	if (rc)
> +		return rc;
> +
> +	id = dax_read_lock();
> +	rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &daddr, pfn);
> +	if (rc < 0)
> +		goto out;
> +
> +	dax_to_dax_copy(iomap, pos, daddr, PAGE_SIZE);
> +
> +out:
> +	dax_read_unlock(id);
> +	return rc;
> +}
> +
>  static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  			       int *iomap_errp, const struct iomap_ops *ops)
>  {
> @@ -1372,7 +1399,11 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  			count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
>  			major = VM_FAULT_MAJOR;
>  		}
> -		error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn);
> +
> +		if (iomap.flags & IOMAP_F_COW)
> +			error = dax_iomap_cow(&iomap, pos, &pfn);

Why wouldn't dax_iomap_cow also apply to dax_iomap_pmd_fault?

--D

> +		else
> +			error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn);
>  		if (error < 0)
>  			goto error_finish_iomap;
>  
> -- 
> 2.16.4
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 09/15] btrfs: add dax mmap support
  2019-03-26 19:02 ` [PATCH 09/15] btrfs: add dax mmap support Goldwyn Rodrigues
@ 2019-03-28 15:45   ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2019-03-28 15:45 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, linux-fsdevel, Goldwyn Rodrigues

On Tue, Mar 26, 2019 at 02:02:55PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Add a new vm_operations struct btrfs_dax_vm_ops
> specifically for dax files.
> 
> Since we will be removing(nulling) readpages/writepages for dax
> return ENOEXEC only for non-dax files.
> 
> dax_insert_entry() looks ugly. Do you think we should break it
> into dax_insert_cow_entry() and dax_insert_entry()?

I would (or replace the two bools with flags), but people seem not to
like my stylistic choices. :)

> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/btrfs/ctree.h |  1 +
>  fs/btrfs/dax.c   | 11 +++++++++++
>  fs/btrfs/file.c  | 18 ++++++++++++++++--
>  fs/dax.c         | 17 ++++++++++-------
>  4 files changed, 38 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 3bcd2a4959c1..0e5060933bde 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3802,6 +3802,7 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
>  /* dax.c */
>  ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
>  ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
> +vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
>  #else
>  static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
>  {
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> index 49619fe3f94f..927f962d1e88 100644
> --- a/fs/btrfs/dax.c
> +++ b/fs/btrfs/dax.c
> @@ -157,4 +157,15 @@ ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *iter)
>  	}
>  	return ret;
>  }
> +
> +vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
> +{
> +	vm_fault_t ret;
> +	pfn_t pfn;
> +	ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &pfn, NULL, &btrfs_iomap_ops);
> +	if (ret & VM_FAULT_NEEDDSYNC)
> +		ret = dax_finish_sync_fault(vmf, PE_SIZE_PTE, pfn);
> +
> +	return ret;
> +}
>  #endif /* CONFIG_FS_DAX */
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 3b320d0ab495..196c8f37ff9d 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2214,15 +2214,29 @@ static const struct vm_operations_struct btrfs_file_vm_ops = {
>  	.page_mkwrite	= btrfs_page_mkwrite,
>  };
>  
> +#ifdef CONFIG_FS_DAX
> +static const struct vm_operations_struct btrfs_dax_vm_ops = {
> +	.fault          = btrfs_dax_fault,
> +	.page_mkwrite   = btrfs_dax_fault,
> +	.pfn_mkwrite    = btrfs_dax_fault,
> +};
> +#else
> +#define btrfs_dax_vm_ops btrfs_file_vm_ops
> +#endif
> +
>  static int btrfs_file_mmap(struct file	*filp, struct vm_area_struct *vma)
>  {
>  	struct address_space *mapping = filp->f_mapping;
> +	struct inode *inode = file_inode(filp);
>  
> -	if (!mapping->a_ops->readpage)
> +	if (!IS_DAX(inode) && !mapping->a_ops->readpage)
>  		return -ENOEXEC;
>  
>  	file_accessed(filp);
> -	vma->vm_ops = &btrfs_file_vm_ops;
> +	if (IS_DAX(inode))
> +		vma->vm_ops = &btrfs_dax_vm_ops;
> +	else
> +		vma->vm_ops = &btrfs_file_vm_ops;
>  
>  	return 0;
>  }
> diff --git a/fs/dax.c b/fs/dax.c
> index 21ee3df6f02c..41061da42771 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c

Whoah, waitaminute, I thought this was a "twiddle stuff inside btrfs
only" patch...

> @@ -708,14 +708,15 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
>   */
>  static void *dax_insert_entry(struct xa_state *xas,
>  		struct address_space *mapping, struct vm_fault *vmf,
> -		void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> +		void *entry, pfn_t pfn, unsigned long flags, bool dirty,
> +		bool cow)
>  {
>  	void *new_entry = dax_make_entry(pfn, flags);
>  
>  	if (dirty)
>  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
> +	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
>  		unsigned long index = xas->xa_index;
>  		/* we are replacing a zero page with block mapping */
>  		if (dax_is_pmd_entry(entry))
> @@ -732,7 +733,7 @@ static void *dax_insert_entry(struct xa_state *xas,
>  		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
>  	}
>  
> -	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> +	if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
>  		/*
>  		 * Only swap our new entry into the page cache if the current
>  		 * entry is a zero page or an empty entry.  If a normal PTE or
> @@ -1031,7 +1032,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
>  	vm_fault_t ret;
>  
>  	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> -			DAX_ZERO_PAGE, false);
> +			DAX_ZERO_PAGE, false, false);
>  
>  	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
>  	trace_dax_load_hole(inode, vmf, ret);
> @@ -1408,7 +1409,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  			goto error_finish_iomap;
>  
>  		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
> -						 0, write && !sync);
> +						 0, write && !sync,
> +						(iomap.flags & IOMAP_F_COW) != 0);

Assuming you stick with bool cow, you don't need the != 0 test.

>  
>  		/*
>  		 * If we are doing synchronous page fault and inode needs fsync,
> @@ -1487,7 +1489,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>  
>  	pfn = page_to_pfn_t(zero_page);
>  	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> -			DAX_PMD | DAX_ZERO_PAGE, false);
> +			DAX_PMD | DAX_ZERO_PAGE, false, false);
>  
>  	ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
>  	if (!pmd_none(*(vmf->pmd))) {
> @@ -1610,7 +1612,8 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  			goto finish_iomap;
>  
>  		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
> -						DAX_PMD, write && !sync);
> +						DAX_PMD, write && !sync,
> +						false);

Why don't PMD faults support COW?

--D

>  
>  		/*
>  		 * If we are doing synchronous page fault and inode needs fsync,
> -- 
> 2.16.4
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 15/15] btrfs: Writeprotect mmap pages on snapshot
  2019-03-26 19:03 ` [PATCH 15/15] btrfs: Writeprotect mmap pages on snapshot Goldwyn Rodrigues
@ 2019-03-28 15:48   ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2019-03-28 15:48 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, linux-fsdevel, Goldwyn Rodrigues

On Tue, Mar 26, 2019 at 02:03:01PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Inorder to make sure mmap'd files don't change after snapshot,
> writeprotect the mmap pages on snapshot. This is done by performing
> a data writeback on the pages (which simply mark the pages are
> wrprotected). This way if the user process tries to access the memory
> we will get another fault and we can perform a CoW.
> 
> In order to accomplish this, we tag all CoW pages as
> PAGECACHE_TAG_TOWRITE, and add the mmapd inode in delalloc_inodes.
> During snapshot, it starts writeback of all delalloc'd inodes and
> here we perform a data writeback. We don't want to keep the inodes
> in delalloc_inodes until it umount (WARN_ON), so we remove it
> during inode evictions.
> 
> This looks hackish. Other alternatives could be to create another
> list for mmap'd files or rename delalloc_inodes to writeback_inodes.
> Suggestions?

Is there a function to invalidate only the writable mappings in the
pagecache?  Could there be?

> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/btrfs/ctree.h |  3 ++-
>  fs/btrfs/dax.c   |  7 +++++++
>  fs/btrfs/inode.c | 13 ++++++++++++-
>  fs/dax.c         |  3 +++
>  4 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 21068dc4a95a..68a63d93556a 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3252,7 +3252,8 @@ int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
>  			     struct btrfs_root *new_root,
>  			     struct btrfs_root *parent_root,
>  			     u64 new_dirid);
> - void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state,
> +void btrfs_add_delalloc_inodes(struct btrfs_root *root, struct inode *inode);
> +void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state,
>  			       unsigned *bits);
>  void btrfs_clear_delalloc_extent(struct inode *inode,
>  				 struct extent_state *state, unsigned *bits);
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> index d73945d50b88..bcb961242c74 100644
> --- a/fs/btrfs/dax.c
> +++ b/fs/btrfs/dax.c
> @@ -166,10 +166,17 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
>  {
>  	vm_fault_t ret;
>  	pfn_t pfn;
> +	struct inode *inode = file_inode(vmf->vma->vm_file);
> +	struct btrfs_inode *binode = BTRFS_I(inode);
>  	ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &pfn, NULL, &btrfs_iomap_ops);
>  	if (ret & VM_FAULT_NEEDDSYNC)
>  		ret = dax_finish_sync_fault(vmf, PE_SIZE_PTE, pfn);
>  
> +	/* Insert into delalloc so we get writeback calls on snapshots */
> +	if (vmf->flags & FAULT_FLAG_WRITE &&
> +			!test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags))
> +		btrfs_add_delalloc_inodes(binode->root, inode);
> +
>  	return ret;
>  }
>  
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 5350e5f23728..3b72c1c96b34 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1713,7 +1713,7 @@ void btrfs_merge_delalloc_extent(struct inode *inode, struct extent_state *new,
>  	spin_unlock(&BTRFS_I(inode)->lock);
>  }
>  
> -static void btrfs_add_delalloc_inodes(struct btrfs_root *root,
> +void btrfs_add_delalloc_inodes(struct btrfs_root *root,
>  				      struct inode *inode)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> @@ -5358,12 +5358,17 @@ void btrfs_evict_inode(struct inode *inode)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	struct btrfs_trans_handle *trans;
> +	struct btrfs_inode *binode = BTRFS_I(inode);
>  	struct btrfs_root *root = BTRFS_I(inode)->root;
>  	struct btrfs_block_rsv *rsv;
>  	int ret;
>  
>  	trace_btrfs_inode_evict(inode);
>  
> +	if (IS_DAX(inode)
> +	    && test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags))
> +		btrfs_del_delalloc_inode(root, binode);
> +
>  	if (!root) {
>  		clear_inode(inode);
>  		return;
> @@ -8683,6 +8688,10 @@ static int btrfs_dax_writepages(struct address_space *mapping,
>  {
>  	struct inode *inode = mapping->host;
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	struct btrfs_inode *binode = BTRFS_I(inode);
> +	if ((wbc->sync_mode == WB_SYNC_ALL) &&
> +	    test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags))
> +		btrfs_del_delalloc_inode(binode->root, binode);
>  	return dax_writeback_mapping_range(mapping, fs_info->fs_devices->latest_bdev,
>  			wbc);
>  }
> @@ -9981,6 +9990,8 @@ static void btrfs_run_delalloc_work(struct btrfs_work *work)
>  	delalloc_work = container_of(work, struct btrfs_delalloc_work,
>  				     work);
>  	inode = delalloc_work->inode;
> +	if (IS_DAX(inode))
> +		filemap_fdatawrite(inode->i_mapping);
>  	filemap_flush(inode->i_mapping);
>  	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>  				&BTRFS_I(inode)->runtime_flags))
> diff --git a/fs/dax.c b/fs/dax.c
> index 93146142bb00..c42e9cb486ef 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -753,6 +753,9 @@ static void *dax_insert_entry(struct xa_state *xas,
>  	if (dirty)
>  		xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
>  
> +	if (cow)
> +		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
> +

Was this supposed to go with the dax_insert_entry changes (patch 9)?

--D

>  	xas_unlock_irq(xas);
>  	return entry;
>  }
> -- 
> 2.16.4
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/15] fs: dedup file range to use a compare function
  2019-03-26 19:02 ` [PATCH 11/15] fs: dedup file range to use a compare function Goldwyn Rodrigues
@ 2019-03-28 17:04   ` Darrick J. Wong
  2019-04-01 20:36     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2019-03-28 17:04 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, linux-fsdevel, Goldwyn Rodrigues

On Tue, Mar 26, 2019 at 02:02:57PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> With dax we cannot deal with readpage() etc. So, we create a
> funciton callback to perform the file data comparison and pass
> it to generic_remap_file_range_prep() so it can use iomap-based
> functions.

So it occurs to me -- is there an intrinsic requirement that all files
sharing blocks must be in DAX mode or not-DAX mode?  Or in other words,
if I run cp --reflink a b, then it cannot be the case that IS_DAX(a) !=
IS_DAX(b), right?

Will there be calamitous consequences if dax and page cache both point
to the same piece of storage?  I would imagine so...

> This may not be the best way to solve this. Suggestions welcome.

Agree.  There's a fair amount of code duplication between the dax and
pagecache compare functions, considering that the loop structure for
both cases is:

while (len) {
	/* Figure out minimum comparison length */

	/* Map source file range into memory */

	/* Map dest file range into memory */

	/* Compare memory */

	/* Release dest file memory */

	/* Release source file memory */

	/* Update counters or break out */
}

The current vfs_dedupe_file_range_compare could use some improvements,
at least for filesystems where we actually support iomap.  In
particular, if you try to dedupe into or out of a hole, it'll flood the
page cache with zero pages, which is pretty wasteful if we could've
found out that it's actually a hole.

> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/btrfs/ctree.h     |  3 +++
>  fs/btrfs/dax.c       |  7 +++++++
>  fs/btrfs/ioctl.c     | 13 +++++++++++-
>  fs/dax.c             | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/ocfs2/file.c      |  2 +-
>  fs/read_write.c      |  9 +++++---
>  fs/xfs/xfs_reflink.c |  2 +-
>  include/linux/dax.h  |  2 ++
>  include/linux/fs.h   |  4 +++-
>  9 files changed, 93 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 0e5060933bde..750f9c70fabe 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3803,6 +3803,9 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
>  ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
>  ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
>  vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
> +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
> +		struct inode *dest, loff_t destoff, loff_t len,
> +		bool *is_same);
>  #else
>  static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
>  {
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> index 927f962d1e88..9488cae0f8b4 100644
> --- a/fs/btrfs/dax.c
> +++ b/fs/btrfs/dax.c
> @@ -168,4 +168,11 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
>  
>  	return ret;
>  }
> +
> +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
> +		struct inode *dest, loff_t destoff, loff_t len,
> +		bool *is_same)
> +{
> +	return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same, &btrfs_iomap_ops);
> +}
>  #endif /* CONFIG_FS_DAX */
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index e66426e7692d..2e5137b01561 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3990,8 +3990,19 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>  	if (ret < 0)
>  		goto out_unlock;
>  
> +#ifdef CONFIG_FS_DAX
> +	if (IS_DAX(file_inode(file_in)) && IS_DAX(file_inode(file_out)))
> +		ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
> +				pos_out, len, remap_flags,
> +				btrfs_dax_file_range_compare);
> +	else
> +		ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
> +				pos_out, len, remap_flags, NULL);
> +#else
>  	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> -					    len, remap_flags);
> +						    len, remap_flags, NULL);
> +#endif
> +
>  	if (ret < 0 || *len == 0)
>  		goto out_unlock;
>  
> diff --git a/fs/dax.c b/fs/dax.c
> index 41061da42771..18998c5ee27a 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1775,3 +1775,61 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>  	return dax_insert_pfn_mkwrite(vmf, pfn, order);
>  }
>  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> +
> +
> +int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
> +		loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops)

Comments for this function assume this is a vfs generic function, not a
dax specific thing...

> +{
> +	void *saddr, *daddr;
> +	struct iomap s_iomap = {0};
> +	struct iomap d_iomap = {0};
> +	loff_t dstart, sstart;
> +	bool same = true;
> +	loff_t cmp_len, l;
> +	int id, ret = 0;
> +
> +	id = dax_read_lock();
> +	while (len) {
> +		ret = ops->iomap_begin(src, srcoff, len, 0, &s_iomap);
> +		if (ret < 0) {
> +			if (ops->iomap_end)
> +				ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap);
> +			return ret;
> +		}

Shouldn't we be checking that the iomap actually points to written
storage?  As opposed to hole/delalloc/unwritten?

Bonus: not-written extents (hole/unwritten) could be optimized a bit,
though I don't know if it's worth it for something that's probably a
corner case.

> +		cmp_len = len;
> +		if (cmp_len > s_iomap.offset + s_iomap.length - srcoff)
> +			cmp_len = s_iomap.offset + s_iomap.length - srcoff;
> +
> +		ret = ops->iomap_begin(dest, destoff, cmp_len, 0, &d_iomap);
> +		if (ret < 0) {
> +			if (ops->iomap_end) {
> +				ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap);
> +				ops->iomap_end(dest, destoff, len, ret, 0, &d_iomap);
> +			}
> +			return ret;
> +		}
> +		if (cmp_len > d_iomap.offset + d_iomap.length - destoff)
> +			cmp_len = d_iomap.offset + d_iomap.length - destoff;
> +

If you wanted to make this a generic function, it would be kinda nice if
we could switch between grabbing page cache for non-DAX files and
dax_direct_access for DAX files.

(Actually, no, that's kind of ugly too.)

Perhaps iomap.c needs to grow a function to iterate ranges of two files
and call an actor function on both sets?  And we can change the actor to
be the pagecache-grabbing one or the dax-access one depending on the DAX
mode.

Though I guess that leaves poor old ocfs2 in the dust since it doesn't
support iomap.  I guess it wouldn't be too hard to adapt it to have an
iomap ops for reads only.

> +
> +		sstart = (get_start_sect(s_iomap.bdev) << 9) + s_iomap.addr + (srcoff - s_iomap.offset);
> +		l = dax_direct_access(s_iomap.dax_dev, PHYS_PFN(sstart), PHYS_PFN(cmp_len), &saddr, NULL);
> +		dstart = (get_start_sect(d_iomap.bdev) << 9) + d_iomap.addr + (destoff - d_iomap.offset);
> +		l = dax_direct_access(d_iomap.dax_dev, PHYS_PFN(dstart), PHYS_PFN(cmp_len), &daddr, NULL);
> +		same = !memcmp(saddr, daddr, cmp_len);
> +		if (!same)
> +			break;
> +		len -= cmp_len;
> +		srcoff += cmp_len;
> +		destoff += cmp_len;
> +
> +		if (ops->iomap_end) {
> +			ret = ops->iomap_end(src, srcoff, len, 0, 0, &s_iomap);
> +			ret = ops->iomap_end(dest, destoff, len, 0, 0, &d_iomap);
> +		}
> +	}
> +	dax_read_unlock(id);
> +	*is_same = same;
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(dax_file_range_compare);
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index d640c5f8a85d..6bf3e8fbb016 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -2558,7 +2558,7 @@ static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in,
>  		goto out_unlock;
>  
>  	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> -			&len, remap_flags);
> +			&len, remap_flags, NULL);
>  	if (ret < 0 || len == 0)
>  		goto out_unlock;
>  
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 177ccc3d405a..da521a221213 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1855,7 +1855,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
>   */
>  int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>  				  struct file *file_out, loff_t pos_out,
> -				  loff_t *len, unsigned int remap_flags)
> +				  loff_t *len, unsigned int remap_flags, compare_range_t compare)

Line wrapping...

>  {
>  	struct inode *inode_in = file_inode(file_in);
>  	struct inode *inode_out = file_inode(file_out);
> @@ -1914,8 +1914,11 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>  	 */
>  	if (remap_flags & REMAP_FILE_DEDUP) {
>  		bool		is_same = false;
> -
> -		ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
> +		if (compare)
> +			ret = compare(inode_in, pos_in,
> +				inode_out, pos_out, *len, &is_same);
> +		else
> +			ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
>  				inode_out, pos_out, *len, &is_same);

Make the callers pass in vfs_dedupe_file_range_compare instead of NULL,
and avoid this if clause stuff.

--D

>  		if (ret)
>  			return ret;
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 680ae7662a78..8907c7aa3f19 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1350,7 +1350,7 @@ xfs_reflink_remap_prep(
>  		goto out_unlock;
>  
>  	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> -			len, remap_flags);
> +			len, remap_flags, NULL);
>  	if (ret < 0 || *len == 0)
>  		goto out_unlock;
>  
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 0dd316a74a29..a11bc7b1f526 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -157,6 +157,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
>  int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
>  				      pgoff_t index);
> +int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
> +                loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops);
>  
>  #ifdef CONFIG_FS_DAX
>  int __dax_zero_page_range(struct block_device *bdev,
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 8b42df09b04c..22fe4324b22e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1880,10 +1880,12 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
>  		unsigned long, loff_t *, rwf_t);
>  extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
>  				   loff_t, size_t, unsigned int);
> +typedef int (compare_range_t)(struct inode *src, loff_t srcpos, struct inode *dest,
> +		loff_t destpos, loff_t len, bool *is_same);
>  extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
>  					 struct file *file_out, loff_t pos_out,
>  					 loff_t *count,
> -					 unsigned int remap_flags);
> +					 unsigned int remap_flags, compare_range_t cmp);
>  extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
>  				  struct file *file_out, loff_t pos_out,
>  				  loff_t len, unsigned int remap_flags);
> -- 
> 2.16.4
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/15] btrfs: create a mount option for dax
  2019-03-26 19:02 ` [PATCH 01/15] btrfs: create a mount option for dax Goldwyn Rodrigues
  2019-03-26 19:10   ` Matthew Wilcox
  2019-03-28 14:49   ` David Sterba
@ 2019-03-28 17:28   ` David Sterba
  2019-03-28 17:57     ` Darrick J. Wong
  2019-04-01 20:43     ` Goldwyn Rodrigues
  2 siblings, 2 replies; 48+ messages in thread
From: David Sterba @ 2019-03-28 17:28 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, linux-fsdevel, Goldwyn Rodrigues

On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> This sets S_DAX in inode->i_flags, which can be used with
> IS_DAX().
> 
> The dax option is restricted to non multi-device mounts.
> dax interacts with the device directly instead of using bio, so
> all bio-hooks which we use for multi-device cannot be performed
> here. While regular read/writes could be manipulated with
> RAID0/1, mmap() is still an issue.

I'm looking for other features that would not work with dax. Disabling
multiple devices strikes out device add, device delete and device
replace.

To be verified:

- balance - at least profile changes must be forbidden

- defrag

- compression - as it needs COW, it won't work on the nodatacow files,
  thus seems to be ok

- resize/grow - this could work without changes, no block relocation is
  needed, only new structures created

- resize/shrink - depends on balance, the block groups from the removed
  area need to be relocated

- swapfiles - I'm sure somebody will try that; I haven't seen any
  IS_DAX checks in the swapfile activation code


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/15] btrfs: create a mount option for dax
  2019-03-28 17:28   ` David Sterba
@ 2019-03-28 17:57     ` Darrick J. Wong
  2019-04-01 20:43     ` Goldwyn Rodrigues
  1 sibling, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2019-03-28 17:57 UTC (permalink / raw)
  To: dsterba, Goldwyn Rodrigues, linux-btrfs, linux-fsdevel,
	Goldwyn Rodrigues

On Thu, Mar 28, 2019 at 06:28:25PM +0100, David Sterba wrote:
> On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > This sets S_DAX in inode->i_flags, which can be used with
> > IS_DAX().
> > 
> > The dax option is restricted to non multi-device mounts.
> > dax interacts with the device directly instead of using bio, so
> > all bio-hooks which we use for multi-device cannot be performed
> > here. While regular read/writes could be manipulated with
> > RAID0/1, mmap() is still an issue.
> 
> I'm looking for other features that would not work with dax. Disabling
> multiple devices strikes out device add, device delete and device
> replace.
> 
> To be verified:
> 
> - balance - at least profile changes must be forbidden
> 
> - defrag
> 
> - compression - as it needs COW, it won't work on the nodatacow files,
>   thus seems to be ok
> 
> - resize/grow - this could work without changes, no block relocation is
>   needed, only new structures created
> 
> - resize/shrink - depends on balance, the block groups from the removed
>   area need to be relocated
> 
> - swapfiles - I'm sure somebody will try that; I haven't seen any
>   IS_DAX checks in the swapfile activation code

I've never tried dax swapfiles, but so long as you can issue bios to the
underlying bdev they ought to work, right?  Even if the swap code could
be streamlined by memcpy in the dax case.

(Granted I have no idea what sort of bio-hooks magic btrfs does, maybe
it really doesn't work there...)

--D

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 05/15] btrfs: return whether extent is nocow or not
  2019-03-26 19:02 ` [PATCH 05/15] btrfs: return whether extent is nocow or not Goldwyn Rodrigues
@ 2019-03-31 18:42   ` Nikolay Borisov
  0 siblings, 0 replies; 48+ messages in thread
From: Nikolay Borisov @ 2019-03-31 18:42 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs; +Cc: linux-fsdevel, Goldwyn Rodrigues



On 26.03.19 г. 21:02 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> We require this to set the IOMAP_F_COW flag in
> iomap structure, in the later patches.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/btrfs/ctree.h | 2 +-
>  fs/btrfs/inode.c | 9 +++++++--
>  2 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index b7bbe5130a3b..2c49d3c46170 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3278,7 +3278,7 @@ struct inode *btrfs_iget_path(struct super_block *s, struct btrfs_key *location,
>  struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
>  			 struct btrfs_root *root, int *was_new);
>  int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh,
> -		struct inode *inode, u64 start, u64 len);
> +		struct inode *inode, u64 start, u64 len, int *nocow);

If this is going to be used as a boolean flag, then define is as boolean *
>  struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>  				    struct page *page, size_t pg_offset,
>  				    u64 start, u64 end, int create);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 80184d0c3b52..c8702e0b5e66 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7499,12 +7499,15 @@ static int btrfs_get_blocks_direct_read(struct extent_map *em,
>  int btrfs_get_extent_map_write(struct extent_map **map,
>  		struct buffer_head *bh,
>  		struct inode *inode,
> -		u64 start, u64 len)
> +		u64 start, u64 len, int *nocow)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	struct extent_map *em = *map;
>  	int ret = 0;
>  
> +	if (nocow)
> +		*nocow = 0;
> +
>  	/*
>  	 * We don't allocate a new extent in the following cases
>  	 *
> @@ -7553,6 +7556,8 @@ int btrfs_get_extent_map_write(struct extent_map **map,
>  			 */
>  			btrfs_free_reserved_data_space_noquota(inode, start,
>  							       len);
> +			if (nocow)
> +				*nocow = 1;
>  			/* skip COW */
>  			goto out;
>  		}
> @@ -7579,7 +7584,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>  	struct extent_map *em;
>  
>  	ret = btrfs_get_extent_map_write(map, bh_result, inode,
> -			start, len);
> +			start, len, NULL);
>  	if (ret < 0)
>  		return ret;
>  	em = *map;
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write
  2019-03-26 19:02 ` [PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write Goldwyn Rodrigues
  2019-03-27 17:54   ` Darrick J. Wong
@ 2019-04-01  4:38   ` Dave Chinner
  2019-04-01 21:41     ` Goldwyn Rodrigues
  2019-04-07  7:26     ` Christoph Hellwig
  1 sibling, 2 replies; 48+ messages in thread
From: Dave Chinner @ 2019-04-01  4:38 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, linux-fsdevel, Goldwyn Rodrigues

On Tue, Mar 26, 2019 at 02:02:50PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> The IOMAP_F_COW is a flag to notify dax that it needs to copy
> the data from iomap->cow_addr to iomap->addr, if the start/end
> of I/O are not page aligned.

I see what you are trying to do here, but this is kinda gross.

> This also introduces dax_to_dax_copy() which performs a copy
> from one part of the device to another, to a maximum of one page.
> 
> Question: Using iomap.cow_addr == 0 means the CoW is to be copied
> (or memset) from a hole. Would this be better handled through a flag?

That's what all these checks in the iomap code do:

	if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_UNWRITTEN)

Oh, wait, you're trying to map two ranges in a single iomap and then
infer state from information that got chucked away.... IOWs, you're
doing it wrong - iomap algorithms are driven by how we manipulate
iomaps to do data operations efficiently, not how we copy data page
by page.

IOWs, what we really should have here is two iomaps - a source
and a destination iomap. The source is a read mapping of the
current address (where we are going to copy the data from), the
destination is the post-cow allocation mapping (where the data
goes).

Now you just copy the data from one map to the other iterating
source mappings until the necessary range of the destination has
been covered.  And you can check if the source map is IOMAP_HOLE or
IOMAP_UNWRITTEN and hence optimise the copy (i.e. zero the new
allocation) before copying in the new data.

Even better, only do the source mapping if the write isn't
page/filesystem block aligned, and hence only do the slow path
copying if a source mapping exists....

> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 0fefb5455bda..391785de1428 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -35,6 +35,7 @@ struct vm_fault;
>  #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
>  #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
>  #define IOMAP_F_BUFFER_HEAD	0x04	/* file system requires buffer heads */
> +#define IOMAP_F_COW		0x08	/* cow before write */

"Copy on write before write"? :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] btrfs: allow MAP_SYNC mmap
  2019-03-28 10:24       ` [PATCH] btrfs: allow MAP_SYNC mmap Adam Borowski
  2019-03-28 10:42         ` Adam Borowski
@ 2019-04-01 20:08         ` Goldwyn Rodrigues
  1 sibling, 0 replies; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-01 20:08 UTC (permalink / raw)
  To: Adam Borowski; +Cc: linux-btrfs, linux-fsdevel, marcin.slusarz

On 11:24 28/03, Adam Borowski wrote:
> Used by userspace to detect DAX.

Thanks. I will add it to the series.

> 
> Signed-off-by: Adam Borowski <kilobyte@angband.pl>
> ---
>  fs/btrfs/file.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 196c8f37ff9d..8efdb040bc1d 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -16,6 +16,7 @@
>  #include <linux/btrfs.h>
>  #include <linux/uio.h>
>  #include <linux/iversion.h>
> +#include <linux/mman.h>
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> @@ -3320,6 +3321,7 @@ const struct file_operations btrfs_file_operations = {
>  	.splice_read	= generic_file_splice_read,
>  	.write_iter	= btrfs_file_write_iter,
>  	.mmap		= btrfs_file_mmap,
> +	.mmap_supported_flags = MAP_SYNC,
>  	.open		= btrfs_file_open,
>  	.release	= btrfs_release_file,
>  	.fsync		= btrfs_sync_file,
> -- 
> 2.20.1
> 

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/15] fs: dedup file range to use a compare function
  2019-03-28 17:04   ` Darrick J. Wong
@ 2019-04-01 20:36     ` Goldwyn Rodrigues
  0 siblings, 0 replies; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-01 20:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-btrfs, linux-fsdevel

On 10:04 28/03, Darrick J. Wong wrote:
> On Tue, Mar 26, 2019 at 02:02:57PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > With dax we cannot deal with readpage() etc. So, we create a
> > funciton callback to perform the file data comparison and pass
> > it to generic_remap_file_range_prep() so it can use iomap-based
> > functions.
> 
> So it occurs to me -- is there an intrinsic requirement that all files
> sharing blocks must be in DAX mode or not-DAX mode?  Or in other words,
> if I run cp --reflink a b, then it cannot be the case that IS_DAX(a) !=
> IS_DAX(b), right?
> 
> Will there be calamitous consequences if dax and page cache both point
> to the same piece of storage?  I would imagine so...

Yes, data corruptions!

> 
> > This may not be the best way to solve this. Suggestions welcome.
> 
> Agree.  There's a fair amount of code duplication between the dax and
> pagecache compare functions, considering that the loop structure for
> both cases is:
> 
> while (len) {
> 	/* Figure out minimum comparison length */
> 
> 	/* Map source file range into memory */
> 
> 	/* Map dest file range into memory */
> 
> 	/* Compare memory */
> 
> 	/* Release dest file memory */
> 
> 	/* Release source file memory */
> 
> 	/* Update counters or break out */
> }
> 
> The current vfs_dedupe_file_range_compare could use some improvements,
> at least for filesystems where we actually support iomap.  In
> particular, if you try to dedupe into or out of a hole, it'll flood the
> page cache with zero pages, which is pretty wasteful if we could've
> found out that it's actually a hole.
> 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> >  fs/btrfs/ctree.h     |  3 +++
> >  fs/btrfs/dax.c       |  7 +++++++
> >  fs/btrfs/ioctl.c     | 13 +++++++++++-
> >  fs/dax.c             | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/ocfs2/file.c      |  2 +-
> >  fs/read_write.c      |  9 +++++---
> >  fs/xfs/xfs_reflink.c |  2 +-
> >  include/linux/dax.h  |  2 ++
> >  include/linux/fs.h   |  4 +++-
> >  9 files changed, 93 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 0e5060933bde..750f9c70fabe 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -3803,6 +3803,9 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
> >  ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
> >  ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
> >  vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
> > +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
> > +		struct inode *dest, loff_t destoff, loff_t len,
> > +		bool *is_same);
> >  #else
> >  static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
> >  {
> > diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> > index 927f962d1e88..9488cae0f8b4 100644
> > --- a/fs/btrfs/dax.c
> > +++ b/fs/btrfs/dax.c
> > @@ -168,4 +168,11 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
> >  
> >  	return ret;
> >  }
> > +
> > +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
> > +		struct inode *dest, loff_t destoff, loff_t len,
> > +		bool *is_same)
> > +{
> > +	return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same, &btrfs_iomap_ops);
> > +}
> >  #endif /* CONFIG_FS_DAX */
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index e66426e7692d..2e5137b01561 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -3990,8 +3990,19 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> >  	if (ret < 0)
> >  		goto out_unlock;
> >  
> > +#ifdef CONFIG_FS_DAX
> > +	if (IS_DAX(file_inode(file_in)) && IS_DAX(file_inode(file_out)))
> > +		ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
> > +				pos_out, len, remap_flags,
> > +				btrfs_dax_file_range_compare);
> > +	else
> > +		ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
> > +				pos_out, len, remap_flags, NULL);
> > +#else
> >  	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> > -					    len, remap_flags);
> > +						    len, remap_flags, NULL);
> > +#endif
> > +
> >  	if (ret < 0 || *len == 0)
> >  		goto out_unlock;
> >  
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 41061da42771..18998c5ee27a 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1775,3 +1775,61 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
> >  	return dax_insert_pfn_mkwrite(vmf, pfn, order);
> >  }
> >  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> > +
> > +
> > +int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
> > +		loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops)
> 
> Comments for this function assume this is a vfs generic function, not a
> dax specific thing...
> 
> > +{
> > +	void *saddr, *daddr;
> > +	struct iomap s_iomap = {0};
> > +	struct iomap d_iomap = {0};
> > +	loff_t dstart, sstart;
> > +	bool same = true;
> > +	loff_t cmp_len, l;
> > +	int id, ret = 0;
> > +
> > +	id = dax_read_lock();
> > +	while (len) {
> > +		ret = ops->iomap_begin(src, srcoff, len, 0, &s_iomap);
> > +		if (ret < 0) {
> > +			if (ops->iomap_end)
> > +				ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap);
> > +			return ret;
> > +		}
> 
> Shouldn't we be checking that the iomap actually points to written
> storage?  As opposed to hole/delalloc/unwritten?
> 
> Bonus: not-written extents (hole/unwritten) could be optimized a bit,
> though I don't know if it's worth it for something that's probably a
> corner case.

Yes, I think this needs to be optimized. However, from the VFS API end.
As you mention above, it fills the pagecache unnecessarily. I feel file
comparison is best handled by individual filesystems as opposed to
the VFS.

> 
> > +		cmp_len = len;
> > +		if (cmp_len > s_iomap.offset + s_iomap.length - srcoff)
> > +			cmp_len = s_iomap.offset + s_iomap.length - srcoff;
> > +
> > +		ret = ops->iomap_begin(dest, destoff, cmp_len, 0, &d_iomap);
> > +		if (ret < 0) {
> > +			if (ops->iomap_end) {
> > +				ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap);
> > +				ops->iomap_end(dest, destoff, len, ret, 0, &d_iomap);
> > +			}
> > +			return ret;
> > +		}
> > +		if (cmp_len > d_iomap.offset + d_iomap.length - destoff)
> > +			cmp_len = d_iomap.offset + d_iomap.length - destoff;
> > +
> 
> If you wanted to make this a generic function, it would be kinda nice if
> we could switch between grabbing page cache for non-DAX files and
> dax_direct_access for DAX files.
> 
> (Actually, no, that's kind of ugly too.)
> 
> Perhaps iomap.c needs to grow a function to iterate ranges of two files
> and call an actor function on both sets?  And we can change the actor to
> be the pagecache-grabbing one or the dax-access one depending on the DAX
> mode.
> 

Yes, that would work well.

> Though I guess that leaves poor old ocfs2 in the dust since it doesn't
> support iomap.  I guess it wouldn't be too hard to adapt it to have an
> iomap ops for reads only.
> 
> > +
> > +		sstart = (get_start_sect(s_iomap.bdev) << 9) + s_iomap.addr + (srcoff - s_iomap.offset);
> > +		l = dax_direct_access(s_iomap.dax_dev, PHYS_PFN(sstart), PHYS_PFN(cmp_len), &saddr, NULL);
> > +		dstart = (get_start_sect(d_iomap.bdev) << 9) + d_iomap.addr + (destoff - d_iomap.offset);
> > +		l = dax_direct_access(d_iomap.dax_dev, PHYS_PFN(dstart), PHYS_PFN(cmp_len), &daddr, NULL);
> > +		same = !memcmp(saddr, daddr, cmp_len);
> > +		if (!same)
> > +			break;
> > +		len -= cmp_len;
> > +		srcoff += cmp_len;
> > +		destoff += cmp_len;
> > +
> > +		if (ops->iomap_end) {
> > +			ret = ops->iomap_end(src, srcoff, len, 0, 0, &s_iomap);
> > +			ret = ops->iomap_end(dest, destoff, len, 0, 0, &d_iomap);
> > +		}
> > +	}
> > +	dax_read_unlock(id);
> > +	*is_same = same;
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_file_range_compare);
> > diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> > index d640c5f8a85d..6bf3e8fbb016 100644
> > --- a/fs/ocfs2/file.c
> > +++ b/fs/ocfs2/file.c
> > @@ -2558,7 +2558,7 @@ static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in,
> >  		goto out_unlock;
> >  
> >  	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> > -			&len, remap_flags);
> > +			&len, remap_flags, NULL);
> >  	if (ret < 0 || len == 0)
> >  		goto out_unlock;
> >  
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 177ccc3d405a..da521a221213 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1855,7 +1855,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> >   */
> >  int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> >  				  struct file *file_out, loff_t pos_out,
> > -				  loff_t *len, unsigned int remap_flags)
> > +				  loff_t *len, unsigned int remap_flags, compare_range_t compare)
> 
> Line wrapping...
> 
> >  {
> >  	struct inode *inode_in = file_inode(file_in);
> >  	struct inode *inode_out = file_inode(file_out);
> > @@ -1914,8 +1914,11 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> >  	 */
> >  	if (remap_flags & REMAP_FILE_DEDUP) {
> >  		bool		is_same = false;
> > -
> > -		ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
> > +		if (compare)
> > +			ret = compare(inode_in, pos_in,
> > +				inode_out, pos_out, *len, &is_same);
> > +		else
> > +			ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
> >  				inode_out, pos_out, *len, &is_same);
> 
> Make the callers pass in vfs_dedupe_file_range_compare instead of NULL,
> and avoid this if clause stuff.

Got it. Thanks.

> 
> --D
> 
> >  		if (ret)
> >  			return ret;
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 680ae7662a78..8907c7aa3f19 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -1350,7 +1350,7 @@ xfs_reflink_remap_prep(
> >  		goto out_unlock;
> >  
> >  	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> > -			len, remap_flags);
> > +			len, remap_flags, NULL);
> >  	if (ret < 0 || *len == 0)
> >  		goto out_unlock;
> >  
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index 0dd316a74a29..a11bc7b1f526 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -157,6 +157,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
> >  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
> >  int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
> >  				      pgoff_t index);
> > +int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
> > +                loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops);
> >  
> >  #ifdef CONFIG_FS_DAX
> >  int __dax_zero_page_range(struct block_device *bdev,
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 8b42df09b04c..22fe4324b22e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1880,10 +1880,12 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
> >  		unsigned long, loff_t *, rwf_t);
> >  extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
> >  				   loff_t, size_t, unsigned int);
> > +typedef int (compare_range_t)(struct inode *src, loff_t srcpos, struct inode *dest,
> > +		loff_t destpos, loff_t len, bool *is_same);
> >  extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> >  					 struct file *file_out, loff_t pos_out,
> >  					 loff_t *count,
> > -					 unsigned int remap_flags);
> > +					 unsigned int remap_flags, compare_range_t cmp);
> >  extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
> >  				  struct file *file_out, loff_t pos_out,
> >  				  loff_t len, unsigned int remap_flags);
> > -- 
> > 2.16.4
> > 
> 

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 07/15] btrfs: add dax write support
  2019-03-28 14:53   ` Darrick J. Wong
@ 2019-04-01 20:39     ` Goldwyn Rodrigues
  0 siblings, 0 replies; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-01 20:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-btrfs, linux-fsdevel

On  7:53 28/03, Darrick J. Wong wrote:
> On Tue, Mar 26, 2019 at 02:02:53PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > IOMAP_F_COW allows to inform the dax code, to first perform
> > a copy which are not page-aligned before performing the write.
> > 
> > A new struct btrfs_iomap is passed from iomap_begin() to
> > iomap_end(), which contains all the accounting and locking information
> > for CoW based writes.
> > 
> > For writing to a hole, iomap->cow_addr is set to zero. Would this
> > be better handled by a flag or can a valid filesystem block be at
> > offset zero of the device?
> > 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> >  fs/btrfs/ctree.h |   6 +++
> >  fs/btrfs/dax.c   | 119 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
> >  fs/btrfs/file.c  |   4 +-
> >  3 files changed, 124 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index a3543a4a063d..3bcd2a4959c1 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -3801,6 +3801,12 @@ int btree_readahead_hook(struct extent_buffer *eb, int err);
> >  #ifdef CONFIG_FS_DAX
> >  /* dax.c */
> >  ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
> > +ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
> > +#else
> > +static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +	return 0;
> > +}
> >  #endif /* CONFIG_FS_DAX */
> >  
> >  static inline int is_fstree(u64 rootid)
> > diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> > index bf3d46b0acb6..49619fe3f94f 100644
> > --- a/fs/btrfs/dax.c
> > +++ b/fs/btrfs/dax.c
> > @@ -9,30 +9,124 @@
> >  #ifdef CONFIG_FS_DAX
> >  #include <linux/dax.h>
> >  #include <linux/iomap.h>
> > +#include <linux/uio.h>
> >  #include "ctree.h"
> >  #include "btrfs_inode.h"
> >  
> > +struct btrfs_iomap {
> > +	u64 start;
> > +	u64 end;
> > +	int nocow;
> > +	struct extent_changeset *data_reserved;
> > +	struct extent_state *cached_state;
> > +};
> > +
> >  static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
> >  		loff_t length, unsigned flags, struct iomap *iomap)
> >  {
> >  	struct extent_map *em;
> >  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +
> >  	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
> > +
> > +	if (flags & IOMAP_WRITE) {
> > +		int ret = 0, nocow;
> > +		struct extent_map *map = em;
> > +		struct btrfs_iomap *bi;
> 
> Please consider breaking this up into a separate helper before the
> btrfs_iomap_begin function becomes long and hard to read like the xfs
> one did.  :)
> 
> (Granted people also seem to dislike scrolling back and forth...)
> 
> > +
> > +		bi = kzalloc(sizeof(struct btrfs_iomap), GFP_NOFS);
> > +		if (!bi)
> > +			return -ENOMEM;
> > +
> > +		bi->start = round_down(pos, PAGE_SIZE);
> > +		bi->end = round_up(pos + length, PAGE_SIZE);
> > +
> > +		iomap->private = bi;
> > +
> > +		/* Wait for existing ordered extents in range to finish */
> > +		btrfs_wait_ordered_range(inode, bi->start, bi->end - bi->start);
> > +
> > +		lock_extent_bits(&BTRFS_I(inode)->io_tree, bi->start, bi->end, &bi->cached_state);
> > +
> > +		ret = btrfs_delalloc_reserve_space(inode, &bi->data_reserved,
> > +				bi->start, bi->end - bi->start);
> > +		if (ret) {
> > +			unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
> > +					&bi->cached_state);
> > +			kfree(bi);
> > +			return ret;
> > +		}
> > +
> > +		refcount_inc(&map->refs);
> > +		ret = btrfs_get_extent_map_write(&em, NULL,
> > +				inode, bi->start, bi->end - bi->start, &nocow);
> > +		if (ret) {
> > +			unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
> > +					&bi->cached_state);
> > +			btrfs_delalloc_release_space(inode,
> > +					bi->data_reserved, bi->start,
> > +					bi->end - bi->start, true);
> > +			extent_changeset_free(bi->data_reserved);
> > +			kfree(bi);
> > +			return ret;
> > +		}
> > +		if (!nocow) {
> > +			iomap->flags |= IOMAP_F_COW;
> > +			if (map->block_start != EXTENT_MAP_HOLE) {
> > +				iomap->cow_addr = map->block_start;
> > +				iomap->cow_pos = map->start;
> 
> Oh, I see, cow_pos exists because the extent we're copying from and the
> extent we're copying into are not necessarily going to be positioned at
> the same file offset and (I guess) it's possible that the source range
> could be partially sparse given the destination range?

No, there is no sparse range here.

> 
> Hmm, no, the previous patch doesn't account for that; it only seems to
> know how to handle @cow_pos < @offset.  In that case, why not trim the
> cow_* map to @offset?

Yes, that however would put the calculation responsibility on the filesystem
as opposed to the code. I am fine with either ways though, and it would
eliminate the need of cow_offset. However, for CoW cow_offset will be
calculated as round_down(offset, PAGE_SIZE) which seems reasonable.

> 
> --D
> 
> > +			}
> > +		} else {
> > +			bi->nocow = 1;
> > +		}
> > +		free_extent_map(map);
> > +	}
> > +
> > +	iomap->offset = em->start;
> > +	iomap->length = em->len;
> > +	iomap->bdev = em->bdev;
> > +	iomap->dax_dev = fs_info->dax_dev;
> > +
> >  	if (em->block_start == EXTENT_MAP_HOLE) {
> >  		iomap->type = IOMAP_HOLE;
> >  		return 0;
> >  	}
> > +
> >  	iomap->type = IOMAP_MAPPED;
> > -	iomap->bdev = em->bdev;
> > -	iomap->dax_dev = fs_info->dax_dev;
> > -	iomap->offset = em->start;
> > -	iomap->length = em->len;
> >  	iomap->addr = em->block_start;
> >  	return 0;
> >  }
> >  
> > +static int btrfs_iomap_end(struct inode *inode, loff_t pos,
> > +		loff_t length, ssize_t written, unsigned flags,
> > +		struct iomap *iomap)
> > +{
> > +	struct btrfs_iomap *bi = iomap->private;
> > +	u64 wend;
> > +
> > +	if (!bi)
> > +		return 0;
> > +
> > +	unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
> > +			&bi->cached_state);
> > +
> > +	wend = round_up(pos + written, PAGE_SIZE);
> > +	if (wend < bi->end) {
> > +		btrfs_delalloc_release_space(inode,
> > +				bi->data_reserved, wend,
> > +				bi->end - wend, true);
> > +	}
> > +
> > +	btrfs_update_ordered_extent(inode, bi->start, wend - bi->start, true);
> > +	btrfs_delalloc_release_extents(BTRFS_I(inode), wend - bi->start, false);
> > +	extent_changeset_free(bi->data_reserved);
> > +	kfree(bi);
> > +	return 0;
> > +}
> > +
> >  static const struct iomap_ops btrfs_iomap_ops = {
> >  	.iomap_begin		= btrfs_iomap_begin,
> > +	.iomap_end		= btrfs_iomap_end,
> >  };
> >  
> >  ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
> > @@ -46,4 +140,21 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
> >  
> >  	return ret;
> >  }
> > +
> > +ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *iter)
> > +{
> > +	ssize_t ret = 0;
> > +	u64 pos = iocb->ki_pos;
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +
> > +	ret = dax_iomap_rw(iocb, iter, &btrfs_iomap_ops);
> > +
> > +	if (ret > 0) {
> > +		pos += ret;
> > +		if (pos > i_size_read(inode))
> > +			i_size_write(inode, pos);
> > +		iocb->ki_pos = pos;
> > +	}
> > +	return ret;
> > +}
> >  #endif /* CONFIG_FS_DAX */
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index b620f4e718b2..3b320d0ab495 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1964,7 +1964,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >  	if (sync)
> >  		atomic_inc(&BTRFS_I(inode)->sync_writers);
> >  
> > -	if (iocb->ki_flags & IOCB_DIRECT) {
> > +	if (IS_DAX(inode)) {
> > +		num_written = btrfs_file_dax_write(iocb, from);
> > +	} else if (iocb->ki_flags & IOCB_DIRECT) {
> >  		num_written = __btrfs_direct_write(iocb, from);
> >  	} else {
> >  		num_written = btrfs_buffered_write(iocb, from);
> > -- 
> > 2.16.4
> > 
> 

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/15] btrfs: create a mount option for dax
  2019-03-28 17:28   ` David Sterba
  2019-03-28 17:57     ` Darrick J. Wong
@ 2019-04-01 20:43     ` Goldwyn Rodrigues
  1 sibling, 0 replies; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-01 20:43 UTC (permalink / raw)
  To: dsterba, linux-btrfs, linux-fsdevel

On 18:28 28/03, David Sterba wrote:
> On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > This sets S_DAX in inode->i_flags, which can be used with
> > IS_DAX().
> > 
> > The dax option is restricted to non multi-device mounts.
> > dax interacts with the device directly instead of using bio, so
> > all bio-hooks which we use for multi-device cannot be performed
> > here. While regular read/writes could be manipulated with
> > RAID0/1, mmap() is still an issue.
> 
> I'm looking for other features that would not work with dax. Disabling
> multiple devices strikes out device add, device delete and device
> replace.

I am glad you brought this up.
> 
> To be verified:
> 
> - balance - at least profile changes must be forbidden
> 
> - defrag

Disabled for now with EOPNOTSUPP primarily because of (null) readpages()
in dax.

> 
> - compression - as it needs COW, it won't work on the nodatacow files,
>   thus seems to be ok

Compression would require some sort of processing before writes. However,
users are writing directly to disk. So no, this can't be supported either.

> 
> - resize/grow - this could work without changes, no block relocation is
>   needed, only new structures created
> 
> - resize/shrink - depends on balance, the block groups from the removed
>   area need to be relocated
> 
> - swapfiles - I'm sure somebody will try that; I haven't seen any
>   IS_DAX checks in the swapfile activation code

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write
  2019-04-01  4:38   ` Dave Chinner
@ 2019-04-01 21:41     ` Goldwyn Rodrigues
  2019-04-01 23:06       ` Dave Chinner
  2019-04-07  7:26     ` Christoph Hellwig
  1 sibling, 1 reply; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-01 21:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-btrfs, linux-fsdevel

On 15:38 01/04, Dave Chinner wrote:
> On Tue, Mar 26, 2019 at 02:02:50PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > The IOMAP_F_COW is a flag to notify dax that it needs to copy
> > the data from iomap->cow_addr to iomap->addr, if the start/end
> > of I/O are not page aligned.
> 
> I see what you are trying to do here, but this is kinda gross.
> 
> > This also introduces dax_to_dax_copy() which performs a copy
> > from one part of the device to another, to a maximum of one page.
> > 
> > Question: Using iomap.cow_addr == 0 means the CoW is to be copied
> > (or memset) from a hole. Would this be better handled through a flag?
> 
> That's what all these checks in the iomap code do:
> 

This is using iomap->flags not type.

> 	if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_UNWRITTEN)
> 
> Oh, wait, you're trying to map two ranges in a single iomap and then
> infer state from information that got chucked away.... IOWs, you're
> doing it wrong - iomap algorithms are driven by how we manipulate
> iomaps to do data operations efficiently, not how we copy data page
> by page.
> 
> IOWs, what we really should have here is two iomaps - a source
> and a destination iomap. The source is a read mapping of the
> current address (where we are going to copy the data from), the
> destination is the post-cow allocation mapping (where the data
> goes).
> 
> Now you just copy the data from one map to the other iterating
> source mappings until the necessary range of the destination has
> been covered.  And you can check if the source map is IOMAP_HOLE or
> IOMAP_UNWRITTEN and hence optimise the copy (i.e. zero the new
> allocation) before copying in the new data.

Won't that be inefficient? With CoW we only need to write the first
and last block. Again, that is not required if the offset/end
offset is block aligned. After that, it falls back to the
regular write path of performing dax_copy_to_iter().
We don't deal with IOMAP_UNWRITTEN in dax, though other
filesystems in the future may use CoW without dax.

Besides, what you are suggesting will not fit well with the
current iomap iterator code and would require another function
altogether. It is better suited for like file comparison range:
patch [11/16]. This would end up complicating/bloating the code.

> 
> Even better, only do the source mapping if the write isn't
> page/filesystem block aligned, and hence only do the slow path
> copying if a source mapping exists....
> 

The code already does this.
After Darrick's suggestion, we can even do away with cow_pos, so
only the read address of cow_addr will exist.

> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 0fefb5455bda..391785de1428 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -35,6 +35,7 @@ struct vm_fault;
> >  #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
> >  #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
> >  #define IOMAP_F_BUFFER_HEAD	0x04	/* file system requires buffer heads */
> > +#define IOMAP_F_COW		0x08	/* cow before write */
> 
> "Copy on write before write"? :)

Yes, I could use a better comment. Thanks for pointing.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write
  2019-04-01 21:41     ` Goldwyn Rodrigues
@ 2019-04-01 23:06       ` Dave Chinner
  2019-04-03  1:56         ` Goldwyn Rodrigues
  0 siblings, 1 reply; 48+ messages in thread
From: Dave Chinner @ 2019-04-01 23:06 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, linux-fsdevel

On Mon, Apr 01, 2019 at 04:41:02PM -0500, Goldwyn Rodrigues wrote:
> On 15:38 01/04, Dave Chinner wrote:
> > On Tue, Mar 26, 2019 at 02:02:50PM -0500, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > 
> > > The IOMAP_F_COW is a flag to notify dax that it needs to copy
> > > the data from iomap->cow_addr to iomap->addr, if the start/end
> > > of I/O are not page aligned.
> > 
> > I see what you are trying to do here, but this is kinda gross.
> > 
> > > This also introduces dax_to_dax_copy() which performs a copy
> > > from one part of the device to another, to a maximum of one page.
> > > 
> > > Question: Using iomap.cow_addr == 0 means the CoW is to be copied
> > > (or memset) from a hole. Would this be better handled through a flag?
> > 
> > That's what all these checks in the iomap code do:
> > 
> 
> This is using iomap->flags not type.

Yes, I know. The fact that you tell me this (when it was obvious)
indicates to me that you didn't understand what I was saying.

i.e. the gross hack here is that this patch is trying to define a
new iomap type - both behaviourally and iomap content - via adding
a modifier flag rather than defining a new iomap->type. That's the
gross hack, and everything stems from that.

i.e. the "bloating" of the struct iomap is caused because the flag
modifier (IOMAP_F_COW) can't use parts of the iomap that are defined
for specific iomap types. e.g. IOMAP_INLINE type uses ->inline_data,
and so it can't be re-used by a iomap flag modifier such as
IOMAP_F_COW.

However, if we define a new type for this "need multiple mappings"
iomap rather than a flag, we don't need any new fields in the struct
iomap because we can use what already exists in the iomap.

> > 	if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_UNWRITTEN)
> > 
> > Oh, wait, you're trying to map two ranges in a single iomap and then
> > infer state from information that got chucked away.... IOWs, you're
> > doing it wrong - iomap algorithms are driven by how we manipulate
> > iomaps to do data operations efficiently, not how we copy data page
> > by page.
> > 
> > IOWs, what we really should have here is two iomaps - a source
> > and a destination iomap. The source is a read mapping of the
> > current address (where we are going to copy the data from), the
> > destination is the post-cow allocation mapping (where the data
> > goes).
> > 
> > Now you just copy the data from one map to the other iterating
> > source mappings until the necessary range of the destination has
> > been covered.  And you can check if the source map is IOMAP_HOLE or
> > IOMAP_UNWRITTEN and hence optimise the copy (i.e. zero the new
> > allocation) before copying in the new data.
> 
> Won't that be inefficient? With CoW we only need to write the first
> and last block.

You're assuming that partial data overwrites are the only case where
this dax-to-dax copy of a file range is required. That assumption is
false.

i.e. FALLOC_FL_UNSHARE_RANGE for DAX requires iterating over the
entire source range and copying all the contents to the newly
allocated destination range.  The partial block copying is just a
short version of this limited to a single block.

Sure, btrfs doesn't support FALLOC_FL_UNSHARE_RANGE, but if you're
going to be adding support for reflink to DAX, the infrastructure
needs to provide support for performing FALLOC_FL_UNSHARE_RANGE
to break extent sharing efficiently.

> Again, that is not required if the offset/end
> offset is block aligned. After that, it falls back to the
> regular write path of performing dax_copy_to_iter().
> We don't deal with IOMAP_UNWRITTEN in dax,

Yes we do. fallocate() can lay down unwritten extents, and we can
both read and write to them. See, for example, dax_iomap_actor()
called from dax_iomap_rw() via iomap_apply() - it does not call
dax_copy_to_iter() for reads if the range is IOMAP_HOLE or
IOMAP_UNWRITTEN.

> though other
> filesystems in the future may use CoW without dax.

That makes no sense. :/

> Besides, what you are suggesting will not fit well with the
> current iomap iterator code and would require another function
> altogether.

I'm not concerned about that - I would much prefer we do things
cleanly and properly rather than make expedient hacks for
questionable benefit that we'll have to completely rework or remove
the moment we implement DAX+reflink support in XFS.

> After Darrick's suggestion, we can even do away with cow_pos, so
> only the read address of cow_addr will exist.

As I mentioned earlier, even that is not necessary.

This is DAX - the iomap API and mapping functions can already return
pointers to inline data, and DAX can effectively be considered
inline data for the purposes of reading data.

As I said, the problem here is you are trying to use flags to define
a new type of iomap operation requires two mappings rather than one.
IMO, we should be defining an IOMAP_DAX_COW /type/ and then define
it to contain and behave as follows:

	- new destination region for data to be copied into is the
	  same setup as IOMAP_MAPPED
	- existing shared data that may be needed for reading is
	  mapped direct to device address by ->iomap_begin into
	  iomap->inline_data
	- if the iomap infrastructure needs to copy original source
	  data into destination, it copies directly from the memory
	  address in iomap->inline_data into the directly mapped DAX
	  desitination via memcpy().

This covers both the partial write COW case you are concerned with
here, and the full-extent range copy case that
FALLOC_FL_UNSHARE_RANGE requires.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write
  2019-04-01 23:06       ` Dave Chinner
@ 2019-04-03  1:56         ` Goldwyn Rodrigues
  2019-04-03  3:20           ` Dave Chinner
  0 siblings, 1 reply; 48+ messages in thread
From: Goldwyn Rodrigues @ 2019-04-03  1:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-btrfs, linux-fsdevel

On 10:06 02/04, Dave Chinner wrote:
> On Mon, Apr 01, 2019 at 04:41:02PM -0500, Goldwyn Rodrigues wrote:
> > On 15:38 01/04, Dave Chinner wrote:
> > > On Tue, Mar 26, 2019 at 02:02:50PM -0500, Goldwyn Rodrigues wrote:
> > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > 
> > > > The IOMAP_F_COW is a flag to notify dax that it needs to copy
> > > > the data from iomap->cow_addr to iomap->addr, if the start/end
> > > > of I/O are not page aligned.
> > > 
> > > I see what you are trying to do here, but this is kinda gross.
> > > 
> > > > This also introduces dax_to_dax_copy() which performs a copy
> > > > from one part of the device to another, to a maximum of one page.
> > > > 
> > > > Question: Using iomap.cow_addr == 0 means the CoW is to be copied
> > > > (or memset) from a hole. Would this be better handled through a flag?
> > > 
> > > That's what all these checks in the iomap code do:
> > > 
> > 
> > This is using iomap->flags not type.
> 
> Yes, I know. The fact that you tell me this (when it was obvious)
> indicates to me that you didn't understand what I was saying.
> 
> i.e. the gross hack here is that this patch is trying to define a
> new iomap type - both behaviourally and iomap content - via adding
> a modifier flag rather than defining a new iomap->type. That's the
> gross hack, and everything stems from that.
> 
> i.e. the "bloating" of the struct iomap is caused because the flag
> modifier (IOMAP_F_COW) can't use parts of the iomap that are defined
> for specific iomap types. e.g. IOMAP_INLINE type uses ->inline_data,
> and so it can't be re-used by a iomap flag modifier such as
> IOMAP_F_COW.
> 
> However, if we define a new type for this "need multiple mappings"
> iomap rather than a flag, we don't need any new fields in the struct
> iomap because we can use what already exists in the iomap.
> 
> > > 	if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_UNWRITTEN)
> > > 
> > > Oh, wait, you're trying to map two ranges in a single iomap and then
> > > infer state from information that got chucked away.... IOWs, you're
> > > doing it wrong - iomap algorithms are driven by how we manipulate
> > > iomaps to do data operations efficiently, not how we copy data page
> > > by page.
> > > 
> > > IOWs, what we really should have here is two iomaps - a source
> > > and a destination iomap. The source is a read mapping of the
> > > current address (where we are going to copy the data from), the
> > > destination is the post-cow allocation mapping (where the data
> > > goes).
> > > 
> > > Now you just copy the data from one map to the other iterating
> > > source mappings until the necessary range of the destination has
> > > been covered.  And you can check if the source map is IOMAP_HOLE or
> > > IOMAP_UNWRITTEN and hence optimise the copy (i.e. zero the new
> > > allocation) before copying in the new data.
> > 
> > Won't that be inefficient? With CoW we only need to write the first
> > and last block.
> 
> You're assuming that partial data overwrites are the only case where
> this dax-to-dax copy of a file range is required. That assumption is
> false.
> 
> i.e. FALLOC_FL_UNSHARE_RANGE for DAX requires iterating over the
> entire source range and copying all the contents to the newly
> allocated destination range.  The partial block copying is just a
> short version of this limited to a single block.
> 
> Sure, btrfs doesn't support FALLOC_FL_UNSHARE_RANGE, but if you're
> going to be adding support for reflink to DAX, the infrastructure
> needs to provide support for performing FALLOC_FL_UNSHARE_RANGE
> to break extent sharing efficiently.
> 
> > Again, that is not required if the offset/end
> > offset is block aligned. After that, it falls back to the
> > regular write path of performing dax_copy_to_iter().
> > We don't deal with IOMAP_UNWRITTEN in dax,
> 
> Yes we do. fallocate() can lay down unwritten extents, and we can
> both read and write to them. See, for example, dax_iomap_actor()
> called from dax_iomap_rw() via iomap_apply() - it does not call
> dax_copy_to_iter() for reads if the range is IOMAP_HOLE or
> IOMAP_UNWRITTEN.
> 
> > though other
> > filesystems in the future may use CoW without dax.
> 
> That makes no sense. :/
> 
> > Besides, what you are suggesting will not fit well with the
> > current iomap iterator code and would require another function
> > altogether.
> 
> I'm not concerned about that - I would much prefer we do things
> cleanly and properly rather than make expedient hacks for
> questionable benefit that we'll have to completely rework or remove
> the moment we implement DAX+reflink support in XFS.
> 
> > After Darrick's suggestion, we can even do away with cow_pos, so
> > only the read address of cow_addr will exist.
> 
> As I mentioned earlier, even that is not necessary.
> 
> This is DAX - the iomap API and mapping functions can already return
> pointers to inline data, and DAX can effectively be considered
> inline data for the purposes of reading data.
> 
> As I said, the problem here is you are trying to use flags to define
> a new type of iomap operation requires two mappings rather than one.
> IMO, we should be defining an IOMAP_DAX_COW /type/ and then define
> it to contain and behave as follows:
> 
> 	- new destination region for data to be copied into is the
> 	  same setup as IOMAP_MAPPED
> 	- existing shared data that may be needed for reading is
> 	  mapped direct to device address by ->iomap_begin into
> 	  iomap->inline_data
> 	- if the iomap infrastructure needs to copy original source
> 	  data into destination, it copies directly from the memory
> 	  address in iomap->inline_data into the directly mapped DAX
> 	  desitination via memcpy().
> 
> This covers both the partial write COW case you are concerned with
> here, and the full-extent range copy case that
> FALLOC_FL_UNSHARE_RANGE requires.
> 

Ok, understood. However, we may have to differentiate between a CoW
with FALLOC_FL_UNSHARE_RANGE because CoW will only (conditionally)
copy the first and the last block wheras FALLOC_FL_UNSHARE_RANGE
will copy the whole range. So, should it be another type?

Also, this would require iomap_begin to translate block->dax address
which I suppose can be a function in the dax code.


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write
  2019-04-03  1:56         ` Goldwyn Rodrigues
@ 2019-04-03  3:20           ` Dave Chinner
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2019-04-03  3:20 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, linux-fsdevel

On Tue, Apr 02, 2019 at 08:56:31PM -0500, Goldwyn Rodrigues wrote:
> On 10:06 02/04, Dave Chinner wrote:
> > On Mon, Apr 01, 2019 at 04:41:02PM -0500, Goldwyn Rodrigues wrote:
> > > After Darrick's suggestion, we can even do away with cow_pos, so
> > > only the read address of cow_addr will exist.
> > 
> > As I mentioned earlier, even that is not necessary.
> > 
> > This is DAX - the iomap API and mapping functions can already return
> > pointers to inline data, and DAX can effectively be considered
> > inline data for the purposes of reading data.
> > 
> > As I said, the problem here is you are trying to use flags to define
> > a new type of iomap operation requires two mappings rather than one.
> > IMO, we should be defining an IOMAP_DAX_COW /type/ and then define
> > it to contain and behave as follows:
> > 
> > 	- new destination region for data to be copied into is the
> > 	  same setup as IOMAP_MAPPED
> > 	- existing shared data that may be needed for reading is
> > 	  mapped direct to device address by ->iomap_begin into
> > 	  iomap->inline_data
> > 	- if the iomap infrastructure needs to copy original source
> > 	  data into destination, it copies directly from the memory
> > 	  address in iomap->inline_data into the directly mapped DAX
> > 	  desitination via memcpy().
> > 
> > This covers both the partial write COW case you are concerned with
> > here, and the full-extent range copy case that
> > FALLOC_FL_UNSHARE_RANGE requires.
> > 
> 
> Ok, understood. However, we may have to differentiate between a CoW
> with FALLOC_FL_UNSHARE_RANGE because CoW will only (conditionally)
> copy the first and the last block wheras FALLOC_FL_UNSHARE_RANGE
> will copy the whole range. So, should it be another type?

I don't see why that would be necessary. The calling contexts are
completely different, but both know exactly what they have to copy
from the iomap's inline data. They fact they copy different parts of
data from the iomap doesn't mean we need different types of iomaps
for them...

> Also, this would require iomap_begin to translate block->dax address
> which I suppose can be a function in the dax code.

bdev_dax_pgoff() + dax_direct_access() should already provide what
we need here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write
  2019-04-01  4:38   ` Dave Chinner
  2019-04-01 21:41     ` Goldwyn Rodrigues
@ 2019-04-07  7:26     ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2019-04-07  7:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Goldwyn Rodrigues, linux-btrfs, linux-fsdevel, Goldwyn Rodrigues

On Mon, Apr 01, 2019 at 03:38:52PM +1100, Dave Chinner wrote:
> On Tue, Mar 26, 2019 at 02:02:50PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > The IOMAP_F_COW is a flag to notify dax that it needs to copy
> > the data from iomap->cow_addr to iomap->addr, if the start/end
> > of I/O are not page aligned.
> 
> I see what you are trying to do here, but this is kinda gross.

Even more gross than our current hack where for buffered I/O
we return the previous address but still allocate delalloc blocks
for the COW? ;-)

This whole area is a mine field..

> Oh, wait, you're trying to map two ranges in a single iomap and then
> infer state from information that got chucked away.... IOWs, you're
> doing it wrong - iomap algorithms are driven by how we manipulate
> iomaps to do data operations efficiently, not how we copy data page
> by page.
> 
> IOWs, what we really should have here is two iomaps - a source
> and a destination iomap. The source is a read mapping of the
> current address (where we are going to copy the data from), the
> destination is the post-cow allocation mapping (where the data
> goes).
> 
> Now you just copy the data from one map to the other iterating
> source mappings until the necessary range of the destination has
> been covered.  And you can check if the source map is IOMAP_HOLE or
> IOMAP_UNWRITTEN and hence optimise the copy (i.e. zero the new
> allocation) before copying in the new data.
> 
> Even better, only do the source mapping if the write isn't
> page/filesystem block aligned, and hence only do the slow path
> copying if a source mapping exists....

I looked into the do two maps instead of the mentioned above existing
gross hack when doing the buffer_head removal for iomap.  The major
issue with it is that we need a way for the second recursive mapping
to not take the inode-wide locks.  I still think it is worth it
(or move to per-fork or range locks) as it defintively is the much
cleaner way to handle it.  I just didn't bother to go there yet
as I wanted the buffer_head removal and thus kept the existing gross
hack for now.  But moving to a two mapping scheme for read-modify-write
ops would be a major improvement for sure.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2019-04-07  7:28 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190326190301.32365-1-rgoldwyn@suse.de>
2019-03-26 19:02 ` [PATCH 01/15] btrfs: create a mount option for dax Goldwyn Rodrigues
2019-03-26 19:10   ` Matthew Wilcox
2019-03-27 11:00     ` Goldwyn Rodrigues
2019-03-27 12:00       ` Matthew Wilcox
2019-03-27 12:26         ` Goldwyn Rodrigues
2019-03-27 23:31         ` Goldwyn Rodrigues
2019-03-27 17:38     ` Adam Borowski
2019-03-28 14:49   ` David Sterba
2019-03-28 17:28   ` David Sterba
2019-03-28 17:57     ` Darrick J. Wong
2019-04-01 20:43     ` Goldwyn Rodrigues
2019-03-26 19:02 ` [PATCH 02/15] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
2019-03-26 19:02 ` [PATCH 03/15] btrfs: basic dax read Goldwyn Rodrigues
2019-03-26 19:02 ` [PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write Goldwyn Rodrigues
2019-03-27 17:54   ` Darrick J. Wong
2019-03-27 18:58     ` Goldwyn Rodrigues
2019-03-28 14:45       ` Darrick J. Wong
2019-04-01  4:38   ` Dave Chinner
2019-04-01 21:41     ` Goldwyn Rodrigues
2019-04-01 23:06       ` Dave Chinner
2019-04-03  1:56         ` Goldwyn Rodrigues
2019-04-03  3:20           ` Dave Chinner
2019-04-07  7:26     ` Christoph Hellwig
2019-03-26 19:02 ` [PATCH 05/15] btrfs: return whether extent is nocow or not Goldwyn Rodrigues
2019-03-31 18:42   ` Nikolay Borisov
2019-03-26 19:02 ` [PATCH 06/15] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
2019-03-26 19:02 ` [PATCH 07/15] btrfs: add dax write support Goldwyn Rodrigues
2019-03-28 14:53   ` Darrick J. Wong
2019-04-01 20:39     ` Goldwyn Rodrigues
2019-03-26 19:02 ` [PATCH 08/15] dax: add dax_iomap_cow to copy a mmap page before writing Goldwyn Rodrigues
2019-03-28 15:41   ` Darrick J. Wong
2019-03-26 19:02 ` [PATCH 09/15] btrfs: add dax mmap support Goldwyn Rodrigues
2019-03-28 15:45   ` Darrick J. Wong
2019-03-26 19:02 ` [PATCH 10/15] btrfs: Add dax specific address_space_operations Goldwyn Rodrigues
2019-03-26 19:02 ` [PATCH 11/15] fs: dedup file range to use a compare function Goldwyn Rodrigues
2019-03-28 17:04   ` Darrick J. Wong
2019-04-01 20:36     ` Goldwyn Rodrigues
2019-03-26 19:02 ` [PATCH 12/15] btrfs: trace functions for btrfs_iomap_begin/end Goldwyn Rodrigues
2019-03-26 19:02 ` [PATCH 13/15] btrfs: handle dax page zeroing Goldwyn Rodrigues
2019-03-26 19:03 ` [PATCH 14/15] btrfs: Disable dax-based defrag and send Goldwyn Rodrigues
2019-03-26 19:03 ` [PATCH 15/15] btrfs: Writeprotect mmap pages on snapshot Goldwyn Rodrigues
2019-03-28 15:48   ` Darrick J. Wong
2019-03-26 19:09 ` [PATCH v2 00/15] btrfs dax support Goldwyn Rodrigues
2019-03-27 20:14   ` Adam Borowski
2019-03-27 23:26     ` Goldwyn Rodrigues
2019-03-28 10:24       ` [PATCH] btrfs: allow MAP_SYNC mmap Adam Borowski
2019-03-28 10:42         ` Adam Borowski
2019-04-01 20:08         ` Goldwyn Rodrigues

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).