linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] Btrfs iomap
@ 2019-06-21 19:28 Goldwyn Rodrigues
  2019-06-21 19:28 ` [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O Goldwyn Rodrigues
                   ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: Goldwyn Rodrigues @ 2019-06-21 19:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, hch, darrick.wong, david

This is an effort to use iomap with btrfs. This would keep most
responsibility of page handling during writes in iomap code, hence
code reduction. For CoW support, changes are needed in iomap code
to make sure we perform a copy before the write.
This is in line with the discussion we had during adding dax support in
btrfs.

Efforts on adding dax support have been put on hold until MM experts can
come with a way of performing multiple mappings to a single page
(primarily the TODO before dax_associate_entry()). While we are waiting
on that we could add support for buffered writes in btrfs.

I have managed to add Direct I/O with iomap support as well, but it
still needs to be refined with respect to locking. If you are
interested in that effort (on top of this one), it is available
at [1].

[1] https://github.com/goldwynr/linux/tree/btrfs-iomap


 fs/btrfs/Makefile     |    2 
 fs/btrfs/ctree.h      |    1 
 fs/btrfs/file.c       |  468 --------------------------------------------------
 fs/btrfs/iomap.c      |  430 +++++++++++++++++++++++++++++++++++++++++++++
 fs/dax.c              |    8 
 fs/ext2/inode.c       |    2 
 fs/ext4/inode.c       |    2 
 fs/gfs2/bmap.c        |    3 
 fs/internal.h         |    2 
 fs/iomap.c            |   45 ++--
 fs/xfs/xfs_iomap.c    |    9 
 include/linux/iomap.h |    7 
 12 files changed, 479 insertions(+), 500 deletions(-)



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O
  2019-06-21 19:28 [PATCH 0/6] Btrfs iomap Goldwyn Rodrigues
@ 2019-06-21 19:28 ` Goldwyn Rodrigues
  2019-06-22  0:46   ` Darrick J. Wong
  2019-06-24  7:07   ` Christoph Hellwig
  2019-06-21 19:28 ` [PATCH 2/6] iomap: Read page from srcmap for IOMAP_COW Goldwyn Rodrigues
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 27+ messages in thread
From: Goldwyn Rodrigues @ 2019-06-21 19:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, hch, darrick.wong, david, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Introduces a new type IOMAP_COW, which means the data at offset
must be read from a srcmap and copied before performing the
write on the offset.

The srcmap is used to identify where the read is to be performed
from. This is passed to iomap->begin(), which is supposed to
put in the details for reading, typically set with type IOMAP_READ.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/dax.c              |  8 +++++---
 fs/ext2/inode.c       |  2 +-
 fs/ext4/inode.c       |  2 +-
 fs/gfs2/bmap.c        |  3 ++-
 fs/internal.h         |  2 +-
 fs/iomap.c            | 31 ++++++++++++++++---------------
 fs/xfs/xfs_iomap.c    |  9 ++++++---
 include/linux/iomap.h |  4 +++-
 8 files changed, 35 insertions(+), 26 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 2e48c7ebb973..80b9e2599223 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1078,7 +1078,7 @@ EXPORT_SYMBOL_GPL(__dax_zero_page_range);
 
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap)
+		struct iomap *iomap, struct iomap *srcmap)
 {
 	struct block_device *bdev = iomap->bdev;
 	struct dax_device *dax_dev = iomap->dax_dev;
@@ -1236,6 +1236,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	unsigned long vaddr = vmf->address;
 	loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
 	struct iomap iomap = { 0 };
+	struct iomap srcmap = { 0 };
 	unsigned flags = IOMAP_FAULT;
 	int error, major = 0;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -1280,7 +1281,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	 * the file system block size to be equal the page size, which means
 	 * that we never have to deal with more than a single extent here.
 	 */
-	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
+	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap, &srcmap);
 	if (iomap_errp)
 		*iomap_errp = error;
 	if (error) {
@@ -1460,6 +1461,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	struct inode *inode = mapping->host;
 	vm_fault_t result = VM_FAULT_FALLBACK;
 	struct iomap iomap = { 0 };
+	struct iomap srcmap = { 0 };
 	pgoff_t max_pgoff;
 	void *entry;
 	loff_t pos;
@@ -1534,7 +1536,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	 * to look up our filesystem block.
 	 */
 	pos = (loff_t)xas.xa_index << PAGE_SHIFT;
-	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
+	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap, &srcmap);
 	if (error)
 		goto unlock_entry;
 
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index e474127dd255..f081f11980ad 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -801,7 +801,7 @@ int ext2_get_block(struct inode *inode, sector_t iblock,
 
 #ifdef CONFIG_FS_DAX
 static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
-		unsigned flags, struct iomap *iomap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	unsigned int blkbits = inode->i_blkbits;
 	unsigned long first_block = offset >> blkbits;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c7f77c643008..a8017e0c302b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3437,7 +3437,7 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
 }
 
 static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
-			    unsigned flags, struct iomap *iomap)
+			    unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	unsigned int blkbits = inode->i_blkbits;
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 93ea1d529aa3..affa0c4305b7 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1124,7 +1124,8 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 }
 
 static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
-			    unsigned flags, struct iomap *iomap)
+			    unsigned flags, struct iomap *iomap,
+			    struct iomap *srcmap)
 {
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct metapath mp = { .mp_aheight = 1, };
diff --git a/fs/internal.h b/fs/internal.h
index a48ef81be37d..79e495d86165 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -188,7 +188,7 @@ extern int do_vfs_ioctl(struct file *file, unsigned int fd, unsigned int cmd,
  * iomap support:
  */
 typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
-		void *data, struct iomap *iomap);
+		void *data, struct iomap *iomap, struct iomap *srcmap);
 
 loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
 		unsigned flags, const struct iomap_ops *ops, void *data,
diff --git a/fs/iomap.c b/fs/iomap.c
index 23ef63fd1669..6648957af268 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -41,6 +41,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 		const struct iomap_ops *ops, void *data, iomap_actor_t actor)
 {
 	struct iomap iomap = { 0 };
+	struct iomap srcmap = { 0 };
 	loff_t written = 0, ret;
 
 	/*
@@ -55,7 +56,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	 * expose transient stale data. If the reserve fails, we can safely
 	 * back out at this point as there is nothing to undo.
 	 */
-	ret = ops->iomap_begin(inode, pos, length, flags, &iomap);
+	ret = ops->iomap_begin(inode, pos, length, flags, &iomap, &srcmap);
 	if (ret)
 		return ret;
 	if (WARN_ON(iomap.offset > pos))
@@ -75,7 +76,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	 * we can do the copy-in page by page without having to worry about
 	 * failures exposing transient data.
 	 */
-	written = actor(inode, pos, length, data, &iomap);
+	written = actor(inode, pos, length, data, &iomap, &srcmap);
 
 	/*
 	 * Now the data has been copied, commit the range we've copied.  This
@@ -282,7 +283,7 @@ iomap_read_inline_data(struct inode *inode, struct page *page,
 
 static loff_t
 iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap)
+		struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
 	struct page *page = ctx->cur_page;
@@ -424,7 +425,7 @@ iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
 
 static loff_t
 iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap)
+		void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
 	loff_t done, ret;
@@ -444,7 +445,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 			ctx->cur_page_in_bio = false;
 		}
 		ret = iomap_readpage_actor(inode, pos + done, length - done,
-				ctx, iomap);
+				ctx, iomap, srcmap);
 	}
 
 	return done;
@@ -796,7 +797,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
 
 static loff_t
 iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap)
+		struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iov_iter *i = data;
 	long status = 0;
@@ -913,7 +914,7 @@ __iomap_read_page(struct inode *inode, loff_t offset)
 
 static loff_t
 iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap)
+		struct iomap *iomap, struct iomap *srcmap)
 {
 	long status = 0;
 	ssize_t written = 0;
@@ -1002,7 +1003,7 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
 
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
-		void *data, struct iomap *iomap)
+		void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	bool *did_zero = data;
 	loff_t written = 0;
@@ -1071,7 +1072,7 @@ EXPORT_SYMBOL_GPL(iomap_truncate_page);
 
 static loff_t
 iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap)
+		void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct page *page = data;
 	int ret;
@@ -1169,7 +1170,7 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
 
 static loff_t
 iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap)
+		struct iomap *iomap, struct iomap *srcmap)
 {
 	struct fiemap_ctx *ctx = data;
 	loff_t ret = length;
@@ -1343,7 +1344,7 @@ page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
 
 static loff_t
 iomap_seek_hole_actor(struct inode *inode, loff_t offset, loff_t length,
-		      void *data, struct iomap *iomap)
+		      void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	switch (iomap->type) {
 	case IOMAP_UNWRITTEN:
@@ -1389,7 +1390,7 @@ EXPORT_SYMBOL_GPL(iomap_seek_hole);
 
 static loff_t
 iomap_seek_data_actor(struct inode *inode, loff_t offset, loff_t length,
-		      void *data, struct iomap *iomap)
+		      void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	switch (iomap->type) {
 	case IOMAP_HOLE:
@@ -1790,7 +1791,7 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
 
 static loff_t
 iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap)
+		void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iomap_dio *dio = data;
 
@@ -2057,7 +2058,7 @@ static int iomap_swapfile_add_extent(struct iomap_swapfile_info *isi)
  * distinction between written and unwritten extents.
  */
 static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
-		loff_t count, void *data, struct iomap *iomap)
+		loff_t count, void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iomap_swapfile_info *isi = data;
 	int error;
@@ -2161,7 +2162,7 @@ EXPORT_SYMBOL_GPL(iomap_swapfile_activate);
 
 static loff_t
 iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap)
+		void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	sector_t *bno = data, addr;
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 63d323916bba..6116a75f03da 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -925,7 +925,8 @@ xfs_file_iomap_begin(
 	loff_t			offset,
 	loff_t			length,
 	unsigned		flags,
-	struct iomap		*iomap)
+	struct iomap		*iomap,
+	struct iomap		*srcmap)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1148,7 +1149,8 @@ xfs_seek_iomap_begin(
 	loff_t			offset,
 	loff_t			length,
 	unsigned		flags,
-	struct iomap		*iomap)
+	struct iomap		*iomap,
+	struct iomap		*srcmap)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1234,7 +1236,8 @@ xfs_xattr_iomap_begin(
 	loff_t			offset,
 	loff_t			length,
 	unsigned		flags,
-	struct iomap		*iomap)
+	struct iomap		*iomap,
+	struct iomap		*srcmap)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 2103b94cb1bf..f49767c7fd83 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -25,6 +25,7 @@ struct vm_fault;
 #define IOMAP_MAPPED	0x03	/* blocks allocated at @addr */
 #define IOMAP_UNWRITTEN	0x04	/* blocks allocated at @addr in unwritten state */
 #define IOMAP_INLINE	0x05	/* data inline in the inode */
+#define IOMAP_COW	0x06	/* copy data from srcmap before writing */
 
 /*
  * Flags for all iomap mappings:
@@ -102,7 +103,8 @@ struct iomap_ops {
 	 * The actual length is returned in iomap->length.
 	 */
 	int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
-			unsigned flags, struct iomap *iomap);
+			unsigned flags, struct iomap *iomap,
+			struct iomap *srcmap);
 
 	/*
 	 * Commit and/or unreserve space previous allocated using iomap_begin.
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/6] iomap: Read page from srcmap for IOMAP_COW
  2019-06-21 19:28 [PATCH 0/6] Btrfs iomap Goldwyn Rodrigues
  2019-06-21 19:28 ` [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O Goldwyn Rodrigues
@ 2019-06-21 19:28 ` Goldwyn Rodrigues
  2019-06-22  0:41   ` Darrick J. Wong
  2019-06-21 19:28 ` [PATCH 3/6] iomap: Check iblocksize before transforming page->private Goldwyn Rodrigues
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: Goldwyn Rodrigues @ 2019-06-21 19:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, hch, darrick.wong, david, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/iomap.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 6648957af268..8a7b20e432ef 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -655,7 +655,7 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
 
 static int
 iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
-		struct page **pagep, struct iomap *iomap)
+		struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
 {
 	const struct iomap_page_ops *page_ops = iomap->page_ops;
 	pgoff_t index = pos >> PAGE_SHIFT;
@@ -681,6 +681,8 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 
 	if (iomap->type == IOMAP_INLINE)
 		iomap_read_inline_data(inode, page, iomap);
+	else if (iomap->type == IOMAP_COW)
+		status = __iomap_write_begin(inode, pos, len, page, srcmap);
 	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
 		status = __block_write_begin_int(page, pos, len, NULL, iomap);
 	else
@@ -833,7 +835,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		}
 
 		status = iomap_write_begin(inode, pos, bytes, flags, &page,
-				iomap);
+				iomap, srcmap);
 		if (unlikely(status))
 			break;
 
@@ -932,7 +934,7 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			return PTR_ERR(rpage);
 
 		status = iomap_write_begin(inode, pos, bytes,
-					   AOP_FLAG_NOFS, &page, iomap);
+					   AOP_FLAG_NOFS, &page, iomap, srcmap);
 		put_page(rpage);
 		if (unlikely(status))
 			return status;
@@ -978,13 +980,13 @@ iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
 EXPORT_SYMBOL_GPL(iomap_file_dirty);
 
 static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
-		unsigned bytes, struct iomap *iomap)
+		unsigned bytes, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct page *page;
 	int status;
 
 	status = iomap_write_begin(inode, pos, bytes, AOP_FLAG_NOFS, &page,
-				   iomap);
+				   iomap, srcmap);
 	if (status)
 		return status;
 
@@ -1022,7 +1024,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		if (IS_DAX(inode))
 			status = iomap_dax_zero(pos, offset, bytes, iomap);
 		else
-			status = iomap_zero(inode, pos, offset, bytes, iomap);
+			status = iomap_zero(inode, pos, offset, bytes, iomap, srcmap);
 		if (status < 0)
 			return status;
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/6] iomap: Check iblocksize before transforming page->private
  2019-06-21 19:28 [PATCH 0/6] Btrfs iomap Goldwyn Rodrigues
  2019-06-21 19:28 ` [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O Goldwyn Rodrigues
  2019-06-21 19:28 ` [PATCH 2/6] iomap: Read page from srcmap for IOMAP_COW Goldwyn Rodrigues
@ 2019-06-21 19:28 ` Goldwyn Rodrigues
  2019-06-22  0:21   ` Darrick J. Wong
  2019-06-24  7:05   ` Christoph Hellwig
  2019-06-21 19:28 ` [PATCH 4/6] btrfs: Add a simple buffered iomap write Goldwyn Rodrigues
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 27+ messages in thread
From: Goldwyn Rodrigues @ 2019-06-21 19:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, hch, darrick.wong, david, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs uses page->private as well to store extent_buffer. Make
the check stricter to make sure we are using page->private for iop by
comparing iblocksize < PAGE_SIZE.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 include/linux/iomap.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f49767c7fd83..6511124e58b6 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -128,7 +128,8 @@ struct iomap_page {
 
 static inline struct iomap_page *to_iomap_page(struct page *page)
 {
-	if (page_has_private(page))
+	if (i_blocksize(page->mapping->host) < PAGE_SIZE &&
+			page_has_private(page))
 		return (struct iomap_page *)page_private(page);
 	return NULL;
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/6] btrfs: Add a simple buffered iomap write
  2019-06-21 19:28 [PATCH 0/6] Btrfs iomap Goldwyn Rodrigues
                   ` (2 preceding siblings ...)
  2019-06-21 19:28 ` [PATCH 3/6] iomap: Check iblocksize before transforming page->private Goldwyn Rodrigues
@ 2019-06-21 19:28 ` Goldwyn Rodrigues
  2019-06-21 19:28 ` [PATCH 5/6] btrfs: Add CoW in iomap based writes Goldwyn Rodrigues
  2019-06-21 19:28 ` [PATCH 6/6] btrfs: remove buffered write code made unnecessary Goldwyn Rodrigues
  5 siblings, 0 replies; 27+ messages in thread
From: Goldwyn Rodrigues @ 2019-06-21 19:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, hch, darrick.wong, david, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Introduce a new static structure btrfs_iomap, which carries
information from iomap_begin() to iomap_end(). This primarily
contains reservation and extent locking information.

This one is a long patch. Most of the code is "inspired" by
fs/btrfs/file.c. To keep the size small, all removals are in
following patches.

Patch inclusion / Coding style question -
From a code-readability PoV, I prefer lot of small functions versus a big
function. I would like to further break this down, but since
it was a 1-1 mapping with file.c, I let it be. Would you prefer to put
breaking into smaller functions in this same patch, or prefer it as
a separate patch (re)modifying this code?

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/Makefile |   2 +-
 fs/btrfs/ctree.h  |   1 +
 fs/btrfs/file.c   |   4 +-
 fs/btrfs/iomap.c  | 389 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 393 insertions(+), 3 deletions(-)
 create mode 100644 fs/btrfs/iomap.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index ca693dd554e9..bfc1954204cf 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -10,7 +10,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-	   uuid-tree.o props.o free-space-tree.o tree-checker.o
+	   uuid-tree.o props.o free-space-tree.o tree-checker.o iomap.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0a61dff27f57..80d37dfff873 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3348,6 +3348,7 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
 loff_t btrfs_remap_file_range(struct file *file_in, loff_t pos_in,
 			      struct file *file_out, loff_t pos_out,
 			      loff_t len, unsigned int remap_flags);
+size_t btrfs_buffered_iomap_write(struct kiocb *iocb, struct iov_iter *from);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 89f5be2bfb43..fc3032d8b573 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1839,7 +1839,7 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		return written;
 
 	pos = iocb->ki_pos;
-	written_buffered = btrfs_buffered_write(iocb, from);
+	written_buffered = btrfs_buffered_iomap_write(iocb, from);
 	if (written_buffered < 0) {
 		err = written_buffered;
 		goto out;
@@ -1976,7 +1976,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		num_written = __btrfs_direct_write(iocb, from);
 	} else {
-		num_written = btrfs_buffered_write(iocb, from);
+		num_written = btrfs_buffered_iomap_write(iocb, from);
 		if (num_written > 0)
 			iocb->ki_pos = pos + num_written;
 		if (clean_page)
diff --git a/fs/btrfs/iomap.c b/fs/btrfs/iomap.c
new file mode 100644
index 000000000000..0435b179d461
--- /dev/null
+++ b/fs/btrfs/iomap.c
@@ -0,0 +1,389 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * iomap support for BTRFS
+ *
+ * Copyright (c) 2019  SUSE Linux
+ * Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
+ */
+
+#include <linux/iomap.h>
+#include "ctree.h"
+#include "btrfs_inode.h"
+#include "volumes.h"
+#include "disk-io.h"
+
+struct btrfs_iomap {
+	u64 start;
+	u64 end;
+	bool nocow;
+	int extents_locked;
+	ssize_t reserved_bytes;
+	struct extent_changeset *data_reserved;
+	struct extent_state *cached_state;
+};
+
+
+/*
+ * This function locks the extent and properly waits for data=ordered extents
+ * to finish before allowing the pages to be modified if need.
+ *
+ * The return value:
+ * 1 - the extent is locked
+ * 0 - the extent is not locked, and everything is OK
+ * -EAGAIN - need re-prepare the pages
+ * the other < 0 number - Something wrong happens
+ */
+static noinline int
+lock_and_cleanup_extent(struct btrfs_inode *inode, loff_t pos,
+			 size_t write_bytes,
+			 u64 *lockstart, u64 *lockend,
+			 struct extent_state **cached_state)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	u64 start_pos;
+	u64 last_pos;
+	int ret = 0;
+
+	start_pos = round_down(pos, fs_info->sectorsize);
+	last_pos = start_pos
+		+ round_up(pos + write_bytes - start_pos,
+			   fs_info->sectorsize) - 1;
+
+	if (start_pos < inode->vfs_inode.i_size) {
+		struct btrfs_ordered_extent *ordered;
+
+		lock_extent_bits(&inode->io_tree, start_pos, last_pos,
+				cached_state);
+		ordered = btrfs_lookup_ordered_range(inode, start_pos,
+						     last_pos - start_pos + 1);
+		if (ordered &&
+		    ordered->file_offset + ordered->len > start_pos &&
+		    ordered->file_offset <= last_pos) {
+			unlock_extent_cached(&inode->io_tree, start_pos,
+					last_pos, cached_state);
+			btrfs_start_ordered_extent(&inode->vfs_inode,
+					ordered, 1);
+			btrfs_put_ordered_extent(ordered);
+			return -EAGAIN;
+		}
+		if (ordered)
+			btrfs_put_ordered_extent(ordered);
+
+		*lockstart = start_pos;
+		*lockend = last_pos;
+		ret = 1;
+	}
+
+	return ret;
+}
+
+static noinline int check_can_nocow(struct btrfs_inode *inode, loff_t pos,
+				    size_t *write_bytes)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct btrfs_root *root = inode->root;
+	struct btrfs_ordered_extent *ordered;
+	u64 lockstart, lockend;
+	u64 num_bytes;
+	int ret;
+
+	ret = btrfs_start_write_no_snapshotting(root);
+	if (!ret)
+		return -ENOSPC;
+
+	lockstart = round_down(pos, fs_info->sectorsize);
+	lockend = round_up(pos + *write_bytes,
+			   fs_info->sectorsize) - 1;
+
+	while (1) {
+		lock_extent(&inode->io_tree, lockstart, lockend);
+		ordered = btrfs_lookup_ordered_range(inode, lockstart,
+						     lockend - lockstart + 1);
+		if (!ordered) {
+			break;
+		}
+		unlock_extent(&inode->io_tree, lockstart, lockend);
+		btrfs_start_ordered_extent(&inode->vfs_inode, ordered, 1);
+		btrfs_put_ordered_extent(ordered);
+	}
+
+	num_bytes = lockend - lockstart + 1;
+	ret = can_nocow_extent(&inode->vfs_inode, lockstart, &num_bytes,
+			NULL, NULL, NULL);
+	if (ret <= 0) {
+		ret = 0;
+		btrfs_end_write_no_snapshotting(root);
+	} else {
+		*write_bytes = min_t(size_t, *write_bytes ,
+				     num_bytes - pos + lockstart);
+	}
+
+	unlock_extent(&inode->io_tree, lockstart, lockend);
+
+	return ret;
+}
+
+static int btrfs_find_new_delalloc_bytes(struct btrfs_inode *inode,
+					 const u64 start,
+					 const u64 len,
+					 struct extent_state **cached_state)
+{
+	u64 search_start = start;
+	const u64 end = start + len - 1;
+
+	while (search_start < end) {
+		const u64 search_len = end - search_start + 1;
+		struct extent_map *em;
+		u64 em_len;
+		int ret = 0;
+
+		em = btrfs_get_extent(inode, NULL, 0, search_start,
+				      search_len, 0);
+		if (IS_ERR(em))
+			return PTR_ERR(em);
+
+		if (em->block_start != EXTENT_MAP_HOLE)
+			goto next;
+
+		em_len = em->len;
+		if (em->start < search_start)
+			em_len -= search_start - em->start;
+		if (em_len > search_len)
+			em_len = search_len;
+
+		ret = set_extent_bit(&inode->io_tree, search_start,
+				     search_start + em_len - 1,
+				     EXTENT_DELALLOC_NEW,
+				     NULL, cached_state, GFP_NOFS);
+next:
+		search_start = extent_map_end(em);
+		free_extent_map(em);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static int btrfs_buffered_page_prepare(struct inode *inode, loff_t pos,
+		unsigned len, struct iomap *iomap)
+{
+	//wait_on_page_writeback(page);
+	//set_page_extent_mapped(page);
+	return 0;
+}
+
+static void btrfs_buffered_page_done(struct inode *inode, loff_t pos,
+		unsigned copied, struct page *page,
+		struct iomap *iomap)
+{
+	SetPageUptodate(page);
+	ClearPageChecked(page);
+	set_page_dirty(page);
+	get_page(page);
+}
+
+
+static const struct iomap_page_ops btrfs_buffered_page_ops = {
+	.page_prepare = btrfs_buffered_page_prepare,
+	.page_done = btrfs_buffered_page_done,
+};
+
+
+static int btrfs_buffered_iomap_begin(struct inode *inode, loff_t pos,
+		loff_t length, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
+{
+	int ret;
+	size_t write_bytes = length;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	size_t sector_offset = pos & (fs_info->sectorsize - 1);
+	struct btrfs_iomap *bi;
+
+	bi = kzalloc(sizeof(struct btrfs_iomap), GFP_NOFS);
+	if (!bi)
+		return -ENOMEM;
+
+	bi->reserved_bytes = round_up(write_bytes + sector_offset,
+			fs_info->sectorsize);
+
+	/* Reserve data space */
+	ret = btrfs_check_data_free_space(inode, &bi->data_reserved, pos,
+			write_bytes);
+	if (ret < 0) {
+		/*
+		 * Space allocation failed. Let's check if we can
+		 * continue I/O without allocations
+		 */
+		if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+						BTRFS_INODE_PREALLOC)) &&
+				check_can_nocow(BTRFS_I(inode), pos,
+					&write_bytes) > 0) {
+			bi->nocow = true;
+			/*
+			 * our prealloc extent may be smaller than
+			 * write_bytes, so scale down.
+			 */
+			bi->reserved_bytes = round_up(write_bytes +
+					sector_offset,
+					fs_info->sectorsize);
+		} else {
+			goto error;
+		}
+	}
+
+	WARN_ON(bi->reserved_bytes == 0);
+
+	/* We have the data space allocated, reserve the metadata now */
+	ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
+			bi->reserved_bytes);
+	if (ret) {
+		struct btrfs_root *root = BTRFS_I(inode)->root;
+		if (!bi->nocow)
+			btrfs_free_reserved_data_space(inode,
+					bi->data_reserved, pos,
+					write_bytes);
+		else
+			btrfs_end_write_no_snapshotting(root);
+		goto error;
+	}
+
+	do {
+		ret = lock_and_cleanup_extent(
+				BTRFS_I(inode), pos, write_bytes, &bi->start,
+				&bi->end, &bi->cached_state);
+	} while (ret == -EAGAIN);
+
+	if (ret < 0) {
+		btrfs_delalloc_release_extents(BTRFS_I(inode),
+					       bi->reserved_bytes, true);
+		goto release;
+	} else {
+		bi->extents_locked = ret;
+	}
+	iomap->private = bi;
+	iomap->length = round_up(write_bytes, fs_info->sectorsize);
+	iomap->offset = round_down(pos, fs_info->sectorsize);
+	iomap->addr = IOMAP_NULL_ADDR;
+	iomap->type = IOMAP_DELALLOC;
+	iomap->bdev = fs_info->fs_devices->latest_bdev;
+	iomap->page_ops = &btrfs_buffered_page_ops;
+	return 0;
+release:
+	if (bi->extents_locked)
+		unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start,
+				bi->end, &bi->cached_state);
+	if (bi->nocow) {
+		struct btrfs_root *root = BTRFS_I(inode)->root;
+		btrfs_end_write_no_snapshotting(root);
+		btrfs_delalloc_release_metadata(BTRFS_I(inode),
+				bi->reserved_bytes, true);
+	} else {
+		btrfs_delalloc_release_space(inode, bi->data_reserved,
+				round_down(pos, fs_info->sectorsize),
+				bi->reserved_bytes, true);
+	}
+	extent_changeset_free(bi->data_reserved);
+
+error:
+	kfree(bi);
+	return ret;
+}
+
+static int btrfs_buffered_iomap_end(struct inode *inode, loff_t pos,
+		loff_t length, ssize_t written, unsigned flags,
+		struct iomap *iomap)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_iomap *bi = iomap->private;
+	ssize_t release_bytes = round_down(bi->reserved_bytes - written,
+			1 << fs_info->sb->s_blocksize_bits);
+	unsigned int extra_bits = 0;
+	u64 start_pos = pos & ~((u64) fs_info->sectorsize - 1);
+	u64 num_bytes = round_up(written + pos - start_pos,
+			fs_info->sectorsize);
+	u64 end_of_last_block = start_pos + num_bytes - 1;
+	int ret = 0;
+
+	if (release_bytes > 0) {
+		if (bi->nocow) {
+			btrfs_delalloc_release_metadata(BTRFS_I(inode),
+					release_bytes, true);
+		} else {
+			u64 __pos = round_down(pos + written, fs_info->sectorsize);
+			btrfs_delalloc_release_space(inode, bi->data_reserved,
+					__pos, release_bytes, true);
+		}
+	}
+
+	/*
+	 * The pages may have already been dirty, clear out old accounting so
+	 * we can set things up properly
+	 */
+	clear_extent_bit(&BTRFS_I(inode)->io_tree, start_pos, end_of_last_block,
+			EXTENT_DIRTY | EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
+			EXTENT_DEFRAG, 0, 0, &bi->cached_state);
+
+	if (!btrfs_is_free_space_inode(BTRFS_I(inode))) {
+		if (start_pos >= i_size_read(inode) &&
+		    !(BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC)) {
+			/*
+			 * There can't be any extents following eof in this case
+			 * so just set the delalloc new bit for the range
+			 * directly.
+			 */
+			extra_bits |= EXTENT_DELALLOC_NEW;
+		} else {
+			ret = btrfs_find_new_delalloc_bytes(BTRFS_I(inode),
+					start_pos, num_bytes,
+					&bi->cached_state);
+			if (ret)
+				goto unlock;
+		}
+	}
+
+	ret = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
+			extra_bits, &bi->cached_state, 0);
+unlock:
+	if (bi->extents_locked)
+		unlock_extent_cached(&BTRFS_I(inode)->io_tree,
+				bi->start, bi->end, &bi->cached_state);
+
+	if (bi->nocow) {
+		struct btrfs_root *root = BTRFS_I(inode)->root;
+		btrfs_end_write_no_snapshotting(root);
+		if (written > 0) {
+			u64 start = round_down(pos, fs_info->sectorsize);
+			u64 end = round_up(pos + written, fs_info->sectorsize) - 1;
+			set_extent_bit(&BTRFS_I(inode)->io_tree, start, end,
+					EXTENT_NORESERVE, NULL, NULL, GFP_NOFS);
+		}
+
+	}
+	btrfs_delalloc_release_extents(BTRFS_I(inode), bi->reserved_bytes,
+			true);
+
+	if (written < fs_info->nodesize)
+		btrfs_btree_balance_dirty(fs_info);
+
+	extent_changeset_free(bi->data_reserved);
+	kfree(bi);
+	return ret;
+}
+
+static const struct iomap_ops btrfs_buffered_iomap_ops = {
+	.iomap_begin            = btrfs_buffered_iomap_begin,
+	.iomap_end              = btrfs_buffered_iomap_end,
+};
+
+size_t btrfs_buffered_iomap_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	ssize_t written;
+	struct inode *inode = file_inode(iocb->ki_filp);
+	written = iomap_file_buffered_write(iocb, from, &btrfs_buffered_iomap_ops);
+	if (written > 0)
+		iocb->ki_pos += written;
+	if (iocb->ki_pos > i_size_read(inode))
+		i_size_write(inode, iocb->ki_pos);
+	return written;
+}
+
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/6] btrfs: Add CoW in iomap based writes
  2019-06-21 19:28 [PATCH 0/6] Btrfs iomap Goldwyn Rodrigues
                   ` (3 preceding siblings ...)
  2019-06-21 19:28 ` [PATCH 4/6] btrfs: Add a simple buffered iomap write Goldwyn Rodrigues
@ 2019-06-21 19:28 ` Goldwyn Rodrigues
  2019-06-21 19:28 ` [PATCH 6/6] btrfs: remove buffered write code made unnecessary Goldwyn Rodrigues
  5 siblings, 0 replies; 27+ messages in thread
From: Goldwyn Rodrigues @ 2019-06-21 19:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, hch, darrick.wong, david, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Set iomap->type to IOMAP_COW and fill up the source map in case
the I/O is not page aligned.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/fs/btrfs/iomap.c b/fs/btrfs/iomap.c
index 0435b179d461..f4133cf953b4 100644
--- a/fs/btrfs/iomap.c
+++ b/fs/btrfs/iomap.c
@@ -172,6 +172,35 @@ static int btrfs_buffered_page_prepare(struct inode *inode, loff_t pos,
 	return 0;
 }
 
+/*
+ * get_iomap: Get the block map and fill the iomap structure
+ * @pos: file position
+ * @length: I/O length
+ * @iomap: The iomap structure to fill
+ */
+
+static int get_iomap(struct inode *inode, loff_t pos, loff_t length,
+		struct iomap *iomap)
+{
+	struct extent_map *em;
+	iomap->addr = IOMAP_NULL_ADDR;
+	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
+	if (IS_ERR(em))
+		return PTR_ERR(em);
+	/* XXX Do we need to check for em->flags here? */
+	if (em->block_start == EXTENT_MAP_HOLE) {
+		iomap->type = IOMAP_HOLE;
+	} else {
+		iomap->addr = em->block_start + (pos - em->start);
+		iomap->type = IOMAP_MAPPED;
+	}
+	iomap->offset = em->start;
+	iomap->bdev = em->bdev;
+	iomap->length = em->len;
+	free_extent_map(em);
+	return 0;
+}
+
 static void btrfs_buffered_page_done(struct inode *inode, loff_t pos,
 		unsigned copied, struct page *page,
 		struct iomap *iomap)
@@ -196,6 +225,7 @@ static int btrfs_buffered_iomap_begin(struct inode *inode, loff_t pos,
 	int ret;
 	size_t write_bytes = length;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	size_t end;
 	size_t sector_offset = pos & (fs_info->sectorsize - 1);
 	struct btrfs_iomap *bi;
 
@@ -263,6 +293,17 @@ static int btrfs_buffered_iomap_begin(struct inode *inode, loff_t pos,
 	iomap->private = bi;
 	iomap->length = round_up(write_bytes, fs_info->sectorsize);
 	iomap->offset = round_down(pos, fs_info->sectorsize);
+	end = pos + write_bytes;
+	/* Set IOMAP_COW if start/end is not page aligned */
+	if (((pos & (PAGE_SIZE - 1)) || (end & (PAGE_SIZE - 1)))) {
+		iomap->type = IOMAP_COW;
+		ret = get_iomap(inode, pos, length, srcmap);
+		if (ret < 0)
+			goto release;
+	} else {
+		iomap->type = IOMAP_DELALLOC;
+	}
+
 	iomap->addr = IOMAP_NULL_ADDR;
 	iomap->type = IOMAP_DELALLOC;
 	iomap->bdev = fs_info->fs_devices->latest_bdev;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 6/6] btrfs: remove buffered write code made unnecessary
  2019-06-21 19:28 [PATCH 0/6] Btrfs iomap Goldwyn Rodrigues
                   ` (4 preceding siblings ...)
  2019-06-21 19:28 ` [PATCH 5/6] btrfs: Add CoW in iomap based writes Goldwyn Rodrigues
@ 2019-06-21 19:28 ` Goldwyn Rodrigues
  5 siblings, 0 replies; 27+ messages in thread
From: Goldwyn Rodrigues @ 2019-06-21 19:28 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, hch, darrick.wong, david, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Better done in a separate patch to keep the main patch short(er)

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/file.c | 464 --------------------------------------------------------
 1 file changed, 464 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index fc3032d8b573..61b5512ef035 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -389,79 +389,6 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
 	return 0;
 }
 
-/* simple helper to fault in pages and copy.  This should go away
- * and be replaced with calls into generic code.
- */
-static noinline int btrfs_copy_from_user(loff_t pos, size_t write_bytes,
-					 struct page **prepared_pages,
-					 struct iov_iter *i)
-{
-	size_t copied = 0;
-	size_t total_copied = 0;
-	int pg = 0;
-	int offset = offset_in_page(pos);
-
-	while (write_bytes > 0) {
-		size_t count = min_t(size_t,
-				     PAGE_SIZE - offset, write_bytes);
-		struct page *page = prepared_pages[pg];
-		/*
-		 * Copy data from userspace to the current page
-		 */
-		copied = iov_iter_copy_from_user_atomic(page, i, offset, count);
-
-		/* Flush processor's dcache for this page */
-		flush_dcache_page(page);
-
-		/*
-		 * if we get a partial write, we can end up with
-		 * partially up to date pages.  These add
-		 * a lot of complexity, so make sure they don't
-		 * happen by forcing this copy to be retried.
-		 *
-		 * The rest of the btrfs_file_write code will fall
-		 * back to page at a time copies after we return 0.
-		 */
-		if (!PageUptodate(page) && copied < count)
-			copied = 0;
-
-		iov_iter_advance(i, copied);
-		write_bytes -= copied;
-		total_copied += copied;
-
-		/* Return to btrfs_file_write_iter to fault page */
-		if (unlikely(copied == 0))
-			break;
-
-		if (copied < PAGE_SIZE - offset) {
-			offset += copied;
-		} else {
-			pg++;
-			offset = 0;
-		}
-	}
-	return total_copied;
-}
-
-/*
- * unlocks pages after btrfs_file_write is done with them
- */
-static void btrfs_drop_pages(struct page **pages, size_t num_pages)
-{
-	size_t i;
-	for (i = 0; i < num_pages; i++) {
-		/* page checked is some magic around finding pages that
-		 * have been modified without going through btrfs_set_page_dirty
-		 * clear it here. There should be no need to mark the pages
-		 * accessed as prepare_pages should have marked them accessed
-		 * in prepare_pages via find_or_create_page()
-		 */
-		ClearPageChecked(pages[i]);
-		unlock_page(pages[i]);
-		put_page(pages[i]);
-	}
-}
-
 static int btrfs_find_new_delalloc_bytes(struct btrfs_inode *inode,
 					 const u64 start,
 					 const u64 len,
@@ -1386,165 +1313,6 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
 	return 0;
 }
 
-/*
- * on error we return an unlocked page and the error value
- * on success we return a locked page and 0
- */
-static int prepare_uptodate_page(struct inode *inode,
-				 struct page *page, u64 pos,
-				 bool force_uptodate)
-{
-	int ret = 0;
-
-	if (((pos & (PAGE_SIZE - 1)) || force_uptodate) &&
-	    !PageUptodate(page)) {
-		ret = btrfs_readpage(NULL, page);
-		if (ret)
-			return ret;
-		lock_page(page);
-		if (!PageUptodate(page)) {
-			unlock_page(page);
-			return -EIO;
-		}
-		if (page->mapping != inode->i_mapping) {
-			unlock_page(page);
-			return -EAGAIN;
-		}
-	}
-	return 0;
-}
-
-/*
- * this just gets pages into the page cache and locks them down.
- */
-static noinline int prepare_pages(struct inode *inode, struct page **pages,
-				  size_t num_pages, loff_t pos,
-				  size_t write_bytes, bool force_uptodate)
-{
-	int i;
-	unsigned long index = pos >> PAGE_SHIFT;
-	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
-	int err = 0;
-	int faili;
-
-	for (i = 0; i < num_pages; i++) {
-again:
-		pages[i] = find_or_create_page(inode->i_mapping, index + i,
-					       mask | __GFP_WRITE);
-		if (!pages[i]) {
-			faili = i - 1;
-			err = -ENOMEM;
-			goto fail;
-		}
-
-		if (i == 0)
-			err = prepare_uptodate_page(inode, pages[i], pos,
-						    force_uptodate);
-		if (!err && i == num_pages - 1)
-			err = prepare_uptodate_page(inode, pages[i],
-						    pos + write_bytes, false);
-		if (err) {
-			put_page(pages[i]);
-			if (err == -EAGAIN) {
-				err = 0;
-				goto again;
-			}
-			faili = i - 1;
-			goto fail;
-		}
-		wait_on_page_writeback(pages[i]);
-	}
-
-	return 0;
-fail:
-	while (faili >= 0) {
-		unlock_page(pages[faili]);
-		put_page(pages[faili]);
-		faili--;
-	}
-	return err;
-
-}
-
-/*
- * This function locks the extent and properly waits for data=ordered extents
- * to finish before allowing the pages to be modified if need.
- *
- * The return value:
- * 1 - the extent is locked
- * 0 - the extent is not locked, and everything is OK
- * -EAGAIN - need re-prepare the pages
- * the other < 0 number - Something wrong happens
- */
-static noinline int
-lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct page **pages,
-				size_t num_pages, loff_t pos,
-				size_t write_bytes,
-				u64 *lockstart, u64 *lockend,
-				struct extent_state **cached_state)
-{
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	u64 start_pos;
-	u64 last_pos;
-	int i;
-	int ret = 0;
-
-	start_pos = round_down(pos, fs_info->sectorsize);
-	last_pos = start_pos
-		+ round_up(pos + write_bytes - start_pos,
-			   fs_info->sectorsize) - 1;
-
-	if (start_pos < inode->vfs_inode.i_size) {
-		struct btrfs_ordered_extent *ordered;
-
-		lock_extent_bits(&inode->io_tree, start_pos, last_pos,
-				cached_state);
-		ordered = btrfs_lookup_ordered_range(inode, start_pos,
-						     last_pos - start_pos + 1);
-		if (ordered &&
-		    ordered->file_offset + ordered->len > start_pos &&
-		    ordered->file_offset <= last_pos) {
-			unlock_extent_cached(&inode->io_tree, start_pos,
-					last_pos, cached_state);
-			for (i = 0; i < num_pages; i++) {
-				unlock_page(pages[i]);
-				put_page(pages[i]);
-			}
-			btrfs_start_ordered_extent(&inode->vfs_inode,
-					ordered, 1);
-			btrfs_put_ordered_extent(ordered);
-			return -EAGAIN;
-		}
-		if (ordered)
-			btrfs_put_ordered_extent(ordered);
-
-		*lockstart = start_pos;
-		*lockend = last_pos;
-		ret = 1;
-	}
-
-	/*
-	 * It's possible the pages are dirty right now, but we don't want
-	 * to clean them yet because copy_from_user may catch a page fault
-	 * and we might have to fall back to one page at a time.  If that
-	 * happens, we'll unlock these pages and we'd have a window where
-	 * reclaim could sneak in and drop the once-dirty page on the floor
-	 * without writing it.
-	 *
-	 * We have the pages locked and the extent range locked, so there's
-	 * no way someone can start IO on any dirty pages in this range.
-	 *
-	 * We'll call btrfs_dirty_pages() later on, and that will flip around
-	 * delalloc bits and dirty the pages as required.
-	 */
-	for (i = 0; i < num_pages; i++) {
-		set_page_extent_mapped(pages[i]);
-		WARN_ON(!PageLocked(pages[i]));
-	}
-
-	return ret;
-}
-
 static noinline int check_can_nocow(struct btrfs_inode *inode, loff_t pos,
 				    size_t *write_bytes)
 {
@@ -1591,238 +1359,6 @@ static noinline int check_can_nocow(struct btrfs_inode *inode, loff_t pos,
 	return ret;
 }
 
-static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
-					       struct iov_iter *i)
-{
-	struct file *file = iocb->ki_filp;
-	loff_t pos = iocb->ki_pos;
-	struct inode *inode = file_inode(file);
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct page **pages = NULL;
-	struct extent_state *cached_state = NULL;
-	struct extent_changeset *data_reserved = NULL;
-	u64 release_bytes = 0;
-	u64 lockstart;
-	u64 lockend;
-	size_t num_written = 0;
-	int nrptrs;
-	int ret = 0;
-	bool only_release_metadata = false;
-	bool force_page_uptodate = false;
-
-	nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
-			PAGE_SIZE / (sizeof(struct page *)));
-	nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied);
-	nrptrs = max(nrptrs, 8);
-	pages = kmalloc_array(nrptrs, sizeof(struct page *), GFP_KERNEL);
-	if (!pages)
-		return -ENOMEM;
-
-	while (iov_iter_count(i) > 0) {
-		size_t offset = offset_in_page(pos);
-		size_t sector_offset;
-		size_t write_bytes = min(iov_iter_count(i),
-					 nrptrs * (size_t)PAGE_SIZE -
-					 offset);
-		size_t num_pages = DIV_ROUND_UP(write_bytes + offset,
-						PAGE_SIZE);
-		size_t reserve_bytes;
-		size_t dirty_pages;
-		size_t copied;
-		size_t dirty_sectors;
-		size_t num_sectors;
-		int extents_locked;
-
-		WARN_ON(num_pages > nrptrs);
-
-		/*
-		 * Fault pages before locking them in prepare_pages
-		 * to avoid recursive lock
-		 */
-		if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
-			ret = -EFAULT;
-			break;
-		}
-
-		sector_offset = pos & (fs_info->sectorsize - 1);
-		reserve_bytes = round_up(write_bytes + sector_offset,
-				fs_info->sectorsize);
-
-		extent_changeset_release(data_reserved);
-		ret = btrfs_check_data_free_space(inode, &data_reserved, pos,
-						  write_bytes);
-		if (ret < 0) {
-			if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
-						      BTRFS_INODE_PREALLOC)) &&
-			    check_can_nocow(BTRFS_I(inode), pos,
-					&write_bytes) > 0) {
-				/*
-				 * For nodata cow case, no need to reserve
-				 * data space.
-				 */
-				only_release_metadata = true;
-				/*
-				 * our prealloc extent may be smaller than
-				 * write_bytes, so scale down.
-				 */
-				num_pages = DIV_ROUND_UP(write_bytes + offset,
-							 PAGE_SIZE);
-				reserve_bytes = round_up(write_bytes +
-							 sector_offset,
-							 fs_info->sectorsize);
-			} else {
-				break;
-			}
-		}
-
-		WARN_ON(reserve_bytes == 0);
-		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
-				reserve_bytes);
-		if (ret) {
-			if (!only_release_metadata)
-				btrfs_free_reserved_data_space(inode,
-						data_reserved, pos,
-						write_bytes);
-			else
-				btrfs_end_write_no_snapshotting(root);
-			break;
-		}
-
-		release_bytes = reserve_bytes;
-again:
-		/*
-		 * This is going to setup the pages array with the number of
-		 * pages we want, so we don't really need to worry about the
-		 * contents of pages from loop to loop
-		 */
-		ret = prepare_pages(inode, pages, num_pages,
-				    pos, write_bytes,
-				    force_page_uptodate);
-		if (ret) {
-			btrfs_delalloc_release_extents(BTRFS_I(inode),
-						       reserve_bytes, true);
-			break;
-		}
-
-		extents_locked = lock_and_cleanup_extent_if_need(
-				BTRFS_I(inode), pages,
-				num_pages, pos, write_bytes, &lockstart,
-				&lockend, &cached_state);
-		if (extents_locked < 0) {
-			if (extents_locked == -EAGAIN)
-				goto again;
-			btrfs_delalloc_release_extents(BTRFS_I(inode),
-						       reserve_bytes, true);
-			ret = extents_locked;
-			break;
-		}
-
-		copied = btrfs_copy_from_user(pos, write_bytes, pages, i);
-
-		num_sectors = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes);
-		dirty_sectors = round_up(copied + sector_offset,
-					fs_info->sectorsize);
-		dirty_sectors = BTRFS_BYTES_TO_BLKS(fs_info, dirty_sectors);
-
-		/*
-		 * if we have trouble faulting in the pages, fall
-		 * back to one page at a time
-		 */
-		if (copied < write_bytes)
-			nrptrs = 1;
-
-		if (copied == 0) {
-			force_page_uptodate = true;
-			dirty_sectors = 0;
-			dirty_pages = 0;
-		} else {
-			force_page_uptodate = false;
-			dirty_pages = DIV_ROUND_UP(copied + offset,
-						   PAGE_SIZE);
-		}
-
-		if (num_sectors > dirty_sectors) {
-			/* release everything except the sectors we dirtied */
-			release_bytes -= dirty_sectors <<
-						fs_info->sb->s_blocksize_bits;
-			if (only_release_metadata) {
-				btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							release_bytes, true);
-			} else {
-				u64 __pos;
-
-				__pos = round_down(pos,
-						   fs_info->sectorsize) +
-					(dirty_pages << PAGE_SHIFT);
-				btrfs_delalloc_release_space(inode,
-						data_reserved, __pos,
-						release_bytes, true);
-			}
-		}
-
-		release_bytes = round_up(copied + sector_offset,
-					fs_info->sectorsize);
-
-		if (copied > 0)
-			ret = btrfs_dirty_pages(inode, pages, dirty_pages,
-						pos, copied, &cached_state);
-		if (extents_locked)
-			unlock_extent_cached(&BTRFS_I(inode)->io_tree,
-					     lockstart, lockend, &cached_state);
-		btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes,
-					       true);
-		if (ret) {
-			btrfs_drop_pages(pages, num_pages);
-			break;
-		}
-
-		release_bytes = 0;
-		if (only_release_metadata)
-			btrfs_end_write_no_snapshotting(root);
-
-		if (only_release_metadata && copied > 0) {
-			lockstart = round_down(pos,
-					       fs_info->sectorsize);
-			lockend = round_up(pos + copied,
-					   fs_info->sectorsize) - 1;
-
-			set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
-				       lockend, EXTENT_NORESERVE, NULL,
-				       NULL, GFP_NOFS);
-			only_release_metadata = false;
-		}
-
-		btrfs_drop_pages(pages, num_pages);
-
-		cond_resched();
-
-		balance_dirty_pages_ratelimited(inode->i_mapping);
-		if (dirty_pages < (fs_info->nodesize >> PAGE_SHIFT) + 1)
-			btrfs_btree_balance_dirty(fs_info);
-
-		pos += copied;
-		num_written += copied;
-	}
-
-	kfree(pages);
-
-	if (release_bytes) {
-		if (only_release_metadata) {
-			btrfs_end_write_no_snapshotting(root);
-			btrfs_delalloc_release_metadata(BTRFS_I(inode),
-					release_bytes, true);
-		} else {
-			btrfs_delalloc_release_space(inode, data_reserved,
-					round_down(pos, fs_info->sectorsize),
-					release_bytes, true);
-		}
-	}
-
-	extent_changeset_free(data_reserved);
-	return num_written ? num_written : ret;
-}
-
 static ssize_t __btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] iomap: Check iblocksize before transforming page->private
  2019-06-21 19:28 ` [PATCH 3/6] iomap: Check iblocksize before transforming page->private Goldwyn Rodrigues
@ 2019-06-22  0:21   ` Darrick J. Wong
  2019-06-25 19:22     ` Goldwyn Rodrigues
  2019-06-24  7:05   ` Christoph Hellwig
  1 sibling, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2019-06-22  0:21 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-btrfs, linux-fsdevel, hch, david, Goldwyn Rodrigues

On Fri, Jun 21, 2019 at 02:28:25PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> btrfs uses page->private as well to store extent_buffer. Make
> the check stricter to make sure we are using page->private for iop by
> comparing iblocksize < PAGE_SIZE.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

/me wonders what will happen when btrfs decides to support blocksize !=
pagesize... will we have to add a pointer to struct iomap_page so that
btrfs can continue to associate an extent_buffer with a page?

--D

> ---
>  include/linux/iomap.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index f49767c7fd83..6511124e58b6 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -128,7 +128,8 @@ struct iomap_page {
>  
>  static inline struct iomap_page *to_iomap_page(struct page *page)
>  {
> -	if (page_has_private(page))
> +	if (i_blocksize(page->mapping->host) < PAGE_SIZE &&
> +			page_has_private(page))
>  		return (struct iomap_page *)page_private(page);
>  	return NULL;
>  }
> -- 
> 2.16.4
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] iomap: Read page from srcmap for IOMAP_COW
  2019-06-21 19:28 ` [PATCH 2/6] iomap: Read page from srcmap for IOMAP_COW Goldwyn Rodrigues
@ 2019-06-22  0:41   ` Darrick J. Wong
  0 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2019-06-22  0:41 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-btrfs, linux-fsdevel, hch, david, Goldwyn Rodrigues

On Fri, Jun 21, 2019 at 02:28:24PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Commit message needed here...

> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/iomap.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 6648957af268..8a7b20e432ef 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -655,7 +655,7 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
>  
>  static int
>  iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
> -		struct page **pagep, struct iomap *iomap)
> +		struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	const struct iomap_page_ops *page_ops = iomap->page_ops;
>  	pgoff_t index = pos >> PAGE_SHIFT;
> @@ -681,6 +681,8 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  
>  	if (iomap->type == IOMAP_INLINE)
>  		iomap_read_inline_data(inode, page, iomap);
> +	else if (iomap->type == IOMAP_COW)
> +		status = __iomap_write_begin(inode, pos, len, page, srcmap);

Pardon my stream of consciousness while I try to reason about this
change...

Hmm.  For writes to the page cache (which aren't necessarily aligned to
a page granularity), this part of the iomap code has used whatever iomap
the fs provides to read in whatever page contents are needed from disk
so that we can do a read-modify-write through the page cache.

For XFS this means that we (almost*) always report data fork extents in
response to a write query, even if the write would be COW, because we
know that we won't need the cow fork mapping until writeback.  This has
led to the sort of funny situation where an (IOMAP_WRITE|IOMAP_DIRECT)
request will return the COW fork extent, but an (IOMAP_WRITE) request
returns the data fork extent.

(* "almost", because we will sometimes return a cow fork extent if the
data fork is a hole and the file has extent size hints enabled.  We're
safe from reading in stale disk contents because cow fork extents do not
achieve written status until writeback completes, and the page stays
locked so we can't write and writeback it simultaneously)

I /think/ this finally enables us to fix that weird quirk of the xfs
iomap methods, because now we always report the write address and we
always report the read address if the actor is supposed to do a
read-modify-write.  It's the actor's responsibilty to sort that out,
not the ->iomap_begin function's.

--D


>  	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
>  		status = __block_write_begin_int(page, pos, len, NULL, iomap);
>  	else
> @@ -833,7 +835,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		}
>  
>  		status = iomap_write_begin(inode, pos, bytes, flags, &page,
> -				iomap);
> +				iomap, srcmap);
>  		if (unlikely(status))
>  			break;
>  
> @@ -932,7 +934,7 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  			return PTR_ERR(rpage);
>  
>  		status = iomap_write_begin(inode, pos, bytes,
> -					   AOP_FLAG_NOFS, &page, iomap);
> +					   AOP_FLAG_NOFS, &page, iomap, srcmap);
>  		put_page(rpage);
>  		if (unlikely(status))
>  			return status;
> @@ -978,13 +980,13 @@ iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
>  EXPORT_SYMBOL_GPL(iomap_file_dirty);
>  
>  static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
> -		unsigned bytes, struct iomap *iomap)
> +		unsigned bytes, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct page *page;
>  	int status;
>  
>  	status = iomap_write_begin(inode, pos, bytes, AOP_FLAG_NOFS, &page,
> -				   iomap);
> +				   iomap, srcmap);
>  	if (status)
>  		return status;
>  
> @@ -1022,7 +1024,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
>  		if (IS_DAX(inode))
>  			status = iomap_dax_zero(pos, offset, bytes, iomap);
>  		else
> -			status = iomap_zero(inode, pos, offset, bytes, iomap);
> +			status = iomap_zero(inode, pos, offset, bytes, iomap, srcmap);
>  		if (status < 0)
>  			return status;
>  
> -- 
> 2.16.4
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O
  2019-06-21 19:28 ` [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O Goldwyn Rodrigues
@ 2019-06-22  0:46   ` Darrick J. Wong
  2019-06-25 19:17     ` Goldwyn Rodrigues
  2019-06-26  6:21     ` Christoph Hellwig
  2019-06-24  7:07   ` Christoph Hellwig
  1 sibling, 2 replies; 27+ messages in thread
From: Darrick J. Wong @ 2019-06-22  0:46 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-btrfs, linux-fsdevel, hch, david, Goldwyn Rodrigues

On Fri, Jun 21, 2019 at 02:28:23PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Introduces a new type IOMAP_COW, which means the data at offset
> must be read from a srcmap and copied before performing the
> write on the offset.
> 
> The srcmap is used to identify where the read is to be performed
> from. This is passed to iomap->begin(), which is supposed to
> put in the details for reading, typically set with type IOMAP_READ.

What is IOMAP_READ ?

> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/dax.c              |  8 +++++---
>  fs/ext2/inode.c       |  2 +-
>  fs/ext4/inode.c       |  2 +-
>  fs/gfs2/bmap.c        |  3 ++-
>  fs/internal.h         |  2 +-
>  fs/iomap.c            | 31 ++++++++++++++++---------------
>  fs/xfs/xfs_iomap.c    |  9 ++++++---
>  include/linux/iomap.h |  4 +++-
>  8 files changed, 35 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 2e48c7ebb973..80b9e2599223 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1078,7 +1078,7 @@ EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>  
>  static loff_t
>  dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> -		struct iomap *iomap)
> +		struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct block_device *bdev = iomap->bdev;
>  	struct dax_device *dax_dev = iomap->dax_dev;
> @@ -1236,6 +1236,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  	unsigned long vaddr = vmf->address;
>  	loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
>  	struct iomap iomap = { 0 };
> +	struct iomap srcmap = { 0 };
>  	unsigned flags = IOMAP_FAULT;
>  	int error, major = 0;
>  	bool write = vmf->flags & FAULT_FLAG_WRITE;
> @@ -1280,7 +1281,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  	 * the file system block size to be equal the page size, which means
>  	 * that we never have to deal with more than a single extent here.
>  	 */
> -	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
> +	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap, &srcmap);
>  	if (iomap_errp)
>  		*iomap_errp = error;
>  	if (error) {
> @@ -1460,6 +1461,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  	struct inode *inode = mapping->host;
>  	vm_fault_t result = VM_FAULT_FALLBACK;
>  	struct iomap iomap = { 0 };
> +	struct iomap srcmap = { 0 };
>  	pgoff_t max_pgoff;
>  	void *entry;
>  	loff_t pos;
> @@ -1534,7 +1536,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  	 * to look up our filesystem block.
>  	 */
>  	pos = (loff_t)xas.xa_index << PAGE_SHIFT;
> -	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
> +	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap, &srcmap);

Line too long?

Also, I guess the DAX and directio write paths will just WARN_ON_ONCE if
someone feeds them an IOMAP_COW type iomap?

Ah, right, I guess the only filesystems that use iomap directio and
iomap dax don't support COW. :)

--D

>  	if (error)
>  		goto unlock_entry;
>  
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index e474127dd255..f081f11980ad 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -801,7 +801,7 @@ int ext2_get_block(struct inode *inode, sector_t iblock,
>  
>  #ifdef CONFIG_FS_DAX
>  static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> -		unsigned flags, struct iomap *iomap)
> +		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	unsigned int blkbits = inode->i_blkbits;
>  	unsigned long first_block = offset >> blkbits;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c7f77c643008..a8017e0c302b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3437,7 +3437,7 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
>  }
>  
>  static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> -			    unsigned flags, struct iomap *iomap)
> +			    unsigned flags, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>  	unsigned int blkbits = inode->i_blkbits;
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index 93ea1d529aa3..affa0c4305b7 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -1124,7 +1124,8 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
>  }
>  
>  static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
> -			    unsigned flags, struct iomap *iomap)
> +			    unsigned flags, struct iomap *iomap,
> +			    struct iomap *srcmap)
>  {
>  	struct gfs2_inode *ip = GFS2_I(inode);
>  	struct metapath mp = { .mp_aheight = 1, };
> diff --git a/fs/internal.h b/fs/internal.h
> index a48ef81be37d..79e495d86165 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -188,7 +188,7 @@ extern int do_vfs_ioctl(struct file *file, unsigned int fd, unsigned int cmd,
>   * iomap support:
>   */
>  typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
> -		void *data, struct iomap *iomap);
> +		void *data, struct iomap *iomap, struct iomap *srcmap);
>  
>  loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
>  		unsigned flags, const struct iomap_ops *ops, void *data,
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 23ef63fd1669..6648957af268 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -41,6 +41,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>  		const struct iomap_ops *ops, void *data, iomap_actor_t actor)
>  {
>  	struct iomap iomap = { 0 };
> +	struct iomap srcmap = { 0 };
>  	loff_t written = 0, ret;
>  
>  	/*
> @@ -55,7 +56,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>  	 * expose transient stale data. If the reserve fails, we can safely
>  	 * back out at this point as there is nothing to undo.
>  	 */
> -	ret = ops->iomap_begin(inode, pos, length, flags, &iomap);
> +	ret = ops->iomap_begin(inode, pos, length, flags, &iomap, &srcmap);
>  	if (ret)
>  		return ret;
>  	if (WARN_ON(iomap.offset > pos))
> @@ -75,7 +76,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>  	 * we can do the copy-in page by page without having to worry about
>  	 * failures exposing transient data.
>  	 */
> -	written = actor(inode, pos, length, data, &iomap);
> +	written = actor(inode, pos, length, data, &iomap, &srcmap);
>  
>  	/*
>  	 * Now the data has been copied, commit the range we've copied.  This
> @@ -282,7 +283,7 @@ iomap_read_inline_data(struct inode *inode, struct page *page,
>  
>  static loff_t
>  iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> -		struct iomap *iomap)
> +		struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct iomap_readpage_ctx *ctx = data;
>  	struct page *page = ctx->cur_page;
> @@ -424,7 +425,7 @@ iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
>  
>  static loff_t
>  iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
> -		void *data, struct iomap *iomap)
> +		void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct iomap_readpage_ctx *ctx = data;
>  	loff_t done, ret;
> @@ -444,7 +445,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
>  			ctx->cur_page_in_bio = false;
>  		}
>  		ret = iomap_readpage_actor(inode, pos + done, length - done,
> -				ctx, iomap);
> +				ctx, iomap, srcmap);
>  	}
>  
>  	return done;
> @@ -796,7 +797,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
>  
>  static loff_t
>  iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> -		struct iomap *iomap)
> +		struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct iov_iter *i = data;
>  	long status = 0;
> @@ -913,7 +914,7 @@ __iomap_read_page(struct inode *inode, loff_t offset)
>  
>  static loff_t
>  iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> -		struct iomap *iomap)
> +		struct iomap *iomap, struct iomap *srcmap)
>  {
>  	long status = 0;
>  	ssize_t written = 0;
> @@ -1002,7 +1003,7 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
>  
>  static loff_t
>  iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
> -		void *data, struct iomap *iomap)
> +		void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	bool *did_zero = data;
>  	loff_t written = 0;
> @@ -1071,7 +1072,7 @@ EXPORT_SYMBOL_GPL(iomap_truncate_page);
>  
>  static loff_t
>  iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
> -		void *data, struct iomap *iomap)
> +		void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct page *page = data;
>  	int ret;
> @@ -1169,7 +1170,7 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
>  
>  static loff_t
>  iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> -		struct iomap *iomap)
> +		struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct fiemap_ctx *ctx = data;
>  	loff_t ret = length;
> @@ -1343,7 +1344,7 @@ page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
>  
>  static loff_t
>  iomap_seek_hole_actor(struct inode *inode, loff_t offset, loff_t length,
> -		      void *data, struct iomap *iomap)
> +		      void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	switch (iomap->type) {
>  	case IOMAP_UNWRITTEN:
> @@ -1389,7 +1390,7 @@ EXPORT_SYMBOL_GPL(iomap_seek_hole);
>  
>  static loff_t
>  iomap_seek_data_actor(struct inode *inode, loff_t offset, loff_t length,
> -		      void *data, struct iomap *iomap)
> +		      void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	switch (iomap->type) {
>  	case IOMAP_HOLE:
> @@ -1790,7 +1791,7 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
>  
>  static loff_t
>  iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
> -		void *data, struct iomap *iomap)
> +		void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct iomap_dio *dio = data;
>  
> @@ -2057,7 +2058,7 @@ static int iomap_swapfile_add_extent(struct iomap_swapfile_info *isi)
>   * distinction between written and unwritten extents.
>   */
>  static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
> -		loff_t count, void *data, struct iomap *iomap)
> +		loff_t count, void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct iomap_swapfile_info *isi = data;
>  	int error;
> @@ -2161,7 +2162,7 @@ EXPORT_SYMBOL_GPL(iomap_swapfile_activate);
>  
>  static loff_t
>  iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
> -		void *data, struct iomap *iomap)
> +		void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	sector_t *bno = data, addr;
>  
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 63d323916bba..6116a75f03da 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -925,7 +925,8 @@ xfs_file_iomap_begin(
>  	loff_t			offset,
>  	loff_t			length,
>  	unsigned		flags,
> -	struct iomap		*iomap)
> +	struct iomap		*iomap,
> +	struct iomap		*srcmap)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1148,7 +1149,8 @@ xfs_seek_iomap_begin(
>  	loff_t			offset,
>  	loff_t			length,
>  	unsigned		flags,
> -	struct iomap		*iomap)
> +	struct iomap		*iomap,
> +	struct iomap		*srcmap)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1234,7 +1236,8 @@ xfs_xattr_iomap_begin(
>  	loff_t			offset,
>  	loff_t			length,
>  	unsigned		flags,
> -	struct iomap		*iomap)
> +	struct iomap		*iomap,
> +	struct iomap		*srcmap)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 2103b94cb1bf..f49767c7fd83 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -25,6 +25,7 @@ struct vm_fault;
>  #define IOMAP_MAPPED	0x03	/* blocks allocated at @addr */
>  #define IOMAP_UNWRITTEN	0x04	/* blocks allocated at @addr in unwritten state */
>  #define IOMAP_INLINE	0x05	/* data inline in the inode */
> +#define IOMAP_COW	0x06	/* copy data from srcmap before writing */
>  
>  /*
>   * Flags for all iomap mappings:
> @@ -102,7 +103,8 @@ struct iomap_ops {
>  	 * The actual length is returned in iomap->length.
>  	 */
>  	int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
> -			unsigned flags, struct iomap *iomap);
> +			unsigned flags, struct iomap *iomap,
> +			struct iomap *srcmap);
>  
>  	/*
>  	 * Commit and/or unreserve space previous allocated using iomap_begin.
> -- 
> 2.16.4
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] iomap: Check iblocksize before transforming page->private
  2019-06-21 19:28 ` [PATCH 3/6] iomap: Check iblocksize before transforming page->private Goldwyn Rodrigues
  2019-06-22  0:21   ` Darrick J. Wong
@ 2019-06-24  7:05   ` Christoph Hellwig
  2019-06-25 18:56     ` Goldwyn Rodrigues
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2019-06-24  7:05 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-btrfs, linux-fsdevel, hch, darrick.wong, david, Goldwyn Rodrigues

On Fri, Jun 21, 2019 at 02:28:25PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> btrfs uses page->private as well to store extent_buffer. Make
> the check stricter to make sure we are using page->private for iop by
> comparing iblocksize < PAGE_SIZE.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

If btrfs uses page->private itself and also uses functions that call
to_iomap_page we have a major problem, as we now have a usage conflict.

How do you end up here?  

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O
  2019-06-21 19:28 ` [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O Goldwyn Rodrigues
  2019-06-22  0:46   ` Darrick J. Wong
@ 2019-06-24  7:07   ` Christoph Hellwig
  2019-06-25 19:14     ` Goldwyn Rodrigues
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2019-06-24  7:07 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-btrfs, linux-fsdevel, hch, darrick.wong, david, Goldwyn Rodrigues

xfs will need to be updated to fill in the additional iomap for the
COW case.  Has this series been tested on xfs?

I can't say I'm a huge fan of this two iomaps in one method call
approach.  I always though two separate iomap iterations would be nicer,
but compared to that even the older hack with just the additional
src_addr seems a little better.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] iomap: Check iblocksize before transforming page->private
  2019-06-24  7:05   ` Christoph Hellwig
@ 2019-06-25 18:56     ` Goldwyn Rodrigues
  2019-06-25 20:04       ` Filipe Manana
  2019-06-26  6:16       ` Christoph Hellwig
  0 siblings, 2 replies; 27+ messages in thread
From: Goldwyn Rodrigues @ 2019-06-25 18:56 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-btrfs, linux-fsdevel, darrick.wong, david

On  9:05 24/06, Christoph Hellwig wrote:
> On Fri, Jun 21, 2019 at 02:28:25PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > btrfs uses page->private as well to store extent_buffer. Make
> > the check stricter to make sure we are using page->private for iop by
> > comparing iblocksize < PAGE_SIZE.
> > 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> If btrfs uses page->private itself and also uses functions that call
> to_iomap_page we have a major problem, as we now have a usage conflict.
> 
> How do you end up here?  
> 

Btrfs uses page->private to identify which extent_buffer it belongs to.
So, if you read, it fills the page->private. Then you try to write to
it, iomap will assume it to be iomap_page pointer.

I don't think we can move extent_buffer out of page->private for btrfs.
Any other ideas?

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O
  2019-06-24  7:07   ` Christoph Hellwig
@ 2019-06-25 19:14     ` Goldwyn Rodrigues
  2019-06-26  1:36       ` Shiyang Ruan
                         ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Goldwyn Rodrigues @ 2019-06-25 19:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-btrfs, linux-fsdevel, darrick.wong, david

On  9:07 24/06, Christoph Hellwig wrote:
> xfs will need to be updated to fill in the additional iomap for the
> COW case.  Has this series been tested on xfs?
> 

No, I have not tested this, or make xfs set IOMAP_COW. I will try to do
it in the next iteration.

> I can't say I'm a huge fan of this two iomaps in one method call
> approach.  I always though two separate iomap iterations would be nicer,
> but compared to that even the older hack with just the additional
> src_addr seems a little better.

I am just expanding on your idea of using multiple iterations for the Cow case
in the hope we can come out of a good design:

1. iomap_file_buffered_write calls iomap_apply with IOMAP_WRITE flag.
   which calls iomap_begin() for the respective filesystem.
2. btrfs_iomap_begin() sets up iomap->type as IOMAP_COW and fills iomap
   struct with read addr information.
3. iomap_apply() conditionally for IOMAP_COW calls do_cow(new function)
   and calls ops->iomap_begin() with flag IOMAP_COW_READ_DONE(new flag).
4. btrfs_iomap_begin() fills up iomap structure with write information.

Step 3 seems out of place because iomap_apply should be iomap.type agnostic.
Right?
Should we be adding another flag IOMAP_COW_DONE, just to figure out that
this is the "real" write for iomap_begin to fill iomap?

If this is not how you imagined, could you elaborate on the dual iteration
sequence?


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O
  2019-06-22  0:46   ` Darrick J. Wong
@ 2019-06-25 19:17     ` Goldwyn Rodrigues
  2019-06-26  6:21     ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Goldwyn Rodrigues @ 2019-06-25 19:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-btrfs, linux-fsdevel, hch, david

On 17:46 21/06, Darrick J. Wong wrote:
> On Fri, Jun 21, 2019 at 02:28:23PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > Introduces a new type IOMAP_COW, which means the data at offset
> > must be read from a srcmap and copied before performing the
> > write on the offset.
> > 
> > The srcmap is used to identify where the read is to be performed
> > from. This is passed to iomap->begin(), which is supposed to
> > put in the details for reading, typically set with type IOMAP_READ.
> 
> What is IOMAP_READ ?

Badd comment. I should have written IOMAP_MAPPED or IOMAP_HOLE.
> 
> > 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> >  fs/dax.c              |  8 +++++---
> >  fs/ext2/inode.c       |  2 +-
> >  fs/ext4/inode.c       |  2 +-
> >  fs/gfs2/bmap.c        |  3 ++-
> >  fs/internal.h         |  2 +-
> >  fs/iomap.c            | 31 ++++++++++++++++---------------
> >  fs/xfs/xfs_iomap.c    |  9 ++++++---
> >  include/linux/iomap.h |  4 +++-
> >  8 files changed, 35 insertions(+), 26 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 2e48c7ebb973..80b9e2599223 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1078,7 +1078,7 @@ EXPORT_SYMBOL_GPL(__dax_zero_page_range);
> >  
> >  static loff_t
> >  dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > -		struct iomap *iomap)
> > +		struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	struct block_device *bdev = iomap->bdev;
> >  	struct dax_device *dax_dev = iomap->dax_dev;
> > @@ -1236,6 +1236,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> >  	unsigned long vaddr = vmf->address;
> >  	loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
> >  	struct iomap iomap = { 0 };
> > +	struct iomap srcmap = { 0 };
> >  	unsigned flags = IOMAP_FAULT;
> >  	int error, major = 0;
> >  	bool write = vmf->flags & FAULT_FLAG_WRITE;
> > @@ -1280,7 +1281,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> >  	 * the file system block size to be equal the page size, which means
> >  	 * that we never have to deal with more than a single extent here.
> >  	 */
> > -	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
> > +	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap, &srcmap);
> >  	if (iomap_errp)
> >  		*iomap_errp = error;
> >  	if (error) {
> > @@ -1460,6 +1461,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
> >  	struct inode *inode = mapping->host;
> >  	vm_fault_t result = VM_FAULT_FALLBACK;
> >  	struct iomap iomap = { 0 };
> > +	struct iomap srcmap = { 0 };
> >  	pgoff_t max_pgoff;
> >  	void *entry;
> >  	loff_t pos;
> > @@ -1534,7 +1536,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
> >  	 * to look up our filesystem block.
> >  	 */
> >  	pos = (loff_t)xas.xa_index << PAGE_SHIFT;
> > -	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
> > +	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap, &srcmap);
> 
> Line too long?
> 
> Also, I guess the DAX and directio write paths will just WARN_ON_ONCE if
> someone feeds them an IOMAP_COW type iomap?
> 
> Ah, right, I guess the only filesystems that use iomap directio and
> iomap dax don't support COW. :)

Well, I am planning on adding direct I/O to btrfs using iomap, but am stuck
with some bugs. So, I will add the WARN_ON().

> 
> --D
> 
> >  	if (error)
> >  		goto unlock_entry;
> >  
> > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> > index e474127dd255..f081f11980ad 100644
> > --- a/fs/ext2/inode.c
> > +++ b/fs/ext2/inode.c
> > @@ -801,7 +801,7 @@ int ext2_get_block(struct inode *inode, sector_t iblock,
> >  
> >  #ifdef CONFIG_FS_DAX
> >  static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> > -		unsigned flags, struct iomap *iomap)
> > +		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	unsigned int blkbits = inode->i_blkbits;
> >  	unsigned long first_block = offset >> blkbits;
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index c7f77c643008..a8017e0c302b 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3437,7 +3437,7 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
> >  }
> >  
> >  static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> > -			    unsigned flags, struct iomap *iomap)
> > +			    unsigned flags, struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> >  	unsigned int blkbits = inode->i_blkbits;
> > diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> > index 93ea1d529aa3..affa0c4305b7 100644
> > --- a/fs/gfs2/bmap.c
> > +++ b/fs/gfs2/bmap.c
> > @@ -1124,7 +1124,8 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
> >  }
> >  
> >  static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
> > -			    unsigned flags, struct iomap *iomap)
> > +			    unsigned flags, struct iomap *iomap,
> > +			    struct iomap *srcmap)
> >  {
> >  	struct gfs2_inode *ip = GFS2_I(inode);
> >  	struct metapath mp = { .mp_aheight = 1, };
> > diff --git a/fs/internal.h b/fs/internal.h
> > index a48ef81be37d..79e495d86165 100644
> > --- a/fs/internal.h
> > +++ b/fs/internal.h
> > @@ -188,7 +188,7 @@ extern int do_vfs_ioctl(struct file *file, unsigned int fd, unsigned int cmd,
> >   * iomap support:
> >   */
> >  typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
> > -		void *data, struct iomap *iomap);
> > +		void *data, struct iomap *iomap, struct iomap *srcmap);
> >  
> >  loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
> >  		unsigned flags, const struct iomap_ops *ops, void *data,
> > diff --git a/fs/iomap.c b/fs/iomap.c
> > index 23ef63fd1669..6648957af268 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -41,6 +41,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
> >  		const struct iomap_ops *ops, void *data, iomap_actor_t actor)
> >  {
> >  	struct iomap iomap = { 0 };
> > +	struct iomap srcmap = { 0 };
> >  	loff_t written = 0, ret;
> >  
> >  	/*
> > @@ -55,7 +56,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
> >  	 * expose transient stale data. If the reserve fails, we can safely
> >  	 * back out at this point as there is nothing to undo.
> >  	 */
> > -	ret = ops->iomap_begin(inode, pos, length, flags, &iomap);
> > +	ret = ops->iomap_begin(inode, pos, length, flags, &iomap, &srcmap);
> >  	if (ret)
> >  		return ret;
> >  	if (WARN_ON(iomap.offset > pos))
> > @@ -75,7 +76,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
> >  	 * we can do the copy-in page by page without having to worry about
> >  	 * failures exposing transient data.
> >  	 */
> > -	written = actor(inode, pos, length, data, &iomap);
> > +	written = actor(inode, pos, length, data, &iomap, &srcmap);
> >  
> >  	/*
> >  	 * Now the data has been copied, commit the range we've copied.  This
> > @@ -282,7 +283,7 @@ iomap_read_inline_data(struct inode *inode, struct page *page,
> >  
> >  static loff_t
> >  iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > -		struct iomap *iomap)
> > +		struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	struct iomap_readpage_ctx *ctx = data;
> >  	struct page *page = ctx->cur_page;
> > @@ -424,7 +425,7 @@ iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
> >  
> >  static loff_t
> >  iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
> > -		void *data, struct iomap *iomap)
> > +		void *data, struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	struct iomap_readpage_ctx *ctx = data;
> >  	loff_t done, ret;
> > @@ -444,7 +445,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
> >  			ctx->cur_page_in_bio = false;
> >  		}
> >  		ret = iomap_readpage_actor(inode, pos + done, length - done,
> > -				ctx, iomap);
> > +				ctx, iomap, srcmap);
> >  	}
> >  
> >  	return done;
> > @@ -796,7 +797,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
> >  
> >  static loff_t
> >  iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > -		struct iomap *iomap)
> > +		struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	struct iov_iter *i = data;
> >  	long status = 0;
> > @@ -913,7 +914,7 @@ __iomap_read_page(struct inode *inode, loff_t offset)
> >  
> >  static loff_t
> >  iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > -		struct iomap *iomap)
> > +		struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	long status = 0;
> >  	ssize_t written = 0;
> > @@ -1002,7 +1003,7 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
> >  
> >  static loff_t
> >  iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
> > -		void *data, struct iomap *iomap)
> > +		void *data, struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	bool *did_zero = data;
> >  	loff_t written = 0;
> > @@ -1071,7 +1072,7 @@ EXPORT_SYMBOL_GPL(iomap_truncate_page);
> >  
> >  static loff_t
> >  iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
> > -		void *data, struct iomap *iomap)
> > +		void *data, struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	struct page *page = data;
> >  	int ret;
> > @@ -1169,7 +1170,7 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
> >  
> >  static loff_t
> >  iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > -		struct iomap *iomap)
> > +		struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	struct fiemap_ctx *ctx = data;
> >  	loff_t ret = length;
> > @@ -1343,7 +1344,7 @@ page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
> >  
> >  static loff_t
> >  iomap_seek_hole_actor(struct inode *inode, loff_t offset, loff_t length,
> > -		      void *data, struct iomap *iomap)
> > +		      void *data, struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	switch (iomap->type) {
> >  	case IOMAP_UNWRITTEN:
> > @@ -1389,7 +1390,7 @@ EXPORT_SYMBOL_GPL(iomap_seek_hole);
> >  
> >  static loff_t
> >  iomap_seek_data_actor(struct inode *inode, loff_t offset, loff_t length,
> > -		      void *data, struct iomap *iomap)
> > +		      void *data, struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	switch (iomap->type) {
> >  	case IOMAP_HOLE:
> > @@ -1790,7 +1791,7 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
> >  
> >  static loff_t
> >  iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
> > -		void *data, struct iomap *iomap)
> > +		void *data, struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	struct iomap_dio *dio = data;
> >  
> > @@ -2057,7 +2058,7 @@ static int iomap_swapfile_add_extent(struct iomap_swapfile_info *isi)
> >   * distinction between written and unwritten extents.
> >   */
> >  static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
> > -		loff_t count, void *data, struct iomap *iomap)
> > +		loff_t count, void *data, struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	struct iomap_swapfile_info *isi = data;
> >  	int error;
> > @@ -2161,7 +2162,7 @@ EXPORT_SYMBOL_GPL(iomap_swapfile_activate);
> >  
> >  static loff_t
> >  iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
> > -		void *data, struct iomap *iomap)
> > +		void *data, struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	sector_t *bno = data, addr;
> >  
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index 63d323916bba..6116a75f03da 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -925,7 +925,8 @@ xfs_file_iomap_begin(
> >  	loff_t			offset,
> >  	loff_t			length,
> >  	unsigned		flags,
> > -	struct iomap		*iomap)
> > +	struct iomap		*iomap,
> > +	struct iomap		*srcmap)
> >  {
> >  	struct xfs_inode	*ip = XFS_I(inode);
> >  	struct xfs_mount	*mp = ip->i_mount;
> > @@ -1148,7 +1149,8 @@ xfs_seek_iomap_begin(
> >  	loff_t			offset,
> >  	loff_t			length,
> >  	unsigned		flags,
> > -	struct iomap		*iomap)
> > +	struct iomap		*iomap,
> > +	struct iomap		*srcmap)
> >  {
> >  	struct xfs_inode	*ip = XFS_I(inode);
> >  	struct xfs_mount	*mp = ip->i_mount;
> > @@ -1234,7 +1236,8 @@ xfs_xattr_iomap_begin(
> >  	loff_t			offset,
> >  	loff_t			length,
> >  	unsigned		flags,
> > -	struct iomap		*iomap)
> > +	struct iomap		*iomap,
> > +	struct iomap		*srcmap)
> >  {
> >  	struct xfs_inode	*ip = XFS_I(inode);
> >  	struct xfs_mount	*mp = ip->i_mount;
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 2103b94cb1bf..f49767c7fd83 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -25,6 +25,7 @@ struct vm_fault;
> >  #define IOMAP_MAPPED	0x03	/* blocks allocated at @addr */
> >  #define IOMAP_UNWRITTEN	0x04	/* blocks allocated at @addr in unwritten state */
> >  #define IOMAP_INLINE	0x05	/* data inline in the inode */
> > +#define IOMAP_COW	0x06	/* copy data from srcmap before writing */
> >  
> >  /*
> >   * Flags for all iomap mappings:
> > @@ -102,7 +103,8 @@ struct iomap_ops {
> >  	 * The actual length is returned in iomap->length.
> >  	 */
> >  	int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
> > -			unsigned flags, struct iomap *iomap);
> > +			unsigned flags, struct iomap *iomap,
> > +			struct iomap *srcmap);
> >  
> >  	/*
> >  	 * Commit and/or unreserve space previous allocated using iomap_begin.
> > -- 
> > 2.16.4
> > 

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] iomap: Check iblocksize before transforming page->private
  2019-06-22  0:21   ` Darrick J. Wong
@ 2019-06-25 19:22     ` Goldwyn Rodrigues
  0 siblings, 0 replies; 27+ messages in thread
From: Goldwyn Rodrigues @ 2019-06-25 19:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-btrfs, linux-fsdevel, hch, david

On 17:21 21/06, Darrick J. Wong wrote:
> On Fri, Jun 21, 2019 at 02:28:25PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > btrfs uses page->private as well to store extent_buffer. Make
> > the check stricter to make sure we are using page->private for iop by
> > comparing iblocksize < PAGE_SIZE.
> > 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> /me wonders what will happen when btrfs decides to support blocksize !=
> pagesize... will we have to add a pointer to struct iomap_page so that
> btrfs can continue to associate an extent_buffer with a page?
> 

Yes, I did not think about that. It would be nasty to put wrappers
to get to the right pointer.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] iomap: Check iblocksize before transforming page->private
  2019-06-25 18:56     ` Goldwyn Rodrigues
@ 2019-06-25 20:04       ` Filipe Manana
  2019-06-26  3:03         ` Goldwyn Rodrigues
  2019-06-26  6:16       ` Christoph Hellwig
  1 sibling, 1 reply; 27+ messages in thread
From: Filipe Manana @ 2019-06-25 20:04 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: Christoph Hellwig, linux-btrfs, linux-fsdevel, Darrick J. Wong,
	Dave Chinner

On Tue, Jun 25, 2019 at 8:58 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
>
> On  9:05 24/06, Christoph Hellwig wrote:
> > On Fri, Jun 21, 2019 at 02:28:25PM -0500, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > >
> > > btrfs uses page->private as well to store extent_buffer. Make
> > > the check stricter to make sure we are using page->private for iop by
> > > comparing iblocksize < PAGE_SIZE.
> > >
> > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >
> > If btrfs uses page->private itself and also uses functions that call
> > to_iomap_page we have a major problem, as we now have a usage conflict.
> >
> > How do you end up here?
> >
>
> Btrfs uses page->private to identify which extent_buffer it belongs to.
> So, if you read, it fills the page->private. Then you try to write to
> it, iomap will assume it to be iomap_page pointer.
>
> I don't think we can move extent_buffer out of page->private for btrfs.
> Any other ideas?

The extent buffer is only for pages belonging to the btree inode (i.e.
pages that correspond to a btree node/lead).
Haven't looked in detail to this patchset, but you can't do buffered
writes or direct IO against the btree inode, can you?
So for file inodes, this problem doesn't exist.

Thanks.

>
> --
> Goldwyn



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O
  2019-06-25 19:14     ` Goldwyn Rodrigues
@ 2019-06-26  1:36       ` Shiyang Ruan
  2019-06-26  6:39       ` Christoph Hellwig
  2019-06-26 18:00       ` Darrick J. Wong
  2 siblings, 0 replies; 27+ messages in thread
From: Shiyang Ruan @ 2019-06-26  1:36 UTC (permalink / raw)
  To: Goldwyn Rodrigues, Christoph Hellwig
  Cc: linux-btrfs, linux-fsdevel, darrick.wong, david



On 6/26/19 3:14 AM, Goldwyn Rodrigues wrote:
> On  9:07 24/06, Christoph Hellwig wrote:
>> xfs will need to be updated to fill in the additional iomap for the
>> COW case.  Has this series been tested on xfs?
>>
> 
> No, I have not tested this, or make xfs set IOMAP_COW. I will try to do
> it in the next iteration.

Hi Goldwyn,

I'm willing to help you with this test on XFS.  If you need any help, 
please feel free to tell me. :)


-- 
Thanks,
Shiyang Ruan.
> 
>> I can't say I'm a huge fan of this two iomaps in one method call
>> approach.  I always though two separate iomap iterations would be nicer,
>> but compared to that even the older hack with just the additional
>> src_addr seems a little better.
> 
> I am just expanding on your idea of using multiple iterations for the Cow case
> in the hope we can come out of a good design:
> 
> 1. iomap_file_buffered_write calls iomap_apply with IOMAP_WRITE flag.
>     which calls iomap_begin() for the respective filesystem.
> 2. btrfs_iomap_begin() sets up iomap->type as IOMAP_COW and fills iomap
>     struct with read addr information.
> 3. iomap_apply() conditionally for IOMAP_COW calls do_cow(new function)
>     and calls ops->iomap_begin() with flag IOMAP_COW_READ_DONE(new flag).
> 4. btrfs_iomap_begin() fills up iomap structure with write information.
> 
> Step 3 seems out of place because iomap_apply should be iomap.type agnostic.
> Right?
> Should we be adding another flag IOMAP_COW_DONE, just to figure out that
> this is the "real" write for iomap_begin to fill iomap?
> 
> If this is not how you imagined, could you elaborate on the dual iteration
> sequence?
> 
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] iomap: Check iblocksize before transforming page->private
  2019-06-25 20:04       ` Filipe Manana
@ 2019-06-26  3:03         ` Goldwyn Rodrigues
  2019-06-26  6:42           ` Nikolay Borisov
  0 siblings, 1 reply; 27+ messages in thread
From: Goldwyn Rodrigues @ 2019-06-26  3:03 UTC (permalink / raw)
  To: Filipe Manana
  Cc: Christoph Hellwig, linux-btrfs, linux-fsdevel, Darrick J. Wong,
	Dave Chinner

On 21:04 25/06, Filipe Manana wrote:
> On Tue, Jun 25, 2019 at 8:58 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> >
> > On  9:05 24/06, Christoph Hellwig wrote:
> > > On Fri, Jun 21, 2019 at 02:28:25PM -0500, Goldwyn Rodrigues wrote:
> > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > >
> > > > btrfs uses page->private as well to store extent_buffer. Make
> > > > the check stricter to make sure we are using page->private for iop by
> > > > comparing iblocksize < PAGE_SIZE.
> > > >
> > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > >
> > > If btrfs uses page->private itself and also uses functions that call
> > > to_iomap_page we have a major problem, as we now have a usage conflict.
> > >
> > > How do you end up here?
> > >
> >
> > Btrfs uses page->private to identify which extent_buffer it belongs to.
> > So, if you read, it fills the page->private. Then you try to write to
> > it, iomap will assume it to be iomap_page pointer.
> >
> > I don't think we can move extent_buffer out of page->private for btrfs.
> > Any other ideas?
> 
> The extent buffer is only for pages belonging to the btree inode (i.e.
> pages that correspond to a btree node/lead).
> Haven't looked in detail to this patchset, but you can't do buffered
> writes or direct IO against the btree inode, can you?
> So for file inodes, this problem doesn't exist.

Why do we call set_page_extent_mapped(page) in lock_and_cleanup_extent_if_needed() or __do_readpage()?

I must admit, the backtrace crashes that I saw had the page->private set to
EXTENT_PAGE_PRIVATE rather than the extent_buffer pointer. Does that mean
calling this function is not necessary in these codepaths?

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] iomap: Check iblocksize before transforming page->private
  2019-06-25 18:56     ` Goldwyn Rodrigues
  2019-06-25 20:04       ` Filipe Manana
@ 2019-06-26  6:16       ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2019-06-26  6:16 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: Christoph Hellwig, linux-btrfs, linux-fsdevel, darrick.wong, david

On Tue, Jun 25, 2019 at 01:56:59PM -0500, Goldwyn Rodrigues wrote:
> Btrfs uses page->private to identify which extent_buffer it belongs to.
> So, if you read, it fills the page->private. Then you try to write to
> it, iomap will assume it to be iomap_page pointer.

Yes, and that is going to run into problems sooner or later, that is
if you want to support sub-page size block sizes in btrfs, which I
though is work in progress, or if you ever want to write through iomap.

> I don't think we can move extent_buffer out of page->private for btrfs.
> Any other ideas?

I think you'll have to.  That being said I don't see why you'd need
data in page->private for pages potentially being read in a setup
where blocksize == PAGESIZE anyway.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O
  2019-06-22  0:46   ` Darrick J. Wong
  2019-06-25 19:17     ` Goldwyn Rodrigues
@ 2019-06-26  6:21     ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2019-06-26  6:21 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Goldwyn Rodrigues, linux-btrfs, linux-fsdevel, hch, david,
	Goldwyn Rodrigues

On Fri, Jun 21, 2019 at 05:46:24PM -0700, Darrick J. Wong wrote:
> On Fri, Jun 21, 2019 at 02:28:23PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > Introduces a new type IOMAP_COW, which means the data at offset
> > must be read from a srcmap and copied before performing the
> > write on the offset.
> > 
> > The srcmap is used to identify where the read is to be performed
> > from. This is passed to iomap->begin(), which is supposed to
> > put in the details for reading, typically set with type IOMAP_READ.
> 
> What is IOMAP_READ ?

The lack of flags.  Which reminds me that our IOMAP_* types have
pretty much gotten out of hand in how we use some flags that really
are different types vs others that are modifiers.  We'll need to clean
this up a bit eventually.

> 
> > 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> >  fs/dax.c              |  8 +++++---
> >  fs/ext2/inode.c       |  2 +-
> >  fs/ext4/inode.c       |  2 +-
> >  fs/gfs2/bmap.c        |  3 ++-
> >  fs/internal.h         |  2 +-
> >  fs/iomap.c            | 31 ++++++++++++++++---------------
> >  fs/xfs/xfs_iomap.c    |  9 ++++++---
> >  include/linux/iomap.h |  4 +++-
> >  8 files changed, 35 insertions(+), 26 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 2e48c7ebb973..80b9e2599223 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1078,7 +1078,7 @@ EXPORT_SYMBOL_GPL(__dax_zero_page_range);
> >  
> >  static loff_t
> >  dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > -		struct iomap *iomap)
> > +		struct iomap *iomap, struct iomap *srcmap)
> >  {
> >  	struct block_device *bdev = iomap->bdev;
> >  	struct dax_device *dax_dev = iomap->dax_dev;
> > @@ -1236,6 +1236,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> >  	unsigned long vaddr = vmf->address;
> >  	loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
> >  	struct iomap iomap = { 0 };
> > +	struct iomap srcmap = { 0 };
> >  	unsigned flags = IOMAP_FAULT;
> >  	int error, major = 0;
> >  	bool write = vmf->flags & FAULT_FLAG_WRITE;
> > @@ -1280,7 +1281,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> >  	 * the file system block size to be equal the page size, which means
> >  	 * that we never have to deal with more than a single extent here.
> >  	 */
> > -	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
> > +	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap, &srcmap);
> >  	if (iomap_errp)
> >  		*iomap_errp = error;
> >  	if (error) {
> > @@ -1460,6 +1461,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
> >  	struct inode *inode = mapping->host;
> >  	vm_fault_t result = VM_FAULT_FALLBACK;
> >  	struct iomap iomap = { 0 };
> > +	struct iomap srcmap = { 0 };
> >  	pgoff_t max_pgoff;
> >  	void *entry;
> >  	loff_t pos;
> > @@ -1534,7 +1536,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
> >  	 * to look up our filesystem block.
> >  	 */
> >  	pos = (loff_t)xas.xa_index << PAGE_SHIFT;
> > -	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
> > +	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap, &srcmap);
> 
> Line too long?
> 
> Also, I guess the DAX and directio write paths will just WARN_ON_ONCE if
> someone feeds them an IOMAP_COW type iomap?
> 
> Ah, right, I guess the only filesystems that use iomap directio and
> iomap dax don't support COW. :)

?  XFS does iomap based cow for direct I/O.  But we don't use IOMAP_COW
yey with this series as far as I can tell.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O
  2019-06-25 19:14     ` Goldwyn Rodrigues
  2019-06-26  1:36       ` Shiyang Ruan
@ 2019-06-26  6:39       ` Christoph Hellwig
  2019-06-26 16:10         ` Goldwyn Rodrigues
  2019-06-26 18:00       ` Darrick J. Wong
  2 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2019-06-26  6:39 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: Christoph Hellwig, linux-btrfs, linux-fsdevel, darrick.wong, david

On Tue, Jun 25, 2019 at 02:14:42PM -0500, Goldwyn Rodrigues wrote:
> > I can't say I'm a huge fan of this two iomaps in one method call
> > approach.  I always though two separate iomap iterations would be nicer,
> > but compared to that even the older hack with just the additional
> > src_addr seems a little better.
> 
> I am just expanding on your idea of using multiple iterations for the Cow case
> in the hope we can come out of a good design:
> 
> 1. iomap_file_buffered_write calls iomap_apply with IOMAP_WRITE flag.
>    which calls iomap_begin() for the respective filesystem.
> 2. btrfs_iomap_begin() sets up iomap->type as IOMAP_COW and fills iomap
>    struct with read addr information.
> 3. iomap_apply() conditionally for IOMAP_COW calls do_cow(new function)
>    and calls ops->iomap_begin() with flag IOMAP_COW_READ_DONE(new flag).
> 4. btrfs_iomap_begin() fills up iomap structure with write information.
> 
> Step 3 seems out of place because iomap_apply should be iomap.type agnostic.
> Right?
> Should we be adding another flag IOMAP_COW_DONE, just to figure out that
> this is the "real" write for iomap_begin to fill iomap?
> 
> If this is not how you imagined, could you elaborate on the dual iteration
> sequence?

Here are my thoughts from dealing with this from a while ago, all
XFS based of course.

If iomap_file_buffered_write is called on a page that is inside a COW
extent we have the following options:

 a) the page is updatodate or entirely overwritten.  We cn just allocate
    new COW blocks and return them, and we are done
 b) the page is not/partially uptodate and not entirely overwritten.

The latter case is the interesting one.  My thought was that iff the
IOMAP_F_SHARED flag is set __iomap_write_begin / iomap_read_page_sync
will then have to retreive the source information in some form.

My original plan was to just do a nested iomap_apply call, which would
need a special nested flag to not duplicate any locking the file
system might be holding between ->iomap_begin and ->iomap_end.

The upside here is that there is no additional overhead for the non-COW
path and the architecture looks relatively clean.  The downside is that
at least for XFS we usually have to look up the source information
anyway before allocating the COW destination extent, so we'd have to
cache that information somewhere or redo it, which would be rather
pointless.  At that point the idea of a srcaddr in the iomap becomes
interesting again - while it looks a little ugly from the architectural
POV it actually ends up having very practical benefits.
> 
> 
> -- 
> Goldwyn
---end quoted text---

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] iomap: Check iblocksize before transforming page->private
  2019-06-26  3:03         ` Goldwyn Rodrigues
@ 2019-06-26  6:42           ` Nikolay Borisov
  0 siblings, 0 replies; 27+ messages in thread
From: Nikolay Borisov @ 2019-06-26  6:42 UTC (permalink / raw)
  To: Goldwyn Rodrigues, Filipe Manana
  Cc: Christoph Hellwig, linux-btrfs, linux-fsdevel, Darrick J. Wong,
	Dave Chinner, Chris Mason



On 26.06.19 г. 6:03 ч., Goldwyn Rodrigues wrote:
> On 21:04 25/06, Filipe Manana wrote:
>> On Tue, Jun 25, 2019 at 8:58 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
>>>
>>> On  9:05 24/06, Christoph Hellwig wrote:
>>>> On Fri, Jun 21, 2019 at 02:28:25PM -0500, Goldwyn Rodrigues wrote:
>>>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>>>
>>>>> btrfs uses page->private as well to store extent_buffer. Make
>>>>> the check stricter to make sure we are using page->private for iop by
>>>>> comparing iblocksize < PAGE_SIZE.
>>>>>
>>>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>>
>>>> If btrfs uses page->private itself and also uses functions that call
>>>> to_iomap_page we have a major problem, as we now have a usage conflict.
>>>>
>>>> How do you end up here?
>>>>
>>>
>>> Btrfs uses page->private to identify which extent_buffer it belongs to.
>>> So, if you read, it fills the page->private. Then you try to write to
>>> it, iomap will assume it to be iomap_page pointer.
>>>
>>> I don't think we can move extent_buffer out of page->private for btrfs.
>>> Any other ideas?
>>
>> The extent buffer is only for pages belonging to the btree inode (i.e.
>> pages that correspond to a btree node/lead).
>> Haven't looked in detail to this patchset, but you can't do buffered
>> writes or direct IO against the btree inode, can you?
>> So for file inodes, this problem doesn't exist.
> 
> Why do we call set_page_extent_mapped(page) in lock_and_cleanup_extent_if_needed() or __do_readpage()?
> 
> I must admit, the backtrace crashes that I saw had the page->private set to
> EXTENT_PAGE_PRIVATE rather than the extent_buffer pointer. Does that mean
> calling this function is not necessary in these codepaths?

On a quick look it seems PagePrivate is not used for ordinary data
pages. E.g. set_Page_extent_mapped seems to be a leftover from old code.
No one is actually checking the EXTENT_PAGE_PRIVATE flag. For data pages
what seems to be important is PagePrivate2 for the fixup worker. IMO
set_page_extent_mapped should be removed as well as references to
pageprivate in the context of ordinary data pages.

Chris, any ideas what set_page_extent_mapped was used for?

> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O
  2019-06-26  6:39       ` Christoph Hellwig
@ 2019-06-26 16:10         ` Goldwyn Rodrigues
  2019-06-26 17:34           ` Darrick J. Wong
  0 siblings, 1 reply; 27+ messages in thread
From: Goldwyn Rodrigues @ 2019-06-26 16:10 UTC (permalink / raw)
  To: Christoph Hellwig, darrick.wong, david; +Cc: linux-btrfs, linux-fsdevel

On  8:39 26/06, Christoph Hellwig wrote:
> On Tue, Jun 25, 2019 at 02:14:42PM -0500, Goldwyn Rodrigues wrote:
> > > I can't say I'm a huge fan of this two iomaps in one method call
> > > approach.  I always though two separate iomap iterations would be nicer,
> > > but compared to that even the older hack with just the additional
> > > src_addr seems a little better.
> > 
> > I am just expanding on your idea of using multiple iterations for the Cow case
> > in the hope we can come out of a good design:
> > 
> > 1. iomap_file_buffered_write calls iomap_apply with IOMAP_WRITE flag.
> >    which calls iomap_begin() for the respective filesystem.
> > 2. btrfs_iomap_begin() sets up iomap->type as IOMAP_COW and fills iomap
> >    struct with read addr information.
> > 3. iomap_apply() conditionally for IOMAP_COW calls do_cow(new function)
> >    and calls ops->iomap_begin() with flag IOMAP_COW_READ_DONE(new flag).
> > 4. btrfs_iomap_begin() fills up iomap structure with write information.
> > 
> > Step 3 seems out of place because iomap_apply should be iomap.type agnostic.
> > Right?
> > Should we be adding another flag IOMAP_COW_DONE, just to figure out that
> > this is the "real" write for iomap_begin to fill iomap?
> > 
> > If this is not how you imagined, could you elaborate on the dual iteration
> > sequence?
> 
> Here are my thoughts from dealing with this from a while ago, all
> XFS based of course.
> 
> If iomap_file_buffered_write is called on a page that is inside a COW
> extent we have the following options:
> 
>  a) the page is updatodate or entirely overwritten.  We cn just allocate
>     new COW blocks and return them, and we are done
>  b) the page is not/partially uptodate and not entirely overwritten.
> 
> The latter case is the interesting one.  My thought was that iff the
> IOMAP_F_SHARED flag is set __iomap_write_begin / iomap_read_page_sync
> will then have to retreive the source information in some form.
> 
> My original plan was to just do a nested iomap_apply call, which would
> need a special nested flag to not duplicate any locking the file
> system might be holding between ->iomap_begin and ->iomap_end.
> 
> The upside here is that there is no additional overhead for the non-COW
> path and the architecture looks relatively clean.  The downside is that
> at least for XFS we usually have to look up the source information
> anyway before allocating the COW destination extent, so we'd have to
> cache that information somewhere or redo it, which would be rather
> pointless.  At that point the idea of a srcaddr in the iomap becomes
> interesting again - while it looks a little ugly from the architectural
> POV it actually ends up having very practical benefits.


So, do we move back to the design of adding an extra field of srcaddr?
Honestly, I find the design of using an extra field srcaddr in iomap better
and simpler versus passing additional iomap srcmap or multiple iterations.

Also, should we add another iomap type IOMAP_COW, or (re)use the flag
IOMAP_F_SHARED during writes? IOW iomap type vs iomap flag.

Dave/Darrick, what are your thoughts?

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O
  2019-06-26 16:10         ` Goldwyn Rodrigues
@ 2019-06-26 17:34           ` Darrick J. Wong
  0 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2019-06-26 17:34 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: Christoph Hellwig, david, linux-btrfs, linux-fsdevel

On Wed, Jun 26, 2019 at 11:10:17AM -0500, Goldwyn Rodrigues wrote:
> On  8:39 26/06, Christoph Hellwig wrote:
> > On Tue, Jun 25, 2019 at 02:14:42PM -0500, Goldwyn Rodrigues wrote:
> > > > I can't say I'm a huge fan of this two iomaps in one method call
> > > > approach.  I always though two separate iomap iterations would be nicer,
> > > > but compared to that even the older hack with just the additional
> > > > src_addr seems a little better.
> > > 
> > > I am just expanding on your idea of using multiple iterations for the Cow case
> > > in the hope we can come out of a good design:
> > > 
> > > 1. iomap_file_buffered_write calls iomap_apply with IOMAP_WRITE flag.
> > >    which calls iomap_begin() for the respective filesystem.
> > > 2. btrfs_iomap_begin() sets up iomap->type as IOMAP_COW and fills iomap
> > >    struct with read addr information.
> > > 3. iomap_apply() conditionally for IOMAP_COW calls do_cow(new function)
> > >    and calls ops->iomap_begin() with flag IOMAP_COW_READ_DONE(new flag).
> > > 4. btrfs_iomap_begin() fills up iomap structure with write information.
> > > 
> > > Step 3 seems out of place because iomap_apply should be iomap.type agnostic.
> > > Right?
> > > Should we be adding another flag IOMAP_COW_DONE, just to figure out that
> > > this is the "real" write for iomap_begin to fill iomap?
> > > 
> > > If this is not how you imagined, could you elaborate on the dual iteration
> > > sequence?
> > 
> > Here are my thoughts from dealing with this from a while ago, all
> > XFS based of course.
> > 
> > If iomap_file_buffered_write is called on a page that is inside a COW
> > extent we have the following options:
> > 
> >  a) the page is updatodate or entirely overwritten.  We cn just allocate
> >     new COW blocks and return them, and we are done
> >  b) the page is not/partially uptodate and not entirely overwritten.
> > 
> > The latter case is the interesting one.  My thought was that iff the
> > IOMAP_F_SHARED flag is set __iomap_write_begin / iomap_read_page_sync
> > will then have to retreive the source information in some form.
> > 
> > My original plan was to just do a nested iomap_apply call, which would
> > need a special nested flag to not duplicate any locking the file
> > system might be holding between ->iomap_begin and ->iomap_end.
> > 
> > The upside here is that there is no additional overhead for the non-COW
> > path and the architecture looks relatively clean.  The downside is that
> > at least for XFS we usually have to look up the source information
> > anyway before allocating the COW destination extent, so we'd have to
> > cache that information somewhere or redo it, which would be rather
> > pointless.  At that point the idea of a srcaddr in the iomap becomes
> > interesting again - while it looks a little ugly from the architectural
> > POV it actually ends up having very practical benefits.

I think it's less complicated to pass both mappings out in a single
->iomap_begin call rather than have this dance where the fs tells iomap
to call back for the read mapping and then iomap calls back for the read
mapping with a special "don't take locks" flag.

For XFS specifically this means we can serve both mappings with a single
ILOCK cycle.

> So, do we move back to the design of adding an extra field of srcaddr?

TLDR: Please no.

> Honestly, I find the design of using an extra field srcaddr in iomap better
> and simpler versus passing additional iomap srcmap or multiple iterations.

Putting my long-range planning hat on, the usage story (from the fs'
perspective) here is:

"iomap wants to know how a file write should map to a disk write.  If
we're doing a straight overwrite of disk blocks then I should send back
the relevant mapping.  Sometimes I might need the write to go to a
totally different location than where the data is currently stored, so I
need to send back a second mapping."

Because iomap is now a general-purpose API, we need to think about the
read mapping for a moment:

 - For all disk-based filesystems we'll want the source address for the
   read mapping.

 - For filesystems that support "inline data" (which really just means
   the fs maintains its own buffers to file data) we'll also need the
   inline_data pointer.

 - For filesystems that support multiple devices (like btrfs) we'll also
   need a pointer to a block_device because we could be writing to a
   different device than the one that stores the data.  The prime
   example I can think of is reading data from disk A in some RAID
   stripe and writing to disk B in a different RAID stripe to solve the
   RAID5 hole... but you could just be lazy-migrating file data to less
   full or newer drives or whatever.

 - If we ever encounter a filesystem that supports multiple dax devices
   then we'll need a pointer to the dax_device too.  (That would be
   btrfs, since I thought your larger goal was to enable dax there...)

 - We're probably going to need the ability to pass flags for the read
   mapping at some point or another, so we need that too.

From this, you can see that we need half the fields in the existing
struct iomap, and that's how I arrived at the idea of passing to
iomap_begin pointers to two iomaps instead of bolting these five fields
into struct iomap.

In XFS parlance (where the data fork stores mappings for on-disk data
and the cow fork stores mappings for), this means that our iomap_begin
data paths remain fairly straightforward:

	xfs_bmapi_read(ip, offset, XFS_DATA_FORK, &imap...);
	xfs_bmapi_read(ip, offset, XFS_COW_FORK, &cmap...);

	xfs_bmbt_to_iomap(ip, iomap, &imap...);
	iomap->type = IOMAP_COW;
	xfs_bmbt_to_iomap(ip, srcmap, &cmap...);

(It's more complicated than that, but that's approximately what we do
now.)

> Also, should we add another iomap type IOMAP_COW, or (re)use the flag
> IOMAP_F_SHARED during writes? IOW iomap type vs iomap flag.

I think they need to remain distinct types because the IOMAP_F_SHARED
flag means that the storage is shared amongst multiple owners (which
iomap_fiemap reports to userspace), whereas the IOMAP_COW type means
that RMW is required for unaligned writes.

btrfs has a strong tendency to require copy writes, but that doesn't
mean the extent is necessarily shared.

> Dave/Darrick, what are your thoughts?

I liked where the first 3 patches of this series were heading. :)

--D

> 
> -- 
> Goldwyn

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O
  2019-06-25 19:14     ` Goldwyn Rodrigues
  2019-06-26  1:36       ` Shiyang Ruan
  2019-06-26  6:39       ` Christoph Hellwig
@ 2019-06-26 18:00       ` Darrick J. Wong
  2019-06-26 18:42         ` Goldwyn Rodrigues
  2 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2019-06-26 18:00 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: Christoph Hellwig, linux-btrfs, linux-fsdevel, david

On Tue, Jun 25, 2019 at 02:14:42PM -0500, Goldwyn Rodrigues wrote:
> On  9:07 24/06, Christoph Hellwig wrote:
> > xfs will need to be updated to fill in the additional iomap for the
> > COW case.  Has this series been tested on xfs?
> > 
> 
> No, I have not tested this, or make xfs set IOMAP_COW. I will try to do
> it in the next iteration.

AFAICT even if you did absolutely nothing XFS would continue to work
properly because iomap_write_begin doesn't actually care if it's going
to be a COW write because the only IO it does from the mapping is to
read in the non-uptodate parts of the page if the write offset/len
aren't page-aligned.

> > I can't say I'm a huge fan of this two iomaps in one method call
> > approach.  I always though two separate iomap iterations would be nicer,
> > but compared to that even the older hack with just the additional
> > src_addr seems a little better.
> 
> I am just expanding on your idea of using multiple iterations for the Cow case
> in the hope we can come out of a good design:
> 
> 1. iomap_file_buffered_write calls iomap_apply with IOMAP_WRITE flag.
>    which calls iomap_begin() for the respective filesystem.
> 2. btrfs_iomap_begin() sets up iomap->type as IOMAP_COW and fills iomap
>    struct with read addr information.
> 3. iomap_apply() conditionally for IOMAP_COW calls do_cow(new function)
>    and calls ops->iomap_begin() with flag IOMAP_COW_READ_DONE(new flag).

Unless I'm misreading this, you don't need a do_cow() or
IOMAP_COW_READ_DONE because the page state tracks that for you:

iomap_write_begin calls ->iomap_begin to learn from where it should read
data if the write is not aligned to a page and the page isn't uptodate.
If it's IOMAP_COW then we learn from *srcmap instead of *iomap.

(The write actor then dirties the page)

fsync() or whatever

The mm calls ->writepage.  The filesystem grabs the new COW mapping,
constructs a bio with the new mapping and dirty pages, and submits the
bio.  pagesize >= blocksize so we're always writing full blocks.

The writeback bio completes and calls ->bio_endio, which is the
filesystem's trigger to make the mapping changes permanent, update
ondisk file size, etc.

For direct writes that are not block-aligned, we just bounce the write
to the page cache...

...so it's only dax_iomap_rw where we're going to have to do the COW
ourselves.  That's simple -- map both addresses, copy the regions before
offset and after offset+len, then proceed with writing whatever
userspace sent us.  No need for the iomap code itself to get involved.

> 4. btrfs_iomap_begin() fills up iomap structure with write information.
> 
> Step 3 seems out of place because iomap_apply should be iomap.type agnostic.
> Right?
> Should we be adding another flag IOMAP_COW_DONE, just to figure out that
> this is the "real" write for iomap_begin to fill iomap?
> 
> If this is not how you imagined, could you elaborate on the dual iteration
> sequence?

--D

> 
> 
> -- 
> Goldwyn

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O
  2019-06-26 18:00       ` Darrick J. Wong
@ 2019-06-26 18:42         ` Goldwyn Rodrigues
  0 siblings, 0 replies; 27+ messages in thread
From: Goldwyn Rodrigues @ 2019-06-26 18:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-btrfs, linux-fsdevel, david

On 11:00 26/06, Darrick J. Wong wrote:
> On Tue, Jun 25, 2019 at 02:14:42PM -0500, Goldwyn Rodrigues wrote:
> > On  9:07 24/06, Christoph Hellwig wrote:
> > > xfs will need to be updated to fill in the additional iomap for the
> > > COW case.  Has this series been tested on xfs?
> > > 
> > 
> > No, I have not tested this, or make xfs set IOMAP_COW. I will try to do
> > it in the next iteration.
> 
> AFAICT even if you did absolutely nothing XFS would continue to work
> properly because iomap_write_begin doesn't actually care if it's going
> to be a COW write because the only IO it does from the mapping is to
> read in the non-uptodate parts of the page if the write offset/len
> aren't page-aligned.
> 
> > > I can't say I'm a huge fan of this two iomaps in one method call
> > > approach.  I always though two separate iomap iterations would be nicer,
> > > but compared to that even the older hack with just the additional
> > > src_addr seems a little better.
> > 
> > I am just expanding on your idea of using multiple iterations for the Cow case
> > in the hope we can come out of a good design:
> > 
> > 1. iomap_file_buffered_write calls iomap_apply with IOMAP_WRITE flag.
> >    which calls iomap_begin() for the respective filesystem.
> > 2. btrfs_iomap_begin() sets up iomap->type as IOMAP_COW and fills iomap
> >    struct with read addr information.
> > 3. iomap_apply() conditionally for IOMAP_COW calls do_cow(new function)
> >    and calls ops->iomap_begin() with flag IOMAP_COW_READ_DONE(new flag).
> 
> Unless I'm misreading this, you don't need a do_cow() or
> IOMAP_COW_READ_DONE because the page state tracks that for you:
> 
> iomap_write_begin calls ->iomap_begin to learn from where it should read
> data if the write is not aligned to a page and the page isn't uptodate.
> If it's IOMAP_COW then we learn from *srcmap instead of *iomap.
> 

We were discussing the single iomap structure (without srcmap), but
with two iterations, the first to get the read information and the second
to get the write information for CoW.

> (The write actor then dirties the page)
> 
> fsync() or whatever
> 
> The mm calls ->writepage.  The filesystem grabs the new COW mapping,
> constructs a bio with the new mapping and dirty pages, and submits the
> bio.  pagesize >= blocksize so we're always writing full blocks.
> 
> The writeback bio completes and calls ->bio_endio, which is the
> filesystem's trigger to make the mapping changes permanent, update
> ondisk file size, etc.
> 
> For direct writes that are not block-aligned, we just bounce the write
> to the page cache...
> 
> ...so it's only dax_iomap_rw where we're going to have to do the COW
> ourselves.  That's simple -- map both addresses, copy the regions before
> offset and after offset+len, then proceed with writing whatever
> userspace sent us.  No need for the iomap code itself to get involved.

Yes, this is how the latest patch series is working, with two iomap
structures in iomap_begin() function parameters.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2019-06-26 18:42 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-21 19:28 [PATCH 0/6] Btrfs iomap Goldwyn Rodrigues
2019-06-21 19:28 ` [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O Goldwyn Rodrigues
2019-06-22  0:46   ` Darrick J. Wong
2019-06-25 19:17     ` Goldwyn Rodrigues
2019-06-26  6:21     ` Christoph Hellwig
2019-06-24  7:07   ` Christoph Hellwig
2019-06-25 19:14     ` Goldwyn Rodrigues
2019-06-26  1:36       ` Shiyang Ruan
2019-06-26  6:39       ` Christoph Hellwig
2019-06-26 16:10         ` Goldwyn Rodrigues
2019-06-26 17:34           ` Darrick J. Wong
2019-06-26 18:00       ` Darrick J. Wong
2019-06-26 18:42         ` Goldwyn Rodrigues
2019-06-21 19:28 ` [PATCH 2/6] iomap: Read page from srcmap for IOMAP_COW Goldwyn Rodrigues
2019-06-22  0:41   ` Darrick J. Wong
2019-06-21 19:28 ` [PATCH 3/6] iomap: Check iblocksize before transforming page->private Goldwyn Rodrigues
2019-06-22  0:21   ` Darrick J. Wong
2019-06-25 19:22     ` Goldwyn Rodrigues
2019-06-24  7:05   ` Christoph Hellwig
2019-06-25 18:56     ` Goldwyn Rodrigues
2019-06-25 20:04       ` Filipe Manana
2019-06-26  3:03         ` Goldwyn Rodrigues
2019-06-26  6:42           ` Nikolay Borisov
2019-06-26  6:16       ` Christoph Hellwig
2019-06-21 19:28 ` [PATCH 4/6] btrfs: Add a simple buffered iomap write Goldwyn Rodrigues
2019-06-21 19:28 ` [PATCH 5/6] btrfs: Add CoW in iomap based writes Goldwyn Rodrigues
2019-06-21 19:28 ` [PATCH 6/6] btrfs: remove buffered write code made unnecessary Goldwyn Rodrigues

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).