Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 0/13] Btrfs iomap
@ 2019-08-02 22:00 Goldwyn Rodrigues
  2019-08-02 22:00 ` [PATCH 01/13] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O Goldwyn Rodrigues
                   ` (12 more replies)
  0 siblings, 13 replies; 42+ messages in thread
From: Goldwyn Rodrigues @ 2019-08-02 22:00 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-btrfs, hch, darrick.wong, ruansy.fnst

This is an effort to use iomap for btrfs. This would keep most
responsibility of page handling during writes in iomap code, hence
code reduction. For CoW support, changes are needed in iomap code
to make sure we perform a copy before the write.
This is in line with the discussion we had during adding dax support in
btrfs.

Efforts on adding dax support have been put on hold until MM experts can
come with a way of performing multiple mappings to a single page
(primarily the TODO before dax_associate_entry()). While we are waiting
on that we could add support for buffered writes in btrfs.

[1] https://github.com/goldwynr/linux/tree/btrfs-iomap

-- 
Goldwyn

Changes since v1
- Added Direct I/O support
- Remove PagePrivate from btrfs pages for regular files

 fs/btrfs/Makefile           |    2 
 fs/btrfs/compression.c      |    1 
 fs/btrfs/ctree.h            |   15 -
 fs/btrfs/extent_io.c        |   13 
 fs/btrfs/extent_io.h        |    2 
 fs/btrfs/file.c             |  520 --------------------------------------
 fs/btrfs/free-space-cache.c |    1 
 fs/btrfs/inode.c            |  170 +++---------
 fs/btrfs/ioctl.c            |    4 
 fs/btrfs/iomap.c            |  600 +++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/relocation.c       |    2 
 fs/dax.c                    |    8 
 fs/ext2/inode.c             |    2 
 fs/ext4/inode.c             |    2 
 fs/gfs2/bmap.c              |    3 
 fs/iomap/apply.c            |    5 
 fs/iomap/buffered-io.c      |   28 +-
 fs/iomap/direct-io.c        |   18 -
 fs/iomap/fiemap.c           |    4 
 fs/iomap/seek.c             |    4 
 fs/iomap/swapfile.c         |    3 
 fs/xfs/xfs_iomap.c          |    9 
 include/linux/iomap.h       |    7 
 23 files changed, 727 insertions(+), 696 deletions(-)



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/13] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O
  2019-08-02 22:00 [PATCH v2 0/13] Btrfs iomap Goldwyn Rodrigues
@ 2019-08-02 22:00 ` Goldwyn Rodrigues
  2019-08-03  0:39   ` Darrick J. Wong
  2019-08-05  0:06   ` Dave Chinner
  2019-08-02 22:00 ` [PATCH 02/13] iomap: Read page from srcmap for IOMAP_COW Goldwyn Rodrigues
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 42+ messages in thread
From: Goldwyn Rodrigues @ 2019-08-02 22:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, hch, darrick.wong, ruansy.fnst, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Introduces a new type IOMAP_COW, which means the data at offset
must be read from a srcmap and copied before performing the
write on the offset.

The srcmap is used to identify where the read is to be performed
from. This is passed to iomap->begin() of the respective
filesystem, which is supposed to put in the details for
reading before performing the copy for CoW.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/dax.c               |  8 +++++---
 fs/ext2/inode.c        |  2 +-
 fs/ext4/inode.c        |  2 +-
 fs/gfs2/bmap.c         |  3 ++-
 fs/iomap/apply.c       |  5 +++--
 fs/iomap/buffered-io.c | 14 +++++++-------
 fs/iomap/direct-io.c   |  2 +-
 fs/iomap/fiemap.c      |  4 ++--
 fs/iomap/seek.c        |  4 ++--
 fs/iomap/swapfile.c    |  3 ++-
 fs/xfs/xfs_iomap.c     |  9 ++++++---
 include/linux/iomap.h  |  6 ++++--
 12 files changed, 36 insertions(+), 26 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index a237141d8787..b21d9a9cde2b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1090,7 +1090,7 @@ EXPORT_SYMBOL_GPL(__dax_zero_page_range);
 
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap)
+		struct iomap *iomap, struct iomap *srcmap)
 {
 	struct block_device *bdev = iomap->bdev;
 	struct dax_device *dax_dev = iomap->dax_dev;
@@ -1248,6 +1248,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	unsigned long vaddr = vmf->address;
 	loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
 	struct iomap iomap = { 0 };
+	struct iomap srcmap = { 0 };
 	unsigned flags = IOMAP_FAULT;
 	int error, major = 0;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -1292,7 +1293,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	 * the file system block size to be equal the page size, which means
 	 * that we never have to deal with more than a single extent here.
 	 */
-	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
+	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap, &srcmap);
 	if (iomap_errp)
 		*iomap_errp = error;
 	if (error) {
@@ -1472,6 +1473,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	struct inode *inode = mapping->host;
 	vm_fault_t result = VM_FAULT_FALLBACK;
 	struct iomap iomap = { 0 };
+	struct iomap srcmap = { 0 };
 	pgoff_t max_pgoff;
 	void *entry;
 	loff_t pos;
@@ -1546,7 +1548,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	 * to look up our filesystem block.
 	 */
 	pos = (loff_t)xas.xa_index << PAGE_SHIFT;
-	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
+	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap, &srcmap);
 	if (error)
 		goto unlock_entry;
 
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 7004ce581a32..467c13ff6b40 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -801,7 +801,7 @@ int ext2_get_block(struct inode *inode, sector_t iblock,
 
 #ifdef CONFIG_FS_DAX
 static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
-		unsigned flags, struct iomap *iomap)
+		unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	unsigned int blkbits = inode->i_blkbits;
 	unsigned long first_block = offset >> blkbits;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 420fe3deed39..918f94eff799 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3453,7 +3453,7 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
 }
 
 static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
-			    unsigned flags, struct iomap *iomap)
+			    unsigned flags, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	unsigned int blkbits = inode->i_blkbits;
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 79581b9bdebb..0bf8e8fa82bd 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1123,7 +1123,8 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 }
 
 static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
-			    unsigned flags, struct iomap *iomap)
+			    unsigned flags, struct iomap *iomap,
+			    struct iomap *srcmap)
 {
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct metapath mp = { .mp_aheight = 1, };
diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index 54c02aecf3cd..6cdb362fff36 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -24,6 +24,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 		const struct iomap_ops *ops, void *data, iomap_actor_t actor)
 {
 	struct iomap iomap = { 0 };
+	struct iomap srcmap = { 0 };
 	loff_t written = 0, ret;
 
 	/*
@@ -38,7 +39,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	 * expose transient stale data. If the reserve fails, we can safely
 	 * back out at this point as there is nothing to undo.
 	 */
-	ret = ops->iomap_begin(inode, pos, length, flags, &iomap);
+	ret = ops->iomap_begin(inode, pos, length, flags, &iomap, &srcmap);
 	if (ret)
 		return ret;
 	if (WARN_ON(iomap.offset > pos))
@@ -58,7 +59,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	 * we can do the copy-in page by page without having to worry about
 	 * failures exposing transient data.
 	 */
-	written = actor(inode, pos, length, data, &iomap);
+	written = actor(inode, pos, length, data, &iomap, &srcmap);
 
 	/*
 	 * Now the data has been copied, commit the range we've copied.  This
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e25901ae3ff4..f27756c0b31c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -205,7 +205,7 @@ iomap_read_inline_data(struct inode *inode, struct page *page,
 
 static loff_t
 iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap)
+		struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
 	struct page *page = ctx->cur_page;
@@ -351,7 +351,7 @@ iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
 
 static loff_t
 iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap)
+		void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
 	loff_t done, ret;
@@ -371,7 +371,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 			ctx->cur_page_in_bio = false;
 		}
 		ret = iomap_readpage_actor(inode, pos + done, length - done,
-				ctx, iomap);
+				ctx, iomap, srcmap);
 	}
 
 	return done;
@@ -736,7 +736,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
 
 static loff_t
 iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap)
+		struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iov_iter *i = data;
 	long status = 0;
@@ -853,7 +853,7 @@ __iomap_read_page(struct inode *inode, loff_t offset)
 
 static loff_t
 iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap)
+		struct iomap *iomap, struct iomap *srcmap)
 {
 	long status = 0;
 	ssize_t written = 0;
@@ -942,7 +942,7 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
 
 static loff_t
 iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
-		void *data, struct iomap *iomap)
+		void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	bool *did_zero = data;
 	loff_t written = 0;
@@ -1011,7 +1011,7 @@ EXPORT_SYMBOL_GPL(iomap_truncate_page);
 
 static loff_t
 iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap)
+		void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct page *page = data;
 	int ret;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 10517cea9682..5279029c7a3c 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -362,7 +362,7 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
 
 static loff_t
 iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap)
+		void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iomap_dio *dio = data;
 
diff --git a/fs/iomap/fiemap.c b/fs/iomap/fiemap.c
index f26fdd36e383..690ef2d7c6c8 100644
--- a/fs/iomap/fiemap.c
+++ b/fs/iomap/fiemap.c
@@ -44,7 +44,7 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
 
 static loff_t
 iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
-		struct iomap *iomap)
+		struct iomap *iomap, struct iomap *srcmap)
 {
 	struct fiemap_ctx *ctx = data;
 	loff_t ret = length;
@@ -111,7 +111,7 @@ EXPORT_SYMBOL_GPL(iomap_fiemap);
 
 static loff_t
 iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
-		void *data, struct iomap *iomap)
+		void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	sector_t *bno = data, addr;
 
diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
index c04bad4b2b43..89f61d93c0bc 100644
--- a/fs/iomap/seek.c
+++ b/fs/iomap/seek.c
@@ -119,7 +119,7 @@ page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
 
 static loff_t
 iomap_seek_hole_actor(struct inode *inode, loff_t offset, loff_t length,
-		      void *data, struct iomap *iomap)
+		      void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	switch (iomap->type) {
 	case IOMAP_UNWRITTEN:
@@ -165,7 +165,7 @@ EXPORT_SYMBOL_GPL(iomap_seek_hole);
 
 static loff_t
 iomap_seek_data_actor(struct inode *inode, loff_t offset, loff_t length,
-		      void *data, struct iomap *iomap)
+		      void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	switch (iomap->type) {
 	case IOMAP_HOLE:
diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c
index 152a230f668d..a648dbf6991e 100644
--- a/fs/iomap/swapfile.c
+++ b/fs/iomap/swapfile.c
@@ -76,7 +76,8 @@ static int iomap_swapfile_add_extent(struct iomap_swapfile_info *isi)
  * distinction between written and unwritten extents.
  */
 static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
-		loff_t count, void *data, struct iomap *iomap)
+		loff_t count, void *data, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct iomap_swapfile_info *isi = data;
 	int error;
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 3a4310d7cb59..8321733c16c3 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -922,7 +922,8 @@ xfs_file_iomap_begin(
 	loff_t			offset,
 	loff_t			length,
 	unsigned		flags,
-	struct iomap		*iomap)
+	struct iomap		*iomap,
+	struct iomap		*srcmap)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1145,7 +1146,8 @@ xfs_seek_iomap_begin(
 	loff_t			offset,
 	loff_t			length,
 	unsigned		flags,
-	struct iomap		*iomap)
+	struct iomap		*iomap,
+	struct iomap		*srcmap)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
@@ -1231,7 +1233,8 @@ xfs_xattr_iomap_begin(
 	loff_t			offset,
 	loff_t			length,
 	unsigned		flags,
-	struct iomap		*iomap)
+	struct iomap		*iomap,
+	struct iomap		*srcmap)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index bc499ceae392..5b2055e8ca8a 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -26,6 +26,7 @@ struct vm_fault;
 #define IOMAP_MAPPED	0x03	/* blocks allocated at @addr */
 #define IOMAP_UNWRITTEN	0x04	/* blocks allocated at @addr in unwritten state */
 #define IOMAP_INLINE	0x05	/* data inline in the inode */
+#define IOMAP_COW	0x06	/* copy data from srcmap before writing */
 
 /*
  * Flags for all iomap mappings:
@@ -110,7 +111,8 @@ struct iomap_ops {
 	 * The actual length is returned in iomap->length.
 	 */
 	int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
-			unsigned flags, struct iomap *iomap);
+			unsigned flags, struct iomap *iomap,
+			struct iomap *srcmap);
 
 	/*
 	 * Commit and/or unreserve space previous allocated using iomap_begin.
@@ -126,7 +128,7 @@ struct iomap_ops {
  * Main iomap iterator function.
  */
 typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
-		void *data, struct iomap *iomap);
+		void *data, struct iomap *iomap, struct iomap *srcmap);
 
 loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
 		unsigned flags, const struct iomap_ops *ops, void *data,
-- 
2.16.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 02/13] iomap: Read page from srcmap for IOMAP_COW
  2019-08-02 22:00 [PATCH v2 0/13] Btrfs iomap Goldwyn Rodrigues
  2019-08-02 22:00 ` [PATCH 01/13] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O Goldwyn Rodrigues
@ 2019-08-02 22:00 ` Goldwyn Rodrigues
  2019-08-03  0:23   ` Darrick J. Wong
  2019-08-04 23:52   ` Dave Chinner
  2019-08-02 22:00 ` [PATCH 03/13] btrfs: Eliminate PagePrivate for btrfs data pages Goldwyn Rodrigues
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 42+ messages in thread
From: Goldwyn Rodrigues @ 2019-08-02 22:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, hch, darrick.wong, ruansy.fnst, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

In case of a IOMAP_COW, read a page from the srcmap before
performing a write on the page.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/iomap/buffered-io.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f27756c0b31c..a96cc26eec92 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -581,7 +581,7 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
 
 static int
 iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
-		struct page **pagep, struct iomap *iomap)
+		struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
 {
 	const struct iomap_page_ops *page_ops = iomap->page_ops;
 	pgoff_t index = pos >> PAGE_SHIFT;
@@ -607,6 +607,8 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 
 	if (iomap->type == IOMAP_INLINE)
 		iomap_read_inline_data(inode, page, iomap);
+	else if (iomap->type == IOMAP_COW)
+		status = __iomap_write_begin(inode, pos, len, page, srcmap);
 	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
 		status = __block_write_begin_int(page, pos, len, NULL, iomap);
 	else
@@ -772,7 +774,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		}
 
 		status = iomap_write_begin(inode, pos, bytes, flags, &page,
-				iomap);
+				iomap, srcmap);
 		if (unlikely(status))
 			break;
 
@@ -871,7 +873,7 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			return PTR_ERR(rpage);
 
 		status = iomap_write_begin(inode, pos, bytes,
-					   AOP_FLAG_NOFS, &page, iomap);
+					   AOP_FLAG_NOFS, &page, iomap, srcmap);
 		put_page(rpage);
 		if (unlikely(status))
 			return status;
@@ -917,13 +919,13 @@ iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
 EXPORT_SYMBOL_GPL(iomap_file_dirty);
 
 static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
-		unsigned bytes, struct iomap *iomap)
+		unsigned bytes, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct page *page;
 	int status;
 
 	status = iomap_write_begin(inode, pos, bytes, AOP_FLAG_NOFS, &page,
-				   iomap);
+				   iomap, srcmap);
 	if (status)
 		return status;
 
@@ -961,7 +963,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		if (IS_DAX(inode))
 			status = iomap_dax_zero(pos, offset, bytes, iomap);
 		else
-			status = iomap_zero(inode, pos, offset, bytes, iomap);
+			status = iomap_zero(inode, pos, offset, bytes, iomap, srcmap);
 		if (status < 0)
 			return status;
 
-- 
2.16.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 03/13] btrfs: Eliminate PagePrivate for btrfs data pages
  2019-08-02 22:00 [PATCH v2 0/13] Btrfs iomap Goldwyn Rodrigues
  2019-08-02 22:00 ` [PATCH 01/13] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O Goldwyn Rodrigues
  2019-08-02 22:00 ` [PATCH 02/13] iomap: Read page from srcmap for IOMAP_COW Goldwyn Rodrigues
@ 2019-08-02 22:00 ` Goldwyn Rodrigues
  2019-08-02 22:00 ` [PATCH 04/13] btrfs: Add a simple buffered iomap write Goldwyn Rodrigues
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 42+ messages in thread
From: Goldwyn Rodrigues @ 2019-08-02 22:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, hch, darrick.wong, ruansy.fnst, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

While most of the code works just eliminating page's private
field and related code, there is a problem when we are cloning.
The extent assumes the data is uptodate. Clear the EXTENT_UPTODATE
flag for the extent so the next time the file is read, it is
forced to be read from the disk as opposed to pagecache.

This patch is required to make sure we don't conflict with iomap's
usage of page->private.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/compression.c      |  1 -
 fs/btrfs/extent_io.c        | 13 -------------
 fs/btrfs/extent_io.h        |  2 --
 fs/btrfs/file.c             |  1 -
 fs/btrfs/free-space-cache.c |  1 -
 fs/btrfs/inode.c            | 15 +--------------
 fs/btrfs/ioctl.c            |  4 ++--
 fs/btrfs/relocation.c       |  2 --
 8 files changed, 3 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 60c47b417a4b..fe41fa3d2999 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -481,7 +481,6 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		 * for these bytes in the file.  But, we have to make
 		 * sure they map to this compressed extent on disk.
 		 */
-		set_page_extent_mapped(page);
 		lock_extent(tree, last_offset, end);
 		read_lock(&em_tree->lock);
 		em = lookup_extent_mapping(em_tree, last_offset,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1ff438fd5bc2..27233fb6660c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3005,15 +3005,6 @@ static void attach_extent_buffer_page(struct extent_buffer *eb,
 	}
 }
 
-void set_page_extent_mapped(struct page *page)
-{
-	if (!PagePrivate(page)) {
-		SetPagePrivate(page);
-		get_page(page);
-		set_page_private(page, EXTENT_PAGE_PRIVATE);
-	}
-}
-
 static struct extent_map *
 __get_extent_map(struct inode *inode, struct page *page, size_t pg_offset,
 		 u64 start, u64 len, get_extent_t *get_extent,
@@ -3074,8 +3065,6 @@ static int __do_readpage(struct extent_io_tree *tree,
 	size_t blocksize = inode->i_sb->s_blocksize;
 	unsigned long this_bio_flag = 0;
 
-	set_page_extent_mapped(page);
-
 	if (!PageUptodate(page)) {
 		if (cleancache_get_page(page) == 0) {
 			BUG_ON(blocksize != PAGE_SIZE);
@@ -3589,8 +3578,6 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
 
 	pg_offset = 0;
 
-	set_page_extent_mapped(page);
-
 	if (!epd->extent_locked) {
 		ret = writepage_delalloc(inode, page, wbc, start, &nr_written);
 		if (ret == 1)
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 401423b16976..8082774371b5 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -416,8 +416,6 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages,
 		     unsigned nr_pages);
 int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		__u64 start, __u64 len);
-void set_page_extent_mapped(struct page *page);
-
 struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 					  u64 start);
 struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 58a18ed11546..4466a09f2d98 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1539,7 +1539,6 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct page **pages,
 	 * delalloc bits and dirty the pages as required.
 	 */
 	for (i = 0; i < num_pages; i++) {
-		set_page_extent_mapped(pages[i]);
 		WARN_ON(!PageLocked(pages[i]));
 	}
 
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 062be9dde4c6..9a0c519bd6d4 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -395,7 +395,6 @@ static int io_ctl_prepare_pages(struct btrfs_io_ctl *io_ctl, struct inode *inode
 
 	for (i = 0; i < io_ctl->num_pages; i++) {
 		clear_page_dirty_for_io(io_ctl->pages[i]);
-		set_page_extent_mapped(io_ctl->pages[i]);
 	}
 
 	return 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ee582a36653d..258bacefdf5f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4932,7 +4932,6 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 	wait_on_page_writeback(page);
 
 	lock_extent_bits(io_tree, block_start, block_end, &cached_state);
-	set_page_extent_mapped(page);
 
 	ordered = btrfs_lookup_ordered_extent(inode, block_start);
 	if (ordered) {
@@ -8754,13 +8753,7 @@ btrfs_readpages(struct file *file, struct address_space *mapping,
 
 static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
 {
-	int ret = try_release_extent_mapping(page, gfp_flags);
-	if (ret == 1) {
-		ClearPagePrivate(page);
-		set_page_private(page, 0);
-		put_page(page);
-	}
-	return ret;
+	return try_release_extent_mapping(page, gfp_flags);
 }
 
 static int btrfs_releasepage(struct page *page, gfp_t gfp_flags)
@@ -8878,11 +8871,6 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset,
 	}
 
 	ClearPageChecked(page);
-	if (PagePrivate(page)) {
-		ClearPagePrivate(page);
-		set_page_private(page, 0);
-		put_page(page);
-	}
 }
 
 /*
@@ -8961,7 +8949,6 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 	wait_on_page_writeback(page);
 
 	lock_extent_bits(io_tree, page_start, page_end, &cached_state);
-	set_page_extent_mapped(page);
 
 	/*
 	 * we can't set the delalloc bits if there are pending ordered
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 818f7ec8bb0e..861617e3d0c9 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1355,7 +1355,6 @@ static int cluster_pages_for_defrag(struct inode *inode,
 	for (i = 0; i < i_done; i++) {
 		clear_page_dirty_for_io(pages[i]);
 		ClearPageChecked(pages[i]);
-		set_page_extent_mapped(pages[i]);
 		set_page_dirty(pages[i]);
 		unlock_page(pages[i]);
 		put_page(pages[i]);
@@ -3550,6 +3549,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 	int ret;
 	const u64 len = olen_aligned;
 	u64 last_dest_end = destoff;
+	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 
 	ret = -ENOMEM;
 	buf = kvmalloc(fs_info->nodesize, GFP_KERNEL);
@@ -3864,6 +3864,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 						destoff, olen, no_time_update);
 	}
 
+	clear_extent_uptodate(tree, destoff, destoff+olen, NULL);
 out:
 	btrfs_free_path(path);
 	kvfree(buf);
@@ -3935,7 +3936,6 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 	truncate_inode_pages_range(&inode->i_data,
 				round_down(destoff, PAGE_SIZE),
 				round_up(destoff + len, PAGE_SIZE) - 1);
-
 	return ret;
 }
 
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 7f219851fa23..612988b7eb27 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3300,8 +3300,6 @@ static int relocate_file_extent_cluster(struct inode *inode,
 
 		lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
 
-		set_page_extent_mapped(page);
-
 		if (nr < cluster->nr &&
 		    page_start + offset == cluster->boundary[nr]) {
 			set_extent_bits(&BTRFS_I(inode)->io_tree,
-- 
2.16.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 04/13] btrfs: Add a simple buffered iomap write
  2019-08-02 22:00 [PATCH v2 0/13] Btrfs iomap Goldwyn Rodrigues
                   ` (2 preceding siblings ...)
  2019-08-02 22:00 ` [PATCH 03/13] btrfs: Eliminate PagePrivate for btrfs data pages Goldwyn Rodrigues
@ 2019-08-02 22:00 ` Goldwyn Rodrigues
  2019-08-05  0:11   ` Dave Chinner
  2019-08-02 22:00 ` [PATCH 05/13] btrfs: Add CoW in iomap based writes Goldwyn Rodrigues
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 42+ messages in thread
From: Goldwyn Rodrigues @ 2019-08-02 22:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, hch, darrick.wong, ruansy.fnst, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Introduce a new btrfs_iomap structure which contains information
about the filesystem between the iomap_begin() and iomap_end() calls.
This contains information about reservations and extent locking.

This one is a long patch. Most of the code is "inspired" by
fs/btrfs/file.c. To keep the size small, all removals are in
following patches.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/Makefile |   2 +-
 fs/btrfs/ctree.h  |   1 +
 fs/btrfs/file.c   |   4 +-
 fs/btrfs/iomap.c  | 381 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 385 insertions(+), 3 deletions(-)
 create mode 100644 fs/btrfs/iomap.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 76a843198bcb..f88e696b0698 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -11,7 +11,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
-	   block-rsv.o delalloc-space.o
+	   block-rsv.o delalloc-space.o iomap.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 299e11e6c554..7a4ff524dc77 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3247,6 +3247,7 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
 loff_t btrfs_remap_file_range(struct file *file_in, loff_t pos_in,
 			      struct file *file_out, loff_t pos_out,
 			      loff_t len, unsigned int remap_flags);
+size_t btrfs_buffered_iomap_write(struct kiocb *iocb, struct iov_iter *from);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 4466a09f2d98..0707db04d3cc 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1829,7 +1829,7 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		return written;
 
 	pos = iocb->ki_pos;
-	written_buffered = btrfs_buffered_write(iocb, from);
+	written_buffered = btrfs_buffered_iomap_write(iocb, from);
 	if (written_buffered < 0) {
 		err = written_buffered;
 		goto out;
@@ -1966,7 +1966,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		num_written = __btrfs_direct_write(iocb, from);
 	} else {
-		num_written = btrfs_buffered_write(iocb, from);
+		num_written = btrfs_buffered_iomap_write(iocb, from);
 		if (num_written > 0)
 			iocb->ki_pos = pos + num_written;
 		if (clean_page)
diff --git a/fs/btrfs/iomap.c b/fs/btrfs/iomap.c
new file mode 100644
index 000000000000..9eb5e7b7603a
--- /dev/null
+++ b/fs/btrfs/iomap.c
@@ -0,0 +1,381 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * iomap support for BTRFS
+ *
+ * Copyright (c) 2019  SUSE Linux
+ * Author: Goldwyn Rodrigues <rgoldwyn@suse.com>
+ */
+
+#include <linux/iomap.h>
+#include "ctree.h"
+#include "btrfs_inode.h"
+#include "volumes.h"
+#include "disk-io.h"
+#include "delalloc-space.h"
+
+struct btrfs_iomap {
+	u64 start;
+	u64 end;
+	bool nocow;
+	int extents_locked;
+	ssize_t reserved_bytes;
+	struct extent_changeset *data_reserved;
+	struct extent_state *cached_state;
+};
+
+
+/*
+ * This function locks the extent and properly waits for data=ordered extents
+ * to finish before allowing the pages to be modified if need.
+ *
+ * The return value:
+ * 1 - the extent is locked
+ * 0 - the extent is not locked, and everything is OK
+ * -EAGAIN - need re-prepare the pages
+ * the other < 0 number - Something wrong happens
+ */
+static noinline int
+lock_and_cleanup_extent(struct btrfs_inode *inode, loff_t pos,
+			 size_t write_bytes,
+			 u64 *lockstart, u64 *lockend,
+			 struct extent_state **cached_state)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	u64 start_pos;
+	u64 last_pos;
+	int ret = 0;
+
+	start_pos = round_down(pos, fs_info->sectorsize);
+	last_pos = start_pos
+		+ round_up(pos + write_bytes - start_pos,
+			   fs_info->sectorsize) - 1;
+
+	if (start_pos < inode->vfs_inode.i_size) {
+		struct btrfs_ordered_extent *ordered;
+
+		lock_extent_bits(&inode->io_tree, start_pos, last_pos,
+				cached_state);
+		ordered = btrfs_lookup_ordered_range(inode, start_pos,
+						     last_pos - start_pos + 1);
+		if (ordered &&
+		    ordered->file_offset + ordered->len > start_pos &&
+		    ordered->file_offset <= last_pos) {
+			unlock_extent_cached(&inode->io_tree, start_pos,
+					last_pos, cached_state);
+			btrfs_start_ordered_extent(&inode->vfs_inode,
+					ordered, 1);
+			btrfs_put_ordered_extent(ordered);
+			return -EAGAIN;
+		}
+		if (ordered)
+			btrfs_put_ordered_extent(ordered);
+
+		*lockstart = start_pos;
+		*lockend = last_pos;
+		ret = 1;
+	}
+
+	return ret;
+}
+
+static noinline int check_can_nocow(struct btrfs_inode *inode, loff_t pos,
+				    size_t *write_bytes)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct btrfs_root *root = inode->root;
+	struct btrfs_ordered_extent *ordered;
+	u64 lockstart, lockend;
+	u64 num_bytes;
+	int ret;
+
+	ret = btrfs_start_write_no_snapshotting(root);
+	if (!ret)
+		return -ENOSPC;
+
+	lockstart = round_down(pos, fs_info->sectorsize);
+	lockend = round_up(pos + *write_bytes,
+			   fs_info->sectorsize) - 1;
+
+	while (1) {
+		lock_extent(&inode->io_tree, lockstart, lockend);
+		ordered = btrfs_lookup_ordered_range(inode, lockstart,
+						     lockend - lockstart + 1);
+		if (!ordered) {
+			break;
+		}
+		unlock_extent(&inode->io_tree, lockstart, lockend);
+		btrfs_start_ordered_extent(&inode->vfs_inode, ordered, 1);
+		btrfs_put_ordered_extent(ordered);
+	}
+
+	num_bytes = lockend - lockstart + 1;
+	ret = can_nocow_extent(&inode->vfs_inode, lockstart, &num_bytes,
+			NULL, NULL, NULL);
+	if (ret <= 0) {
+		ret = 0;
+		btrfs_end_write_no_snapshotting(root);
+	} else {
+		*write_bytes = min_t(size_t, *write_bytes ,
+				     num_bytes - pos + lockstart);
+	}
+
+	unlock_extent(&inode->io_tree, lockstart, lockend);
+
+	return ret;
+}
+
+static int btrfs_find_new_delalloc_bytes(struct btrfs_inode *inode,
+					 const u64 start,
+					 const u64 len,
+					 struct extent_state **cached_state)
+{
+	u64 search_start = start;
+	const u64 end = start + len - 1;
+
+	while (search_start < end) {
+		const u64 search_len = end - search_start + 1;
+		struct extent_map *em;
+		u64 em_len;
+		int ret = 0;
+
+		em = btrfs_get_extent(inode, NULL, 0, search_start,
+				      search_len, 0);
+		if (IS_ERR(em))
+			return PTR_ERR(em);
+
+		if (em->block_start != EXTENT_MAP_HOLE)
+			goto next;
+
+		em_len = em->len;
+		if (em->start < search_start)
+			em_len -= search_start - em->start;
+		if (em_len > search_len)
+			em_len = search_len;
+
+		ret = set_extent_bit(&inode->io_tree, search_start,
+				     search_start + em_len - 1,
+				     EXTENT_DELALLOC_NEW,
+				     NULL, cached_state, GFP_NOFS);
+next:
+		search_start = extent_map_end(em);
+		free_extent_map(em);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static void btrfs_buffered_page_done(struct inode *inode, loff_t pos,
+		unsigned copied, struct page *page,
+		struct iomap *iomap)
+{
+	SetPageUptodate(page);
+	ClearPageChecked(page);
+	set_page_dirty(page);
+	get_page(page);
+}
+
+
+static const struct iomap_page_ops btrfs_buffered_page_ops = {
+	.page_done = btrfs_buffered_page_done,
+};
+
+
+static int btrfs_buffered_iomap_begin(struct inode *inode, loff_t pos,
+		loff_t length, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
+{
+	int ret;
+	size_t write_bytes = length;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	size_t sector_offset = pos & (fs_info->sectorsize - 1);
+	struct btrfs_iomap *bi;
+
+	bi = kzalloc(sizeof(struct btrfs_iomap), GFP_NOFS);
+	if (!bi)
+		return -ENOMEM;
+
+	bi->reserved_bytes = round_up(write_bytes + sector_offset,
+			fs_info->sectorsize);
+
+	/* Reserve data space */
+	ret = btrfs_check_data_free_space(inode, &bi->data_reserved, pos,
+			write_bytes);
+	if (ret < 0) {
+		/*
+		 * Space allocation failed. Let's check if we can
+		 * continue I/O without allocations
+		 */
+		if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+						BTRFS_INODE_PREALLOC)) &&
+				check_can_nocow(BTRFS_I(inode), pos,
+					&write_bytes) > 0) {
+			bi->nocow = true;
+			/*
+			 * our prealloc extent may be smaller than
+			 * write_bytes, so scale down.
+			 */
+			bi->reserved_bytes = round_up(write_bytes +
+					sector_offset,
+					fs_info->sectorsize);
+		} else {
+			goto error;
+		}
+	}
+
+	WARN_ON(bi->reserved_bytes == 0);
+
+	/* We have the data space allocated, reserve the metadata now */
+	ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
+			bi->reserved_bytes);
+	if (ret) {
+		struct btrfs_root *root = BTRFS_I(inode)->root;
+		if (!bi->nocow)
+			btrfs_free_reserved_data_space(inode,
+					bi->data_reserved, pos,
+					write_bytes);
+		else
+			btrfs_end_write_no_snapshotting(root);
+		goto error;
+	}
+
+	do {
+		ret = lock_and_cleanup_extent(
+				BTRFS_I(inode), pos, write_bytes, &bi->start,
+				&bi->end, &bi->cached_state);
+	} while (ret == -EAGAIN);
+
+	if (ret < 0) {
+		btrfs_delalloc_release_extents(BTRFS_I(inode),
+					       bi->reserved_bytes, true);
+		goto release;
+	} else {
+		bi->extents_locked = ret;
+	}
+	iomap->private = bi;
+	iomap->length = round_up(write_bytes, fs_info->sectorsize);
+	iomap->offset = round_down(pos, fs_info->sectorsize);
+	iomap->addr = IOMAP_NULL_ADDR;
+	iomap->type = IOMAP_DELALLOC;
+	iomap->bdev = fs_info->fs_devices->latest_bdev;
+	iomap->page_ops = &btrfs_buffered_page_ops;
+	return 0;
+release:
+	if (bi->extents_locked)
+		unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start,
+				bi->end, &bi->cached_state);
+	if (bi->nocow) {
+		struct btrfs_root *root = BTRFS_I(inode)->root;
+		btrfs_end_write_no_snapshotting(root);
+		btrfs_delalloc_release_metadata(BTRFS_I(inode),
+				bi->reserved_bytes, true);
+	} else {
+		btrfs_delalloc_release_space(inode, bi->data_reserved,
+				round_down(pos, fs_info->sectorsize),
+				bi->reserved_bytes, true);
+	}
+	extent_changeset_free(bi->data_reserved);
+
+error:
+	kfree(bi);
+	return ret;
+}
+
+static int btrfs_buffered_iomap_end(struct inode *inode, loff_t pos,
+		loff_t length, ssize_t written, unsigned flags,
+		struct iomap *iomap)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_iomap *bi = iomap->private;
+	ssize_t release_bytes = round_down(bi->reserved_bytes - written,
+			1 << fs_info->sb->s_blocksize_bits);
+	unsigned int extra_bits = 0;
+	u64 start_pos = pos & ~((u64) fs_info->sectorsize - 1);
+	u64 num_bytes = round_up(written + pos - start_pos,
+			fs_info->sectorsize);
+	u64 end_of_last_block = start_pos + num_bytes - 1;
+	int ret = 0;
+
+	if (release_bytes > 0) {
+		if (bi->nocow) {
+			btrfs_delalloc_release_metadata(BTRFS_I(inode),
+					release_bytes, true);
+		} else {
+			u64 __pos = round_down(pos + written, fs_info->sectorsize);
+			btrfs_delalloc_release_space(inode, bi->data_reserved,
+					__pos, release_bytes, true);
+		}
+	}
+
+	/*
+	 * The pages may have already been dirty, clear out old accounting so
+	 * we can set things up properly
+	 */
+	clear_extent_bit(&BTRFS_I(inode)->io_tree, start_pos, end_of_last_block,
+			EXTENT_DIRTY | EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
+			EXTENT_DEFRAG, 0, 0, &bi->cached_state);
+
+	if (!btrfs_is_free_space_inode(BTRFS_I(inode))) {
+		if (start_pos >= i_size_read(inode) &&
+		    !(BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC)) {
+			/*
+			 * There can't be any extents following eof in this case
+			 * so just set the delalloc new bit for the range
+			 * directly.
+			 */
+			extra_bits |= EXTENT_DELALLOC_NEW;
+		} else {
+			ret = btrfs_find_new_delalloc_bytes(BTRFS_I(inode),
+					start_pos, num_bytes,
+					&bi->cached_state);
+			if (ret)
+				goto unlock;
+		}
+	}
+
+	ret = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
+			extra_bits, &bi->cached_state, 0);
+unlock:
+	if (bi->extents_locked)
+		unlock_extent_cached(&BTRFS_I(inode)->io_tree,
+				bi->start, bi->end, &bi->cached_state);
+
+	if (bi->nocow) {
+		struct btrfs_root *root = BTRFS_I(inode)->root;
+		btrfs_end_write_no_snapshotting(root);
+		if (written > 0) {
+			u64 start = round_down(pos, fs_info->sectorsize);
+			u64 end = round_up(pos + written, fs_info->sectorsize) - 1;
+			set_extent_bit(&BTRFS_I(inode)->io_tree, start, end,
+					EXTENT_NORESERVE, NULL, NULL, GFP_NOFS);
+		}
+
+	}
+	btrfs_delalloc_release_extents(BTRFS_I(inode), bi->reserved_bytes,
+			true);
+
+	if (written < fs_info->nodesize)
+		btrfs_btree_balance_dirty(fs_info);
+
+	extent_changeset_free(bi->data_reserved);
+	kfree(bi);
+	return ret;
+}
+
+static const struct iomap_ops btrfs_buffered_iomap_ops = {
+	.iomap_begin            = btrfs_buffered_iomap_begin,
+	.iomap_end              = btrfs_buffered_iomap_end,
+};
+
+size_t btrfs_buffered_iomap_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	ssize_t written;
+	struct inode *inode = file_inode(iocb->ki_filp);
+	written = iomap_file_buffered_write(iocb, from, &btrfs_buffered_iomap_ops);
+	if (written > 0)
+		iocb->ki_pos += written;
+	if (iocb->ki_pos > i_size_read(inode))
+		i_size_write(inode, iocb->ki_pos);
+	return written;
+}
+
-- 
2.16.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 05/13] btrfs: Add CoW in iomap based writes
  2019-08-02 22:00 [PATCH v2 0/13] Btrfs iomap Goldwyn Rodrigues
                   ` (3 preceding siblings ...)
  2019-08-02 22:00 ` [PATCH 04/13] btrfs: Add a simple buffered iomap write Goldwyn Rodrigues
@ 2019-08-02 22:00 ` Goldwyn Rodrigues
  2019-08-05  0:13   ` Dave Chinner
  2019-08-02 22:00 ` [PATCH 06/13] btrfs: remove buffered write code made unnecessary Goldwyn Rodrigues
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 42+ messages in thread
From: Goldwyn Rodrigues @ 2019-08-02 22:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, hch, darrick.wong, ruansy.fnst, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Set iomap->type to IOMAP_COW and fill up the source map in case
the I/O is not page aligned.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/fs/btrfs/iomap.c b/fs/btrfs/iomap.c
index 9eb5e7b7603a..879038e2f1a0 100644
--- a/fs/btrfs/iomap.c
+++ b/fs/btrfs/iomap.c
@@ -165,6 +165,35 @@ static int btrfs_find_new_delalloc_bytes(struct btrfs_inode *inode,
 	return 0;
 }
 
+/*
+ * get_iomap: Get the block map and fill the iomap structure
+ * @pos: file position
+ * @length: I/O length
+ * @iomap: The iomap structure to fill
+ */
+
+static int get_iomap(struct inode *inode, loff_t pos, loff_t length,
+		struct iomap *iomap)
+{
+	struct extent_map *em;
+	iomap->addr = IOMAP_NULL_ADDR;
+	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
+	if (IS_ERR(em))
+		return PTR_ERR(em);
+	/* XXX Do we need to check for em->flags here? */
+	if (em->block_start == EXTENT_MAP_HOLE) {
+		iomap->type = IOMAP_HOLE;
+	} else {
+		iomap->addr = em->block_start;
+		iomap->type = IOMAP_MAPPED;
+	}
+	iomap->offset = em->start;
+	iomap->bdev = em->bdev;
+	iomap->length = em->len;
+	free_extent_map(em);
+	return 0;
+}
+
 static void btrfs_buffered_page_done(struct inode *inode, loff_t pos,
 		unsigned copied, struct page *page,
 		struct iomap *iomap)
@@ -188,6 +217,7 @@ static int btrfs_buffered_iomap_begin(struct inode *inode, loff_t pos,
 	int ret;
 	size_t write_bytes = length;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	size_t end;
 	size_t sector_offset = pos & (fs_info->sectorsize - 1);
 	struct btrfs_iomap *bi;
 
@@ -255,6 +285,17 @@ static int btrfs_buffered_iomap_begin(struct inode *inode, loff_t pos,
 	iomap->private = bi;
 	iomap->length = round_up(write_bytes, fs_info->sectorsize);
 	iomap->offset = round_down(pos, fs_info->sectorsize);
+	end = pos + write_bytes;
+	/* Set IOMAP_COW if start/end is not page aligned */
+	if (((pos & (PAGE_SIZE - 1)) || (end & (PAGE_SIZE - 1)))) {
+		iomap->type = IOMAP_COW;
+		ret = get_iomap(inode, pos, length, srcmap);
+		if (ret < 0)
+			goto release;
+	} else {
+		iomap->type = IOMAP_DELALLOC;
+	}
+
 	iomap->addr = IOMAP_NULL_ADDR;
 	iomap->type = IOMAP_DELALLOC;
 	iomap->bdev = fs_info->fs_devices->latest_bdev;
-- 
2.16.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 06/13] btrfs: remove buffered write code made unnecessary
  2019-08-02 22:00 [PATCH v2 0/13] Btrfs iomap Goldwyn Rodrigues
                   ` (4 preceding siblings ...)
  2019-08-02 22:00 ` [PATCH 05/13] btrfs: Add CoW in iomap based writes Goldwyn Rodrigues
@ 2019-08-02 22:00 ` Goldwyn Rodrigues
  2019-08-02 22:00 ` [PATCH 07/13] btrfs: basic direct read operation Goldwyn Rodrigues
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 42+ messages in thread
From: Goldwyn Rodrigues @ 2019-08-02 22:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, hch, darrick.wong, ruansy.fnst, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Better done in a separate patch to keep the main patch short(er)

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/file.c | 463 --------------------------------------------------------
 1 file changed, 463 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0707db04d3cc..f7087e28ac08 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -390,79 +390,6 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
 	return 0;
 }
 
-/* simple helper to fault in pages and copy.  This should go away
- * and be replaced with calls into generic code.
- */
-static noinline int btrfs_copy_from_user(loff_t pos, size_t write_bytes,
-					 struct page **prepared_pages,
-					 struct iov_iter *i)
-{
-	size_t copied = 0;
-	size_t total_copied = 0;
-	int pg = 0;
-	int offset = offset_in_page(pos);
-
-	while (write_bytes > 0) {
-		size_t count = min_t(size_t,
-				     PAGE_SIZE - offset, write_bytes);
-		struct page *page = prepared_pages[pg];
-		/*
-		 * Copy data from userspace to the current page
-		 */
-		copied = iov_iter_copy_from_user_atomic(page, i, offset, count);
-
-		/* Flush processor's dcache for this page */
-		flush_dcache_page(page);
-
-		/*
-		 * if we get a partial write, we can end up with
-		 * partially up to date pages.  These add
-		 * a lot of complexity, so make sure they don't
-		 * happen by forcing this copy to be retried.
-		 *
-		 * The rest of the btrfs_file_write code will fall
-		 * back to page at a time copies after we return 0.
-		 */
-		if (!PageUptodate(page) && copied < count)
-			copied = 0;
-
-		iov_iter_advance(i, copied);
-		write_bytes -= copied;
-		total_copied += copied;
-
-		/* Return to btrfs_file_write_iter to fault page */
-		if (unlikely(copied == 0))
-			break;
-
-		if (copied < PAGE_SIZE - offset) {
-			offset += copied;
-		} else {
-			pg++;
-			offset = 0;
-		}
-	}
-	return total_copied;
-}
-
-/*
- * unlocks pages after btrfs_file_write is done with them
- */
-static void btrfs_drop_pages(struct page **pages, size_t num_pages)
-{
-	size_t i;
-	for (i = 0; i < num_pages; i++) {
-		/* page checked is some magic around finding pages that
-		 * have been modified without going through btrfs_set_page_dirty
-		 * clear it here. There should be no need to mark the pages
-		 * accessed as prepare_pages should have marked them accessed
-		 * in prepare_pages via find_or_create_page()
-		 */
-		ClearPageChecked(pages[i]);
-		unlock_page(pages[i]);
-		put_page(pages[i]);
-	}
-}
-
 static int btrfs_find_new_delalloc_bytes(struct btrfs_inode *inode,
 					 const u64 start,
 					 const u64 len,
@@ -1387,164 +1314,6 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
 	return 0;
 }
 
-/*
- * on error we return an unlocked page and the error value
- * on success we return a locked page and 0
- */
-static int prepare_uptodate_page(struct inode *inode,
-				 struct page *page, u64 pos,
-				 bool force_uptodate)
-{
-	int ret = 0;
-
-	if (((pos & (PAGE_SIZE - 1)) || force_uptodate) &&
-	    !PageUptodate(page)) {
-		ret = btrfs_readpage(NULL, page);
-		if (ret)
-			return ret;
-		lock_page(page);
-		if (!PageUptodate(page)) {
-			unlock_page(page);
-			return -EIO;
-		}
-		if (page->mapping != inode->i_mapping) {
-			unlock_page(page);
-			return -EAGAIN;
-		}
-	}
-	return 0;
-}
-
-/*
- * this just gets pages into the page cache and locks them down.
- */
-static noinline int prepare_pages(struct inode *inode, struct page **pages,
-				  size_t num_pages, loff_t pos,
-				  size_t write_bytes, bool force_uptodate)
-{
-	int i;
-	unsigned long index = pos >> PAGE_SHIFT;
-	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
-	int err = 0;
-	int faili;
-
-	for (i = 0; i < num_pages; i++) {
-again:
-		pages[i] = find_or_create_page(inode->i_mapping, index + i,
-					       mask | __GFP_WRITE);
-		if (!pages[i]) {
-			faili = i - 1;
-			err = -ENOMEM;
-			goto fail;
-		}
-
-		if (i == 0)
-			err = prepare_uptodate_page(inode, pages[i], pos,
-						    force_uptodate);
-		if (!err && i == num_pages - 1)
-			err = prepare_uptodate_page(inode, pages[i],
-						    pos + write_bytes, false);
-		if (err) {
-			put_page(pages[i]);
-			if (err == -EAGAIN) {
-				err = 0;
-				goto again;
-			}
-			faili = i - 1;
-			goto fail;
-		}
-		wait_on_page_writeback(pages[i]);
-	}
-
-	return 0;
-fail:
-	while (faili >= 0) {
-		unlock_page(pages[faili]);
-		put_page(pages[faili]);
-		faili--;
-	}
-	return err;
-
-}
-
-/*
- * This function locks the extent and properly waits for data=ordered extents
- * to finish before allowing the pages to be modified if need.
- *
- * The return value:
- * 1 - the extent is locked
- * 0 - the extent is not locked, and everything is OK
- * -EAGAIN - need re-prepare the pages
- * the other < 0 number - Something wrong happens
- */
-static noinline int
-lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct page **pages,
-				size_t num_pages, loff_t pos,
-				size_t write_bytes,
-				u64 *lockstart, u64 *lockend,
-				struct extent_state **cached_state)
-{
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	u64 start_pos;
-	u64 last_pos;
-	int i;
-	int ret = 0;
-
-	start_pos = round_down(pos, fs_info->sectorsize);
-	last_pos = start_pos
-		+ round_up(pos + write_bytes - start_pos,
-			   fs_info->sectorsize) - 1;
-
-	if (start_pos < inode->vfs_inode.i_size) {
-		struct btrfs_ordered_extent *ordered;
-
-		lock_extent_bits(&inode->io_tree, start_pos, last_pos,
-				cached_state);
-		ordered = btrfs_lookup_ordered_range(inode, start_pos,
-						     last_pos - start_pos + 1);
-		if (ordered &&
-		    ordered->file_offset + ordered->len > start_pos &&
-		    ordered->file_offset <= last_pos) {
-			unlock_extent_cached(&inode->io_tree, start_pos,
-					last_pos, cached_state);
-			for (i = 0; i < num_pages; i++) {
-				unlock_page(pages[i]);
-				put_page(pages[i]);
-			}
-			btrfs_start_ordered_extent(&inode->vfs_inode,
-					ordered, 1);
-			btrfs_put_ordered_extent(ordered);
-			return -EAGAIN;
-		}
-		if (ordered)
-			btrfs_put_ordered_extent(ordered);
-
-		*lockstart = start_pos;
-		*lockend = last_pos;
-		ret = 1;
-	}
-
-	/*
-	 * It's possible the pages are dirty right now, but we don't want
-	 * to clean them yet because copy_from_user may catch a page fault
-	 * and we might have to fall back to one page at a time.  If that
-	 * happens, we'll unlock these pages and we'd have a window where
-	 * reclaim could sneak in and drop the once-dirty page on the floor
-	 * without writing it.
-	 *
-	 * We have the pages locked and the extent range locked, so there's
-	 * no way someone can start IO on any dirty pages in this range.
-	 *
-	 * We'll call btrfs_dirty_pages() later on, and that will flip around
-	 * delalloc bits and dirty the pages as required.
-	 */
-	for (i = 0; i < num_pages; i++) {
-		WARN_ON(!PageLocked(pages[i]));
-	}
-
-	return ret;
-}
-
 static noinline int check_can_nocow(struct btrfs_inode *inode, loff_t pos,
 				    size_t *write_bytes)
 {
@@ -1581,238 +1350,6 @@ static noinline int check_can_nocow(struct btrfs_inode *inode, loff_t pos,
 	return ret;
 }
 
-static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
-					       struct iov_iter *i)
-{
-	struct file *file = iocb->ki_filp;
-	loff_t pos = iocb->ki_pos;
-	struct inode *inode = file_inode(file);
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct page **pages = NULL;
-	struct extent_state *cached_state = NULL;
-	struct extent_changeset *data_reserved = NULL;
-	u64 release_bytes = 0;
-	u64 lockstart;
-	u64 lockend;
-	size_t num_written = 0;
-	int nrptrs;
-	int ret = 0;
-	bool only_release_metadata = false;
-	bool force_page_uptodate = false;
-
-	nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
-			PAGE_SIZE / (sizeof(struct page *)));
-	nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied);
-	nrptrs = max(nrptrs, 8);
-	pages = kmalloc_array(nrptrs, sizeof(struct page *), GFP_KERNEL);
-	if (!pages)
-		return -ENOMEM;
-
-	while (iov_iter_count(i) > 0) {
-		size_t offset = offset_in_page(pos);
-		size_t sector_offset;
-		size_t write_bytes = min(iov_iter_count(i),
-					 nrptrs * (size_t)PAGE_SIZE -
-					 offset);
-		size_t num_pages = DIV_ROUND_UP(write_bytes + offset,
-						PAGE_SIZE);
-		size_t reserve_bytes;
-		size_t dirty_pages;
-		size_t copied;
-		size_t dirty_sectors;
-		size_t num_sectors;
-		int extents_locked;
-
-		WARN_ON(num_pages > nrptrs);
-
-		/*
-		 * Fault pages before locking them in prepare_pages
-		 * to avoid recursive lock
-		 */
-		if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
-			ret = -EFAULT;
-			break;
-		}
-
-		sector_offset = pos & (fs_info->sectorsize - 1);
-		reserve_bytes = round_up(write_bytes + sector_offset,
-				fs_info->sectorsize);
-
-		extent_changeset_release(data_reserved);
-		ret = btrfs_check_data_free_space(inode, &data_reserved, pos,
-						  write_bytes);
-		if (ret < 0) {
-			if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
-						      BTRFS_INODE_PREALLOC)) &&
-			    check_can_nocow(BTRFS_I(inode), pos,
-					&write_bytes) > 0) {
-				/*
-				 * For nodata cow case, no need to reserve
-				 * data space.
-				 */
-				only_release_metadata = true;
-				/*
-				 * our prealloc extent may be smaller than
-				 * write_bytes, so scale down.
-				 */
-				num_pages = DIV_ROUND_UP(write_bytes + offset,
-							 PAGE_SIZE);
-				reserve_bytes = round_up(write_bytes +
-							 sector_offset,
-							 fs_info->sectorsize);
-			} else {
-				break;
-			}
-		}
-
-		WARN_ON(reserve_bytes == 0);
-		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
-				reserve_bytes);
-		if (ret) {
-			if (!only_release_metadata)
-				btrfs_free_reserved_data_space(inode,
-						data_reserved, pos,
-						write_bytes);
-			else
-				btrfs_end_write_no_snapshotting(root);
-			break;
-		}
-
-		release_bytes = reserve_bytes;
-again:
-		/*
-		 * This is going to setup the pages array with the number of
-		 * pages we want, so we don't really need to worry about the
-		 * contents of pages from loop to loop
-		 */
-		ret = prepare_pages(inode, pages, num_pages,
-				    pos, write_bytes,
-				    force_page_uptodate);
-		if (ret) {
-			btrfs_delalloc_release_extents(BTRFS_I(inode),
-						       reserve_bytes, true);
-			break;
-		}
-
-		extents_locked = lock_and_cleanup_extent_if_need(
-				BTRFS_I(inode), pages,
-				num_pages, pos, write_bytes, &lockstart,
-				&lockend, &cached_state);
-		if (extents_locked < 0) {
-			if (extents_locked == -EAGAIN)
-				goto again;
-			btrfs_delalloc_release_extents(BTRFS_I(inode),
-						       reserve_bytes, true);
-			ret = extents_locked;
-			break;
-		}
-
-		copied = btrfs_copy_from_user(pos, write_bytes, pages, i);
-
-		num_sectors = BTRFS_BYTES_TO_BLKS(fs_info, reserve_bytes);
-		dirty_sectors = round_up(copied + sector_offset,
-					fs_info->sectorsize);
-		dirty_sectors = BTRFS_BYTES_TO_BLKS(fs_info, dirty_sectors);
-
-		/*
-		 * if we have trouble faulting in the pages, fall
-		 * back to one page at a time
-		 */
-		if (copied < write_bytes)
-			nrptrs = 1;
-
-		if (copied == 0) {
-			force_page_uptodate = true;
-			dirty_sectors = 0;
-			dirty_pages = 0;
-		} else {
-			force_page_uptodate = false;
-			dirty_pages = DIV_ROUND_UP(copied + offset,
-						   PAGE_SIZE);
-		}
-
-		if (num_sectors > dirty_sectors) {
-			/* release everything except the sectors we dirtied */
-			release_bytes -= dirty_sectors <<
-						fs_info->sb->s_blocksize_bits;
-			if (only_release_metadata) {
-				btrfs_delalloc_release_metadata(BTRFS_I(inode),
-							release_bytes, true);
-			} else {
-				u64 __pos;
-
-				__pos = round_down(pos,
-						   fs_info->sectorsize) +
-					(dirty_pages << PAGE_SHIFT);
-				btrfs_delalloc_release_space(inode,
-						data_reserved, __pos,
-						release_bytes, true);
-			}
-		}
-
-		release_bytes = round_up(copied + sector_offset,
-					fs_info->sectorsize);
-
-		if (copied > 0)
-			ret = btrfs_dirty_pages(inode, pages, dirty_pages,
-						pos, copied, &cached_state);
-		if (extents_locked)
-			unlock_extent_cached(&BTRFS_I(inode)->io_tree,
-					     lockstart, lockend, &cached_state);
-		btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes,
-					       true);
-		if (ret) {
-			btrfs_drop_pages(pages, num_pages);
-			break;
-		}
-
-		release_bytes = 0;
-		if (only_release_metadata)
-			btrfs_end_write_no_snapshotting(root);
-
-		if (only_release_metadata && copied > 0) {
-			lockstart = round_down(pos,
-					       fs_info->sectorsize);
-			lockend = round_up(pos + copied,
-					   fs_info->sectorsize) - 1;
-
-			set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
-				       lockend, EXTENT_NORESERVE, NULL,
-				       NULL, GFP_NOFS);
-			only_release_metadata = false;
-		}
-
-		btrfs_drop_pages(pages, num_pages);
-
-		cond_resched();
-
-		balance_dirty_pages_ratelimited(inode->i_mapping);
-		if (dirty_pages < (fs_info->nodesize >> PAGE_SHIFT) + 1)
-			btrfs_btree_balance_dirty(fs_info);
-
-		pos += copied;
-		num_written += copied;
-	}
-
-	kfree(pages);
-
-	if (release_bytes) {
-		if (only_release_metadata) {
-			btrfs_end_write_no_snapshotting(root);
-			btrfs_delalloc_release_metadata(BTRFS_I(inode),
-					release_bytes, true);
-		} else {
-			btrfs_delalloc_release_space(inode, data_reserved,
-					round_down(pos, fs_info->sectorsize),
-					release_bytes, true);
-		}
-	}
-
-	extent_changeset_free(data_reserved);
-	return num_written ? num_written : ret;
-}
-
 static ssize_t __btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
-- 
2.16.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 07/13] btrfs: basic direct read operation
  2019-08-02 22:00 [PATCH v2 0/13] Btrfs iomap Goldwyn Rodrigues
                   ` (5 preceding siblings ...)
  2019-08-02 22:00 ` [PATCH 06/13] btrfs: remove buffered write code made unnecessary Goldwyn Rodrigues
@ 2019-08-02 22:00 ` Goldwyn Rodrigues
  2019-08-12 12:32   ` RITESH HARJANI
  2019-08-02 22:00 ` [PATCH 08/13] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 42+ messages in thread
From: Goldwyn Rodrigues @ 2019-08-02 22:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, hch, darrick.wong, ruansy.fnst, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Add btrfs_dio_iomap_ops for iomap.begin() function. In order to
accomodate dio reads, add a new function btrfs_file_read_iter()
which would call btrfs_dio_iomap_read() for DIO reads and
fallback to generic_file_read_iter otherwise.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/file.c  | 10 +++++++++-
 fs/btrfs/iomap.c | 20 ++++++++++++++++++++
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7a4ff524dc77..9eca2d576dd1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3247,7 +3247,9 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
 loff_t btrfs_remap_file_range(struct file *file_in, loff_t pos_in,
 			      struct file *file_out, loff_t pos_out,
 			      loff_t len, unsigned int remap_flags);
+/* iomap.c */
 size_t btrfs_buffered_iomap_write(struct kiocb *iocb, struct iov_iter *from);
+ssize_t btrfs_dio_iomap_read(struct kiocb *iocb, struct iov_iter *to);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f7087e28ac08..997eb152a35a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2839,9 +2839,17 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
 	return generic_file_open(inode, filp);
 }
 
+static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	if (iocb->ki_flags & IOCB_DIRECT)
+		return btrfs_dio_iomap_read(iocb, to);
+
+	return generic_file_read_iter(iocb, to);
+}
+
 const struct file_operations btrfs_file_operations = {
 	.llseek		= btrfs_file_llseek,
-	.read_iter      = generic_file_read_iter,
+	.read_iter      = btrfs_file_read_iter,
 	.splice_read	= generic_file_splice_read,
 	.write_iter	= btrfs_file_write_iter,
 	.mmap		= btrfs_file_mmap,
diff --git a/fs/btrfs/iomap.c b/fs/btrfs/iomap.c
index 879038e2f1a0..36df606fc028 100644
--- a/fs/btrfs/iomap.c
+++ b/fs/btrfs/iomap.c
@@ -420,3 +420,23 @@ size_t btrfs_buffered_iomap_write(struct kiocb *iocb, struct iov_iter *from)
 	return written;
 }
 
+static int btrfs_dio_iomap_begin(struct inode *inode, loff_t pos,
+		loff_t length, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
+{
+	return get_iomap(inode, pos, length, iomap);
+}
+
+static const struct iomap_ops btrfs_dio_iomap_ops = {
+	.iomap_begin            = btrfs_dio_iomap_begin,
+};
+
+ssize_t btrfs_dio_iomap_read(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret;
+	inode_lock_shared(inode);
+	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, NULL);
+	inode_unlock_shared(inode);
+	return ret;
+}
-- 
2.16.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 08/13] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write()
  2019-08-02 22:00 [PATCH v2 0/13] Btrfs iomap Goldwyn Rodrigues
                   ` (6 preceding siblings ...)
  2019-08-02 22:00 ` [PATCH 07/13] btrfs: basic direct read operation Goldwyn Rodrigues
@ 2019-08-02 22:00 ` Goldwyn Rodrigues
  2019-08-02 22:00 ` [PATCH 09/13] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 42+ messages in thread
From: Goldwyn Rodrigues @ 2019-08-02 22:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, hch, darrick.wong, ruansy.fnst, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This makes btrfs_get_extent_map_write() independent of Direct
I/O code.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/inode.c | 40 +++++++++++++++++++++++++++-------------
 2 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9eca2d576dd1..66232cbc2414 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3172,6 +3172,8 @@ struct inode *btrfs_iget_path(struct super_block *s, struct btrfs_key *location,
 			      struct btrfs_path *path);
 struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
 			 struct btrfs_root *root, int *was_new);
+int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh,
+		struct inode *inode, u64 start, u64 len);
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 				    struct page *page, size_t pg_offset,
 				    u64 start, u64 end, int create);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 258bacefdf5f..24895793fd91 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7592,11 +7592,10 @@ static int btrfs_get_blocks_direct_read(struct extent_map *em,
 	return 0;
 }
 
-static int btrfs_get_blocks_direct_write(struct extent_map **map,
-					 struct buffer_head *bh_result,
-					 struct inode *inode,
-					 struct btrfs_dio_data *dio_data,
-					 u64 start, u64 len)
+int btrfs_get_extent_map_write(struct extent_map **map,
+		struct buffer_head *bh,
+		struct inode *inode,
+		u64 start, u64 len)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_map *em = *map;
@@ -7650,22 +7649,38 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 			 */
 			btrfs_free_reserved_data_space_noquota(inode, start,
 							       len);
-			goto skip_cow;
+			/* skip COW */
+			goto out;
 		}
 	}
 
 	/* this will cow the extent */
-	len = bh_result->b_size;
+	if (bh)
+		len = bh->b_size;
 	free_extent_map(em);
 	*map = em = btrfs_new_extent_direct(inode, start, len);
-	if (IS_ERR(em)) {
-		ret = PTR_ERR(em);
-		goto out;
-	}
+	if (IS_ERR(em))
+		return PTR_ERR(em);
+out:
+	return ret;
+}
 
+static int btrfs_get_blocks_direct_write(struct extent_map **map,
+					 struct buffer_head *bh_result,
+					 struct inode *inode,
+					 struct btrfs_dio_data *dio_data,
+					 u64 start, u64 len)
+{
+	int ret;
+	struct extent_map *em;
+
+	ret = btrfs_get_extent_map_write(map, bh_result, inode,
+			start, len);
+	if (ret < 0)
+		return ret;
+	em = *map;
 	len = min(len, em->len - (start - em->start));
 
-skip_cow:
 	bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
 		inode->i_blkbits;
 	bh_result->b_size = len;
@@ -7686,7 +7701,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	dio_data->reserve -= len;
 	dio_data->unsubmitted_oe_range_end = start + len;
 	current->journal_info = dio_data;
-out:
 	return ret;
 }
 
-- 
2.16.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 09/13] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent()
  2019-08-02 22:00 [PATCH v2 0/13] Btrfs iomap Goldwyn Rodrigues
                   ` (7 preceding siblings ...)
  2019-08-02 22:00 ` [PATCH 08/13] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
@ 2019-08-02 22:00 ` Goldwyn Rodrigues
  2019-08-02 22:00 ` [PATCH 10/13] iomap: use a function pointer for dio submits Goldwyn Rodrigues
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 42+ messages in thread
From: Goldwyn Rodrigues @ 2019-08-02 22:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, hch, darrick.wong, ruansy.fnst, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Since we will be calling from another file, use a
better name to declare it non-static

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  7 +++++--
 fs/btrfs/inode.c | 14 +++++---------
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 66232cbc2414..b8b19647b43e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3175,8 +3175,11 @@ struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
 int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh,
 		struct inode *inode, u64 start, u64 len);
 struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
-				    struct page *page, size_t pg_offset,
-				    u64 start, u64 end, int create);
+		struct page *page, size_t pg_offset,
+		u64 start, u64 end, int create);
+void btrfs_update_ordered_extent(struct inode *inode,
+		const u64 offset, const u64 bytes,
+		const bool uptodate);
 int btrfs_update_inode(struct btrfs_trans_handle *trans,
 			      struct btrfs_root *root,
 			      struct inode *inode);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 24895793fd91..d415534ce733 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -89,10 +89,6 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
 				       u64 ram_bytes, int compress_type,
 				       int type);
 
-static void __endio_write_update_ordered(struct inode *inode,
-					 const u64 offset, const u64 bytes,
-					 const bool uptodate);
-
 /*
  * Cleanup all submitted ordered extents in specified range to handle errors
  * from the btrfs_run_delalloc_range() callback.
@@ -133,7 +129,7 @@ static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
 		bytes -= PAGE_SIZE;
 	}
 
-	return __endio_write_update_ordered(inode, offset, bytes, false);
+	return btrfs_update_ordered_extent(inode, offset, bytes, false);
 }
 
 static int btrfs_dirty_inode(struct inode *inode);
@@ -8176,7 +8172,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
 	bio_put(bio);
 }
 
-static void __endio_write_update_ordered(struct inode *inode,
+void btrfs_update_ordered_extent(struct inode *inode,
 					 const u64 offset, const u64 bytes,
 					 const bool uptodate)
 {
@@ -8229,7 +8225,7 @@ static void btrfs_endio_direct_write(struct bio *bio)
 	struct btrfs_dio_private *dip = bio->bi_private;
 	struct bio *dio_bio = dip->dio_bio;
 
-	__endio_write_update_ordered(dip->inode, dip->logical_offset,
+	btrfs_update_ordered_extent(dip->inode, dip->logical_offset,
 				     dip->bytes, !bio->bi_status);
 
 	kfree(dip);
@@ -8546,7 +8542,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 		bio = NULL;
 	} else {
 		if (write)
-			__endio_write_update_ordered(inode,
+			btrfs_update_ordered_extent(inode,
 						file_offset,
 						dio_bio->bi_iter.bi_size,
 						false);
@@ -8686,7 +8682,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 			 */
 			if (dio_data.unsubmitted_oe_range_start <
 			    dio_data.unsubmitted_oe_range_end)
-				__endio_write_update_ordered(inode,
+				btrfs_update_ordered_extent(inode,
 					dio_data.unsubmitted_oe_range_start,
 					dio_data.unsubmitted_oe_range_end -
 					dio_data.unsubmitted_oe_range_start,
-- 
2.16.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-02 22:00 [PATCH v2 0/13] Btrfs iomap Goldwyn Rodrigues
                   ` (8 preceding siblings ...)
  2019-08-02 22:00 ` [PATCH 09/13] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
@ 2019-08-02 22:00 ` Goldwyn Rodrigues
  2019-08-03  0:21   ` Darrick J. Wong
  2019-08-04 23:43   ` Dave Chinner
  2019-08-02 22:00 ` [PATCH 11/13] btrfs: Use iomap_dio_rw for performing direct I/O writes Goldwyn Rodrigues
                   ` (2 subsequent siblings)
  12 siblings, 2 replies; 42+ messages in thread
From: Goldwyn Rodrigues @ 2019-08-02 22:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, hch, darrick.wong, ruansy.fnst, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This helps filesystems to perform tasks on the bio while
submitting for I/O. Since btrfs requires the position
we are working on, pass pos to iomap_dio_submit_bio()

The correct place for submit_io() is not page_ops. Would it
better to rename the structure to something like iomap_io_ops
or put it directly under struct iomap?

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/iomap/direct-io.c  | 16 +++++++++++-----
 include/linux/iomap.h |  1 +
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 5279029c7a3c..a802e66bf11f 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool spin)
 EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
 
 static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
-		struct bio *bio)
+		struct bio *bio, loff_t pos)
 {
 	atomic_inc(&dio->ref);
 
@@ -67,7 +67,13 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
 		bio_set_polled(bio, dio->iocb);
 
 	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
-	dio->submit.cookie = submit_bio(bio);
+	if (iomap->page_ops && iomap->page_ops->submit_io) {
+		iomap->page_ops->submit_io(bio, file_inode(dio->iocb->ki_filp),
+				pos);
+		dio->submit.cookie = BLK_QC_T_NONE;
+	} else {
+		dio->submit.cookie = submit_bio(bio);
+	}
 }
 
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
@@ -195,7 +201,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 	get_page(page);
 	__bio_add_page(bio, page, len, 0);
 	bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
-	iomap_dio_submit_bio(dio, iomap, bio);
+	iomap_dio_submit_bio(dio, iomap, bio, pos);
 }
 
 static loff_t
@@ -301,11 +307,11 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		iov_iter_advance(dio->submit.iter, n);
 
 		dio->size += n;
-		pos += n;
 		copied += n;
 
 		nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
-		iomap_dio_submit_bio(dio, iomap, bio);
+		iomap_dio_submit_bio(dio, iomap, bio, pos);
+		pos += n;
 	} while (nr_pages);
 
 	/*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 5b2055e8ca8a..6617e4b6fb6d 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -92,6 +92,7 @@ struct iomap_page_ops {
 			struct iomap *iomap);
 	void (*page_done)(struct inode *inode, loff_t pos, unsigned copied,
 			struct page *page, struct iomap *iomap);
+	dio_submit_t 		*submit_io;
 };
 
 /*
-- 
2.16.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 11/13] btrfs: Use iomap_dio_rw for performing direct I/O writes
  2019-08-02 22:00 [PATCH v2 0/13] Btrfs iomap Goldwyn Rodrigues
                   ` (9 preceding siblings ...)
  2019-08-02 22:00 ` [PATCH 10/13] iomap: use a function pointer for dio submits Goldwyn Rodrigues
@ 2019-08-02 22:00 ` Goldwyn Rodrigues
  2019-08-02 22:00 ` [PATCH 12/13] btrfs: Remove btrfs_dio_data and __btrfs_direct_write Goldwyn Rodrigues
  2019-08-02 22:00 ` [PATCH 13/13] btrfs: update inode size during bio completion Goldwyn Rodrigues
  12 siblings, 0 replies; 42+ messages in thread
From: Goldwyn Rodrigues @ 2019-08-02 22:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, hch, darrick.wong, ruansy.fnst, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Since btrfs Direct I/O needs to perform operations before submission,
use the submit_io function which operates on the bio to perform checksum
calculations etc.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |   3 ++
 fs/btrfs/file.c  |   2 +-
 fs/btrfs/inode.c |  14 +++--
 fs/btrfs/iomap.c | 158 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 165 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b8b19647b43e..3b7a6ddceed6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3206,6 +3206,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
 void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
 					  u64 end, int uptodate);
 extern const struct dentry_operations btrfs_dentry_operations;
+void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
+		loff_t file_offset);
 
 /* ioctl.c */
 long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
@@ -3255,6 +3257,7 @@ loff_t btrfs_remap_file_range(struct file *file_in, loff_t pos_in,
 /* iomap.c */
 size_t btrfs_buffered_iomap_write(struct kiocb *iocb, struct iov_iter *from);
 ssize_t btrfs_dio_iomap_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t btrfs_dio_iomap_write(struct kiocb *iocb, struct iov_iter *from);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 997eb152a35a..faa5ad89469f 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1501,7 +1501,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 		atomic_inc(&BTRFS_I(inode)->sync_writers);
 
 	if (iocb->ki_flags & IOCB_DIRECT) {
-		num_written = __btrfs_direct_write(iocb, from);
+		num_written = btrfs_dio_iomap_write(iocb, from);
 	} else {
 		num_written = btrfs_buffered_iomap_write(iocb, from);
 		if (num_written > 0)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d415534ce733..323d72858c9c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8167,9 +8167,8 @@ static void btrfs_endio_direct_read(struct bio *bio)
 	kfree(dip);
 
 	dio_bio->bi_status = err;
-	dio_end_io(dio_bio);
+	bio_endio(dio_bio);
 	btrfs_io_bio_free_csum(io_bio);
-	bio_put(bio);
 }
 
 void btrfs_update_ordered_extent(struct inode *inode,
@@ -8231,8 +8230,7 @@ static void btrfs_endio_direct_write(struct bio *bio)
 	kfree(dip);
 
 	dio_bio->bi_status = bio->bi_status;
-	dio_end_io(dio_bio);
-	bio_put(bio);
+	bio_endio(dio_bio);
 }
 
 static blk_status_t btrfs_submit_bio_start_direct_io(void *private_data,
@@ -8464,8 +8462,8 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 	return 0;
 }
 
-static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
-				loff_t file_offset)
+void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
+			 loff_t file_offset)
 {
 	struct btrfs_dio_private *dip = NULL;
 	struct bio *bio = NULL;
@@ -8536,7 +8534,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 		/*
 		 * The end io callbacks free our dip, do the final put on bio
 		 * and all the cleanup and final put for dio_bio (through
-		 * dio_end_io()).
+		 * end_io()).
 		 */
 		dip = NULL;
 		bio = NULL;
@@ -8555,7 +8553,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 		 * Releases and cleans up our dio_bio, no need to bio_put()
 		 * nor bio_endio()/bio_io_error() against dio_bio.
 		 */
-		dio_end_io(dio_bio);
+		bio_endio(dio_bio);
 	}
 	if (bio)
 		bio_put(bio);
diff --git a/fs/btrfs/iomap.c b/fs/btrfs/iomap.c
index 36df606fc028..329954c8cb88 100644
--- a/fs/btrfs/iomap.c
+++ b/fs/btrfs/iomap.c
@@ -7,6 +7,7 @@
  */
 
 #include <linux/iomap.h>
+#include <linux/uio.h>
 #include "ctree.h"
 #include "btrfs_inode.h"
 #include "volumes.h"
@@ -420,15 +421,127 @@ size_t btrfs_buffered_iomap_write(struct kiocb *iocb, struct iov_iter *from)
 	return written;
 }
 
+static const struct iomap_page_ops btrfs_dio_iomap_page_ops = {
+	.submit_io = btrfs_submit_direct,
+};
+
+static struct btrfs_iomap *btrfs_iomap_init(struct inode *inode,
+		struct extent_map **em,
+		loff_t pos, loff_t length)
+{
+	int ret = 0;
+	struct extent_map *map = *em;
+	struct btrfs_iomap *bi;
+	u64 num_bytes;
+
+	bi = kzalloc(sizeof(struct btrfs_iomap), GFP_NOFS);
+	if (!bi)
+		return ERR_PTR(-ENOMEM);
+
+	bi->start = round_down(pos, PAGE_SIZE);
+	bi->end = PAGE_ALIGN(pos + length) - 1;
+	num_bytes = bi->end - bi->start + 1;
+
+	/* Wait for existing ordered extents in range to finish */
+	btrfs_wait_ordered_range(inode, bi->start, num_bytes);
+
+	lock_extent_bits(&BTRFS_I(inode)->io_tree, bi->start, bi->end, &bi->cached_state);
+
+	ret = btrfs_delalloc_reserve_space(inode, &bi->data_reserved,
+			bi->start, num_bytes);
+	if (ret) {
+		unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
+				&bi->cached_state);
+		kfree(bi);
+		return ERR_PTR(ret);
+	}
+
+	refcount_inc(&map->refs);
+	ret = btrfs_get_extent_map_write(em, NULL,
+			inode, bi->start, num_bytes);
+	if (ret) {
+		unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
+				&bi->cached_state);
+		btrfs_delalloc_release_space(inode,
+				bi->data_reserved, bi->start,
+				num_bytes, true);
+		extent_changeset_free(bi->data_reserved);
+		kfree(bi);
+		return ERR_PTR(ret);
+	}
+	free_extent_map(map);
+	return bi;
+}
+
 static int btrfs_dio_iomap_begin(struct inode *inode, loff_t pos,
-		loff_t length, unsigned flags, struct iomap *iomap,
-		struct iomap *srcmap)
+                loff_t length, unsigned flags, struct iomap *iomap,
+                struct iomap *srcmap)
 {
-	return get_iomap(inode, pos, length, iomap);
+        struct extent_map *em;
+        struct btrfs_iomap *bi = NULL;
+
+        em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
+
+        if (flags & IOMAP_WRITE) {
+                srcmap->offset = em->start;
+                srcmap->length = em->len;
+                srcmap->bdev = em->bdev;
+                if (em->block_start == EXTENT_MAP_HOLE) {
+                        srcmap->type = IOMAP_HOLE;
+                } else {
+                        srcmap->type = IOMAP_MAPPED;
+                        srcmap->addr = em->block_start;
+                }
+                bi = btrfs_iomap_init(inode, &em, pos, length);
+                if (IS_ERR(bi))
+                        return PTR_ERR(bi);
+        }
+
+        iomap->offset = em->start;
+        iomap->length = em->len;
+        iomap->bdev = em->bdev;
+
+	if (em->block_start == EXTENT_MAP_HOLE) {
+		iomap->type = IOMAP_HOLE;
+	} else {
+		iomap->type = IOMAP_MAPPED;
+		iomap->addr = em->block_start;
+	}
+        iomap->private = bi;
+	iomap->page_ops = &btrfs_dio_iomap_page_ops;
+        return 0;
+}
+
+static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos,
+		loff_t length, ssize_t written, unsigned flags,
+		struct iomap *iomap)
+{
+	struct btrfs_iomap *bi = iomap->private;
+	u64 wend;
+	loff_t release_bytes;
+
+	if (!bi)
+		return 0;
+
+	unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end,
+			&bi->cached_state);
+
+	wend = PAGE_ALIGN(pos + written);
+	release_bytes = wend - bi->end - 1;
+	if (release_bytes > 0)
+		btrfs_delalloc_release_space(inode,
+				bi->data_reserved, wend,
+				release_bytes, true);
+
+	btrfs_delalloc_release_extents(BTRFS_I(inode), wend - bi->start, false);
+	extent_changeset_free(bi->data_reserved);
+	kfree(bi);
+	return 0;
 }
 
 static const struct iomap_ops btrfs_dio_iomap_ops = {
 	.iomap_begin            = btrfs_dio_iomap_begin,
+	.iomap_end              = btrfs_dio_iomap_end,
 };
 
 ssize_t btrfs_dio_iomap_read(struct kiocb *iocb, struct iov_iter *to)
@@ -440,3 +553,42 @@ ssize_t btrfs_dio_iomap_read(struct kiocb *iocb, struct iov_iter *to)
 	inode_unlock_shared(inode);
 	return ret;
 }
+
+ssize_t btrfs_dio_iomap_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	ssize_t written, written_buffered;
+	loff_t pos, endbyte;
+	int err;
+
+	written = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, NULL);
+	if (written < 0 || !iov_iter_count(from))
+		return written;
+
+	pos = iocb->ki_pos;
+	written_buffered = btrfs_buffered_iomap_write(iocb, from);
+	if (written_buffered < 0) {
+		err = written_buffered;
+		goto out;
+	}
+	/*
+	 * Ensure all data is persisted. We want the next direct IO read to be
+	 * able to read what was just written.
+	 */
+	endbyte = pos + written_buffered - 1;
+	err = btrfs_fdatawrite_range(inode, pos, endbyte);
+	if (err)
+		goto out;
+	err = filemap_fdatawait_range(inode->i_mapping, pos, endbyte);
+	if (err)
+		goto out;
+	written += written_buffered;
+	iocb->ki_pos = pos + written_buffered;
+	invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
+			endbyte >> PAGE_SHIFT);
+out:
+	if (written > 0 && iocb->ki_pos > i_size_read(inode))
+			i_size_write(inode, iocb->ki_pos);
+	return written ? written : err;
+}
-- 
2.16.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 12/13] btrfs: Remove btrfs_dio_data and __btrfs_direct_write
  2019-08-02 22:00 [PATCH v2 0/13] Btrfs iomap Goldwyn Rodrigues
                   ` (10 preceding siblings ...)
  2019-08-02 22:00 ` [PATCH 11/13] btrfs: Use iomap_dio_rw for performing direct I/O writes Goldwyn Rodrigues
@ 2019-08-02 22:00 ` Goldwyn Rodrigues
  2019-08-02 22:00 ` [PATCH 13/13] btrfs: update inode size during bio completion Goldwyn Rodrigues
  12 siblings, 0 replies; 42+ messages in thread
From: Goldwyn Rodrigues @ 2019-08-02 22:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, hch, darrick.wong, ruansy.fnst, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

btrfs_dio_data is unnecessary since we are now storing all
informaiton in btrfs_iomap.

Advantage: We don't abuse current->journal_info anymore :)
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/file.c  | 40 ----------------------------
 fs/btrfs/inode.c | 81 ++------------------------------------------------------
 2 files changed, 2 insertions(+), 119 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index faa5ad89469f..90a5fa387986 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1350,46 +1350,6 @@ static noinline int check_can_nocow(struct btrfs_inode *inode, loff_t pos,
 	return ret;
 }
 
-static ssize_t __btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
-{
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file_inode(file);
-	loff_t pos;
-	ssize_t written;
-	ssize_t written_buffered;
-	loff_t endbyte;
-	int err;
-
-	written = generic_file_direct_write(iocb, from);
-
-	if (written < 0 || !iov_iter_count(from))
-		return written;
-
-	pos = iocb->ki_pos;
-	written_buffered = btrfs_buffered_iomap_write(iocb, from);
-	if (written_buffered < 0) {
-		err = written_buffered;
-		goto out;
-	}
-	/*
-	 * Ensure all data is persisted. We want the next direct IO read to be
-	 * able to read what was just written.
-	 */
-	endbyte = pos + written_buffered - 1;
-	err = btrfs_fdatawrite_range(inode, pos, endbyte);
-	if (err)
-		goto out;
-	err = filemap_fdatawait_range(inode->i_mapping, pos, endbyte);
-	if (err)
-		goto out;
-	written += written_buffered;
-	iocb->ki_pos = pos + written_buffered;
-	invalidate_mapping_pages(file->f_mapping, pos >> PAGE_SHIFT,
-				 endbyte >> PAGE_SHIFT);
-out:
-	return written ? written : err;
-}
-
 static void update_time_for_write(struct inode *inode)
 {
 	struct timespec64 now;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 323d72858c9c..87fbe73ca2e4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -54,13 +54,6 @@ struct btrfs_iget_args {
 	struct btrfs_root *root;
 };
 
-struct btrfs_dio_data {
-	u64 reserve;
-	u64 unsubmitted_oe_range_start;
-	u64 unsubmitted_oe_range_end;
-	int overwrite;
-};
-
 static const struct inode_operations btrfs_dir_inode_operations;
 static const struct inode_operations btrfs_symlink_inode_operations;
 static const struct inode_operations btrfs_dir_ro_inode_operations;
@@ -7664,7 +7657,6 @@ int btrfs_get_extent_map_write(struct extent_map **map,
 static int btrfs_get_blocks_direct_write(struct extent_map **map,
 					 struct buffer_head *bh_result,
 					 struct inode *inode,
-					 struct btrfs_dio_data *dio_data,
 					 u64 start, u64 len)
 {
 	int ret;
@@ -7686,17 +7678,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	if (!test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
 		set_buffer_new(bh_result);
 
-	/*
-	 * Need to update the i_size under the extent lock so buffered
-	 * readers will get the updated i_size when we unlock.
-	 */
-	if (!dio_data->overwrite && start + len > i_size_read(inode))
-		i_size_write(inode, start + len);
-
-	WARN_ON(dio_data->reserve < len);
-	dio_data->reserve -= len;
-	dio_data->unsubmitted_oe_range_end = start + len;
-	current->journal_info = dio_data;
 	return ret;
 }
 
@@ -7706,7 +7687,6 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_map *em;
 	struct extent_state *cached_state = NULL;
-	struct btrfs_dio_data *dio_data = NULL;
 	u64 start = iblock << inode->i_blkbits;
 	u64 lockstart, lockend;
 	u64 len = bh_result->b_size;
@@ -7721,16 +7701,6 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 	lockstart = start;
 	lockend = start + len - 1;
 
-	if (current->journal_info) {
-		/*
-		 * Need to pull our outstanding extents and set journal_info to NULL so
-		 * that anything that needs to check if there's a transaction doesn't get
-		 * confused.
-		 */
-		dio_data = current->journal_info;
-		current->journal_info = NULL;
-	}
-
 	/*
 	 * If this errors out it's because we couldn't invalidate pagecache for
 	 * this range and we need to fallback to buffered.
@@ -7770,7 +7740,7 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 
 	if (create) {
 		ret = btrfs_get_blocks_direct_write(&em, bh_result, inode,
-						    dio_data, start, len);
+						    start, len);
 		if (ret < 0)
 			goto unlock_err;
 
@@ -7808,8 +7778,6 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 	clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend,
 			 unlock_bits, 1, 0, &cached_state);
 err:
-	if (dio_data)
-		current->journal_info = dio_data;
 	return ret;
 }
 
@@ -8498,21 +8466,6 @@ void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 		dip->subio_endio = btrfs_subio_endio_read;
 	}
 
-	/*
-	 * Reset the range for unsubmitted ordered extents (to a 0 length range)
-	 * even if we fail to submit a bio, because in such case we do the
-	 * corresponding error handling below and it must not be done a second
-	 * time by btrfs_direct_IO().
-	 */
-	if (write) {
-		struct btrfs_dio_data *dio_data = current->journal_info;
-
-		dio_data->unsubmitted_oe_range_end = dip->logical_offset +
-			dip->bytes;
-		dio_data->unsubmitted_oe_range_start =
-			dio_data->unsubmitted_oe_range_end;
-	}
-
 	ret = btrfs_submit_direct_hook(dip);
 	if (!ret)
 		return;
@@ -8598,7 +8551,6 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct btrfs_dio_data dio_data = { 0 };
 	struct extent_changeset *data_reserved = NULL;
 	loff_t offset = iocb->ki_pos;
 	size_t count = 0;
@@ -8631,7 +8583,6 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 		 * not unlock the i_mutex at this case.
 		 */
 		if (offset + count <= inode->i_size) {
-			dio_data.overwrite = 1;
 			inode_unlock(inode);
 			relock = true;
 		} else if (iocb->ki_flags & IOCB_NOWAIT) {
@@ -8643,16 +8594,6 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 		if (ret)
 			goto out;
 
-		/*
-		 * We need to know how many extents we reserved so that we can
-		 * do the accounting properly if we go over the number we
-		 * originally calculated.  Abuse current->journal_info for this.
-		 */
-		dio_data.reserve = round_up(count,
-					    fs_info->sectorsize);
-		dio_data.unsubmitted_oe_range_start = (u64)offset;
-		dio_data.unsubmitted_oe_range_end = (u64)offset;
-		current->journal_info = &dio_data;
 		down_read(&BTRFS_I(inode)->dio_sem);
 	} else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
 				     &BTRFS_I(inode)->runtime_flags)) {
@@ -8667,25 +8608,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 				   btrfs_submit_direct, flags);
 	if (iov_iter_rw(iter) == WRITE) {
 		up_read(&BTRFS_I(inode)->dio_sem);
-		current->journal_info = NULL;
-		if (ret < 0 && ret != -EIOCBQUEUED) {
-			if (dio_data.reserve)
-				btrfs_delalloc_release_space(inode, data_reserved,
-					offset, dio_data.reserve, true);
-			/*
-			 * On error we might have left some ordered extents
-			 * without submitting corresponding bios for them, so
-			 * cleanup them up to avoid other tasks getting them
-			 * and waiting for them to complete forever.
-			 */
-			if (dio_data.unsubmitted_oe_range_start <
-			    dio_data.unsubmitted_oe_range_end)
-				btrfs_update_ordered_extent(inode,
-					dio_data.unsubmitted_oe_range_start,
-					dio_data.unsubmitted_oe_range_end -
-					dio_data.unsubmitted_oe_range_start,
-					false);
-		} else if (ret >= 0 && (size_t)ret < count)
+		if (ret >= 0 && (size_t)ret < count)
 			btrfs_delalloc_release_space(inode, data_reserved,
 					offset, count - (size_t)ret, true);
 		btrfs_delalloc_release_extents(BTRFS_I(inode), count, false);
-- 
2.16.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 13/13] btrfs: update inode size during bio completion
  2019-08-02 22:00 [PATCH v2 0/13] Btrfs iomap Goldwyn Rodrigues
                   ` (11 preceding siblings ...)
  2019-08-02 22:00 ` [PATCH 12/13] btrfs: Remove btrfs_dio_data and __btrfs_direct_write Goldwyn Rodrigues
@ 2019-08-02 22:00 ` Goldwyn Rodrigues
  12 siblings, 0 replies; 42+ messages in thread
From: Goldwyn Rodrigues @ 2019-08-02 22:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, hch, darrick.wong, ruansy.fnst, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Update the inode size for dio writes during bio completion.
This ties the success of the underlying block layer
whether to increase the size of the inode. Especially for
in aio cases.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/inode.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 87fbe73ca2e4..f87a9dd154a9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8191,9 +8191,13 @@ static void btrfs_endio_direct_write(struct bio *bio)
 {
 	struct btrfs_dio_private *dip = bio->bi_private;
 	struct bio *dio_bio = dip->dio_bio;
+	struct inode *inode = dip->inode;
 
-	btrfs_update_ordered_extent(dip->inode, dip->logical_offset,
+	btrfs_update_ordered_extent(inode, dip->logical_offset,
 				     dip->bytes, !bio->bi_status);
+	if (!bio->bi_status &&
+	    i_size_read(inode) < dip->logical_offset + dip->bytes)
+		i_size_write(inode, dip->logical_offset + dip->bytes);
 
 	kfree(dip);
 
-- 
2.16.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-02 22:00 ` [PATCH 10/13] iomap: use a function pointer for dio submits Goldwyn Rodrigues
@ 2019-08-03  0:21   ` Darrick J. Wong
  2019-08-05 16:08     ` Goldwyn Rodrigues
  2019-08-04 23:43   ` Dave Chinner
  1 sibling, 1 reply; 42+ messages in thread
From: Darrick J. Wong @ 2019-08-03  0:21 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, linux-btrfs, hch, ruansy.fnst, Goldwyn Rodrigues

On Fri, Aug 02, 2019 at 05:00:45PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> This helps filesystems to perform tasks on the bio while
> submitting for I/O. Since btrfs requires the position
> we are working on, pass pos to iomap_dio_submit_bio()

What /does/ btrfs_submit_direct do, anyway?  Looks like it's a custom
submission function that ... does something related to setting
checksums?  And, uh, RAID?

> The correct place for submit_io() is not page_ops. Would it
> better to rename the structure to something like iomap_io_ops
> or put it directly under struct iomap?

Seeing as the ->iomap_begin handler knows if the requested op is a
buffered write or a direct write, what if we just declare a union of
ops?

e.g.

struct iomap_page_ops;
struct iomap_directio_ops;

struct iomap {
	<usual stuff>
	union {
		const struct iomap_page_ops *page_ops;
		const struct iomap_directio_ops *directio_ops;
	};
};

--D

> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/iomap/direct-io.c  | 16 +++++++++++-----
>  include/linux/iomap.h |  1 +
>  2 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 5279029c7a3c..a802e66bf11f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool spin)
>  EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
>  
>  static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
> -		struct bio *bio)
> +		struct bio *bio, loff_t pos)
>  {
>  	atomic_inc(&dio->ref);
>  
> @@ -67,7 +67,13 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
>  		bio_set_polled(bio, dio->iocb);
>  
>  	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
> -	dio->submit.cookie = submit_bio(bio);
> +	if (iomap->page_ops && iomap->page_ops->submit_io) {
> +		iomap->page_ops->submit_io(bio, file_inode(dio->iocb->ki_filp),
> +				pos);
> +		dio->submit.cookie = BLK_QC_T_NONE;
> +	} else {
> +		dio->submit.cookie = submit_bio(bio);
> +	}
>  }
>  
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> @@ -195,7 +201,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
>  	get_page(page);
>  	__bio_add_page(bio, page, len, 0);
>  	bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
> -	iomap_dio_submit_bio(dio, iomap, bio);
> +	iomap_dio_submit_bio(dio, iomap, bio, pos);
>  }
>  
>  static loff_t
> @@ -301,11 +307,11 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  		iov_iter_advance(dio->submit.iter, n);
>  
>  		dio->size += n;
> -		pos += n;
>  		copied += n;
>  
>  		nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
> -		iomap_dio_submit_bio(dio, iomap, bio);
> +		iomap_dio_submit_bio(dio, iomap, bio, pos);
> +		pos += n;
>  	} while (nr_pages);
>  
>  	/*
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 5b2055e8ca8a..6617e4b6fb6d 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -92,6 +92,7 @@ struct iomap_page_ops {
>  			struct iomap *iomap);
>  	void (*page_done)(struct inode *inode, loff_t pos, unsigned copied,
>  			struct page *page, struct iomap *iomap);
> +	dio_submit_t 		*submit_io;
>  };
>  
>  /*
> -- 
> 2.16.4
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/13] iomap: Read page from srcmap for IOMAP_COW
  2019-08-02 22:00 ` [PATCH 02/13] iomap: Read page from srcmap for IOMAP_COW Goldwyn Rodrigues
@ 2019-08-03  0:23   ` Darrick J. Wong
  2019-08-04 23:52   ` Dave Chinner
  1 sibling, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2019-08-03  0:23 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, linux-btrfs, hch, ruansy.fnst, Goldwyn Rodrigues

On Fri, Aug 02, 2019 at 05:00:37PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> In case of a IOMAP_COW, read a page from the srcmap before
> performing a write on the page.

Looks ok, I think...
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/iomap/buffered-io.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index f27756c0b31c..a96cc26eec92 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -581,7 +581,7 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
>  
>  static int
>  iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
> -		struct page **pagep, struct iomap *iomap)
> +		struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	const struct iomap_page_ops *page_ops = iomap->page_ops;
>  	pgoff_t index = pos >> PAGE_SHIFT;
> @@ -607,6 +607,8 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  
>  	if (iomap->type == IOMAP_INLINE)
>  		iomap_read_inline_data(inode, page, iomap);
> +	else if (iomap->type == IOMAP_COW)
> +		status = __iomap_write_begin(inode, pos, len, page, srcmap);
>  	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
>  		status = __block_write_begin_int(page, pos, len, NULL, iomap);
>  	else
> @@ -772,7 +774,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		}
>  
>  		status = iomap_write_begin(inode, pos, bytes, flags, &page,
> -				iomap);
> +				iomap, srcmap);
>  		if (unlikely(status))
>  			break;
>  
> @@ -871,7 +873,7 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  			return PTR_ERR(rpage);
>  
>  		status = iomap_write_begin(inode, pos, bytes,
> -					   AOP_FLAG_NOFS, &page, iomap);
> +					   AOP_FLAG_NOFS, &page, iomap, srcmap);
>  		put_page(rpage);
>  		if (unlikely(status))
>  			return status;
> @@ -917,13 +919,13 @@ iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
>  EXPORT_SYMBOL_GPL(iomap_file_dirty);
>  
>  static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
> -		unsigned bytes, struct iomap *iomap)
> +		unsigned bytes, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct page *page;
>  	int status;
>  
>  	status = iomap_write_begin(inode, pos, bytes, AOP_FLAG_NOFS, &page,
> -				   iomap);
> +				   iomap, srcmap);
>  	if (status)
>  		return status;
>  
> @@ -961,7 +963,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
>  		if (IS_DAX(inode))
>  			status = iomap_dax_zero(pos, offset, bytes, iomap);
>  		else
> -			status = iomap_zero(inode, pos, offset, bytes, iomap);
> +			status = iomap_zero(inode, pos, offset, bytes, iomap, srcmap);
>  		if (status < 0)
>  			return status;
>  
> -- 
> 2.16.4
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/13] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O
  2019-08-02 22:00 ` [PATCH 01/13] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O Goldwyn Rodrigues
@ 2019-08-03  0:39   ` Darrick J. Wong
  2019-08-05  0:06   ` Dave Chinner
  1 sibling, 0 replies; 42+ messages in thread
From: Darrick J. Wong @ 2019-08-03  0:39 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, linux-btrfs, hch, ruansy.fnst, Goldwyn Rodrigues

On Fri, Aug 02, 2019 at 05:00:36PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Introduces a new type IOMAP_COW, which means the data at offset
> must be read from a srcmap and copied before performing the
> write on the offset.
> 
> The srcmap is used to identify where the read is to be performed
> from. This is passed to iomap->begin() of the respective
> filesystem, which is supposed to put in the details for
> reading before performing the copy for CoW.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/dax.c               |  8 +++++---
>  fs/ext2/inode.c        |  2 +-
>  fs/ext4/inode.c        |  2 +-
>  fs/gfs2/bmap.c         |  3 ++-
>  fs/iomap/apply.c       |  5 +++--
>  fs/iomap/buffered-io.c | 14 +++++++-------
>  fs/iomap/direct-io.c   |  2 +-
>  fs/iomap/fiemap.c      |  4 ++--
>  fs/iomap/seek.c        |  4 ++--
>  fs/iomap/swapfile.c    |  3 ++-
>  fs/xfs/xfs_iomap.c     |  9 ++++++---
>  include/linux/iomap.h  |  6 ++++--
>  12 files changed, 36 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index a237141d8787..b21d9a9cde2b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1090,7 +1090,7 @@ EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>  
>  static loff_t
>  dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> -		struct iomap *iomap)
> +		struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct block_device *bdev = iomap->bdev;
>  	struct dax_device *dax_dev = iomap->dax_dev;
> @@ -1248,6 +1248,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  	unsigned long vaddr = vmf->address;
>  	loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
>  	struct iomap iomap = { 0 };
> +	struct iomap srcmap = { 0 };
>  	unsigned flags = IOMAP_FAULT;
>  	int error, major = 0;
>  	bool write = vmf->flags & FAULT_FLAG_WRITE;
> @@ -1292,7 +1293,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  	 * the file system block size to be equal the page size, which means
>  	 * that we never have to deal with more than a single extent here.
>  	 */
> -	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
> +	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap, &srcmap);
>  	if (iomap_errp)
>  		*iomap_errp = error;
>  	if (error) {
> @@ -1472,6 +1473,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  	struct inode *inode = mapping->host;
>  	vm_fault_t result = VM_FAULT_FALLBACK;
>  	struct iomap iomap = { 0 };
> +	struct iomap srcmap = { 0 };
>  	pgoff_t max_pgoff;
>  	void *entry;
>  	loff_t pos;
> @@ -1546,7 +1548,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  	 * to look up our filesystem block.
>  	 */
>  	pos = (loff_t)xas.xa_index << PAGE_SHIFT;
> -	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
> +	error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap, &srcmap);

/me wonders aloud if he ought to add a helper function to standardize at
least some of validation of the iomap that gets returned from
->iomap_begin invocations...

>  	if (error)
>  		goto unlock_entry;
>  

<snip>

> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
> index 54c02aecf3cd..6cdb362fff36 100644
> --- a/fs/iomap/apply.c
> +++ b/fs/iomap/apply.c
> @@ -24,6 +24,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>  		const struct iomap_ops *ops, void *data, iomap_actor_t actor)
>  {
>  	struct iomap iomap = { 0 };
> +	struct iomap srcmap = { 0 };
>  	loff_t written = 0, ret;
>  
>  	/*
> @@ -38,7 +39,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>  	 * expose transient stale data. If the reserve fails, we can safely
>  	 * back out at this point as there is nothing to undo.
>  	 */
> -	ret = ops->iomap_begin(inode, pos, length, flags, &iomap);
> +	ret = ops->iomap_begin(inode, pos, length, flags, &iomap, &srcmap);
>  	if (ret)
>  		return ret;
>  	if (WARN_ON(iomap.offset > pos))

...because I wonder if we ought to have a debugging assert here just in
case an ->iomap_begin returns IOMAP_COW in response to an IOMAP_WRITE
request?  Basic sanity checks to catch accidental API misuse, etc.

Eh we probably ought to have a CONFIG_IOMAP_DEBUG so that non-developers
don't necessarily have to pay the assert costs or something like that.

> @@ -58,7 +59,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>  	 * we can do the copy-in page by page without having to worry about
>  	 * failures exposing transient data.
>  	 */
> -	written = actor(inode, pos, length, data, &iomap);
> +	written = actor(inode, pos, length, data, &iomap, &srcmap);
>  
>  	/*
>  	 * Now the data has been copied, commit the range we've copied.  This
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index e25901ae3ff4..f27756c0b31c 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -205,7 +205,7 @@ iomap_read_inline_data(struct inode *inode, struct page *page,
>  
>  static loff_t
>  iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> -		struct iomap *iomap)
> +		struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct iomap_readpage_ctx *ctx = data;
>  	struct page *page = ctx->cur_page;
> @@ -351,7 +351,7 @@ iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
>  
>  static loff_t
>  iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
> -		void *data, struct iomap *iomap)
> +		void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct iomap_readpage_ctx *ctx = data;
>  	loff_t done, ret;
> @@ -371,7 +371,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
>  			ctx->cur_page_in_bio = false;
>  		}
>  		ret = iomap_readpage_actor(inode, pos + done, length - done,
> -				ctx, iomap);
> +				ctx, iomap, srcmap);
>  	}
>  
>  	return done;
> @@ -736,7 +736,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
>  
>  static loff_t
>  iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> -		struct iomap *iomap)
> +		struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct iov_iter *i = data;
>  	long status = 0;
> @@ -853,7 +853,7 @@ __iomap_read_page(struct inode *inode, loff_t offset)
>  
>  static loff_t
>  iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> -		struct iomap *iomap)
> +		struct iomap *iomap, struct iomap *srcmap)
>  {
>  	long status = 0;
>  	ssize_t written = 0;
> @@ -942,7 +942,7 @@ static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
>  
>  static loff_t
>  iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
> -		void *data, struct iomap *iomap)
> +		void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	bool *did_zero = data;
>  	loff_t written = 0;
> @@ -1011,7 +1011,7 @@ EXPORT_SYMBOL_GPL(iomap_truncate_page);
>  
>  static loff_t
>  iomap_page_mkwrite_actor(struct inode *inode, loff_t pos, loff_t length,
> -		void *data, struct iomap *iomap)
> +		void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct page *page = data;
>  	int ret;
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 10517cea9682..5279029c7a3c 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -362,7 +362,7 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
>  
>  static loff_t
>  iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
> -		void *data, struct iomap *iomap)
> +		void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct iomap_dio *dio = data;
>  
> diff --git a/fs/iomap/fiemap.c b/fs/iomap/fiemap.c
> index f26fdd36e383..690ef2d7c6c8 100644
> --- a/fs/iomap/fiemap.c
> +++ b/fs/iomap/fiemap.c
> @@ -44,7 +44,7 @@ static int iomap_to_fiemap(struct fiemap_extent_info *fi,
>  
>  static loff_t
>  iomap_fiemap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> -		struct iomap *iomap)
> +		struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct fiemap_ctx *ctx = data;
>  	loff_t ret = length;
> @@ -111,7 +111,7 @@ EXPORT_SYMBOL_GPL(iomap_fiemap);
>  
>  static loff_t
>  iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
> -		void *data, struct iomap *iomap)
> +		void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	sector_t *bno = data, addr;
>  
> diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c
> index c04bad4b2b43..89f61d93c0bc 100644
> --- a/fs/iomap/seek.c
> +++ b/fs/iomap/seek.c
> @@ -119,7 +119,7 @@ page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
>  
>  static loff_t
>  iomap_seek_hole_actor(struct inode *inode, loff_t offset, loff_t length,
> -		      void *data, struct iomap *iomap)
> +		      void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	switch (iomap->type) {
>  	case IOMAP_UNWRITTEN:
> @@ -165,7 +165,7 @@ EXPORT_SYMBOL_GPL(iomap_seek_hole);
>  
>  static loff_t
>  iomap_seek_data_actor(struct inode *inode, loff_t offset, loff_t length,
> -		      void *data, struct iomap *iomap)
>  	switch (iomap->type) {
>  	case IOMAP_HOLE:
> diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c
> index 152a230f668d..a648dbf6991e 100644
> --- a/fs/iomap/swapfile.c
> +++ b/fs/iomap/swapfile.c
> @@ -76,7 +76,8 @@ static int iomap_swapfile_add_extent(struct iomap_swapfile_info *isi)
>   * distinction between written and unwritten extents.
>   */
>  static loff_t iomap_swapfile_activate_actor(struct inode *inode, loff_t pos,
> -		loff_t count, void *data, struct iomap *iomap)
> +		loff_t count, void *data, struct iomap *iomap,
> +		struct iomap *srcmap)

The switch(iomap->type) probably ought to have a separate printk for the
IOMAP_COW case so that we don't go complaining about "unwritten" extents
in the swap file.

>  {
>  	struct iomap_swapfile_info *isi = data;
>  	int error;
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 3a4310d7cb59..8321733c16c3 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -922,7 +922,8 @@ xfs_file_iomap_begin(
>  	loff_t			offset,
>  	loff_t			length,
>  	unsigned		flags,
> -	struct iomap		*iomap)
> +	struct iomap		*iomap,
> +	struct iomap		*srcmap)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1145,7 +1146,8 @@ xfs_seek_iomap_begin(
>  	loff_t			offset,
>  	loff_t			length,
>  	unsigned		flags,
> -	struct iomap		*iomap)
> +	struct iomap		*iomap,
> +	struct iomap		*srcmap)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1231,7 +1233,8 @@ xfs_xattr_iomap_begin(
>  	loff_t			offset,
>  	loff_t			length,
>  	unsigned		flags,
> -	struct iomap		*iomap)
> +	struct iomap		*iomap,
> +	struct iomap		*srcmap)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;

XFS part looks ok... I guess I'll get to Shiyuan's series next.

> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index bc499ceae392..5b2055e8ca8a 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -26,6 +26,7 @@ struct vm_fault;
>  #define IOMAP_MAPPED	0x03	/* blocks allocated at @addr */
>  #define IOMAP_UNWRITTEN	0x04	/* blocks allocated at @addr in unwritten state */
>  #define IOMAP_INLINE	0x05	/* data inline in the inode */
> +#define IOMAP_COW	0x06	/* copy data from srcmap before writing */

Hm, ok, at least the comment references that this is only for writes.
Looks good!

--D

>  
>  /*
>   * Flags for all iomap mappings:
> @@ -110,7 +111,8 @@ struct iomap_ops {
>  	 * The actual length is returned in iomap->length.
>  	 */
>  	int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
> -			unsigned flags, struct iomap *iomap);
> +			unsigned flags, struct iomap *iomap,
> +			struct iomap *srcmap);
>  
>  	/*
>  	 * Commit and/or unreserve space previous allocated using iomap_begin.
> @@ -126,7 +128,7 @@ struct iomap_ops {
>   * Main iomap iterator function.
>   */
>  typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
> -		void *data, struct iomap *iomap);
> +		void *data, struct iomap *iomap, struct iomap *srcmap);
>  
>  loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
>  		unsigned flags, const struct iomap_ops *ops, void *data,
> -- 
> 2.16.4
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-02 22:00 ` [PATCH 10/13] iomap: use a function pointer for dio submits Goldwyn Rodrigues
  2019-08-03  0:21   ` Darrick J. Wong
@ 2019-08-04 23:43   ` Dave Chinner
  2019-08-05 16:08     ` Goldwyn Rodrigues
  1 sibling, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2019-08-04 23:43 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, linux-btrfs, hch, darrick.wong, ruansy.fnst,
	Goldwyn Rodrigues

On Fri, Aug 02, 2019 at 05:00:45PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> This helps filesystems to perform tasks on the bio while
> submitting for I/O. Since btrfs requires the position
> we are working on, pass pos to iomap_dio_submit_bio()
> 
> The correct place for submit_io() is not page_ops. Would it
> better to rename the structure to something like iomap_io_ops
> or put it directly under struct iomap?
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/iomap/direct-io.c  | 16 +++++++++++-----
>  include/linux/iomap.h |  1 +
>  2 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 5279029c7a3c..a802e66bf11f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool spin)
>  EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
>  
>  static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
> -		struct bio *bio)
> +		struct bio *bio, loff_t pos)
>  {
>  	atomic_inc(&dio->ref);
>  
> @@ -67,7 +67,13 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
>  		bio_set_polled(bio, dio->iocb);
>  
>  	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
> -	dio->submit.cookie = submit_bio(bio);
> +	if (iomap->page_ops && iomap->page_ops->submit_io) {
> +		iomap->page_ops->submit_io(bio, file_inode(dio->iocb->ki_filp),
> +				pos);
> +		dio->submit.cookie = BLK_QC_T_NONE;
> +	} else {
> +		dio->submit.cookie = submit_bio(bio);
> +	}

I don't really like this at all. Apart from the fact it doesn't work
with block device polling (RWF_HIPRI), the iomap architecture is
supposed to resolve the file offset -> block device + LBA mapping
completely up front and so all that remains to be done is build and
submit the bio(s) to the block device.

What I see here is a hack to work around the fact that btrfs has
implemented both file data transformations and device mapping layer
functionality as a filesystem layer between file data bio building
and device bio submission. And as the btrfs file data mapping
(->iomap_begin) is completely unaware that there is further block
mapping to be done before block device bio submission, any generic
code that btrfs uses requires special IO submission hooks rather
than just calling submit_bio().

I'm not 100% sure what the solution here is, but the one thing we
must resist is turning the iomap code into a mess of custom hooks
that only one filesystem uses. We've been taught this lesson time
and time again - the iomap infrastructure exists because stuff like
bufferheads and the old direct IO code ended up so full of special
case code that it ossified and became unmodifiable and
unmaintainable.

We do not want to go down that path again. 

IMO, the iomap IO model needs to be restructured to support post-IO
and pre-IO data verification/calculation/transformation operations
so all the work that needs to be done at the inode/offset context
level can be done in the iomap path before bio submission/after
bio completion. This will allow infrastructure like fscrypt, data
compression, data checksums, etc to be suported generically, not
just by individual filesystems that provide a ->submit_io hook.

As for the btrfs needing to slice and dice bios for multiple
devices?  That should be done via a block device ->make_request
function, not a custom hook in the iomap code.

That's why I don't like this hook - I think hiding data operations
and/or custom bio manipulations in opaque filesystem callouts is
completely the wrong approach to be taking. We need to do these
things in a generic manner so that all filesystems (and block
devices!) that use the iomap infrastructure can take advantage of
them, not just one of them.

Quite frankly, I don't care if it takes more time and work up front,
I'm tired of expedient hacks to merge code quickly repeatedly biting
us on the arse and wasting far more time sorting out than we would
have spent getting it right in the first place.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/13] iomap: Read page from srcmap for IOMAP_COW
  2019-08-02 22:00 ` [PATCH 02/13] iomap: Read page from srcmap for IOMAP_COW Goldwyn Rodrigues
  2019-08-03  0:23   ` Darrick J. Wong
@ 2019-08-04 23:52   ` Dave Chinner
  1 sibling, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2019-08-04 23:52 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, linux-btrfs, hch, darrick.wong, ruansy.fnst,
	Goldwyn Rodrigues

On Fri, Aug 02, 2019 at 05:00:37PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> In case of a IOMAP_COW, read a page from the srcmap before
> performing a write on the page.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/iomap/buffered-io.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index f27756c0b31c..a96cc26eec92 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -581,7 +581,7 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
>  
>  static int
>  iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
> -		struct page **pagep, struct iomap *iomap)
> +		struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	const struct iomap_page_ops *page_ops = iomap->page_ops;
>  	pgoff_t index = pos >> PAGE_SHIFT;
> @@ -607,6 +607,8 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  
>  	if (iomap->type == IOMAP_INLINE)
>  		iomap_read_inline_data(inode, page, iomap);
> +	else if (iomap->type == IOMAP_COW)
> +		status = __iomap_write_begin(inode, pos, len, page, srcmap);
>  	else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
>  		status = __block_write_begin_int(page, pos, len, NULL, iomap);
>  	else

This looks busted w.r.t. IOMAP_F_BUFFER_HEAD.  i.e. What's to stop
someone returning iomap->type == IOMAP_COW, iomap->flags &
IOMAP_F_BUFFER_HEAD?

In which case we can't call __iomap_write_begin(), right?

I'm with Darrick on CONFIG_IOMAP_DEBUG here - we need to start
locking down invalid behaviour and invalid combinations with asserts
that tell developers they've broken something.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/13] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O
  2019-08-02 22:00 ` [PATCH 01/13] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O Goldwyn Rodrigues
  2019-08-03  0:39   ` Darrick J. Wong
@ 2019-08-05  0:06   ` Dave Chinner
  1 sibling, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2019-08-05  0:06 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, linux-btrfs, hch, darrick.wong, ruansy.fnst,
	Goldwyn Rodrigues

On Fri, Aug 02, 2019 at 05:00:36PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Introduces a new type IOMAP_COW, which means the data at offset
> must be read from a srcmap and copied before performing the
> write on the offset.
> 
> The srcmap is used to identify where the read is to be performed
> from. This is passed to iomap->begin() of the respective
> filesystem, which is supposed to put in the details for
> reading before performing the copy for CoW.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/dax.c               |  8 +++++---
>  fs/ext2/inode.c        |  2 +-
>  fs/ext4/inode.c        |  2 +-
>  fs/gfs2/bmap.c         |  3 ++-
>  fs/iomap/apply.c       |  5 +++--
>  fs/iomap/buffered-io.c | 14 +++++++-------
>  fs/iomap/direct-io.c   |  2 +-
>  fs/iomap/fiemap.c      |  4 ++--
>  fs/iomap/seek.c        |  4 ++--
>  fs/iomap/swapfile.c    |  3 ++-
>  fs/xfs/xfs_iomap.c     |  9 ++++++---
>  include/linux/iomap.h  |  6 ++++--
>  12 files changed, 36 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index a237141d8787..b21d9a9cde2b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1090,7 +1090,7 @@ EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>  
>  static loff_t
>  dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> -		struct iomap *iomap)
> +		struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct block_device *bdev = iomap->bdev;
>  	struct dax_device *dax_dev = iomap->dax_dev;
> @@ -1248,6 +1248,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  	unsigned long vaddr = vmf->address;
>  	loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
>  	struct iomap iomap = { 0 };
> +	struct iomap srcmap = { 0 };

I'm not found of defining multiple iomaps everywhere and passing
them explicitly to every function that might need them.  Perhaps
something like:

	DEFINE_IOMAP(iomap);

#define IOMAP_BASE_MAP		0
#define IOMAP_SOURCE_MAP	1
#define IOMAP_MAX_MAPS		2

#define DEFINE_IOMAP(name)	\
	(struct iomap #name[IOMAP_MAX_MAPS] = {{0}})

#define IOMAP_B(name)		((name)[IOMAP_BASE_MAP])
#define IOMAP_S(name)		((name)[IOMAP_SOURCE_MAP])

And now we only have to pass a single iomap parameter to each
function as "struct iomap **iomap". This makes the code somewhat
simpler, and we only ever need to use IOMAP_S(iomap) when
IOMAP_B(iomap)->type == IOMAP_COW.

The other advantage of this is that if we even need new
functionality that requires 2 (or more) iomaps, we don't have to
change APIs again....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/13] btrfs: Add a simple buffered iomap write
  2019-08-02 22:00 ` [PATCH 04/13] btrfs: Add a simple buffered iomap write Goldwyn Rodrigues
@ 2019-08-05  0:11   ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2019-08-05  0:11 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, linux-btrfs, hch, darrick.wong, ruansy.fnst,
	Goldwyn Rodrigues

On Fri, Aug 02, 2019 at 05:00:39PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Introduce a new btrfs_iomap structure which contains information
> about the filesystem between the iomap_begin() and iomap_end() calls.
> This contains information about reservations and extent locking.
> 
> This one is a long patch. Most of the code is "inspired" by
> fs/btrfs/file.c. To keep the size small, all removals are in
> following patches.

I can't comment on the btrfs code but this:

> +size_t btrfs_buffered_iomap_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	ssize_t written;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	written = iomap_file_buffered_write(iocb, from, &btrfs_buffered_iomap_ops);
> +	if (written > 0)
> +		iocb->ki_pos += written;
> +	if (iocb->ki_pos > i_size_read(inode))
> +		i_size_write(inode, iocb->ki_pos);
> +	return written;

Looks like it fails to handle O_[D]SYNC writes.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/13] btrfs: Add CoW in iomap based writes
  2019-08-02 22:00 ` [PATCH 05/13] btrfs: Add CoW in iomap based writes Goldwyn Rodrigues
@ 2019-08-05  0:13   ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2019-08-05  0:13 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, linux-btrfs, hch, darrick.wong, ruansy.fnst,
	Goldwyn Rodrigues

On Fri, Aug 02, 2019 at 05:00:40PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Set iomap->type to IOMAP_COW and fill up the source map in case
> the I/O is not page aligned.
.....
>  static void btrfs_buffered_page_done(struct inode *inode, loff_t pos,
>  		unsigned copied, struct page *page,
>  		struct iomap *iomap)
> @@ -188,6 +217,7 @@ static int btrfs_buffered_iomap_begin(struct inode *inode, loff_t pos,
>  	int ret;
>  	size_t write_bytes = length;
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	size_t end;
>  	size_t sector_offset = pos & (fs_info->sectorsize - 1);
>  	struct btrfs_iomap *bi;
>  
> @@ -255,6 +285,17 @@ static int btrfs_buffered_iomap_begin(struct inode *inode, loff_t pos,
>  	iomap->private = bi;
>  	iomap->length = round_up(write_bytes, fs_info->sectorsize);
>  	iomap->offset = round_down(pos, fs_info->sectorsize);
> +	end = pos + write_bytes;
> +	/* Set IOMAP_COW if start/end is not page aligned */
> +	if (((pos & (PAGE_SIZE - 1)) || (end & (PAGE_SIZE - 1)))) {
> +		iomap->type = IOMAP_COW;
> +		ret = get_iomap(inode, pos, length, srcmap);
> +		if (ret < 0)
> +			goto release;

I suspect you didn't test this case, because....

> +	} else {
> +		iomap->type = IOMAP_DELALLOC;
> +	}
> +
>  	iomap->addr = IOMAP_NULL_ADDR;
>  	iomap->type = IOMAP_DELALLOC;

The iomap->type is overwritten here and so IOMAP_COW will never be
seen by the iomap infrastructure...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-03  0:21   ` Darrick J. Wong
@ 2019-08-05 16:08     ` Goldwyn Rodrigues
  0 siblings, 0 replies; 42+ messages in thread
From: Goldwyn Rodrigues @ 2019-08-05 16:08 UTC (permalink / raw)
  To: darrick.wong; +Cc: hch, linux-btrfs, ruansy.fnst, linux-fsdevel

On Fri, 2019-08-02 at 17:21 -0700,  Darrick J. Wong  wrote:
> On Fri, Aug 02, 2019 at 05:00:45PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > This helps filesystems to perform tasks on the bio while
> > submitting for I/O. Since btrfs requires the position
> > we are working on, pass pos to iomap_dio_submit_bio()
> 
> What /does/ btrfs_submit_direct do, anyway?  Looks like it's a custom
> submission function that ... does something related to setting
> checksums?  And, uh, RAID?

Yes and yes.

> 
> > The correct place for submit_io() is not page_ops. Would it
> > better to rename the structure to something like iomap_io_ops
> > or put it directly under struct iomap?
> 
> Seeing as the ->iomap_begin handler knows if the requested op is a
> buffered write or a direct write, what if we just declare a union of
> ops?
> 
> e.g.
> 
> struct iomap_page_ops;
> struct iomap_directio_ops;
> 
> struct iomap {
> 	<usual stuff>
> 	union {
> 		const struct iomap_page_ops *page_ops;
> 		const struct iomap_directio_ops *directio_ops;
> 	};
> };

Yes, that looks good. Thanks. I will incorporate it.

> 
> --D
> 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> >  fs/iomap/direct-io.c  | 16 +++++++++++-----
> >  include/linux/iomap.h |  1 +
> >  2 files changed, 12 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index 5279029c7a3c..a802e66bf11f 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool
> > spin)
> >  EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
> >  
> >  static void iomap_dio_submit_bio(struct iomap_dio *dio, struct
> > iomap *iomap,
> > -		struct bio *bio)
> > +		struct bio *bio, loff_t pos)
> >  {
> >  	atomic_inc(&dio->ref);
> >  
> > @@ -67,7 +67,13 @@ static void iomap_dio_submit_bio(struct
> > iomap_dio *dio, struct iomap *iomap,
> >  		bio_set_polled(bio, dio->iocb);
> >  
> >  	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
> > -	dio->submit.cookie = submit_bio(bio);
> > +	if (iomap->page_ops && iomap->page_ops->submit_io) {
> > +		iomap->page_ops->submit_io(bio, file_inode(dio-
> > >iocb->ki_filp),
> > +				pos);
> > +		dio->submit.cookie = BLK_QC_T_NONE;
> > +	} else {
> > +		dio->submit.cookie = submit_bio(bio);
> > +	}
> >  }
> >  
> >  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> > @@ -195,7 +201,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct
> > iomap *iomap, loff_t pos,
> >  	get_page(page);
> >  	__bio_add_page(bio, page, len, 0);
> >  	bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
> > -	iomap_dio_submit_bio(dio, iomap, bio);
> > +	iomap_dio_submit_bio(dio, iomap, bio, pos);
> >  }
> >  
> >  static loff_t
> > @@ -301,11 +307,11 @@ iomap_dio_bio_actor(struct inode *inode,
> > loff_t pos, loff_t length,
> >  		iov_iter_advance(dio->submit.iter, n);
> >  
> >  		dio->size += n;
> > -		pos += n;
> >  		copied += n;
> >  
> >  		nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
> > -		iomap_dio_submit_bio(dio, iomap, bio);
> > +		iomap_dio_submit_bio(dio, iomap, bio, pos);
> > +		pos += n;
> >  	} while (nr_pages);
> >  
> >  	/*
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 5b2055e8ca8a..6617e4b6fb6d 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -92,6 +92,7 @@ struct iomap_page_ops {
> >  			struct iomap *iomap);
> >  	void (*page_done)(struct inode *inode, loff_t pos,
> > unsigned copied,
> >  			struct page *page, struct iomap *iomap);
> > +	dio_submit_t 		*submit_io;
> >  };
> >  
> >  /*
> > -- 
> > 2.16.4
> > 
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-04 23:43   ` Dave Chinner
@ 2019-08-05 16:08     ` Goldwyn Rodrigues
  2019-08-05 21:54       ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Goldwyn Rodrigues @ 2019-08-05 16:08 UTC (permalink / raw)
  To: david; +Cc: hch, darrick.wong, linux-btrfs, ruansy.fnst, linux-fsdevel

On Mon, 2019-08-05 at 09:43 +1000, Dave Chinner wrote:
> On Fri, Aug 02, 2019 at 05:00:45PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > This helps filesystems to perform tasks on the bio while
> > submitting for I/O. Since btrfs requires the position
> > we are working on, pass pos to iomap_dio_submit_bio()
> > 
> > The correct place for submit_io() is not page_ops. Would it
> > better to rename the structure to something like iomap_io_ops
> > or put it directly under struct iomap?
> > 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> >  fs/iomap/direct-io.c  | 16 +++++++++++-----
> >  include/linux/iomap.h |  1 +
> >  2 files changed, 12 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index 5279029c7a3c..a802e66bf11f 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool
> > spin)
> >  EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
> >  
> >  static void iomap_dio_submit_bio(struct iomap_dio *dio, struct
> > iomap *iomap,
> > -		struct bio *bio)
> > +		struct bio *bio, loff_t pos)
> >  {
> >  	atomic_inc(&dio->ref);
> >  
> > @@ -67,7 +67,13 @@ static void iomap_dio_submit_bio(struct
> > iomap_dio *dio, struct iomap *iomap,
> >  		bio_set_polled(bio, dio->iocb);
> >  
> >  	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
> > -	dio->submit.cookie = submit_bio(bio);
> > +	if (iomap->page_ops && iomap->page_ops->submit_io) {
> > +		iomap->page_ops->submit_io(bio, file_inode(dio-
> > >iocb->ki_filp),
> > +				pos);
> > +		dio->submit.cookie = BLK_QC_T_NONE;
> > +	} else {
> > +		dio->submit.cookie = submit_bio(bio);
> > +	}
> 
> I don't really like this at all. Apart from the fact it doesn't work
> with block device polling (RWF_HIPRI), the iomap architecture is

That can be added, no? Should be relayed when we clone the bio.

> supposed to resolve the file offset -> block device + LBA mapping
> completely up front and so all that remains to be done is build and
> submit the bio(s) to the block device.
> 
> What I see here is a hack to work around the fact that btrfs has
> implemented both file data transformations and device mapping layer
> functionality as a filesystem layer between file data bio building
> and device bio submission. And as the btrfs file data mapping
> (->iomap_begin) is completely unaware that there is further block
> mapping to be done before block device bio submission, any generic
> code that btrfs uses requires special IO submission hooks rather
> than just calling submit_bio().
> 
> I'm not 100% sure what the solution here is, but the one thing we
> must resist is turning the iomap code into a mess of custom hooks
> that only one filesystem uses. We've been taught this lesson time
> and time again - the iomap infrastructure exists because stuff like
> bufferheads and the old direct IO code ended up so full of special
> case code that it ossified and became unmodifiable and
> unmaintainable.
> 
> We do not want to go down that path again. 
> 
> IMO, the iomap IO model needs to be restructured to support post-IO
> and pre-IO data verification/calculation/transformation operations
> so all the work that needs to be done at the inode/offset context
> level can be done in the iomap path before bio submission/after
> bio completion. This will allow infrastructure like fscrypt, data
> compression, data checksums, etc to be suported generically, not
> just by individual filesystems that provide a ->submit_io hook.
> 
> As for the btrfs needing to slice and dice bios for multiple
> devices?  That should be done via a block device ->make_request
> function, not a custom hook in the iomap code.

btrfs differentiates the way how metadata and data is
handled/replicated/stored. We would still need an entry point in the
iomap code to handle the I/O submission.

> 
> That's why I don't like this hook - I think hiding data operations
> and/or custom bio manipulations in opaque filesystem callouts is
> completely the wrong approach to be taking. We need to do these
> things in a generic manner so that all filesystems (and block
> devices!) that use the iomap infrastructure can take advantage of
> them, not just one of them.
> 
> Quite frankly, I don't care if it takes more time and work up front,
> I'm tired of expedient hacks to merge code quickly repeatedly biting
> us on the arse and wasting far more time sorting out than we would
> have spent getting it right in the first place.

Sure. I am open to ideas. What are you proposing?

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-05 16:08     ` Goldwyn Rodrigues
@ 2019-08-05 21:54       ` Dave Chinner
  2019-08-08  4:26         ` Gao Xiang
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2019-08-05 21:54 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: hch, darrick.wong, linux-btrfs, ruansy.fnst, linux-fsdevel

On Mon, Aug 05, 2019 at 04:08:43PM +0000, Goldwyn Rodrigues wrote:
> On Mon, 2019-08-05 at 09:43 +1000, Dave Chinner wrote:
> > On Fri, Aug 02, 2019 at 05:00:45PM -0500, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > 
> > > This helps filesystems to perform tasks on the bio while
> > > submitting for I/O. Since btrfs requires the position
> > > we are working on, pass pos to iomap_dio_submit_bio()
> > > 
> > > The correct place for submit_io() is not page_ops. Would it
> > > better to rename the structure to something like iomap_io_ops
> > > or put it directly under struct iomap?
> > > 
> > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > ---
> > >  fs/iomap/direct-io.c  | 16 +++++++++++-----
> > >  include/linux/iomap.h |  1 +
> > >  2 files changed, 12 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index 5279029c7a3c..a802e66bf11f 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool
> > > spin)
> > >  EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
> > >  
> > >  static void iomap_dio_submit_bio(struct iomap_dio *dio, struct
> > > iomap *iomap,
> > > -		struct bio *bio)
> > > +		struct bio *bio, loff_t pos)
> > >  {
> > >  	atomic_inc(&dio->ref);
> > >  
> > > @@ -67,7 +67,13 @@ static void iomap_dio_submit_bio(struct
> > > iomap_dio *dio, struct iomap *iomap,
> > >  		bio_set_polled(bio, dio->iocb);
> > >  
> > >  	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
> > > -	dio->submit.cookie = submit_bio(bio);
> > > +	if (iomap->page_ops && iomap->page_ops->submit_io) {
> > > +		iomap->page_ops->submit_io(bio, file_inode(dio-
> > > >iocb->ki_filp),
> > > +				pos);
> > > +		dio->submit.cookie = BLK_QC_T_NONE;
> > > +	} else {
> > > +		dio->submit.cookie = submit_bio(bio);
> > > +	}
> > 
> > I don't really like this at all. Apart from the fact it doesn't work
> > with block device polling (RWF_HIPRI), the iomap architecture is
> 
> That can be added, no? Should be relayed when we clone the bio.

No idea how that all is supposed to work when you split a single bio
into multiple bios. I'm pretty sure the iomap code is broken for
that case, too -  Jens was silent on how to fix other than to say
"it wasn't important so we didn't care to make sure it worked". So
it's not clear to me exactly how block polling is supposed to work
when a an IO needs to be split into multiple submissions...

> > supposed to resolve the file offset -> block device + LBA mapping
> > completely up front and so all that remains to be done is build and
> > submit the bio(s) to the block device.
> > 
> > What I see here is a hack to work around the fact that btrfs has
> > implemented both file data transformations and device mapping layer
> > functionality as a filesystem layer between file data bio building
> > and device bio submission. And as the btrfs file data mapping
> > (->iomap_begin) is completely unaware that there is further block
> > mapping to be done before block device bio submission, any generic
> > code that btrfs uses requires special IO submission hooks rather
> > than just calling submit_bio().
> > 
> > I'm not 100% sure what the solution here is, but the one thing we
> > must resist is turning the iomap code into a mess of custom hooks
> > that only one filesystem uses. We've been taught this lesson time
> > and time again - the iomap infrastructure exists because stuff like
> > bufferheads and the old direct IO code ended up so full of special
> > case code that it ossified and became unmodifiable and
> > unmaintainable.
> > 
> > We do not want to go down that path again. 
> > 
> > IMO, the iomap IO model needs to be restructured to support post-IO
> > and pre-IO data verification/calculation/transformation operations
> > so all the work that needs to be done at the inode/offset context
> > level can be done in the iomap path before bio submission/after
> > bio completion. This will allow infrastructure like fscrypt, data
> > compression, data checksums, etc to be suported generically, not
> > just by individual filesystems that provide a ->submit_io hook.
> > 
> > As for the btrfs needing to slice and dice bios for multiple
> > devices?  That should be done via a block device ->make_request
> > function, not a custom hook in the iomap code.
> 
> btrfs differentiates the way how metadata and data is
> handled/replicated/stored. We would still need an entry point in the
> iomap code to handle the I/O submission.

This is a data IO path. How metadata is stored/replicated is
irrelevant to this code path...

> > That's why I don't like this hook - I think hiding data operations
> > and/or custom bio manipulations in opaque filesystem callouts is
> > completely the wrong approach to be taking. We need to do these
> > things in a generic manner so that all filesystems (and block
> > devices!) that use the iomap infrastructure can take advantage of
> > them, not just one of them.
> > 
> > Quite frankly, I don't care if it takes more time and work up front,
> > I'm tired of expedient hacks to merge code quickly repeatedly biting
> > us on the arse and wasting far more time sorting out than we would
> > have spent getting it right in the first place.
> 
> Sure. I am open to ideas. What are you proposing?

That you think about how to normalise the btrfs IO path to fit into
the standard iomap/blockdev model, rather than adding special hacks
to iomap to allow an opaque, custom, IO model to be shoe-horned into
the generic code.

For example, post-read validation requires end-io processing,
whether it be encryption, decompression, CRC/T10 validation, etc. The
iomap end-io completion has all the information needed to run these
things, whether it be a callout to the filesystem for custom
processing checking, or a generic "decrypt into supplied data page"
sort of thing. These all need to be done in the same place, so we
should have common support for this. And I suspect the iomap should
also state in a flag that something like this is necessary (e.g.
IOMAP_FL_ENCRYPTED indicates post-IO decryption needs to be run).

Similarly, on the IO submit side we have need for a pre-IO
processing hook. That can be used to encrypt, compress, calculate
data CRCs, do pre-IO COW processing (XFS requires a hook for this),
etc.

These hooks are needed for for both buffered and direct IO, and they
are needed for more filesystems than just btrfs. fscrypt will need
them, XFS needs them, etc. So rather than hide data CRCs,
compression, and encryption deep inside the btrfs code, pull it up
into common layers that are called by the generic code. THis will
leave with just the things like mirroring, raid, IO retries, etc
below the iomap code, and that's all stuff that can be done behind a
->make_request function that is passed a bio...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-05 21:54       ` Dave Chinner
@ 2019-08-08  4:26         ` Gao Xiang
  2019-08-08  4:52           ` Gao Xiang
  2019-08-08  5:49           ` Eric Biggers
  0 siblings, 2 replies; 42+ messages in thread
From: Gao Xiang @ 2019-08-08  4:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Goldwyn Rodrigues, hch, darrick.wong, linux-btrfs, ruansy.fnst,
	linux-fsdevel, linux-erofs, miaoxie

On Tue, Aug 06, 2019 at 07:54:58AM +1000, Dave Chinner wrote:
> On Mon, Aug 05, 2019 at 04:08:43PM +0000, Goldwyn Rodrigues wrote:
> > On Mon, 2019-08-05 at 09:43 +1000, Dave Chinner wrote:
> > > On Fri, Aug 02, 2019 at 05:00:45PM -0500, Goldwyn Rodrigues wrote:
> > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > 
> > > > This helps filesystems to perform tasks on the bio while
> > > > submitting for I/O. Since btrfs requires the position
> > > > we are working on, pass pos to iomap_dio_submit_bio()
> > > > 
> > > > The correct place for submit_io() is not page_ops. Would it
> > > > better to rename the structure to something like iomap_io_ops
> > > > or put it directly under struct iomap?
> > > > 
> > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > ---
> > > >  fs/iomap/direct-io.c  | 16 +++++++++++-----
> > > >  include/linux/iomap.h |  1 +
> > > >  2 files changed, 12 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > index 5279029c7a3c..a802e66bf11f 100644
> > > > --- a/fs/iomap/direct-io.c
> > > > +++ b/fs/iomap/direct-io.c
> > > > @@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool
> > > > spin)
> > > >  EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
> > > >  
> > > >  static void iomap_dio_submit_bio(struct iomap_dio *dio, struct
> > > > iomap *iomap,
> > > > -		struct bio *bio)
> > > > +		struct bio *bio, loff_t pos)
> > > >  {
> > > >  	atomic_inc(&dio->ref);
> > > >  
> > > > @@ -67,7 +67,13 @@ static void iomap_dio_submit_bio(struct
> > > > iomap_dio *dio, struct iomap *iomap,
> > > >  		bio_set_polled(bio, dio->iocb);
> > > >  
> > > >  	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
> > > > -	dio->submit.cookie = submit_bio(bio);
> > > > +	if (iomap->page_ops && iomap->page_ops->submit_io) {
> > > > +		iomap->page_ops->submit_io(bio, file_inode(dio-
> > > > >iocb->ki_filp),
> > > > +				pos);
> > > > +		dio->submit.cookie = BLK_QC_T_NONE;
> > > > +	} else {
> > > > +		dio->submit.cookie = submit_bio(bio);
> > > > +	}
> > > 
> > > I don't really like this at all. Apart from the fact it doesn't work
> > > with block device polling (RWF_HIPRI), the iomap architecture is
> > 
> > That can be added, no? Should be relayed when we clone the bio.
> 
> No idea how that all is supposed to work when you split a single bio
> into multiple bios. I'm pretty sure the iomap code is broken for
> that case, too -  Jens was silent on how to fix other than to say
> "it wasn't important so we didn't care to make sure it worked". So
> it's not clear to me exactly how block polling is supposed to work
> when a an IO needs to be split into multiple submissions...
> 
> > > supposed to resolve the file offset -> block device + LBA mapping
> > > completely up front and so all that remains to be done is build and
> > > submit the bio(s) to the block device.
> > > 
> > > What I see here is a hack to work around the fact that btrfs has
> > > implemented both file data transformations and device mapping layer
> > > functionality as a filesystem layer between file data bio building
> > > and device bio submission. And as the btrfs file data mapping
> > > (->iomap_begin) is completely unaware that there is further block
> > > mapping to be done before block device bio submission, any generic
> > > code that btrfs uses requires special IO submission hooks rather
> > > than just calling submit_bio().
> > > 
> > > I'm not 100% sure what the solution here is, but the one thing we
> > > must resist is turning the iomap code into a mess of custom hooks
> > > that only one filesystem uses. We've been taught this lesson time
> > > and time again - the iomap infrastructure exists because stuff like
> > > bufferheads and the old direct IO code ended up so full of special
> > > case code that it ossified and became unmodifiable and
> > > unmaintainable.
> > > 
> > > We do not want to go down that path again. 
> > > 
> > > IMO, the iomap IO model needs to be restructured to support post-IO
> > > and pre-IO data verification/calculation/transformation operations
> > > so all the work that needs to be done at the inode/offset context
> > > level can be done in the iomap path before bio submission/after
> > > bio completion. This will allow infrastructure like fscrypt, data
> > > compression, data checksums, etc to be suported generically, not
> > > just by individual filesystems that provide a ->submit_io hook.
> > > 
> > > As for the btrfs needing to slice and dice bios for multiple
> > > devices?  That should be done via a block device ->make_request
> > > function, not a custom hook in the iomap code.
> > 
> > btrfs differentiates the way how metadata and data is
> > handled/replicated/stored. We would still need an entry point in the
> > iomap code to handle the I/O submission.
> 
> This is a data IO path. How metadata is stored/replicated is
> irrelevant to this code path...
> 
> > > That's why I don't like this hook - I think hiding data operations
> > > and/or custom bio manipulations in opaque filesystem callouts is
> > > completely the wrong approach to be taking. We need to do these
> > > things in a generic manner so that all filesystems (and block
> > > devices!) that use the iomap infrastructure can take advantage of
> > > them, not just one of them.
> > > 
> > > Quite frankly, I don't care if it takes more time and work up front,
> > > I'm tired of expedient hacks to merge code quickly repeatedly biting
> > > us on the arse and wasting far more time sorting out than we would
> > > have spent getting it right in the first place.
> > 
> > Sure. I am open to ideas. What are you proposing?
> 
> That you think about how to normalise the btrfs IO path to fit into
> the standard iomap/blockdev model, rather than adding special hacks
> to iomap to allow an opaque, custom, IO model to be shoe-horned into
> the generic code.
> 
> For example, post-read validation requires end-io processing,
> whether it be encryption, decompression, CRC/T10 validation, etc. The
> iomap end-io completion has all the information needed to run these
> things, whether it be a callout to the filesystem for custom
> processing checking, or a generic "decrypt into supplied data page"
> sort of thing. These all need to be done in the same place, so we
> should have common support for this. And I suspect the iomap should
> also state in a flag that something like this is necessary (e.g.
> IOMAP_FL_ENCRYPTED indicates post-IO decryption needs to be run).

Add some word to this topic, I think introducing a generic full approach
to IOMAP for encryption, decompression, verification is hard to meet all
filesystems, and seems unnecessary, especially data compression is involved.

Since the data decompression will expand the data, therefore the logical
data size is not same as the physical data size:

1) IO submission should be applied to all physical data, but data
   decompression will be eventually applied to logical mapping.
   As for EROFS, it submits all physical pages with page->private
   points to management structure which maintain all logical pages
   as well for further decompression. And time-sharing approach is
   used to save the L2P mapping array in these allocated pages itself.

   In addition, IOMAP also needs to consider fixed-sized output/input
   difference which is filesystem specific and I have no idea whether
   involveing too many code for each requirement is really good for IOMAP;

2) The post-read processing order is another negotiable stuff.
   Although there is no benefit to select verity->decrypt rather than
   decrypt->verity; but when compression is involved, the different
   orders could be selected by different filesystem users:

    1. decrypt->verity->decompress

    2. verity->decompress->decrypt

    3. decompress->decrypt->verity

   1. and 2. could cause less computation since it processes
   compressed data, and the security is good enough since
   the behavior of decompression algorithm is deterministic.
   3 could cause more computation.

All I want to say is the post process is so complicated since we have
many selection if encryption, decompression, verification are all involved.

Maybe introduce a core subset to IOMAP is better for long-term
maintainment and better performance. And we should consider it
more carefully.

Thanks,
Gao Xiang

> 
> Similarly, on the IO submit side we have need for a pre-IO
> processing hook. That can be used to encrypt, compress, calculate
> data CRCs, do pre-IO COW processing (XFS requires a hook for this),
> etc.
> 
> These hooks are needed for for both buffered and direct IO, and they
> are needed for more filesystems than just btrfs. fscrypt will need
> them, XFS needs them, etc. So rather than hide data CRCs,
> compression, and encryption deep inside the btrfs code, pull it up
> into common layers that are called by the generic code. THis will
> leave with just the things like mirroring, raid, IO retries, etc
> below the iomap code, and that's all stuff that can be done behind a
> ->make_request function that is passed a bio...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-08  4:26         ` Gao Xiang
@ 2019-08-08  4:52           ` Gao Xiang
  2019-08-08  5:49           ` Eric Biggers
  1 sibling, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2019-08-08  4:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Goldwyn Rodrigues, hch, darrick.wong, linux-btrfs, ruansy.fnst,
	linux-fsdevel, linux-erofs, miaoxie

On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> On Tue, Aug 06, 2019 at 07:54:58AM +1000, Dave Chinner wrote:
> > On Mon, Aug 05, 2019 at 04:08:43PM +0000, Goldwyn Rodrigues wrote:
> > > On Mon, 2019-08-05 at 09:43 +1000, Dave Chinner wrote:
> > > > On Fri, Aug 02, 2019 at 05:00:45PM -0500, Goldwyn Rodrigues wrote:
> > > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > > 
> > > > > This helps filesystems to perform tasks on the bio while
> > > > > submitting for I/O. Since btrfs requires the position
> > > > > we are working on, pass pos to iomap_dio_submit_bio()
> > > > > 
> > > > > The correct place for submit_io() is not page_ops. Would it
> > > > > better to rename the structure to something like iomap_io_ops
> > > > > or put it directly under struct iomap?
> > > > > 
> > > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > > ---
> > > > >  fs/iomap/direct-io.c  | 16 +++++++++++-----
> > > > >  include/linux/iomap.h |  1 +
> > > > >  2 files changed, 12 insertions(+), 5 deletions(-)
> > > > > 
> > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > > index 5279029c7a3c..a802e66bf11f 100644
> > > > > --- a/fs/iomap/direct-io.c
> > > > > +++ b/fs/iomap/direct-io.c
> > > > > @@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool
> > > > > spin)
> > > > >  EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
> > > > >  
> > > > >  static void iomap_dio_submit_bio(struct iomap_dio *dio, struct
> > > > > iomap *iomap,
> > > > > -		struct bio *bio)
> > > > > +		struct bio *bio, loff_t pos)
> > > > >  {
> > > > >  	atomic_inc(&dio->ref);
> > > > >  
> > > > > @@ -67,7 +67,13 @@ static void iomap_dio_submit_bio(struct
> > > > > iomap_dio *dio, struct iomap *iomap,
> > > > >  		bio_set_polled(bio, dio->iocb);
> > > > >  
> > > > >  	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
> > > > > -	dio->submit.cookie = submit_bio(bio);
> > > > > +	if (iomap->page_ops && iomap->page_ops->submit_io) {
> > > > > +		iomap->page_ops->submit_io(bio, file_inode(dio-
> > > > > >iocb->ki_filp),
> > > > > +				pos);
> > > > > +		dio->submit.cookie = BLK_QC_T_NONE;
> > > > > +	} else {
> > > > > +		dio->submit.cookie = submit_bio(bio);
> > > > > +	}
> > > > 
> > > > I don't really like this at all. Apart from the fact it doesn't work
> > > > with block device polling (RWF_HIPRI), the iomap architecture is
> > > 
> > > That can be added, no? Should be relayed when we clone the bio.
> > 
> > No idea how that all is supposed to work when you split a single bio
> > into multiple bios. I'm pretty sure the iomap code is broken for
> > that case, too -  Jens was silent on how to fix other than to say
> > "it wasn't important so we didn't care to make sure it worked". So
> > it's not clear to me exactly how block polling is supposed to work
> > when a an IO needs to be split into multiple submissions...
> > 
> > > > supposed to resolve the file offset -> block device + LBA mapping
> > > > completely up front and so all that remains to be done is build and
> > > > submit the bio(s) to the block device.
> > > > 
> > > > What I see here is a hack to work around the fact that btrfs has
> > > > implemented both file data transformations and device mapping layer
> > > > functionality as a filesystem layer between file data bio building
> > > > and device bio submission. And as the btrfs file data mapping
> > > > (->iomap_begin) is completely unaware that there is further block
> > > > mapping to be done before block device bio submission, any generic
> > > > code that btrfs uses requires special IO submission hooks rather
> > > > than just calling submit_bio().
> > > > 
> > > > I'm not 100% sure what the solution here is, but the one thing we
> > > > must resist is turning the iomap code into a mess of custom hooks
> > > > that only one filesystem uses. We've been taught this lesson time
> > > > and time again - the iomap infrastructure exists because stuff like
> > > > bufferheads and the old direct IO code ended up so full of special
> > > > case code that it ossified and became unmodifiable and
> > > > unmaintainable.
> > > > 
> > > > We do not want to go down that path again. 
> > > > 
> > > > IMO, the iomap IO model needs to be restructured to support post-IO
> > > > and pre-IO data verification/calculation/transformation operations
> > > > so all the work that needs to be done at the inode/offset context
> > > > level can be done in the iomap path before bio submission/after
> > > > bio completion. This will allow infrastructure like fscrypt, data
> > > > compression, data checksums, etc to be suported generically, not
> > > > just by individual filesystems that provide a ->submit_io hook.
> > > > 
> > > > As for the btrfs needing to slice and dice bios for multiple
> > > > devices?  That should be done via a block device ->make_request
> > > > function, not a custom hook in the iomap code.
> > > 
> > > btrfs differentiates the way how metadata and data is
> > > handled/replicated/stored. We would still need an entry point in the
> > > iomap code to handle the I/O submission.
> > 
> > This is a data IO path. How metadata is stored/replicated is
> > irrelevant to this code path...
> > 
> > > > That's why I don't like this hook - I think hiding data operations
> > > > and/or custom bio manipulations in opaque filesystem callouts is
> > > > completely the wrong approach to be taking. We need to do these
> > > > things in a generic manner so that all filesystems (and block
> > > > devices!) that use the iomap infrastructure can take advantage of
> > > > them, not just one of them.
> > > > 
> > > > Quite frankly, I don't care if it takes more time and work up front,
> > > > I'm tired of expedient hacks to merge code quickly repeatedly biting
> > > > us on the arse and wasting far more time sorting out than we would
> > > > have spent getting it right in the first place.
> > > 
> > > Sure. I am open to ideas. What are you proposing?
> > 
> > That you think about how to normalise the btrfs IO path to fit into
> > the standard iomap/blockdev model, rather than adding special hacks
> > to iomap to allow an opaque, custom, IO model to be shoe-horned into
> > the generic code.
> > 
> > For example, post-read validation requires end-io processing,
> > whether it be encryption, decompression, CRC/T10 validation, etc. The
> > iomap end-io completion has all the information needed to run these
> > things, whether it be a callout to the filesystem for custom
> > processing checking, or a generic "decrypt into supplied data page"
> > sort of thing. These all need to be done in the same place, so we
> > should have common support for this. And I suspect the iomap should
> > also state in a flag that something like this is necessary (e.g.
> > IOMAP_FL_ENCRYPTED indicates post-IO decryption needs to be run).
> 
> Add some word to this topic, I think introducing a generic full approach
> to IOMAP for encryption, decompression, verification is hard to meet all
> filesystems, and seems unnecessary, especially data compression is involved.
> 
> Since the data decompression will expand the data, therefore the logical
> data size is not same as the physical data size:
> 
> 1) IO submission should be applied to all physical data, but data
>    decompression will be eventually applied to logical mapping.
>    As for EROFS, it submits all physical pages with page->private
>    points to management structure which maintain all logical pages
>    as well for further decompression. And time-sharing approach is
>    used to save the L2P mapping array in these allocated pages itself.
> 
>    In addition, IOMAP also needs to consider fixed-sized output/input
>    difference which is filesystem specific and I have no idea whether
>    involveing too many code for each requirement is really good for IOMAP;
> 
> 2) The post-read processing order is another negotiable stuff.
>    Although there is no benefit to select verity->decrypt rather than
>    decrypt->verity; but when compression is involved, the different
>    orders could be selected by different filesystem users:
> 
>     1. decrypt->verity->decompress
> 
>     2. verity->decompress->decrypt
> 
>     3. decompress->decrypt->verity

maybe "4. decrypt->decompress->verity" is useful as well.

some post-read processing operates on physical data size and
the other post-read processing operates on logical data size.

> 
>    1. and 2. could cause less computation since it processes

and less verify data IO as well.

>    compressed data, and the security is good enough since
>    the behavior of decompression algorithm is deterministic.
>    3 could cause more computation.
> 
> All I want to say is the post process is so complicated since we have
> many selection if encryption, decompression, verification are all involved.

Correct the above word, I mean "all I want to say is the pre/post
process is so complicated", therefore a full generic approach for
decryption, decompression, verification is hard.

Thanks,
Gao Xiang

> 
> Maybe introduce a core subset to IOMAP is better for long-term
> maintainment and better performance. And we should consider it
> more carefully.
> 
> Thanks,
> Gao Xiang
> 
> > 
> > Similarly, on the IO submit side we have need for a pre-IO
> > processing hook. That can be used to encrypt, compress, calculate
> > data CRCs, do pre-IO COW processing (XFS requires a hook for this),
> > etc.
> > 
> > These hooks are needed for for both buffered and direct IO, and they
> > are needed for more filesystems than just btrfs. fscrypt will need
> > them, XFS needs them, etc. So rather than hide data CRCs,
> > compression, and encryption deep inside the btrfs code, pull it up
> > into common layers that are called by the generic code. THis will
> > leave with just the things like mirroring, raid, IO retries, etc
> > below the iomap code, and that's all stuff that can be done behind a
> > ->make_request function that is passed a bio...
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-08  4:26         ` Gao Xiang
  2019-08-08  4:52           ` Gao Xiang
@ 2019-08-08  5:49           ` Eric Biggers
  2019-08-08  6:28             ` Gao Xiang
                               ` (2 more replies)
  1 sibling, 3 replies; 42+ messages in thread
From: Eric Biggers @ 2019-08-08  5:49 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Dave Chinner, Goldwyn Rodrigues, hch, darrick.wong, linux-btrfs,
	ruansy.fnst, linux-fsdevel, linux-erofs, miaoxie

On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> > 
> > > > That's why I don't like this hook - I think hiding data operations
> > > > and/or custom bio manipulations in opaque filesystem callouts is
> > > > completely the wrong approach to be taking. We need to do these
> > > > things in a generic manner so that all filesystems (and block
> > > > devices!) that use the iomap infrastructure can take advantage of
> > > > them, not just one of them.
> > > > 
> > > > Quite frankly, I don't care if it takes more time and work up front,
> > > > I'm tired of expedient hacks to merge code quickly repeatedly biting
> > > > us on the arse and wasting far more time sorting out than we would
> > > > have spent getting it right in the first place.
> > > 
> > > Sure. I am open to ideas. What are you proposing?
> > 
> > That you think about how to normalise the btrfs IO path to fit into
> > the standard iomap/blockdev model, rather than adding special hacks
> > to iomap to allow an opaque, custom, IO model to be shoe-horned into
> > the generic code.
> > 
> > For example, post-read validation requires end-io processing,
> > whether it be encryption, decompression, CRC/T10 validation, etc. The
> > iomap end-io completion has all the information needed to run these
> > things, whether it be a callout to the filesystem for custom
> > processing checking, or a generic "decrypt into supplied data page"
> > sort of thing. These all need to be done in the same place, so we
> > should have common support for this. And I suspect the iomap should
> > also state in a flag that something like this is necessary (e.g.
> > IOMAP_FL_ENCRYPTED indicates post-IO decryption needs to be run).
> 
> Add some word to this topic, I think introducing a generic full approach
> to IOMAP for encryption, decompression, verification is hard to meet all
> filesystems, and seems unnecessary, especially data compression is involved.
> 
> Since the data decompression will expand the data, therefore the logical
> data size is not same as the physical data size:
> 
> 1) IO submission should be applied to all physical data, but data
>    decompression will be eventually applied to logical mapping.
>    As for EROFS, it submits all physical pages with page->private
>    points to management structure which maintain all logical pages
>    as well for further decompression. And time-sharing approach is
>    used to save the L2P mapping array in these allocated pages itself.
> 
>    In addition, IOMAP also needs to consider fixed-sized output/input
>    difference which is filesystem specific and I have no idea whether
>    involveing too many code for each requirement is really good for IOMAP;
> 
> 2) The post-read processing order is another negotiable stuff.
>    Although there is no benefit to select verity->decrypt rather than
>    decrypt->verity; but when compression is involved, the different
>    orders could be selected by different filesystem users:
> 
>     1. decrypt->verity->decompress
> 
>     2. verity->decompress->decrypt
> 
>     3. decompress->decrypt->verity
> 
>    1. and 2. could cause less computation since it processes
>    compressed data, and the security is good enough since
>    the behavior of decompression algorithm is deterministic.
>    3 could cause more computation.
> 
> All I want to say is the post process is so complicated since we have
> many selection if encryption, decompression, verification are all involved.
> 
> Maybe introduce a core subset to IOMAP is better for long-term
> maintainment and better performance. And we should consider it
> more carefully.
> 

FWIW, the only order that actually makes sense is decrypt->decompress->verity.

Decrypt before decompress, i.e. encrypt after compress, because only the
plaintext can be compressible; the ciphertext isn't.

Verity last, on the original data, because otherwise the file hash that
fs-verity reports would be specific to that particular inode on-disk and
therefore would be useless for authenticating the file's user-visible contents.

[By "verity" I mean specifically fs-verity.  Integrity-only block checksums are
a different case; those can be done at any point, but doing them on the
compressed data would make sense as then there would be less to checksum.

And yes, compression+encryption leaks information about the original data, so
may not be advisable.  My point is just that if the two are nevertheless
combined, it only makes sense to compress the plaintext.]

- Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-08  5:49           ` Eric Biggers
@ 2019-08-08  6:28             ` Gao Xiang
  2019-08-08  8:16             ` Dave Chinner
  2019-08-09 20:45             ` Matthew Wilcox
  2 siblings, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2019-08-08  6:28 UTC (permalink / raw)
  To: Dave Chinner, Goldwyn Rodrigues, hch, darrick.wong, linux-btrfs,
	ruansy.fnst, linux-fsdevel, linux-erofs, miaoxie

Hi Eric,

On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> > > 
> > > > > That's why I don't like this hook - I think hiding data operations
> > > > > and/or custom bio manipulations in opaque filesystem callouts is
> > > > > completely the wrong approach to be taking. We need to do these
> > > > > things in a generic manner so that all filesystems (and block
> > > > > devices!) that use the iomap infrastructure can take advantage of
> > > > > them, not just one of them.
> > > > > 
> > > > > Quite frankly, I don't care if it takes more time and work up front,
> > > > > I'm tired of expedient hacks to merge code quickly repeatedly biting
> > > > > us on the arse and wasting far more time sorting out than we would
> > > > > have spent getting it right in the first place.
> > > > 
> > > > Sure. I am open to ideas. What are you proposing?
> > > 
> > > That you think about how to normalise the btrfs IO path to fit into
> > > the standard iomap/blockdev model, rather than adding special hacks
> > > to iomap to allow an opaque, custom, IO model to be shoe-horned into
> > > the generic code.
> > > 
> > > For example, post-read validation requires end-io processing,
> > > whether it be encryption, decompression, CRC/T10 validation, etc. The
> > > iomap end-io completion has all the information needed to run these
> > > things, whether it be a callout to the filesystem for custom
> > > processing checking, or a generic "decrypt into supplied data page"
> > > sort of thing. These all need to be done in the same place, so we
> > > should have common support for this. And I suspect the iomap should
> > > also state in a flag that something like this is necessary (e.g.
> > > IOMAP_FL_ENCRYPTED indicates post-IO decryption needs to be run).
> > 
> > Add some word to this topic, I think introducing a generic full approach
> > to IOMAP for encryption, decompression, verification is hard to meet all
> > filesystems, and seems unnecessary, especially data compression is involved.
> > 
> > Since the data decompression will expand the data, therefore the logical
> > data size is not same as the physical data size:
> > 
> > 1) IO submission should be applied to all physical data, but data
> >    decompression will be eventually applied to logical mapping.
> >    As for EROFS, it submits all physical pages with page->private
> >    points to management structure which maintain all logical pages
> >    as well for further decompression. And time-sharing approach is
> >    used to save the L2P mapping array in these allocated pages itself.
> > 
> >    In addition, IOMAP also needs to consider fixed-sized output/input
> >    difference which is filesystem specific and I have no idea whether
> >    involveing too many code for each requirement is really good for IOMAP;
> > 
> > 2) The post-read processing order is another negotiable stuff.
> >    Although there is no benefit to select verity->decrypt rather than
> >    decrypt->verity; but when compression is involved, the different
> >    orders could be selected by different filesystem users:
> > 
> >     1. decrypt->verity->decompress
> > 
> >     2. verity->decompress->decrypt
> > 
> >     3. decompress->decrypt->verity
> > 
> >    1. and 2. could cause less computation since it processes
> >    compressed data, and the security is good enough since
> >    the behavior of decompression algorithm is deterministic.
> >    3 could cause more computation.
> > 
> > All I want to say is the post process is so complicated since we have
> > many selection if encryption, decompression, verification are all involved.
> > 
> > Maybe introduce a core subset to IOMAP is better for long-term
> > maintainment and better performance. And we should consider it
> > more carefully.
> > 
> 
> FWIW, the only order that actually makes sense is decrypt->decompress->verity.

I am not just talking about fsverity as you mentioned below.

> 
> Decrypt before decompress, i.e. encrypt after compress, because only the
> plaintext can be compressible; the ciphertext isn't.

There could be some potential users need partially decrypt/decompress,
but that is minor. I don't want to talk about this detail in this topic.

> 
> Verity last, on the original data, because otherwise the file hash that
> fs-verity reports would be specific to that particular inode on-disk and
> therefore would be useless for authenticating the file's user-visible contents.
> 
> [By "verity" I mean specifically fs-verity.  Integrity-only block checksums are
> a different case; those can be done at any point, but doing them on the
> compressed data would make sense as then there would be less to checksum.
> 
> And yes, compression+encryption leaks information about the original data, so
> may not be advisable.  My point is just that if the two are nevertheless
> combined, it only makes sense to compress the plaintext.]

I cannot fully agree with your point. (I was not talking of fs-verity, it's
a generic approach of verity approach.)

Considering we introduce a block-based verity solution for all on-disk data
to EROFS later. It means all data/compressed data and metadata are already
from a trusted source at least (like dm-verity).

Either verity->decompress or decompress->verity is safe since either
decompression algotithms or verity algorithms are _deterministic_ and
should be considered _bugfree_ therefore it should have one result.

And if you say decompression algorithm is untrusted because of bug or
somewhat, I think verity algorithm as well. In other words, if we consider
software/hardware bugs, we cannot trust any combination of results.

A advantage of verity->decompress over decompress->verity is that
the verity data is smaller than decompress->verity, so
  1) we can have less I/O for most I/O patterns;
and
  2) we can consume less CPUs.

Take a step back, there are many compression algorithm in the
user-space like apk or what ever, so the plaintext is in a
relatively speaking. We cannot consider the data to end-user is
absolutely right.

Thanks,
Gao Xiang


> 
> - Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-08  5:49           ` Eric Biggers
  2019-08-08  6:28             ` Gao Xiang
@ 2019-08-08  8:16             ` Dave Chinner
  2019-08-08  8:57               ` Gao Xiang
  2019-08-08  9:29               ` Gao Xiang
  2019-08-09 20:45             ` Matthew Wilcox
  2 siblings, 2 replies; 42+ messages in thread
From: Dave Chinner @ 2019-08-08  8:16 UTC (permalink / raw)
  To: Gao Xiang, Goldwyn Rodrigues, hch, darrick.wong, linux-btrfs,
	ruansy.fnst, linux-fsdevel, linux-erofs, miaoxie

On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> FWIW, the only order that actually makes sense is decrypt->decompress->verity.

*nod*

Especially once we get the inline encryption support for fscrypt so
the storage layer can offload the encrypt/decrypt to hardware via
the bio containing plaintext. That pretty much forces fscrypt to be
the lowest layer of the filesystem transformation stack.  This
hardware offload capability also places lots of limits on what you
can do with block-based verity layers below the filesystem. e.g.
using dm-verity when you don't know if there's hardware encryption
below or software encryption on top becomes problematic...

So really, from a filesystem and iomap perspective, What Eric says
is the right - it's the only order that makes sense...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-08  8:16             ` Dave Chinner
@ 2019-08-08  8:57               ` Gao Xiang
  2019-08-08  9:29               ` Gao Xiang
  1 sibling, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2019-08-08  8:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Goldwyn Rodrigues, hch, darrick.wong, linux-btrfs, ruansy.fnst,
	linux-fsdevel, linux-erofs, miaoxie

Hi Dave,

On Thu, Aug 08, 2019 at 06:16:47PM +1000, Dave Chinner wrote:
> On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> 
> *nod*
> 
> Especially once we get the inline encryption support for fscrypt so
> the storage layer can offload the encrypt/decrypt to hardware via
> the bio containing plaintext. That pretty much forces fscrypt to be
> the lowest layer of the filesystem transformation stack.  This
> hardware offload capability also places lots of limits on what you
> can do with block-based verity layers below the filesystem. e.g.
> using dm-verity when you don't know if there's hardware encryption
> below or software encryption on top becomes problematic...
> 
> So really, from a filesystem and iomap perspective, What Eric says
> is the right - it's the only order that makes sense...

Don't be surprised there will be a decrypt/verity/decompress
all-in-one hardware approach for such stuff. 30% random IO (no matter
hardware or software approach) can be saved that is greatly helpful
for user experience on embedded devices with too limited source.

and I really got a SHA256 CPU hardware bug years ago.

I don't want to talk more on tendency, it depends on real scenerio
and user selection (server or embedded device).

For security consideration, these approaches are all the same
level --- these approaches all from the same signed key and
storage source, all transformation A->B->C or A->C->B are equal.

For bug-free, we can fuzzer compression/verity algorithms even
the whole file-system stack. There is another case other than
security consideration.

Thanks,
Gao Xiang

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-08  8:16             ` Dave Chinner
  2019-08-08  8:57               ` Gao Xiang
@ 2019-08-08  9:29               ` Gao Xiang
  2019-08-08 11:21                 ` Gao Xiang
  1 sibling, 1 reply; 42+ messages in thread
From: Gao Xiang @ 2019-08-08  9:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Goldwyn Rodrigues, hch, darrick.wong, linux-btrfs, ruansy.fnst,
	linux-fsdevel, linux-erofs, miaoxie

On Thu, Aug 08, 2019 at 06:16:47PM +1000, Dave Chinner wrote:
> On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> 
> *nod*
> 
> Especially once we get the inline encryption support for fscrypt so
> the storage layer can offload the encrypt/decrypt to hardware via
> the bio containing plaintext. That pretty much forces fscrypt to be
> the lowest layer of the filesystem transformation stack.  This
> hardware offload capability also places lots of limits on what you
> can do with block-based verity layers below the filesystem. e.g.
> using dm-verity when you don't know if there's hardware encryption
> below or software encryption on top becomes problematic...

Add a word, I was just talking benefits between "decrypt->decompress->
verity" and "decrypt->verity->decompress", I think both forms are
compatible with inline en/decryption. I don't care which level
"decrypt" is at... But maybe some user cares. Am I missing something?

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-08  9:29               ` Gao Xiang
@ 2019-08-08 11:21                 ` Gao Xiang
  2019-08-08 13:11                   ` Gao Xiang
  0 siblings, 1 reply; 42+ messages in thread
From: Gao Xiang @ 2019-08-08 11:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric Biggers, Goldwyn Rodrigues, hch, darrick.wong, linux-btrfs,
	ruansy.fnst, linux-fsdevel, linux-erofs, miaoxie

On Thu, Aug 08, 2019 at 05:29:47PM +0800, Gao Xiang wrote:
> On Thu, Aug 08, 2019 at 06:16:47PM +1000, Dave Chinner wrote:
> > On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> > 
> > *nod*
> > 
> > Especially once we get the inline encryption support for fscrypt so
> > the storage layer can offload the encrypt/decrypt to hardware via
> > the bio containing plaintext. That pretty much forces fscrypt to be
> > the lowest layer of the filesystem transformation stack.  This
> > hardware offload capability also places lots of limits on what you
> > can do with block-based verity layers below the filesystem. e.g.
> > using dm-verity when you don't know if there's hardware encryption
> > below or software encryption on top becomes problematic...

...and I'm not talking of fs-verity, I personally think fs-verity
is great. I am only talking about a generic stuff.

In order to know which level becomes problematic, there even could
be another choice "decrypt->verity1->decompress->verity2" for such
requirement (assuming verity1/2 themselves are absolutely bug-free),
verity1 can be a strong merkle tree and verity2 is a weak form (just
like a simple Adler-32/crc32 in compressed block), thus we can locate
whether it's a decrypt or decompress bug.

Many compression algorithm containers already have such a weak
form such as gzip algorithm, so there is no need to add such
an extra step to postprocess.

and I have no idea which (decrypt->verity1->decompress->verity2 or
decrypt->decompress->verity) is faster since verity2 is rather simple.
However, if we use the only strong form in the end, there could be
a lot of extra IO and expensive multiple-level computations if files
are highly compressible.

On the other hand, such verity2 can be computed offline / avoided
by fuzzer tools for read-only scenerios (for example, after building
these images and do a full image verification with the given kernel)
in order to make sure its stability (In any case, I'm talking about
how to make those algorithms bug-free).

All I want to say is I think "decrypt->verity->decompress" is
reasonable as well.

Thanks,
Gao Xiang

> 
> Add a word, I was just talking benefits between "decrypt->decompress->
> verity" and "decrypt->verity->decompress", I think both forms are
> compatible with inline en/decryption. I don't care which level
> "decrypt" is at... But maybe some user cares. Am I missing something?
> 
> Thanks,
> Gao Xiang
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-08 11:21                 ` Gao Xiang
@ 2019-08-08 13:11                   ` Gao Xiang
  0 siblings, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2019-08-08 13:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric Biggers, Goldwyn Rodrigues, hch, darrick.wong, linux-btrfs,
	ruansy.fnst, linux-fsdevel, linux-erofs, miaoxie

On Thu, Aug 08, 2019 at 07:21:39PM +0800, Gao Xiang wrote:
> On Thu, Aug 08, 2019 at 05:29:47PM +0800, Gao Xiang wrote:
> > On Thu, Aug 08, 2019 at 06:16:47PM +1000, Dave Chinner wrote:
> > > On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > > > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> > > 
> > > *nod*
> > > 
> > > Especially once we get the inline encryption support for fscrypt so
> > > the storage layer can offload the encrypt/decrypt to hardware via
> > > the bio containing plaintext. That pretty much forces fscrypt to be
> > > the lowest layer of the filesystem transformation stack.  This
> > > hardware offload capability also places lots of limits on what you
> > > can do with block-based verity layers below the filesystem. e.g.
> > > using dm-verity when you don't know if there's hardware encryption
> > > below or software encryption on top becomes problematic...
> 
> ...and I'm not talking of fs-verity, I personally think fs-verity
> is great. I am only talking about a generic stuff.
> 
> In order to know which level becomes problematic, there even could
> be another choice "decrypt->verity1->decompress->verity2" for such
> requirement (assuming verity1/2 themselves are absolutely bug-free),
> verity1 can be a strong merkle tree and verity2 is a weak form (just
> like a simple Adler-32/crc32 in compressed block), thus we can locate
> whether it's a decrypt or decompress bug.
> 
> Many compression algorithm containers already have such a weak
> form such as gzip algorithm, so there is no need to add such
> an extra step to postprocess.
> 
> and I have no idea which (decrypt->verity1->decompress->verity2 or
> decrypt->decompress->verity) is faster since verity2 is rather simple.
> However, if we use the only strong form in the end, there could be
> a lot of extra IO and expensive multiple-level computations if files
> are highly compressible.
> 
> On the other hand, such verity2 can be computed offline / avoided
> by fuzzer tools for read-only scenerios (for example, after building
> these images and do a full image verification with the given kernel)
> in order to make sure its stability (In any case, I'm talking about
> how to make those algorithms bug-free).
> 
> All I want to say is I think "decrypt->verity->decompress" is
> reasonable as well.

... And another fundamental concern is that if we don't verify earlier
(I mean on-disk data), then untrusted data will be transformed
(decompressed and even decrypted if no inline encryption) with risk,
and it seems _vulnerable_ if such decrypt / decompress algorithms have
_security issues_ (such as Buffer Overflow). It seems that it's less
security than do verity earlier.

Thanks,
Gao Xiang

> 
> Thanks,
> Gao Xiang
> 
> > 
> > Add a word, I was just talking benefits between "decrypt->decompress->
> > verity" and "decrypt->verity->decompress", I think both forms are
> > compatible with inline en/decryption. I don't care which level
> > "decrypt" is at... But maybe some user cares. Am I missing something?
> > 
> > Thanks,
> > Gao Xiang
> > 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-08  5:49           ` Eric Biggers
  2019-08-08  6:28             ` Gao Xiang
  2019-08-08  8:16             ` Dave Chinner
@ 2019-08-09 20:45             ` Matthew Wilcox
  2019-08-09 23:45               ` Gao Xiang
  2019-08-10  0:17               ` Eric Biggers
  2 siblings, 2 replies; 42+ messages in thread
From: Matthew Wilcox @ 2019-08-09 20:45 UTC (permalink / raw)
  To: Gao Xiang, Dave Chinner, Goldwyn Rodrigues, hch, darrick.wong,
	linux-btrfs, ruansy.fnst, linux-fsdevel, linux-erofs, miaoxie

On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> >     1. decrypt->verity->decompress
> > 
> >     2. verity->decompress->decrypt
> > 
> >     3. decompress->decrypt->verity
> > 
> >    1. and 2. could cause less computation since it processes
> >    compressed data, and the security is good enough since
> >    the behavior of decompression algorithm is deterministic.
> >    3 could cause more computation.
> > 
> > All I want to say is the post process is so complicated since we have
> > many selection if encryption, decompression, verification are all involved.
> > 
> > Maybe introduce a core subset to IOMAP is better for long-term
> > maintainment and better performance. And we should consider it
> > more carefully.
> > 
> 
> FWIW, the only order that actually makes sense is decrypt->decompress->verity.

That used to be true, but a paper in 2004 suggested it's not true.
Further work in this space in 2009 based on block ciphers:
https://arxiv.org/pdf/1009.1759

It looks like it'd be computationally expensive to do, but feasible.

> Decrypt before decompress, i.e. encrypt after compress, because only the
> plaintext can be compressible; the ciphertext isn't.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-09 20:45             ` Matthew Wilcox
@ 2019-08-09 23:45               ` Gao Xiang
  2019-08-10  0:31                 ` Eric Biggers
  2019-08-10  0:17               ` Eric Biggers
  1 sibling, 1 reply; 42+ messages in thread
From: Gao Xiang @ 2019-08-09 23:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gao Xiang, Dave Chinner, Goldwyn Rodrigues, hch, darrick.wong,
	linux-btrfs, ruansy.fnst, linux-fsdevel, linux-erofs, miaoxie

Hi Willy,

On Fri, Aug 09, 2019 at 01:45:17PM -0700, Matthew Wilcox wrote:
> On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> > >     1. decrypt->verity->decompress
> > > 
> > >     2. verity->decompress->decrypt
> > > 
> > >     3. decompress->decrypt->verity
> > > 
> > >    1. and 2. could cause less computation since it processes
> > >    compressed data, and the security is good enough since
> > >    the behavior of decompression algorithm is deterministic.
> > >    3 could cause more computation.
> > > 
> > > All I want to say is the post process is so complicated since we have
> > > many selection if encryption, decompression, verification are all involved.
> > > 
> > > Maybe introduce a core subset to IOMAP is better for long-term
> > > maintainment and better performance. And we should consider it
> > > more carefully.
> > > 
> > 
> > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> 
> That used to be true, but a paper in 2004 suggested it's not true.
> Further work in this space in 2009 based on block ciphers:
> https://arxiv.org/pdf/1009.1759
> 
> It looks like it'd be computationally expensive to do, but feasible.

Yes, maybe someone cares where encrypt is at due to their system design.

and I thought over these days, I have to repeat my thought of verity
again :( the meaningful order ought to be "decrypt->verity->decompress"
rather than "decrypt->decompress->verity" if compression is involved.

since most (de)compress algorithms are complex enough (allocate memory and
do a lot of unsafe stuffes such as wildcopy) and even maybe unsafe by its
design, we cannot do verity in the end for security consideration thus
the whole system can be vulnerable by this order from malformed on-disk
data. In other words, we need to verify on compressed data.

Fsverity is fine for me since most decrypt algorithms is stable and reliable
and no compression by its design, but if some decrypt software algorithms is
complicated enough, I'd suggest "verity->decrypt" as well to some extent.

Considering transformation "A->B->C->D->....->verity", if any of "A->B->C
->D->..." is attacked by the malformed on-disk data... It would crash or
even root the whole operating system.

All in all, we have to verify data earlier in order to get trusted data
for later complex transformation chains.

The performance benefit I described in my previous email, it seems no need
to say again... please take them into consideration and I think it's no
easy to get a unique generic post-read order for all real systems.

Thanks,
Gao Xiang

> 
> > Decrypt before decompress, i.e. encrypt after compress, because only the
> > plaintext can be compressible; the ciphertext isn't.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-09 20:45             ` Matthew Wilcox
  2019-08-09 23:45               ` Gao Xiang
@ 2019-08-10  0:17               ` Eric Biggers
  1 sibling, 0 replies; 42+ messages in thread
From: Eric Biggers @ 2019-08-10  0:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gao Xiang, Dave Chinner, Goldwyn Rodrigues, hch, darrick.wong,
	linux-btrfs, ruansy.fnst, linux-fsdevel, linux-erofs, miaoxie

On Fri, Aug 09, 2019 at 01:45:17PM -0700, Matthew Wilcox wrote:
> On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> > >     1. decrypt->verity->decompress
> > > 
> > >     2. verity->decompress->decrypt
> > > 
> > >     3. decompress->decrypt->verity
> > > 
> > >    1. and 2. could cause less computation since it processes
> > >    compressed data, and the security is good enough since
> > >    the behavior of decompression algorithm is deterministic.
> > >    3 could cause more computation.
> > > 
> > > All I want to say is the post process is so complicated since we have
> > > many selection if encryption, decompression, verification are all involved.
> > > 
> > > Maybe introduce a core subset to IOMAP is better for long-term
> > > maintainment and better performance. And we should consider it
> > > more carefully.
> > > 
> > 
> > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> 
> That used to be true, but a paper in 2004 suggested it's not true.
> Further work in this space in 2009 based on block ciphers:
> https://arxiv.org/pdf/1009.1759
> 
> It looks like it'd be computationally expensive to do, but feasible.
> 
> > Decrypt before decompress, i.e. encrypt after compress, because only the
> > plaintext can be compressible; the ciphertext isn't.

It's an interesting paper, but even assuming that "compress after encrypt" could
provide some actual benefit over the usual order (I can't think of any in this
context), it doesn't sound practical.  From what I understand from that paper:

- It assumes the compressor just *knows* a priori some pattern in the plaintext,
  i.e. it can't be arbitrary data.  E.g. the compressor for CBC encrypted data
  assumes that each 128 bits of plaintext is drawn from a distibution much
  smaller than the 2^128 possible values, e.g. at most a certain number of bits
  are set.  If any other data is encrypted+compressed, then the compressor will
  corrupt it, and it's impossible for it to detect that it did so.

  That alone makes it unusable for any use case we're talking about here.

- It only works for some specific encryption modes, and even then each
  encryption mode needs a custom compression algorithm designed just for it.
  I don't see how it could work for XTS, let alone a wide-block mode.

- The decompressor needs access to the encryption key.  [If that's allowed, why
  can't the compressor have access to it too?]

- It's almost certainly *much* slower and won't compress as well as conventional
  compression algorithms (gzip, LZ4, ZSTD, ...) that operate on the plaintext.

Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-09 23:45               ` Gao Xiang
@ 2019-08-10  0:31                 ` Eric Biggers
  2019-08-10  0:50                   ` Eric Biggers
  2019-08-10  1:13                   ` Gao Xiang
  0 siblings, 2 replies; 42+ messages in thread
From: Eric Biggers @ 2019-08-10  0:31 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Matthew Wilcox, Gao Xiang, Dave Chinner, Goldwyn Rodrigues, hch,
	darrick.wong, linux-btrfs, ruansy.fnst, linux-fsdevel,
	linux-erofs, miaoxie

On Sat, Aug 10, 2019 at 07:45:59AM +0800, Gao Xiang wrote:
> Hi Willy,
> 
> On Fri, Aug 09, 2019 at 01:45:17PM -0700, Matthew Wilcox wrote:
> > On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > > On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> > > >     1. decrypt->verity->decompress
> > > > 
> > > >     2. verity->decompress->decrypt
> > > > 
> > > >     3. decompress->decrypt->verity
> > > > 
> > > >    1. and 2. could cause less computation since it processes
> > > >    compressed data, and the security is good enough since
> > > >    the behavior of decompression algorithm is deterministic.
> > > >    3 could cause more computation.
> > > > 
> > > > All I want to say is the post process is so complicated since we have
> > > > many selection if encryption, decompression, verification are all involved.
> > > > 
> > > > Maybe introduce a core subset to IOMAP is better for long-term
> > > > maintainment and better performance. And we should consider it
> > > > more carefully.
> > > > 
> > > 
> > > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> > 
> > That used to be true, but a paper in 2004 suggested it's not true.
> > Further work in this space in 2009 based on block ciphers:
> > https://arxiv.org/pdf/1009.1759
> > 
> > It looks like it'd be computationally expensive to do, but feasible.
> 
> Yes, maybe someone cares where encrypt is at due to their system design.
> 
> and I thought over these days, I have to repeat my thought of verity
> again :( the meaningful order ought to be "decrypt->verity->decompress"
> rather than "decrypt->decompress->verity" if compression is involved.
> 
> since most (de)compress algorithms are complex enough (allocate memory and
> do a lot of unsafe stuffes such as wildcopy) and even maybe unsafe by its
> design, we cannot do verity in the end for security consideration thus
> the whole system can be vulnerable by this order from malformed on-disk
> data. In other words, we need to verify on compressed data.
> 
> Fsverity is fine for me since most decrypt algorithms is stable and reliable
> and no compression by its design, but if some decrypt software algorithms is
> complicated enough, I'd suggest "verity->decrypt" as well to some extent.
> 
> Considering transformation "A->B->C->D->....->verity", if any of "A->B->C
> ->D->..." is attacked by the malformed on-disk data... It would crash or
> even root the whole operating system.
> 
> All in all, we have to verify data earlier in order to get trusted data
> for later complex transformation chains.
> 
> The performance benefit I described in my previous email, it seems no need
> to say again... please take them into consideration and I think it's no
> easy to get a unique generic post-read order for all real systems.
> 

While it would be nice to protect against filesystem bugs, it's not the point of
fs-verity.  fs-verity is about authenticating the contents the *user* sees, so
that e.g. a file can be distributed to many computers and it can be
authenticated regardless of exactly what other filesystem features were used
when it was stored on disk.  Different computers may use:

- Different filesystems
- Different compression algorithms (or no compression)
- Different compression strengths, even with same algorithm
- Different divisions of the file into compression units
- Different encryption algorithms (or no encryption)
- Different encryption keys, even with same algorithm
- Different encryption nonces, even with same key

All those change the on-disk data; only the user-visible data stays the same.

Bugs in filesystems may also be exploited regardless of fs-verity, as the
attacker (able to manipulate on-disk image) can create a malicious file without
fs-verity enabled, somewhere else on the filesystem.

If you actually want to authenticate the full filesystem image, you need to use
dm-verity, which is designed for that.

- Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-10  0:31                 ` Eric Biggers
@ 2019-08-10  0:50                   ` Eric Biggers
  2019-08-10  1:34                     ` Gao Xiang
  2019-08-10  1:13                   ` Gao Xiang
  1 sibling, 1 reply; 42+ messages in thread
From: Eric Biggers @ 2019-08-10  0:50 UTC (permalink / raw)
  To: Gao Xiang, Matthew Wilcox, Gao Xiang, Dave Chinner,
	Goldwyn Rodrigues, hch, darrick.wong, linux-btrfs, ruansy.fnst,
	linux-fsdevel, linux-erofs, miaoxie

On Fri, Aug 09, 2019 at 05:31:35PM -0700, Eric Biggers wrote:
> On Sat, Aug 10, 2019 at 07:45:59AM +0800, Gao Xiang wrote:
> > Hi Willy,
> > 
> > On Fri, Aug 09, 2019 at 01:45:17PM -0700, Matthew Wilcox wrote:
> > > On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > > > On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> > > > >     1. decrypt->verity->decompress
> > > > > 
> > > > >     2. verity->decompress->decrypt
> > > > > 
> > > > >     3. decompress->decrypt->verity
> > > > > 
> > > > >    1. and 2. could cause less computation since it processes
> > > > >    compressed data, and the security is good enough since
> > > > >    the behavior of decompression algorithm is deterministic.
> > > > >    3 could cause more computation.
> > > > > 
> > > > > All I want to say is the post process is so complicated since we have
> > > > > many selection if encryption, decompression, verification are all involved.
> > > > > 
> > > > > Maybe introduce a core subset to IOMAP is better for long-term
> > > > > maintainment and better performance. And we should consider it
> > > > > more carefully.
> > > > > 
> > > > 
> > > > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> > > 
> > > That used to be true, but a paper in 2004 suggested it's not true.
> > > Further work in this space in 2009 based on block ciphers:
> > > https://arxiv.org/pdf/1009.1759
> > > 
> > > It looks like it'd be computationally expensive to do, but feasible.
> > 
> > Yes, maybe someone cares where encrypt is at due to their system design.
> > 
> > and I thought over these days, I have to repeat my thought of verity
> > again :( the meaningful order ought to be "decrypt->verity->decompress"
> > rather than "decrypt->decompress->verity" if compression is involved.
> > 
> > since most (de)compress algorithms are complex enough (allocate memory and
> > do a lot of unsafe stuffes such as wildcopy) and even maybe unsafe by its
> > design, we cannot do verity in the end for security consideration thus
> > the whole system can be vulnerable by this order from malformed on-disk
> > data. In other words, we need to verify on compressed data.
> > 
> > Fsverity is fine for me since most decrypt algorithms is stable and reliable
> > and no compression by its design, but if some decrypt software algorithms is
> > complicated enough, I'd suggest "verity->decrypt" as well to some extent.
> > 
> > Considering transformation "A->B->C->D->....->verity", if any of "A->B->C
> > ->D->..." is attacked by the malformed on-disk data... It would crash or
> > even root the whole operating system.
> > 
> > All in all, we have to verify data earlier in order to get trusted data
> > for later complex transformation chains.
> > 
> > The performance benefit I described in my previous email, it seems no need
> > to say again... please take them into consideration and I think it's no
> > easy to get a unique generic post-read order for all real systems.
> > 
> 
> While it would be nice to protect against filesystem bugs, it's not the point of
> fs-verity.  fs-verity is about authenticating the contents the *user* sees, so
> that e.g. a file can be distributed to many computers and it can be
> authenticated regardless of exactly what other filesystem features were used
> when it was stored on disk.  Different computers may use:
> 
> - Different filesystems
> - Different compression algorithms (or no compression)
> - Different compression strengths, even with same algorithm
> - Different divisions of the file into compression units
> - Different encryption algorithms (or no encryption)
> - Different encryption keys, even with same algorithm
> - Different encryption nonces, even with same key
> 
> All those change the on-disk data; only the user-visible data stays the same.
> 
> Bugs in filesystems may also be exploited regardless of fs-verity, as the
> attacker (able to manipulate on-disk image) can create a malicious file without
> fs-verity enabled, somewhere else on the filesystem.
> 
> If you actually want to authenticate the full filesystem image, you need to use
> dm-verity, which is designed for that.
> 

Also keep in mind that ideally the encryption layer would do authenticated
encryption, so that during decrypt->decompress->verity the blocks only get past
the decrypt step if they're authentically from someone with the encryption key.
That's currently missing from fscrypt for practical reasons (read/write
per-block metadata is really hard on most filesystems), but in an ideal world it
would be there.  The fs-verity step is conceptually different, but it seems it's
being conflated with this missing step.

- Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-10  0:31                 ` Eric Biggers
  2019-08-10  0:50                   ` Eric Biggers
@ 2019-08-10  1:13                   ` Gao Xiang
  1 sibling, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2019-08-10  1:13 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Matthew Wilcox, Gao Xiang, Dave Chinner, Goldwyn Rodrigues, hch,
	darrick.wong, linux-btrfs, ruansy.fnst, linux-fsdevel,
	linux-erofs, miaoxie

On Fri, Aug 09, 2019 at 05:31:36PM -0700, Eric Biggers wrote:
> On Sat, Aug 10, 2019 at 07:45:59AM +0800, Gao Xiang wrote:
> > Hi Willy,
> > 
> > On Fri, Aug 09, 2019 at 01:45:17PM -0700, Matthew Wilcox wrote:
> > > On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > > > On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> > > > >     1. decrypt->verity->decompress
> > > > > 
> > > > >     2. verity->decompress->decrypt
> > > > > 
> > > > >     3. decompress->decrypt->verity
> > > > > 
> > > > >    1. and 2. could cause less computation since it processes
> > > > >    compressed data, and the security is good enough since
> > > > >    the behavior of decompression algorithm is deterministic.
> > > > >    3 could cause more computation.
> > > > > 
> > > > > All I want to say is the post process is so complicated since we have
> > > > > many selection if encryption, decompression, verification are all involved.
> > > > > 
> > > > > Maybe introduce a core subset to IOMAP is better for long-term
> > > > > maintainment and better performance. And we should consider it
> > > > > more carefully.
> > > > > 
> > > > 
> > > > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> > > 
> > > That used to be true, but a paper in 2004 suggested it's not true.
> > > Further work in this space in 2009 based on block ciphers:
> > > https://arxiv.org/pdf/1009.1759
> > > 
> > > It looks like it'd be computationally expensive to do, but feasible.
> > 
> > Yes, maybe someone cares where encrypt is at due to their system design.
> > 
> > and I thought over these days, I have to repeat my thought of verity
> > again :( the meaningful order ought to be "decrypt->verity->decompress"
> > rather than "decrypt->decompress->verity" if compression is involved.
> > 
> > since most (de)compress algorithms are complex enough (allocate memory and
> > do a lot of unsafe stuffes such as wildcopy) and even maybe unsafe by its
> > design, we cannot do verity in the end for security consideration thus
> > the whole system can be vulnerable by this order from malformed on-disk
> > data. In other words, we need to verify on compressed data.
> > 
> > Fsverity is fine for me since most decrypt algorithms is stable and reliable
> > and no compression by its design, but if some decrypt software algorithms is
> > complicated enough, I'd suggest "verity->decrypt" as well to some extent.
> > 
> > Considering transformation "A->B->C->D->....->verity", if any of "A->B->C
> > ->D->..." is attacked by the malformed on-disk data... It would crash or
> > even root the whole operating system.
> > 
> > All in all, we have to verify data earlier in order to get trusted data
> > for later complex transformation chains.
> > 
> > The performance benefit I described in my previous email, it seems no need
> > to say again... please take them into consideration and I think it's no
> > easy to get a unique generic post-read order for all real systems.
> > 
> 
> While it would be nice to protect against filesystem bugs, it's not the point of
> fs-verity.  fs-verity is about authenticating the contents the *user* sees, so
> that e.g. a file can be distributed to many computers and it can be
> authenticated regardless of exactly what other filesystem features were used
> when it was stored on disk.  Different computers may use:
> 
> - Different filesystems
> - Different compression algorithms (or no compression)
> - Different compression strengths, even with same algorithm
> - Different divisions of the file into compression units
> - Different encryption algorithms (or no encryption)
> - Different encryption keys, even with same algorithm
> - Different encryption nonces, even with same key
> 
> All those change the on-disk data; only the user-visible data stays the same.

Yes, I agree with fs-verity use case, and I can get some limitation
as well. (I am not arguing fs-verity in this topic at all...)

> 
> Bugs in filesystems may also be exploited regardless of fs-verity, as the
> attacker (able to manipulate on-disk image) can create a malicious file without
> fs-verity enabled, somewhere else on the filesystem.
> 
> If you actually want to authenticate the full filesystem image, you need to use
> dm-verity, which is designed for that.

Yes, but for generic consideration, there is a limitation for dm-verity
since it needs filesystems should be read-only;

and that raises another consideration -- verity should be in block/fs,
and I think what fscrypt answers is also appropriate to verity in fs
(since we have dm-crypt as well), that is that we could consider
multiple key R/W verification as well and blah-blah-blah-blah-blah...

I think all is fine at the moment, in this topic, again, I just try to
say a generic post-read approach is hard and complicated, not for some
specific feature.

Thanks,
Gao Xiang

> 
> - Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/13] iomap: use a function pointer for dio submits
  2019-08-10  0:50                   ` Eric Biggers
@ 2019-08-10  1:34                     ` Gao Xiang
  0 siblings, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2019-08-10  1:34 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Matthew Wilcox, Gao Xiang, Dave Chinner, Goldwyn Rodrigues, hch,
	darrick.wong, linux-btrfs, ruansy.fnst, linux-fsdevel,
	linux-erofs, miaoxie

On Fri, Aug 09, 2019 at 05:50:40PM -0700, Eric Biggers wrote:
> On Fri, Aug 09, 2019 at 05:31:35PM -0700, Eric Biggers wrote:
> > On Sat, Aug 10, 2019 at 07:45:59AM +0800, Gao Xiang wrote:
> > > Hi Willy,
> > > 
> > > On Fri, Aug 09, 2019 at 01:45:17PM -0700, Matthew Wilcox wrote:
> > > > On Wed, Aug 07, 2019 at 10:49:36PM -0700, Eric Biggers wrote:
> > > > > On Thu, Aug 08, 2019 at 12:26:42PM +0800, Gao Xiang wrote:
> > > > > >     1. decrypt->verity->decompress
> > > > > > 
> > > > > >     2. verity->decompress->decrypt
> > > > > > 
> > > > > >     3. decompress->decrypt->verity
> > > > > > 
> > > > > >    1. and 2. could cause less computation since it processes
> > > > > >    compressed data, and the security is good enough since
> > > > > >    the behavior of decompression algorithm is deterministic.
> > > > > >    3 could cause more computation.
> > > > > > 
> > > > > > All I want to say is the post process is so complicated since we have
> > > > > > many selection if encryption, decompression, verification are all involved.
> > > > > > 
> > > > > > Maybe introduce a core subset to IOMAP is better for long-term
> > > > > > maintainment and better performance. And we should consider it
> > > > > > more carefully.
> > > > > > 
> > > > > 
> > > > > FWIW, the only order that actually makes sense is decrypt->decompress->verity.
> > > > 
> > > > That used to be true, but a paper in 2004 suggested it's not true.
> > > > Further work in this space in 2009 based on block ciphers:
> > > > https://arxiv.org/pdf/1009.1759
> > > > 
> > > > It looks like it'd be computationally expensive to do, but feasible.
> > > 
> > > Yes, maybe someone cares where encrypt is at due to their system design.
> > > 
> > > and I thought over these days, I have to repeat my thought of verity
> > > again :( the meaningful order ought to be "decrypt->verity->decompress"
> > > rather than "decrypt->decompress->verity" if compression is involved.
> > > 
> > > since most (de)compress algorithms are complex enough (allocate memory and
> > > do a lot of unsafe stuffes such as wildcopy) and even maybe unsafe by its
> > > design, we cannot do verity in the end for security consideration thus
> > > the whole system can be vulnerable by this order from malformed on-disk
> > > data. In other words, we need to verify on compressed data.
> > > 
> > > Fsverity is fine for me since most decrypt algorithms is stable and reliable
> > > and no compression by its design, but if some decrypt software algorithms is
> > > complicated enough, I'd suggest "verity->decrypt" as well to some extent.
> > > 
> > > Considering transformation "A->B->C->D->....->verity", if any of "A->B->C
> > > ->D->..." is attacked by the malformed on-disk data... It would crash or
> > > even root the whole operating system.
> > > 
> > > All in all, we have to verify data earlier in order to get trusted data
> > > for later complex transformation chains.
> > > 
> > > The performance benefit I described in my previous email, it seems no need
> > > to say again... please take them into consideration and I think it's no
> > > easy to get a unique generic post-read order for all real systems.
> > > 
> > 
> > While it would be nice to protect against filesystem bugs, it's not the point of
> > fs-verity.  fs-verity is about authenticating the contents the *user* sees, so
> > that e.g. a file can be distributed to many computers and it can be
> > authenticated regardless of exactly what other filesystem features were used
> > when it was stored on disk.  Different computers may use:
> > 
> > - Different filesystems
> > - Different compression algorithms (or no compression)
> > - Different compression strengths, even with same algorithm
> > - Different divisions of the file into compression units
> > - Different encryption algorithms (or no encryption)
> > - Different encryption keys, even with same algorithm
> > - Different encryption nonces, even with same key
> > 
> > All those change the on-disk data; only the user-visible data stays the same.
> > 
> > Bugs in filesystems may also be exploited regardless of fs-verity, as the
> > attacker (able to manipulate on-disk image) can create a malicious file without
> > fs-verity enabled, somewhere else on the filesystem.
> > 
> > If you actually want to authenticate the full filesystem image, you need to use
> > dm-verity, which is designed for that.
> > 
> 
> Also keep in mind that ideally the encryption layer would do authenticated
> encryption, so that during decrypt->decompress->verity the blocks only get past
> the decrypt step if they're authentically from someone with the encryption key.
> That's currently missing from fscrypt for practical reasons (read/write
> per-block metadata is really hard on most filesystems), but in an ideal world it
> would be there.  The fs-verity step is conceptually different, but it seems it's
> being conflated with this missing step.

Yes, but encryption could be not enabled mandatorily for all the post-read data,
and not all encrypt algorithms are authenticated encryption...blah-blah-blah...

I want to stop here :) and I think it depends on real requirements, and I don't
want the geneeric post-read process is too limited by specfic chains....

Thanks,
Gao XIang

> 
> - Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/13] btrfs: basic direct read operation
  2019-08-02 22:00 ` [PATCH 07/13] btrfs: basic direct read operation Goldwyn Rodrigues
@ 2019-08-12 12:32   ` RITESH HARJANI
  0 siblings, 0 replies; 42+ messages in thread
From: RITESH HARJANI @ 2019-08-12 12:32 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-fsdevel
  Cc: linux-btrfs, hch, darrick.wong, ruansy.fnst, Goldwyn Rodrigues


On 8/3/19 3:30 AM, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> Add btrfs_dio_iomap_ops for iomap.begin() function. In order to
> accomodate dio reads, add a new function btrfs_file_read_iter()
> which would call btrfs_dio_iomap_read() for DIO reads and
> fallback to generic_file_read_iter otherwise.
>
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>   fs/btrfs/ctree.h |  2 ++
>   fs/btrfs/file.c  | 10 +++++++++-
>   fs/btrfs/iomap.c | 20 ++++++++++++++++++++
>   3 files changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 7a4ff524dc77..9eca2d576dd1 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3247,7 +3247,9 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
>   loff_t btrfs_remap_file_range(struct file *file_in, loff_t pos_in,
>   			      struct file *file_out, loff_t pos_out,
>   			      loff_t len, unsigned int remap_flags);
> +/* iomap.c */
>   size_t btrfs_buffered_iomap_write(struct kiocb *iocb, struct iov_iter *from);
> +ssize_t btrfs_dio_iomap_read(struct kiocb *iocb, struct iov_iter *to);
>   
>   /* tree-defrag.c */
>   int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index f7087e28ac08..997eb152a35a 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2839,9 +2839,17 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
>   	return generic_file_open(inode, filp);
>   }
>   
> +static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	if (iocb->ki_flags & IOCB_DIRECT)
> +		return btrfs_dio_iomap_read(iocb, to);

No provision to fallback to bufferedIO read? Not sure from btrfs 
perspective,
but earlier generic_file_read_iter may fall through to bufferedIO read 
say in case where directIO could not be completed (returned 0 or less 
than the requested read bytes).
Is it not required anymore in case of btrfs when we move to iomap 
infrastructure, to still fall back to bufferedIO read?
Correct me if I am missing anything here.

> +
> +	return generic_file_read_iter(iocb, to);
> +}
> +
>   const struct file_operations btrfs_file_operations = {
>   	.llseek		= btrfs_file_llseek,
> -	.read_iter      = generic_file_read_iter,
> +	.read_iter      = btrfs_file_read_iter,
>   	.splice_read	= generic_file_splice_read,
>   	.write_iter	= btrfs_file_write_iter,
>   	.mmap		= btrfs_file_mmap,
> diff --git a/fs/btrfs/iomap.c b/fs/btrfs/iomap.c
> index 879038e2f1a0..36df606fc028 100644
> --- a/fs/btrfs/iomap.c
> +++ b/fs/btrfs/iomap.c
> @@ -420,3 +420,23 @@ size_t btrfs_buffered_iomap_write(struct kiocb *iocb, struct iov_iter *from)
>   	return written;
>   }
>   
> +static int btrfs_dio_iomap_begin(struct inode *inode, loff_t pos,
> +		loff_t length, unsigned flags, struct iomap *iomap,
> +		struct iomap *srcmap)
> +{
> +	return get_iomap(inode, pos, length, iomap);
> +}
> +
> +static const struct iomap_ops btrfs_dio_iomap_ops = {
> +	.iomap_begin            = btrfs_dio_iomap_begin,
> +};
> +
> +ssize_t btrfs_dio_iomap_read(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	ssize_t ret;
> +	inode_lock_shared(inode);
> +	ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, NULL);
> +	inode_unlock_shared(inode);
> +	return ret;
> +}


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, back to index

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-02 22:00 [PATCH v2 0/13] Btrfs iomap Goldwyn Rodrigues
2019-08-02 22:00 ` [PATCH 01/13] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O Goldwyn Rodrigues
2019-08-03  0:39   ` Darrick J. Wong
2019-08-05  0:06   ` Dave Chinner
2019-08-02 22:00 ` [PATCH 02/13] iomap: Read page from srcmap for IOMAP_COW Goldwyn Rodrigues
2019-08-03  0:23   ` Darrick J. Wong
2019-08-04 23:52   ` Dave Chinner
2019-08-02 22:00 ` [PATCH 03/13] btrfs: Eliminate PagePrivate for btrfs data pages Goldwyn Rodrigues
2019-08-02 22:00 ` [PATCH 04/13] btrfs: Add a simple buffered iomap write Goldwyn Rodrigues
2019-08-05  0:11   ` Dave Chinner
2019-08-02 22:00 ` [PATCH 05/13] btrfs: Add CoW in iomap based writes Goldwyn Rodrigues
2019-08-05  0:13   ` Dave Chinner
2019-08-02 22:00 ` [PATCH 06/13] btrfs: remove buffered write code made unnecessary Goldwyn Rodrigues
2019-08-02 22:00 ` [PATCH 07/13] btrfs: basic direct read operation Goldwyn Rodrigues
2019-08-12 12:32   ` RITESH HARJANI
2019-08-02 22:00 ` [PATCH 08/13] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write() Goldwyn Rodrigues
2019-08-02 22:00 ` [PATCH 09/13] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent() Goldwyn Rodrigues
2019-08-02 22:00 ` [PATCH 10/13] iomap: use a function pointer for dio submits Goldwyn Rodrigues
2019-08-03  0:21   ` Darrick J. Wong
2019-08-05 16:08     ` Goldwyn Rodrigues
2019-08-04 23:43   ` Dave Chinner
2019-08-05 16:08     ` Goldwyn Rodrigues
2019-08-05 21:54       ` Dave Chinner
2019-08-08  4:26         ` Gao Xiang
2019-08-08  4:52           ` Gao Xiang
2019-08-08  5:49           ` Eric Biggers
2019-08-08  6:28             ` Gao Xiang
2019-08-08  8:16             ` Dave Chinner
2019-08-08  8:57               ` Gao Xiang
2019-08-08  9:29               ` Gao Xiang
2019-08-08 11:21                 ` Gao Xiang
2019-08-08 13:11                   ` Gao Xiang
2019-08-09 20:45             ` Matthew Wilcox
2019-08-09 23:45               ` Gao Xiang
2019-08-10  0:31                 ` Eric Biggers
2019-08-10  0:50                   ` Eric Biggers
2019-08-10  1:34                     ` Gao Xiang
2019-08-10  1:13                   ` Gao Xiang
2019-08-10  0:17               ` Eric Biggers
2019-08-02 22:00 ` [PATCH 11/13] btrfs: Use iomap_dio_rw for performing direct I/O writes Goldwyn Rodrigues
2019-08-02 22:00 ` [PATCH 12/13] btrfs: Remove btrfs_dio_data and __btrfs_direct_write Goldwyn Rodrigues
2019-08-02 22:00 ` [PATCH 13/13] btrfs: update inode size during bio completion Goldwyn Rodrigues

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox