Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/5 v2] btrfs direct-io using iomap
@ 2019-11-26  3:14 Goldwyn Rodrigues
  2019-11-26  3:14 ` [PATCH 1/5] fs: Export generic_file_buffered_read() Goldwyn Rodrigues
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Goldwyn Rodrigues @ 2019-11-26  3:14 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, hch, darrick.wong, fdmanana

This is an effort to use iomap for direct I/O in btrfs. This would
change the call from __blockdev_direct_io() to iomap_dio_rw().

The main objective is to lose the buffer head and use bio defined by
iomap code, and hopefully to use more of generic-FS codebase.

These patches are based on xfs/iomap-for-next, though I tested it
against the patches on xfs/iomap-for-next on top of v5.4 (there are no
changes to existing iomap patches).

I have tested it against xfstests/btrfs.

Changes since v1
- Incorporated back the efficiency change for inode locking
- Review comments about coding style and git comments
- Merge related patches into one
- Direct read to go through btrfs_direct_IO()
- Removal of no longer used function dio_end_io()

 fs/btrfs/ctree.h      |    2 
 fs/btrfs/extent_io.c  |   33 ++++-----
 fs/btrfs/file.c       |   15 +++-
 fs/btrfs/inode.c      |  171 +++++++++++++++++++++++---------------------------
 fs/direct-io.c        |   19 -----
 fs/iomap/direct-io.c  |   14 ++--
 include/linux/fs.h    |    3 
 include/linux/iomap.h |    2 
 mm/filemap.c          |   13 ++-
 9 files changed, 131 insertions(+), 141 deletions(-)

--
Goldwyn


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/5] fs: Export generic_file_buffered_read()
  2019-11-26  3:14 [PATCH 0/5 v2] btrfs direct-io using iomap Goldwyn Rodrigues
@ 2019-11-26  3:14 ` Goldwyn Rodrigues
  2019-11-26  3:43   ` Matthew Wilcox
  2019-11-26 10:10   ` Johannes Thumshirn
  2019-11-26  3:14 ` [PATCH 2/5] iomap: add a filesystem hook for direct I/O bio submission Goldwyn Rodrigues
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Goldwyn Rodrigues @ 2019-11-26  3:14 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, hch, darrick.wong, fdmanana, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Export generic_file_buffered_read() to be used to
supplement incomplete direct reads.

While we are at it, correct the comments and variable names.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 include/linux/fs.h |  2 ++
 mm/filemap.c       | 13 +++++++------
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index e0d909d35763..5bc75cff3536 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3095,6 +3095,8 @@ extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
 extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
 				    struct file *file_out, loff_t pos_out,
 				    size_t *count, unsigned int flags);
+extern ssize_t generic_file_buffered_read(struct kiocb *iocb,
+		struct iov_iter *to, ssize_t already_read);
 extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 85b7d087eb45..6a1be2fcdfe7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1994,7 +1994,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * generic_file_buffered_read - generic file read routine
  * @iocb:	the iocb to read
  * @iter:	data destination
- * @written:	already copied
+ * @copied:	already copied
  *
  * This is a generic file read routine, and uses the
  * mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -2003,11 +2003,11 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  *
  * Return:
- * * total number of bytes copied, including those the were already @written
+ * * total number of bytes copied, including those the were @copied
  * * negative error code if nothing was copied
  */
-static ssize_t generic_file_buffered_read(struct kiocb *iocb,
-		struct iov_iter *iter, ssize_t written)
+ssize_t generic_file_buffered_read(struct kiocb *iocb,
+		struct iov_iter *iter, ssize_t copied)
 {
 	struct file *filp = iocb->ki_filp;
 	struct address_space *mapping = filp->f_mapping;
@@ -2148,7 +2148,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		prev_offset = offset;
 
 		put_page(page);
-		written += ret;
+		copied += ret;
 		if (!iov_iter_count(iter))
 			goto out;
 		if (ret < nr) {
@@ -2256,8 +2256,9 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 	*ppos = ((loff_t)index << PAGE_SHIFT) + offset;
 	file_accessed(filp);
-	return written ? written : error;
+	return copied ? copied : error;
 }
+EXPORT_SYMBOL_GPL(generic_file_buffered_read);
 
 /**
  * generic_file_read_iter - generic filesystem read routine
-- 
2.16.4


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2/5] iomap: add a filesystem hook for direct I/O bio submission
  2019-11-26  3:14 [PATCH 0/5 v2] btrfs direct-io using iomap Goldwyn Rodrigues
  2019-11-26  3:14 ` [PATCH 1/5] fs: Export generic_file_buffered_read() Goldwyn Rodrigues
@ 2019-11-26  3:14 ` Goldwyn Rodrigues
  2019-11-26 10:53   ` Nikolay Borisov
  2019-11-26 11:51   ` Johannes Thumshirn
  2019-11-26  3:14 ` [PATCH 3/5] btrfs: Switch to iomap_dio_rw() for dio Goldwyn Rodrigues
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Goldwyn Rodrigues @ 2019-11-26  3:14 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, hch, darrick.wong, fdmanana, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This helps filesystems to perform tasks on the bio while submitting for
I/O. This could be post-write operations such as data CRC or data
replication for fs-handled RAID.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/iomap/direct-io.c  | 14 +++++++++-----
 include/linux/iomap.h |  2 ++
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b77ec2acb066..2dac380f88a0 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool spin)
 EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
 
 static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
-		struct bio *bio)
+		struct bio *bio, loff_t pos)
 {
 	atomic_inc(&dio->ref);
 
@@ -67,7 +67,11 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
 		bio_set_polled(bio, dio->iocb);
 
 	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
-	dio->submit.cookie = submit_bio(bio);
+	if (dio->dops && dio->dops->submit_io)
+		dio->submit.cookie = dio->dops->submit_io(bio,
+				dio->iocb->ki_filp, pos);
+	else
+		dio->submit.cookie = submit_bio(bio);
 }
 
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
@@ -191,7 +195,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 	get_page(page);
 	__bio_add_page(bio, page, len, 0);
 	bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
-	iomap_dio_submit_bio(dio, iomap, bio);
+	iomap_dio_submit_bio(dio, iomap, bio, pos);
 }
 
 static loff_t
@@ -297,11 +301,11 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		iov_iter_advance(dio->submit.iter, n);
 
 		dio->size += n;
-		pos += n;
 		copied += n;
 
 		nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
-		iomap_dio_submit_bio(dio, iomap, bio);
+		iomap_dio_submit_bio(dio, iomap, bio, pos);
+		pos += n;
 	} while (nr_pages);
 
 	/*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8b09463dae0d..2b093a23ef1c 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -252,6 +252,8 @@ int iomap_writepages(struct address_space *mapping,
 struct iomap_dio_ops {
 	int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
 		      unsigned flags);
+	blk_qc_t (*submit_io)(struct bio *bio, struct file *file,
+			  loff_t file_offset);
 };
 
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
-- 
2.16.4


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3/5] btrfs: Switch to iomap_dio_rw() for dio
  2019-11-26  3:14 [PATCH 0/5 v2] btrfs direct-io using iomap Goldwyn Rodrigues
  2019-11-26  3:14 ` [PATCH 1/5] fs: Export generic_file_buffered_read() Goldwyn Rodrigues
  2019-11-26  3:14 ` [PATCH 2/5] iomap: add a filesystem hook for direct I/O bio submission Goldwyn Rodrigues
@ 2019-11-26  3:14 ` Goldwyn Rodrigues
  2019-11-26 11:54   ` Nikolay Borisov
  2019-11-26 12:13   ` Nikolay Borisov
  2019-11-26  3:14 ` [PATCH 4/5] btrfs: Wait for extent bits to release page Goldwyn Rodrigues
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Goldwyn Rodrigues @ 2019-11-26  3:14 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, hch, darrick.wong, fdmanana, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Switch from __blockdev_direct_IO() to iomap_dio_rw().
Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
as iomap_begin() for iomap direct I/O functions. This function
allocates and locks all the blocks required for the I/O.
btrfs_submit_direct() is used as the submit_io() hook for direct I/O
ops.

Since we need direct I/O reads to go through iomap_dio_rw(), we change
file_operations.read_iter() to a btrfs_file_read_iter() which calls
btrfs_direct_IO() for direct reads and falls back to
generic_file_buffered_read() for incomplete reads and buffered reads.

We don't need address_space.direct_IO() anymore so set it to noop.
Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
capable of direct I/O reads from a hole, so we don't need to return
-ENOENT.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |   2 +
 fs/btrfs/file.c  |  15 ++++-
 fs/btrfs/inode.c | 171 ++++++++++++++++++++++++++-----------------------------
 3 files changed, 97 insertions(+), 91 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fe2b8765d9e6..6f038ba34512 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2903,6 +2903,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
 void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
 					  u64 end, int uptodate);
 extern const struct dentry_operations btrfs_dentry_operations;
+ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter);
+ssize_t btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter);
 
 /* ioctl.c */
 long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 435a502a3226..960a0767f532 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1834,7 +1834,7 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	loff_t endbyte;
 	int err;
 
-	written = generic_file_direct_write(iocb, from);
+	written = btrfs_dio_write(iocb, from);
 
 	if (written < 0 || !iov_iter_count(from))
 		return written;
@@ -3452,9 +3452,20 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
 	return generic_file_open(inode, filp);
 }
 
+static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	ssize_t ret = 0;
+	if (iocb->ki_flags & IOCB_DIRECT)
+		ret = btrfs_dio_read(iocb, to);
+	if (ret < 0)
+		return ret;
+
+	return generic_file_buffered_read(iocb, to, ret);
+}
+
 const struct file_operations btrfs_file_operations = {
 	.llseek		= btrfs_file_llseek,
-	.read_iter      = generic_file_read_iter,
+	.read_iter      = btrfs_file_read_iter,
 	.splice_read	= generic_file_splice_read,
 	.write_iter	= btrfs_file_write_iter,
 	.mmap		= btrfs_file_mmap,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 015910079e73..9a39a53b9068 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -29,6 +29,7 @@
 #include <linux/iversion.h>
 #include <linux/swap.h>
 #include <linux/sched/mm.h>
+#include <linux/iomap.h>
 #include <asm/unaligned.h>
 #include "misc.h"
 #include "ctree.h"
@@ -7619,28 +7620,7 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
 }
 
 
-static int btrfs_get_blocks_direct_read(struct extent_map *em,
-					struct buffer_head *bh_result,
-					struct inode *inode,
-					u64 start, u64 len)
-{
-	if (em->block_start == EXTENT_MAP_HOLE ||
-			test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
-		return -ENOENT;
-
-	len = min(len, em->len - (start - em->start));
-
-	bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
-		inode->i_blkbits;
-	bh_result->b_size = len;
-	bh_result->b_bdev = em->bdev;
-	set_buffer_mapped(bh_result);
-
-	return 0;
-}
-
 static int btrfs_get_blocks_direct_write(struct extent_map **map,
-					 struct buffer_head *bh_result,
 					 struct inode *inode,
 					 struct btrfs_dio_data *dio_data,
 					 u64 start, u64 len)
@@ -7702,7 +7682,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	}
 
 	/* this will cow the extent */
-	len = bh_result->b_size;
 	free_extent_map(em);
 	*map = em = btrfs_new_extent_direct(inode, start, len);
 	if (IS_ERR(em)) {
@@ -7713,15 +7692,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	len = min(len, em->len - (start - em->start));
 
 skip_cow:
-	bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
-		inode->i_blkbits;
-	bh_result->b_size = len;
-	bh_result->b_bdev = em->bdev;
-	set_buffer_mapped(bh_result);
-
-	if (!test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
-		set_buffer_new(bh_result);
-
 	/*
 	 * Need to update the i_size under the extent lock so buffered
 	 * readers will get the updated i_size when we unlock.
@@ -7737,17 +7707,19 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	return ret;
 }
 
-static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
-				   struct buffer_head *bh_result, int create)
+static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
+		loff_t length, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_map *em;
 	struct extent_state *cached_state = NULL;
 	struct btrfs_dio_data *dio_data = NULL;
-	u64 start = iblock << inode->i_blkbits;
 	u64 lockstart, lockend;
-	u64 len = bh_result->b_size;
+	int create = flags & IOMAP_WRITE;
 	int ret = 0;
+	u64 len = length;
+	bool unlock_extents = false;
 
 	if (!create)
 		len = min_t(u64, len, fs_info->sectorsize);
@@ -7755,6 +7727,17 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 	lockstart = start;
 	lockend = start + len - 1;
 
+	/*
+	 * The generic stuff only does filemap_write_and_wait_range, which
+	 * isn't enough if we've written compressed pages to this area, so
+	 * we need to flush the dirty pages again to make absolutely sure
+	 * that any outstanding dirty pages are on disk.
+	 */
+	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
+		     &BTRFS_I(inode)->runtime_flags))
+		ret = filemap_fdatawrite_range(inode->i_mapping, start,
+					 start + length - 1);
+
 	if (current->journal_info) {
 		/*
 		 * Need to pull our outstanding extents and set journal_info to NULL so
@@ -7803,35 +7786,45 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 	}
 
 	if (create) {
-		ret = btrfs_get_blocks_direct_write(&em, bh_result, inode,
+		ret = btrfs_get_blocks_direct_write(&em, inode,
 						    dio_data, start, len);
 		if (ret < 0)
 			goto unlock_err;
-
-		unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart,
-				     lockend, &cached_state);
+		unlock_extents = true;
+	} else if (em->block_start == EXTENT_MAP_HOLE ||
+			test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
+		/* Unlock in case of direct reading from a hole */
+		unlock_extents = true;
 	} else {
-		ret = btrfs_get_blocks_direct_read(em, bh_result, inode,
-						   start, len);
-		/* Can be negative only if we read from a hole */
-		if (ret < 0) {
-			ret = 0;
-			free_extent_map(em);
-			goto unlock_err;
-		}
+
+		len = min(len, em->len - (start - em->start));
 		/*
 		 * We need to unlock only the end area that we aren't using.
 		 * The rest is going to be unlocked by the endio routine.
 		 */
-		lockstart = start + bh_result->b_size;
-		if (lockstart < lockend) {
-			unlock_extent_cached(&BTRFS_I(inode)->io_tree,
-					     lockstart, lockend, &cached_state);
-		} else {
+		lockstart = start + len;
+		if (lockstart < lockend)
+			unlock_extents = true;
+		else
 			free_extent_state(cached_state);
-		}
 	}
 
+	if (unlock_extents)
+		unlock_extent_cached(&BTRFS_I(inode)->io_tree,
+				lockstart, lockend, &cached_state);
+
+	if ((em->block_start == EXTENT_MAP_HOLE) ||
+	    (test_bit(EXTENT_FLAG_PREALLOC, &em->flags) && !create)) {
+		iomap->addr = IOMAP_NULL_ADDR;
+		iomap->type = IOMAP_HOLE;
+	} else {
+		iomap->addr = em->block_start;
+		iomap->type = IOMAP_MAPPED;
+	}
+	iomap->offset = em->start;
+	iomap->bdev = em->bdev;
+	iomap->length = em->len;
+
 	free_extent_map(em);
 
 	return 0;
@@ -8199,9 +8192,8 @@ static void btrfs_endio_direct_read(struct bio *bio)
 	kfree(dip);
 
 	dio_bio->bi_status = err;
-	dio_end_io(dio_bio);
+	bio_endio(dio_bio);
 	btrfs_io_bio_free_csum(io_bio);
-	bio_put(bio);
 }
 
 static void __endio_write_update_ordered(struct inode *inode,
@@ -8263,8 +8255,7 @@ static void btrfs_endio_direct_write(struct bio *bio)
 	kfree(dip);
 
 	dio_bio->bi_status = bio->bi_status;
-	dio_end_io(dio_bio);
-	bio_put(bio);
+	bio_endio(dio_bio);
 }
 
 static blk_status_t btrfs_submit_bio_start_direct_io(void *private_data,
@@ -8496,9 +8487,10 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 	return 0;
 }
 
-static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
+static blk_qc_t btrfs_submit_direct(struct bio *dio_bio, struct file *file,
 				loff_t file_offset)
 {
+	struct inode *inode = file_inode(file);
 	struct btrfs_dio_private *dip = NULL;
 	struct bio *bio = NULL;
 	struct btrfs_io_bio *io_bio;
@@ -8549,7 +8541,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 
 	ret = btrfs_submit_direct_hook(dip);
 	if (!ret)
-		return;
+		return BLK_QC_T_NONE;
 
 	btrfs_io_bio_free_csum(io_bio);
 
@@ -8568,7 +8560,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 		/*
 		 * The end io callbacks free our dip, do the final put on bio
 		 * and all the cleanup and final put for dio_bio (through
-		 * dio_end_io()).
+		 * end_io()).
 		 */
 		dip = NULL;
 		bio = NULL;
@@ -8587,11 +8579,12 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 		 * Releases and cleans up our dio_bio, no need to bio_put()
 		 * nor bio_endio()/bio_io_error() against dio_bio.
 		 */
-		dio_end_io(dio_bio);
+		bio_endio(dio_bio);
 	}
 	if (bio)
 		bio_put(bio);
 	kfree(dip);
+	return BLK_QC_T_NONE;
 }
 
 static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
@@ -8627,6 +8620,14 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
 	return retval;
 }
 
+static const struct iomap_ops btrfs_dio_iomap_ops = {
+	.iomap_begin            = btrfs_dio_iomap_begin,
+};
+
+static const struct iomap_dio_ops btrfs_dops = {
+	.submit_io		= btrfs_submit_direct,
+};
+
 static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct file *file = iocb->ki_filp;
@@ -8636,28 +8637,13 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	struct extent_changeset *data_reserved = NULL;
 	loff_t offset = iocb->ki_pos;
 	size_t count = 0;
-	int flags = 0;
-	bool wakeup = true;
 	bool relock = false;
 	ssize_t ret;
 
 	if (check_direct_IO(fs_info, iter, offset))
 		return 0;
 
-	inode_dio_begin(inode);
-
-	/*
-	 * The generic stuff only does filemap_write_and_wait_range, which
-	 * isn't enough if we've written compressed pages to this area, so
-	 * we need to flush the dirty pages again to make absolutely sure
-	 * that any outstanding dirty pages are on disk.
-	 */
 	count = iov_iter_count(iter);
-	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
-		     &BTRFS_I(inode)->runtime_flags))
-		filemap_fdatawrite_range(inode->i_mapping, offset,
-					 offset + count - 1);
-
 	if (iov_iter_rw(iter) == WRITE) {
 		/*
 		 * If the write DIO is beyond the EOF, we need update
@@ -8688,17 +8674,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 		dio_data.unsubmitted_oe_range_end = (u64)offset;
 		current->journal_info = &dio_data;
 		down_read(&BTRFS_I(inode)->dio_sem);
-	} else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
-				     &BTRFS_I(inode)->runtime_flags)) {
-		inode_dio_end(inode);
-		flags = DIO_LOCKING | DIO_SKIP_HOLES;
-		wakeup = false;
 	}
 
-	ret = __blockdev_direct_IO(iocb, inode,
-				   fs_info->fs_devices->latest_bdev,
-				   iter, btrfs_get_blocks_direct, NULL,
-				   btrfs_submit_direct, flags);
+	ret = iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dops,
+			is_sync_kiocb(iocb));
+
 	if (iov_iter_rw(iter) == WRITE) {
 		up_read(&BTRFS_I(inode)->dio_sem);
 		current->journal_info = NULL;
@@ -8725,15 +8705,28 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 		btrfs_delalloc_release_extents(BTRFS_I(inode), count);
 	}
 out:
-	if (wakeup)
-		inode_dio_end(inode);
 	if (relock)
 		inode_lock(inode);
-
 	extent_changeset_free(data_reserved);
 	return ret;
 }
 
+ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret;
+
+	inode_lock_shared(inode);
+	ret = btrfs_direct_IO(iocb, to);
+	inode_unlock_shared(inode);
+	return ret;
+}
+
+ssize_t btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter)
+{
+	return btrfs_direct_IO(iocb, iter);
+}
+
 #define BTRFS_FIEMAP_FLAGS	(FIEMAP_FLAG_SYNC)
 
 static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
@@ -11019,7 +11012,7 @@ static const struct address_space_operations btrfs_aops = {
 	.writepage	= btrfs_writepage,
 	.writepages	= btrfs_writepages,
 	.readpages	= btrfs_readpages,
-	.direct_IO	= btrfs_direct_IO,
+	.direct_IO	= noop_direct_IO,
 	.invalidatepage = btrfs_invalidatepage,
 	.releasepage	= btrfs_releasepage,
 	.set_page_dirty	= btrfs_set_page_dirty,
-- 
2.16.4


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 4/5] btrfs: Wait for extent bits to release page
  2019-11-26  3:14 [PATCH 0/5 v2] btrfs direct-io using iomap Goldwyn Rodrigues
                   ` (2 preceding siblings ...)
  2019-11-26  3:14 ` [PATCH 3/5] btrfs: Switch to iomap_dio_rw() for dio Goldwyn Rodrigues
@ 2019-11-26  3:14 ` Goldwyn Rodrigues
  2019-11-26 12:38   ` Nikolay Borisov
  2019-11-26 12:42   ` Johannes Thumshirn
  2019-11-26  3:14 ` [PATCH 5/5] fs: Remove dio_end_io() Goldwyn Rodrigues
  2019-11-27 15:51 ` [PATCH 0/5 v2] btrfs direct-io using iomap David Sterba
  5 siblings, 2 replies; 18+ messages in thread
From: Goldwyn Rodrigues @ 2019-11-26  3:14 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, hch, darrick.wong, fdmanana, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

While trying to release a page, the extent containing the page may be locked
which would stop the page from being released. Wait for the
extent lock to be cleared, if blocking is allowed and then clear
the bits.

While we are at it, clean the code of try_release_extent_state() to make
it simpler.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/extent_io.c | 33 ++++++++++++++-------------------
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cceaf05aada2..a7c32276702d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4367,28 +4367,23 @@ static int try_release_extent_state(struct extent_io_tree *tree,
 {
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_SIZE - 1;
-	int ret = 1;
 
 	if (test_range_bit(tree, start, end, EXTENT_LOCKED, 0, NULL)) {
-		ret = 0;
-	} else {
-		/*
-		 * at this point we can safely clear everything except the
-		 * locked bit and the nodatasum bit
-		 */
-		ret = __clear_extent_bit(tree, start, end,
-				 ~(EXTENT_LOCKED | EXTENT_NODATASUM),
-				 0, 0, NULL, mask, NULL);
-
-		/* if clear_extent_bit failed for enomem reasons,
-		 * we can't allow the release to continue.
-		 */
-		if (ret < 0)
-			ret = 0;
-		else
-			ret = 1;
+		if (!gfpflags_allow_blocking(mask))
+			return 0;
+		wait_extent_bit(tree, start, end, EXTENT_LOCKED);
 	}
-	return ret;
+	/*
+	 * At this point we can safely clear everything except the locked and
+	 * nodatasum bits. If clear_extent_bit failed due to -ENOMEM,
+	 * don't allow release.
+	 */
+	if (__clear_extent_bit(tree, start, end,
+				~(EXTENT_LOCKED | EXTENT_NODATASUM), 0, 0,
+				NULL, mask, NULL) < 0)
+		return 0;
+
+	return 1;
 }
 
 /*
-- 
2.16.4


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 5/5] fs: Remove dio_end_io()
  2019-11-26  3:14 [PATCH 0/5 v2] btrfs direct-io using iomap Goldwyn Rodrigues
                   ` (3 preceding siblings ...)
  2019-11-26  3:14 ` [PATCH 4/5] btrfs: Wait for extent bits to release page Goldwyn Rodrigues
@ 2019-11-26  3:14 ` Goldwyn Rodrigues
  2019-11-26 12:39   ` Nikolay Borisov
  2019-11-26 12:50   ` Johannes Thumshirn
  2019-11-27 15:51 ` [PATCH 0/5 v2] btrfs direct-io using iomap David Sterba
  5 siblings, 2 replies; 18+ messages in thread
From: Goldwyn Rodrigues @ 2019-11-26  3:14 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, hch, darrick.wong, fdmanana, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Since we removed the last user of dio_end_io(), remove the helper
function dio_end_io().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/direct-io.c     | 19 -------------------
 include/linux/fs.h |  1 -
 2 files changed, 20 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 9329ced91f1d..9d6fc6467d9c 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -405,25 +405,6 @@ static void dio_bio_end_io(struct bio *bio)
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 }
 
-/**
- * dio_end_io - handle the end io action for the given bio
- * @bio: The direct io bio thats being completed
- *
- * This is meant to be called by any filesystem that uses their own dio_submit_t
- * so that the DIO specific endio actions are dealt with after the filesystem
- * has done it's completion work.
- */
-void dio_end_io(struct bio *bio)
-{
-	struct dio *dio = bio->bi_private;
-
-	if (dio->is_async)
-		dio_bio_end_aio(bio);
-	else
-		dio_bio_end_io(bio);
-}
-EXPORT_SYMBOL_GPL(dio_end_io);
-
 static inline void
 dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 	      struct block_device *bdev,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bc75cff3536..eab2c26fb32a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3154,7 +3154,6 @@ enum {
 	DIO_SKIP_HOLES	= 0x02,
 };
 
-void dio_end_io(struct bio *bio);
 void dio_warn_stale_pagecache(struct file *filp);
 
 ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
-- 
2.16.4


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] fs: Export generic_file_buffered_read()
  2019-11-26  3:14 ` [PATCH 1/5] fs: Export generic_file_buffered_read() Goldwyn Rodrigues
@ 2019-11-26  3:43   ` Matthew Wilcox
  2019-11-26 10:10   ` Johannes Thumshirn
  1 sibling, 0 replies; 18+ messages in thread
From: Matthew Wilcox @ 2019-11-26  3:43 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-btrfs, linux-fsdevel, hch, darrick.wong, fdmanana,
	Goldwyn Rodrigues

On Mon, Nov 25, 2019 at 09:14:52PM -0600, Goldwyn Rodrigues wrote:
> @@ -2003,11 +2003,11 @@ static void shrink_readahead_size_eio(struct file *filp,
>   * of the logic when it comes to error handling etc.
>   *
>   * Return:
> - * * total number of bytes copied, including those the were already @written
> + * * total number of bytes copied, including those the were @copied

"those that were"


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] fs: Export generic_file_buffered_read()
  2019-11-26  3:14 ` [PATCH 1/5] fs: Export generic_file_buffered_read() Goldwyn Rodrigues
  2019-11-26  3:43   ` Matthew Wilcox
@ 2019-11-26 10:10   ` Johannes Thumshirn
  2019-11-26 10:43     ` Johannes Thumshirn
  1 sibling, 1 reply; 18+ messages in thread
From: Johannes Thumshirn @ 2019-11-26 10:10 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs
  Cc: linux-fsdevel, hch, darrick.wong, fdmanana, Goldwyn Rodrigues

Apart from Willy's comment
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] fs: Export generic_file_buffered_read()
  2019-11-26 10:10   ` Johannes Thumshirn
@ 2019-11-26 10:43     ` Johannes Thumshirn
  0 siblings, 0 replies; 18+ messages in thread
From: Johannes Thumshirn @ 2019-11-26 10:43 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs
  Cc: linux-fsdevel, hch, darrick.wong, fdmanana, Goldwyn Rodrigues

On 26/11/2019 11:10, Johannes Thumshirn wrote:
> Apart from Willy's comment
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

On the other hand, you could as well do:

static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	ssize_t ret = 0;
+	if (iocb->ki_flags & IOCB_DIRECT)
+		ret = btrfs_dio_read(iocb, to);
+	if (ret < 0)
+		return ret;
+
+	return generic_file_read_iter(iocb, ret);
+}

In patch 3/5, if IOCB_DIRECT is set we're branching into
btrfs_dio_read() above and if not generic_file_read_iter() will then
directly call generic_file_buffered_read()

-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/5] iomap: add a filesystem hook for direct I/O bio submission
  2019-11-26  3:14 ` [PATCH 2/5] iomap: add a filesystem hook for direct I/O bio submission Goldwyn Rodrigues
@ 2019-11-26 10:53   ` Nikolay Borisov
  2019-11-26 11:51   ` Johannes Thumshirn
  1 sibling, 0 replies; 18+ messages in thread
From: Nikolay Borisov @ 2019-11-26 10:53 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs
  Cc: linux-fsdevel, hch, darrick.wong, fdmanana, Goldwyn Rodrigues



On 26.11.19 г. 5:14 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> This helps filesystems to perform tasks on the bio while submitting for
> I/O. This could be post-write operations such as data CRC or data
> replication for fs-handled RAID.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/5] iomap: add a filesystem hook for direct I/O bio submission
  2019-11-26  3:14 ` [PATCH 2/5] iomap: add a filesystem hook for direct I/O bio submission Goldwyn Rodrigues
  2019-11-26 10:53   ` Nikolay Borisov
@ 2019-11-26 11:51   ` Johannes Thumshirn
  1 sibling, 0 replies; 18+ messages in thread
From: Johannes Thumshirn @ 2019-11-26 11:51 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs
  Cc: linux-fsdevel, hch, darrick.wong, fdmanana, Goldwyn Rodrigues

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] btrfs: Switch to iomap_dio_rw() for dio
  2019-11-26  3:14 ` [PATCH 3/5] btrfs: Switch to iomap_dio_rw() for dio Goldwyn Rodrigues
@ 2019-11-26 11:54   ` Nikolay Borisov
  2019-11-26 12:13   ` Nikolay Borisov
  1 sibling, 0 replies; 18+ messages in thread
From: Nikolay Borisov @ 2019-11-26 11:54 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs
  Cc: linux-fsdevel, hch, darrick.wong, fdmanana, Goldwyn Rodrigues



On 26.11.19 г. 5:14 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Switch from __blockdev_direct_IO() to iomap_dio_rw().
> Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
> as iomap_begin() for iomap direct I/O functions. This function
> allocates and locks all the blocks required for the I/O.
> btrfs_submit_direct() is used as the submit_io() hook for direct I/O
> ops.
> 
> Since we need direct I/O reads to go through iomap_dio_rw(), we change
> file_operations.read_iter() to a btrfs_file_read_iter() which calls
> btrfs_direct_IO() for direct reads and falls back to
> generic_file_buffered_read() for incomplete reads and buffered reads.
> 
> We don't need address_space.direct_IO() anymore so set it to noop.
> Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
> capable of direct I/O reads from a hole, so we don't need to return
> -ENOENT.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/btrfs/ctree.h |   2 +
>  fs/btrfs/file.c  |  15 ++++-
>  fs/btrfs/inode.c | 171 ++++++++++++++++++++++++++-----------------------------
>  3 files changed, 97 insertions(+), 91 deletions(-)
> 

<snip>

> -static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
> -				   struct buffer_head *bh_result, int create)
> +static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
> +		loff_t length, unsigned flags, struct iomap *iomap,
> +		struct iomap *srcmap)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	struct extent_map *em;
>  	struct extent_state *cached_state = NULL;
>  	struct btrfs_dio_data *dio_data = NULL;
> -	u64 start = iblock << inode->i_blkbits;
>  	u64 lockstart, lockend;
> -	u64 len = bh_result->b_size;
> +	int create = flags & IOMAP_WRITE;

nit: Imo this should be turned into a bool and renamed to write or
is_write. Create implies we are always creating blocks which is not true
if we are doing overwrite. This has been a misnomer ever since it was
introduced. We really care to distinguish read vs write.

<snip>

> @@ -8636,28 +8637,13 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>  	struct extent_changeset *data_reserved = NULL;
>  	loff_t offset = iocb->ki_pos;
>  	size_t count = 0;
> -	int flags = 0;
> -	bool wakeup = true;
>  	bool relock = false;
>  	ssize_t ret;
>  
>  	if (check_direct_IO(fs_info, iter, offset))
>  		return 0;
>  
> -	inode_dio_begin(inode);
> -
> -	/*
> -	 * The generic stuff only does filemap_write_and_wait_range, which
> -	 * isn't enough if we've written compressed pages to this area, so
> -	 * we need to flush the dirty pages again to make absolutely sure
> -	 * that any outstanding dirty pages are on disk.
> -	 */
>  	count = iov_iter_count(iter);
> -	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
> -		     &BTRFS_I(inode)->runtime_flags))
> -		filemap_fdatawrite_range(inode->i_mapping, offset,
> -					 offset + count - 1);
> -
>  	if (iov_iter_rw(iter) == WRITE) {
>  		/*
>  		 * If the write DIO is beyond the EOF, we need update
> @@ -8688,17 +8674,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>  		dio_data.unsubmitted_oe_range_end = (u64)offset;
>  		current->journal_info = &dio_data;
>  		down_read(&BTRFS_I(inode)->dio_sem);
> -	} else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
> -				     &BTRFS_I(inode)->runtime_flags)) {

This is the sole reader of BTRFS_INODE_READDIO_NEED_LOCK flag. Have you
verified this is correct w.r.t btrfs_setsize. I'm very much in favor or
removing the subtle behavior this flag introduced.

On the other hand, with iomap we no longer have control over when
inode_dio_end is called e.g. inode_dio_begin is called before calling
iomap_apply and then it's finished in iomap_dio_complete. Also for DIO
reads you now hold the inode lock which is also held during setattr
(notify_change calls ->setattr callback and it has a
WARN_ON_ONCE(!inode_is_locked(inode)); at the beginning) so perhaps you
can simply delete relevant code in btrfs_setattr as well.

> -		inode_dio_end(inode);
> -		flags = DIO_LOCKING | DIO_SKIP_HOLES;
> -		wakeup = false;
>  	}

<snip>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] btrfs: Switch to iomap_dio_rw() for dio
  2019-11-26  3:14 ` [PATCH 3/5] btrfs: Switch to iomap_dio_rw() for dio Goldwyn Rodrigues
  2019-11-26 11:54   ` Nikolay Borisov
@ 2019-11-26 12:13   ` Nikolay Borisov
  1 sibling, 0 replies; 18+ messages in thread
From: Nikolay Borisov @ 2019-11-26 12:13 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs
  Cc: linux-fsdevel, hch, darrick.wong, fdmanana, Goldwyn Rodrigues



On 26.11.19 г. 5:14 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Switch from __blockdev_direct_IO() to iomap_dio_rw().
> Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
> as iomap_begin() for iomap direct I/O functions. This function
> allocates and locks all the blocks required for the I/O.
> btrfs_submit_direct() is used as the submit_io() hook for direct I/O
> ops.
> 
> Since we need direct I/O reads to go through iomap_dio_rw(), we change
> file_operations.read_iter() to a btrfs_file_read_iter() which calls
> btrfs_direct_IO() for direct reads and falls back to
> generic_file_buffered_read() for incomplete reads and buffered reads.
> 
> We don't need address_space.direct_IO() anymore so set it to noop.
> Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
> capable of direct I/O reads from a hole, so we don't need to return
> -ENOENT.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/btrfs/ctree.h |   2 +
>  fs/btrfs/file.c  |  15 ++++-
>  fs/btrfs/inode.c | 171 ++++++++++++++++++++++++++-----------------------------
>  3 files changed, 97 insertions(+), 91 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index fe2b8765d9e6..6f038ba34512 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2903,6 +2903,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
>  void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
>  					  u64 end, int uptodate);
>  extern const struct dentry_operations btrfs_dentry_operations;
> +ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter);
> +ssize_t btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter);
>  
>  /* ioctl.c */
>  long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 435a502a3226..960a0767f532 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1834,7 +1834,7 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  	loff_t endbyte;
>  	int err;
>  
> -	written = generic_file_direct_write(iocb, from);
> +	written = btrfs_dio_write(iocb, from);
>  
>  	if (written < 0 || !iov_iter_count(from))
>  		return written;
> @@ -3452,9 +3452,20 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
>  	return generic_file_open(inode, filp);
>  }
>  
> +static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	ssize_t ret = 0;
> +	if (iocb->ki_flags & IOCB_DIRECT)
> +		ret = btrfs_dio_read(iocb, to);
> +	if (ret < 0)
> +		return ret;
> +
> +	return generic_file_buffered_read(iocb, to, ret);
> +}
> +
>  const struct file_operations btrfs_file_operations = {
>  	.llseek		= btrfs_file_llseek,
> -	.read_iter      = generic_file_read_iter,
> +	.read_iter      = btrfs_file_read_iter,
>  	.splice_read	= generic_file_splice_read,
>  	.write_iter	= btrfs_file_write_iter,
>  	.mmap		= btrfs_file_mmap,
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 015910079e73..9a39a53b9068 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -29,6 +29,7 @@
>  #include <linux/iversion.h>
>  #include <linux/swap.h>
>  #include <linux/sched/mm.h>
> +#include <linux/iomap.h>
>  #include <asm/unaligned.h>
>  #include "misc.h"
>  #include "ctree.h"
> @@ -7619,28 +7620,7 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
>  }
>  
>  
> -static int btrfs_get_blocks_direct_read(struct extent_map *em,
> -					struct buffer_head *bh_result,
> -					struct inode *inode,
> -					u64 start, u64 len)
> -{
> -	if (em->block_start == EXTENT_MAP_HOLE ||
> -			test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> -		return -ENOENT;
> -
> -	len = min(len, em->len - (start - em->start));
> -
> -	bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
> -		inode->i_blkbits;
> -	bh_result->b_size = len;
> -	bh_result->b_bdev = em->bdev;
> -	set_buffer_mapped(bh_result);
> -
> -	return 0;
> -}
> -
>  static int btrfs_get_blocks_direct_write(struct extent_map **map,
> -					 struct buffer_head *bh_result,
>  					 struct inode *inode,
>  					 struct btrfs_dio_data *dio_data,
>  					 u64 start, u64 len)
> @@ -7702,7 +7682,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>  	}
>  
>  	/* this will cow the extent */
> -	len = bh_result->b_size;
>  	free_extent_map(em);
>  	*map = em = btrfs_new_extent_direct(inode, start, len);
>  	if (IS_ERR(em)) {
> @@ -7713,15 +7692,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>  	len = min(len, em->len - (start - em->start));
>  
>  skip_cow:
> -	bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
> -		inode->i_blkbits;
> -	bh_result->b_size = len;
> -	bh_result->b_bdev = em->bdev;
> -	set_buffer_mapped(bh_result);
> -
> -	if (!test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
> -		set_buffer_new(bh_result);
> -
>  	/*
>  	 * Need to update the i_size under the extent lock so buffered
>  	 * readers will get the updated i_size when we unlock.
> @@ -7737,17 +7707,19 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>  	return ret;
>  }
>  
> -static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
> -				   struct buffer_head *bh_result, int create)
> +static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
> +		loff_t length, unsigned flags, struct iomap *iomap,
> +		struct iomap *srcmap)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	struct extent_map *em;
>  	struct extent_state *cached_state = NULL;
>  	struct btrfs_dio_data *dio_data = NULL;
> -	u64 start = iblock << inode->i_blkbits;
>  	u64 lockstart, lockend;
> -	u64 len = bh_result->b_size;
> +	int create = flags & IOMAP_WRITE;
>  	int ret = 0;
> +	u64 len = length;
> +	bool unlock_extents = false;
>  
>  	if (!create)
>  		len = min_t(u64, len, fs_info->sectorsize);
> @@ -7755,6 +7727,17 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
>  	lockstart = start;
>  	lockend = start + len - 1;
>  
> +	/*
> +	 * The generic stuff only does filemap_write_and_wait_range, which
> +	 * isn't enough if we've written compressed pages to this area, so
> +	 * we need to flush the dirty pages again to make absolutely sure
> +	 * that any outstanding dirty pages are on disk.
> +	 */
> +	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
> +		     &BTRFS_I(inode)->runtime_flags))
> +		ret = filemap_fdatawrite_range(inode->i_mapping, start,
> +					 start + length - 1);
> +
>  	if (current->journal_info) {
>  		/*
>  		 * Need to pull our outstanding extents and set journal_info to NULL so
> @@ -7803,35 +7786,45 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
>  	}
>  
>  	if (create) {
> -		ret = btrfs_get_blocks_direct_write(&em, bh_result, inode,
> +		ret = btrfs_get_blocks_direct_write(&em, inode,
>  						    dio_data, start, len);
>  		if (ret < 0)
>  			goto unlock_err;
> -
> -		unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart,
> -				     lockend, &cached_state);
> +		unlock_extents = true;
> +	} else if (em->block_start == EXTENT_MAP_HOLE ||
> +			test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
> +		/* Unlock in case of direct reading from a hole */
> +		unlock_extents = true;
>  	} else {
> -		ret = btrfs_get_blocks_direct_read(em, bh_result, inode,
> -						   start, len);
> -		/* Can be negative only if we read from a hole */
> -		if (ret < 0) {
> -			ret = 0;
> -			free_extent_map(em);
> -			goto unlock_err;
> -		}
> +
> +		len = min(len, em->len - (start - em->start));
>  		/*
>  		 * We need to unlock only the end area that we aren't using.
>  		 * The rest is going to be unlocked by the endio routine.
>  		 */
> -		lockstart = start + bh_result->b_size;
> -		if (lockstart < lockend) {
> -			unlock_extent_cached(&BTRFS_I(inode)->io_tree,
> -					     lockstart, lockend, &cached_state);
> -		} else {
> +		lockstart = start + len;
> +		if (lockstart < lockend)
> +			unlock_extents = true;
> +		else
>  			free_extent_state(cached_state);
> -		}
>  	}
>  
> +	if (unlock_extents)
> +		unlock_extent_cached(&BTRFS_I(inode)->io_tree,
> +				lockstart, lockend, &cached_state);
> +
> +	if ((em->block_start == EXTENT_MAP_HOLE) ||
> +	    (test_bit(EXTENT_FLAG_PREALLOC, &em->flags) && !create)) {
> +		iomap->addr = IOMAP_NULL_ADDR;
> +		iomap->type = IOMAP_HOLE;
> +	} else {
> +		iomap->addr = em->block_start;
> +		iomap->type = IOMAP_MAPPED;
> +	}
> +	iomap->offset = em->start;
> +	iomap->bdev = em->bdev;
> +	iomap->length = em->len;
> +

So here you always return a full extent worth of data, why not trimming
it to the requested range? I think this is still correct because the
generic iomap infrastructure will only process the requested range, at
least as far as iomap->length is concerned. But I'm not so sure about
iomap->offset though.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] btrfs: Wait for extent bits to release page
  2019-11-26  3:14 ` [PATCH 4/5] btrfs: Wait for extent bits to release page Goldwyn Rodrigues
@ 2019-11-26 12:38   ` Nikolay Borisov
  2019-11-26 12:42   ` Johannes Thumshirn
  1 sibling, 0 replies; 18+ messages in thread
From: Nikolay Borisov @ 2019-11-26 12:38 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs
  Cc: linux-fsdevel, hch, darrick.wong, fdmanana, Goldwyn Rodrigues



On 26.11.19 г. 5:14 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> While trying to release a page, the extent containing the page may be locked
> which would stop the page from being released. Wait for the
> extent lock to be cleared, if blocking is allowed and then clear
> the bits.
> 
> While we are at it, clean the code of try_release_extent_state() to make
> it simpler.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> ---
>  fs/btrfs/extent_io.c | 33 ++++++++++++++-------------------
>  1 file changed, 14 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index cceaf05aada2..a7c32276702d 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4367,28 +4367,23 @@ static int try_release_extent_state(struct extent_io_tree *tree,

nit: While on it you can change the return type to bool and propagate it
up to try_release_extent_mapping and __btrfs_releasepage.

But in any case it looks good :

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

<snip>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5] fs: Remove dio_end_io()
  2019-11-26  3:14 ` [PATCH 5/5] fs: Remove dio_end_io() Goldwyn Rodrigues
@ 2019-11-26 12:39   ` Nikolay Borisov
  2019-11-26 12:50   ` Johannes Thumshirn
  1 sibling, 0 replies; 18+ messages in thread
From: Nikolay Borisov @ 2019-11-26 12:39 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs
  Cc: linux-fsdevel, hch, darrick.wong, fdmanana, Goldwyn Rodrigues



On 26.11.19 г. 5:14 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Since we removed the last user of dio_end_io(), remove the helper
> function dio_end_io().
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

Imo this and the previous patch need to be re-ordered. Since 4/5 is an
unrelated cleanup and this one is directly dependent on 3/5.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] btrfs: Wait for extent bits to release page
  2019-11-26  3:14 ` [PATCH 4/5] btrfs: Wait for extent bits to release page Goldwyn Rodrigues
  2019-11-26 12:38   ` Nikolay Borisov
@ 2019-11-26 12:42   ` Johannes Thumshirn
  1 sibling, 0 replies; 18+ messages in thread
From: Johannes Thumshirn @ 2019-11-26 12:42 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs
  Cc: linux-fsdevel, hch, darrick.wong, fdmanana, Goldwyn Rodrigues

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5] fs: Remove dio_end_io()
  2019-11-26  3:14 ` [PATCH 5/5] fs: Remove dio_end_io() Goldwyn Rodrigues
  2019-11-26 12:39   ` Nikolay Borisov
@ 2019-11-26 12:50   ` Johannes Thumshirn
  1 sibling, 0 replies; 18+ messages in thread
From: Johannes Thumshirn @ 2019-11-26 12:50 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs
  Cc: linux-fsdevel, hch, darrick.wong, fdmanana, Goldwyn Rodrigues

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5 v2] btrfs direct-io using iomap
  2019-11-26  3:14 [PATCH 0/5 v2] btrfs direct-io using iomap Goldwyn Rodrigues
                   ` (4 preceding siblings ...)
  2019-11-26  3:14 ` [PATCH 5/5] fs: Remove dio_end_io() Goldwyn Rodrigues
@ 2019-11-27 15:51 ` David Sterba
  5 siblings, 0 replies; 18+ messages in thread
From: David Sterba @ 2019-11-27 15:51 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, linux-fsdevel, hch, darrick.wong, fdmanana

On Mon, Nov 25, 2019 at 09:14:51PM -0600, Goldwyn Rodrigues wrote:
> This is an effort to use iomap for direct I/O in btrfs. This would
> change the call from __blockdev_direct_io() to iomap_dio_rw().
> 
> The main objective is to lose the buffer head and use bio defined by
> iomap code, and hopefully to use more of generic-FS codebase.
> 
> These patches are based on xfs/iomap-for-next, though I tested it
> against the patches on xfs/iomap-for-next on top of v5.4 (there are no
> changes to existing iomap patches).
> 
> I have tested it against xfstests/btrfs.
> 
> Changes since v1
> - Incorporated back the efficiency change for inode locking
> - Review comments about coding style and git comments
> - Merge related patches into one
> - Direct read to go through btrfs_direct_IO()
> - Removal of no longer used function dio_end_io()

I see a lockdep assertion during test btrfs/004

iomap_dio_rw()
...
	lockdep_assert_held(&inode->i_rwsem);

btrfs/004               [15:46:30][   69.749908] run fstests btrfs/004 at 2019-11-27 15:46:30
[   70.180401] BTRFS info (device vda): disk space caching is enabled
[   70.183496] BTRFS info (device vda): has skinny extents
[   70.378865] BTRFS: device fsid a10200b4-f17d-49a0-8c82-b06b69871613 devid 1 transid 5 /dev/vdb scanned by mkfs.btrfs (21271)
[   70.405858] BTRFS info (device vdb): turning on discard
[   70.408279] BTRFS info (device vdb): disk space caching is enabled
[   70.410630] BTRFS info (device vdb): has skinny extents
[   70.412358] BTRFS info (device vdb): flagging fs with big metadata feature
[   70.420386] BTRFS info (device vdb): checking UUID tree
[   70.427675] ------------[ cut here ]------------
[   70.429375] WARNING: CPU: 2 PID: 21300 at fs/iomap/direct-io.c:413 iomap_dio_rw+0x255/0x560
[   70.432208] Modules linked in: xxhash_generic btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
[   70.437096] CPU: 2 PID: 21300 Comm: fsstress Not tainted 5.4.0-default+ #881
[   70.439135] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-rebuilt.opensuse.org 04/01/2014
[   70.442462] RIP: 0010:iomap_dio_rw+0x255/0x560
[   70.449092] RSP: 0018:ffffa36dc5b37c80 EFLAGS: 00010246
[   70.450701] RAX: 0000000000000000 RBX: ffffa36dc5b37e78 RCX: 0000000000000001
[   70.453062] RDX: ffff88d5b3f75500 RSI: ffff88d5a0c29018 RDI: ffff88d5b3f75d10
[   70.455308] RBP: ffffa36dc5b37d20 R08: 0000000000000001 R09: ffffffffaef84850
[   70.457479] R10: ffffa36dc5b37d40 R11: 0000000000007000 R12: ffffa36dc5b37e50
[   70.459479] R13: 0000000000049000 R14: ffff88d5a0c28ec0 R15: 0000000000049000
[   70.461790] FS:  00007fab1f5cdb80(0000) GS:ffff88d5bda00000(0000) knlGS:0000000000000000
[   70.464601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   70.466503] CR2: 000000000041e000 CR3: 0000000061097006 CR4: 0000000000160ee0
[   70.468475] Call Trace:
[   70.469477]  ? btrfs_direct_IO+0x215/0x360 [btrfs]
[   70.471201]  ? __lock_acquired+0x1f0/0x320
[   70.472837]  ? btrfs_direct_IO+0xda/0x360 [btrfs]
[   70.474453]  btrfs_direct_IO+0xda/0x360 [btrfs]
[   70.476101]  ? lockdep_hardirqs_on+0x103/0x190
[   70.477732]  btrfs_file_write_iter+0x20b/0x5f0 [btrfs]
[   70.479450]  ? _copy_to_user+0x76/0x90
[   70.480943]  new_sync_write+0x118/0x1b0
[   70.482446]  vfs_write+0xde/0x1d0
[   70.483789]  ksys_write+0x68/0xe0
[   70.485136]  do_syscall_64+0x50/0x210
[   70.486578]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   70.488390] RIP: 0033:0x7fab1f7a8f93
[   70.495054] RSP: 002b:00007ffea7b21208 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   70.497693] RAX: ffffffffffffffda RBX: 0000000000007000 RCX: 00007fab1f7a8f93
[   70.499708] RDX: 0000000000007000 RSI: 0000000000416000 RDI: 0000000000000003
[   70.501754] RBP: 0000000000000003 R08: 0000000000416000 R09: 0000000000000231
[   70.503682] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009
[   70.505622] R13: 0000000000049000 R14: 0000000000416000 R15: 0000000000000000
[   70.507683] irq event stamp: 1532
[   70.508931] hardirqs last  enabled at (1531): [<ffffffffc01d7bfb>] get_alloc_profile+0x1ab/0x2b0 [btrfs]
[   70.512256] hardirqs last disabled at (1532): [<ffffffffad002a8b>] trace_hardirqs_off_thunk+0x1a/0x1c
[   70.515039] softirqs last  enabled at (0): [<ffffffffad07efd0>] copy_process+0x680/0x1b70
[   70.517193] softirqs last disabled at (0): [<0000000000000000>] 0x0
[   70.519153] ---[ end trace 7fa25c75a89dbc97 ]---

The branch is xfs/iomap-for-next + this patchset.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, back to index

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-26  3:14 [PATCH 0/5 v2] btrfs direct-io using iomap Goldwyn Rodrigues
2019-11-26  3:14 ` [PATCH 1/5] fs: Export generic_file_buffered_read() Goldwyn Rodrigues
2019-11-26  3:43   ` Matthew Wilcox
2019-11-26 10:10   ` Johannes Thumshirn
2019-11-26 10:43     ` Johannes Thumshirn
2019-11-26  3:14 ` [PATCH 2/5] iomap: add a filesystem hook for direct I/O bio submission Goldwyn Rodrigues
2019-11-26 10:53   ` Nikolay Borisov
2019-11-26 11:51   ` Johannes Thumshirn
2019-11-26  3:14 ` [PATCH 3/5] btrfs: Switch to iomap_dio_rw() for dio Goldwyn Rodrigues
2019-11-26 11:54   ` Nikolay Borisov
2019-11-26 12:13   ` Nikolay Borisov
2019-11-26  3:14 ` [PATCH 4/5] btrfs: Wait for extent bits to release page Goldwyn Rodrigues
2019-11-26 12:38   ` Nikolay Borisov
2019-11-26 12:42   ` Johannes Thumshirn
2019-11-26  3:14 ` [PATCH 5/5] fs: Remove dio_end_io() Goldwyn Rodrigues
2019-11-26 12:39   ` Nikolay Borisov
2019-11-26 12:50   ` Johannes Thumshirn
2019-11-27 15:51 ` [PATCH 0/5 v2] btrfs direct-io using iomap David Sterba

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git