linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7 v8] btrfs direct-io using iomap
@ 2020-05-22 12:38 Goldwyn Rodrigues
  2020-05-22 12:38 ` [PATCH 1/7] fs: Export generic_file_buffered_read() Goldwyn Rodrigues
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: Goldwyn Rodrigues @ 2020-05-22 12:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: hch, dsterba

This is an effort to use iomap for direct I/O in btrfs. This would
change the call from __blockdev_direct_io() to iomap_dio_rw().

The main objective is to lose the buffer head and use bio defined by
iomap code, and hopefully to use more of generic-FS codebase.

These patches are based and tested on kdave/for-next-20200511. I
have tested it against xfstests/btrfs.

The tree is available at
https://github.com/goldwynr/linux/tree/dio-merge


Changes since v1
- Incorporated back the efficiency change for inode locking
- Review comments about coding style and git comments
- Merge related patches into one
- Direct read to go through btrfs_direct_IO()
- Removal of no longer used function dio_end_io()

Changes since v2
- aligning iomap offset/length to the position/length of I/O
- Removed btrfs_dio_data
- Removed BTRFS_INODE_READDIO_NEED_LOCK
- Re-incorporating write efficiency changes caused lockdep_assert() in
  iomap to be triggered, remove that code.

Changes since v3
- Fixed freeze on generic/095. Use iomap_end() to account for
  failed/incomplete dio instead of btrfs_dio_data

Changes since v4
- moved lockdep_assert_held() to functions calling iomap_dio_rw()
  This may be called immidiately after calling inode lock and
  may feel not required, but it seems important.
- Removed comments which are no longer required
- Changed commit comments to make them more appropriate

Changes since v5
- restore inode_dio_wait() in truncate
- Removed lockdep_assert_held() near callers

Changes since v6
- Fixed hangs due to underlying device failures
- Removed the patch to wait while releasing pages

Changes since v7
- Moved reservation into btrfs iomap begin()/end() for correct
  reservation cleanup in case of device error.
- Merged patches switch to iomap, and adding btrfs_iomap_end()
  for easier bisection. The switch to iomap would fail in case
  of dio after buffered writes.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/7] fs: Export generic_file_buffered_read()
  2020-05-22 12:38 [PATCH 0/7 v8] btrfs direct-io using iomap Goldwyn Rodrigues
@ 2020-05-22 12:38 ` Goldwyn Rodrigues
  2020-05-25 12:25   ` David Sterba
  2020-05-22 12:38 ` [PATCH 2/7] iomap: add a filesystem hook for direct I/O bio submission Goldwyn Rodrigues
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Goldwyn Rodrigues @ 2020-05-22 12:38 UTC (permalink / raw)
  To: linux-btrfs
  Cc: hch, dsterba, Goldwyn Rodrigues, Johannes Thumshirn, Christoph Hellwig

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Export generic_file_buffered_read() to be used to
supplement incomplete direct reads.

While we are at it, correct the comments and variable names.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/fs.h |  2 ++
 mm/filemap.c       | 13 +++++++------
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 45cc10cdf6dd..366c533d30cd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3124,6 +3124,8 @@ extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
 extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
 				    struct file *file_out, loff_t pos_out,
 				    size_t *count, unsigned int flags);
+extern ssize_t generic_file_buffered_read(struct kiocb *iocb,
+		struct iov_iter *to, ssize_t already_read);
 extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 23a051a7ef0f..27df1cf35eb4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1979,7 +1979,7 @@ static void shrink_readahead_size_eio(struct file_ra_state *ra)
  * generic_file_buffered_read - generic file read routine
  * @iocb:	the iocb to read
  * @iter:	data destination
- * @written:	already copied
+ * @copied:	already copied
  *
  * This is a generic file read routine, and uses the
  * mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1988,11 +1988,11 @@ static void shrink_readahead_size_eio(struct file_ra_state *ra)
  * of the logic when it comes to error handling etc.
  *
  * Return:
- * * total number of bytes copied, including those the were already @written
+ * * total number of bytes copied, including those that were @copied
  * * negative error code if nothing was copied
  */
-static ssize_t generic_file_buffered_read(struct kiocb *iocb,
-		struct iov_iter *iter, ssize_t written)
+ssize_t generic_file_buffered_read(struct kiocb *iocb,
+		struct iov_iter *iter, ssize_t copied)
 {
 	struct file *filp = iocb->ki_filp;
 	struct address_space *mapping = filp->f_mapping;
@@ -2133,7 +2133,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		prev_offset = offset;
 
 		put_page(page);
-		written += ret;
+		copied += ret;
 		if (!iov_iter_count(iter))
 			goto out;
 		if (ret < nr) {
@@ -2241,8 +2241,9 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 	*ppos = ((loff_t)index << PAGE_SHIFT) + offset;
 	file_accessed(filp);
-	return written ? written : error;
+	return copied ? copied : error;
 }
+EXPORT_SYMBOL_GPL(generic_file_buffered_read);
 
 /**
  * generic_file_read_iter - generic filesystem read routine
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/7] iomap: add a filesystem hook for direct I/O bio submission
  2020-05-22 12:38 [PATCH 0/7 v8] btrfs direct-io using iomap Goldwyn Rodrigues
  2020-05-22 12:38 ` [PATCH 1/7] fs: Export generic_file_buffered_read() Goldwyn Rodrigues
@ 2020-05-22 12:38 ` Goldwyn Rodrigues
  2020-05-22 12:38 ` [PATCH 3/7] iomap: Remove lockdep_assert_held() Goldwyn Rodrigues
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Goldwyn Rodrigues @ 2020-05-22 12:38 UTC (permalink / raw)
  To: linux-btrfs
  Cc: hch, dsterba, Goldwyn Rodrigues, Johannes Thumshirn,
	Nikolay Borisov, Christoph Hellwig, Darrick J . Wong

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This helps filesystems to perform tasks on the bio while submitting for
I/O. This could be post-write operations such as data CRC or data
replication for fs-handled RAID.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/iomap/direct-io.c  | 15 ++++++++++-----
 include/linux/iomap.h |  2 ++
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 20dde5aadcdd..f88ba6e7f6af 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool spin)
 EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
 
 static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
-		struct bio *bio)
+		struct bio *bio, loff_t pos)
 {
 	atomic_inc(&dio->ref);
 
@@ -67,7 +67,12 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
 		bio_set_polled(bio, dio->iocb);
 
 	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
-	dio->submit.cookie = submit_bio(bio);
+	if (dio->dops && dio->dops->submit_io)
+		dio->submit.cookie = dio->dops->submit_io(
+				file_inode(dio->iocb->ki_filp),
+				iomap, bio, pos);
+	else
+		dio->submit.cookie = submit_bio(bio);
 }
 
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
@@ -191,7 +196,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 	get_page(page);
 	__bio_add_page(bio, page, len, 0);
 	bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
-	iomap_dio_submit_bio(dio, iomap, bio);
+	iomap_dio_submit_bio(dio, iomap, bio, pos);
 }
 
 static loff_t
@@ -299,11 +304,11 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		}
 
 		dio->size += n;
-		pos += n;
 		copied += n;
 
 		nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES);
-		iomap_dio_submit_bio(dio, iomap, bio);
+		iomap_dio_submit_bio(dio, iomap, bio, pos);
+		pos += n;
 	} while (nr_pages);
 
 	/*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8b09463dae0d..5b4875344874 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -252,6 +252,8 @@ int iomap_writepages(struct address_space *mapping,
 struct iomap_dio_ops {
 	int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
 		      unsigned flags);
+	blk_qc_t (*submit_io)(struct inode *inode, struct iomap *iomap,
+			struct bio *bio, loff_t file_offset);
 };
 
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/7] iomap: Remove lockdep_assert_held()
  2020-05-22 12:38 [PATCH 0/7 v8] btrfs direct-io using iomap Goldwyn Rodrigues
  2020-05-22 12:38 ` [PATCH 1/7] fs: Export generic_file_buffered_read() Goldwyn Rodrigues
  2020-05-22 12:38 ` [PATCH 2/7] iomap: add a filesystem hook for direct I/O bio submission Goldwyn Rodrigues
@ 2020-05-22 12:38 ` Goldwyn Rodrigues
  2020-05-28 15:40   ` Darrick J. Wong
  2020-05-22 12:38 ` [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio Goldwyn Rodrigues
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Goldwyn Rodrigues @ 2020-05-22 12:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: hch, dsterba, Goldwyn Rodrigues, Darrick J . Wong

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Filesystems such as btrfs can perform direct I/O without holding the
inode->i_rwsem in some of the cases like writing within i_size.
So, remove the check for lockdep_assert_held() in iomap_dio_rw()

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/iomap/direct-io.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f88ba6e7f6af..e4addfc58107 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -416,8 +416,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	struct blk_plug plug;
 	struct iomap_dio *dio;
 
-	lockdep_assert_held(&inode->i_rwsem);
-
 	if (!count)
 		return 0;
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio
  2020-05-22 12:38 [PATCH 0/7 v8] btrfs direct-io using iomap Goldwyn Rodrigues
                   ` (2 preceding siblings ...)
  2020-05-22 12:38 ` [PATCH 3/7] iomap: Remove lockdep_assert_held() Goldwyn Rodrigues
@ 2020-05-22 12:38 ` Goldwyn Rodrigues
  2020-05-26 15:03   ` Johannes Thumshirn
  2020-05-22 12:38 ` [PATCH 5/7] fs: Remove dio_end_io() Goldwyn Rodrigues
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Goldwyn Rodrigues @ 2020-05-22 12:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: hch, dsterba, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Switch from __blockdev_direct_IO() to iomap_dio_rw().
Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
as iomap_begin() for iomap direct I/O functions. This function
allocates and locks all the blocks required for the I/O.
btrfs_submit_direct() is used as the submit_io() hook for direct I/O
ops.

Since we need direct I/O reads to go through iomap_dio_rw(), we change
file_operations.read_iter() to a btrfs_file_read_iter() which calls
btrfs_direct_IO() for direct reads and falls back to
generic_file_buffered_read() for incomplete reads and buffered reads.

We don't need address_space.direct_IO() anymore so set it to noop.
Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
capable of direct I/O reads from a hole, so we don't need to return
-ENOENT.

BTRFS direct I/O is now done under i_rwsem, shared in case of
reads and exclusive in case of writes. This guards against simultaneous
truncates.

Use iomap->iomap_end() to check for failed or incomplete direct I/O:
For writes, call __endio_write_update_ordered()
For reads, Unlock extents

btrfs_dio_data is now hooked in iomap->private and not
current->journal_info. It carries the reservation variable and the
amount of data submitted, so we can calculate the amount of data to call
__endio_write_update_ordered in case of an error.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |   1 +
 fs/btrfs/file.c  |  21 +++-
 fs/btrfs/inode.c | 313 ++++++++++++++++++++++-------------------------
 3 files changed, 169 insertions(+), 166 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7c24abc3d302..1ed8a780930f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2911,6 +2911,7 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
 void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
 					  u64 end, int uptodate);
 extern const struct dentry_operations btrfs_dentry_operations;
+ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
 
 /* ioctl.c */
 long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 719e68ab552c..cc42f0752625 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1827,7 +1827,7 @@ static ssize_t __btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	loff_t endbyte;
 	int err;
 
-	written = generic_file_direct_write(iocb, from);
+	written = btrfs_direct_IO(iocb, from);
 
 	if (written < 0 || !iov_iter_count(from))
 		return written;
@@ -3484,9 +3484,26 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
 	return generic_file_open(inode, filp);
 }
 
+static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	ssize_t ret = 0;
+
+	if (iocb->ki_flags & IOCB_DIRECT) {
+		struct inode *inode = file_inode(iocb->ki_filp);
+
+		inode_lock_shared(inode);
+		ret = btrfs_direct_IO(iocb, to);
+		inode_unlock_shared(inode);
+		if (ret < 0)
+			return ret;
+	}
+
+	return generic_file_buffered_read(iocb, to, ret);
+}
+
 const struct file_operations btrfs_file_operations = {
 	.llseek		= btrfs_file_llseek,
-	.read_iter      = generic_file_read_iter,
+	.read_iter      = btrfs_file_read_iter,
 	.splice_read	= generic_file_splice_read,
 	.write_iter	= btrfs_file_write_iter,
 	.mmap		= btrfs_file_mmap,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3ea694ee1c90..ebf97286b110 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -30,6 +30,7 @@
 #include <linux/swap.h>
 #include <linux/migrate.h>
 #include <linux/sched/mm.h>
+#include <linux/iomap.h>
 #include <asm/unaligned.h>
 #include "misc.h"
 #include "ctree.h"
@@ -57,9 +58,9 @@ struct btrfs_iget_args {
 
 struct btrfs_dio_data {
 	u64 reserve;
-	u64 unsubmitted_oe_range_start;
-	u64 unsubmitted_oe_range_end;
-	int overwrite;
+	loff_t length;
+	ssize_t submitted;
+	struct extent_changeset *data_reserved;
 };
 
 static const struct inode_operations btrfs_dir_inode_operations;
@@ -6983,7 +6984,7 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 }
 
 static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
-			      struct extent_state **cached_state, int writing)
+			      struct extent_state **cached_state, bool writing)
 {
 	struct btrfs_ordered_extent *ordered;
 	int ret = 0;
@@ -7121,30 +7122,7 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
 }
 
 
-static int btrfs_get_blocks_direct_read(struct extent_map *em,
-					struct buffer_head *bh_result,
-					struct inode *inode,
-					u64 start, u64 len)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-
-	if (em->block_start == EXTENT_MAP_HOLE ||
-			test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
-		return -ENOENT;
-
-	len = min(len, em->len - (start - em->start));
-
-	bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
-		inode->i_blkbits;
-	bh_result->b_size = len;
-	bh_result->b_bdev = fs_info->fs_devices->latest_bdev;
-	set_buffer_mapped(bh_result);
-
-	return 0;
-}
-
 static int btrfs_get_blocks_direct_write(struct extent_map **map,
-					 struct buffer_head *bh_result,
 					 struct inode *inode,
 					 struct btrfs_dio_data *dio_data,
 					 u64 start, u64 len)
@@ -7206,7 +7184,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	}
 
 	/* this will cow the extent */
-	len = bh_result->b_size;
 	free_extent_map(em);
 	*map = em = btrfs_new_extent_direct(inode, start, len);
 	if (IS_ERR(em)) {
@@ -7217,64 +7194,74 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	len = min(len, em->len - (start - em->start));
 
 skip_cow:
-	bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
-		inode->i_blkbits;
-	bh_result->b_size = len;
-	bh_result->b_bdev = fs_info->fs_devices->latest_bdev;
-	set_buffer_mapped(bh_result);
-
-	if (!test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
-		set_buffer_new(bh_result);
-
 	/*
 	 * Need to update the i_size under the extent lock so buffered
 	 * readers will get the updated i_size when we unlock.
 	 */
-	if (!dio_data->overwrite && start + len > i_size_read(inode))
+	if (start + len > i_size_read(inode))
 		i_size_write(inode, start + len);
 
-	WARN_ON(dio_data->reserve < len);
 	dio_data->reserve -= len;
-	dio_data->unsubmitted_oe_range_end = start + len;
-	current->journal_info = dio_data;
 out:
 	return ret;
 }
 
-static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
-				   struct buffer_head *bh_result, int create)
+static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
+		loff_t length, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_map *em;
 	struct extent_state *cached_state = NULL;
 	struct btrfs_dio_data *dio_data = NULL;
-	u64 start = iblock << inode->i_blkbits;
 	u64 lockstart, lockend;
-	u64 len = bh_result->b_size;
+	bool write = !!(flags & IOMAP_WRITE);
 	int ret = 0;
+	u64 len = length;
+	bool unlock_extents = false;
 
-	if (!create)
+	if (!write)
 		len = min_t(u64, len, fs_info->sectorsize);
 
 	lockstart = start;
 	lockend = start + len - 1;
 
-	if (current->journal_info) {
-		/*
-		 * Need to pull our outstanding extents and set journal_info to NULL so
-		 * that anything that needs to check if there's a transaction doesn't get
-		 * confused.
-		 */
-		dio_data = current->journal_info;
-		current->journal_info = NULL;
+	/*
+	 * The generic stuff only does filemap_write_and_wait_range, which
+	 * isn't enough if we've written compressed pages to this area, so
+	 * we need to flush the dirty pages again to make absolutely sure
+	 * that any outstanding dirty pages are on disk.
+	 */
+	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
+		     &BTRFS_I(inode)->runtime_flags))
+		ret = filemap_fdatawrite_range(inode->i_mapping, start,
+					 start + length - 1);
+
+	dio_data = kzalloc(sizeof(*dio_data), GFP_NOFS);
+	if (!dio_data)
+		return -ENOMEM;
+
+	dio_data->length = length;
+	if (write) {
+		dio_data->reserve = round_up(length, fs_info->sectorsize);
+		ret = btrfs_delalloc_reserve_space(inode,
+				&dio_data->data_reserved,
+				start, dio_data->reserve);
+		if (ret) {
+			extent_changeset_free(dio_data->data_reserved);
+			kfree(dio_data);
+			return ret;
+		}
 	}
+	iomap->private = dio_data;
+
 
 	/*
 	 * If this errors out it's because we couldn't invalidate pagecache for
 	 * this range and we need to fallback to buffered.
 	 */
 	if (lock_extent_direct(inode, lockstart, lockend, &cached_state,
-			       create)) {
+			       write)) {
 		ret = -ENOTBLK;
 		goto err;
 	}
@@ -7306,35 +7293,48 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 		goto unlock_err;
 	}
 
-	if (create) {
-		ret = btrfs_get_blocks_direct_write(&em, bh_result, inode,
+	len = min(len, em->len - (start - em->start));
+	if (write) {
+		ret = btrfs_get_blocks_direct_write(&em, inode,
 						    dio_data, start, len);
 		if (ret < 0)
 			goto unlock_err;
-
-		unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart,
-				     lockend, &cached_state);
+		unlock_extents = true;
+		/* Recalc len in case the new em is smaller than requested */
+		len = min(len, em->len - (start - em->start));
 	} else {
-		ret = btrfs_get_blocks_direct_read(em, bh_result, inode,
-						   start, len);
-		/* Can be negative only if we read from a hole */
-		if (ret < 0) {
-			ret = 0;
-			free_extent_map(em);
-			goto unlock_err;
-		}
 		/*
 		 * We need to unlock only the end area that we aren't using.
 		 * The rest is going to be unlocked by the endio routine.
 		 */
-		lockstart = start + bh_result->b_size;
-		if (lockstart < lockend) {
-			unlock_extent_cached(&BTRFS_I(inode)->io_tree,
-					     lockstart, lockend, &cached_state);
-		} else {
-			free_extent_state(cached_state);
-		}
+		lockstart = start + len;
+		if (lockstart < lockend)
+			unlock_extents = true;
+	}
+
+	if (unlock_extents)
+		unlock_extent_cached(&BTRFS_I(inode)->io_tree,
+				lockstart, lockend, &cached_state);
+	else
+		free_extent_state(cached_state);
+
+	/*
+	 * Translate extent map information to iomap
+	 * We trim the extents (and move the addr) even though
+	 * iomap code does that, since we have locked only the parts
+	 * we are performing I/O in.
+	 */
+	if ((em->block_start == EXTENT_MAP_HOLE) ||
+	    (test_bit(EXTENT_FLAG_PREALLOC, &em->flags) && !write)) {
+		iomap->addr = IOMAP_NULL_ADDR;
+		iomap->type = IOMAP_HOLE;
+	} else {
+		iomap->addr = em->block_start + (start - em->start);
+		iomap->type = IOMAP_MAPPED;
 	}
+	iomap->offset = start;
+	iomap->bdev = fs_info->fs_devices->latest_bdev;
+	iomap->length = len;
 
 	free_extent_map(em);
 
@@ -7344,11 +7344,58 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend,
 			     &cached_state);
 err:
-	if (dio_data)
-		current->journal_info = dio_data;
+	if (dio_data) {
+		btrfs_delalloc_release_space(inode, dio_data->data_reserved,
+				start, dio_data->reserve, true);
+		btrfs_delalloc_release_extents(BTRFS_I(inode), dio_data->reserve);
+		extent_changeset_free(dio_data->data_reserved);
+		kfree(dio_data);
+	}
 	return ret;
 }
 
+static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
+		ssize_t written, unsigned flags, struct iomap *iomap)
+{
+	int ret = 0;
+	struct btrfs_dio_data *dio_data = iomap->private;
+	size_t submitted = dio_data->submitted;
+	bool write = !!(flags & IOMAP_WRITE);
+
+	if (!write && (iomap->type == IOMAP_HOLE)) {
+		/* If reading from a hole, unlock and return */
+		unlock_extent(&BTRFS_I(inode)->io_tree, pos,
+				pos + length - 1);
+		goto out;
+	}
+
+	if (submitted < length) {
+		pos += submitted;
+		length -= submitted;
+		if (write)
+			__endio_write_update_ordered(inode, pos, length,
+					false);
+		else
+			unlock_extent(&BTRFS_I(inode)->io_tree, pos,
+					pos + length - 1);
+		ret = -ENOTBLK;
+	}
+
+	if (write) {
+		if (dio_data->reserve)
+			btrfs_delalloc_release_space(inode, dio_data->data_reserved,
+					pos, dio_data->reserve, true);
+		btrfs_delalloc_release_extents(BTRFS_I(inode), dio_data->length);
+		extent_changeset_free(dio_data->data_reserved);
+	}
+out:
+	kfree(dio_data);
+	iomap->private = NULL;
+
+	return ret;
+}
+
+
 static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
 {
 	/*
@@ -7368,7 +7415,7 @@ static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
 			      dip->logical_offset + dip->bytes - 1);
 	}
 
-	dio_end_io(dip->dio_bio);
+	bio_endio(dip->dio_bio);
 	kfree(dip);
 }
 
@@ -7604,24 +7651,12 @@ static struct btrfs_dio_private *btrfs_create_dio_private(struct bio *dio_bio,
 	dip->disk_bytenr = (u64)dio_bio->bi_iter.bi_sector << 9;
 	dip->dio_bio = dio_bio;
 	refcount_set(&dip->refs, 1);
-
-	if (write) {
-		struct btrfs_dio_data *dio_data = current->journal_info;
-
-		/*
-		 * Setting range start and end to the same value means that
-		 * no cleanup will happen in btrfs_direct_IO
-		 */
-		dio_data->unsubmitted_oe_range_end = dip->logical_offset +
-			dip->bytes;
-		dio_data->unsubmitted_oe_range_start =
-			dio_data->unsubmitted_oe_range_end;
-	}
 	return dip;
 }
 
-static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
-				loff_t file_offset)
+static blk_qc_t btrfs_submit_direct(struct inode *inode,
+		struct iomap *iomap, struct bio *dio_bio,
+		loff_t file_offset)
 {
 	const bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
 	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
@@ -7638,6 +7673,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 	int ret;
 	blk_status_t status;
 	struct btrfs_io_geometry geom;
+	struct btrfs_dio_data *dio_data = iomap->private;
 
 	dip = btrfs_create_dio_private(dio_bio, inode, file_offset);
 	if (!dip) {
@@ -7646,8 +7682,8 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 				file_offset + dio_bio->bi_iter.bi_size - 1);
 		}
 		dio_bio->bi_status = BLK_STS_RESOURCE;
-		dio_end_io(dio_bio);
-		return;
+		bio_endio(dio_bio);
+		return BLK_QC_T_NONE;
 	}
 
 	if (!write && csum) {
@@ -7718,15 +7754,17 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 			goto out_err;
 		}
 
+		dio_data->submitted += clone_len;
 		clone_offset += clone_len;
 		start_sector += clone_len >> 9;
 		file_offset += clone_len;
 	} while (submit_len > 0);
-	return;
+	return BLK_QC_T_NONE;
 
 out_err:
 	dip->dio_bio->bi_status = status;
 	btrfs_dio_private_put(dip);
+	return BLK_QC_T_NONE;
 }
 
 static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
@@ -7762,37 +7800,31 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
 	return retval;
 }
 
-static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+static const struct iomap_ops btrfs_dio_iomap_ops = {
+	.iomap_begin            = btrfs_dio_iomap_begin,
+	.iomap_end              = btrfs_dio_iomap_end,
+};
+
+static const struct iomap_dio_ops btrfs_dops = {
+	.submit_io		= btrfs_submit_direct,
+};
+
+
+ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct btrfs_dio_data dio_data = { 0 };
 	struct extent_changeset *data_reserved = NULL;
 	loff_t offset = iocb->ki_pos;
 	size_t count = 0;
-	int flags = 0;
-	bool wakeup = true;
 	bool relock = false;
 	ssize_t ret;
 
 	if (check_direct_IO(fs_info, iter, offset))
 		return 0;
 
-	inode_dio_begin(inode);
-
-	/*
-	 * The generic stuff only does filemap_write_and_wait_range, which
-	 * isn't enough if we've written compressed pages to this area, so
-	 * we need to flush the dirty pages again to make absolutely sure
-	 * that any outstanding dirty pages are on disk.
-	 */
 	count = iov_iter_count(iter);
-	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
-		     &BTRFS_I(inode)->runtime_flags))
-		filemap_fdatawrite_range(inode->i_mapping, offset,
-					 offset + count - 1);
-
 	if (iov_iter_rw(iter) == WRITE) {
 		/*
 		 * If the write DIO is beyond the EOF, we need update
@@ -7800,71 +7832,24 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 		 * not unlock the i_mutex at this case.
 		 */
 		if (offset + count <= inode->i_size) {
-			dio_data.overwrite = 1;
 			inode_unlock(inode);
 			relock = true;
 		} else if (iocb->ki_flags & IOCB_NOWAIT) {
 			ret = -EAGAIN;
 			goto out;
 		}
-		ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
-						   offset, count);
-		if (ret)
-			goto out;
-
-		/*
-		 * We need to know how many extents we reserved so that we can
-		 * do the accounting properly if we go over the number we
-		 * originally calculated.  Abuse current->journal_info for this.
-		 */
-		dio_data.reserve = round_up(count,
-					    fs_info->sectorsize);
-		dio_data.unsubmitted_oe_range_start = (u64)offset;
-		dio_data.unsubmitted_oe_range_end = (u64)offset;
-		current->journal_info = &dio_data;
 		down_read(&BTRFS_I(inode)->dio_sem);
-	} else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
-				     &BTRFS_I(inode)->runtime_flags)) {
-		inode_dio_end(inode);
-		flags = DIO_LOCKING | DIO_SKIP_HOLES;
-		wakeup = false;
 	}
 
-	ret = __blockdev_direct_IO(iocb, inode,
-				   fs_info->fs_devices->latest_bdev,
-				   iter, btrfs_get_blocks_direct, NULL,
-				   btrfs_submit_direct, flags);
+	ret = iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dops,
+			is_sync_kiocb(iocb));
+
 	if (iov_iter_rw(iter) == WRITE) {
 		up_read(&BTRFS_I(inode)->dio_sem);
-		current->journal_info = NULL;
-		if (ret < 0 && ret != -EIOCBQUEUED) {
-			if (dio_data.reserve)
-				btrfs_delalloc_release_space(inode, data_reserved,
-					offset, dio_data.reserve, true);
-			/*
-			 * On error we might have left some ordered extents
-			 * without submitting corresponding bios for them, so
-			 * cleanup them up to avoid other tasks getting them
-			 * and waiting for them to complete forever.
-			 */
-			if (dio_data.unsubmitted_oe_range_start <
-			    dio_data.unsubmitted_oe_range_end)
-				__endio_write_update_ordered(inode,
-					dio_data.unsubmitted_oe_range_start,
-					dio_data.unsubmitted_oe_range_end -
-					dio_data.unsubmitted_oe_range_start,
-					false);
-		} else if (ret >= 0 && (size_t)ret < count)
-			btrfs_delalloc_release_space(inode, data_reserved,
-					offset, count - (size_t)ret, true);
-		btrfs_delalloc_release_extents(BTRFS_I(inode), count);
 	}
 out:
-	if (wakeup)
-		inode_dio_end(inode);
 	if (relock)
 		inode_lock(inode);
-
 	extent_changeset_free(data_reserved);
 	return ret;
 }
@@ -10180,7 +10165,7 @@ static const struct address_space_operations btrfs_aops = {
 	.writepage	= btrfs_writepage,
 	.writepages	= btrfs_writepages,
 	.readpages	= btrfs_readpages,
-	.direct_IO	= btrfs_direct_IO,
+	.direct_IO	= noop_direct_IO,
 	.invalidatepage = btrfs_invalidatepage,
 	.releasepage	= btrfs_releasepage,
 #ifdef CONFIG_MIGRATION
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/7] fs: Remove dio_end_io()
  2020-05-22 12:38 [PATCH 0/7 v8] btrfs direct-io using iomap Goldwyn Rodrigues
                   ` (3 preceding siblings ...)
  2020-05-22 12:38 ` [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio Goldwyn Rodrigues
@ 2020-05-22 12:38 ` Goldwyn Rodrigues
  2020-05-22 12:38 ` [PATCH 6/7] btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK Goldwyn Rodrigues
  2020-05-22 12:38 ` [PATCH 7/7] btrfs: btrfs: split btrfs_direct_IO Goldwyn Rodrigues
  6 siblings, 0 replies; 23+ messages in thread
From: Goldwyn Rodrigues @ 2020-05-22 12:38 UTC (permalink / raw)
  To: linux-btrfs
  Cc: hch, dsterba, Goldwyn Rodrigues, Nikolay Borisov,
	Johannes Thumshirn, Christoph Hellwig

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Since we removed the last user of dio_end_io(), remove the helper
function dio_end_io().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/direct-io.c     | 19 -------------------
 include/linux/fs.h |  2 --
 2 files changed, 21 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 00b4d15bb811..c44d60f375bc 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,25 +386,6 @@ static void dio_bio_end_io(struct bio *bio)
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 }
 
-/**
- * dio_end_io - handle the end io action for the given bio
- * @bio: The direct io bio thats being completed
- *
- * This is meant to be called by any filesystem that uses their own dio_submit_t
- * so that the DIO specific endio actions are dealt with after the filesystem
- * has done it's completion work.
- */
-void dio_end_io(struct bio *bio)
-{
-	struct dio *dio = bio->bi_private;
-
-	if (dio->is_async)
-		dio_bio_end_aio(bio);
-	else
-		dio_bio_end_io(bio);
-}
-EXPORT_SYMBOL_GPL(dio_end_io);
-
 static inline void
 dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 	      struct block_device *bdev,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 366c533d30cd..e84623d5e173 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3187,8 +3187,6 @@ enum {
 	DIO_SKIP_HOLES	= 0x02,
 };
 
-void dio_end_io(struct bio *bio);
-
 ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 			     struct block_device *bdev, struct iov_iter *iter,
 			     get_block_t get_block,
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/7] btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
  2020-05-22 12:38 [PATCH 0/7 v8] btrfs direct-io using iomap Goldwyn Rodrigues
                   ` (4 preceding siblings ...)
  2020-05-22 12:38 ` [PATCH 5/7] fs: Remove dio_end_io() Goldwyn Rodrigues
@ 2020-05-22 12:38 ` Goldwyn Rodrigues
  2020-05-22 12:38 ` [PATCH 7/7] btrfs: btrfs: split btrfs_direct_IO Goldwyn Rodrigues
  6 siblings, 0 replies; 23+ messages in thread
From: Goldwyn Rodrigues @ 2020-05-22 12:38 UTC (permalink / raw)
  To: linux-btrfs
  Cc: hch, dsterba, Goldwyn Rodrigues, Nikolay Borisov, Johannes Thumshirn

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Since we now perform direct reads using i_rwsem, we can remove this
inode flag used to co-ordinate unlocked reads.

The truncate call takes i_rwsem. This means it is correctly synchronized
with concurrent direct reads.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jth@kernel.org>
---
 fs/btrfs/btrfs_inode.h | 18 ------------------
 fs/btrfs/inode.c       |  3 ---
 2 files changed, 21 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index e7d709505cb1..aeff56a0e105 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -28,7 +28,6 @@ enum {
 	BTRFS_INODE_NEEDS_FULL_SYNC,
 	BTRFS_INODE_COPY_EVERYTHING,
 	BTRFS_INODE_IN_DELALLOC_LIST,
-	BTRFS_INODE_READDIO_NEED_LOCK,
 	BTRFS_INODE_HAS_PROPS,
 	BTRFS_INODE_SNAPSHOT_FLUSH,
 };
@@ -313,23 +312,6 @@ struct btrfs_dio_private {
 	u8 csums[];
 };
 
-/*
- * Disable DIO read nolock optimization, so new dio readers will be forced
- * to grab i_mutex. It is used to avoid the endless truncate due to
- * nonlocked dio read.
- */
-static inline void btrfs_inode_block_unlocked_dio(struct btrfs_inode *inode)
-{
-	set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
-	smp_mb();
-}
-
-static inline void btrfs_inode_resume_unlocked_dio(struct btrfs_inode *inode)
-{
-	smp_mb__before_atomic();
-	clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
-}
-
 /* Array of bytes with variable length, hexadecimal format 0x1234 */
 #define CSUM_FMT				"0x%*phN"
 #define CSUM_FMT_VALUE(size, bytes)		size, bytes
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ebf97286b110..a975d2c61d68 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4752,10 +4752,7 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 
 		truncate_setsize(inode, newsize);
 
-		/* Disable nonlocked read DIO to avoid the endless truncate */
-		btrfs_inode_block_unlocked_dio(BTRFS_I(inode));
 		inode_dio_wait(inode);
-		btrfs_inode_resume_unlocked_dio(BTRFS_I(inode));
 
 		ret = btrfs_truncate(inode, newsize == oldsize);
 		if (ret && inode->i_nlink) {
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 7/7] btrfs: btrfs: split btrfs_direct_IO
  2020-05-22 12:38 [PATCH 0/7 v8] btrfs direct-io using iomap Goldwyn Rodrigues
                   ` (5 preceding siblings ...)
  2020-05-22 12:38 ` [PATCH 6/7] btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK Goldwyn Rodrigues
@ 2020-05-22 12:38 ` Goldwyn Rodrigues
  6 siblings, 0 replies; 23+ messages in thread
From: Goldwyn Rodrigues @ 2020-05-22 12:38 UTC (permalink / raw)
  To: linux-btrfs; +Cc: hch, dsterba, Christoph Hellwig, Goldwyn Rodrigues

From: Christoph Hellwig <hch@lst.de>

The read and write versions don't have anything in common except
for the call to iomap_dio_rw.  So split this function, and merge
each half into its only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[rgoldwyn: reservation changes, check_direct_IO changes]
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |  3 ++
 fs/btrfs/file.c  | 88 ++++++++++++++++++++++++++++++++++++++++++------
 fs/btrfs/inode.c | 83 ++-------------------------------------------
 3 files changed, 83 insertions(+), 91 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1ed8a780930f..bc078e3e6684 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -28,6 +28,7 @@
 #include <linux/dynamic_debug.h>
 #include <linux/refcount.h>
 #include <linux/crc32c.h>
+#include <linux/iomap.h>
 #include "extent-io-tree.h"
 #include "extent_io.h"
 #include "extent_map.h"
@@ -2912,6 +2913,8 @@ void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
 					  u64 end, int uptodate);
 extern const struct dentry_operations btrfs_dentry_operations;
 ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
+extern const struct iomap_ops btrfs_dio_iomap_ops;
+extern const struct iomap_dio_ops btrfs_dops;
 
 /* ioctl.c */
 long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index cc42f0752625..4fcfabed917b 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1817,21 +1817,61 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
 	return num_written ? num_written : ret;
 }
 
-static ssize_t __btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
+                               const struct iov_iter *iter, loff_t offset)
+{
+        unsigned int blocksize_mask = fs_info->sectorsize - 1;
+
+        if (offset & blocksize_mask)
+                return -EINVAL;
+
+        if (iov_iter_alignment(iter) & blocksize_mask)
+                return -EINVAL;
+
+	return 0;
+}
+
+static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
-	loff_t pos;
-	ssize_t written;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	loff_t pos = iocb->ki_pos;
+	ssize_t written = 0;
 	ssize_t written_buffered;
 	loff_t endbyte;
 	int err;
+	size_t count = 0;
+	bool relock = false;
 
-	written = btrfs_direct_IO(iocb, from);
+	if (check_direct_IO(fs_info, from, pos))
+		goto buffered;
+
+	count = iov_iter_count(from);
+	/*
+	 * If the write DIO is beyond the EOF, we need update
+	 * the isize, but it is protected by i_mutex. So we can
+	 * not unlock the i_mutex at this case.
+	 */
+	if (pos + count <= inode->i_size) {
+		inode_unlock(inode);
+		relock = true;
+	} else if (iocb->ki_flags & IOCB_NOWAIT) {
+		return -EAGAIN;
+	}
+
+	down_read(&BTRFS_I(inode)->dio_sem);
+	written = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dops,
+			is_sync_kiocb(iocb));
+	up_read(&BTRFS_I(inode)->dio_sem);
+
+	if (relock)
+		inode_lock(inode);
 
 	if (written < 0 || !iov_iter_count(from))
 		return written;
 
+buffered:
 	pos = iocb->ki_pos;
 	written_buffered = btrfs_buffered_write(iocb, from);
 	if (written_buffered < 0) {
@@ -1970,7 +2010,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 		atomic_inc(&BTRFS_I(inode)->sync_writers);
 
 	if (iocb->ki_flags & IOCB_DIRECT) {
-		num_written = __btrfs_direct_write(iocb, from);
+		num_written = btrfs_direct_write(iocb, from);
 	} else {
 		num_written = btrfs_buffered_write(iocb, from);
 		if (num_written > 0)
@@ -3484,16 +3524,44 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
 	return generic_file_open(inode, filp);
 }
 
+static int check_direct_read(struct btrfs_fs_info *fs_info,
+                               const struct iov_iter *iter, loff_t offset)
+{
+	int ret;
+	int i, seg;
+
+	ret = check_direct_IO(fs_info, iter, offset);
+	if (ret < 0)
+		return ret;
+
+	for (seg = 0; seg < iter->nr_segs; seg++)
+		for (i = seg + 1; i < iter->nr_segs; i++)
+			if (iter->iov[seg].iov_base == iter->iov[i].iov_base)
+				return -EINVAL;
+	return 0;
+}
+
+static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	ssize_t ret;
+
+	if (check_direct_read(btrfs_sb(inode->i_sb), to, iocb->ki_pos))
+		return 0;
+
+	inode_lock_shared(inode);
+        ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dops,
+                        is_sync_kiocb(iocb));
+	inode_unlock_shared(inode);
+	return ret;
+}
+
 static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
 	ssize_t ret = 0;
 
 	if (iocb->ki_flags & IOCB_DIRECT) {
-		struct inode *inode = file_inode(iocb->ki_filp);
-
-		inode_lock_shared(inode);
-		ret = btrfs_direct_IO(iocb, to);
-		inode_unlock_shared(inode);
+		ret = btrfs_direct_read(iocb, to);
 		if (ret < 0)
 			return ret;
 	}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a975d2c61d68..d9da33a0d83b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -30,7 +30,6 @@
 #include <linux/swap.h>
 #include <linux/migrate.h>
 #include <linux/sched/mm.h>
-#include <linux/iomap.h>
 #include <asm/unaligned.h>
 #include "misc.h"
 #include "ctree.h"
@@ -7764,93 +7763,15 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode,
 	return BLK_QC_T_NONE;
 }
 
-static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
-			       const struct iov_iter *iter, loff_t offset)
-{
-	int seg;
-	int i;
-	unsigned int blocksize_mask = fs_info->sectorsize - 1;
-	ssize_t retval = -EINVAL;
-
-	if (offset & blocksize_mask)
-		goto out;
-
-	if (iov_iter_alignment(iter) & blocksize_mask)
-		goto out;
-
-	/* If this is a write we don't need to check anymore */
-	if (iov_iter_rw(iter) != READ || !iter_is_iovec(iter))
-		return 0;
-	/*
-	 * Check to make sure we don't have duplicate iov_base's in this
-	 * iovec, if so return EINVAL, otherwise we'll get csum errors
-	 * when reading back.
-	 */
-	for (seg = 0; seg < iter->nr_segs; seg++) {
-		for (i = seg + 1; i < iter->nr_segs; i++) {
-			if (iter->iov[seg].iov_base == iter->iov[i].iov_base)
-				goto out;
-		}
-	}
-	retval = 0;
-out:
-	return retval;
-}
-
-static const struct iomap_ops btrfs_dio_iomap_ops = {
+const struct iomap_ops btrfs_dio_iomap_ops = {
 	.iomap_begin            = btrfs_dio_iomap_begin,
 	.iomap_end              = btrfs_dio_iomap_end,
 };
 
-static const struct iomap_dio_ops btrfs_dops = {
+const struct iomap_dio_ops btrfs_dops = {
 	.submit_io		= btrfs_submit_direct,
 };
 
-
-ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
-{
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file->f_mapping->host;
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct extent_changeset *data_reserved = NULL;
-	loff_t offset = iocb->ki_pos;
-	size_t count = 0;
-	bool relock = false;
-	ssize_t ret;
-
-	if (check_direct_IO(fs_info, iter, offset))
-		return 0;
-
-	count = iov_iter_count(iter);
-	if (iov_iter_rw(iter) == WRITE) {
-		/*
-		 * If the write DIO is beyond the EOF, we need update
-		 * the isize, but it is protected by i_mutex. So we can
-		 * not unlock the i_mutex at this case.
-		 */
-		if (offset + count <= inode->i_size) {
-			inode_unlock(inode);
-			relock = true;
-		} else if (iocb->ki_flags & IOCB_NOWAIT) {
-			ret = -EAGAIN;
-			goto out;
-		}
-		down_read(&BTRFS_I(inode)->dio_sem);
-	}
-
-	ret = iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dops,
-			is_sync_kiocb(iocb));
-
-	if (iov_iter_rw(iter) == WRITE) {
-		up_read(&BTRFS_I(inode)->dio_sem);
-	}
-out:
-	if (relock)
-		inode_lock(inode);
-	extent_changeset_free(data_reserved);
-	return ret;
-}
-
 #define BTRFS_FIEMAP_FLAGS	(FIEMAP_FLAG_SYNC)
 
 static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/7] fs: Export generic_file_buffered_read()
  2020-05-22 12:38 ` [PATCH 1/7] fs: Export generic_file_buffered_read() Goldwyn Rodrigues
@ 2020-05-25 12:25   ` David Sterba
  0 siblings, 0 replies; 23+ messages in thread
From: David Sterba @ 2020-05-25 12:25 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-btrfs, hch, dsterba, Goldwyn Rodrigues, Johannes Thumshirn,
	Christoph Hellwig

On Fri, May 22, 2020 at 07:38:31AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Export generic_file_buffered_read() to be used to
> supplement incomplete direct reads.
> 
> While we are at it, correct the comments and variable names.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  include/linux/fs.h |  2 ++
>  mm/filemap.c       | 13 +++++++------
>  2 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 45cc10cdf6dd..366c533d30cd 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3124,6 +3124,8 @@ extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
>  extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
>  				    struct file *file_out, loff_t pos_out,
>  				    size_t *count, unsigned int flags);
> +extern ssize_t generic_file_buffered_read(struct kiocb *iocb,
> +		struct iov_iter *to, ssize_t already_read);
>  extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
>  extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
>  extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 23a051a7ef0f..27df1cf35eb4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1979,7 +1979,7 @@ static void shrink_readahead_size_eio(struct file_ra_state *ra)
>   * generic_file_buffered_read - generic file read routine
>   * @iocb:	the iocb to read
>   * @iter:	data destination
> - * @written:	already copied
> + * @copied:	already copied
>   *
>   * This is a generic file read routine, and uses the
>   * mapping->a_ops->readpage() function for the actual low-level stuff.
> @@ -1988,11 +1988,11 @@ static void shrink_readahead_size_eio(struct file_ra_state *ra)
>   * of the logic when it comes to error handling etc.
>   *
>   * Return:
> - * * total number of bytes copied, including those the were already @written
> + * * total number of bytes copied, including those that were @copied
>   * * negative error code if nothing was copied
>   */
> -static ssize_t generic_file_buffered_read(struct kiocb *iocb,
> -		struct iov_iter *iter, ssize_t written)
> +ssize_t generic_file_buffered_read(struct kiocb *iocb,
> +		struct iov_iter *iter, ssize_t copied)
>  {
>  	struct file *filp = iocb->ki_filp;
>  	struct address_space *mapping = filp->f_mapping;
> @@ -2133,7 +2133,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
>  		prev_offset = offset;
>  
>  		put_page(page);
> -		written += ret;
> +		copied += ret;
>  		if (!iov_iter_count(iter))
>  			goto out;
>  		if (ret < nr) {
> @@ -2241,8 +2241,9 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
>  
>  	*ppos = ((loff_t)index << PAGE_SHIFT) + offset;
>  	file_accessed(filp);
> -	return written ? written : error;
> +	return copied ? copied : error;
>  }
> +EXPORT_SYMBOL_GPL(generic_file_buffered_read);

The renamed variable causes build issues with block tree so I've
s/copied/written/ back as we want only the function export.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio
  2020-05-22 12:38 ` [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio Goldwyn Rodrigues
@ 2020-05-26 15:03   ` Johannes Thumshirn
  2020-05-26 16:44     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2020-05-26 15:03 UTC (permalink / raw)
  To: Goldwyn Rodrigues, linux-btrfs; +Cc: hch, dsterba, Goldwyn Rodrigues

Just as a heads up, this one gives me lot's of Page cache invalidation
failure prints from dio_warn_stale_pagecache() on btrfs/004 with 
current misc-next:

[   16.607545] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   16.609328] File: /mnt/scratch/next/p0/d0/d77/de2/d5c/dc9/fee PID: 766 Comm: fsstress
[   16.743572] BTRFS info (device zram1): disk space caching is enabled
[   16.744620] BTRFS info (device zram1): has skinny extents
[   16.747458] BTRFS info (device zram1): enabling ssd optimizations
[   18.303877] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   18.305476] File: /mnt/scratch/bgnoise/p0/d5/d53/d21/f27 PID: 2064 Comm: fsstress
[   18.768426] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   18.770074] File: /mnt/scratch/bgnoise/p0/d9/de/f15 PID: 2490 Comm: fsstress
[   18.916118] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   18.917843] File: /mnt/scratch/bgnoise/p0/f0 PID: 2694 Comm: fsstress
[   21.170384] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   21.172375] File: /mnt/scratch/bgnoise/p0/f3 PID: 4325 Comm: fsstress
[   21.812452] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   21.814232] File: /mnt/scratch/bgnoise/p0/fb PID: 5000 Comm: fsstress
[   21.826027] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   21.827741] File: /mnt/scratch/bgnoise/p0/fb PID: 5000 Comm: fsstress
[   22.127966] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   22.129413] File: /mnt/scratch/bgnoise/p0/df/d28/d26/f3b PID: 5196 Comm: fsstress
[   22.160542] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   22.161972] File: /mnt/scratch/bgnoise/p0/df/d10/d5f/f64 PID: 5196 Comm: fsstress
[   23.696400] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
[   23.698115] File: /mnt/scratch/bgnoise/p0/f0 PID: 6562 Comm: fsstress

I have no idea yet why but I'm investigating.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio
  2020-05-26 15:03   ` Johannes Thumshirn
@ 2020-05-26 16:44     ` Goldwyn Rodrigues
  2020-05-28 15:13       ` Filipe Manana
  0 siblings, 1 reply; 23+ messages in thread
From: Goldwyn Rodrigues @ 2020-05-26 16:44 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs, hch, dsterba

On 15:03 26/05, Johannes Thumshirn wrote:
> Just as a heads up, this one gives me lot's of Page cache invalidation
> failure prints from dio_warn_stale_pagecache() on btrfs/004 with 
> current misc-next:
> 
<snip>

> [   23.696400] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> [   23.698115] File: /mnt/scratch/bgnoise/p0/f0 PID: 6562 Comm: fsstress
> 
> I have no idea yet why but I'm investigating.

This is caused because we are trying to release a page when the extent
has locked the page and release page returns false. To minimize the
effect, I had proposed a patch [1] in v6. However, this created
more extent locking issues and so was dropped.

[1] https://patchwork.kernel.org/patch/11275063/

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio
  2020-05-26 16:44     ` Goldwyn Rodrigues
@ 2020-05-28 15:13       ` Filipe Manana
  2020-05-28 16:34         ` Goldwyn Rodrigues
  0 siblings, 1 reply; 23+ messages in thread
From: Filipe Manana @ 2020-05-28 15:13 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: Johannes Thumshirn, linux-btrfs, hch, dsterba

On Tue, May 26, 2020 at 5:47 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
>
> On 15:03 26/05, Johannes Thumshirn wrote:
> > Just as a heads up, this one gives me lot's of Page cache invalidation
> > failure prints from dio_warn_stale_pagecache() on btrfs/004 with
> > current misc-next:
> >
> <snip>
>
> > [   23.696400] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> > [   23.698115] File: /mnt/scratch/bgnoise/p0/f0 PID: 6562 Comm: fsstress
> >
> > I have no idea yet why but I'm investigating.
>
> This is caused because we are trying to release a page when the extent
> has locked the page and release page returns false.

By "we" you mean what exaclty, a direct IO read, a direct IO write?

And who locked the extent range before?

That seems alarming to me, specially if it's a direct IO write failing
to invalidate the page cache, since a subsequent buffered read could
get stale data (what's in the page cache), and not what the direct IO
write wrote.

Can you elaborate more on all those details?

Thanks.


> To minimize the
> effect, I had proposed a patch [1] in v6. However, this created
> more extent locking issues and so was dropped.
>
> [1] https://patchwork.kernel.org/patch/11275063/
>
> --
> Goldwyn



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/7] iomap: Remove lockdep_assert_held()
  2020-05-22 12:38 ` [PATCH 3/7] iomap: Remove lockdep_assert_held() Goldwyn Rodrigues
@ 2020-05-28 15:40   ` Darrick J. Wong
  2020-05-28 16:45     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2020-05-28 15:40 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: linux-btrfs, hch, dsterba, Goldwyn Rodrigues

On Fri, May 22, 2020 at 07:38:33AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Filesystems such as btrfs can perform direct I/O without holding the
> inode->i_rwsem in some of the cases like writing within i_size.
> So, remove the check for lockdep_assert_held() in iomap_dio_rw()
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/iomap/direct-io.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f88ba6e7f6af..e4addfc58107 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -416,8 +416,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	struct blk_plug plug;
>  	struct iomap_dio *dio;
>  
> -	lockdep_assert_held(&inode->i_rwsem);
> -

I could've sworn that I saw a reply from Dave asking to hoist this check
into all the /other/ iomap_dio_rw callers, but I can't find it and maybe
I just dreamed the whole thing.

Also, please cc fsdevel any time you make change to fs/iomap/, even if
we've already reviewed the patches.

--D

>  	if (!count)
>  		return 0;
>  
> -- 
> 2.25.0
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio
  2020-05-28 15:13       ` Filipe Manana
@ 2020-05-28 16:34         ` Goldwyn Rodrigues
  2020-05-28 16:45           ` Filipe Manana
  0 siblings, 1 reply; 23+ messages in thread
From: Goldwyn Rodrigues @ 2020-05-28 16:34 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Johannes Thumshirn, linux-btrfs, hch, dsterba

On 16:13 28/05, Filipe Manana wrote:
> On Tue, May 26, 2020 at 5:47 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> >
> > On 15:03 26/05, Johannes Thumshirn wrote:
> > > Just as a heads up, this one gives me lot's of Page cache invalidation
> > > failure prints from dio_warn_stale_pagecache() on btrfs/004 with
> > > current misc-next:
> > >
> > <snip>
> >
> > > [   23.696400] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> > > [   23.698115] File: /mnt/scratch/bgnoise/p0/f0 PID: 6562 Comm: fsstress
> > >
> > > I have no idea yet why but I'm investigating.
> >
> > This is caused because we are trying to release a page when the extent
> > has locked the page and release page returns false.
> 
> By "we" you mean what exaclty, a direct IO read, a direct IO write?
> 
> And who locked the extent range before?

This is usually locked by a previous buffered write or read.

> 
> That seems alarming to me, specially if it's a direct IO write failing
> to invalidate the page cache, since a subsequent buffered read could
> get stale data (what's in the page cache), and not what the direct IO
> write wrote.
> 
> Can you elaborate more on all those details?

The origin of the message is when iomap_dio_rw() tries to invalidate the
inode pages, but fails and calls dio_warn_stale_pagecache().

In the vanilla code, generic_file_direct_write() aborts direct writes
and returns 0 so that it may fallback to buffered I/O. Perhaps this
should be changed in iomap_dio_rw() as well. I will write a patch to
accomodate that.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio
  2020-05-28 16:34         ` Goldwyn Rodrigues
@ 2020-05-28 16:45           ` Filipe Manana
  2020-05-28 18:38             ` Goldwyn Rodrigues
  0 siblings, 1 reply; 23+ messages in thread
From: Filipe Manana @ 2020-05-28 16:45 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: Johannes Thumshirn, linux-btrfs, hch, dsterba

On Thu, May 28, 2020 at 5:34 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
>
> On 16:13 28/05, Filipe Manana wrote:
> > On Tue, May 26, 2020 at 5:47 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > >
> > > On 15:03 26/05, Johannes Thumshirn wrote:
> > > > Just as a heads up, this one gives me lot's of Page cache invalidation
> > > > failure prints from dio_warn_stale_pagecache() on btrfs/004 with
> > > > current misc-next:
> > > >
> > > <snip>
> > >
> > > > [   23.696400] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> > > > [   23.698115] File: /mnt/scratch/bgnoise/p0/f0 PID: 6562 Comm: fsstress
> > > >
> > > > I have no idea yet why but I'm investigating.
> > >
> > > This is caused because we are trying to release a page when the extent
> > > has locked the page and release page returns false.
> >
> > By "we" you mean what exaclty, a direct IO read, a direct IO write?
> >
> > And who locked the extent range before?
>
> This is usually locked by a previous buffered write or read.

A previous buffered write/read that has already finished or is still
in progress?

Because if it has finished we're not supposed to have the file range
locked anymore.

>
> >
> > That seems alarming to me, specially if it's a direct IO write failing
> > to invalidate the page cache, since a subsequent buffered read could
> > get stale data (what's in the page cache), and not what the direct IO
> > write wrote.
> >
> > Can you elaborate more on all those details?
>
> The origin of the message is when iomap_dio_rw() tries to invalidate the
> inode pages, but fails and calls dio_warn_stale_pagecache().
>
> In the vanilla code, generic_file_direct_write() aborts direct writes
> and returns 0 so that it may fallback to buffered I/O. Perhaps this
> should be changed in iomap_dio_rw() as well. I will write a patch to
> accomodate that.

On vanilla we have no problems of mixing buffered and direct
operations as long as they are done sequentially at least.
And even if done concurrently we take several measures to ensure that
are no surprises (locking ranges, waiting for any ordered extents in
progress, etc).

Thanks.

>
> --
> Goldwyn



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/7] iomap: Remove lockdep_assert_held()
  2020-05-28 15:40   ` Darrick J. Wong
@ 2020-05-28 16:45     ` Goldwyn Rodrigues
  0 siblings, 0 replies; 23+ messages in thread
From: Goldwyn Rodrigues @ 2020-05-28 16:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-btrfs, hch, dsterba

On  8:40 28/05, Darrick J. Wong wrote:
> On Fri, May 22, 2020 at 07:38:33AM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > Filesystems such as btrfs can perform direct I/O without holding the
> > inode->i_rwsem in some of the cases like writing within i_size.
> > So, remove the check for lockdep_assert_held() in iomap_dio_rw()
> > 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/iomap/direct-io.c | 2 --
> >  1 file changed, 2 deletions(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index f88ba6e7f6af..e4addfc58107 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -416,8 +416,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  	struct blk_plug plug;
> >  	struct iomap_dio *dio;
> >  
> > -	lockdep_assert_held(&inode->i_rwsem);
> > -
> 
> I could've sworn that I saw a reply from Dave asking to hoist this check
> into all the /other/ iomap_dio_rw callers, but I can't find it and maybe
> I just dreamed the whole thing.

It did happen! However, hch mentioned it is not required [1].
I did promise him to remove the entire concept of dio_sem
locking in btrfs, and just rely on inode->i_mutex. It is still a work in
progress.

> 
> Also, please cc fsdevel any time you make change to fs/iomap/, even if
> we've already reviewed the patches.
> 

Yes, missed that. Sorry.

[1] https://www.spinics.net/lists/linux-btrfs/msg96016.html

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio
  2020-05-28 16:45           ` Filipe Manana
@ 2020-05-28 18:38             ` Goldwyn Rodrigues
  2020-06-03 11:35               ` Filipe Manana
  0 siblings, 1 reply; 23+ messages in thread
From: Goldwyn Rodrigues @ 2020-05-28 18:38 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Johannes Thumshirn, linux-btrfs, hch, dsterba

On 17:45 28/05, Filipe Manana wrote:
> On Thu, May 28, 2020 at 5:34 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > And who locked the extent range before?
> >
> > This is usually locked by a previous buffered write or read.
> 
> A previous buffered write/read that has already finished or is still
> in progress?
> 
> Because if it has finished we're not supposed to have the file range
> locked anymore.

In progress. Mixing buffered I/O with direct writes.

> 
> >
> > >
> > > That seems alarming to me, specially if it's a direct IO write failing
> > > to invalidate the page cache, since a subsequent buffered read could
> > > get stale data (what's in the page cache), and not what the direct IO
> > > write wrote.
> > >
> > > Can you elaborate more on all those details?
> >
> > The origin of the message is when iomap_dio_rw() tries to invalidate the
> > inode pages, but fails and calls dio_warn_stale_pagecache().
> >
> > In the vanilla code, generic_file_direct_write() aborts direct writes
> > and returns 0 so that it may fallback to buffered I/O. Perhaps this
> > should be changed in iomap_dio_rw() as well. I will write a patch to
> > accomodate that.
> 
> On vanilla we have no problems of mixing buffered and direct
> operations as long as they are done sequentially at least.
> And even if done concurrently we take several measures to ensure that
> are no surprises (locking ranges, waiting for any ordered extents in
> progress, etc).

Yes, it is because of the code in generic_file_direct_write(). Anyways,
I did some tests with the following patch, and it seems to work. I will
send a formal patch to so that it gets incorporated in iomap sequence as
well.

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index e4addfc58107..215315be6233 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
         */
        ret = invalidate_inode_pages2_range(mapping,
                        pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
-       if (ret)
-               dio_warn_stale_pagecache(iocb->ki_filp);
-       ret = 0;
+       /*
+        * If a page can not be invalidated, return 0 to fall back
+        * to buffered write.
+        */
+       if (ret) {
+               if (ret == -EBUSY)
+                       ret = 0;
+               goto out_free_dio;
+       }
 
        if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
            !inode->i_sb->s_dio_done_wq) {



-- 
Goldwyn

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio
  2020-05-28 18:38             ` Goldwyn Rodrigues
@ 2020-06-03 11:35               ` Filipe Manana
  2020-06-05 15:17                 ` Filipe Manana
  0 siblings, 1 reply; 23+ messages in thread
From: Filipe Manana @ 2020-06-03 11:35 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: Johannes Thumshirn, linux-btrfs, hch, dsterba

On Thu, May 28, 2020 at 7:38 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
>
> On 17:45 28/05, Filipe Manana wrote:
> > On Thu, May 28, 2020 at 5:34 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > > And who locked the extent range before?
> > >
> > > This is usually locked by a previous buffered write or read.
> >
> > A previous buffered write/read that has already finished or is still
> > in progress?
> >
> > Because if it has finished we're not supposed to have the file range
> > locked anymore.
>
> In progress. Mixing buffered I/O with direct writes.
>
> >
> > >
> > > >
> > > > That seems alarming to me, specially if it's a direct IO write failing
> > > > to invalidate the page cache, since a subsequent buffered read could
> > > > get stale data (what's in the page cache), and not what the direct IO
> > > > write wrote.
> > > >
> > > > Can you elaborate more on all those details?
> > >
> > > The origin of the message is when iomap_dio_rw() tries to invalidate the
> > > inode pages, but fails and calls dio_warn_stale_pagecache().
> > >
> > > In the vanilla code, generic_file_direct_write() aborts direct writes
> > > and returns 0 so that it may fallback to buffered I/O. Perhaps this
> > > should be changed in iomap_dio_rw() as well. I will write a patch to
> > > accomodate that.
> >
> > On vanilla we have no problems of mixing buffered and direct
> > operations as long as they are done sequentially at least.
> > And even if done concurrently we take several measures to ensure that
> > are no surprises (locking ranges, waiting for any ordered extents in
> > progress, etc).
>
> Yes, it is because of the code in generic_file_direct_write(). Anyways,
> I did some tests with the following patch, and it seems to work. I will
> send a formal patch to so that it gets incorporated in iomap sequence as
> well.
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index e4addfc58107..215315be6233 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>          */
>         ret = invalidate_inode_pages2_range(mapping,
>                         pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> -       if (ret)
> -               dio_warn_stale_pagecache(iocb->ki_filp);
> -       ret = 0;
> +       /*
> +        * If a page can not be invalidated, return 0 to fall back
> +        * to buffered write.
> +        */
> +       if (ret) {
> +               if (ret == -EBUSY)
> +                       ret = 0;
> +               goto out_free_dio;
> +       }
>
>         if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
>             !inode->i_sb->s_dio_done_wq) {
>
>

Thanks. As I just replied on another thread for that patch, we
actually have a regression.
There's more than the annoying warning in dmesg, it also sets -EIO on
the inode's mapping and makes future fsyncs return that error despite
the fact that no actual errors or corruptions happened:

https://patchwork.kernel.org/patch/11576677/



>
> --
> Goldwyn



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio
  2020-06-03 11:35               ` Filipe Manana
@ 2020-06-05 15:17                 ` Filipe Manana
  2020-06-05 20:43                   ` Goldwyn Rodrigues
  0 siblings, 1 reply; 23+ messages in thread
From: Filipe Manana @ 2020-06-05 15:17 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: Johannes Thumshirn, linux-btrfs, hch, dsterba

On Wed, Jun 3, 2020 at 12:35 PM Filipe Manana <fdmanana@gmail.com> wrote:
>
> On Thu, May 28, 2020 at 7:38 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> >
> > On 17:45 28/05, Filipe Manana wrote:
> > > On Thu, May 28, 2020 at 5:34 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > > > And who locked the extent range before?
> > > >
> > > > This is usually locked by a previous buffered write or read.
> > >
> > > A previous buffered write/read that has already finished or is still
> > > in progress?
> > >
> > > Because if it has finished we're not supposed to have the file range
> > > locked anymore.
> >
> > In progress. Mixing buffered I/O with direct writes.
> >
> > >
> > > >
> > > > >
> > > > > That seems alarming to me, specially if it's a direct IO write failing
> > > > > to invalidate the page cache, since a subsequent buffered read could
> > > > > get stale data (what's in the page cache), and not what the direct IO
> > > > > write wrote.
> > > > >
> > > > > Can you elaborate more on all those details?
> > > >
> > > > The origin of the message is when iomap_dio_rw() tries to invalidate the
> > > > inode pages, but fails and calls dio_warn_stale_pagecache().
> > > >
> > > > In the vanilla code, generic_file_direct_write() aborts direct writes
> > > > and returns 0 so that it may fallback to buffered I/O. Perhaps this
> > > > should be changed in iomap_dio_rw() as well. I will write a patch to
> > > > accomodate that.
> > >
> > > On vanilla we have no problems of mixing buffered and direct
> > > operations as long as they are done sequentially at least.
> > > And even if done concurrently we take several measures to ensure that
> > > are no surprises (locking ranges, waiting for any ordered extents in
> > > progress, etc).
> >
> > Yes, it is because of the code in generic_file_direct_write(). Anyways,
> > I did some tests with the following patch, and it seems to work. I will
> > send a formal patch to so that it gets incorporated in iomap sequence as
> > well.
> >
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index e4addfc58107..215315be6233 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >          */
> >         ret = invalidate_inode_pages2_range(mapping,
> >                         pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > -       if (ret)
> > -               dio_warn_stale_pagecache(iocb->ki_filp);
> > -       ret = 0;
> > +       /*
> > +        * If a page can not be invalidated, return 0 to fall back
> > +        * to buffered write.
> > +        */
> > +       if (ret) {
> > +               if (ret == -EBUSY)
> > +                       ret = 0;
> > +               goto out_free_dio;
> > +       }
> >
> >         if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> >             !inode->i_sb->s_dio_done_wq) {
> >
> >
>
> Thanks. As I just replied on another thread for that patch, we
> actually have a regression.
> There's more than the annoying warning in dmesg, it also sets -EIO on
> the inode's mapping and makes future fsyncs return that error despite
> the fact that no actual errors or corruptions happened:
>
> https://patchwork.kernel.org/patch/11576677/
>

There's also some deadlock/hang, I have triggered it twice today with
generic/113 on two different VMs:

[14621.297370] INFO: task kworker/1:117:15962 blocked for more than 120 seconds.
[14621.298491]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
[14621.299231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[14621.300523] kworker/1:117   D    0 15962      2 0x80004000
[14621.301558] Workqueue: dio/sdb iomap_dio_complete_work
[14621.302389] Call Trace:
[14621.302877]  __schedule+0x384/0xa30
[14621.303555]  schedule+0x33/0xe0
[14621.304167]  rwsem_down_write_slowpath+0x2c2/0x750
[14621.305121]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[14621.306217]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[14621.307113]  iomap_dio_complete+0x11b/0x260
[14621.307888]  ? aio_fsync_work+0x5b0/0x5b0
[14621.308585]  iomap_dio_complete_work+0x17/0x30
[14621.309476]  process_one_work+0x275/0x6b0
[14621.310275]  worker_thread+0x4f/0x3e0
[14621.310869]  ? process_one_work+0x6b0/0x6b0
[14621.311403]  kthread+0x12a/0x170
[14621.311819]  ? kthread_create_worker_on_cpu+0x70/0x70
[14621.312460]  ret_from_fork+0x3a/0x50
[14621.312983] INFO: task kworker/1:199:16063 blocked for more than 120 seconds.
[14621.313921]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
[14621.314680] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[14621.315724] kworker/1:199   D    0 16063      2 0x80004000
[14621.316445] Workqueue: dio/sdb iomap_dio_complete_work
[14621.317101] Call Trace:
[14621.317437]  __schedule+0x384/0xa30
[14621.317928]  schedule+0x33/0xe0
[14621.318339]  rwsem_down_write_slowpath+0x2c2/0x750
[14621.318981]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[14621.319609]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[14621.320203]  iomap_dio_complete+0x11b/0x260
[14621.320721]  ? aio_fsync_work+0x5b0/0x5b0
[14621.321249]  iomap_dio_complete_work+0x17/0x30
[14621.321844]  process_one_work+0x275/0x6b0
[14621.322376]  worker_thread+0x4f/0x3e0
[14621.322871]  ? process_one_work+0x6b0/0x6b0
[14621.323408]  kthread+0x12a/0x170
[14621.323827]  ? kthread_create_worker_on_cpu+0x70/0x70
[14621.324473]  ret_from_fork+0x3a/0x50
[14621.324983] INFO: task aio-stress:16274 blocked for more than 120 seconds.
[14621.325896]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
[14621.326579] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[14621.327580] aio-stress      D    0 16274  14855 0x00004000
[14621.328280] Call Trace:
[14621.328602]  __schedule+0x384/0xa30
[14621.329056]  schedule+0x33/0xe0
[14621.329478]  rwsem_down_write_slowpath+0x2c2/0x750
[14621.330118]  ? btrfs_sync_file+0x219/0x4d0 [btrfs]
[14621.330747]  btrfs_sync_file+0x219/0x4d0 [btrfs]
[14621.331346]  iomap_dio_complete+0x11b/0x260
[14621.331886]  iomap_dio_rw+0x3bc/0x4c0
[14621.332372]  ? btrfs_file_write_iter+0x645/0x870 [btrfs]
[14621.333076]  btrfs_file_write_iter+0x645/0x870 [btrfs]
[14621.333749]  aio_write+0x148/0x1d0
[14621.334196]  ? lock_acquire+0xb1/0x4a0
[14621.334682]  ? __might_fault+0x3e/0x90
[14621.335172]  ? __fget_files+0x132/0x270
[14621.335668]  ? io_submit_one+0x946/0x1630
[14621.336184]  io_submit_one+0x946/0x1630
[14621.336680]  ? lock_acquire+0xb1/0x4a0
[14621.337175]  ? __might_fault+0x3e/0x90
[14621.337707]  ? __x64_sys_io_submit+0x9c/0x330
[14621.338269]  __x64_sys_io_submit+0x9c/0x330
[14621.338812]  ? do_syscall_64+0x5c/0x280
[14621.339303]  do_syscall_64+0x5c/0x280
[14621.339774]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
[14621.340416] RIP: 0033:0x7fb6cd395717
[14621.340875] Code: Bad RIP value.
[14621.341304] RSP: 002b:00007fb6bf7e1de8 EFLAGS: 00000202 ORIG_RAX:
00000000000000d1
[14621.342262] RAX: ffffffffffffffda RBX: 0000560d3d92ea60 RCX: 00007fb6cd395717
[14621.343180] RDX: 0000560d3d92ea60 RSI: 0000000000000008 RDI: 00007fb6cdb32000
[14621.344081] RBP: 0000000000000008 R08: 0000150e50ac6651 R09: 00000000003081a8
[14621.344981] R10: 00007fb6bf7e1df0 R11: 0000000000000202 R12: 0000560d3d8fe110
[14621.345897] R13: 00007fb6bf7e1e10 R14: 00007fb6bf7e1e00 R15: 0000560d3d8fe110
[14621.346820] INFO: lockdep is turned off.
[14742.125500] INFO: task kworker/1:117:15962 blocked for more than 241 seconds.
[14742.126456]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
[14742.127156] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[14742.128156] kworker/1:117   D    0 15962      2 0x80004000
[14742.128875] Workqueue: dio/sdb iomap_dio_complete_work
[14742.129633] Call Trace:
[14742.130010]  __schedule+0x384/0xa30
[14742.130494]  schedule+0x33/0xe0
[14742.131068]  rwsem_down_write_slowpath+0x2c2/0x750
[14742.131956]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[14742.132834]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[14742.133712]  iomap_dio_complete+0x11b/0x260
[14742.134475]  ? aio_fsync_work+0x5b0/0x5b0
[14742.135205]  iomap_dio_complete_work+0x17/0x30
[14742.136018]  process_one_work+0x275/0x6b0
[14742.136677]  worker_thread+0x4f/0x3e0
[14742.137154]  ? process_one_work+0x6b0/0x6b0
[14742.137805]  kthread+0x12a/0x170
[14742.138236]  ? kthread_create_worker_on_cpu+0x70/0x70
[14742.138901]  ret_from_fork+0x3a/0x50
[14742.139389] INFO: task kworker/1:199:16063 blocked for more than 241 seconds.
[14742.140305]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
[14742.140998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[14742.142056] kworker/1:199   D    0 16063      2 0x80004000
[14742.142877] Workqueue: dio/sdb iomap_dio_complete_work
[14742.143397] Call Trace:
[14742.143654]  __schedule+0x384/0xa30
[14742.144017]  schedule+0x33/0xe0
[14742.144352]  rwsem_down_write_slowpath+0x2c2/0x750
[14742.144859]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[14742.145386]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[14742.145863]  iomap_dio_complete+0x11b/0x260
[14742.146289]  ? aio_fsync_work+0x5b0/0x5b0
[14742.146701]  iomap_dio_complete_work+0x17/0x30
[14742.147168]  process_one_work+0x275/0x6b0
[14742.147579]  worker_thread+0x4f/0x3e0
[14742.147954]  ? process_one_work+0x6b0/0x6b0
[14742.148377]  kthread+0x12a/0x170
[14742.148722]  ? kthread_create_worker_on_cpu+0x70/0x70
[14742.149257]  ret_from_fork+0x3a/0x50
[14742.149671] INFO: task aio-stress:16274 blocked for more than 241 seconds.
[14742.150376]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
[14742.150948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[14742.151962] aio-stress      D    0 16274  14855 0x00004000
(...)

Thanks.

>
>
> >
> > --
> > Goldwyn
>
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio
  2020-06-05 15:17                 ` Filipe Manana
@ 2020-06-05 20:43                   ` Goldwyn Rodrigues
  2020-06-06  9:57                     ` Filipe Manana
  0 siblings, 1 reply; 23+ messages in thread
From: Goldwyn Rodrigues @ 2020-06-05 20:43 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Johannes Thumshirn, linux-btrfs, hch, dsterba

On 16:17 05/06, Filipe Manana wrote:
> On Wed, Jun 3, 2020 at 12:35 PM Filipe Manana <fdmanana@gmail.com> wrote:
> >
> > On Thu, May 28, 2020 at 7:38 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > >
> > > On 17:45 28/05, Filipe Manana wrote:
> > > > On Thu, May 28, 2020 at 5:34 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > > > > And who locked the extent range before?
> > > > >
> > > > > This is usually locked by a previous buffered write or read.
> > > >
> > > > A previous buffered write/read that has already finished or is still
> > > > in progress?
> > > >
> > > > Because if it has finished we're not supposed to have the file range
> > > > locked anymore.
> > >
> > > In progress. Mixing buffered I/O with direct writes.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > That seems alarming to me, specially if it's a direct IO write failing
> > > > > > to invalidate the page cache, since a subsequent buffered read could
> > > > > > get stale data (what's in the page cache), and not what the direct IO
> > > > > > write wrote.
> > > > > >
> > > > > > Can you elaborate more on all those details?
> > > > >
> > > > > The origin of the message is when iomap_dio_rw() tries to invalidate the
> > > > > inode pages, but fails and calls dio_warn_stale_pagecache().
> > > > >
> > > > > In the vanilla code, generic_file_direct_write() aborts direct writes
> > > > > and returns 0 so that it may fallback to buffered I/O. Perhaps this
> > > > > should be changed in iomap_dio_rw() as well. I will write a patch to
> > > > > accomodate that.
> > > >
> > > > On vanilla we have no problems of mixing buffered and direct
> > > > operations as long as they are done sequentially at least.
> > > > And even if done concurrently we take several measures to ensure that
> > > > are no surprises (locking ranges, waiting for any ordered extents in
> > > > progress, etc).
> > >
> > > Yes, it is because of the code in generic_file_direct_write(). Anyways,
> > > I did some tests with the following patch, and it seems to work. I will
> > > send a formal patch to so that it gets incorporated in iomap sequence as
> > > well.
> > >
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index e4addfc58107..215315be6233 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >          */
> > >         ret = invalidate_inode_pages2_range(mapping,
> > >                         pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > -       if (ret)
> > > -               dio_warn_stale_pagecache(iocb->ki_filp);
> > > -       ret = 0;
> > > +       /*
> > > +        * If a page can not be invalidated, return 0 to fall back
> > > +        * to buffered write.
> > > +        */
> > > +       if (ret) {
> > > +               if (ret == -EBUSY)
> > > +                       ret = 0;
> > > +               goto out_free_dio;
> > > +       }
> > >
> > >         if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > >             !inode->i_sb->s_dio_done_wq) {
> > >
> > >
> >
> > Thanks. As I just replied on another thread for that patch, we
> > actually have a regression.
> > There's more than the annoying warning in dmesg, it also sets -EIO on
> > the inode's mapping and makes future fsyncs return that error despite
> > the fact that no actual errors or corruptions happened:
> >
> > https://patchwork.kernel.org/patch/11576677/
> >
> 
> There's also some deadlock/hang, I have triggered it twice today with
> generic/113 on two different VMs:
> 
> [14621.297370] INFO: task kworker/1:117:15962 blocked for more than 120 seconds.
> [14621.298491]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> [14621.299231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [14621.300523] kworker/1:117   D    0 15962      2 0x80004000
> [14621.301558] Workqueue: dio/sdb iomap_dio_complete_work
> [14621.302389] Call Trace:
> [14621.302877]  __schedule+0x384/0xa30
> [14621.303555]  schedule+0x33/0xe0
> [14621.304167]  rwsem_down_write_slowpath+0x2c2/0x750
> [14621.305121]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> [14621.306217]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> [14621.307113]  iomap_dio_complete+0x11b/0x260
> [14621.307888]  ? aio_fsync_work+0x5b0/0x5b0
> [14621.308585]  iomap_dio_complete_work+0x17/0x30
> [14621.309476]  process_one_work+0x275/0x6b0
> [14621.310275]  worker_thread+0x4f/0x3e0
> [14621.310869]  ? process_one_work+0x6b0/0x6b0
> [14621.311403]  kthread+0x12a/0x170
> [14621.311819]  ? kthread_create_worker_on_cpu+0x70/0x70
> [14621.312460]  ret_from_fork+0x3a/0x50
> [14621.312983] INFO: task kworker/1:199:16063 blocked for more than 120 seconds.
> [14621.313921]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> [14621.314680] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [14621.315724] kworker/1:199   D    0 16063      2 0x80004000
> [14621.316445] Workqueue: dio/sdb iomap_dio_complete_work
> [14621.317101] Call Trace:
> [14621.317437]  __schedule+0x384/0xa30
> [14621.317928]  schedule+0x33/0xe0
> [14621.318339]  rwsem_down_write_slowpath+0x2c2/0x750
> [14621.318981]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> [14621.319609]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> [14621.320203]  iomap_dio_complete+0x11b/0x260
> [14621.320721]  ? aio_fsync_work+0x5b0/0x5b0
> [14621.321249]  iomap_dio_complete_work+0x17/0x30
> [14621.321844]  process_one_work+0x275/0x6b0
> [14621.322376]  worker_thread+0x4f/0x3e0
> [14621.322871]  ? process_one_work+0x6b0/0x6b0
> [14621.323408]  kthread+0x12a/0x170
> [14621.323827]  ? kthread_create_worker_on_cpu+0x70/0x70
> [14621.324473]  ret_from_fork+0x3a/0x50
> [14621.324983] INFO: task aio-stress:16274 blocked for more than 120 seconds.
> [14621.325896]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> [14621.326579] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [14621.327580] aio-stress      D    0 16274  14855 0x00004000
> [14621.328280] Call Trace:
> [14621.328602]  __schedule+0x384/0xa30
> [14621.329056]  schedule+0x33/0xe0
> [14621.329478]  rwsem_down_write_slowpath+0x2c2/0x750
> [14621.330118]  ? btrfs_sync_file+0x219/0x4d0 [btrfs]
> [14621.330747]  btrfs_sync_file+0x219/0x4d0 [btrfs]
> [14621.331346]  iomap_dio_complete+0x11b/0x260
> [14621.331886]  iomap_dio_rw+0x3bc/0x4c0
> [14621.332372]  ? btrfs_file_write_iter+0x645/0x870 [btrfs]
> [14621.333076]  btrfs_file_write_iter+0x645/0x870 [btrfs]
> [14621.333749]  aio_write+0x148/0x1d0
> [14621.334196]  ? lock_acquire+0xb1/0x4a0
> [14621.334682]  ? __might_fault+0x3e/0x90
> [14621.335172]  ? __fget_files+0x132/0x270
> [14621.335668]  ? io_submit_one+0x946/0x1630
> [14621.336184]  io_submit_one+0x946/0x1630
> [14621.336680]  ? lock_acquire+0xb1/0x4a0
> [14621.337175]  ? __might_fault+0x3e/0x90
> [14621.337707]  ? __x64_sys_io_submit+0x9c/0x330
> [14621.338269]  __x64_sys_io_submit+0x9c/0x330
> [14621.338812]  ? do_syscall_64+0x5c/0x280
> [14621.339303]  do_syscall_64+0x5c/0x280
> [14621.339774]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [14621.340416] RIP: 0033:0x7fb6cd395717
> [14621.340875] Code: Bad RIP value.
> [14621.341304] RSP: 002b:00007fb6bf7e1de8 EFLAGS: 00000202 ORIG_RAX:
> 00000000000000d1
> [14621.342262] RAX: ffffffffffffffda RBX: 0000560d3d92ea60 RCX: 00007fb6cd395717
> [14621.343180] RDX: 0000560d3d92ea60 RSI: 0000000000000008 RDI: 00007fb6cdb32000
> [14621.344081] RBP: 0000000000000008 R08: 0000150e50ac6651 R09: 00000000003081a8
> [14621.344981] R10: 00007fb6bf7e1df0 R11: 0000000000000202 R12: 0000560d3d8fe110
> [14621.345897] R13: 00007fb6bf7e1e10 R14: 00007fb6bf7e1e00 R15: 0000560d3d8fe110
> [14621.346820] INFO: lockdep is turned off.
> [14742.125500] INFO: task kworker/1:117:15962 blocked for more than 241 seconds.
> [14742.126456]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> [14742.127156] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [14742.128156] kworker/1:117   D    0 15962      2 0x80004000
> [14742.128875] Workqueue: dio/sdb iomap_dio_complete_work
> [14742.129633] Call Trace:
> [14742.130010]  __schedule+0x384/0xa30
> [14742.130494]  schedule+0x33/0xe0
> [14742.131068]  rwsem_down_write_slowpath+0x2c2/0x750
> [14742.131956]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> [14742.132834]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> [14742.133712]  iomap_dio_complete+0x11b/0x260
> [14742.134475]  ? aio_fsync_work+0x5b0/0x5b0
> [14742.135205]  iomap_dio_complete_work+0x17/0x30
> [14742.136018]  process_one_work+0x275/0x6b0
> [14742.136677]  worker_thread+0x4f/0x3e0
> [14742.137154]  ? process_one_work+0x6b0/0x6b0
> [14742.137805]  kthread+0x12a/0x170
> [14742.138236]  ? kthread_create_worker_on_cpu+0x70/0x70
> [14742.138901]  ret_from_fork+0x3a/0x50
> [14742.139389] INFO: task kworker/1:199:16063 blocked for more than 241 seconds.
> [14742.140305]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> [14742.140998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [14742.142056] kworker/1:199   D    0 16063      2 0x80004000
> [14742.142877] Workqueue: dio/sdb iomap_dio_complete_work
> [14742.143397] Call Trace:
> [14742.143654]  __schedule+0x384/0xa30
> [14742.144017]  schedule+0x33/0xe0
> [14742.144352]  rwsem_down_write_slowpath+0x2c2/0x750
> [14742.144859]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> [14742.145386]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> [14742.145863]  iomap_dio_complete+0x11b/0x260
> [14742.146289]  ? aio_fsync_work+0x5b0/0x5b0
> [14742.146701]  iomap_dio_complete_work+0x17/0x30
> [14742.147168]  process_one_work+0x275/0x6b0
> [14742.147579]  worker_thread+0x4f/0x3e0
> [14742.147954]  ? process_one_work+0x6b0/0x6b0
> [14742.148377]  kthread+0x12a/0x170
> [14742.148722]  ? kthread_create_worker_on_cpu+0x70/0x70
> [14742.149257]  ret_from_fork+0x3a/0x50
> [14742.149671] INFO: task aio-stress:16274 blocked for more than 241 seconds.
> [14742.150376]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> [14742.150948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [14742.151962] aio-stress      D    0 16274  14855 0x00004000
> (...)
> 

Seems like the btrfs_inode->dio_sem. Would you have more information on
which process is holding on to it, or if there was a failure?

I will try to reproduce at my end. Thanks for testing.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio
  2020-06-05 20:43                   ` Goldwyn Rodrigues
@ 2020-06-06  9:57                     ` Filipe Manana
  2020-06-08 15:39                       ` Goldwyn Rodrigues
  0 siblings, 1 reply; 23+ messages in thread
From: Filipe Manana @ 2020-06-06  9:57 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: Johannes Thumshirn, linux-btrfs, hch, dsterba

On Fri, Jun 5, 2020 at 9:43 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
>
> On 16:17 05/06, Filipe Manana wrote:
> > On Wed, Jun 3, 2020 at 12:35 PM Filipe Manana <fdmanana@gmail.com> wrote:
> > >
> > > On Thu, May 28, 2020 at 7:38 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > >
> > > > On 17:45 28/05, Filipe Manana wrote:
> > > > > On Thu, May 28, 2020 at 5:34 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > > > > > And who locked the extent range before?
> > > > > >
> > > > > > This is usually locked by a previous buffered write or read.
> > > > >
> > > > > A previous buffered write/read that has already finished or is still
> > > > > in progress?
> > > > >
> > > > > Because if it has finished we're not supposed to have the file range
> > > > > locked anymore.
> > > >
> > > > In progress. Mixing buffered I/O with direct writes.
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > That seems alarming to me, specially if it's a direct IO write failing
> > > > > > > to invalidate the page cache, since a subsequent buffered read could
> > > > > > > get stale data (what's in the page cache), and not what the direct IO
> > > > > > > write wrote.
> > > > > > >
> > > > > > > Can you elaborate more on all those details?
> > > > > >
> > > > > > The origin of the message is when iomap_dio_rw() tries to invalidate the
> > > > > > inode pages, but fails and calls dio_warn_stale_pagecache().
> > > > > >
> > > > > > In the vanilla code, generic_file_direct_write() aborts direct writes
> > > > > > and returns 0 so that it may fallback to buffered I/O. Perhaps this
> > > > > > should be changed in iomap_dio_rw() as well. I will write a patch to
> > > > > > accomodate that.
> > > > >
> > > > > On vanilla we have no problems of mixing buffered and direct
> > > > > operations as long as they are done sequentially at least.
> > > > > And even if done concurrently we take several measures to ensure that
> > > > > are no surprises (locking ranges, waiting for any ordered extents in
> > > > > progress, etc).
> > > >
> > > > Yes, it is because of the code in generic_file_direct_write(). Anyways,
> > > > I did some tests with the following patch, and it seems to work. I will
> > > > send a formal patch to so that it gets incorporated in iomap sequence as
> > > > well.
> > > >
> > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > index e4addfc58107..215315be6233 100644
> > > > --- a/fs/iomap/direct-io.c
> > > > +++ b/fs/iomap/direct-io.c
> > > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > > >          */
> > > >         ret = invalidate_inode_pages2_range(mapping,
> > > >                         pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > -       if (ret)
> > > > -               dio_warn_stale_pagecache(iocb->ki_filp);
> > > > -       ret = 0;
> > > > +       /*
> > > > +        * If a page can not be invalidated, return 0 to fall back
> > > > +        * to buffered write.
> > > > +        */
> > > > +       if (ret) {
> > > > +               if (ret == -EBUSY)
> > > > +                       ret = 0;
> > > > +               goto out_free_dio;
> > > > +       }
> > > >
> > > >         if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > > >             !inode->i_sb->s_dio_done_wq) {
> > > >
> > > >
> > >
> > > Thanks. As I just replied on another thread for that patch, we
> > > actually have a regression.
> > > There's more than the annoying warning in dmesg, it also sets -EIO on
> > > the inode's mapping and makes future fsyncs return that error despite
> > > the fact that no actual errors or corruptions happened:
> > >
> > > https://patchwork.kernel.org/patch/11576677/
> > >
> >
> > There's also some deadlock/hang, I have triggered it twice today with
> > generic/113 on two different VMs:
> >
> > [14621.297370] INFO: task kworker/1:117:15962 blocked for more than 120 seconds.
> > [14621.298491]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > [14621.299231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [14621.300523] kworker/1:117   D    0 15962      2 0x80004000
> > [14621.301558] Workqueue: dio/sdb iomap_dio_complete_work
> > [14621.302389] Call Trace:
> > [14621.302877]  __schedule+0x384/0xa30
> > [14621.303555]  schedule+0x33/0xe0
> > [14621.304167]  rwsem_down_write_slowpath+0x2c2/0x750
> > [14621.305121]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > [14621.306217]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > [14621.307113]  iomap_dio_complete+0x11b/0x260
> > [14621.307888]  ? aio_fsync_work+0x5b0/0x5b0
> > [14621.308585]  iomap_dio_complete_work+0x17/0x30
> > [14621.309476]  process_one_work+0x275/0x6b0
> > [14621.310275]  worker_thread+0x4f/0x3e0
> > [14621.310869]  ? process_one_work+0x6b0/0x6b0
> > [14621.311403]  kthread+0x12a/0x170
> > [14621.311819]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [14621.312460]  ret_from_fork+0x3a/0x50
> > [14621.312983] INFO: task kworker/1:199:16063 blocked for more than 120 seconds.
> > [14621.313921]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > [14621.314680] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [14621.315724] kworker/1:199   D    0 16063      2 0x80004000
> > [14621.316445] Workqueue: dio/sdb iomap_dio_complete_work
> > [14621.317101] Call Trace:
> > [14621.317437]  __schedule+0x384/0xa30
> > [14621.317928]  schedule+0x33/0xe0
> > [14621.318339]  rwsem_down_write_slowpath+0x2c2/0x750
> > [14621.318981]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > [14621.319609]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > [14621.320203]  iomap_dio_complete+0x11b/0x260
> > [14621.320721]  ? aio_fsync_work+0x5b0/0x5b0
> > [14621.321249]  iomap_dio_complete_work+0x17/0x30
> > [14621.321844]  process_one_work+0x275/0x6b0
> > [14621.322376]  worker_thread+0x4f/0x3e0
> > [14621.322871]  ? process_one_work+0x6b0/0x6b0
> > [14621.323408]  kthread+0x12a/0x170
> > [14621.323827]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [14621.324473]  ret_from_fork+0x3a/0x50
> > [14621.324983] INFO: task aio-stress:16274 blocked for more than 120 seconds.
> > [14621.325896]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > [14621.326579] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [14621.327580] aio-stress      D    0 16274  14855 0x00004000
> > [14621.328280] Call Trace:
> > [14621.328602]  __schedule+0x384/0xa30
> > [14621.329056]  schedule+0x33/0xe0
> > [14621.329478]  rwsem_down_write_slowpath+0x2c2/0x750
> > [14621.330118]  ? btrfs_sync_file+0x219/0x4d0 [btrfs]
> > [14621.330747]  btrfs_sync_file+0x219/0x4d0 [btrfs]
> > [14621.331346]  iomap_dio_complete+0x11b/0x260
> > [14621.331886]  iomap_dio_rw+0x3bc/0x4c0
> > [14621.332372]  ? btrfs_file_write_iter+0x645/0x870 [btrfs]
> > [14621.333076]  btrfs_file_write_iter+0x645/0x870 [btrfs]
> > [14621.333749]  aio_write+0x148/0x1d0
> > [14621.334196]  ? lock_acquire+0xb1/0x4a0
> > [14621.334682]  ? __might_fault+0x3e/0x90
> > [14621.335172]  ? __fget_files+0x132/0x270
> > [14621.335668]  ? io_submit_one+0x946/0x1630
> > [14621.336184]  io_submit_one+0x946/0x1630
> > [14621.336680]  ? lock_acquire+0xb1/0x4a0
> > [14621.337175]  ? __might_fault+0x3e/0x90
> > [14621.337707]  ? __x64_sys_io_submit+0x9c/0x330
> > [14621.338269]  __x64_sys_io_submit+0x9c/0x330
> > [14621.338812]  ? do_syscall_64+0x5c/0x280
> > [14621.339303]  do_syscall_64+0x5c/0x280
> > [14621.339774]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [14621.340416] RIP: 0033:0x7fb6cd395717
> > [14621.340875] Code: Bad RIP value.
> > [14621.341304] RSP: 002b:00007fb6bf7e1de8 EFLAGS: 00000202 ORIG_RAX:
> > 00000000000000d1
> > [14621.342262] RAX: ffffffffffffffda RBX: 0000560d3d92ea60 RCX: 00007fb6cd395717
> > [14621.343180] RDX: 0000560d3d92ea60 RSI: 0000000000000008 RDI: 00007fb6cdb32000
> > [14621.344081] RBP: 0000000000000008 R08: 0000150e50ac6651 R09: 00000000003081a8
> > [14621.344981] R10: 00007fb6bf7e1df0 R11: 0000000000000202 R12: 0000560d3d8fe110
> > [14621.345897] R13: 00007fb6bf7e1e10 R14: 00007fb6bf7e1e00 R15: 0000560d3d8fe110
> > [14621.346820] INFO: lockdep is turned off.
> > [14742.125500] INFO: task kworker/1:117:15962 blocked for more than 241 seconds.
> > [14742.126456]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > [14742.127156] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [14742.128156] kworker/1:117   D    0 15962      2 0x80004000
> > [14742.128875] Workqueue: dio/sdb iomap_dio_complete_work
> > [14742.129633] Call Trace:
> > [14742.130010]  __schedule+0x384/0xa30
> > [14742.130494]  schedule+0x33/0xe0
> > [14742.131068]  rwsem_down_write_slowpath+0x2c2/0x750
> > [14742.131956]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > [14742.132834]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > [14742.133712]  iomap_dio_complete+0x11b/0x260
> > [14742.134475]  ? aio_fsync_work+0x5b0/0x5b0
> > [14742.135205]  iomap_dio_complete_work+0x17/0x30
> > [14742.136018]  process_one_work+0x275/0x6b0
> > [14742.136677]  worker_thread+0x4f/0x3e0
> > [14742.137154]  ? process_one_work+0x6b0/0x6b0
> > [14742.137805]  kthread+0x12a/0x170
> > [14742.138236]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [14742.138901]  ret_from_fork+0x3a/0x50
> > [14742.139389] INFO: task kworker/1:199:16063 blocked for more than 241 seconds.
> > [14742.140305]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > [14742.140998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [14742.142056] kworker/1:199   D    0 16063      2 0x80004000
> > [14742.142877] Workqueue: dio/sdb iomap_dio_complete_work
> > [14742.143397] Call Trace:
> > [14742.143654]  __schedule+0x384/0xa30
> > [14742.144017]  schedule+0x33/0xe0
> > [14742.144352]  rwsem_down_write_slowpath+0x2c2/0x750
> > [14742.144859]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > [14742.145386]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > [14742.145863]  iomap_dio_complete+0x11b/0x260
> > [14742.146289]  ? aio_fsync_work+0x5b0/0x5b0
> > [14742.146701]  iomap_dio_complete_work+0x17/0x30
> > [14742.147168]  process_one_work+0x275/0x6b0
> > [14742.147579]  worker_thread+0x4f/0x3e0
> > [14742.147954]  ? process_one_work+0x6b0/0x6b0
> > [14742.148377]  kthread+0x12a/0x170
> > [14742.148722]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [14742.149257]  ret_from_fork+0x3a/0x50
> > [14742.149671] INFO: task aio-stress:16274 blocked for more than 241 seconds.
> > [14742.150376]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > [14742.150948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [14742.151962] aio-stress      D    0 16274  14855 0x00004000
> > (...)
> >
>
> Seems like the btrfs_inode->dio_sem. Would you have more information on
> which process is holding on to it, or if there was a failure?

This happens 100% of the time.

At btrfs_direct_write():

down_read(&BTRFS_I(inode)->dio_sem);
written = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dops,
       is_sync_kiocb(iocb));
up_read(&BTRFS_I(inode)->dio_sem);

than through iomap_dio_rw(), we end up calling btrfs_sync_file(),
which tries to down_write() the dio_sem.

>
> I will try to reproduce at my end. Thanks for testing.

You should have no difficulty reproducing it.

And another lockdep warning:

[  422.202844] ============================================
[  422.204113] WARNING: possible recursive locking detected
[  422.205014] 5.7.0-rc7-btrfs-next-59 #1 Not tainted
[  422.205722] --------------------------------------------
[  422.206598] aio-stress/1665 is trying to acquire lock:
[  422.207776] ffff8fbc9a24d620
(&sb->s_type->i_mutex_key#14){++++}-{3:3}, at:
btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[  422.209538]
               but task is already holding lock:
[  422.210424] ffff8fbc9a24d620
(&sb->s_type->i_mutex_key#14){++++}-{3:3}, at:
btrfs_file_write_iter+0x80/0x870 [btrfs]
[  422.212075]
               other info that might help us debug this:
[  422.213089]  Possible unsafe locking scenario:

[  422.214141]        CPU0
[  422.214578]        ----
[  422.215098]   lock(&sb->s_type->i_mutex_key#14);
[  422.215990]   lock(&sb->s_type->i_mutex_key#14);
[  422.216940]
                *** DEADLOCK ***

[  422.218115]  May be due to missing lock nesting notation

[  422.219060] 2 locks held by aio-stress/1665:
[  422.219815]  #0: ffff8fbc9a24d620
(&sb->s_type->i_mutex_key#14){++++}-{3:3}, at:
btrfs_file_write_iter+0x80/0x870 [btrfs]
[  422.221516]  #1: ffff8fbc9a24d4a8 (&ei->dio_sem){++++}-{3:3}, at:
btrfs_file_write_iter+0x6b4/0x870 [btrfs]
[  422.223437]
               stack backtrace:
[  422.224335] CPU: 0 PID: 1665 Comm: aio-stress Not tainted
5.7.0-rc7-btrfs-next-59 #1
[  422.225960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[  422.227728] Call Trace:
[  422.228046]  dump_stack+0x87/0xcb
[  422.228667]  __lock_acquire+0x17d0/0x21a0
[  422.229545]  lock_acquire+0xb1/0x4a0
[  422.230272]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[  422.231250]  down_write+0x38/0x70
[  422.231821]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[  422.232502]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
[  422.233169]  iomap_dio_complete+0x11b/0x260
[  422.233838]  iomap_dio_rw+0x3bc/0x4c0
[  422.234404]  ? btrfs_file_write_iter+0x6d9/0x870 [btrfs]
[  422.235311]  btrfs_file_write_iter+0x6d9/0x870 [btrfs]
[  422.236250]  aio_write+0x148/0x1d0
[  422.236872]  ? lock_acquire+0xb1/0x4a0
[  422.237560]  ? __might_fault+0x3e/0x90
[  422.238291]  ? __might_fault+0x3e/0x90
[  422.239002]  ? io_submit_one+0x946/0x1630
[  422.239874]  io_submit_one+0x946/0x1630
[  422.240565]  ? lock_acquire+0xb1/0x4a0
[  422.241162]  ? __might_fault+0x3e/0x90
[  422.241711]  ? __x64_sys_io_submit+0x9c/0x330
[  422.242336]  __x64_sys_io_submit+0x9c/0x330
[  422.242943]  ? do_syscall_64+0x5c/0x280
[  422.243420]  do_syscall_64+0x5c/0x280
[  422.243878]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
[  422.244626] RIP: 0033:0x7f77e38aa717
[  422.245219] Code: 00 75 08 8b 47 0c 39 47 08 74 08 e9 c3 ff ff ff
0f 1f 00 31 c0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 d1 00 00
00 0f 05 <c3> 0f 1f 84 00 00 00 00 00 b8 d2 00 00 00 0f 05 c3 0f 1f 84
00 00
[  422.247827] RSP: 002b:00007f77d6cf8de8 EFLAGS: 00000202 ORIG_RAX:
00000000000000d1
[  422.248924] RAX: ffffffffffffffda RBX: 000055c78674c400 RCX: 00007f77e38aa717
[  422.249959] RDX: 000055c78674c400 RSI: 0000000000000008 RDI: 00007f77e4059000
[  422.250924] RBP: 0000000000000008 R08: 000000a0afa4d2c9 R09: 0000000000009e34
[  422.251958] R10: 00007f77d6cf8df0 R11: 0000000000000202 R12: 000055c786720f10
[  422.252989] R13: 00007f77d6cf8e10 R14: 00007f77d6cf8e00 R15: 000055c786720f10

Thanks.

>
> --
> Goldwyn



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio
  2020-06-06  9:57                     ` Filipe Manana
@ 2020-06-08 15:39                       ` Goldwyn Rodrigues
  0 siblings, 0 replies; 23+ messages in thread
From: Goldwyn Rodrigues @ 2020-06-08 15:39 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Johannes Thumshirn, linux-btrfs, hch, dsterba

On 10:57 06/06, Filipe Manana wrote:
> On Fri, Jun 5, 2020 at 9:43 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> >
> > On 16:17 05/06, Filipe Manana wrote:
> > > On Wed, Jun 3, 2020 at 12:35 PM Filipe Manana <fdmanana@gmail.com> wrote:
> > > >
> > > > On Thu, May 28, 2020 at 7:38 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > > >
> > > > > On 17:45 28/05, Filipe Manana wrote:
> > > > > > On Thu, May 28, 2020 at 5:34 PM Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
> > > > > > > > And who locked the extent range before?
> > > > > > >
> > > > > > > This is usually locked by a previous buffered write or read.
> > > > > >
> > > > > > A previous buffered write/read that has already finished or is still
> > > > > > in progress?
> > > > > >
> > > > > > Because if it has finished we're not supposed to have the file range
> > > > > > locked anymore.
> > > > >
> > > > > In progress. Mixing buffered I/O with direct writes.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > That seems alarming to me, specially if it's a direct IO write failing
> > > > > > > > to invalidate the page cache, since a subsequent buffered read could
> > > > > > > > get stale data (what's in the page cache), and not what the direct IO
> > > > > > > > write wrote.
> > > > > > > >
> > > > > > > > Can you elaborate more on all those details?
> > > > > > >
> > > > > > > The origin of the message is when iomap_dio_rw() tries to invalidate the
> > > > > > > inode pages, but fails and calls dio_warn_stale_pagecache().
> > > > > > >
> > > > > > > In the vanilla code, generic_file_direct_write() aborts direct writes
> > > > > > > and returns 0 so that it may fallback to buffered I/O. Perhaps this
> > > > > > > should be changed in iomap_dio_rw() as well. I will write a patch to
> > > > > > > accomodate that.
> > > > > >
> > > > > > On vanilla we have no problems of mixing buffered and direct
> > > > > > operations as long as they are done sequentially at least.
> > > > > > And even if done concurrently we take several measures to ensure that
> > > > > > are no surprises (locking ranges, waiting for any ordered extents in
> > > > > > progress, etc).
> > > > >
> > > > > Yes, it is because of the code in generic_file_direct_write(). Anyways,
> > > > > I did some tests with the following patch, and it seems to work. I will
> > > > > send a formal patch to so that it gets incorporated in iomap sequence as
> > > > > well.
> > > > >
> > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > > index e4addfc58107..215315be6233 100644
> > > > > --- a/fs/iomap/direct-io.c
> > > > > +++ b/fs/iomap/direct-io.c
> > > > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > > > >          */
> > > > >         ret = invalidate_inode_pages2_range(mapping,
> > > > >                         pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > > -       if (ret)
> > > > > -               dio_warn_stale_pagecache(iocb->ki_filp);
> > > > > -       ret = 0;
> > > > > +       /*
> > > > > +        * If a page can not be invalidated, return 0 to fall back
> > > > > +        * to buffered write.
> > > > > +        */
> > > > > +       if (ret) {
> > > > > +               if (ret == -EBUSY)
> > > > > +                       ret = 0;
> > > > > +               goto out_free_dio;
> > > > > +       }
> > > > >
> > > > >         if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > > > >             !inode->i_sb->s_dio_done_wq) {
> > > > >
> > > > >
> > > >
> > > > Thanks. As I just replied on another thread for that patch, we
> > > > actually have a regression.
> > > > There's more than the annoying warning in dmesg, it also sets -EIO on
> > > > the inode's mapping and makes future fsyncs return that error despite
> > > > the fact that no actual errors or corruptions happened:
> > > >
> > > > https://patchwork.kernel.org/patch/11576677/
> > > >
> > >
> > > There's also some deadlock/hang, I have triggered it twice today with
> > > generic/113 on two different VMs:
> > >
> > > [14621.297370] INFO: task kworker/1:117:15962 blocked for more than 120 seconds.
> > > [14621.298491]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > > [14621.299231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [14621.300523] kworker/1:117   D    0 15962      2 0x80004000
> > > [14621.301558] Workqueue: dio/sdb iomap_dio_complete_work
> > > [14621.302389] Call Trace:
> > > [14621.302877]  __schedule+0x384/0xa30
> > > [14621.303555]  schedule+0x33/0xe0
> > > [14621.304167]  rwsem_down_write_slowpath+0x2c2/0x750
> > > [14621.305121]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > > [14621.306217]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > > [14621.307113]  iomap_dio_complete+0x11b/0x260
> > > [14621.307888]  ? aio_fsync_work+0x5b0/0x5b0
> > > [14621.308585]  iomap_dio_complete_work+0x17/0x30
> > > [14621.309476]  process_one_work+0x275/0x6b0
> > > [14621.310275]  worker_thread+0x4f/0x3e0
> > > [14621.310869]  ? process_one_work+0x6b0/0x6b0
> > > [14621.311403]  kthread+0x12a/0x170
> > > [14621.311819]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [14621.312460]  ret_from_fork+0x3a/0x50
> > > [14621.312983] INFO: task kworker/1:199:16063 blocked for more than 120 seconds.
> > > [14621.313921]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > > [14621.314680] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [14621.315724] kworker/1:199   D    0 16063      2 0x80004000
> > > [14621.316445] Workqueue: dio/sdb iomap_dio_complete_work
> > > [14621.317101] Call Trace:
> > > [14621.317437]  __schedule+0x384/0xa30
> > > [14621.317928]  schedule+0x33/0xe0
> > > [14621.318339]  rwsem_down_write_slowpath+0x2c2/0x750
> > > [14621.318981]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > > [14621.319609]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > > [14621.320203]  iomap_dio_complete+0x11b/0x260
> > > [14621.320721]  ? aio_fsync_work+0x5b0/0x5b0
> > > [14621.321249]  iomap_dio_complete_work+0x17/0x30
> > > [14621.321844]  process_one_work+0x275/0x6b0
> > > [14621.322376]  worker_thread+0x4f/0x3e0
> > > [14621.322871]  ? process_one_work+0x6b0/0x6b0
> > > [14621.323408]  kthread+0x12a/0x170
> > > [14621.323827]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [14621.324473]  ret_from_fork+0x3a/0x50
> > > [14621.324983] INFO: task aio-stress:16274 blocked for more than 120 seconds.
> > > [14621.325896]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > > [14621.326579] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [14621.327580] aio-stress      D    0 16274  14855 0x00004000
> > > [14621.328280] Call Trace:
> > > [14621.328602]  __schedule+0x384/0xa30
> > > [14621.329056]  schedule+0x33/0xe0
> > > [14621.329478]  rwsem_down_write_slowpath+0x2c2/0x750
> > > [14621.330118]  ? btrfs_sync_file+0x219/0x4d0 [btrfs]
> > > [14621.330747]  btrfs_sync_file+0x219/0x4d0 [btrfs]
> > > [14621.331346]  iomap_dio_complete+0x11b/0x260
> > > [14621.331886]  iomap_dio_rw+0x3bc/0x4c0
> > > [14621.332372]  ? btrfs_file_write_iter+0x645/0x870 [btrfs]
> > > [14621.333076]  btrfs_file_write_iter+0x645/0x870 [btrfs]
> > > [14621.333749]  aio_write+0x148/0x1d0
> > > [14621.334196]  ? lock_acquire+0xb1/0x4a0
> > > [14621.334682]  ? __might_fault+0x3e/0x90
> > > [14621.335172]  ? __fget_files+0x132/0x270
> > > [14621.335668]  ? io_submit_one+0x946/0x1630
> > > [14621.336184]  io_submit_one+0x946/0x1630
> > > [14621.336680]  ? lock_acquire+0xb1/0x4a0
> > > [14621.337175]  ? __might_fault+0x3e/0x90
> > > [14621.337707]  ? __x64_sys_io_submit+0x9c/0x330
> > > [14621.338269]  __x64_sys_io_submit+0x9c/0x330
> > > [14621.338812]  ? do_syscall_64+0x5c/0x280
> > > [14621.339303]  do_syscall_64+0x5c/0x280
> > > [14621.339774]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > [14621.340416] RIP: 0033:0x7fb6cd395717
> > > [14621.340875] Code: Bad RIP value.
> > > [14621.341304] RSP: 002b:00007fb6bf7e1de8 EFLAGS: 00000202 ORIG_RAX:
> > > 00000000000000d1
> > > [14621.342262] RAX: ffffffffffffffda RBX: 0000560d3d92ea60 RCX: 00007fb6cd395717
> > > [14621.343180] RDX: 0000560d3d92ea60 RSI: 0000000000000008 RDI: 00007fb6cdb32000
> > > [14621.344081] RBP: 0000000000000008 R08: 0000150e50ac6651 R09: 00000000003081a8
> > > [14621.344981] R10: 00007fb6bf7e1df0 R11: 0000000000000202 R12: 0000560d3d8fe110
> > > [14621.345897] R13: 00007fb6bf7e1e10 R14: 00007fb6bf7e1e00 R15: 0000560d3d8fe110
> > > [14621.346820] INFO: lockdep is turned off.
> > > [14742.125500] INFO: task kworker/1:117:15962 blocked for more than 241 seconds.
> > > [14742.126456]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > > [14742.127156] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [14742.128156] kworker/1:117   D    0 15962      2 0x80004000
> > > [14742.128875] Workqueue: dio/sdb iomap_dio_complete_work
> > > [14742.129633] Call Trace:
> > > [14742.130010]  __schedule+0x384/0xa30
> > > [14742.130494]  schedule+0x33/0xe0
> > > [14742.131068]  rwsem_down_write_slowpath+0x2c2/0x750
> > > [14742.131956]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > > [14742.132834]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > > [14742.133712]  iomap_dio_complete+0x11b/0x260
> > > [14742.134475]  ? aio_fsync_work+0x5b0/0x5b0
> > > [14742.135205]  iomap_dio_complete_work+0x17/0x30
> > > [14742.136018]  process_one_work+0x275/0x6b0
> > > [14742.136677]  worker_thread+0x4f/0x3e0
> > > [14742.137154]  ? process_one_work+0x6b0/0x6b0
> > > [14742.137805]  kthread+0x12a/0x170
> > > [14742.138236]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [14742.138901]  ret_from_fork+0x3a/0x50
> > > [14742.139389] INFO: task kworker/1:199:16063 blocked for more than 241 seconds.
> > > [14742.140305]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > > [14742.140998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [14742.142056] kworker/1:199   D    0 16063      2 0x80004000
> > > [14742.142877] Workqueue: dio/sdb iomap_dio_complete_work
> > > [14742.143397] Call Trace:
> > > [14742.143654]  __schedule+0x384/0xa30
> > > [14742.144017]  schedule+0x33/0xe0
> > > [14742.144352]  rwsem_down_write_slowpath+0x2c2/0x750
> > > [14742.144859]  ? btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > > [14742.145386]  btrfs_sync_file+0x1fe/0x4d0 [btrfs]
> > > [14742.145863]  iomap_dio_complete+0x11b/0x260
> > > [14742.146289]  ? aio_fsync_work+0x5b0/0x5b0
> > > [14742.146701]  iomap_dio_complete_work+0x17/0x30
> > > [14742.147168]  process_one_work+0x275/0x6b0
> > > [14742.147579]  worker_thread+0x4f/0x3e0
> > > [14742.147954]  ? process_one_work+0x6b0/0x6b0
> > > [14742.148377]  kthread+0x12a/0x170
> > > [14742.148722]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [14742.149257]  ret_from_fork+0x3a/0x50
> > > [14742.149671] INFO: task aio-stress:16274 blocked for more than 241 seconds.
> > > [14742.150376]       Not tainted 5.7.0-rc7-btrfs-next-59 #1
> > > [14742.150948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [14742.151962] aio-stress      D    0 16274  14855 0x00004000
> > > (...)
> > >
> >
> > Seems like the btrfs_inode->dio_sem. Would you have more information on
> > which process is holding on to it, or if there was a failure?
> 
> This happens 100% of the time.
> 
> At btrfs_direct_write():
> 
> down_read(&BTRFS_I(inode)->dio_sem);
> written = iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dops,
>        is_sync_kiocb(iocb));
> up_read(&BTRFS_I(inode)->dio_sem);
> 
> than through iomap_dio_rw(), we end up calling btrfs_sync_file(),
> which tries to down_write() the dio_sem.

I see what is happening in the code. Also, we have generic_write_sync()
called  twice. I will work on this. Surprisingly generic/113 works just
fine for me.

Thanks,



-- 
Goldwyn

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/7] fs: Export generic_file_buffered_read()
  2019-11-15 16:16 btrfs direct-io using iomap Goldwyn Rodrigues
@ 2019-11-15 16:16 ` Goldwyn Rodrigues
  0 siblings, 0 replies; 23+ messages in thread
From: Goldwyn Rodrigues @ 2019-11-15 16:16 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-fsdevel, hch, darrick.wong, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Export generic_file_buffered_read() to be used to
supplement incomplete direct reads.

While we are at it, correct the comments and variable names.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 include/linux/fs.h |  2 ++
 mm/filemap.c       | 13 +++++++------
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index e0d909d35763..5bc75cff3536 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3095,6 +3095,8 @@ extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
 extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
 				    struct file *file_out, loff_t pos_out,
 				    size_t *count, unsigned int flags);
+extern ssize_t generic_file_buffered_read(struct kiocb *iocb,
+		struct iov_iter *to, ssize_t already_read);
 extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 85b7d087eb45..6a1be2fcdfe7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1994,7 +1994,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * generic_file_buffered_read - generic file read routine
  * @iocb:	the iocb to read
  * @iter:	data destination
- * @written:	already copied
+ * @copied:	already copied
  *
  * This is a generic file read routine, and uses the
  * mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -2003,11 +2003,11 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  *
  * Return:
- * * total number of bytes copied, including those the were already @written
+ * * total number of bytes copied, including those the were @copied
  * * negative error code if nothing was copied
  */
-static ssize_t generic_file_buffered_read(struct kiocb *iocb,
-		struct iov_iter *iter, ssize_t written)
+ssize_t generic_file_buffered_read(struct kiocb *iocb,
+		struct iov_iter *iter, ssize_t copied)
 {
 	struct file *filp = iocb->ki_filp;
 	struct address_space *mapping = filp->f_mapping;
@@ -2148,7 +2148,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		prev_offset = offset;
 
 		put_page(page);
-		written += ret;
+		copied += ret;
 		if (!iov_iter_count(iter))
 			goto out;
 		if (ret < nr) {
@@ -2256,8 +2256,9 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 	*ppos = ((loff_t)index << PAGE_SHIFT) + offset;
 	file_accessed(filp);
-	return written ? written : error;
+	return copied ? copied : error;
 }
+EXPORT_SYMBOL_GPL(generic_file_buffered_read);
 
 /**
  * generic_file_read_iter - generic filesystem read routine
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-06-08 15:39 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-22 12:38 [PATCH 0/7 v8] btrfs direct-io using iomap Goldwyn Rodrigues
2020-05-22 12:38 ` [PATCH 1/7] fs: Export generic_file_buffered_read() Goldwyn Rodrigues
2020-05-25 12:25   ` David Sterba
2020-05-22 12:38 ` [PATCH 2/7] iomap: add a filesystem hook for direct I/O bio submission Goldwyn Rodrigues
2020-05-22 12:38 ` [PATCH 3/7] iomap: Remove lockdep_assert_held() Goldwyn Rodrigues
2020-05-28 15:40   ` Darrick J. Wong
2020-05-28 16:45     ` Goldwyn Rodrigues
2020-05-22 12:38 ` [PATCH 4/7] btrfs: Switch to iomap_dio_rw() for dio Goldwyn Rodrigues
2020-05-26 15:03   ` Johannes Thumshirn
2020-05-26 16:44     ` Goldwyn Rodrigues
2020-05-28 15:13       ` Filipe Manana
2020-05-28 16:34         ` Goldwyn Rodrigues
2020-05-28 16:45           ` Filipe Manana
2020-05-28 18:38             ` Goldwyn Rodrigues
2020-06-03 11:35               ` Filipe Manana
2020-06-05 15:17                 ` Filipe Manana
2020-06-05 20:43                   ` Goldwyn Rodrigues
2020-06-06  9:57                     ` Filipe Manana
2020-06-08 15:39                       ` Goldwyn Rodrigues
2020-05-22 12:38 ` [PATCH 5/7] fs: Remove dio_end_io() Goldwyn Rodrigues
2020-05-22 12:38 ` [PATCH 6/7] btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK Goldwyn Rodrigues
2020-05-22 12:38 ` [PATCH 7/7] btrfs: btrfs: split btrfs_direct_IO Goldwyn Rodrigues
  -- strict thread matches above, loose matches on Subject: below --
2019-11-15 16:16 btrfs direct-io using iomap Goldwyn Rodrigues
2019-11-15 16:16 ` [PATCH 1/7] fs: Export generic_file_buffered_read() Goldwyn Rodrigues

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).