All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/15] btrfs: read repair/direct I/O improvements
@ 2020-03-09 21:32 Omar Sandoval
  2020-03-09 21:32 ` [PATCH 01/15] btrfs: fix error handling when submitting direct I/O bio Omar Sandoval
                   ` (17 more replies)
  0 siblings, 18 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-09 21:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Christoph Hellwig

From: Omar Sandoval <osandov@fb.com>

Hi,

This series includes several fixes, cleanups, and improvements to direct
I/O and read repair. It's preparation for adding read repair to my
RWF_ENCODED series [1], but it can go in independently.

Patches 1 and 2 are direct I/O error handling fixes. Patch 3 is a
buffered read repair fix. Patch 4 is a buffered read repair improvement.
Patches 5-9 are trivial cleanups. Patch 10 converts direct I/O to use
refcount_t, which would've helped catch the bug fixed by patch 1.
Patches 11-14 drastically simplify the direct I/O code. Patch 15 gets
unifies buffered and direct I/O read repair, which also makes direct I/O
repair actually do validation for large failed reads instead of
rewriting the whole thing.

Overall, this is net about -400 lines of code and actually makes direct
I/O more functional.

Note that this series causes btrfs/142 to fail. This is a bug in the
test, as it assumes that direct I/O doesn't do read validation. I'm
working on a fix for the test.

Christoph is cc'd for patch 3. The fix looks at the bio internals in a
way that I wasn't sure was recommended, although there is precedent in
the bcache code. I'd appreciate if Christoph acked that patch or
suggested a better approach.

This series is based on misc-next.

Thanks!

1: https://lore.kernel.org/linux-fsdevel/cover.1582930832.git.osandov@fb.com/

Omar Sandoval (15):
  btrfs: fix error handling when submitting direct I/O bio
  btrfs: fix double __endio_write_update_ordered in direct I/O
  btrfs: look at full bi_io_vec for repair decision
  btrfs: don't do repair validation for checksum errors
  btrfs: clarify btrfs_lookup_bio_sums documentation
  btrfs: rename __readpage_endio_check to check_data_csum
  btrfs: make btrfs_check_repairable() static
  btrfs: move btrfs_dio_private to inode.c
  btrfs: kill btrfs_dio_private->private
  btrfs: convert btrfs_dio_private->pending_bios to refcount_t
  btrfs: put direct I/O checksums in btrfs_dio_private instead of bio
  btrfs: get rid of one layer of bios in direct I/O
  btrfs: simplify direct I/O read repair
  btrfs: get rid of endio_repair_workers
  btrfs: unify buffered and direct I/O read repair

 fs/btrfs/btrfs_inode.h |  30 --
 fs/btrfs/ctree.h       |   1 -
 fs/btrfs/disk-io.c     |   8 +-
 fs/btrfs/disk-io.h     |   1 -
 fs/btrfs/extent_io.c   | 141 +++++----
 fs/btrfs/extent_io.h   |  19 +-
 fs/btrfs/file-item.c   |  11 +-
 fs/btrfs/inode.c       | 694 ++++++++++-------------------------------
 8 files changed, 260 insertions(+), 645 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 01/15] btrfs: fix error handling when submitting direct I/O bio
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
@ 2020-03-09 21:32 ` Omar Sandoval
  2020-03-11 17:54   ` Josef Bacik
  2020-03-17 13:46   ` Nikolay Borisov
  2020-03-09 21:32 ` [PATCH 02/15] btrfs: fix double __endio_write_update_ordered in direct I/O Omar Sandoval
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-09 21:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Christoph Hellwig

From: Omar Sandoval <osandov@fb.com>

If we submit orig_bio in btrfs_submit_direct_hook(), we never increment
pending_bios. Then, if btrfs_submit_dio_bio() fails, we decrement
pending_bios to -1, and we never complete orig_bio. Fix it by
initializing pending_bios to 1 instead of incrementing later.

Fixing this exposes another bug: we put orig_bio prematurely and then
put it again from end_io. Fix it by not putting orig_bio.

After this change, pending_bios is really more of a reference count, but
I'll leave that cleanup separate to keep the fix small.

Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO")
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8a3bc19d83ff..d48a2010f24a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7948,7 +7948,6 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 
 	/* bio split */
 	ASSERT(geom.len <= INT_MAX);
-	atomic_inc(&dip->pending_bios);
 	do {
 		clone_len = min_t(int, submit_len, geom.len);
 
@@ -7998,7 +7997,8 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 	if (!status)
 		return 0;
 
-	bio_put(bio);
+	if (bio != orig_bio)
+		bio_put(bio);
 out_err:
 	dip->errors = 1;
 	/*
@@ -8039,7 +8039,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 	bio->bi_private = dip;
 	dip->orig_bio = bio;
 	dip->dio_bio = dio_bio;
-	atomic_set(&dip->pending_bios, 0);
+	atomic_set(&dip->pending_bios, 1);
 	io_bio = btrfs_io_bio(bio);
 	io_bio->logical = file_offset;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 02/15] btrfs: fix double __endio_write_update_ordered in direct I/O
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
  2020-03-09 21:32 ` [PATCH 01/15] btrfs: fix error handling when submitting direct I/O bio Omar Sandoval
@ 2020-03-09 21:32 ` Omar Sandoval
  2020-03-10 16:30   ` Christoph Hellwig
  2020-03-17 14:04   ` Nikolay Borisov
  2020-03-09 21:32 ` [PATCH 03/15] btrfs: look at full bi_io_vec for repair decision Omar Sandoval
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-09 21:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Christoph Hellwig

From: Omar Sandoval <osandov@fb.com>

In btrfs_submit_direct(), if we fail to allocate the btrfs_dio_private,
we complete the ordered extent range. However, we don't mark that the
range doesn't need to be cleaned up from btrfs_direct_IO() until later.
Therefore, if we fail to allocate the btrfs_dio_private, we complete the
ordered extent range twice. We could fix this by updating
unsubmitted_oe_range earlier, but it's simpler to always clean up via
the bio once the btrfs_dio_private is allocated and leave it for
btrfs_direct_IO() before that.

Fixes: f28a49287817 ("Btrfs: fix leaking of ordered extents after direct IO write error")
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c | 92 ++++++++++++++----------------------------------
 1 file changed, 26 insertions(+), 66 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d48a2010f24a..8e986056be3c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7912,7 +7912,7 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 	return ret;
 }
 
-static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
+static void btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 {
 	struct inode *inode = dip->inode;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -7932,7 +7932,7 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 	ret = btrfs_get_io_geometry(fs_info, btrfs_op(orig_bio),
 				    start_sector << 9, submit_len, &geom);
 	if (ret)
-		return -EIO;
+		goto out_err;
 
 	if (geom.len >= submit_len) {
 		bio = orig_bio;
@@ -7995,7 +7995,7 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 submit:
 	status = btrfs_submit_dio_bio(bio, inode, file_offset, async_submit);
 	if (!status)
-		return 0;
+		return;
 
 	if (bio != orig_bio)
 		bio_put(bio);
@@ -8009,9 +8009,6 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 	 */
 	if (atomic_dec_and_test(&dip->pending_bios))
 		bio_io_error(dip->orig_bio);
-
-	/* bio_end_io() will handle error, so we needn't return it */
-	return 0;
 }
 
 static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
@@ -8021,14 +8018,24 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 	struct bio *bio = NULL;
 	struct btrfs_io_bio *io_bio;
 	bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
-	int ret = 0;
 
 	bio = btrfs_bio_clone(dio_bio);
 
 	dip = kzalloc(sizeof(*dip), GFP_NOFS);
 	if (!dip) {
-		ret = -ENOMEM;
-		goto free_ordered;
+		if (!write) {
+			unlock_extent(&BTRFS_I(inode)->io_tree, file_offset,
+				file_offset + dio_bio->bi_iter.bi_size - 1);
+		}
+
+		dio_bio->bi_status = BLK_STS_RESOURCE;
+		/*
+		 * Releases and cleans up our dio_bio, no need to bio_put() nor
+		 * bio_endio()/bio_io_error() against dio_bio.
+		 */
+		dio_end_io(dio_bio);
+		bio_put(bio);
+		return;
 	}
 
 	dip->private = dio_bio->bi_private;
@@ -8044,72 +8051,25 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 	io_bio->logical = file_offset;
 
 	if (write) {
-		bio->bi_end_io = btrfs_endio_direct_write;
-	} else {
-		bio->bi_end_io = btrfs_endio_direct_read;
-		dip->subio_endio = btrfs_subio_endio_read;
-	}
-
-	/*
-	 * Reset the range for unsubmitted ordered extents (to a 0 length range)
-	 * even if we fail to submit a bio, because in such case we do the
-	 * corresponding error handling below and it must not be done a second
-	 * time by btrfs_direct_IO().
-	 */
-	if (write) {
+		/*
+		 * At this point, the btrfs_dio_private is responsible for
+		 * cleaning up the ordered extents whether or not we submit any
+		 * bios.
+		 */
 		struct btrfs_dio_data *dio_data = current->journal_info;
 
 		dio_data->unsubmitted_oe_range_end = dip->logical_offset +
 			dip->bytes;
 		dio_data->unsubmitted_oe_range_start =
 			dio_data->unsubmitted_oe_range_end;
-	}
-
-	ret = btrfs_submit_direct_hook(dip);
-	if (!ret)
-		return;
-
-	btrfs_io_bio_free_csum(io_bio);
 
-free_ordered:
-	/*
-	 * If we arrived here it means either we failed to submit the dip
-	 * or we either failed to clone the dio_bio or failed to allocate the
-	 * dip. If we cloned the dio_bio and allocated the dip, we can just
-	 * call bio_endio against our io_bio so that we get proper resource
-	 * cleanup if we fail to submit the dip, otherwise, we must do the
-	 * same as btrfs_endio_direct_[write|read] because we can't call these
-	 * callbacks - they require an allocated dip and a clone of dio_bio.
-	 */
-	if (bio && dip) {
-		bio_io_error(bio);
-		/*
-		 * The end io callbacks free our dip, do the final put on bio
-		 * and all the cleanup and final put for dio_bio (through
-		 * dio_end_io()).
-		 */
-		dip = NULL;
-		bio = NULL;
+		bio->bi_end_io = btrfs_endio_direct_write;
 	} else {
-		if (write)
-			__endio_write_update_ordered(inode,
-						file_offset,
-						dio_bio->bi_iter.bi_size,
-						false);
-		else
-			unlock_extent(&BTRFS_I(inode)->io_tree, file_offset,
-			      file_offset + dio_bio->bi_iter.bi_size - 1);
-
-		dio_bio->bi_status = BLK_STS_IOERR;
-		/*
-		 * Releases and cleans up our dio_bio, no need to bio_put()
-		 * nor bio_endio()/bio_io_error() against dio_bio.
-		 */
-		dio_end_io(dio_bio);
+		bio->bi_end_io = btrfs_endio_direct_read;
+		dip->subio_endio = btrfs_subio_endio_read;
 	}
-	if (bio)
-		bio_put(bio);
-	kfree(dip);
+
+	btrfs_submit_direct_hook(dip);
 }
 
 static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 03/15] btrfs: look at full bi_io_vec for repair decision
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
  2020-03-09 21:32 ` [PATCH 01/15] btrfs: fix error handling when submitting direct I/O bio Omar Sandoval
  2020-03-09 21:32 ` [PATCH 02/15] btrfs: fix double __endio_write_update_ordered in direct I/O Omar Sandoval
@ 2020-03-09 21:32 ` Omar Sandoval
  2020-03-10 16:33   ` Christoph Hellwig
  2020-03-17 14:38   ` Nikolay Borisov
  2020-03-09 21:32 ` [PATCH 04/15] btrfs: don't do repair validation for checksum errors Omar Sandoval
                   ` (14 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-09 21:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Christoph Hellwig

From: Omar Sandoval <osandov@fb.com>

Read repair does two things: it finds a good copy of data to return to
the reader, and it corrects the bad copy on disk. If a read of multiple
sectors has an I/O error, repair does an extra "validation" step that
issues a separate read for each sector. This allows us to find the exact
failing sectors and only rewrite those.

This heuristic is implemented in
bio_readpage_error()/btrfs_check_repairable() as:

	failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
	if (failed_bio_pages > 1)
		do validation

However, at this point, bi_iter may have already been advanced. This
means that we'll skip the validation step and rewrite the entire failed
read.

Fix it by getting the actual size from the biovec (which we can do
because this is only called for non-cloned bios, although that will
change in a later commit).

Fixes: 8a2ee44a371c ("btrfs: look at bi_size for repair decisions")
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/extent_io.c | 28 ++++++++++++++++++++++------
 fs/btrfs/extent_io.h |  5 +++--
 2 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 837262d54e28..279731bff0a8 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2528,8 +2528,9 @@ int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
 	return 0;
 }
 
-bool btrfs_check_repairable(struct inode *inode, unsigned failed_bio_pages,
-			   struct io_failure_record *failrec, int failed_mirror)
+bool btrfs_check_repairable(struct inode *inode, bool need_validation,
+			    struct io_failure_record *failrec,
+			    int failed_mirror)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	int num_copies;
@@ -2552,7 +2553,7 @@ bool btrfs_check_repairable(struct inode *inode, unsigned failed_bio_pages,
 	 *	a) deliver good data to the caller
 	 *	b) correct the bad sectors on disk
 	 */
-	if (failed_bio_pages > 1) {
+	if (need_validation) {
 		/*
 		 * to fulfill b), we need to know the exact failing sectors, as
 		 * we don't want to rewrite any more than the failed ones. thus,
@@ -2638,11 +2639,13 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 	struct inode *inode = page->mapping->host;
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
+	bool need_validation = false;
+	u64 len;
+	int i;
 	struct bio *bio;
 	int read_mode = 0;
 	blk_status_t status;
 	int ret;
-	unsigned failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
 
 	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
@@ -2650,13 +2653,26 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 	if (ret)
 		return ret;
 
-	if (!btrfs_check_repairable(inode, failed_bio_pages, failrec,
+	/*
+	 * We need to validate each sector individually if the I/O was for
+	 * multiple sectors.
+	 */
+	len = 0;
+	for (i = 0; i < failed_bio->bi_vcnt; i++) {
+		len += failed_bio->bi_io_vec[i].bv_len;
+		if (len > inode->i_sb->s_blocksize) {
+			need_validation = true;
+			break;
+		}
+	}
+
+	if (!btrfs_check_repairable(inode, need_validation, failrec,
 				    failed_mirror)) {
 		free_io_failure(failure_tree, tree, failrec);
 		return -EIO;
 	}
 
-	if (failed_bio_pages > 1)
+	if (need_validation)
 		read_mode |= REQ_FAILFAST_DEV;
 
 	phy_offset >>= inode->i_sb->s_blocksize_bits;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 234622101230..64e176995af2 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -312,8 +312,9 @@ struct io_failure_record {
 };
 
 
-bool btrfs_check_repairable(struct inode *inode, unsigned failed_bio_pages,
-			    struct io_failure_record *failrec, int fail_mirror);
+bool btrfs_check_repairable(struct inode *inode, bool need_validation,
+			    struct io_failure_record *failrec,
+			    int failed_mirror);
 struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
 				    struct io_failure_record *failrec,
 				    struct page *page, int pg_offset, int icsum,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 04/15] btrfs: don't do repair validation for checksum errors
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
                   ` (2 preceding siblings ...)
  2020-03-09 21:32 ` [PATCH 03/15] btrfs: look at full bi_io_vec for repair decision Omar Sandoval
@ 2020-03-09 21:32 ` Omar Sandoval
  2020-03-11 17:55   ` Josef Bacik
  2020-03-09 21:32 ` [PATCH 05/15] btrfs: clarify btrfs_lookup_bio_sums documentation Omar Sandoval
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 69+ messages in thread
From: Omar Sandoval @ 2020-03-09 21:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Christoph Hellwig

From: Omar Sandoval <osandov@fb.com>

The purpose of the validation step is to distinguish between good and
bad sectors in a failed multi-sector read. If a multi-sector read
succeeded but some of those sectors had checksum errors, we don't need
to validate anything; we know the sectors with bad checksums need to be
repaired.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/extent_io.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 279731bff0a8..104374854cf1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2640,8 +2640,6 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
 	bool need_validation = false;
-	u64 len;
-	int i;
 	struct bio *bio;
 	int read_mode = 0;
 	blk_status_t status;
@@ -2654,15 +2652,19 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		return ret;
 
 	/*
-	 * We need to validate each sector individually if the I/O was for
-	 * multiple sectors.
+	 * If there was an I/O error and the I/O was for multiple sectors, we
+	 * need to validate each sector individually.
 	 */
-	len = 0;
-	for (i = 0; i < failed_bio->bi_vcnt; i++) {
-		len += failed_bio->bi_io_vec[i].bv_len;
-		if (len > inode->i_sb->s_blocksize) {
-			need_validation = true;
-			break;
+	if (failed_bio->bi_status != BLK_STS_OK) {
+		u64 len = 0;
+		int i;
+
+		for (i = 0; i < failed_bio->bi_vcnt; i++) {
+			len += failed_bio->bi_io_vec[i].bv_len;
+			if (len > inode->i_sb->s_blocksize) {
+				need_validation = true;
+				break;
+			}
 		}
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 05/15] btrfs: clarify btrfs_lookup_bio_sums documentation
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
                   ` (3 preceding siblings ...)
  2020-03-09 21:32 ` [PATCH 04/15] btrfs: don't do repair validation for checksum errors Omar Sandoval
@ 2020-03-09 21:32 ` Omar Sandoval
  2020-03-11 17:56   ` Josef Bacik
  2020-03-17 14:38   ` Nikolay Borisov
  2020-03-09 21:32 ` [PATCH 06/15] btrfs: rename __readpage_endio_check to check_data_csum Omar Sandoval
                   ` (12 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-09 21:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Christoph Hellwig

From: Omar Sandoval <osandov@fb.com>

Fix a couple of issues in the btrfs_lookup_bio_sums documentation:

* The bio doesn't need to be a btrfs_io_bio if dst was provided. Move
  the declaration in the code to make that clear, too.
* dst must be large enough to hold nblocks * csum_size, not just
  csum_size.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/file-item.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 6c849e8fd5a1..fa9f4a92f74d 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -242,11 +242,13 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
 /**
  * btrfs_lookup_bio_sums - Look up checksums for a bio.
  * @inode: inode that the bio is for.
- * @bio: bio embedded in btrfs_io_bio.
+ * @bio: bio to look up.
  * @offset: Unless (u64)-1, look up checksums for this offset in the file.
  *          If (u64)-1, use the page offsets from the bio instead.
- * @dst: Buffer of size btrfs_super_csum_size() used to return checksum. If
- *       NULL, the checksum is returned in btrfs_io_bio(bio)->csum instead.
+ * @dst: Buffer of size nblocks * btrfs_super_csum_size() used to return
+ *       checksum (nblocks = bio->bi_iter.bi_size / sectorsize). If NULL, the
+ *       checksum buffer is allocated and returned in btrfs_io_bio(bio)->csum
+ *       instead.
  *
  * Return: BLK_STS_RESOURCE if allocating memory fails, BLK_STS_OK otherwise.
  */
@@ -256,7 +258,6 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct bio_vec bvec;
 	struct bvec_iter iter;
-	struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
 	struct btrfs_csum_item *item = NULL;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct btrfs_path *path;
@@ -277,6 +278,8 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
 
 	nblocks = bio->bi_iter.bi_size >> inode->i_sb->s_blocksize_bits;
 	if (!dst) {
+		struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
+
 		if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
 			btrfs_bio->csum = kmalloc_array(nblocks, csum_size,
 							GFP_NOFS);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 06/15] btrfs: rename __readpage_endio_check to check_data_csum
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
                   ` (4 preceding siblings ...)
  2020-03-09 21:32 ` [PATCH 05/15] btrfs: clarify btrfs_lookup_bio_sums documentation Omar Sandoval
@ 2020-03-09 21:32 ` Omar Sandoval
  2020-03-10 14:46   ` Johannes Thumshirn
                     ` (2 more replies)
  2020-03-09 21:32 ` [PATCH 07/15] btrfs: make btrfs_check_repairable() static Omar Sandoval
                   ` (11 subsequent siblings)
  17 siblings, 3 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-09 21:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Christoph Hellwig

From: Omar Sandoval <osandov@fb.com>

__readpage_endio_check() is also used from the direct I/O read code, so
give it a more descriptive name.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8e986056be3c..50476ae96552 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2725,10 +2725,9 @@ void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
 	btrfs_queue_work(wq, &ordered_extent->work);
 }
 
-static int __readpage_endio_check(struct inode *inode,
-				  struct btrfs_io_bio *io_bio,
-				  int icsum, struct page *page,
-				  int pgoff, u64 start, size_t len)
+static int check_data_csum(struct inode *inode, struct btrfs_io_bio *io_bio,
+			   int icsum, struct page *page, int pgoff, u64 start,
+			   size_t len)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
@@ -2789,8 +2788,8 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	}
 
 	phy_offset >>= inode->i_sb->s_blocksize_bits;
-	return __readpage_endio_check(inode, io_bio, phy_offset, page, offset,
-				      start, (size_t)(end - start + 1));
+	return check_data_csum(inode, io_bio, phy_offset, page, offset, start,
+			       (size_t)(end - start + 1));
 }
 
 /*
@@ -7593,9 +7592,9 @@ static void btrfs_retry_endio(struct bio *bio)
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
-					     bvec->bv_offset, done->start,
-					     bvec->bv_len);
+		ret = check_data_csum(inode, io_bio, i, bvec->bv_page,
+				      bvec->bv_offset, done->start,
+				      bvec->bv_len);
 		if (!ret)
 			clean_io_failure(BTRFS_I(inode)->root->fs_info,
 					 failure_tree, io_tree, done->start,
@@ -7645,8 +7644,9 @@ static blk_status_t __btrfs_subio_endio_read(struct inode *inode,
 next_block:
 		if (uptodate) {
 			csum_pos = BTRFS_BYTES_TO_BLKS(fs_info, offset);
-			ret = __readpage_endio_check(inode, io_bio, csum_pos,
-					bvec.bv_page, pgoff, start, sectorsize);
+			ret = check_data_csum(inode, io_bio, csum_pos,
+					      bvec.bv_page, pgoff, start,
+					      sectorsize);
 			if (likely(!ret))
 				goto next;
 		}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 07/15] btrfs: make btrfs_check_repairable() static
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
                   ` (5 preceding siblings ...)
  2020-03-09 21:32 ` [PATCH 06/15] btrfs: rename __readpage_endio_check to check_data_csum Omar Sandoval
@ 2020-03-09 21:32 ` Omar Sandoval
  2020-03-10 14:53   ` Johannes Thumshirn
                     ` (2 more replies)
  2020-03-09 21:32 ` [PATCH 08/15] btrfs: move btrfs_dio_private to inode.c Omar Sandoval
                   ` (10 subsequent siblings)
  17 siblings, 3 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-09 21:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Christoph Hellwig

From: Omar Sandoval <osandov@fb.com>

Since its introduction in commit 2fe6303e7cd0 ("Btrfs: split
bio_readpage_error into several functions"), btrfs_check_repairable()
has only been used from extent_io.c where it is defined.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/extent_io.c | 7 ++++---
 fs/btrfs/extent_io.h | 3 ---
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 104374854cf1..aee35d431f91 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2528,9 +2528,10 @@ int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
 	return 0;
 }
 
-bool btrfs_check_repairable(struct inode *inode, bool need_validation,
-			    struct io_failure_record *failrec,
-			    int failed_mirror)
+static bool btrfs_check_repairable(struct inode *inode,
+				   bool need_validation,
+				   struct io_failure_record *failrec,
+				   int failed_mirror)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	int num_copies;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 64e176995af2..11341a430007 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -312,9 +312,6 @@ struct io_failure_record {
 };
 
 
-bool btrfs_check_repairable(struct inode *inode, bool need_validation,
-			    struct io_failure_record *failrec,
-			    int failed_mirror);
 struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
 				    struct io_failure_record *failrec,
 				    struct page *page, int pg_offset, int icsum,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 08/15] btrfs: move btrfs_dio_private to inode.c
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
                   ` (6 preceding siblings ...)
  2020-03-09 21:32 ` [PATCH 07/15] btrfs: make btrfs_check_repairable() static Omar Sandoval
@ 2020-03-09 21:32 ` Omar Sandoval
  2020-03-10 14:56   ` Johannes Thumshirn
                     ` (2 more replies)
  2020-03-09 21:32 ` [PATCH 09/15] btrfs: kill btrfs_dio_private->private Omar Sandoval
                   ` (9 subsequent siblings)
  17 siblings, 3 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-09 21:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Christoph Hellwig

From: Omar Sandoval <osandov@fb.com>

This hasn't been needed outside of inode.c since commit 23ea8e5a0767
("Btrfs: load checksum data once when submitting a direct read io").

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/btrfs_inode.h | 30 ------------------------------
 fs/btrfs/inode.c       | 30 ++++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 27a1fefce508..ade5c6adec06 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -293,36 +293,6 @@ static inline int btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation)
 	return ret;
 }
 
-#define BTRFS_DIO_ORIG_BIO_SUBMITTED	0x1
-
-struct btrfs_dio_private {
-	struct inode *inode;
-	unsigned long flags;
-	u64 logical_offset;
-	u64 disk_bytenr;
-	u64 bytes;
-	void *private;
-
-	/* number of bios pending for this dio */
-	atomic_t pending_bios;
-
-	/* IO errors */
-	int errors;
-
-	/* orig_bio is our btrfs_io_bio */
-	struct bio *orig_bio;
-
-	/* dio_bio came from fs/direct-io.c */
-	struct bio *dio_bio;
-
-	/*
-	 * The original bio may be split to several sub-bios, this is
-	 * done during endio of sub-bios
-	 */
-	blk_status_t (*subio_endio)(struct inode *, struct btrfs_io_bio *,
-			blk_status_t);
-};
-
 /*
  * Disable DIO read nolock optimization, so new dio readers will be forced
  * to grab i_mutex. It is used to avoid the endless truncate due to
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 50476ae96552..9d3a275ef253 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -54,6 +54,36 @@ struct btrfs_iget_args {
 	struct btrfs_root *root;
 };
 
+#define BTRFS_DIO_ORIG_BIO_SUBMITTED	0x1
+
+struct btrfs_dio_private {
+	struct inode *inode;
+	unsigned long flags;
+	u64 logical_offset;
+	u64 disk_bytenr;
+	u64 bytes;
+	void *private;
+
+	/* number of bios pending for this dio */
+	atomic_t pending_bios;
+
+	/* IO errors */
+	int errors;
+
+	/* orig_bio is our btrfs_io_bio */
+	struct bio *orig_bio;
+
+	/* dio_bio came from fs/direct-io.c */
+	struct bio *dio_bio;
+
+	/*
+	 * The original bio may be split to several sub-bios, this is
+	 * done during endio of sub-bios
+	 */
+	blk_status_t (*subio_endio)(struct inode *, struct btrfs_io_bio *,
+			blk_status_t);
+};
+
 struct btrfs_dio_data {
 	u64 reserve;
 	u64 unsubmitted_oe_range_start;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 09/15] btrfs: kill btrfs_dio_private->private
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
                   ` (7 preceding siblings ...)
  2020-03-09 21:32 ` [PATCH 08/15] btrfs: move btrfs_dio_private to inode.c Omar Sandoval
@ 2020-03-09 21:32 ` Omar Sandoval
  2020-03-11 17:59   ` Josef Bacik
  2020-03-17 14:54   ` Nikolay Borisov
  2020-03-09 21:32 ` [PATCH 10/15] btrfs: convert btrfs_dio_private->pending_bios to refcount_t Omar Sandoval
                   ` (8 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-09 21:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Christoph Hellwig

From: Omar Sandoval <osandov@fb.com>

We haven't used this since commit 9be3395bcd4a ("Btrfs: use a btrfs
bioset instead of abusing bio internals").

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9d3a275ef253..8cc8741b3fec 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -62,7 +62,6 @@ struct btrfs_dio_private {
 	u64 logical_offset;
 	u64 disk_bytenr;
 	u64 bytes;
-	void *private;
 
 	/* number of bios pending for this dio */
 	atomic_t pending_bios;
@@ -8068,7 +8067,6 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 		return;
 	}
 
-	dip->private = dio_bio->bi_private;
 	dip->inode = inode;
 	dip->logical_offset = file_offset;
 	dip->bytes = dio_bio->bi_iter.bi_size;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 10/15] btrfs: convert btrfs_dio_private->pending_bios to refcount_t
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
                   ` (8 preceding siblings ...)
  2020-03-09 21:32 ` [PATCH 09/15] btrfs: kill btrfs_dio_private->private Omar Sandoval
@ 2020-03-09 21:32 ` Omar Sandoval
  2020-03-11 18:00   ` Josef Bacik
  2020-03-17 15:10   ` Nikolay Borisov
  2020-03-09 21:32 ` [PATCH 11/15] btrfs: put direct I/O checksums in btrfs_dio_private instead of bio Omar Sandoval
                   ` (7 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-09 21:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Christoph Hellwig

From: Omar Sandoval <osandov@fb.com>

This is really a reference count now, so convert it to refcount_t and
rename it to refs.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8cc8741b3fec..a7fb0ba8cde4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -63,8 +63,11 @@ struct btrfs_dio_private {
 	u64 disk_bytenr;
 	u64 bytes;
 
-	/* number of bios pending for this dio */
-	atomic_t pending_bios;
+	/*
+	 * References to this structure. There is one reference per in-flight
+	 * bio plus one while we're still setting up.
+	 */
+	refcount_t refs;
 
 	/* IO errors */
 	int errors;
@@ -7849,7 +7852,7 @@ static void btrfs_end_dio_bio(struct bio *bio)
 	}
 
 	/* if there are more bios still pending for this dio, just exit */
-	if (!atomic_dec_and_test(&dip->pending_bios))
+	if (!refcount_dec_and_test(&dip->refs))
 		goto out;
 
 	if (dip->errors) {
@@ -8001,13 +8004,13 @@ static void btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 		 * count. Otherwise, the dip might get freed before we're
 		 * done setting it up.
 		 */
-		atomic_inc(&dip->pending_bios);
+		refcount_inc(&dip->refs);
 
 		status = btrfs_submit_dio_bio(bio, inode, file_offset,
 						async_submit);
 		if (status) {
 			bio_put(bio);
-			atomic_dec(&dip->pending_bios);
+			refcount_dec(&dip->refs);
 			goto out_err;
 		}
 
@@ -8036,7 +8039,7 @@ static void btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 	 * atomic operations with a return value are fully ordered as per
 	 * atomic_t.txt
 	 */
-	if (atomic_dec_and_test(&dip->pending_bios))
+	if (refcount_dec_and_test(&dip->refs))
 		bio_io_error(dip->orig_bio);
 }
 
@@ -8074,7 +8077,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 	bio->bi_private = dip;
 	dip->orig_bio = bio;
 	dip->dio_bio = dio_bio;
-	atomic_set(&dip->pending_bios, 1);
+	refcount_set(&dip->refs, 1);
 	io_bio = btrfs_io_bio(bio);
 	io_bio->logical = file_offset;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 11/15] btrfs: put direct I/O checksums in btrfs_dio_private instead of bio
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
                   ` (9 preceding siblings ...)
  2020-03-09 21:32 ` [PATCH 10/15] btrfs: convert btrfs_dio_private->pending_bios to refcount_t Omar Sandoval
@ 2020-03-09 21:32 ` Omar Sandoval
  2020-03-11 18:04   ` Josef Bacik
                     ` (2 more replies)
  2020-03-09 21:32 ` [PATCH 12/15] btrfs: get rid of one layer of bios in direct I/O Omar Sandoval
                   ` (6 subsequent siblings)
  17 siblings, 3 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-09 21:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Christoph Hellwig

From: Omar Sandoval <osandov@fb.com>

The next commit will get rid of btrfs_dio_private->orig_bio. The only
thing we really need it for is containing all of the checksums, but we
can easily put those in btrfs_dio_private and get rid of the awkward
logic that looks up the checksums for orig_bio when the first split bio
is submitted. (Interestingly, btrfs_dio_private did contain the
checksums before commit 23ea8e5a0767 ("Btrfs: load checksum data once
when submitting a direct read io"), but it didn't look them up up
front.)

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c | 79 ++++++++++++++++++++++++------------------------
 1 file changed, 39 insertions(+), 40 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a7fb0ba8cde4..4a2e44f3e66e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -84,6 +84,9 @@ struct btrfs_dio_private {
 	 */
 	blk_status_t (*subio_endio)(struct inode *, struct btrfs_io_bio *,
 			blk_status_t);
+
+	/* Checksums. */
+	u8 sums[];
 };
 
 struct btrfs_dio_data {
@@ -7753,7 +7756,6 @@ static void btrfs_endio_direct_read(struct bio *bio)
 
 	dio_bio->bi_status = err;
 	dio_end_io(dio_bio);
-	btrfs_io_bio_free_csum(io_bio);
 	bio_put(bio);
 }
 
@@ -7865,39 +7867,6 @@ static void btrfs_end_dio_bio(struct bio *bio)
 	bio_put(bio);
 }
 
-static inline blk_status_t btrfs_lookup_and_bind_dio_csum(struct inode *inode,
-						 struct btrfs_dio_private *dip,
-						 struct bio *bio,
-						 u64 file_offset)
-{
-	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
-	struct btrfs_io_bio *orig_io_bio = btrfs_io_bio(dip->orig_bio);
-	u16 csum_size;
-	blk_status_t ret;
-
-	/*
-	 * We load all the csum data we need when we submit
-	 * the first bio to reduce the csum tree search and
-	 * contention.
-	 */
-	if (dip->logical_offset == file_offset) {
-		ret = btrfs_lookup_bio_sums(inode, dip->orig_bio, file_offset,
-					    NULL);
-		if (ret)
-			return ret;
-	}
-
-	if (bio == dip->orig_bio)
-		return 0;
-
-	file_offset -= dip->logical_offset;
-	file_offset >>= inode->i_sb->s_blocksize_bits;
-	csum_size = btrfs_super_csum_size(btrfs_sb(inode->i_sb)->super_copy);
-	io_bio->csum = orig_io_bio->csum + csum_size * file_offset;
-
-	return 0;
-}
-
 static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 		struct inode *inode, u64 file_offset, int async_submit)
 {
@@ -7933,10 +7902,12 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 		if (ret)
 			goto err;
 	} else {
-		ret = btrfs_lookup_and_bind_dio_csum(inode, dip, bio,
-						     file_offset);
-		if (ret)
-			goto err;
+		u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
+		size_t csum_offset;
+
+		csum_offset = ((file_offset - dip->logical_offset) >>
+			       inode->i_sb->s_blocksize_bits) * csum_size;
+		btrfs_io_bio(bio)->csum = dip->sums + csum_offset;
 	}
 map:
 	ret = btrfs_map_bio(fs_info, bio, 0);
@@ -8047,13 +8018,25 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 				loff_t file_offset)
 {
 	struct btrfs_dio_private *dip = NULL;
+	size_t dip_size;
 	struct bio *bio = NULL;
 	struct btrfs_io_bio *io_bio;
 	bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
+	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
 
 	bio = btrfs_bio_clone(dio_bio);
 
-	dip = kzalloc(sizeof(*dip), GFP_NOFS);
+	dip_size = sizeof(*dip);
+	if (!write && csum) {
+		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+		u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
+		size_t nblocks = (dio_bio->bi_iter.bi_size >>
+				  inode->i_sb->s_blocksize_bits);
+
+		dip_size += csum_size * nblocks;
+	}
+
+	dip = kzalloc(dip_size, GFP_NOFS);
 	if (!dip) {
 		if (!write) {
 			unlock_extent(&BTRFS_I(inode)->io_tree, file_offset,
@@ -8093,11 +8076,27 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 			dip->bytes;
 		dio_data->unsubmitted_oe_range_start =
 			dio_data->unsubmitted_oe_range_end;
-
 		bio->bi_end_io = btrfs_endio_direct_write;
 	} else {
 		bio->bi_end_io = btrfs_endio_direct_read;
 		dip->subio_endio = btrfs_subio_endio_read;
+
+		if (csum) {
+			blk_status_t status;
+
+			/*
+			 * Load the csums up front to reduce csum tree searches
+			 * and contention when submitting bios.
+			 */
+			status = btrfs_lookup_bio_sums(inode, dio_bio,
+						       file_offset, dip->sums);
+			if (status != BLK_STS_OK) {
+				dip->errors = 1;
+				if (refcount_dec_and_test(&dip->refs))
+					bio_io_error(dip->orig_bio);
+				return;
+			}
+		}
 	}
 
 	btrfs_submit_direct_hook(dip);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 12/15] btrfs: get rid of one layer of bios in direct I/O
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
                   ` (10 preceding siblings ...)
  2020-03-09 21:32 ` [PATCH 11/15] btrfs: put direct I/O checksums in btrfs_dio_private instead of bio Omar Sandoval
@ 2020-03-09 21:32 ` Omar Sandoval
  2020-03-10 16:38   ` Christoph Hellwig
                     ` (2 more replies)
  2020-03-09 21:32 ` [PATCH 13/15] btrfs: simplify direct I/O read repair Omar Sandoval
                   ` (5 subsequent siblings)
  17 siblings, 3 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-09 21:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Christoph Hellwig

From: Omar Sandoval <osandov@fb.com>

In the worst case, there are _4_ layers of bios in the Btrfs direct I/O
path:

1. The bio created by the generic direct I/O code (dio_bio).
2. A clone of dio_bio we create in btrfs_submit_direct() to represent
   the entire direct I/O range (orig_bio).
3. A partial clone of orig_bio limited to the size of a RAID stripe that
   we create in btrfs_submit_direct_hook().
4. Clones of each of those split bios for each RAID stripe that we
   create in btrfs_map_bio().

As of the previous commit, the second layer (orig_bio) is no longer
needed for anything: we can split dio_bio instead, and complete dio_bio
directly when all of the cloned bios complete. This lets us clean up a
bunch of cruft, including dip->subio_endio and dip->errors (we can use
dio_bio->bi_status instead). It also enables the next big cleanup of
direct I/O read repair.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/inode.c | 213 +++++++++++++++--------------------------------
 1 file changed, 66 insertions(+), 147 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4a2e44f3e66e..40c1562704e9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -54,11 +54,8 @@ struct btrfs_iget_args {
 	struct btrfs_root *root;
 };
 
-#define BTRFS_DIO_ORIG_BIO_SUBMITTED	0x1
-
 struct btrfs_dio_private {
 	struct inode *inode;
-	unsigned long flags;
 	u64 logical_offset;
 	u64 disk_bytenr;
 	u64 bytes;
@@ -69,22 +66,9 @@ struct btrfs_dio_private {
 	 */
 	refcount_t refs;
 
-	/* IO errors */
-	int errors;
-
-	/* orig_bio is our btrfs_io_bio */
-	struct bio *orig_bio;
-
 	/* dio_bio came from fs/direct-io.c */
 	struct bio *dio_bio;
 
-	/*
-	 * The original bio may be split to several sub-bios, this is
-	 * done during endio of sub-bios
-	 */
-	blk_status_t (*subio_endio)(struct inode *, struct btrfs_io_bio *,
-			blk_status_t);
-
 	/* Checksums. */
 	u8 sums[];
 };
@@ -7400,6 +7384,29 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 	return ret;
 }
 
+static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
+{
+	/*
+	 * This implies a barrier so that stores to dio_bio->bi_status before
+	 * this and the following load are fully ordered.
+	 */
+	if (!refcount_dec_and_test(&dip->refs))
+		return;
+
+	if (bio_op(dip->dio_bio) == REQ_OP_WRITE) {
+		__endio_write_update_ordered(dip->inode, dip->logical_offset,
+					     dip->bytes,
+					     !dip->dio_bio->bi_status);
+	} else {
+		unlock_extent(&BTRFS_I(dip->inode)->io_tree,
+			      dip->logical_offset,
+			      dip->logical_offset + dip->bytes - 1);
+	}
+
+	dio_end_io(dip->dio_bio);
+	kfree(dip);
+}
+
 static inline blk_status_t submit_dio_repair_bio(struct inode *inode,
 						 struct bio *bio,
 						 int mirror_num)
@@ -7722,8 +7729,9 @@ static blk_status_t __btrfs_subio_endio_read(struct inode *inode,
 	return err;
 }
 
-static blk_status_t btrfs_subio_endio_read(struct inode *inode,
-		struct btrfs_io_bio *io_bio, blk_status_t err)
+static blk_status_t btrfs_check_read_dio_bio(struct inode *inode,
+					     struct btrfs_io_bio *io_bio,
+					     blk_status_t err)
 {
 	bool skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
 
@@ -7737,28 +7745,6 @@ static blk_status_t btrfs_subio_endio_read(struct inode *inode,
 	}
 }
 
-static void btrfs_endio_direct_read(struct bio *bio)
-{
-	struct btrfs_dio_private *dip = bio->bi_private;
-	struct inode *inode = dip->inode;
-	struct bio *dio_bio;
-	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
-	blk_status_t err = bio->bi_status;
-
-	if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED)
-		err = btrfs_subio_endio_read(inode, io_bio, err);
-
-	unlock_extent(&BTRFS_I(inode)->io_tree, dip->logical_offset,
-		      dip->logical_offset + dip->bytes - 1);
-	dio_bio = dip->dio_bio;
-
-	kfree(dip);
-
-	dio_bio->bi_status = err;
-	dio_end_io(dio_bio);
-	bio_put(bio);
-}
-
 static void __endio_write_update_ordered(struct inode *inode,
 					 const u64 offset, const u64 bytes,
 					 const bool uptodate)
@@ -7802,21 +7788,6 @@ static void __endio_write_update_ordered(struct inode *inode,
 	}
 }
 
-static void btrfs_endio_direct_write(struct bio *bio)
-{
-	struct btrfs_dio_private *dip = bio->bi_private;
-	struct bio *dio_bio = dip->dio_bio;
-
-	__endio_write_update_ordered(dip->inode, dip->logical_offset,
-				     dip->bytes, !bio->bi_status);
-
-	kfree(dip);
-
-	dio_bio->bi_status = bio->bi_status;
-	dio_end_io(dio_bio);
-	bio_put(bio);
-}
-
 static blk_status_t btrfs_submit_bio_start_direct_io(void *private_data,
 				    struct bio *bio, u64 offset)
 {
@@ -7840,31 +7811,16 @@ static void btrfs_end_dio_bio(struct bio *bio)
 			   (unsigned long long)bio->bi_iter.bi_sector,
 			   bio->bi_iter.bi_size, err);
 
-	if (dip->subio_endio)
-		err = dip->subio_endio(dip->inode, btrfs_io_bio(bio), err);
-
-	if (err) {
-		/*
-		 * We want to perceive the errors flag being set before
-		 * decrementing the reference count. We don't need a barrier
-		 * since atomic operations with a return value are fully
-		 * ordered as per atomic_t.txt
-		 */
-		dip->errors = 1;
+	if (bio_op(bio) == REQ_OP_READ) {
+		err = btrfs_check_read_dio_bio(dip->inode, btrfs_io_bio(bio),
+					       err);
 	}
 
-	/* if there are more bios still pending for this dio, just exit */
-	if (!refcount_dec_and_test(&dip->refs))
-		goto out;
+	if (err)
+		dip->dio_bio->bi_status = err;
 
-	if (dip->errors) {
-		bio_io_error(dip->orig_bio);
-	} else {
-		dip->dio_bio->bi_status = BLK_STS_OK;
-		bio_endio(dip->orig_bio);
-	}
-out:
 	bio_put(bio);
+	btrfs_dio_private_put(dip);
 }
 
 static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
@@ -7920,98 +7876,77 @@ static void btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 	struct inode *inode = dip->inode;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct bio *bio;
-	struct bio *orig_bio = dip->orig_bio;
-	u64 start_sector = orig_bio->bi_iter.bi_sector;
+	struct bio *dio_bio = dip->dio_bio;
+	u64 start_sector = dio_bio->bi_iter.bi_sector;
 	u64 file_offset = dip->logical_offset;
 	int async_submit = 0;
-	u64 submit_len;
+	u64 submit_len = dio_bio->bi_iter.bi_size;
 	int clone_offset = 0;
 	int clone_len;
 	int ret;
 	blk_status_t status;
 	struct btrfs_io_geometry geom;
 
-	submit_len = orig_bio->bi_iter.bi_size;
-	ret = btrfs_get_io_geometry(fs_info, btrfs_op(orig_bio),
-				    start_sector << 9, submit_len, &geom);
-	if (ret)
-		goto out_err;
-
-	if (geom.len >= submit_len) {
-		bio = orig_bio;
-		dip->flags |= BTRFS_DIO_ORIG_BIO_SUBMITTED;
-		goto submit;
-	}
-
 	/* async crcs make it difficult to collect full stripe writes. */
 	if (btrfs_data_alloc_profile(fs_info) & BTRFS_BLOCK_GROUP_RAID56_MASK)
 		async_submit = 0;
 	else
 		async_submit = 1;
 
-	/* bio split */
 	ASSERT(geom.len <= INT_MAX);
 	do {
+		ret = btrfs_get_io_geometry(fs_info, btrfs_op(dio_bio),
+					    start_sector << 9, submit_len,
+					    &geom);
+		if (ret) {
+			status = errno_to_blk_status(ret);
+			goto out_err;
+		}
+
 		clone_len = min_t(int, submit_len, geom.len);
 
 		/*
 		 * This will never fail as it's passing GPF_NOFS and
 		 * the allocation is backed by btrfs_bioset.
 		 */
-		bio = btrfs_bio_clone_partial(orig_bio, clone_offset,
-					      clone_len);
+		bio = btrfs_bio_clone_partial(dio_bio, clone_offset, clone_len);
 		bio->bi_private = dip;
 		bio->bi_end_io = btrfs_end_dio_bio;
 		btrfs_io_bio(bio)->logical = file_offset;
 
 		ASSERT(submit_len >= clone_len);
 		submit_len -= clone_len;
-		if (submit_len == 0)
-			break;
 
 		/*
 		 * Increase the count before we submit the bio so we know
 		 * the end IO handler won't happen before we increase the
 		 * count. Otherwise, the dip might get freed before we're
 		 * done setting it up.
+		 *
+		 * We transfer the initial reference to the last bio, so we
+		 * don't need to increment the reference count for the last one.
 		 */
-		refcount_inc(&dip->refs);
+		if (submit_len > 0)
+			refcount_inc(&dip->refs);
 
 		status = btrfs_submit_dio_bio(bio, inode, file_offset,
 						async_submit);
 		if (status) {
 			bio_put(bio);
-			refcount_dec(&dip->refs);
+			if (submit_len > 0)
+				refcount_dec(&dip->refs);
 			goto out_err;
 		}
 
 		clone_offset += clone_len;
 		start_sector += clone_len >> 9;
 		file_offset += clone_len;
-
-		ret = btrfs_get_io_geometry(fs_info, btrfs_op(orig_bio),
-				      start_sector << 9, submit_len, &geom);
-		if (ret)
-			goto out_err;
 	} while (submit_len > 0);
+	return;
 
-submit:
-	status = btrfs_submit_dio_bio(bio, inode, file_offset, async_submit);
-	if (!status)
-		return;
-
-	if (bio != orig_bio)
-		bio_put(bio);
 out_err:
-	dip->errors = 1;
-	/*
-	 * Before atomic variable goto zero, we must  make sure dip->errors is
-	 * perceived to be set. This ordering is ensured by the fact that an
-	 * atomic operations with a return value are fully ordered as per
-	 * atomic_t.txt
-	 */
-	if (refcount_dec_and_test(&dip->refs))
-		bio_io_error(dip->orig_bio);
+	dip->dio_bio->bi_status = status;
+	btrfs_dio_private_put(dip);
 }
 
 static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
@@ -8019,13 +7954,9 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 {
 	struct btrfs_dio_private *dip = NULL;
 	size_t dip_size;
-	struct bio *bio = NULL;
-	struct btrfs_io_bio *io_bio;
 	bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
 	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
 
-	bio = btrfs_bio_clone(dio_bio);
-
 	dip_size = sizeof(*dip);
 	if (!write && csum) {
 		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -8049,7 +7980,6 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 		 * bio_endio()/bio_io_error() against dio_bio.
 		 */
 		dio_end_io(dio_bio);
-		bio_put(bio);
 		return;
 	}
 
@@ -8057,12 +7987,8 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 	dip->logical_offset = file_offset;
 	dip->bytes = dio_bio->bi_iter.bi_size;
 	dip->disk_bytenr = (u64)dio_bio->bi_iter.bi_sector << 9;
-	bio->bi_private = dip;
-	dip->orig_bio = bio;
 	dip->dio_bio = dio_bio;
 	refcount_set(&dip->refs, 1);
-	io_bio = btrfs_io_bio(bio);
-	io_bio->logical = file_offset;
 
 	if (write) {
 		/*
@@ -8076,26 +8002,19 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 			dip->bytes;
 		dio_data->unsubmitted_oe_range_start =
 			dio_data->unsubmitted_oe_range_end;
-		bio->bi_end_io = btrfs_endio_direct_write;
-	} else {
-		bio->bi_end_io = btrfs_endio_direct_read;
-		dip->subio_endio = btrfs_subio_endio_read;
+	} else if (csum) {
+		blk_status_t status;
 
-		if (csum) {
-			blk_status_t status;
-
-			/*
-			 * Load the csums up front to reduce csum tree searches
-			 * and contention when submitting bios.
-			 */
-			status = btrfs_lookup_bio_sums(inode, dio_bio,
-						       file_offset, dip->sums);
-			if (status != BLK_STS_OK) {
-				dip->errors = 1;
-				if (refcount_dec_and_test(&dip->refs))
-					bio_io_error(dip->orig_bio);
-				return;
-			}
+		/*
+		 * Load the csums up front to reduce csum tree searches and
+		 * contention when submitting bios.
+		 */
+		status = btrfs_lookup_bio_sums(inode, dio_bio, file_offset,
+					       dip->sums);
+		if (status != BLK_STS_OK) {
+			dip->dio_bio->bi_status = status;
+			btrfs_dio_private_put(dip);
+			return;
 		}
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 13/15] btrfs: simplify direct I/O read repair
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
                   ` (11 preceding siblings ...)
  2020-03-09 21:32 ` [PATCH 12/15] btrfs: get rid of one layer of bios in direct I/O Omar Sandoval
@ 2020-03-09 21:32 ` Omar Sandoval
  2020-03-11 18:16   ` Josef Bacik
  2020-04-03 16:40   ` David Sterba
  2020-03-09 21:32 ` [PATCH 14/15] btrfs: get rid of endio_repair_workers Omar Sandoval
                   ` (4 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-09 21:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Christoph Hellwig

From: Omar Sandoval <osandov@fb.com>

Direct I/O read repair is an over-complicated mess. There is major code
duplication between __btrfs_subio_endio_read() (checks checksums and
handles I/O errors for files with checksums),
__btrfs_correct_data_nocsum() (handles I/O errors for files without
checksums), btrfs_retry_endio() (checks checksums and handles I/O errors
for retries of files with checksums), and btrfs_retry_endio_nocsum()
(handles I/O errors for retries of files without checksum). If it sounds
like these should be one function, that's because they should.

After the previous commit getting rid of orig_bio, we can reuse the same
endio callback for repair I/O and the original I/O, we just need to
track the file offset and original iterator in the repair bio. We can
also unify the handling of files with and without checksums and replace
the atrocity that was probably the inspiration for "Go To Statement
Considered Harmful" with normal loops. We also no longer have to wait
for each repair I/O to complete one by one.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/extent_io.c |   2 +
 fs/btrfs/inode.c     | 268 +++++++------------------------------------
 2 files changed, 44 insertions(+), 226 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index aee35d431f91..fad86ef4d09d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2622,6 +2622,8 @@ struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
 	}
 
 	bio_add_page(bio, page, failrec->len, pg_offset);
+	btrfs_io_bio(bio)->logical = failrec->start;
+	btrfs_io_bio(bio)->iter = bio->bi_iter;
 
 	return bio;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 40c1562704e9..ef302b7c6c2d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7463,19 +7463,17 @@ static int btrfs_check_dio_repairable(struct inode *inode,
 
 static blk_status_t dio_read_error(struct inode *inode, struct bio *failed_bio,
 				   struct page *page, unsigned int pgoff,
-				   u64 start, u64 end, int failed_mirror,
-				   bio_end_io_t *repair_endio, void *repair_arg)
+				   u64 start, u64 end, int failed_mirror)
 {
+	struct btrfs_dio_private *dip = failed_bio->bi_private;
 	struct io_failure_record *failrec;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
 	struct bio *bio;
 	int isector;
 	unsigned int read_mode = 0;
-	int segs;
 	int ret;
 	blk_status_t status;
-	struct bio_vec bvec;
 
 	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
@@ -7490,261 +7488,79 @@ static blk_status_t dio_read_error(struct inode *inode, struct bio *failed_bio,
 		return BLK_STS_IOERR;
 	}
 
-	segs = bio_segments(failed_bio);
-	bio_get_first_bvec(failed_bio, &bvec);
-	if (segs > 1 ||
-	    (bvec.bv_len > btrfs_inode_sectorsize(inode)))
+	if (btrfs_io_bio(failed_bio)->iter.bi_size > inode->i_sb->s_blocksize)
 		read_mode |= REQ_FAILFAST_DEV;
 
 	isector = start - btrfs_io_bio(failed_bio)->logical;
 	isector >>= inode->i_sb->s_blocksize_bits;
-	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
-				pgoff, isector, repair_endio, repair_arg);
+	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page, pgoff,
+				      isector, failed_bio->bi_end_io, dip);
 	bio->bi_opf = REQ_OP_READ | read_mode;
 
 	btrfs_debug(BTRFS_I(inode)->root->fs_info,
 		    "repair DIO read error: submitting new dio read[%#x] to this_mirror=%d, in_validation=%d",
 		    read_mode, failrec->this_mirror, failrec->in_validation);
 
+	refcount_inc(&dip->refs);
 	status = submit_dio_repair_bio(inode, bio, failrec->this_mirror);
 	if (status) {
 		free_io_failure(failure_tree, io_tree, failrec);
 		bio_put(bio);
+		refcount_dec(&dip->refs);
 	}
 
 	return status;
 }
 
-struct btrfs_retry_complete {
-	struct completion done;
-	struct inode *inode;
-	u64 start;
-	int uptodate;
-};
-
-static void btrfs_retry_endio_nocsum(struct bio *bio)
-{
-	struct btrfs_retry_complete *done = bio->bi_private;
-	struct inode *inode = done->inode;
-	struct bio_vec *bvec;
-	struct extent_io_tree *io_tree, *failure_tree;
-	struct bvec_iter_all iter_all;
-
-	if (bio->bi_status)
-		goto end;
-
-	ASSERT(bio->bi_vcnt == 1);
-	io_tree = &BTRFS_I(inode)->io_tree;
-	failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	ASSERT(bio_first_bvec_all(bio)->bv_len == btrfs_inode_sectorsize(inode));
-
-	done->uptodate = 1;
-	ASSERT(!bio_flagged(bio, BIO_CLONED));
-	bio_for_each_segment_all(bvec, bio, iter_all)
-		clean_io_failure(BTRFS_I(inode)->root->fs_info, failure_tree,
-				 io_tree, done->start, bvec->bv_page,
-				 btrfs_ino(BTRFS_I(inode)), 0);
-end:
-	complete(&done->done);
-	bio_put(bio);
-}
-
-static blk_status_t __btrfs_correct_data_nocsum(struct inode *inode,
-						struct btrfs_io_bio *io_bio)
+static blk_status_t btrfs_check_read_dio_bio(struct inode *inode,
+					     struct btrfs_io_bio *io_bio,
+					     const bool uptodate)
 {
-	struct btrfs_fs_info *fs_info;
+	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+	u32 sectorsize = fs_info->sectorsize;
+	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
 	struct bio_vec bvec;
 	struct bvec_iter iter;
-	struct btrfs_retry_complete done;
-	u64 start;
-	unsigned int pgoff;
-	u32 sectorsize;
-	int nr_sectors;
-	blk_status_t ret;
+	u64 start = io_bio->logical;
+	int icsum = 0;
 	blk_status_t err = BLK_STS_OK;
 
-	fs_info = BTRFS_I(inode)->root->fs_info;
-	sectorsize = fs_info->sectorsize;
-
-	start = io_bio->logical;
-	done.inode = inode;
-	io_bio->bio.bi_iter = io_bio->iter;
+	__bio_for_each_segment(bvec, &io_bio->bio, iter, io_bio->iter) {
+		unsigned int i, nr_sectors, pgoff;
 
-	bio_for_each_segment(bvec, &io_bio->bio, iter) {
 		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec.bv_len);
 		pgoff = bvec.bv_offset;
-
-next_block_or_try_again:
-		done.uptodate = 0;
-		done.start = start;
-		init_completion(&done.done);
-
-		ret = dio_read_error(inode, &io_bio->bio, bvec.bv_page,
-				pgoff, start, start + sectorsize - 1,
-				io_bio->mirror_num,
-				btrfs_retry_endio_nocsum, &done);
-		if (ret) {
-			err = ret;
-			goto next;
-		}
-
-		wait_for_completion_io(&done.done);
-
-		if (!done.uptodate) {
-			/* We might have another mirror, so try again */
-			goto next_block_or_try_again;
-		}
-
-next:
-		start += sectorsize;
-
-		nr_sectors--;
-		if (nr_sectors) {
-			pgoff += sectorsize;
-			ASSERT(pgoff < PAGE_SIZE);
-			goto next_block_or_try_again;
-		}
-	}
-
-	return err;
-}
-
-static void btrfs_retry_endio(struct bio *bio)
-{
-	struct btrfs_retry_complete *done = bio->bi_private;
-	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
-	struct extent_io_tree *io_tree, *failure_tree;
-	struct inode *inode = done->inode;
-	struct bio_vec *bvec;
-	int uptodate;
-	int ret;
-	int i = 0;
-	struct bvec_iter_all iter_all;
-
-	if (bio->bi_status)
-		goto end;
-
-	uptodate = 1;
-
-	ASSERT(bio->bi_vcnt == 1);
-	ASSERT(bio_first_bvec_all(bio)->bv_len == btrfs_inode_sectorsize(done->inode));
-
-	io_tree = &BTRFS_I(inode)->io_tree;
-	failure_tree = &BTRFS_I(inode)->io_failure_tree;
-
-	ASSERT(!bio_flagged(bio, BIO_CLONED));
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		ret = check_data_csum(inode, io_bio, i, bvec->bv_page,
-				      bvec->bv_offset, done->start,
-				      bvec->bv_len);
-		if (!ret)
-			clean_io_failure(BTRFS_I(inode)->root->fs_info,
-					 failure_tree, io_tree, done->start,
-					 bvec->bv_page,
-					 btrfs_ino(BTRFS_I(inode)),
-					 bvec->bv_offset);
-		else
-			uptodate = 0;
-		i++;
-	}
-
-	done->uptodate = uptodate;
-end:
-	complete(&done->done);
-	bio_put(bio);
-}
-
-static blk_status_t __btrfs_subio_endio_read(struct inode *inode,
-		struct btrfs_io_bio *io_bio, blk_status_t err)
-{
-	struct btrfs_fs_info *fs_info;
-	struct bio_vec bvec;
-	struct bvec_iter iter;
-	struct btrfs_retry_complete done;
-	u64 start;
-	u64 offset = 0;
-	u32 sectorsize;
-	int nr_sectors;
-	unsigned int pgoff;
-	int csum_pos;
-	bool uptodate = (err == 0);
-	int ret;
-	blk_status_t status;
-
-	fs_info = BTRFS_I(inode)->root->fs_info;
-	sectorsize = fs_info->sectorsize;
-
-	err = BLK_STS_OK;
-	start = io_bio->logical;
-	done.inode = inode;
-	io_bio->bio.bi_iter = io_bio->iter;
-
-	bio_for_each_segment(bvec, &io_bio->bio, iter) {
-		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec.bv_len);
-
-		pgoff = bvec.bv_offset;
-next_block:
-		if (uptodate) {
-			csum_pos = BTRFS_BYTES_TO_BLKS(fs_info, offset);
-			ret = check_data_csum(inode, io_bio, csum_pos,
-					      bvec.bv_page, pgoff, start,
-					      sectorsize);
-			if (likely(!ret))
-				goto next;
-		}
-try_again:
-		done.uptodate = 0;
-		done.start = start;
-		init_completion(&done.done);
-
-		status = dio_read_error(inode, &io_bio->bio, bvec.bv_page,
-					pgoff, start, start + sectorsize - 1,
-					io_bio->mirror_num, btrfs_retry_endio,
-					&done);
-		if (status) {
-			err = status;
-			goto next;
-		}
-
-		wait_for_completion_io(&done.done);
-
-		if (!done.uptodate) {
-			/* We might have another mirror, so try again */
-			goto try_again;
-		}
-next:
-		offset += sectorsize;
-		start += sectorsize;
-
-		ASSERT(nr_sectors);
-
-		nr_sectors--;
-		if (nr_sectors) {
+		for (i = 0; i < nr_sectors; i++) {
+			if (uptodate &&
+			    (!csum || !check_data_csum(inode, io_bio, icsum,
+						       bvec.bv_page, pgoff,
+						       start, sectorsize))) {
+				clean_io_failure(fs_info, failure_tree, io_tree,
+						 start, bvec.bv_page,
+						 btrfs_ino(BTRFS_I(inode)),
+						 pgoff);
+			} else {
+				blk_status_t status;
+
+				status = dio_read_error(inode, &io_bio->bio,
+							bvec.bv_page, pgoff,
+							start,
+							start + sectorsize - 1,
+							io_bio->mirror_num);
+				if (status)
+					err = status;
+			}
+			start += sectorsize;
+			icsum++;
 			pgoff += sectorsize;
 			ASSERT(pgoff < PAGE_SIZE);
-			goto next_block;
 		}
 	}
-
 	return err;
 }
 
-static blk_status_t btrfs_check_read_dio_bio(struct inode *inode,
-					     struct btrfs_io_bio *io_bio,
-					     blk_status_t err)
-{
-	bool skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
-
-	if (skip_csum) {
-		if (unlikely(err))
-			return __btrfs_correct_data_nocsum(inode, io_bio);
-		else
-			return BLK_STS_OK;
-	} else {
-		return __btrfs_subio_endio_read(inode, io_bio, err);
-	}
-}
-
 static void __endio_write_update_ordered(struct inode *inode,
 					 const u64 offset, const u64 bytes,
 					 const bool uptodate)
@@ -7813,7 +7629,7 @@ static void btrfs_end_dio_bio(struct bio *bio)
 
 	if (bio_op(bio) == REQ_OP_READ) {
 		err = btrfs_check_read_dio_bio(dip->inode, btrfs_io_bio(bio),
-					       err);
+					       !err);
 	}
 
 	if (err)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 14/15] btrfs: get rid of endio_repair_workers
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
                   ` (12 preceding siblings ...)
  2020-03-09 21:32 ` [PATCH 13/15] btrfs: simplify direct I/O read repair Omar Sandoval
@ 2020-03-09 21:32 ` Omar Sandoval
  2020-03-11 18:16   ` Josef Bacik
  2020-03-09 21:32 ` [PATCH 15/15] btrfs: unify buffered and direct I/O read repair Omar Sandoval
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 69+ messages in thread
From: Omar Sandoval @ 2020-03-09 21:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Christoph Hellwig

From: Omar Sandoval <osandov@fb.com>

This was originally added in commit 8b110e393c5a ("Btrfs: implement
repair function when direct read fails") because the original bio waited
for the repair bio to complete, so the repair I/O couldn't go through
the same workqueue. As of the previous commit, this is no longer true,
so this separate workqueue is unnecessary.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h   | 1 -
 fs/btrfs/disk-io.c | 8 +-------
 fs/btrfs/disk-io.h | 1 -
 fs/btrfs/inode.c   | 2 +-
 4 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ecd016f7dab1..91c7ea587fcd 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -759,7 +759,6 @@ struct btrfs_fs_info {
 	struct btrfs_workqueue *endio_workers;
 	struct btrfs_workqueue *endio_meta_workers;
 	struct btrfs_workqueue *endio_raid56_workers;
-	struct btrfs_workqueue *endio_repair_workers;
 	struct btrfs_workqueue *rmw_workers;
 	struct btrfs_workqueue *endio_meta_write_workers;
 	struct btrfs_workqueue *endio_write_workers;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6b00ddea0b48..e2d7915f5b03 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -709,9 +709,7 @@ static void end_workqueue_bio(struct bio *bio)
 		else
 			wq = fs_info->endio_write_workers;
 	} else {
-		if (unlikely(end_io_wq->metadata == BTRFS_WQ_ENDIO_DIO_REPAIR))
-			wq = fs_info->endio_repair_workers;
-		else if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56)
+		if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56)
 			wq = fs_info->endio_raid56_workers;
 		else if (end_io_wq->metadata)
 			wq = fs_info->endio_meta_workers;
@@ -1955,7 +1953,6 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 	btrfs_destroy_workqueue(fs_info->workers);
 	btrfs_destroy_workqueue(fs_info->endio_workers);
 	btrfs_destroy_workqueue(fs_info->endio_raid56_workers);
-	btrfs_destroy_workqueue(fs_info->endio_repair_workers);
 	btrfs_destroy_workqueue(fs_info->rmw_workers);
 	btrfs_destroy_workqueue(fs_info->endio_write_workers);
 	btrfs_destroy_workqueue(fs_info->endio_freespace_worker);
@@ -2141,8 +2138,6 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 	fs_info->endio_raid56_workers =
 		btrfs_alloc_workqueue(fs_info, "endio-raid56", flags,
 				      max_active, 4);
-	fs_info->endio_repair_workers =
-		btrfs_alloc_workqueue(fs_info, "endio-repair", flags, 1, 0);
 	fs_info->rmw_workers =
 		btrfs_alloc_workqueue(fs_info, "rmw", flags, max_active, 2);
 	fs_info->endio_write_workers =
@@ -2166,7 +2161,6 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 	      fs_info->flush_workers &&
 	      fs_info->endio_workers && fs_info->endio_meta_workers &&
 	      fs_info->endio_meta_write_workers &&
-	      fs_info->endio_repair_workers &&
 	      fs_info->endio_write_workers && fs_info->endio_raid56_workers &&
 	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
 	      fs_info->caching_workers && fs_info->readahead_workers &&
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 59c885860bf8..aef643f26d0c 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -25,7 +25,6 @@ enum btrfs_wq_endio_type {
 	BTRFS_WQ_ENDIO_METADATA,
 	BTRFS_WQ_ENDIO_FREE_SPACE,
 	BTRFS_WQ_ENDIO_RAID56,
-	BTRFS_WQ_ENDIO_DIO_REPAIR,
 };
 
 static inline u64 btrfs_sb_offset(int mirror)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ef302b7c6c2d..7f00fee5169b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7416,7 +7416,7 @@ static inline blk_status_t submit_dio_repair_bio(struct inode *inode,
 
 	BUG_ON(bio_op(bio) == REQ_OP_WRITE);
 
-	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DIO_REPAIR);
+	ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
 	if (ret)
 		return ret;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 15/15] btrfs: unify buffered and direct I/O read repair
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
                   ` (13 preceding siblings ...)
  2020-03-09 21:32 ` [PATCH 14/15] btrfs: get rid of endio_repair_workers Omar Sandoval
@ 2020-03-09 21:32 ` Omar Sandoval
  2020-03-11 18:19   ` Josef Bacik
  2020-03-19  8:53   ` Nikolay Borisov
  2020-03-10 16:39 ` [PATCH 00/15] btrfs: read repair/direct I/O improvements Christoph Hellwig
                   ` (2 subsequent siblings)
  17 siblings, 2 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-09 21:32 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team, Christoph Hellwig

From: Omar Sandoval <osandov@fb.com>

Currently, direct I/O has its own versions of bio_readpage_error() and
btrfs_check_repairable() (dio_read_error() and
btrfs_check_dio_repairable(), respectively). The main difference is that
the direct I/O version doesn't do read validation. The rework of direct
I/O repair makes it possible to do validation, so we can get rid of
btrfs_check_dio_repairable() and combine bio_readpage_error() and
dio_read_error() into a new helper, btrfs_submit_read_repair().

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/extent_io.c | 126 +++++++++++++++++++------------------------
 fs/btrfs/extent_io.h |  17 +++---
 fs/btrfs/inode.c     | 103 ++++-------------------------------
 3 files changed, 76 insertions(+), 170 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fad86ef4d09d..a5cbe04da803 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2593,80 +2593,52 @@ static bool btrfs_check_repairable(struct inode *inode,
 	return true;
 }
 
-
-struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
-				    struct io_failure_record *failrec,
-				    struct page *page, int pg_offset, int icsum,
-				    bio_end_io_t *endio_func, void *data)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	struct bio *bio;
-	struct btrfs_io_bio *btrfs_failed_bio;
-	struct btrfs_io_bio *btrfs_bio;
-
-	bio = btrfs_io_bio_alloc(1);
-	bio->bi_end_io = endio_func;
-	bio->bi_iter.bi_sector = failrec->logical >> 9;
-	bio->bi_iter.bi_size = 0;
-	bio->bi_private = data;
-
-	btrfs_failed_bio = btrfs_io_bio(failed_bio);
-	if (btrfs_failed_bio->csum) {
-		u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
-
-		btrfs_bio = btrfs_io_bio(bio);
-		btrfs_bio->csum = btrfs_bio->csum_inline;
-		icsum *= csum_size;
-		memcpy(btrfs_bio->csum, btrfs_failed_bio->csum + icsum,
-		       csum_size);
-	}
-
-	bio_add_page(bio, page, failrec->len, pg_offset);
-	btrfs_io_bio(bio)->logical = failrec->start;
-	btrfs_io_bio(bio)->iter = bio->bi_iter;
-
-	return bio;
-}
-
-/*
- * This is a generic handler for readpage errors. If other copies exist, read
- * those and write back good data to the failed position. Does not investigate
- * in remapping the failed extent elsewhere, hoping the device will be smart
- * enough to do this as needed
- */
-static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
-			      struct page *page, u64 start, u64 end,
-			      int failed_mirror)
+blk_status_t btrfs_submit_read_repair(struct inode *inode,
+				      struct bio *failed_bio, u64 phy_offset,
+				      struct page *page, unsigned int pgoff,
+				      u64 start, u64 end, int failed_mirror,
+				      submit_bio_hook_t *submit_bio_hook)
 {
 	struct io_failure_record *failrec;
-	struct inode *inode = page->mapping->host;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
+	struct btrfs_io_bio *failed_io_bio = btrfs_io_bio(failed_bio);
+	struct btrfs_io_bio *io_bio;
+	int icsum = phy_offset >> inode->i_sb->s_blocksize_bits;
 	bool need_validation = false;
 	struct bio *bio;
-	int read_mode = 0;
 	blk_status_t status;
 	int ret;
 
+	btrfs_info(btrfs_sb(inode->i_sb),
+		   "Repair Read Error: read error at %llu", start);
+
 	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
 	ret = btrfs_get_io_failure_record(inode, start, end, &failrec);
 	if (ret)
-		return ret;
+		return errno_to_blk_status(ret);
 
 	/*
 	 * If there was an I/O error and the I/O was for multiple sectors, we
 	 * need to validate each sector individually.
 	 */
 	if (failed_bio->bi_status != BLK_STS_OK) {
-		u64 len = 0;
-		int i;
-
-		for (i = 0; i < failed_bio->bi_vcnt; i++) {
-			len += failed_bio->bi_io_vec[i].bv_len;
-			if (len > inode->i_sb->s_blocksize) {
+		if (bio_flagged(failed_bio, BIO_CLONED)) {
+			if (failed_io_bio->iter.bi_size >
+			    inode->i_sb->s_blocksize)
 				need_validation = true;
-				break;
+		} else {
+			u64 len = 0;
+			int i;
+
+			for (i = 0; i < failed_bio->bi_vcnt; i++) {
+				len += failed_bio->bi_io_vec[i].bv_len;
+				if (len > inode->i_sb->s_blocksize) {
+					need_validation = true;
+					break;
+				}
 			}
 		}
 	}
@@ -2674,32 +2646,41 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 	if (!btrfs_check_repairable(inode, need_validation, failrec,
 				    failed_mirror)) {
 		free_io_failure(failure_tree, tree, failrec);
-		return -EIO;
+		return BLK_STS_IOERR;
 	}
 
+	bio = btrfs_io_bio_alloc(1);
+	io_bio = btrfs_io_bio(bio);
+	bio->bi_opf = REQ_OP_READ;
 	if (need_validation)
-		read_mode |= REQ_FAILFAST_DEV;
+		bio->bi_opf |= REQ_FAILFAST_DEV;
+	bio->bi_end_io = failed_bio->bi_end_io;
+	bio->bi_iter.bi_sector = failrec->logical >> 9;
+	bio->bi_private = failed_bio->bi_private;
 
-	phy_offset >>= inode->i_sb->s_blocksize_bits;
-	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
-				      start - page_offset(page),
-				      (int)phy_offset, failed_bio->bi_end_io,
-				      NULL);
-	bio->bi_opf = REQ_OP_READ | read_mode;
+	if (failed_io_bio->csum) {
+		u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
+
+		io_bio->csum = io_bio->csum_inline;
+		memcpy(io_bio->csum, failed_io_bio->csum + csum_size * icsum,
+		       csum_size);
+	}
+
+	bio_add_page(bio, page, failrec->len, pgoff);
+	io_bio->logical = failrec->start;
+	io_bio->iter = bio->bi_iter;
 
 	btrfs_debug(btrfs_sb(inode->i_sb),
-		"Repair Read Error: submitting new read[%#x] to this_mirror=%d, in_validation=%d",
-		read_mode, failrec->this_mirror, failrec->in_validation);
+"Repair Read Error: submitting new read to this_mirror=%d, in_validation=%d",
+		    failrec->this_mirror, failrec->in_validation);
 
-	status = tree->ops->submit_bio_hook(tree->private_data, bio, failrec->this_mirror,
-					 failrec->bio_flags);
+	status = submit_bio_hook(inode, bio, failrec->this_mirror,
+				 failrec->bio_flags);
 	if (status) {
 		free_io_failure(failure_tree, tree, failrec);
 		bio_put(bio);
-		ret = blk_status_to_errno(status);
 	}
-
-	return ret;
+	return status;
 }
 
 /* lots and lots of room for performance fixes in the end_bio funcs */
@@ -2871,9 +2852,10 @@ static void end_bio_extent_readpage(struct bio *bio)
 			 * If it can't handle the error it will return -EIO and
 			 * we remain responsible for that page.
 			 */
-			ret = bio_readpage_error(bio, offset, page, start, end,
-						 mirror);
-			if (ret == 0) {
+			if (!btrfs_submit_read_repair(inode, bio, offset, page,
+						start - page_offset(page),
+						start, end, mirror,
+						tree->ops->submit_bio_hook)) {
 				uptodate = !bio->bi_status;
 				offset += len;
 				continue;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 11341a430007..f269a4847d8b 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -66,6 +66,10 @@ struct btrfs_io_bio;
 struct io_failure_record;
 struct extent_io_tree;
 
+typedef blk_status_t (submit_bio_hook_t)(struct inode *inode, struct bio *bio,
+					 int mirror_num,
+					 unsigned long bio_flags);
+
 typedef blk_status_t (extent_submit_bio_start_t)(void *private_data,
 		struct bio *bio, u64 bio_offset);
 
@@ -74,8 +78,7 @@ struct extent_io_ops {
 	 * The following callbacks must be always defined, the function
 	 * pointer will be called unconditionally.
 	 */
-	blk_status_t (*submit_bio_hook)(struct inode *inode, struct bio *bio,
-					int mirror_num, unsigned long bio_flags);
+	submit_bio_hook_t *submit_bio_hook;
 	int (*readpage_end_io_hook)(struct btrfs_io_bio *io_bio, u64 phy_offset,
 				    struct page *page, u64 start, u64 end,
 				    int mirror);
@@ -312,10 +315,12 @@ struct io_failure_record {
 };
 
 
-struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
-				    struct io_failure_record *failrec,
-				    struct page *page, int pg_offset, int icsum,
-				    bio_end_io_t *endio_func, void *data);
+blk_status_t btrfs_submit_read_repair(struct inode *inode,
+				      struct bio *failed_bio, u64 phy_offset,
+				      struct page *page, unsigned int pgoff,
+				      u64 start, u64 end, int failed_mirror,
+				      submit_bio_hook_t *submit_bio_hook);
+
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 bool find_lock_delalloc_range(struct inode *inode,
 			     struct page *locked_page, u64 *start,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7f00fee5169b..d555f9bf5bbf 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7407,10 +7407,11 @@ static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
 	kfree(dip);
 }
 
-static inline blk_status_t submit_dio_repair_bio(struct inode *inode,
-						 struct bio *bio,
-						 int mirror_num)
+static blk_status_t submit_dio_repair_bio(struct inode *inode, struct bio *bio,
+					  int mirror_num,
+					  unsigned long bio_flags)
 {
+	struct btrfs_dio_private *dip = bio->bi_private;
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	blk_status_t ret;
 
@@ -7420,96 +7421,11 @@ static inline blk_status_t submit_dio_repair_bio(struct inode *inode,
 	if (ret)
 		return ret;
 
+	refcount_inc(&dip->refs);
 	ret = btrfs_map_bio(fs_info, bio, mirror_num);
-
-	return ret;
-}
-
-static int btrfs_check_dio_repairable(struct inode *inode,
-				      struct bio *failed_bio,
-				      struct io_failure_record *failrec,
-				      int failed_mirror)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-	int num_copies;
-
-	num_copies = btrfs_num_copies(fs_info, failrec->logical, failrec->len);
-	if (num_copies == 1) {
-		/*
-		 * we only have a single copy of the data, so don't bother with
-		 * all the retry and error correction code that follows. no
-		 * matter what the error is, it is very likely to persist.
-		 */
-		btrfs_debug(fs_info,
-			"Check DIO Repairable: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d",
-			num_copies, failrec->this_mirror, failed_mirror);
-		return 0;
-	}
-
-	failrec->failed_mirror = failed_mirror;
-	failrec->this_mirror++;
-	if (failrec->this_mirror == failed_mirror)
-		failrec->this_mirror++;
-
-	if (failrec->this_mirror > num_copies) {
-		btrfs_debug(fs_info,
-			"Check DIO Repairable: (fail) num_copies=%d, next_mirror %d, failed_mirror %d",
-			num_copies, failrec->this_mirror, failed_mirror);
-		return 0;
-	}
-
-	return 1;
-}
-
-static blk_status_t dio_read_error(struct inode *inode, struct bio *failed_bio,
-				   struct page *page, unsigned int pgoff,
-				   u64 start, u64 end, int failed_mirror)
-{
-	struct btrfs_dio_private *dip = failed_bio->bi_private;
-	struct io_failure_record *failrec;
-	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
-	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
-	struct bio *bio;
-	int isector;
-	unsigned int read_mode = 0;
-	int ret;
-	blk_status_t status;
-
-	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
-
-	ret = btrfs_get_io_failure_record(inode, start, end, &failrec);
 	if (ret)
-		return errno_to_blk_status(ret);
-
-	ret = btrfs_check_dio_repairable(inode, failed_bio, failrec,
-					 failed_mirror);
-	if (!ret) {
-		free_io_failure(failure_tree, io_tree, failrec);
-		return BLK_STS_IOERR;
-	}
-
-	if (btrfs_io_bio(failed_bio)->iter.bi_size > inode->i_sb->s_blocksize)
-		read_mode |= REQ_FAILFAST_DEV;
-
-	isector = start - btrfs_io_bio(failed_bio)->logical;
-	isector >>= inode->i_sb->s_blocksize_bits;
-	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page, pgoff,
-				      isector, failed_bio->bi_end_io, dip);
-	bio->bi_opf = REQ_OP_READ | read_mode;
-
-	btrfs_debug(BTRFS_I(inode)->root->fs_info,
-		    "repair DIO read error: submitting new dio read[%#x] to this_mirror=%d, in_validation=%d",
-		    read_mode, failrec->this_mirror, failrec->in_validation);
-
-	refcount_inc(&dip->refs);
-	status = submit_dio_repair_bio(inode, bio, failrec->this_mirror);
-	if (status) {
-		free_io_failure(failure_tree, io_tree, failrec);
-		bio_put(bio);
 		refcount_dec(&dip->refs);
-	}
-
-	return status;
+	return ret;
 }
 
 static blk_status_t btrfs_check_read_dio_bio(struct inode *inode,
@@ -7544,11 +7460,14 @@ static blk_status_t btrfs_check_read_dio_bio(struct inode *inode,
 			} else {
 				blk_status_t status;
 
-				status = dio_read_error(inode, &io_bio->bio,
+				status = btrfs_submit_read_repair(inode,
+							&io_bio->bio,
+							start - io_bio->logical,
 							bvec.bv_page, pgoff,
 							start,
 							start + sectorsize - 1,
-							io_bio->mirror_num);
+							io_bio->mirror_num,
+							submit_dio_repair_bio);
 				if (status)
 					err = status;
 			}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 06/15] btrfs: rename __readpage_endio_check to check_data_csum
  2020-03-09 21:32 ` [PATCH 06/15] btrfs: rename __readpage_endio_check to check_data_csum Omar Sandoval
@ 2020-03-10 14:46   ` Johannes Thumshirn
  2020-03-11 17:57   ` Josef Bacik
  2020-03-17 14:39   ` Nikolay Borisov
  2 siblings, 0 replies; 69+ messages in thread
From: Johannes Thumshirn @ 2020-03-10 14:46 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 07/15] btrfs: make btrfs_check_repairable() static
  2020-03-09 21:32 ` [PATCH 07/15] btrfs: make btrfs_check_repairable() static Omar Sandoval
@ 2020-03-10 14:53   ` Johannes Thumshirn
  2020-03-11 17:58   ` Josef Bacik
  2020-03-17 14:52   ` Nikolay Borisov
  2 siblings, 0 replies; 69+ messages in thread
From: Johannes Thumshirn @ 2020-03-10 14:53 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig

Easy enough,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 08/15] btrfs: move btrfs_dio_private to inode.c
  2020-03-09 21:32 ` [PATCH 08/15] btrfs: move btrfs_dio_private to inode.c Omar Sandoval
@ 2020-03-10 14:56   ` Johannes Thumshirn
  2020-03-11  8:48     ` Omar Sandoval
  2020-03-17 14:53   ` Nikolay Borisov
  2020-03-19 16:16   ` David Sterba
  2 siblings, 1 reply; 69+ messages in thread
From: Johannes Thumshirn @ 2020-03-10 14:56 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig

There's still a forward declaration in ctree.h:
johannes@redsun60:linux(btrfs-misc-next)$ git grep -n btrfs_dio_private
[...]
fs/btrfs/ctree.h:2808:struct btrfs_dio_private;
[...]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 02/15] btrfs: fix double __endio_write_update_ordered in direct I/O
  2020-03-09 21:32 ` [PATCH 02/15] btrfs: fix double __endio_write_update_ordered in direct I/O Omar Sandoval
@ 2020-03-10 16:30   ` Christoph Hellwig
  2020-03-11  9:03     ` Omar Sandoval
  2020-03-17 14:04   ` Nikolay Borisov
  1 sibling, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2020-03-10 16:30 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Christoph Hellwig

On Mon, Mar 09, 2020 at 02:32:28PM -0700, Omar Sandoval wrote:
> +static void btrfs_submit_direct_hook(struct btrfs_dio_private *dip)

Just curious: why is this routine called btrfs_submit_direct_hook?
it doesn't seem to hook up anything, but just contains the guts of
btrfs_submit_direct.  Any reason to keep the two functions separate?

Also instead of the separate bip allocation and the bio clone, why
not clone into a private bio_set that contains the private data
as part of the allocation?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 03/15] btrfs: look at full bi_io_vec for repair decision
  2020-03-09 21:32 ` [PATCH 03/15] btrfs: look at full bi_io_vec for repair decision Omar Sandoval
@ 2020-03-10 16:33   ` Christoph Hellwig
  2020-03-11  9:07     ` Omar Sandoval
  2020-03-17 14:38   ` Nikolay Borisov
  1 sibling, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2020-03-10 16:33 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Christoph Hellwig

On Mon, Mar 09, 2020 at 02:32:29PM -0700, Omar Sandoval wrote:
> +	/*
> +	 * We need to validate each sector individually if the I/O was for
> +	 * multiple sectors.
> +	 */
> +	len = 0;
> +	for (i = 0; i < failed_bio->bi_vcnt; i++) {
> +		len += failed_bio->bi_io_vec[i].bv_len;
> +		if (len > inode->i_sb->s_blocksize) {
> +			need_validation = true;
> +			break;
> +		}

So given that btrfs is the I/O submitter iterating over all bio_vecs
should probably be fine.  That being said I deeply dislike the idea
of having code like this inside the file system.  Maybe we can add
a helper like:

u32 bio_size_all(struct bio *bio)
{
	u32 len, i;

	for (i = 0; i < bio->bi_vcnt; i++)
		len += bio->bi_io_vec[i].bv_len;
	return len;
}

in the core block code, with a kerneldoc comment documenting the
exact use cases, and then use that?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 12/15] btrfs: get rid of one layer of bios in direct I/O
  2020-03-09 21:32 ` [PATCH 12/15] btrfs: get rid of one layer of bios in direct I/O Omar Sandoval
@ 2020-03-10 16:38   ` Christoph Hellwig
  2020-03-11  9:19     ` Omar Sandoval
  2020-03-11 18:07   ` Josef Bacik
  2020-03-17 16:48   ` Nikolay Borisov
  2 siblings, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2020-03-10 16:38 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Christoph Hellwig

On Mon, Mar 09, 2020 at 02:32:38PM -0700, Omar Sandoval wrote:
> 1. The bio created by the generic direct I/O code (dio_bio).
> 2. A clone of dio_bio we create in btrfs_submit_direct() to represent
>    the entire direct I/O range (orig_bio).
> 3. A partial clone of orig_bio limited to the size of a RAID stripe that
>    we create in btrfs_submit_direct_hook().
> 4. Clones of each of those split bios for each RAID stripe that we
>    create in btrfs_map_bio().

Just curious:  what is number 3 useful for?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 00/15] btrfs: read repair/direct I/O improvements
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
                   ` (14 preceding siblings ...)
  2020-03-09 21:32 ` [PATCH 15/15] btrfs: unify buffered and direct I/O read repair Omar Sandoval
@ 2020-03-10 16:39 ` Christoph Hellwig
  2020-03-11  9:22   ` Omar Sandoval
  2020-03-18 22:07 ` David Sterba
  2020-03-20 14:10 ` Christoph Hellwig
  17 siblings, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2020-03-10 16:39 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Goldwyn Rodrigues

Adding Goldwyn,

as he has been looking into converting btrfs to the iomap direct
I/O code.  This doesn't look like a major conflict, but he should
be knowledgeable about the dio code by now after a few iterations
of that.

On Mon, Mar 09, 2020 at 02:32:26PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Hi,
> 
> This series includes several fixes, cleanups, and improvements to direct
> I/O and read repair. It's preparation for adding read repair to my
> RWF_ENCODED series [1], but it can go in independently.
> 
> Patches 1 and 2 are direct I/O error handling fixes. Patch 3 is a
> buffered read repair fix. Patch 4 is a buffered read repair improvement.
> Patches 5-9 are trivial cleanups. Patch 10 converts direct I/O to use
> refcount_t, which would've helped catch the bug fixed by patch 1.
> Patches 11-14 drastically simplify the direct I/O code. Patch 15 gets
> unifies buffered and direct I/O read repair, which also makes direct I/O
> repair actually do validation for large failed reads instead of
> rewriting the whole thing.
> 
> Overall, this is net about -400 lines of code and actually makes direct
> I/O more functional.
> 
> Note that this series causes btrfs/142 to fail. This is a bug in the
> test, as it assumes that direct I/O doesn't do read validation. I'm
> working on a fix for the test.
> 
> Christoph is cc'd for patch 3. The fix looks at the bio internals in a
> way that I wasn't sure was recommended, although there is precedent in
> the bcache code. I'd appreciate if Christoph acked that patch or
> suggested a better approach.
> 
> This series is based on misc-next.
> 
> Thanks!
> 
> 1: https://lore.kernel.org/linux-fsdevel/cover.1582930832.git.osandov@fb.com/
> 
> Omar Sandoval (15):
>   btrfs: fix error handling when submitting direct I/O bio
>   btrfs: fix double __endio_write_update_ordered in direct I/O
>   btrfs: look at full bi_io_vec for repair decision
>   btrfs: don't do repair validation for checksum errors
>   btrfs: clarify btrfs_lookup_bio_sums documentation
>   btrfs: rename __readpage_endio_check to check_data_csum
>   btrfs: make btrfs_check_repairable() static
>   btrfs: move btrfs_dio_private to inode.c
>   btrfs: kill btrfs_dio_private->private
>   btrfs: convert btrfs_dio_private->pending_bios to refcount_t
>   btrfs: put direct I/O checksums in btrfs_dio_private instead of bio
>   btrfs: get rid of one layer of bios in direct I/O
>   btrfs: simplify direct I/O read repair
>   btrfs: get rid of endio_repair_workers
>   btrfs: unify buffered and direct I/O read repair
> 
>  fs/btrfs/btrfs_inode.h |  30 --
>  fs/btrfs/ctree.h       |   1 -
>  fs/btrfs/disk-io.c     |   8 +-
>  fs/btrfs/disk-io.h     |   1 -
>  fs/btrfs/extent_io.c   | 141 +++++----
>  fs/btrfs/extent_io.h   |  19 +-
>  fs/btrfs/file-item.c   |  11 +-
>  fs/btrfs/inode.c       | 694 ++++++++++-------------------------------
>  8 files changed, 260 insertions(+), 645 deletions(-)
> 
> -- 
> 2.25.1
---end quoted text---

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 08/15] btrfs: move btrfs_dio_private to inode.c
  2020-03-10 14:56   ` Johannes Thumshirn
@ 2020-03-11  8:48     ` Omar Sandoval
  0 siblings, 0 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-11  8:48 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs, kernel-team, Christoph Hellwig

On Tue, Mar 10, 2020 at 02:56:55PM +0000, Johannes Thumshirn wrote:
> There's still a forward declaration in ctree.h:
> johannes@redsun60:linux(btrfs-misc-next)$ git grep -n btrfs_dio_private
> [...]
> fs/btrfs/ctree.h:2808:struct btrfs_dio_private;
> [...]

Good catch, I'll remove that.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 02/15] btrfs: fix double __endio_write_update_ordered in direct I/O
  2020-03-10 16:30   ` Christoph Hellwig
@ 2020-03-11  9:03     ` Omar Sandoval
  0 siblings, 0 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-11  9:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-btrfs, kernel-team

On Tue, Mar 10, 2020 at 05:30:24PM +0100, Christoph Hellwig wrote:
> On Mon, Mar 09, 2020 at 02:32:28PM -0700, Omar Sandoval wrote:
> > +static void btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
> 
> Just curious: why is this routine called btrfs_submit_direct_hook?

No idea. The name goes back to commit e65e15355429 ("btrfs: fix panic
caused by direct IO"), and it didn't make any sense then, either.

> it doesn't seem to hook up anything, but just contains the guts of
> btrfs_submit_direct.  Any reason to keep the two functions separate?

The only reason I didn't combine them is that it would make for a pretty
big, unwieldy function. Maybe folding btrfs_submit_direct_hook() into
btrfs_submit_direct() and splitting the dip setup into something like
btrfs_create_dio_private() would be more natural.

> Also instead of the separate bip allocation and the bio clone, why
> not clone into a private bio_set that contains the private data
> as part of the allocation?

Mainly because dip is not tied to any one bio, as we might need to split
up or retry bios. We also already clone the bios from a btrfs bioset,
although that doesn't have everything we need for direct I/O.

Thanks!

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 03/15] btrfs: look at full bi_io_vec for repair decision
  2020-03-10 16:33   ` Christoph Hellwig
@ 2020-03-11  9:07     ` Omar Sandoval
  2020-03-16 10:48       ` Christoph Hellwig
  0 siblings, 1 reply; 69+ messages in thread
From: Omar Sandoval @ 2020-03-11  9:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-btrfs, kernel-team

On Tue, Mar 10, 2020 at 05:33:19PM +0100, Christoph Hellwig wrote:
> On Mon, Mar 09, 2020 at 02:32:29PM -0700, Omar Sandoval wrote:
> > +	/*
> > +	 * We need to validate each sector individually if the I/O was for
> > +	 * multiple sectors.
> > +	 */
> > +	len = 0;
> > +	for (i = 0; i < failed_bio->bi_vcnt; i++) {
> > +		len += failed_bio->bi_io_vec[i].bv_len;
> > +		if (len > inode->i_sb->s_blocksize) {
> > +			need_validation = true;
> > +			break;
> > +		}
> 
> So given that btrfs is the I/O submitter iterating over all bio_vecs
> should probably be fine.  That being said I deeply dislike the idea
> of having code like this inside the file system.  Maybe we can add
> a helper like:
> 
> u32 bio_size_all(struct bio *bio)
> {
> 	u32 len, i;
> 
> 	for (i = 0; i < bio->bi_vcnt; i++)
> 		len += bio->bi_io_vec[i].bv_len;
> 	return len;
> }
> 
> in the core block code, with a kerneldoc comment documenting the
> exact use cases, and then use that?

That works for me. I was microoptimizing here since I can stop iterating
once I know that the bio is greater than one sector, but since this is
for the rare repair case, it really doesn't matter.

On the other hand, a bio_for_each_bvec_all() helper would feel less
intrusive into the bio guts and also support bailing early. I'm fine
either way. Thoughts?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 12/15] btrfs: get rid of one layer of bios in direct I/O
  2020-03-10 16:38   ` Christoph Hellwig
@ 2020-03-11  9:19     ` Omar Sandoval
  2020-03-16 10:49       ` Christoph Hellwig
  0 siblings, 1 reply; 69+ messages in thread
From: Omar Sandoval @ 2020-03-11  9:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-btrfs, kernel-team

On Tue, Mar 10, 2020 at 05:38:35PM +0100, Christoph Hellwig wrote:
> On Mon, Mar 09, 2020 at 02:32:38PM -0700, Omar Sandoval wrote:
> > 1. The bio created by the generic direct I/O code (dio_bio).
> > 2. A clone of dio_bio we create in btrfs_submit_direct() to represent
> >    the entire direct I/O range (orig_bio).
> > 3. A partial clone of orig_bio limited to the size of a RAID stripe that
> >    we create in btrfs_submit_direct_hook().
> > 4. Clones of each of those split bios for each RAID stripe that we
> >    create in btrfs_map_bio().
> 
> Just curious:  what is number 3 useful for?

The next thing we do with bio 2 (which has a logical block address) is
to map it to physical block addresses on each device (btrfs_map_bio()).
That mapping is per-stripe, so we either have to avoid building bios
that cross a stripe (which is what buffered I/O does) or we have to
split up the bio (which is what direct I/O does). We probably want to
move towards the first approach for direct I/O, as well, but reworking
get_blocks would conflict with the iomap series, and it looks like that
would be easier to do using iomap instead, anyways.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 00/15] btrfs: read repair/direct I/O improvements
  2020-03-10 16:39 ` [PATCH 00/15] btrfs: read repair/direct I/O improvements Christoph Hellwig
@ 2020-03-11  9:22   ` Omar Sandoval
  2020-03-18 16:33     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 69+ messages in thread
From: Omar Sandoval @ 2020-03-11  9:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-btrfs, kernel-team, Goldwyn Rodrigues

On Tue, Mar 10, 2020 at 05:39:40PM +0100, Christoph Hellwig wrote:
> Adding Goldwyn,
> 
> as he has been looking into converting btrfs to the iomap direct
> I/O code.  This doesn't look like a major conflict, but he should
> be knowledgeable about the dio code by now after a few iterations
> of that.

Thanks, I did try to avoid conflicting with the iomap rework, and I'm
looking forward to the further improvement.

Btw, Christoph, I'd love to get your opinion on the RWF_ENCODED series
("[PATCH v4 0/9] fs: interface for directly reading/writing compressed
data").

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 01/15] btrfs: fix error handling when submitting direct I/O bio
  2020-03-09 21:32 ` [PATCH 01/15] btrfs: fix error handling when submitting direct I/O bio Omar Sandoval
@ 2020-03-11 17:54   ` Josef Bacik
  2020-03-17 13:46   ` Nikolay Borisov
  1 sibling, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2020-03-11 17:54 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig

On 3/9/20 5:32 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> If we submit orig_bio in btrfs_submit_direct_hook(), we never increment
> pending_bios. Then, if btrfs_submit_dio_bio() fails, we decrement
> pending_bios to -1, and we never complete orig_bio. Fix it by
> initializing pending_bios to 1 instead of incrementing later.
> 
> Fixing this exposes another bug: we put orig_bio prematurely and then
> put it again from end_io. Fix it by not putting orig_bio.
> 
> After this change, pending_bios is really more of a reference count, but
> I'll leave that cleanup separate to keep the fix small.
> 
> Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO")
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 04/15] btrfs: don't do repair validation for checksum errors
  2020-03-09 21:32 ` [PATCH 04/15] btrfs: don't do repair validation for checksum errors Omar Sandoval
@ 2020-03-11 17:55   ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2020-03-11 17:55 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig

On 3/9/20 5:32 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> The purpose of the validation step is to distinguish between good and
> bad sectors in a failed multi-sector read. If a multi-sector read
> succeeded but some of those sectors had checksum errors, we don't need
> to validate anything; we know the sectors with bad checksums need to be
> repaired.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 05/15] btrfs: clarify btrfs_lookup_bio_sums documentation
  2020-03-09 21:32 ` [PATCH 05/15] btrfs: clarify btrfs_lookup_bio_sums documentation Omar Sandoval
@ 2020-03-11 17:56   ` Josef Bacik
  2020-03-11 18:23     ` Omar Sandoval
  2020-03-17 14:38   ` Nikolay Borisov
  1 sibling, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2020-03-11 17:56 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig

On 3/9/20 5:32 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Fix a couple of issues in the btrfs_lookup_bio_sums documentation:
> 
> * The bio doesn't need to be a btrfs_io_bio if dst was provided. Move
>    the declaration in the code to make that clear, too.
> * dst must be large enough to hold nblocks * csum_size, not just
>    csum_size.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>   fs/btrfs/file-item.c | 11 +++++++----
>   1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 6c849e8fd5a1..fa9f4a92f74d 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -242,11 +242,13 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
>   /**
>    * btrfs_lookup_bio_sums - Look up checksums for a bio.
>    * @inode: inode that the bio is for.
> - * @bio: bio embedded in btrfs_io_bio.
> + * @bio: bio to look up.
>    * @offset: Unless (u64)-1, look up checksums for this offset in the file.
>    *          If (u64)-1, use the page offsets from the bio instead.
> - * @dst: Buffer of size btrfs_super_csum_size() used to return checksum. If
> - *       NULL, the checksum is returned in btrfs_io_bio(bio)->csum instead.
> + * @dst: Buffer of size nblocks * btrfs_super_csum_size() used to return
> + *       checksum (nblocks = bio->bi_iter.bi_size / sectorsize). If NULL, the
> + *       checksum buffer is allocated and returned in btrfs_io_bio(bio)->csum
> + *       instead.
>    *
>    * Return: BLK_STS_RESOURCE if allocating memory fails, BLK_STS_OK otherwise.
>    */
> @@ -256,7 +258,6 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
>   	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   	struct bio_vec bvec;
>   	struct bvec_iter iter;
> -	struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
>   	struct btrfs_csum_item *item = NULL;
>   	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
>   	struct btrfs_path *path;
> @@ -277,6 +278,8 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
>   
>   	nblocks = bio->bi_iter.bi_size >> inode->i_sb->s_blocksize_bits;
>   	if (!dst) {
> +		struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
> +

Looks like you have some extra changes in here?

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 06/15] btrfs: rename __readpage_endio_check to check_data_csum
  2020-03-09 21:32 ` [PATCH 06/15] btrfs: rename __readpage_endio_check to check_data_csum Omar Sandoval
  2020-03-10 14:46   ` Johannes Thumshirn
@ 2020-03-11 17:57   ` Josef Bacik
  2020-03-17 14:39   ` Nikolay Borisov
  2 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2020-03-11 17:57 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig

On 3/9/20 5:32 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> __readpage_endio_check() is also used from the direct I/O read code, so
> give it a more descriptive name.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 07/15] btrfs: make btrfs_check_repairable() static
  2020-03-09 21:32 ` [PATCH 07/15] btrfs: make btrfs_check_repairable() static Omar Sandoval
  2020-03-10 14:53   ` Johannes Thumshirn
@ 2020-03-11 17:58   ` Josef Bacik
  2020-03-17 14:52   ` Nikolay Borisov
  2 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2020-03-11 17:58 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig

On 3/9/20 5:32 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Since its introduction in commit 2fe6303e7cd0 ("Btrfs: split
> bio_readpage_error into several functions"), btrfs_check_repairable()
> has only been used from extent_io.c where it is defined.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 09/15] btrfs: kill btrfs_dio_private->private
  2020-03-09 21:32 ` [PATCH 09/15] btrfs: kill btrfs_dio_private->private Omar Sandoval
@ 2020-03-11 17:59   ` Josef Bacik
  2020-03-17 14:54   ` Nikolay Borisov
  1 sibling, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2020-03-11 17:59 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig

On 3/9/20 5:32 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> We haven't used this since commit 9be3395bcd4a ("Btrfs: use a btrfs
> bioset instead of abusing bio internals").
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 10/15] btrfs: convert btrfs_dio_private->pending_bios to refcount_t
  2020-03-09 21:32 ` [PATCH 10/15] btrfs: convert btrfs_dio_private->pending_bios to refcount_t Omar Sandoval
@ 2020-03-11 18:00   ` Josef Bacik
  2020-03-17 15:10   ` Nikolay Borisov
  1 sibling, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2020-03-11 18:00 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig

On 3/9/20 5:32 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This is really a reference count now, so convert it to refcount_t and
> rename it to refs.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 11/15] btrfs: put direct I/O checksums in btrfs_dio_private instead of bio
  2020-03-09 21:32 ` [PATCH 11/15] btrfs: put direct I/O checksums in btrfs_dio_private instead of bio Omar Sandoval
@ 2020-03-11 18:04   ` Josef Bacik
  2020-03-17 16:37   ` Nikolay Borisov
  2020-04-03 16:18   ` David Sterba
  2 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2020-03-11 18:04 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig

On 3/9/20 5:32 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> The next commit will get rid of btrfs_dio_private->orig_bio. The only
> thing we really need it for is containing all of the checksums, but we
> can easily put those in btrfs_dio_private and get rid of the awkward
> logic that looks up the checksums for orig_bio when the first split bio
> is submitted. (Interestingly, btrfs_dio_private did contain the
> checksums before commit 23ea8e5a0767 ("Btrfs: load checksum data once
> when submitting a direct read io"), but it didn't look them up up
> front.)
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 12/15] btrfs: get rid of one layer of bios in direct I/O
  2020-03-09 21:32 ` [PATCH 12/15] btrfs: get rid of one layer of bios in direct I/O Omar Sandoval
  2020-03-10 16:38   ` Christoph Hellwig
@ 2020-03-11 18:07   ` Josef Bacik
  2020-03-17 16:48   ` Nikolay Borisov
  2 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2020-03-11 18:07 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig

On 3/9/20 5:32 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> In the worst case, there are _4_ layers of bios in the Btrfs direct I/O
> path:
> 
> 1. The bio created by the generic direct I/O code (dio_bio).
> 2. A clone of dio_bio we create in btrfs_submit_direct() to represent
>     the entire direct I/O range (orig_bio).
> 3. A partial clone of orig_bio limited to the size of a RAID stripe that
>     we create in btrfs_submit_direct_hook().
> 4. Clones of each of those split bios for each RAID stripe that we
>     create in btrfs_map_bio().
> 
> As of the previous commit, the second layer (orig_bio) is no longer
> needed for anything: we can split dio_bio instead, and complete dio_bio
> directly when all of the cloned bios complete. This lets us clean up a
> bunch of cruft, including dip->subio_endio and dip->errors (we can use
> dio_bio->bi_status instead). It also enables the next big cleanup of
> direct I/O read repair.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 13/15] btrfs: simplify direct I/O read repair
  2020-03-09 21:32 ` [PATCH 13/15] btrfs: simplify direct I/O read repair Omar Sandoval
@ 2020-03-11 18:16   ` Josef Bacik
  2020-04-03 16:40   ` David Sterba
  1 sibling, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2020-03-11 18:16 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig

On 3/9/20 5:32 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Direct I/O read repair is an over-complicated mess. There is major code
> duplication between __btrfs_subio_endio_read() (checks checksums and
> handles I/O errors for files with checksums),
> __btrfs_correct_data_nocsum() (handles I/O errors for files without
> checksums), btrfs_retry_endio() (checks checksums and handles I/O errors
> for retries of files with checksums), and btrfs_retry_endio_nocsum()
> (handles I/O errors for retries of files without checksum). If it sounds
> like these should be one function, that's because they should.
> 
> After the previous commit getting rid of orig_bio, we can reuse the same
> endio callback for repair I/O and the original I/O, we just need to
> track the file offset and original iterator in the repair bio. We can
> also unify the handling of files with and without checksums and replace
> the atrocity that was probably the inspiration for "Go To Statement
> Considered Harmful" with normal loops. We also no longer have to wait
> for each repair I/O to complete one by one.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Man that was a doozy, took me a while to realize we only ever use our ->logical 
for DIO.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 14/15] btrfs: get rid of endio_repair_workers
  2020-03-09 21:32 ` [PATCH 14/15] btrfs: get rid of endio_repair_workers Omar Sandoval
@ 2020-03-11 18:16   ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2020-03-11 18:16 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig

On 3/9/20 5:32 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This was originally added in commit 8b110e393c5a ("Btrfs: implement
> repair function when direct read fails") because the original bio waited
> for the repair bio to complete, so the repair I/O couldn't go through
> the same workqueue. As of the previous commit, this is no longer true,
> so this separate workqueue is unnecessary.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/15] btrfs: unify buffered and direct I/O read repair
  2020-03-09 21:32 ` [PATCH 15/15] btrfs: unify buffered and direct I/O read repair Omar Sandoval
@ 2020-03-11 18:19   ` Josef Bacik
  2020-03-19  8:53   ` Nikolay Borisov
  1 sibling, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2020-03-11 18:19 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig

On 3/9/20 5:32 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Currently, direct I/O has its own versions of bio_readpage_error() and
> btrfs_check_repairable() (dio_read_error() and
> btrfs_check_dio_repairable(), respectively). The main difference is that
> the direct I/O version doesn't do read validation. The rework of direct
> I/O repair makes it possible to do validation, so we can get rid of
> btrfs_check_dio_repairable() and combine bio_readpage_error() and
> dio_read_error() into a new helper, btrfs_submit_read_repair().
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 05/15] btrfs: clarify btrfs_lookup_bio_sums documentation
  2020-03-11 17:56   ` Josef Bacik
@ 2020-03-11 18:23     ` Omar Sandoval
  2020-03-11 18:34       ` Josef Bacik
  0 siblings, 1 reply; 69+ messages in thread
From: Omar Sandoval @ 2020-03-11 18:23 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team, Christoph Hellwig

On Wed, Mar 11, 2020 at 01:56:44PM -0400, Josef Bacik wrote:
> On 3/9/20 5:32 PM, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Fix a couple of issues in the btrfs_lookup_bio_sums documentation:
> > 
> > * The bio doesn't need to be a btrfs_io_bio if dst was provided. Move
> >    the declaration in the code to make that clear, too.
> > * dst must be large enough to hold nblocks * csum_size, not just
> >    csum_size.
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >   fs/btrfs/file-item.c | 11 +++++++----
> >   1 file changed, 7 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> > index 6c849e8fd5a1..fa9f4a92f74d 100644
> > --- a/fs/btrfs/file-item.c
> > +++ b/fs/btrfs/file-item.c
> > @@ -242,11 +242,13 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
> >   /**
> >    * btrfs_lookup_bio_sums - Look up checksums for a bio.
> >    * @inode: inode that the bio is for.
> > - * @bio: bio embedded in btrfs_io_bio.
> > + * @bio: bio to look up.
> >    * @offset: Unless (u64)-1, look up checksums for this offset in the file.
> >    *          If (u64)-1, use the page offsets from the bio instead.
> > - * @dst: Buffer of size btrfs_super_csum_size() used to return checksum. If
> > - *       NULL, the checksum is returned in btrfs_io_bio(bio)->csum instead.
> > + * @dst: Buffer of size nblocks * btrfs_super_csum_size() used to return
> > + *       checksum (nblocks = bio->bi_iter.bi_size / sectorsize). If NULL, the
> > + *       checksum buffer is allocated and returned in btrfs_io_bio(bio)->csum
> > + *       instead.
> >    *
> >    * Return: BLK_STS_RESOURCE if allocating memory fails, BLK_STS_OK otherwise.
> >    */
> > @@ -256,7 +258,6 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
> >   	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >   	struct bio_vec bvec;
> >   	struct bvec_iter iter;
> > -	struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
> >   	struct btrfs_csum_item *item = NULL;
> >   	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> >   	struct btrfs_path *path;
> > @@ -277,6 +278,8 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
> >   	nblocks = bio->bi_iter.bi_size >> inode->i_sb->s_blocksize_bits;
> >   	if (!dst) {
> > +		struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
> > +
> 
> Looks like you have some extra changes in here?

I mentioned it in the commit message: "Move the declaration in the code
to make that clear". It looks weird to document that the bio only needs
to be a btrfs_io_bio if dst == NULL and then always get the btrfs_bio,
even if we don't use it.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 05/15] btrfs: clarify btrfs_lookup_bio_sums documentation
  2020-03-11 18:23     ` Omar Sandoval
@ 2020-03-11 18:34       ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2020-03-11 18:34 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Christoph Hellwig

On 3/11/20 2:23 PM, Omar Sandoval wrote:
> On Wed, Mar 11, 2020 at 01:56:44PM -0400, Josef Bacik wrote:
>> On 3/9/20 5:32 PM, Omar Sandoval wrote:
>>> From: Omar Sandoval <osandov@fb.com>
>>>
>>> Fix a couple of issues in the btrfs_lookup_bio_sums documentation:
>>>
>>> * The bio doesn't need to be a btrfs_io_bio if dst was provided. Move
>>>     the declaration in the code to make that clear, too.
>>> * dst must be large enough to hold nblocks * csum_size, not just
>>>     csum_size.
>>>
>>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>>> ---
>>>    fs/btrfs/file-item.c | 11 +++++++----
>>>    1 file changed, 7 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
>>> index 6c849e8fd5a1..fa9f4a92f74d 100644
>>> --- a/fs/btrfs/file-item.c
>>> +++ b/fs/btrfs/file-item.c
>>> @@ -242,11 +242,13 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
>>>    /**
>>>     * btrfs_lookup_bio_sums - Look up checksums for a bio.
>>>     * @inode: inode that the bio is for.
>>> - * @bio: bio embedded in btrfs_io_bio.
>>> + * @bio: bio to look up.
>>>     * @offset: Unless (u64)-1, look up checksums for this offset in the file.
>>>     *          If (u64)-1, use the page offsets from the bio instead.
>>> - * @dst: Buffer of size btrfs_super_csum_size() used to return checksum. If
>>> - *       NULL, the checksum is returned in btrfs_io_bio(bio)->csum instead.
>>> + * @dst: Buffer of size nblocks * btrfs_super_csum_size() used to return
>>> + *       checksum (nblocks = bio->bi_iter.bi_size / sectorsize). If NULL, the
>>> + *       checksum buffer is allocated and returned in btrfs_io_bio(bio)->csum
>>> + *       instead.
>>>     *
>>>     * Return: BLK_STS_RESOURCE if allocating memory fails, BLK_STS_OK otherwise.
>>>     */
>>> @@ -256,7 +258,6 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
>>>    	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>>    	struct bio_vec bvec;
>>>    	struct bvec_iter iter;
>>> -	struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
>>>    	struct btrfs_csum_item *item = NULL;
>>>    	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
>>>    	struct btrfs_path *path;
>>> @@ -277,6 +278,8 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
>>>    	nblocks = bio->bi_iter.bi_size >> inode->i_sb->s_blocksize_bits;
>>>    	if (!dst) {
>>> +		struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
>>> +
>>
>> Looks like you have some extra changes in here?
> 
> I mentioned it in the commit message: "Move the declaration in the code
> to make that clear". It looks weird to document that the bio only needs
> to be a btrfs_io_bio if dst == NULL and then always get the btrfs_bio,
> even if we don't use it.
> 

Apparently I can't read multi-sentence bullet points.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 03/15] btrfs: look at full bi_io_vec for repair decision
  2020-03-11  9:07     ` Omar Sandoval
@ 2020-03-16 10:48       ` Christoph Hellwig
  0 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2020-03-16 10:48 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Christoph Hellwig, linux-btrfs, kernel-team

On Wed, Mar 11, 2020 at 02:07:47AM -0700, Omar Sandoval wrote:
> That works for me. I was microoptimizing here since I can stop iterating
> once I know that the bio is greater than one sector, but since this is
> for the rare repair case, it really doesn't matter.
> 
> On the other hand, a bio_for_each_bvec_all() helper would feel less
> intrusive into the bio guts and also support bailing early. I'm fine
> either way. Thoughts?

I don't really care too much as long as we don't open code the bi_io_vec
access.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 12/15] btrfs: get rid of one layer of bios in direct I/O
  2020-03-11  9:19     ` Omar Sandoval
@ 2020-03-16 10:49       ` Christoph Hellwig
  0 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2020-03-16 10:49 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Christoph Hellwig, linux-btrfs, kernel-team

On Wed, Mar 11, 2020 at 02:19:40AM -0700, Omar Sandoval wrote:
> On Tue, Mar 10, 2020 at 05:38:35PM +0100, Christoph Hellwig wrote:
> > On Mon, Mar 09, 2020 at 02:32:38PM -0700, Omar Sandoval wrote:
> > > 1. The bio created by the generic direct I/O code (dio_bio).
> > > 2. A clone of dio_bio we create in btrfs_submit_direct() to represent
> > >    the entire direct I/O range (orig_bio).
> > > 3. A partial clone of orig_bio limited to the size of a RAID stripe that
> > >    we create in btrfs_submit_direct_hook().
> > > 4. Clones of each of those split bios for each RAID stripe that we
> > >    create in btrfs_map_bio().
> > 
> > Just curious:  what is number 3 useful for?
> 
> The next thing we do with bio 2 (which has a logical block address) is
> to map it to physical block addresses on each device (btrfs_map_bio()).
> That mapping is per-stripe, so we either have to avoid building bios
> that cross a stripe (which is what buffered I/O does)

And which seems inherently sensible..

> or we have to
> split up the bio (which is what direct I/O does). We probably want to
> move towards the first approach for direct I/O, as well, but reworking
> get_blocks would conflict with the iomap series, and it looks like that
> would be easier to do using iomap instead, anyways.

True.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 01/15] btrfs: fix error handling when submitting direct I/O bio
  2020-03-09 21:32 ` [PATCH 01/15] btrfs: fix error handling when submitting direct I/O bio Omar Sandoval
  2020-03-11 17:54   ` Josef Bacik
@ 2020-03-17 13:46   ` Nikolay Borisov
  1 sibling, 0 replies; 69+ messages in thread
From: Nikolay Borisov @ 2020-03-17 13:46 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig



On 9.03.20 г. 23:32 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> If we submit orig_bio in btrfs_submit_direct_hook(), we never increment
> pending_bios. Then, if btrfs_submit_dio_bio() fails, we decrement
> pending_bios to -1, and we never complete orig_bio. Fix it by
> initializing pending_bios to 1 instead of incrementing later.

nit: I'd rephrase this paragraph to put the emphasis on when this could
happen, which is when the write falls entirely within a chunk's stripe
(i.e doesn't span 64k region in case of having a block group with a
profile different than SINGLE) or doesn't span a chunk in case of a
profile different than SINGLE.

> 
> Fixing this exposes another bug: we put orig_bio prematurely and then
> put it again from end_io. Fix it by not putting orig_bio.
> 
> After this change, pending_bios is really more of a reference count, but
> I'll leave that cleanup separate to keep the fix small.
> 
> Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO")
> Signed-off-by: Omar Sandoval <osandov@fb.com>

The changes look good, I just wonder why didn't this trip earlier...

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 02/15] btrfs: fix double __endio_write_update_ordered in direct I/O
  2020-03-09 21:32 ` [PATCH 02/15] btrfs: fix double __endio_write_update_ordered in direct I/O Omar Sandoval
  2020-03-10 16:30   ` Christoph Hellwig
@ 2020-03-17 14:04   ` Nikolay Borisov
  1 sibling, 0 replies; 69+ messages in thread
From: Nikolay Borisov @ 2020-03-17 14:04 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig



On 9.03.20 г. 23:32 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> In btrfs_submit_direct(), if we fail to allocate the btrfs_dio_private,
> we complete the ordered extent range. However, we don't mark that the
> range doesn't need to be cleaned up from btrfs_direct_IO() until later.
> Therefore, if we fail to allocate the btrfs_dio_private, we complete the
> ordered extent range twice. We could fix this by updating
> unsubmitted_oe_range earlier, but it's simpler to always clean up via
> the bio once the btrfs_dio_private is allocated and leave it for
> btrfs_direct_IO() before that.
> 
> Fixes: f28a49287817 ("Btrfs: fix leaking of ordered extents after direct IO write error")
> Signed-off-by: Omar Sandoval <osandov@fb.com>

The result code is much nicer and also I think your suggestion of having
just btrfs_submit_direct and factoring out the dip setup code into a
separate function makes sense. In any case:

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 03/15] btrfs: look at full bi_io_vec for repair decision
  2020-03-09 21:32 ` [PATCH 03/15] btrfs: look at full bi_io_vec for repair decision Omar Sandoval
  2020-03-10 16:33   ` Christoph Hellwig
@ 2020-03-17 14:38   ` Nikolay Borisov
  1 sibling, 0 replies; 69+ messages in thread
From: Nikolay Borisov @ 2020-03-17 14:38 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig



On 9.03.20 г. 23:32 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Read repair does two things: it finds a good copy of data to return to
> the reader, and it corrects the bad copy on disk. If a read of multiple
> sectors has an I/O error, repair does an extra "validation" step that
> issues a separate read for each sector. This allows us to find the exact
> failing sectors and only rewrite those.
> 
> This heuristic is implemented in
> bio_readpage_error()/btrfs_check_repairable() as:
> 
> 	failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
> 	if (failed_bio_pages > 1)
> 		do validation
> 
> However, at this point, bi_iter may have already been advanced. This
> means that we'll skip the validation step and rewrite the entire failed
> read.
> 
> Fix it by getting the actual size from the biovec (which we can do
> because this is only called for non-cloned bios, although that will
> change in a later commit).
> 
> Fixes: 8a2ee44a371c ("btrfs: look at bi_size for repair decisions")
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  fs/btrfs/extent_io.c | 28 ++++++++++++++++++++++------
>  fs/btrfs/extent_io.h |  5 +++--
>  2 files changed, 25 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 837262d54e28..279731bff0a8 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2528,8 +2528,9 @@ int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
>  	return 0;
>  }
>  
> -bool btrfs_check_repairable(struct inode *inode, unsigned failed_bio_pages,
> -			   struct io_failure_record *failrec, int failed_mirror)
> +bool btrfs_check_repairable(struct inode *inode, bool need_validation,

nit: While at it this function can be made static. It's only used in
extent_io.c and it's defined before its sole caller - bio_readpage_error.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 05/15] btrfs: clarify btrfs_lookup_bio_sums documentation
  2020-03-09 21:32 ` [PATCH 05/15] btrfs: clarify btrfs_lookup_bio_sums documentation Omar Sandoval
  2020-03-11 17:56   ` Josef Bacik
@ 2020-03-17 14:38   ` Nikolay Borisov
  1 sibling, 0 replies; 69+ messages in thread
From: Nikolay Borisov @ 2020-03-17 14:38 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig



On 9.03.20 г. 23:32 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Fix a couple of issues in the btrfs_lookup_bio_sums documentation:
> 
> * The bio doesn't need to be a btrfs_io_bio if dst was provided. Move
>   the declaration in the code to make that clear, too.
> * dst must be large enough to hold nblocks * csum_size, not just
>   csum_size.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  fs/btrfs/file-item.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 6c849e8fd5a1..fa9f4a92f74d 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -242,11 +242,13 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
>  /**
>   * btrfs_lookup_bio_sums - Look up checksums for a bio.
>   * @inode: inode that the bio is for.
> - * @bio: bio embedded in btrfs_io_bio.
> + * @bio: bio to look up.
>   * @offset: Unless (u64)-1, look up checksums for this offset in the file.
>   *          If (u64)-1, use the page offsets from the bio instead.
> - * @dst: Buffer of size btrfs_super_csum_size() used to return checksum. If
> - *       NULL, the checksum is returned in btrfs_io_bio(bio)->csum instead.
> + * @dst: Buffer of size nblocks * btrfs_super_csum_size() used to return
> + *       checksum (nblocks = bio->bi_iter.bi_size / sectorsize). If NULL, the

nit: sector here refers to btrfs' notion of a sector which is 4k and not
the default understanding of 512 bytes. Dunno if it makes sense to make
that a bit more emphatic. Also nblocks is really number of checksums.
Looking at file-item.c I see there is only 1 menition of ncsums or
csums_in_item. I guess if we can make that nblocks a bit more explicit?

> + *       checksum buffer is allocated and returned in btrfs_io_bio(bio)->csum
> + *       instead.
>   *
>   * Return: BLK_STS_RESOURCE if allocating memory fails, BLK_STS_OK otherwise.
>   */
> @@ -256,7 +258,6 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	struct bio_vec bvec;
>  	struct bvec_iter iter;
> -	struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
>  	struct btrfs_csum_item *item = NULL;
>  	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
>  	struct btrfs_path *path;
> @@ -277,6 +278,8 @@ blk_status_t btrfs_lookup_bio_sums(struct inode *inode, struct bio *bio,
>  
>  	nblocks = bio->bi_iter.bi_size >> inode->i_sb->s_blocksize_bits;
>  	if (!dst) {
> +		struct btrfs_io_bio *btrfs_bio = btrfs_io_bio(bio);
> +
>  		if (nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE) {
>  			btrfs_bio->csum = kmalloc_array(nblocks, csum_size,
>  							GFP_NOFS);
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 06/15] btrfs: rename __readpage_endio_check to check_data_csum
  2020-03-09 21:32 ` [PATCH 06/15] btrfs: rename __readpage_endio_check to check_data_csum Omar Sandoval
  2020-03-10 14:46   ` Johannes Thumshirn
  2020-03-11 17:57   ` Josef Bacik
@ 2020-03-17 14:39   ` Nikolay Borisov
  2 siblings, 0 replies; 69+ messages in thread
From: Nikolay Borisov @ 2020-03-17 14:39 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig



On 9.03.20 г. 23:32 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> __readpage_endio_check() is also used from the direct I/O read code, so
> give it a more descriptive name.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 07/15] btrfs: make btrfs_check_repairable() static
  2020-03-09 21:32 ` [PATCH 07/15] btrfs: make btrfs_check_repairable() static Omar Sandoval
  2020-03-10 14:53   ` Johannes Thumshirn
  2020-03-11 17:58   ` Josef Bacik
@ 2020-03-17 14:52   ` Nikolay Borisov
  2 siblings, 0 replies; 69+ messages in thread
From: Nikolay Borisov @ 2020-03-17 14:52 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig



On 9.03.20 г. 23:32 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Since its introduction in commit 2fe6303e7cd0 ("Btrfs: split
> bio_readpage_error into several functions"), btrfs_check_repairable()
> has only been used from extent_io.c where it is defined.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Ok that addresses my earlier nit,

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 08/15] btrfs: move btrfs_dio_private to inode.c
  2020-03-09 21:32 ` [PATCH 08/15] btrfs: move btrfs_dio_private to inode.c Omar Sandoval
  2020-03-10 14:56   ` Johannes Thumshirn
@ 2020-03-17 14:53   ` Nikolay Borisov
  2020-03-19 16:16   ` David Sterba
  2 siblings, 0 replies; 69+ messages in thread
From: Nikolay Borisov @ 2020-03-17 14:53 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig



On 9.03.20 г. 23:32 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This hasn't been needed outside of inode.c since commit 23ea8e5a0767
> ("Btrfs: load checksum data once when submitting a direct read io").
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

While doing this can you perhaps remove the forward declaration in
ctree.h as well ?


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 09/15] btrfs: kill btrfs_dio_private->private
  2020-03-09 21:32 ` [PATCH 09/15] btrfs: kill btrfs_dio_private->private Omar Sandoval
  2020-03-11 17:59   ` Josef Bacik
@ 2020-03-17 14:54   ` Nikolay Borisov
  1 sibling, 0 replies; 69+ messages in thread
From: Nikolay Borisov @ 2020-03-17 14:54 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig



On 9.03.20 г. 23:32 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> We haven't used this since commit 9be3395bcd4a ("Btrfs: use a btrfs
> bioset instead of abusing bio internals").
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 10/15] btrfs: convert btrfs_dio_private->pending_bios to refcount_t
  2020-03-09 21:32 ` [PATCH 10/15] btrfs: convert btrfs_dio_private->pending_bios to refcount_t Omar Sandoval
  2020-03-11 18:00   ` Josef Bacik
@ 2020-03-17 15:10   ` Nikolay Borisov
  1 sibling, 0 replies; 69+ messages in thread
From: Nikolay Borisov @ 2020-03-17 15:10 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig



On 9.03.20 г. 23:32 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This is really a reference count now, so convert it to refcount_t and
> rename it to refs.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 11/15] btrfs: put direct I/O checksums in btrfs_dio_private instead of bio
  2020-03-09 21:32 ` [PATCH 11/15] btrfs: put direct I/O checksums in btrfs_dio_private instead of bio Omar Sandoval
  2020-03-11 18:04   ` Josef Bacik
@ 2020-03-17 16:37   ` Nikolay Borisov
  2020-04-03 16:18   ` David Sterba
  2 siblings, 0 replies; 69+ messages in thread
From: Nikolay Borisov @ 2020-03-17 16:37 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig



On 9.03.20 г. 23:32 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> The next commit will get rid of btrfs_dio_private->orig_bio. The only
> thing we really need it for is containing all of the checksums, but we
> can easily put those in btrfs_dio_private and get rid of the awkward
> logic that looks up the checksums for orig_bio when the first split bio
> is submitted. (Interestingly, btrfs_dio_private did contain the
> checksums before commit 23ea8e5a0767 ("Btrfs: load checksum data once
> when submitting a direct read io"), but it didn't look them up up
> front.)

nit: It would be useful to surmise the structural changes this patch
does. Essentially this makes each individual cloned bio to index its
checksums into the global checksum storage array anchored at
btrfs_dio_private::sums.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 12/15] btrfs: get rid of one layer of bios in direct I/O
  2020-03-09 21:32 ` [PATCH 12/15] btrfs: get rid of one layer of bios in direct I/O Omar Sandoval
  2020-03-10 16:38   ` Christoph Hellwig
  2020-03-11 18:07   ` Josef Bacik
@ 2020-03-17 16:48   ` Nikolay Borisov
  2 siblings, 0 replies; 69+ messages in thread
From: Nikolay Borisov @ 2020-03-17 16:48 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig



On 9.03.20 г. 23:32 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> In the worst case, there are _4_ layers of bios in the Btrfs direct I/O
> path:
> 
> 1. The bio created by the generic direct I/O code (dio_bio).
> 2. A clone of dio_bio we create in btrfs_submit_direct() to represent
>    the entire direct I/O range (orig_bio).
> 3. A partial clone of orig_bio limited to the size of a RAID stripe that
>    we create in btrfs_submit_direct_hook().
> 4. Clones of each of those split bios for each RAID stripe that we
>    create in btrfs_map_bio().
> 
> As of the previous commit, the second layer (orig_bio) is no longer
> needed for anything: we can split dio_bio instead, and complete dio_bio
> directly when all of the cloned bios complete. This lets us clean up a
> bunch of cruft, including dip->subio_endio and dip->errors (we can use
> dio_bio->bi_status instead). It also enables the next big cleanup of
> direct I/O read repair.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  fs/btrfs/inode.c | 213 +++++++++++++++--------------------------------
>  1 file changed, 66 insertions(+), 147 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 4a2e44f3e66e..40c1562704e9 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -54,11 +54,8 @@ struct btrfs_iget_args {
>  	struct btrfs_root *root;
>  };
>  

<snip>

> @@ -7400,6 +7384,29 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
>  	return ret;
>  }
>  
> +static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
> +{
> +	/*
> +	 * This implies a barrier so that stores to dio_bio->bi_status before
> +	 * this and the following load are fully ordered.
> +	 */

It's not obvious which load this refers to. It's not obvious where this
ordering matters i.e what are the threads that care?

> +	if (!refcount_dec_and_test(&dip->refs))
> +		return;
> +
> +	if (bio_op(dip->dio_bio) == REQ_OP_WRITE) {
> +		__endio_write_update_ordered(dip->inode, dip->logical_offset,
> +					     dip->bytes,
> +					     !dip->dio_bio->bi_status);
> +	} else {
> +		unlock_extent(&BTRFS_I(dip->inode)->io_tree,
> +			      dip->logical_offset,
> +			      dip->logical_offset + dip->bytes - 1);
> +	}
> +
> +	dio_end_io(dip->dio_bio);
> +	kfree(dip);
> +}
> +
>  static inline blk_status_t submit_dio_repair_bio(struct inode *inode,
>  						 struct bio *bio,
>  						 int mirror_num)

<snip>

> @@ -7920,98 +7876,77 @@ static void btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
>  	struct inode *inode = dip->inode;
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	struct bio *bio;
> -	struct bio *orig_bio = dip->orig_bio;
> -	u64 start_sector = orig_bio->bi_iter.bi_sector;
> +	struct bio *dio_bio = dip->dio_bio;
> +	u64 start_sector = dio_bio->bi_iter.bi_sector;
>  	u64 file_offset = dip->logical_offset;
>  	int async_submit = 0;
> -	u64 submit_len;
> +	u64 submit_len = dio_bio->bi_iter.bi_size;
>  	int clone_offset = 0;
>  	int clone_len;
>  	int ret;
>  	blk_status_t status;
>  	struct btrfs_io_geometry geom;
>  
> -	submit_len = orig_bio->bi_iter.bi_size;
> -	ret = btrfs_get_io_geometry(fs_info, btrfs_op(orig_bio),
> -				    start_sector << 9, submit_len, &geom);
> -	if (ret)
> -		goto out_err;
> -
> -	if (geom.len >= submit_len) {
> -		bio = orig_bio;
> -		dip->flags |= BTRFS_DIO_ORIG_BIO_SUBMITTED;
> -		goto submit;
> -	}
> -
>  	/* async crcs make it difficult to collect full stripe writes. */
>  	if (btrfs_data_alloc_profile(fs_info) & BTRFS_BLOCK_GROUP_RAID56_MASK)
>  		async_submit = 0;
>  	else
>  		async_submit = 1;
>  
> -	/* bio split */
>  	ASSERT(geom.len <= INT_MAX);

geom.len now contains some random data since it's no longer initialised,
meaning this ASSERT hasn't triggered by pure luck. It should be
(re)moved inside the do {} while loop.

>  	do {
> +		ret = btrfs_get_io_geometry(fs_info, btrfs_op(dio_bio),
> +					    start_sector << 9, submit_len,
> +					    &geom);
> +		if (ret) {
> +			status = errno_to_blk_status(ret);
> +			goto out_err;
> +		}
> +
>  		clone_len = min_t(int, submit_len, geom.len);
>  
>  		/*
>  		 * This will never fail as it's passing GPF_NOFS and
>  		 * the allocation is backed by btrfs_bioset.
>  		 */
> -		bio = btrfs_bio_clone_partial(orig_bio, clone_offset,
> -					      clone_len);
> +		bio = btrfs_bio_clone_partial(dio_bio, clone_offset, clone_len);
>  		bio->bi_private = dip;
>  		bio->bi_end_io = btrfs_end_dio_bio;
>  		btrfs_io_bio(bio)->logical = file_offset;
>  
>  		ASSERT(submit_len >= clone_len);
>  		submit_len -= clone_len;
> -		if (submit_len == 0)
> -			break;
>  
>  		/*
>  		 * Increase the count before we submit the bio so we know
>  		 * the end IO handler won't happen before we increase the
>  		 * count. Otherwise, the dip might get freed before we're
>  		 * done setting it up.
> +		 *
> +		 * We transfer the initial reference to the last bio, so we
> +		 * don't need to increment the reference count for the last one.
>  		 */
> -		refcount_inc(&dip->refs);
> +		if (submit_len > 0)
> +			refcount_inc(&dip->refs);
>  
>  		status = btrfs_submit_dio_bio(bio, inode, file_offset,
>  						async_submit);
>  		if (status) {
>  			bio_put(bio);
> -			refcount_dec(&dip->refs);
> +			if (submit_len > 0)
> +				refcount_dec(&dip->refs);
>  			goto out_err;
>  		}
>  
>  		clone_offset += clone_len;
>  		start_sector += clone_len >> 9;
>  		file_offset += clone_len;
> -
> -		ret = btrfs_get_io_geometry(fs_info, btrfs_op(orig_bio),
> -				      start_sector << 9, submit_len, &geom);
> -		if (ret)
> -			goto out_err;
>  	} while (submit_len > 0);> +	return;

<snip>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 00/15] btrfs: read repair/direct I/O improvements
  2020-03-11  9:22   ` Omar Sandoval
@ 2020-03-18 16:33     ` Goldwyn Rodrigues
  2020-03-19 14:08       ` David Sterba
  0 siblings, 1 reply; 69+ messages in thread
From: Goldwyn Rodrigues @ 2020-03-18 16:33 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Christoph Hellwig, linux-btrfs, kernel-team

On  2:22 11/03, Omar Sandoval wrote:
> On Tue, Mar 10, 2020 at 05:39:40PM +0100, Christoph Hellwig wrote:
> > Adding Goldwyn,
> > 
> > as he has been looking into converting btrfs to the iomap direct
> > I/O code.  This doesn't look like a major conflict, but he should
> > be knowledgeable about the dio code by now after a few iterations
> > of that.
> 
> Thanks, I did try to avoid conflicting with the iomap rework, and I'm
> looking forward to the further improvement.

Sorry for the late response, I have been on vacation last week and
recovering from backlog. Most of the code does not seem to conflict with
the direct I/O iomap work.

I am ready with my patches in my git [1]. However, I would like to base
my patches on yours. Do you have them in a public git tree anywhere?
preferably with the acks/review comments.

[1] https://github.com/goldwynr/linux/tree/btrfs-iomap-dio

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 00/15] btrfs: read repair/direct I/O improvements
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
                   ` (15 preceding siblings ...)
  2020-03-10 16:39 ` [PATCH 00/15] btrfs: read repair/direct I/O improvements Christoph Hellwig
@ 2020-03-18 22:07 ` David Sterba
  2020-03-20 21:29   ` Omar Sandoval
  2020-03-20 14:10 ` Christoph Hellwig
  17 siblings, 1 reply; 69+ messages in thread
From: David Sterba @ 2020-03-18 22:07 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Christoph Hellwig

On Mon, Mar 09, 2020 at 02:32:26PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> This series includes several fixes, cleanups, and improvements to direct
> I/O and read repair. It's preparation for adding read repair to my
> RWF_ENCODED series [1], but it can go in independently.

Overall it looks good, but also scary. There are some comments to
address, I haven't reviewed it thoroughly yes so please don't resend
unless there's something that would harm testing. I'll put the branch to
for-next for some coverage.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/15] btrfs: unify buffered and direct I/O read repair
  2020-03-09 21:32 ` [PATCH 15/15] btrfs: unify buffered and direct I/O read repair Omar Sandoval
  2020-03-11 18:19   ` Josef Bacik
@ 2020-03-19  8:53   ` Nikolay Borisov
  2020-03-19  9:03     ` Christoph Hellwig
  2020-03-20 21:28     ` Omar Sandoval
  1 sibling, 2 replies; 69+ messages in thread
From: Nikolay Borisov @ 2020-03-19  8:53 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: kernel-team, Christoph Hellwig



On 9.03.20 г. 23:32 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Currently, direct I/O has its own versions of bio_readpage_error() and
> btrfs_check_repairable() (dio_read_error() and
> btrfs_check_dio_repairable(), respectively). The main difference is that
> the direct I/O version doesn't do read validation. The rework of direct
> I/O repair makes it possible to do validation, so we can get rid of
> btrfs_check_dio_repairable() and combine bio_readpage_error() and
> dio_read_error() into a new helper, btrfs_submit_read_repair().
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  fs/btrfs/extent_io.c | 126 +++++++++++++++++++------------------------
>  fs/btrfs/extent_io.h |  17 +++---
>  fs/btrfs/inode.c     | 103 ++++-------------------------------
>  3 files changed, 76 insertions(+), 170 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index fad86ef4d09d..a5cbe04da803 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c

<snip>

> -/*
> - * This is a generic handler for readpage errors. If other copies exist, read
> - * those and write back good data to the failed position. Does not investigate
> - * in remapping the failed extent elsewhere, hoping the device will be smart
> - * enough to do this as needed
> - */
> -static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
> -			      struct page *page, u64 start, u64 end,
> -			      int failed_mirror)
> +blk_status_t btrfs_submit_read_repair(struct inode *inode,
> +				      struct bio *failed_bio, u64 phy_offset,
> +				      struct page *page, unsigned int pgoff,
> +				      u64 start, u64 end, int failed_mirror,
> +				      submit_bio_hook_t *submit_bio_hook)
>  {
>  	struct io_failure_record *failrec;
> -	struct inode *inode = page->mapping->host;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
>  	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
> +	struct btrfs_io_bio *failed_io_bio = btrfs_io_bio(failed_bio);
> +	struct btrfs_io_bio *io_bio;
> +	int icsum = phy_offset >> inode->i_sb->s_blocksize_bits;
>  	bool need_validation = false;
>  	struct bio *bio;
> -	int read_mode = 0;
>  	blk_status_t status;
>  	int ret;
>  
> +	btrfs_info(btrfs_sb(inode->i_sb),
> +		   "Repair Read Error: read error at %llu", start);
> +
>  	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
>  
>  	ret = btrfs_get_io_failure_record(inode, start, end, &failrec);
>  	if (ret)
> -		return ret;
> +		return errno_to_blk_status(ret);
>  
>  	/*
>  	 * If there was an I/O error and the I/O was for multiple sectors, we
>  	 * need to validate each sector individually.
>  	 */
>  	if (failed_bio->bi_status != BLK_STS_OK) {

Is this correct though, in case of buffered reads we are always called
with bi_status != BLK_STS_OK (we are called from end_bio_extent_readpage
in case uptodate is false,  which happens if failed_bio->bi_status is
non-zero. Additionally the bio is guaranteed to not be cloned because
there is : ASSERT(!bio_flagged(bio, BIO_CLONED));

The end effect of all of this is in case of buffered bios we never set
need_revalidate, is this intentional?

> -		u64 len = 0;
> -		int i;
> -
> -		for (i = 0; i < failed_bio->bi_vcnt; i++) {
> -			len += failed_bio->bi_io_vec[i].bv_len;
> -			if (len > inode->i_sb->s_blocksize) {
> +		if (bio_flagged(failed_bio, BIO_CLONED)) {

If I understand this correctly this is the "this is a DIO " branch. IMO
it'd be clearer if you had bool is_dio = bio_flagged(failed_bio,
BIO_CLONED) at the top of the function and you used that.

> +			if (failed_io_bio->iter.bi_size >
> +			    inode->i_sb->s_blocksize)
>  				need_validation = true;
> -				break;
> +		} else {

This branch will only ever be executed in case of DIO with csum failure.
So either add a comment to demarcate when various leaves of the 2 'if'
should be called or, and I think this would be the better solution,
rewrite it.
> +			u64 len = 0;
> +			int i;
> +
> +			for (i = 0; i < failed_bio->bi_vcnt; i++) {
> +				len += failed_bio->bi_io_vec[i].bv_len;
> +				if (len > inode->i_sb->s_blocksize) {
> +					need_validation = true;
> +					break;
> +				}
>  			}


>  		}
>  	}
> @@ -2674,32 +2646,41 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
>  	if (!btrfs_check_repairable(inode, need_validation, failrec,
>  				    failed_mirror)) {
>  		free_io_failure(failure_tree, tree, failrec);
> -		return -EIO;
> +		return BLK_STS_IOERR;
>  	}
>  
> +	bio = btrfs_io_bio_alloc(1);
> +	io_bio = btrfs_io_bio(bio);
> +	bio->bi_opf = REQ_OP_READ;
>  	if (need_validation)
> -		read_mode |= REQ_FAILFAST_DEV;
> +		bio->bi_opf |= REQ_FAILFAST_DEV;
> +	bio->bi_end_io = failed_bio->bi_end_io;
> +	bio->bi_iter.bi_sector = failrec->logical >> 9;
> +	bio->bi_private = failed_bio->bi_private;
nit: I'd rather have this named as repair_bio, right now the function has:

failed_bio, failed_io_bio and simply bio. But in fact the latter is a
repair bio derived from the io_failured_record.
>  
> -	phy_offset >>= inode->i_sb->s_blocksize_bits;
> -	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
> -				      start - page_offset(page),
> -				      (int)phy_offset, failed_bio->bi_end_io,
> -				      NULL);
> -	bio->bi_opf = REQ_OP_READ | read_mode;
> +	if (failed_io_bio->csum) {
> +		u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
> +
> +		io_bio->csum = io_bio->csum_inline;
> +		memcpy(io_bio->csum, failed_io_bio->csum + csum_size * icsum,
> +		       csum_size);
> +	}
> +
> +	bio_add_page(bio, page, failrec->len, pgoff);
> +	io_bio->logical = failrec->start;
> +	io_bio->iter = bio->bi_iter;
>  
>  	btrfs_debug(btrfs_sb(inode->i_sb),
> -		"Repair Read Error: submitting new read[%#x] to this_mirror=%d, in_validation=%d",
> -		read_mode, failrec->this_mirror, failrec->in_validation);
> +"Repair Read Error: submitting new read to this_mirror=%d, in_validation=%d",
> +		    failrec->this_mirror, failrec->in_validation);
>  
> -	status = tree->ops->submit_bio_hook(tree->private_data, bio, failrec->this_mirror,
> -					 failrec->bio_flags);
> +	status = submit_bio_hook(inode, bio, failrec->this_mirror,
> +				 failrec->bio_flags);
>  	if (status) {
>  		free_io_failure(failure_tree, tree, failrec);
>  		bio_put(bio);
> -		ret = blk_status_to_errno(status);
>  	}
> -
> -	return ret;
> +	return status;
>  }
>  
>  /* lots and lots of room for performance fixes in the end_bio funcs */

<snip>

> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/15] btrfs: unify buffered and direct I/O read repair
  2020-03-19  8:53   ` Nikolay Borisov
@ 2020-03-19  9:03     ` Christoph Hellwig
  2020-03-20 21:28     ` Omar Sandoval
  1 sibling, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2020-03-19  9:03 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Omar Sandoval, linux-btrfs, kernel-team, Christoph Hellwig

On Thu, Mar 19, 2020 at 10:53:22AM +0200, Nikolay Borisov wrote:
> Is this correct though, in case of buffered reads we are always called
> with bi_status != BLK_STS_OK (we are called from end_bio_extent_readpage
> in case uptodate is false,  which happens if failed_bio->bi_status is
> non-zero. Additionally the bio is guaranteed to not be cloned because
> there is : ASSERT(!bio_flagged(bio, BIO_CLONED));

> If I understand this correctly this is the "this is a DIO " branch. IMO
> it'd be clearer if you had bool is_dio = bio_flagged(failed_bio,
> BIO_CLONED) at the top of the function and you used that.

The non-fragile way would be to pass an explicit is_bio argument.
The is cloned thing is just a side effect of the weird cloning done
in the bio path, which hopefully won't survive too long.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 00/15] btrfs: read repair/direct I/O improvements
  2020-03-18 16:33     ` Goldwyn Rodrigues
@ 2020-03-19 14:08       ` David Sterba
  0 siblings, 0 replies; 69+ messages in thread
From: David Sterba @ 2020-03-19 14:08 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: Omar Sandoval, Christoph Hellwig, linux-btrfs, kernel-team

On Wed, Mar 18, 2020 at 11:33:18AM -0500, Goldwyn Rodrigues wrote:
> On  2:22 11/03, Omar Sandoval wrote:

> I am ready with my patches in my git [1]. However, I would like to base
> my patches on yours. Do you have them in a public git tree anywhere?
> preferably with the acks/review comments.

I have the branch for testing in my github repository,
https://github.com/kdave/btrfs-devel/tree/ext/omar/dio-fixes .

Regarding the two patchsest, it makes things easier to test at least if
the code does not conflict but I'm a bit worried about adding 2
different reworks of the DIO in one release. OTOH, we' have focus on the
same subsystem, so at this point I'm preferring the option to merge both
unless something serious pops up. The merge target would be 5.8, plenty
of time to revise the desision if needed.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 08/15] btrfs: move btrfs_dio_private to inode.c
  2020-03-09 21:32 ` [PATCH 08/15] btrfs: move btrfs_dio_private to inode.c Omar Sandoval
  2020-03-10 14:56   ` Johannes Thumshirn
  2020-03-17 14:53   ` Nikolay Borisov
@ 2020-03-19 16:16   ` David Sterba
  2 siblings, 0 replies; 69+ messages in thread
From: David Sterba @ 2020-03-19 16:16 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Christoph Hellwig

On Mon, Mar 09, 2020 at 02:32:34PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This hasn't been needed outside of inode.c since commit 23ea8e5a0767
> ("Btrfs: load checksum data once when submitting a direct read io").

For ease of merging with Goldwyn's dio-iomap patches, can you please
avoid moving code? In this case it's btrfs_dio_private and some followup
patches that modify it.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 00/15] btrfs: read repair/direct I/O improvements
  2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
                   ` (16 preceding siblings ...)
  2020-03-18 22:07 ` David Sterba
@ 2020-03-20 14:10 ` Christoph Hellwig
  2020-03-20 14:11   ` Christoph Hellwig
  17 siblings, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2020-03-20 14:10 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Christoph Hellwig

Btw, while you touch this code:

It seems like btrfs_dio_private.subio_endio is rather pointless,
as it is always set to one function for reads and otherwise never
set.  De-virtualizing this call could help making the code a little
faster and easier to understand.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 00/15] btrfs: read repair/direct I/O improvements
  2020-03-20 14:10 ` Christoph Hellwig
@ 2020-03-20 14:11   ` Christoph Hellwig
  0 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2020-03-20 14:11 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Christoph Hellwig

On Fri, Mar 20, 2020 at 03:10:41PM +0100, Christoph Hellwig wrote:
> Btw, while you touch this code:
> 
> It seems like btrfs_dio_private.subio_endio is rather pointless,
> as it is always set to one function for reads and otherwise never
> set.  De-virtualizing this call could help making the code a little
> faster and easier to understand.

Doh, one of the later patches actually kills it.  Time for more coffee..

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/15] btrfs: unify buffered and direct I/O read repair
  2020-03-19  8:53   ` Nikolay Borisov
  2020-03-19  9:03     ` Christoph Hellwig
@ 2020-03-20 21:28     ` Omar Sandoval
  1 sibling, 0 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-20 21:28 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs, kernel-team, Christoph Hellwig

On Thu, Mar 19, 2020 at 10:53:22AM +0200, Nikolay Borisov wrote:
> 
> 
> On 9.03.20 г. 23:32 ч., Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Currently, direct I/O has its own versions of bio_readpage_error() and
> > btrfs_check_repairable() (dio_read_error() and
> > btrfs_check_dio_repairable(), respectively). The main difference is that
> > the direct I/O version doesn't do read validation. The rework of direct
> > I/O repair makes it possible to do validation, so we can get rid of
> > btrfs_check_dio_repairable() and combine bio_readpage_error() and
> > dio_read_error() into a new helper, btrfs_submit_read_repair().
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  fs/btrfs/extent_io.c | 126 +++++++++++++++++++------------------------
> >  fs/btrfs/extent_io.h |  17 +++---
> >  fs/btrfs/inode.c     | 103 ++++-------------------------------
> >  3 files changed, 76 insertions(+), 170 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index fad86ef4d09d..a5cbe04da803 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> 
> <snip>
> 
> > -/*
> > - * This is a generic handler for readpage errors. If other copies exist, read
> > - * those and write back good data to the failed position. Does not investigate
> > - * in remapping the failed extent elsewhere, hoping the device will be smart
> > - * enough to do this as needed
> > - */
> > -static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
> > -			      struct page *page, u64 start, u64 end,
> > -			      int failed_mirror)
> > +blk_status_t btrfs_submit_read_repair(struct inode *inode,
> > +				      struct bio *failed_bio, u64 phy_offset,
> > +				      struct page *page, unsigned int pgoff,
> > +				      u64 start, u64 end, int failed_mirror,
> > +				      submit_bio_hook_t *submit_bio_hook)
> >  {
> >  	struct io_failure_record *failrec;
> > -	struct inode *inode = page->mapping->host;
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> >  	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
> > +	struct btrfs_io_bio *failed_io_bio = btrfs_io_bio(failed_bio);
> > +	struct btrfs_io_bio *io_bio;
> > +	int icsum = phy_offset >> inode->i_sb->s_blocksize_bits;
> >  	bool need_validation = false;
> >  	struct bio *bio;
> > -	int read_mode = 0;
> >  	blk_status_t status;
> >  	int ret;
> >  
> > +	btrfs_info(btrfs_sb(inode->i_sb),
> > +		   "Repair Read Error: read error at %llu", start);
> > +
> >  	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
> >  
> >  	ret = btrfs_get_io_failure_record(inode, start, end, &failrec);
> >  	if (ret)
> > -		return ret;
> > +		return errno_to_blk_status(ret);
> >  
> >  	/*
> >  	 * If there was an I/O error and the I/O was for multiple sectors, we
> >  	 * need to validate each sector individually.
> >  	 */
> >  	if (failed_bio->bi_status != BLK_STS_OK) {
> 
> Is this correct though, in case of buffered reads we are always called
> with bi_status != BLK_STS_OK (we are called from end_bio_extent_readpage
> in case uptodate is false,  which happens if failed_bio->bi_status is
> non-zero. Additionally the bio is guaranteed to not be cloned because
> there is : ASSERT(!bio_flagged(bio, BIO_CLONED));
> 
> The end effect of all of this is in case of buffered bios we never set
> need_revalidate, is this intentional?

For buffered I/O, this is called when bi_status != BLK_STS_OK OR
readpage_end_io_hook (i.e., check_data_csum()) failed. This check
distinguishes between those two cases: if we didn't hit an I/O error
(bi_status == BLK_STS_OK), then we don't need validation, otherwise, we
need validation if the bio is more than one sector.

> > -		u64 len = 0;
> > -		int i;
> > -
> > -		for (i = 0; i < failed_bio->bi_vcnt; i++) {
> > -			len += failed_bio->bi_io_vec[i].bv_len;
> > -			if (len > inode->i_sb->s_blocksize) {
> > +		if (bio_flagged(failed_bio, BIO_CLONED)) {
> 
> If I understand this correctly this is the "this is a DIO " branch. IMO
> it'd be clearer if you had bool is_dio = bio_flagged(failed_bio,
> BIO_CLONED) at the top of the function and you used that.

Repair bios for direct I/O aren't cloned, so is_dio isn't accurate. IMO
it shouldn't matter whether it came from direct I/O or not. If it's a
cloned bio, you get the size out of io_bio->iter, and if it's not, you
get it out of bi_io_vec.

> > +			if (failed_io_bio->iter.bi_size >
> > +			    inode->i_sb->s_blocksize)
> >  				need_validation = true;
> > -				break;
> > +		} else {
> 
> This branch will only ever be executed in case of DIO with csum failure.
> So either add a comment to demarcate when various leaves of the 2 'if'
> should be called or, and I think this would be the better solution,
> rewrite it.

As commented above, this outer branch is for I/O errors (not checksum
errors), and this specific branch is for non-cloned bios, which happens
to be buffered read bios and buffered or direct I/O repair bios. Would
it be clearer as:

static u64 btrfs_bio_size(struct bio *bio)
{
	if (bio_flagged(bio, BIO_CLONED))
		return bio->iter.bi_size;
	else
		return bio_size_all(bio);
}

blk_status_t btrfs_submit_read_repair(...)
{
	...
	/*
	 * If there was an I/O error and the I/O was for multiple sectors, we
	 * need to validate each sector individually.
	 */
	need_validation = (failed_bio->bi_status != BLK_STS_OK &&
			   btrfs_bio_size() > inode->i_sb->s_blocksize);
	...
}

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 00/15] btrfs: read repair/direct I/O improvements
  2020-03-18 22:07 ` David Sterba
@ 2020-03-20 21:29   ` Omar Sandoval
  0 siblings, 0 replies; 69+ messages in thread
From: Omar Sandoval @ 2020-03-20 21:29 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team, Christoph Hellwig

On Wed, Mar 18, 2020 at 11:07:33PM +0100, David Sterba wrote:
> On Mon, Mar 09, 2020 at 02:32:26PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > This series includes several fixes, cleanups, and improvements to direct
> > I/O and read repair. It's preparation for adding read repair to my
> > RWF_ENCODED series [1], but it can go in independently.
> 
> Overall it looks good, but also scary. There are some comments to
> address, I haven't reviewed it thoroughly yes so please don't resend
> unless there's something that would harm testing. I'll put the branch to
> for-next for some coverage.

Thanks! The only thing so far that might affect testing is the
uninitialized geom.len ASSERT that Nikolay pointed out.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 11/15] btrfs: put direct I/O checksums in btrfs_dio_private instead of bio
  2020-03-09 21:32 ` [PATCH 11/15] btrfs: put direct I/O checksums in btrfs_dio_private instead of bio Omar Sandoval
  2020-03-11 18:04   ` Josef Bacik
  2020-03-17 16:37   ` Nikolay Borisov
@ 2020-04-03 16:18   ` David Sterba
  2 siblings, 0 replies; 69+ messages in thread
From: David Sterba @ 2020-04-03 16:18 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Christoph Hellwig

On Mon, Mar 09, 2020 at 02:32:37PM -0700, Omar Sandoval wrote:
> -	file_offset -= dip->logical_offset;
> -	file_offset >>= inode->i_sb->s_blocksize_bits;
> -	csum_size = btrfs_super_csum_size(btrfs_sb(inode->i_sb)->super_copy);
> -	io_bio->csum = orig_io_bio->csum + csum_size * file_offset;

> -		ret = btrfs_lookup_and_bind_dio_csum(inode, dip, bio,
> -						     file_offset);
> -		if (ret)
> -			goto err;
> +		u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
> +		size_t csum_offset;
> +
> +		csum_offset = ((file_offset - dip->logical_offset) >>
> +			       inode->i_sb->s_blocksize_bits) * csum_size;

Please split the expression like it used to be in the original code
(above).

> +		btrfs_io_bio(bio)->csum = dip->sums + csum_offset;
>  	}
>  map:
>  	ret = btrfs_map_bio(fs_info, bio, 0);
> @@ -8047,13 +8018,25 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
>  				loff_t file_offset)
>  {
>  	struct btrfs_dio_private *dip = NULL;
> +	size_t dip_size;
>  	struct bio *bio = NULL;
>  	struct btrfs_io_bio *io_bio;
>  	bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
> +	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
>  
>  	bio = btrfs_bio_clone(dio_bio);
>  
> -	dip = kzalloc(sizeof(*dip), GFP_NOFS);
> +	dip_size = sizeof(*dip);
> +	if (!write && csum) {
> +		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +		u16 csum_size = btrfs_super_csum_size(fs_info->super_copy);
> +		size_t nblocks = (dio_bio->bi_iter.bi_size >>
> +				  inode->i_sb->s_blocksize_bits);

		nblocks = dio_bio->bi_iter.bi_size >> inode->i_sb->s_blocksize_bits;

This overflows 80 chars/line but is IMHO more readable than the split
line.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 13/15] btrfs: simplify direct I/O read repair
  2020-03-09 21:32 ` [PATCH 13/15] btrfs: simplify direct I/O read repair Omar Sandoval
  2020-03-11 18:16   ` Josef Bacik
@ 2020-04-03 16:40   ` David Sterba
  2020-04-03 18:05     ` Omar Sandoval
  1 sibling, 1 reply; 69+ messages in thread
From: David Sterba @ 2020-04-03 16:40 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team, Christoph Hellwig

On Mon, Mar 09, 2020 at 02:32:39PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Direct I/O read repair is an over-complicated mess. There is major code
> duplication between __btrfs_subio_endio_read() (checks checksums and
> handles I/O errors for files with checksums),
> __btrfs_correct_data_nocsum() (handles I/O errors for files without
> checksums), btrfs_retry_endio() (checks checksums and handles I/O errors
> for retries of files with checksums), and btrfs_retry_endio_nocsum()
> (handles I/O errors for retries of files without checksum). If it sounds
> like these should be one function, that's because they should.
> 
> After the previous commit getting rid of orig_bio, we can reuse the same
> endio callback for repair I/O and the original I/O, we just need to
> track the file offset and original iterator in the repair bio. We can
> also unify the handling of files with and without checksums and replace
> the atrocity that was probably the inspiration for "Go To Statement
> Considered Harmful" with normal loops. We also no longer have to wait
> for each repair I/O to complete one by one.

This patch looks like a revert of 8b110e393c5a ("Btrfs: implement repair
function when direct read fails"), that probably added the extra layer
you're removing.

So instead of the funny remarks, I'd rather see some analysis that the
issues in the original patch are not coming back.  Quoting from the
changelog:

- When we find the data is not right, we try to read the data from the other
  mirror.
- When the io on the mirror ends, we will insert the endio work into the
  dedicated btrfs workqueue, not common read endio workqueue, because the
  original endio work is still blocked in the btrfs endio workqueue, if we
  insert the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.

It's not too detailed either, but what immediatelly looks suspicious is
the extra workqueue that was added to avoid deadlocks. That is now going
to be removed. This seems like a high level change even for such an old
code (2014) so that its effects are not affected by some other changes
in the dio code.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 13/15] btrfs: simplify direct I/O read repair
  2020-04-03 16:40   ` David Sterba
@ 2020-04-03 18:05     ` Omar Sandoval
  2020-04-16 10:08       ` David Sterba
  0 siblings, 1 reply; 69+ messages in thread
From: Omar Sandoval @ 2020-04-03 18:05 UTC (permalink / raw)
  To: dsterba, linux-btrfs, kernel-team, Christoph Hellwig

On Fri, Apr 03, 2020 at 06:40:51PM +0200, David Sterba wrote:
> On Mon, Mar 09, 2020 at 02:32:39PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Direct I/O read repair is an over-complicated mess. There is major code
> > duplication between __btrfs_subio_endio_read() (checks checksums and
> > handles I/O errors for files with checksums),
> > __btrfs_correct_data_nocsum() (handles I/O errors for files without
> > checksums), btrfs_retry_endio() (checks checksums and handles I/O errors
> > for retries of files with checksums), and btrfs_retry_endio_nocsum()
> > (handles I/O errors for retries of files without checksum). If it sounds
> > like these should be one function, that's because they should.
> > 
> > After the previous commit getting rid of orig_bio, we can reuse the same
> > endio callback for repair I/O and the original I/O, we just need to
> > track the file offset and original iterator in the repair bio. We can
> > also unify the handling of files with and without checksums and replace
> > the atrocity that was probably the inspiration for "Go To Statement
> > Considered Harmful" with normal loops. We also no longer have to wait
> > for each repair I/O to complete one by one.
> 
> This patch looks like a revert of 8b110e393c5a ("Btrfs: implement repair
> function when direct read fails"), that probably added the extra layer
> you're removing.
> 
> So instead of the funny remarks, I'd rather see some analysis that the
> issues in the original patch are not coming back.  Quoting from the
> changelog:
> 
> - When we find the data is not right, we try to read the data from the other
>   mirror.
> - When the io on the mirror ends, we will insert the endio work into the
>   dedicated btrfs workqueue, not common read endio workqueue, because the
>   original endio work is still blocked in the btrfs endio workqueue, if we
>   insert the endio work of the io on the mirror into that workqueue, deadlock
>   would happen.
> - After we get right data, we write it back to the corrupted mirror.
> - And if the data on the new mirror is still corrupted, we will try next
>   mirror until we read right data or all the mirrors are traversed.
> - After the above work, we set the uptodate flag according to the result.
> 
> It's not too detailed either, but what immediatelly looks suspicious is
> the extra workqueue that was added to avoid deadlocks. That is now going
> to be removed. This seems like a high level change even for such an old
> code (2014) so that its effects are not affected by some other changes
> in the dio code.

This patch doesn't touch the extra workqueue. The next patch that gets
rid of it has an explanation:

    This was originally added in commit 8b110e393c5a ("Btrfs: implement
    repair function when direct read fails") because the original bio waited
    for the repair bio to complete, so the repair I/O couldn't go through
    the same workqueue. As of the previous commit, this is no longer true,
    so this separate workqueue is unnecessary.

I can expand on that for v2. The deadlock addressed by the original code
is pretty much that while the worker is executing the original bio, it
will be blocked on the repair bio completing, and the repair bio will be
blocked on the worker finishing the original bio. With this rework, the
original bio doesn't block on the repair bio, so the worker becomes
available for the repair bio right away.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 13/15] btrfs: simplify direct I/O read repair
  2020-04-03 18:05     ` Omar Sandoval
@ 2020-04-16 10:08       ` David Sterba
  0 siblings, 0 replies; 69+ messages in thread
From: David Sterba @ 2020-04-16 10:08 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: dsterba, linux-btrfs, kernel-team, Christoph Hellwig

On Fri, Apr 03, 2020 at 11:05:23AM -0700, Omar Sandoval wrote:
> On Fri, Apr 03, 2020 at 06:40:51PM +0200, David Sterba wrote:
> > On Mon, Mar 09, 2020 at 02:32:39PM -0700, Omar Sandoval wrote:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > Direct I/O read repair is an over-complicated mess. There is major code
> > > duplication between __btrfs_subio_endio_read() (checks checksums and
> > > handles I/O errors for files with checksums),
> > > __btrfs_correct_data_nocsum() (handles I/O errors for files without
> > > checksums), btrfs_retry_endio() (checks checksums and handles I/O errors
> > > for retries of files with checksums), and btrfs_retry_endio_nocsum()
> > > (handles I/O errors for retries of files without checksum). If it sounds
> > > like these should be one function, that's because they should.
> > > 
> > > After the previous commit getting rid of orig_bio, we can reuse the same
> > > endio callback for repair I/O and the original I/O, we just need to
> > > track the file offset and original iterator in the repair bio. We can
> > > also unify the handling of files with and without checksums and replace
> > > the atrocity that was probably the inspiration for "Go To Statement
> > > Considered Harmful" with normal loops. We also no longer have to wait
> > > for each repair I/O to complete one by one.
> > 
> > This patch looks like a revert of 8b110e393c5a ("Btrfs: implement repair
> > function when direct read fails"), that probably added the extra layer
> > you're removing.
> > 
> > So instead of the funny remarks, I'd rather see some analysis that the
> > issues in the original patch are not coming back.  Quoting from the
> > changelog:
> > 
> > - When we find the data is not right, we try to read the data from the other
> >   mirror.
> > - When the io on the mirror ends, we will insert the endio work into the
> >   dedicated btrfs workqueue, not common read endio workqueue, because the
> >   original endio work is still blocked in the btrfs endio workqueue, if we
> >   insert the endio work of the io on the mirror into that workqueue, deadlock
> >   would happen.
> > - After we get right data, we write it back to the corrupted mirror.
> > - And if the data on the new mirror is still corrupted, we will try next
> >   mirror until we read right data or all the mirrors are traversed.
> > - After the above work, we set the uptodate flag according to the result.
> > 
> > It's not too detailed either, but what immediatelly looks suspicious is
> > the extra workqueue that was added to avoid deadlocks. That is now going
> > to be removed. This seems like a high level change even for such an old
> > code (2014) so that its effects are not affected by some other changes
> > in the dio code.
> 
> This patch doesn't touch the extra workqueue. The next patch that gets
> rid of it has an explanation:
> 
>     This was originally added in commit 8b110e393c5a ("Btrfs: implement
>     repair function when direct read fails") because the original bio waited
>     for the repair bio to complete, so the repair I/O couldn't go through
>     the same workqueue. As of the previous commit, this is no longer true,
>     so this separate workqueue is unnecessary.
> 
> I can expand on that for v2. The deadlock addressed by the original code
> is pretty much that while the worker is executing the original bio, it
> will be blocked on the repair bio completing, and the repair bio will be
> blocked on the worker finishing the original bio. With this rework, the
> original bio doesn't block on the repair bio, so the worker becomes
> available for the repair bio right away.

I haven't noticed that the next patch mentions the commmit 8b110e393c5a,
so for clarity it would be better to have more references in both. The
explanation above sounds good to me.

Please send v2, of the whole patchset, it's post rc1 so time for new
code. I'm testing the dio-iomap code, it's getting better so hopefully
both patchsets can be merged together.

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2020-04-16 10:09 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-09 21:32 [PATCH 00/15] btrfs: read repair/direct I/O improvements Omar Sandoval
2020-03-09 21:32 ` [PATCH 01/15] btrfs: fix error handling when submitting direct I/O bio Omar Sandoval
2020-03-11 17:54   ` Josef Bacik
2020-03-17 13:46   ` Nikolay Borisov
2020-03-09 21:32 ` [PATCH 02/15] btrfs: fix double __endio_write_update_ordered in direct I/O Omar Sandoval
2020-03-10 16:30   ` Christoph Hellwig
2020-03-11  9:03     ` Omar Sandoval
2020-03-17 14:04   ` Nikolay Borisov
2020-03-09 21:32 ` [PATCH 03/15] btrfs: look at full bi_io_vec for repair decision Omar Sandoval
2020-03-10 16:33   ` Christoph Hellwig
2020-03-11  9:07     ` Omar Sandoval
2020-03-16 10:48       ` Christoph Hellwig
2020-03-17 14:38   ` Nikolay Borisov
2020-03-09 21:32 ` [PATCH 04/15] btrfs: don't do repair validation for checksum errors Omar Sandoval
2020-03-11 17:55   ` Josef Bacik
2020-03-09 21:32 ` [PATCH 05/15] btrfs: clarify btrfs_lookup_bio_sums documentation Omar Sandoval
2020-03-11 17:56   ` Josef Bacik
2020-03-11 18:23     ` Omar Sandoval
2020-03-11 18:34       ` Josef Bacik
2020-03-17 14:38   ` Nikolay Borisov
2020-03-09 21:32 ` [PATCH 06/15] btrfs: rename __readpage_endio_check to check_data_csum Omar Sandoval
2020-03-10 14:46   ` Johannes Thumshirn
2020-03-11 17:57   ` Josef Bacik
2020-03-17 14:39   ` Nikolay Borisov
2020-03-09 21:32 ` [PATCH 07/15] btrfs: make btrfs_check_repairable() static Omar Sandoval
2020-03-10 14:53   ` Johannes Thumshirn
2020-03-11 17:58   ` Josef Bacik
2020-03-17 14:52   ` Nikolay Borisov
2020-03-09 21:32 ` [PATCH 08/15] btrfs: move btrfs_dio_private to inode.c Omar Sandoval
2020-03-10 14:56   ` Johannes Thumshirn
2020-03-11  8:48     ` Omar Sandoval
2020-03-17 14:53   ` Nikolay Borisov
2020-03-19 16:16   ` David Sterba
2020-03-09 21:32 ` [PATCH 09/15] btrfs: kill btrfs_dio_private->private Omar Sandoval
2020-03-11 17:59   ` Josef Bacik
2020-03-17 14:54   ` Nikolay Borisov
2020-03-09 21:32 ` [PATCH 10/15] btrfs: convert btrfs_dio_private->pending_bios to refcount_t Omar Sandoval
2020-03-11 18:00   ` Josef Bacik
2020-03-17 15:10   ` Nikolay Borisov
2020-03-09 21:32 ` [PATCH 11/15] btrfs: put direct I/O checksums in btrfs_dio_private instead of bio Omar Sandoval
2020-03-11 18:04   ` Josef Bacik
2020-03-17 16:37   ` Nikolay Borisov
2020-04-03 16:18   ` David Sterba
2020-03-09 21:32 ` [PATCH 12/15] btrfs: get rid of one layer of bios in direct I/O Omar Sandoval
2020-03-10 16:38   ` Christoph Hellwig
2020-03-11  9:19     ` Omar Sandoval
2020-03-16 10:49       ` Christoph Hellwig
2020-03-11 18:07   ` Josef Bacik
2020-03-17 16:48   ` Nikolay Borisov
2020-03-09 21:32 ` [PATCH 13/15] btrfs: simplify direct I/O read repair Omar Sandoval
2020-03-11 18:16   ` Josef Bacik
2020-04-03 16:40   ` David Sterba
2020-04-03 18:05     ` Omar Sandoval
2020-04-16 10:08       ` David Sterba
2020-03-09 21:32 ` [PATCH 14/15] btrfs: get rid of endio_repair_workers Omar Sandoval
2020-03-11 18:16   ` Josef Bacik
2020-03-09 21:32 ` [PATCH 15/15] btrfs: unify buffered and direct I/O read repair Omar Sandoval
2020-03-11 18:19   ` Josef Bacik
2020-03-19  8:53   ` Nikolay Borisov
2020-03-19  9:03     ` Christoph Hellwig
2020-03-20 21:28     ` Omar Sandoval
2020-03-10 16:39 ` [PATCH 00/15] btrfs: read repair/direct I/O improvements Christoph Hellwig
2020-03-11  9:22   ` Omar Sandoval
2020-03-18 16:33     ` Goldwyn Rodrigues
2020-03-19 14:08       ` David Sterba
2020-03-18 22:07 ` David Sterba
2020-03-20 21:29   ` Omar Sandoval
2020-03-20 14:10 ` Christoph Hellwig
2020-03-20 14:11   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.