[PATCH 00/12] Implement the data repair function for direct read

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/12] Implement the data repair function for direct read
@ 2014-06-28 11:34 Miao Xie
  2014-06-28 11:34 ` [PATCH 01/12] Btrfs: fix put dio bio twice when we submit dio bio fail Miao Xie
                   ` (12 more replies)
  0 siblings, 13 replies; 49+ messages in thread
From: Miao Xie @ 2014-06-28 11:34 UTC (permalink / raw)
  To: linux-btrfs

This patchset implement the data repair function for the direct read, it
is implemented like buffered read:
- When we find the data is not right, we try to read the data from the other
  mirror.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.

The difference is that the direct read may be splited to several small io,
in order to get the number of the mirror on which the io error happens. we
have to do data check and repair on the end IO function of those sub-IO
request.

Besides that, we also fixed some bugs of direct io.

We can pull this patchset from the URL

  https://github.com/miaoxie/linux-btrfs.git for-Chris

Thanks
Miao
---
Miao Xie (12):
  Btrfs: fix put dio bio twice when we submit dio bio fail
  Btrfs: load checksum data once when submitting a direct read io
  Btrfs: cleanup similar code of the buffered data data check and dio
    read data check
  Btrfs: do file data check by sub-bio's self
  Btrfs: fix missing error handler if submiting re-read bio fails
  Btrfs: Cleanup unused variant and argument of IO failure handlers
  Btrfs: split bio_readpage_error into several functions
  Btrfs: modify repair_io_failure and make it suit direct io
  Btrfs: modify clean_io_failure and make it suit direct io
  Btrfs: Set real mirror number for read operation on RAID0/5/6
  Btrfs: implement repair function when direct read fails
  Btrfs: cleanup the read failure record after write or when the inode
    is freeing

 fs/btrfs/btrfs_inode.h |  10 +-
 fs/btrfs/ctree.h       |   3 +-
 fs/btrfs/disk-io.c     |  43 +++--
 fs/btrfs/disk-io.h     |   1 +
 fs/btrfs/extent_io.c   | 241 ++++++++++++++++----------
 fs/btrfs/extent_io.h   |  38 ++++-
 fs/btrfs/file-item.c   |  14 +-
 fs/btrfs/inode.c       | 453 +++++++++++++++++++++++++++++++++++++++----------
 fs/btrfs/scrub.c       |   4 +-
 fs/btrfs/volumes.c     |   5 +
 fs/btrfs/volumes.h     |   5 +-
 11 files changed, 612 insertions(+), 205 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 01/12] Btrfs: fix put dio bio twice when we submit dio bio fail
  2014-06-28 11:34 [PATCH 00/12] Implement the data repair function for direct read Miao Xie
@ 2014-06-28 11:34 ` Miao Xie
  2014-06-28 11:34 ` [PATCH 02/12] Btrfs: load checksum data once when submitting a direct read io Miao Xie
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-06-28 11:34 UTC (permalink / raw)
  To: linux-btrfs

The caller of btrfs_submit_direct_hook() will put the original dio bio
when btrfs_submit_direct_hook() return a error number, so we needn't
put the original bio in btrfs_submit_direct_hook().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/inode.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3668048..a3f102f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7294,10 +7294,8 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 	map_length = orig_bio->bi_iter.bi_size;
 	ret = btrfs_map_block(root->fs_info, rw, start_sector << 9,
 			      &map_length, NULL, 0);
-	if (ret) {
-		bio_put(orig_bio);
+	if (ret)
 		return -EIO;
-	}
 
 	if (map_length >= orig_bio->bi_iter.bi_size) {
 		bio = orig_bio;
@@ -7314,6 +7312,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 	bio = btrfs_dio_bio_alloc(orig_bio->bi_bdev, start_sector, GFP_NOFS);
 	if (!bio)
 		return -ENOMEM;
+
 	bio->bi_private = dip;
 	bio->bi_end_io = btrfs_end_dio_bio;
 	atomic_inc(&dip->pending_bios);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 02/12] Btrfs: load checksum data once when submitting a direct read io
  2014-06-28 11:34 [PATCH 00/12] Implement the data repair function for direct read Miao Xie
  2014-06-28 11:34 ` [PATCH 01/12] Btrfs: fix put dio bio twice when we submit dio bio fail Miao Xie
@ 2014-06-28 11:34 ` Miao Xie
  2014-07-28 17:24   ` Filipe David Manana
  2014-06-28 11:34 ` [PATCH 03/12] Btrfs: cleanup similar code of the buffered data data check and dio read data check Miao Xie
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Miao Xie @ 2014-06-28 11:34 UTC (permalink / raw)
  To: linux-btrfs

The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/btrfs_inode.h |  1 -
 fs/btrfs/ctree.h       |  3 +--
 fs/btrfs/file-item.c   | 14 ++------------
 fs/btrfs/inode.c       | 40 ++++++++++++++++++++++------------------
 4 files changed, 25 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 4794923..7e9f53b 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -263,7 +263,6 @@ struct btrfs_dio_private {
 
 	/* dio_bio came from fs/direct-io.c */
 	struct bio *dio_bio;
-	u8 csum[0];
 };
 
 /*
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index be91397..40e9938 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3739,8 +3739,7 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
 			  struct bio *bio, u32 *dst);
 int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
-			      struct btrfs_dio_private *dip, struct bio *bio,
-			      u64 logical_offset);
+			      struct bio *bio, u64 logical_offset);
 int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root,
 			     u64 objectid, u64 pos,
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index f46cfe4..cf1b94f 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -299,19 +299,9 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
 }
 
 int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
-			      struct btrfs_dio_private *dip, struct bio *bio,
-			      u64 offset)
+			      struct bio *bio, u64 offset)
 {
-	int len = (bio->bi_iter.bi_sector << 9) - dip->disk_bytenr;
-	u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
-	int ret;
-
-	len >>= inode->i_sb->s_blocksize_bits;
-	len *= csum_size;
-
-	ret = __btrfs_lookup_bio_sums(root, inode, bio, offset,
-				      (u32 *)(dip->csum + len), 1);
-	return ret;
+	return __btrfs_lookup_bio_sums(root, inode, bio, offset, NULL, 1);
 }
 
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a3f102f..969fb22 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7081,7 +7081,8 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
 	struct inode *inode = dip->inode;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct bio *dio_bio;
-	u32 *csums = (u32 *)dip->csum;
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	u32 *csums = (u32 *)io_bio->csum;
 	u64 start;
 	int i;
 
@@ -7123,6 +7124,9 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
 	if (err)
 		clear_bit(BIO_UPTODATE, &dio_bio->bi_flags);
 	dio_end_io(dio_bio, err);
+
+	if (io_bio->end_io)
+		io_bio->end_io(io_bio, err);
 	bio_put(bio);
 }
 
@@ -7261,13 +7265,20 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
 		ret = btrfs_csum_one_bio(root, inode, bio, file_offset, 1);
 		if (ret)
 			goto err;
-	} else if (!skip_sum) {
-		ret = btrfs_lookup_bio_sums_dio(root, inode, dip, bio,
+	} else {
+		/*
+		 * We have loaded all the csum data we need when we submit
+		 * the first bio, so skip it.
+		 */
+		if (dip->logical_offset != file_offset)
+			goto map;
+
+		/* Load all csum data at once. */
+		ret = btrfs_lookup_bio_sums_dio(root, inode, dip->orig_bio,
 						file_offset);
 		if (ret)
 			goto err;
 	}
-
 map:
 	ret = btrfs_map_bio(root, rw, bio, 0, async_submit);
 err:
@@ -7288,7 +7299,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 	u64 submit_len = 0;
 	u64 map_length;
 	int nr_pages = 0;
-	int ret = 0;
+	int ret;
 	int async_submit = 0;
 
 	map_length = orig_bio->bi_iter.bi_size;
@@ -7392,30 +7403,20 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_dio_private *dip;
 	struct bio *io_bio;
+	struct btrfs_io_bio *btrfs_bio;
 	int skip_sum;
-	int sum_len;
 	int write = rw & REQ_WRITE;
 	int ret = 0;
-	u16 csum_size;
 
 	skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
 
-	io_bio = btrfs_bio_clone(dio_bio, GFP_NOFS);
+	io_bio = btrfs_bio_clone(dio_bio, GFP_NOFS | __GFP_ZERO);
 	if (!io_bio) {
 		ret = -ENOMEM;
 		goto free_ordered;
 	}
 
-	if (!skip_sum && !write) {
-		csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
-		sum_len = dio_bio->bi_iter.bi_size >>
-			inode->i_sb->s_blocksize_bits;
-		sum_len *= csum_size;
-	} else {
-		sum_len = 0;
-	}
-
-	dip = kmalloc(sizeof(*dip) + sum_len, GFP_NOFS);
+	dip = kmalloc(sizeof(*dip), GFP_NOFS);
 	if (!dip) {
 		ret = -ENOMEM;
 		goto free_io_bio;
@@ -7441,6 +7442,9 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 	if (!ret)
 		return;
 
+	btrfs_bio = btrfs_io_bio(io_bio);
+	if (btrfs_bio->end_io)
+		btrfs_bio->end_io(btrfs_bio, ret);
 free_io_bio:
 	bio_put(io_bio);
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 03/12] Btrfs: cleanup similar code of the buffered data data check and dio read data check
  2014-06-28 11:34 [PATCH 00/12] Implement the data repair function for direct read Miao Xie
  2014-06-28 11:34 ` [PATCH 01/12] Btrfs: fix put dio bio twice when we submit dio bio fail Miao Xie
  2014-06-28 11:34 ` [PATCH 02/12] Btrfs: load checksum data once when submitting a direct read io Miao Xie
@ 2014-06-28 11:34 ` Miao Xie
  2014-06-28 11:34 ` [PATCH 04/12] Btrfs: do file data check by sub-bio's self Miao Xie
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-06-28 11:34 UTC (permalink / raw)
  To: linux-btrfs

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/inode.c | 102 +++++++++++++++++++++++++------------------------------
 1 file changed, 47 insertions(+), 55 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 969fb22..962defb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2830,6 +2830,40 @@ static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
 	return 0;
 }
 
+static int __readpage_endio_check(struct inode *inode,
+				  struct btrfs_io_bio *io_bio,
+				  int icsum, struct page *page,
+				  int pgoff, u64 start, size_t len)
+{
+	char *kaddr;
+	u32 csum_expected;
+	u32 csum = ~(u32)0;
+	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
+
+	csum_expected = *(((u32 *)io_bio->csum) + icsum);
+
+	kaddr = kmap_atomic(page);
+	csum = btrfs_csum_data(kaddr + pgoff, csum,  len);
+	btrfs_csum_final(csum, (char *)&csum);
+	if (csum != csum_expected)
+		goto zeroit;
+
+	kunmap_atomic(kaddr);
+	return 0;
+zeroit:
+	if (__ratelimit(&_rs))
+		btrfs_info(BTRFS_I(inode)->root->fs_info,
+			   "csum failed ino %llu off %llu csum %u expected csum %u",
+			   btrfs_ino(inode), start, csum, csum_expected);
+	memset(kaddr + pgoff, 1, len);
+	flush_dcache_page(page);
+	kunmap_atomic(kaddr);
+	if (csum_expected == 0)
+		return 0;
+	return -EIO;
+}
+
 /*
  * when reads are done, we need to check csums to verify the data is correct
  * if there's a match, we allow the bio to finish.  If not, the code in
@@ -2842,20 +2876,15 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	size_t offset = start - page_offset(page);
 	struct inode *inode = page->mapping->host;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
-	char *kaddr;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	u32 csum_expected;
-	u32 csum = ~(u32)0;
-	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
-	                              DEFAULT_RATELIMIT_BURST);
 
 	if (PageChecked(page)) {
 		ClearPageChecked(page);
-		goto good;
+		return 0;
 	}
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
-		goto good;
+		return 0;
 
 	if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
 	    test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) {
@@ -2865,28 +2894,8 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	}
 
 	phy_offset >>= inode->i_sb->s_blocksize_bits;
-	csum_expected = *(((u32 *)io_bio->csum) + phy_offset);
-
-	kaddr = kmap_atomic(page);
-	csum = btrfs_csum_data(kaddr + offset, csum,  end - start + 1);
-	btrfs_csum_final(csum, (char *)&csum);
-	if (csum != csum_expected)
-		goto zeroit;
-
-	kunmap_atomic(kaddr);
-good:
-	return 0;
-
-zeroit:
-	if (__ratelimit(&_rs))
-		btrfs_info(root->fs_info, "csum failed ino %llu off %llu csum %u expected csum %u",
-			btrfs_ino(page->mapping->host), start, csum, csum_expected);
-	memset(kaddr + offset, 1, end - start + 1);
-	flush_dcache_page(page);
-	kunmap_atomic(kaddr);
-	if (csum_expected == 0)
-		return 0;
-	return -EIO;
+	return __readpage_endio_check(inode, io_bio, phy_offset, page, offset,
+				      start, (size_t)(end - start + 1));
 }
 
 struct delayed_iput {
@@ -7079,41 +7088,24 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
 	struct btrfs_dio_private *dip = bio->bi_private;
 	struct bio_vec *bvec;
 	struct inode *inode = dip->inode;
-	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct bio *dio_bio;
 	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
-	u32 *csums = (u32 *)io_bio->csum;
 	u64 start;
+	int ret;
 	int i;
 
+	if (err || (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
+		goto skip_checksum;
+
 	start = dip->logical_offset;
 	bio_for_each_segment_all(bvec, bio, i) {
-		if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
-			struct page *page = bvec->bv_page;
-			char *kaddr;
-			u32 csum = ~(u32)0;
-			unsigned long flags;
-
-			local_irq_save(flags);
-			kaddr = kmap_atomic(page);
-			csum = btrfs_csum_data(kaddr + bvec->bv_offset,
-					       csum, bvec->bv_len);
-			btrfs_csum_final(csum, (char *)&csum);
-			kunmap_atomic(kaddr);
-			local_irq_restore(flags);
-
-			flush_dcache_page(bvec->bv_page);
-			if (csum != csums[i]) {
-				btrfs_err(root->fs_info, "csum failed ino %llu off %llu csum %u expected csum %u",
-					  btrfs_ino(inode), start, csum,
-					  csums[i]);
-				err = -EIO;
-			}
-		}
-
+		ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
+					     0, start, bvec->bv_len);
+		if (ret)
+			err = -EIO;
 		start += bvec->bv_len;
 	}
-
+skip_checksum:
 	unlock_extent(&BTRFS_I(inode)->io_tree, dip->logical_offset,
 		      dip->logical_offset + dip->bytes - 1);
 	dio_bio = dip->dio_bio;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 04/12] Btrfs: do file data check by sub-bio's self
  2014-06-28 11:34 [PATCH 00/12] Implement the data repair function for direct read Miao Xie
                   ` (2 preceding siblings ...)
  2014-06-28 11:34 ` [PATCH 03/12] Btrfs: cleanup similar code of the buffered data data check and dio read data check Miao Xie
@ 2014-06-28 11:34 ` Miao Xie
  2014-06-28 11:34 ` [PATCH 05/12] Btrfs: fix missing error handler if submiting re-read bio fails Miao Xie
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-06-28 11:34 UTC (permalink / raw)
  To: linux-btrfs

Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.

But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file data
check in the final end io function to the sub-bio end io function, in which we can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/btrfs_inode.h |   9 +++++
 fs/btrfs/extent_io.c   |   2 +-
 fs/btrfs/inode.c       | 100 ++++++++++++++++++++++++++++++++++++-------------
 fs/btrfs/volumes.h     |   5 ++-
 4 files changed, 87 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 7e9f53b..e82014f 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -245,8 +245,11 @@ static inline int btrfs_inode_in_log(struct inode *inode, u64 generation)
 	return 0;
 }
 
+#define BTRFS_DIO_ORIG_BIO_SUBMITTED	0x1
+
 struct btrfs_dio_private {
 	struct inode *inode;
+	unsigned long flags;
 	u64 logical_offset;
 	u64 disk_bytenr;
 	u64 bytes;
@@ -263,6 +266,12 @@ struct btrfs_dio_private {
 
 	/* dio_bio came from fs/direct-io.c */
 	struct bio *dio_bio;
+
+	/*
+	 * The original bio may be splited to several sub-bios, this is
+	 * done during endio of sub-bios
+	 */
+	int (*subio_endio)(struct inode *, struct btrfs_io_bio *);
 };
 
 /*
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a389820..5ac43b4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2469,7 +2469,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 		struct inode *inode = page->mapping->host;
 
 		pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, "
-			 "mirror=%lu\n", (u64)bio->bi_iter.bi_sector, err,
+			 "mirror=%u\n", (u64)bio->bi_iter.bi_sector, err,
 			 io_bio->mirror_num);
 		tree = &BTRFS_I(inode)->io_tree;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 962defb..8e39645 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7083,29 +7083,40 @@ unlock_err:
 	return ret;
 }
 
-static void btrfs_endio_direct_read(struct bio *bio, int err)
+static int btrfs_subio_endio_read(struct inode *inode,
+				  struct btrfs_io_bio *io_bio)
 {
-	struct btrfs_dio_private *dip = bio->bi_private;
 	struct bio_vec *bvec;
-	struct inode *inode = dip->inode;
-	struct bio *dio_bio;
-	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
 	u64 start;
-	int ret;
 	int i;
+	int ret;
+	int err = 0;
 
-	if (err || (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-		goto skip_checksum;
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
+		return 0;
 
-	start = dip->logical_offset;
-	bio_for_each_segment_all(bvec, bio, i) {
+	start = io_bio->logical;
+	bio_for_each_segment_all(bvec, &io_bio->bio, i) {
 		ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
 					     0, start, bvec->bv_len);
 		if (ret)
 			err = -EIO;
 		start += bvec->bv_len;
 	}
-skip_checksum:
+
+	return err;
+}
+
+static void btrfs_endio_direct_read(struct bio *bio, int err)
+{
+	struct btrfs_dio_private *dip = bio->bi_private;
+	struct inode *inode = dip->inode;
+	struct bio *dio_bio;
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+
+	if (!err && (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED))
+		err = btrfs_subio_endio_read(inode, io_bio);
+
 	unlock_extent(&BTRFS_I(inode)->io_tree, dip->logical_offset,
 		      dip->logical_offset + dip->bytes - 1);
 	dio_bio = dip->dio_bio;
@@ -7182,6 +7193,7 @@ static int __btrfs_submit_bio_start_direct_io(struct inode *inode, int rw,
 static void btrfs_end_dio_bio(struct bio *bio, int err)
 {
 	struct btrfs_dio_private *dip = bio->bi_private;
+	int ret;
 
 	if (err) {
 		btrfs_err(BTRFS_I(dip->inode)->root->fs_info,
@@ -7189,6 +7201,13 @@ static void btrfs_end_dio_bio(struct bio *bio, int err)
 		      btrfs_ino(dip->inode), bio->bi_rw,
 		      (unsigned long long)bio->bi_iter.bi_sector,
 		      bio->bi_iter.bi_size, err);
+	} else if (dip->subio_endio) {
+		ret = dip->subio_endio(dip->inode, btrfs_io_bio(bio));
+		if (ret)
+			err = ret;
+	}
+
+	if (err) {
 		dip->errors = 1;
 
 		/*
@@ -7219,6 +7238,38 @@ static struct bio *btrfs_dio_bio_alloc(struct block_device *bdev,
 	return btrfs_bio_alloc(bdev, first_sector, nr_vecs, gfp_flags);
 }
 
+static inline int btrfs_lookup_and_bind_dio_csum(struct btrfs_root *root,
+						 struct inode *inode,
+						 struct btrfs_dio_private *dip,
+						 struct bio *bio,
+						 u64 file_offset)
+{
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	struct btrfs_io_bio *orig_io_bio = btrfs_io_bio(dip->orig_bio);
+	int ret;
+
+	/*
+	 * We load all the csum data we need when we submit
+	 * the first bio to reduce the csum tree search and
+	 * contention.
+	 */
+	if (dip->logical_offset == file_offset) {
+		ret = btrfs_lookup_bio_sums_dio(root, inode, dip->orig_bio,
+						file_offset);
+		if (ret)
+			return ret;
+	}
+
+	if (bio == dip->orig_bio)
+		return 0;
+
+	file_offset -= dip->logical_offset;
+	file_offset >>= inode->i_sb->s_blocksize_bits;
+	io_bio->csum = (u8 *)(((u32 *)orig_io_bio->csum) + file_offset);
+
+	return 0;
+}
+
 static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
 					 int rw, u64 file_offset, int skip_sum,
 					 int async_submit)
@@ -7258,16 +7309,8 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
 		if (ret)
 			goto err;
 	} else {
-		/*
-		 * We have loaded all the csum data we need when we submit
-		 * the first bio, so skip it.
-		 */
-		if (dip->logical_offset != file_offset)
-			goto map;
-
-		/* Load all csum data at once. */
-		ret = btrfs_lookup_bio_sums_dio(root, inode, dip->orig_bio,
-						file_offset);
+		ret = btrfs_lookup_and_bind_dio_csum(root, inode, dip, bio,
+						     file_offset);
 		if (ret)
 			goto err;
 	}
@@ -7302,6 +7345,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 
 	if (map_length >= orig_bio->bi_iter.bi_size) {
 		bio = orig_bio;
+		dip->flags |= BTRFS_DIO_ORIG_BIO_SUBMITTED;
 		goto submit;
 	}
 
@@ -7318,6 +7362,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 
 	bio->bi_private = dip;
 	bio->bi_end_io = btrfs_end_dio_bio;
+	btrfs_io_bio(bio)->logical = file_offset;
 	atomic_inc(&dip->pending_bios);
 
 	while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) {
@@ -7352,6 +7397,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 				goto out_err;
 			bio->bi_private = dip;
 			bio->bi_end_io = btrfs_end_dio_bio;
+			btrfs_io_bio(bio)->logical = file_offset;
 
 			map_length = orig_bio->bi_iter.bi_size;
 			ret = btrfs_map_block(root->fs_info, rw,
@@ -7408,7 +7454,7 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 		goto free_ordered;
 	}
 
-	dip = kmalloc(sizeof(*dip), GFP_NOFS);
+	dip = kzalloc(sizeof(*dip), GFP_NOFS);
 	if (!dip) {
 		ret = -ENOMEM;
 		goto free_io_bio;
@@ -7420,21 +7466,23 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 	dip->bytes = dio_bio->bi_iter.bi_size;
 	dip->disk_bytenr = (u64)dio_bio->bi_iter.bi_sector << 9;
 	io_bio->bi_private = dip;
-	dip->errors = 0;
 	dip->orig_bio = io_bio;
 	dip->dio_bio = dio_bio;
 	atomic_set(&dip->pending_bios, 0);
+	btrfs_bio = btrfs_io_bio(io_bio);
+	btrfs_bio->logical = file_offset;
 
-	if (write)
+	if (write) {
 		io_bio->bi_end_io = btrfs_endio_direct_write;
-	else
+	} else {
 		io_bio->bi_end_io = btrfs_endio_direct_read;
+		dip->subio_endio = btrfs_subio_endio_read;
+	}
 
 	ret = btrfs_submit_direct_hook(rw, dip, skip_sum);
 	if (!ret)
 		return;
 
-	btrfs_bio = btrfs_io_bio(io_bio);
 	if (btrfs_bio->end_io)
 		btrfs_bio->end_io(btrfs_bio, ret);
 free_io_bio:
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 2aaa00c..cd31de1 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -167,8 +167,9 @@ struct btrfs_fs_devices {
  */
 typedef void (btrfs_io_bio_end_io_t) (struct btrfs_io_bio *bio, int err);
 struct btrfs_io_bio {
-	unsigned long mirror_num;
-	unsigned long stripe_index;
+	unsigned int mirror_num;
+	unsigned int stripe_index;
+	u64 logical;
 	u8 *csum;
 	u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
 	u8 *csum_allocated;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 05/12] Btrfs: fix missing error handler if submiting re-read bio fails
  2014-06-28 11:34 [PATCH 00/12] Implement the data repair function for direct read Miao Xie
                   ` (3 preceding siblings ...)
  2014-06-28 11:34 ` [PATCH 04/12] Btrfs: do file data check by sub-bio's self Miao Xie
@ 2014-06-28 11:34 ` Miao Xie
  2014-06-28 11:34 ` [PATCH 06/12] Btrfs: Cleanup unused variant and argument of IO failure handlers Miao Xie
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-06-28 11:34 UTC (permalink / raw)
  To: linux-btrfs

We forgot to free failure record and bio after submitting re-read bio failed,
fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/extent_io.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5ac43b4..c49c1e1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2345,6 +2345,11 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 	ret = tree->ops->submit_bio_hook(inode, read_mode, bio,
 					 failrec->this_mirror,
 					 failrec->bio_flags, 0);
+	if (ret) {
+		free_io_failure(inode, failrec, 0);
+		bio_put(bio);
+	}
+
 	return ret;
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 06/12] Btrfs: Cleanup unused variant and argument of IO failure handlers
  2014-06-28 11:34 [PATCH 00/12] Implement the data repair function for direct read Miao Xie
                   ` (4 preceding siblings ...)
  2014-06-28 11:34 ` [PATCH 05/12] Btrfs: fix missing error handler if submiting re-read bio fails Miao Xie
@ 2014-06-28 11:34 ` Miao Xie
  2014-06-28 11:34 ` [PATCH 07/12] Btrfs: split bio_readpage_error into several functions Miao Xie
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-06-28 11:34 UTC (permalink / raw)
  To: linux-btrfs

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/extent_io.c | 26 ++++++++++----------------
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c49c1e1..b6b391e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1978,8 +1978,7 @@ struct io_failure_record {
 	int in_validation;
 };
 
-static int free_io_failure(struct inode *inode, struct io_failure_record *rec,
-				int did_repair)
+static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
 	int ret;
 	int err = 0;
@@ -2106,7 +2105,6 @@ static int clean_io_failure(u64 start, struct page *page)
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct extent_state *state;
 	int num_copies;
-	int did_repair = 0;
 	int ret;
 
 	private = 0;
@@ -2127,7 +2125,6 @@ static int clean_io_failure(u64 start, struct page *page)
 		/* there was no real error, just free the record */
 		pr_debug("clean_io_failure: freeing dummy error at %llu\n",
 			 failrec->start);
-		did_repair = 1;
 		goto out;
 	}
 	if (fs_info->sb->s_flags & MS_RDONLY)
@@ -2144,19 +2141,16 @@ static int clean_io_failure(u64 start, struct page *page)
 		num_copies = btrfs_num_copies(fs_info, failrec->logical,
 					      failrec->len);
 		if (num_copies > 1)  {
-			ret = repair_io_failure(fs_info, start, failrec->len,
-						failrec->logical, page,
-						failrec->failed_mirror);
-			did_repair = !ret;
+			repair_io_failure(fs_info, start, failrec->len,
+					  failrec->logical, page,
+					  failrec->failed_mirror);
 		}
-		ret = 0;
 	}
 
 out:
-	if (!ret)
-		ret = free_io_failure(inode, failrec, did_repair);
+	free_io_failure(inode, failrec);
 
-	return ret;
+	return 0;
 }
 
 /*
@@ -2266,7 +2260,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		 */
 		pr_debug("bio_readpage_error: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n",
 			 num_copies, failrec->this_mirror, failed_mirror);
-		free_io_failure(inode, failrec, 0);
+		free_io_failure(inode, failrec);
 		return -EIO;
 	}
 
@@ -2309,13 +2303,13 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 	if (failrec->this_mirror > num_copies) {
 		pr_debug("bio_readpage_error: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n",
 			 num_copies, failrec->this_mirror, failed_mirror);
-		free_io_failure(inode, failrec, 0);
+		free_io_failure(inode, failrec);
 		return -EIO;
 	}
 
 	bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
 	if (!bio) {
-		free_io_failure(inode, failrec, 0);
+		free_io_failure(inode, failrec);
 		return -EIO;
 	}
 	bio->bi_end_io = failed_bio->bi_end_io;
@@ -2346,7 +2340,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 					 failrec->this_mirror,
 					 failrec->bio_flags, 0);
 	if (ret) {
-		free_io_failure(inode, failrec, 0);
+		free_io_failure(inode, failrec);
 		bio_put(bio);
 	}
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 07/12] Btrfs: split bio_readpage_error into several functions
  2014-06-28 11:34 [PATCH 00/12] Implement the data repair function for direct read Miao Xie
                   ` (5 preceding siblings ...)
  2014-06-28 11:34 ` [PATCH 06/12] Btrfs: Cleanup unused variant and argument of IO failure handlers Miao Xie
@ 2014-06-28 11:34 ` Miao Xie
  2014-06-28 11:34 ` [PATCH 08/12] Btrfs: modify repair_io_failure and make it suit direct io Miao Xie
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-06-28 11:34 UTC (permalink / raw)
  To: linux-btrfs

The data repair function of direct read will be implemented later, and some code
in bio_readpage_error will be reused, so split bio_readpage_error into
several functions which will be used in direct read repair later.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/extent_io.c | 159 ++++++++++++++++++++++++++++++---------------------
 fs/btrfs/extent_io.h |  28 +++++++++
 2 files changed, 123 insertions(+), 64 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b6b391e..f44c210 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1959,25 +1959,6 @@ static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
 		SetPageUptodate(page);
 }
 
-/*
- * When IO fails, either with EIO or csum verification fails, we
- * try other mirrors that might have a good copy of the data.  This
- * io_failure_record is used to record state as we go through all the
- * mirrors.  If another mirror has good data, the page is set up to date
- * and things continue.  If a good mirror can't be found, the original
- * bio end_io callback is called to indicate things have failed.
- */
-struct io_failure_record {
-	struct page *page;
-	u64 start;
-	u64 len;
-	u64 logical;
-	unsigned long bio_flags;
-	int this_mirror;
-	int failed_mirror;
-	int in_validation;
-};
-
 static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
 	int ret;
@@ -2153,40 +2134,24 @@ out:
 	return 0;
 }
 
-/*
- * this is a generic handler for readpage errors (default
- * readpage_io_failed_hook). if other copies exist, read those and write back
- * good data to the failed position. does not investigate in remapping the
- * failed extent elsewhere, hoping the device will be smart enough to do this as
- * needed
- */
-
-static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
-			      struct page *page, u64 start, u64 end,
-			      int failed_mirror)
+int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
+				struct io_failure_record **failrec_ret)
 {
-	struct io_failure_record *failrec = NULL;
+	struct io_failure_record *failrec;
 	u64 private;
 	struct extent_map *em;
-	struct inode *inode = page->mapping->host;
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
-	struct bio *bio;
-	struct btrfs_io_bio *btrfs_failed_bio;
-	struct btrfs_io_bio *btrfs_bio;
-	int num_copies;
 	int ret;
-	int read_mode;
 	u64 logical;
 
-	BUG_ON(failed_bio->bi_rw & REQ_WRITE);
-
 	ret = get_state_private(failure_tree, start, &private);
 	if (ret) {
 		failrec = kzalloc(sizeof(*failrec), GFP_NOFS);
 		if (!failrec)
 			return -ENOMEM;
+
 		failrec->start = start;
 		failrec->len = end - start + 1;
 		failrec->this_mirror = 0;
@@ -2206,11 +2171,11 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 			em = NULL;
 		}
 		read_unlock(&em_tree->lock);
-
 		if (!em) {
 			kfree(failrec);
 			return -EIO;
 		}
+
 		logical = start - em->start;
 		logical = em->block_start + logical;
 		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
@@ -2219,8 +2184,10 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 			extent_set_compress_type(&failrec->bio_flags,
 						 em->compress_type);
 		}
-		pr_debug("bio_readpage_error: (new) logical=%llu, start=%llu, "
-			 "len=%llu\n", logical, start, failrec->len);
+
+		pr_debug("Get IO Failure Record: (new) logical=%llu, start=%llu, len=%llu\n",
+			 logical, start, failrec->len);
+
 		failrec->logical = logical;
 		free_extent_map(em);
 
@@ -2240,8 +2207,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		}
 	} else {
 		failrec = (struct io_failure_record *)(unsigned long)private;
-		pr_debug("bio_readpage_error: (found) logical=%llu, "
-			 "start=%llu, len=%llu, validation=%d\n",
+		pr_debug("Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu, validation=%d\n",
 			 failrec->logical, failrec->start, failrec->len,
 			 failrec->in_validation);
 		/*
@@ -2250,6 +2216,17 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		 * clean_io_failure() clean all those errors at once.
 		 */
 	}
+
+	*failrec_ret = failrec;
+
+	return 0;
+}
+
+int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
+			   struct io_failure_record *failrec, int failed_mirror)
+{
+	int num_copies;
+
 	num_copies = btrfs_num_copies(BTRFS_I(inode)->root->fs_info,
 				      failrec->logical, failrec->len);
 	if (num_copies == 1) {
@@ -2258,10 +2235,9 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		 * all the retry and error correction code that follows. no
 		 * matter what the error is, it is very likely to persist.
 		 */
-		pr_debug("bio_readpage_error: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n",
+		pr_debug("Check Repairable: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n",
 			 num_copies, failrec->this_mirror, failed_mirror);
-		free_io_failure(inode, failrec);
-		return -EIO;
+		return 0;
 	}
 
 	/*
@@ -2281,7 +2257,6 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		BUG_ON(failrec->in_validation);
 		failrec->in_validation = 1;
 		failrec->this_mirror = failed_mirror;
-		read_mode = READ_SYNC | REQ_FAILFAST_DEV;
 	} else {
 		/*
 		 * we're ready to fulfill a) and b) alongside. get a good copy
@@ -2297,22 +2272,32 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		failrec->this_mirror++;
 		if (failrec->this_mirror == failed_mirror)
 			failrec->this_mirror++;
-		read_mode = READ_SYNC;
 	}
 
 	if (failrec->this_mirror > num_copies) {
-		pr_debug("bio_readpage_error: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n",
+		pr_debug("Check Repairable: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n",
 			 num_copies, failrec->this_mirror, failed_mirror);
-		free_io_failure(inode, failrec);
-		return -EIO;
+		return 0;
 	}
 
+	return 1;
+}
+
+
+struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
+				    struct io_failure_record *failrec,
+				    struct page *page, int pg_offset, int icsum,
+				    bio_end_io_t *endio_func)
+{
+	struct bio *bio;
+	struct btrfs_io_bio *btrfs_failed_bio;
+	struct btrfs_io_bio *btrfs_bio;
+
 	bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
-	if (!bio) {
-		free_io_failure(inode, failrec);
-		return -EIO;
-	}
-	bio->bi_end_io = failed_bio->bi_end_io;
+	if (!bio)
+		return NULL;
+
+	bio->bi_end_io = endio_func;
 	bio->bi_iter.bi_sector = failrec->logical >> 9;
 	bio->bi_bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev;
 	bio->bi_iter.bi_size = 0;
@@ -2324,17 +2309,63 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 
 		btrfs_bio = btrfs_io_bio(bio);
 		btrfs_bio->csum = btrfs_bio->csum_inline;
-		phy_offset >>= inode->i_sb->s_blocksize_bits;
-		phy_offset *= csum_size;
-		memcpy(btrfs_bio->csum, btrfs_failed_bio->csum + phy_offset,
+		icsum *= csum_size;
+		memcpy(btrfs_bio->csum, btrfs_failed_bio->csum + icsum,
 		       csum_size);
 	}
 
-	bio_add_page(bio, page, failrec->len, start - page_offset(page));
+	bio_add_page(bio, page, failrec->len, pg_offset);
+
+	return bio;
+}
+
+/*
+ * this is a generic handler for readpage errors (default
+ * readpage_io_failed_hook). if other copies exist, read those and write back
+ * good data to the failed position. does not investigate in remapping the
+ * failed extent elsewhere, hoping the device will be smart enough to do this as
+ * needed
+ */
+
+static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
+			      struct page *page, u64 start, u64 end,
+			      int failed_mirror)
+{
+	struct io_failure_record *failrec;
+	struct inode *inode = page->mapping->host;
+	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
+	struct bio *bio;
+	int read_mode;
+	int ret;
+
+	BUG_ON(failed_bio->bi_rw & REQ_WRITE);
+
+	ret = btrfs_get_io_failure_record(inode, start, end, &failrec);
+	if (ret)
+		return ret;
+
+	ret = btrfs_check_repairable(inode, failed_bio, failrec, failed_mirror);
+	if (!ret) {
+		free_io_failure(inode, failrec);
+		return -EIO;
+	}
+
+	if (failed_bio->bi_vcnt > 1)
+		read_mode = READ_SYNC | REQ_FAILFAST_DEV;
+	else
+		read_mode = READ_SYNC;
+
+	phy_offset >>= inode->i_sb->s_blocksize_bits;
+	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
+				      start - page_offset(page),
+				      (int)phy_offset, failed_bio->bi_end_io);
+	if (!bio) {
+		free_io_failure(inode, failrec);
+		return -EIO;
+	}
 
-	pr_debug("bio_readpage_error: submitting new read[%#x] to "
-		 "this_mirror=%d, num_copies=%d, in_validation=%d\n", read_mode,
-		 failrec->this_mirror, num_copies, failrec->in_validation);
+	pr_debug("Repair Read Error: submitting new read[%#x] to this_mirror=%d, in_validation=%d\n",
+		 read_mode, failrec->this_mirror, failrec->in_validation);
 
 	ret = tree->ops->submit_bio_hook(inode, read_mode, bio,
 					 failrec->this_mirror,
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index ccc264e..4ce0547 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -347,6 +347,34 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 			 int mirror_num);
+
+/*
+ * When IO fails, either with EIO or csum verification fails, we
+ * try other mirrors that might have a good copy of the data.  This
+ * io_failure_record is used to record state as we go through all the
+ * mirrors.  If another mirror has good data, the page is set up to date
+ * and things continue.  If a good mirror can't be found, the original
+ * bio end_io callback is called to indicate things have failed.
+ */
+struct io_failure_record {
+	struct page *page;
+	u64 start;
+	u64 len;
+	u64 logical;
+	unsigned long bio_flags;
+	int this_mirror;
+	int failed_mirror;
+	int in_validation;
+};
+
+int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
+				struct io_failure_record **failrec_ret);
+int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
+			   struct io_failure_record *failrec, int fail_mirror);
+struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
+				    struct io_failure_record *failrec,
+				    struct page *page, int pg_offset, int icsum,
+				    bio_end_io_t *endio_func);
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 noinline u64 find_lock_delalloc_range(struct inode *inode,
 				      struct extent_io_tree *tree,
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 08/12] Btrfs: modify repair_io_failure and make it suit direct io
  2014-06-28 11:34 [PATCH 00/12] Implement the data repair function for direct read Miao Xie
                   ` (6 preceding siblings ...)
  2014-06-28 11:34 ` [PATCH 07/12] Btrfs: split bio_readpage_error into several functions Miao Xie
@ 2014-06-28 11:34 ` Miao Xie
  2014-06-28 11:34 ` [PATCH 09/12] Btrfs: modify clean_io_failure " Miao Xie
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-06-28 11:34 UTC (permalink / raw)
  To: linux-btrfs

The original code of repair_io_failure was just used for buffered read,
because it got some filesystem data from page structure, it is safe for
the page in the page cache. But when we do a direct read, the pages in bio
are not in the page cache, that is there is no filesystem data in the page
structure. In order to implement direct read data repair, we need modify
repair_io_failure and pass all filesystem data it need by function
parameters.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/extent_io.c | 8 +++++---
 fs/btrfs/extent_io.h | 2 +-
 fs/btrfs/scrub.c     | 1 +
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f44c210..20c3f2e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1994,7 +1994,7 @@ static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
  */
 int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 			u64 length, u64 logical, struct page *page,
-			int mirror_num)
+			unsigned int pg_offset, int mirror_num)
 {
 	struct bio *bio;
 	struct btrfs_device *dev;
@@ -2033,7 +2033,7 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 		return -EIO;
 	}
 	bio->bi_bdev = dev->bdev;
-	bio_add_page(bio, page, length, start - page_offset(page));
+	bio_add_page(bio, page, length, pg_offset);
 
 	if (btrfsic_submit_bio_wait(WRITE_SYNC, bio)) {
 		/* try to remap that extent elsewhere? */
@@ -2064,7 +2064,8 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = extent_buffer_page(eb, i);
 		ret = repair_io_failure(root->fs_info, start, PAGE_CACHE_SIZE,
-					start, p, mirror_num);
+					start, p, start - page_offset(p),
+					mirror_num);
 		if (ret)
 			break;
 		start += PAGE_CACHE_SIZE;
@@ -2124,6 +2125,7 @@ static int clean_io_failure(u64 start, struct page *page)
 		if (num_copies > 1)  {
 			repair_io_failure(fs_info, start, failrec->len,
 					  failrec->logical, page,
+					  start - page_offset(page),
 					  failrec->failed_mirror);
 		}
 	}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 4ce0547..4366453 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -343,7 +343,7 @@ struct btrfs_fs_info;
 
 int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 			u64 length, u64 logical, struct page *page,
-			int mirror_num);
+			unsigned int pg_offset, int mirror_num);
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 			 int mirror_num);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b6d198f..0609245 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -684,6 +684,7 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *fixup_ctx)
 		fs_info = BTRFS_I(inode)->root->fs_info;
 		ret = repair_io_failure(fs_info, offset, PAGE_SIZE,
 					fixup->logical, page,
+					offset - page_offset(page),
 					fixup->mirror_num);
 		unlock_page(page);
 		corrected = !ret;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 09/12] Btrfs: modify clean_io_failure and make it suit direct io
  2014-06-28 11:34 [PATCH 00/12] Implement the data repair function for direct read Miao Xie
                   ` (7 preceding siblings ...)
  2014-06-28 11:34 ` [PATCH 08/12] Btrfs: modify repair_io_failure and make it suit direct io Miao Xie
@ 2014-06-28 11:34 ` Miao Xie
  2014-06-28 11:35 ` [PATCH 10/12] Btrfs: Set real mirror number for read operation on RAID0/5/6 Miao Xie
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-06-28 11:34 UTC (permalink / raw)
  To: linux-btrfs

We could not use clean_io_failure in the direct IO path because it got the
filesystem information from the page structure, but the page in the direct
IO bio didn't have the filesystem information in its structure. So we need
modify it and pass all the information it need by parameters.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/extent_io.c | 31 +++++++++++++++----------------
 fs/btrfs/extent_io.h |  6 +++---
 fs/btrfs/scrub.c     |  3 +--
 3 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 20c3f2e..6e30a8c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1992,10 +1992,10 @@ static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
  * currently, there can be no more than two copies of every data bit. thus,
  * exactly one rewrite is required.
  */
-int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
-			u64 length, u64 logical, struct page *page,
-			unsigned int pg_offset, int mirror_num)
+int repair_io_failure(struct inode *inode, u64 start, u64 length, u64 logical,
+		      struct page *page, unsigned int pg_offset, int mirror_num)
 {
+	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct bio *bio;
 	struct btrfs_device *dev;
 	u64 map_length = 0;
@@ -2043,10 +2043,9 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 	}
 
 	printk_ratelimited_in_rcu(KERN_INFO
-			"BTRFS: read error corrected: ino %lu off %llu "
-		    "(dev %s sector %llu)\n", page->mapping->host->i_ino,
-		    start, rcu_str_deref(dev->name), sector);
-
+				  "BTRFS: read error corrected: ino %llu off %llu (dev %s sector %llu)\n",
+				  btrfs_ino(inode), start,
+				  rcu_str_deref(dev->name), sector);
 	bio_put(bio);
 	return 0;
 }
@@ -2063,9 +2062,10 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = extent_buffer_page(eb, i);
-		ret = repair_io_failure(root->fs_info, start, PAGE_CACHE_SIZE,
-					start, p, start - page_offset(p),
-					mirror_num);
+
+		ret = repair_io_failure(root->fs_info->btree_inode, start,
+					PAGE_CACHE_SIZE, start, p,
+					start - page_offset(p), mirror_num);
 		if (ret)
 			break;
 		start += PAGE_CACHE_SIZE;
@@ -2078,12 +2078,12 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
  * each time an IO finishes, we do a fast check in the IO failure tree
  * to see if we need to process or clean up an io_failure_record
  */
-static int clean_io_failure(u64 start, struct page *page)
+static int clean_io_failure(struct inode *inode, u64 start,
+			    struct page *page, unsigned int pg_offset)
 {
 	u64 private;
 	u64 private_failure;
 	struct io_failure_record *failrec;
-	struct inode *inode = page->mapping->host;
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct extent_state *state;
 	int num_copies;
@@ -2123,10 +2123,9 @@ static int clean_io_failure(u64 start, struct page *page)
 		num_copies = btrfs_num_copies(fs_info, failrec->logical,
 					      failrec->len);
 		if (num_copies > 1)  {
-			repair_io_failure(fs_info, start, failrec->len,
+			repair_io_failure(inode, start, failrec->len,
 					  failrec->logical, page,
-					  start - page_offset(page),
-					  failrec->failed_mirror);
+					  pg_offset, failrec->failed_mirror);
 		}
 	}
 
@@ -2535,7 +2534,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 			if (ret)
 				uptodate = 0;
 			else
-				clean_io_failure(start, page);
+				clean_io_failure(inode, start, page, 0);
 		}
 
 		if (likely(uptodate))
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 4366453..7662eaa 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -341,9 +341,9 @@ struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask);
 
 struct btrfs_fs_info;
 
-int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
-			u64 length, u64 logical, struct page *page,
-			unsigned int pg_offset, int mirror_num);
+int repair_io_failure(struct inode *inode, u64 start, u64 length, u64 logical,
+		      struct page *page, unsigned int pg_offset,
+		      int mirror_num);
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 			 int mirror_num);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 0609245..149a528 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -681,8 +681,7 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *fixup_ctx)
 			ret = -EIO;
 			goto out;
 		}
-		fs_info = BTRFS_I(inode)->root->fs_info;
-		ret = repair_io_failure(fs_info, offset, PAGE_SIZE,
+		ret = repair_io_failure(inode, offset, PAGE_SIZE,
 					fixup->logical, page,
 					offset - page_offset(page),
 					fixup->mirror_num);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 10/12] Btrfs: Set real mirror number for read operation on RAID0/5/6
  2014-06-28 11:34 [PATCH 00/12] Implement the data repair function for direct read Miao Xie
                   ` (8 preceding siblings ...)
  2014-06-28 11:34 ` [PATCH 09/12] Btrfs: modify clean_io_failure " Miao Xie
@ 2014-06-28 11:35 ` Miao Xie
  2014-06-28 11:35 ` [PATCH 11/12] Btrfs: implement repair function when direct read fails Miao Xie
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-06-28 11:35 UTC (permalink / raw)
  To: linux-btrfs

We need real mirror number for RAID0/5/6 when reading data, or if read error
happens, we would pass 0 as the number of the mirror on which the io error
happens. It is wrong and would cause the filesystem read the data from the
corrupted mirror again.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/volumes.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c83b242..95828b0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4935,6 +4935,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 			num_stripes = min_t(u64, map->num_stripes,
 					    stripe_nr_end - stripe_nr_orig);
 		stripe_index = do_div(stripe_nr, map->num_stripes);
+		if (!(rw & (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS)))
+			mirror_num = 1;
 	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
 		if (rw & (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS))
 			num_stripes = map->num_stripes;
@@ -5038,6 +5040,9 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 			/* We distribute the parity blocks across stripes */
 			tmp = stripe_nr + stripe_index;
 			stripe_index = do_div(tmp, map->num_stripes);
+			if (!(rw & (REQ_WRITE | REQ_DISCARD |
+				    REQ_GET_READ_MIRRORS)) && mirror_num <= 1)
+				mirror_num = 1;
 		}
 	} else {
 		/*
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 11/12] Btrfs: implement repair function when direct read fails
  2014-06-28 11:34 [PATCH 00/12] Implement the data repair function for direct read Miao Xie
                   ` (9 preceding siblings ...)
  2014-06-28 11:35 ` [PATCH 10/12] Btrfs: Set real mirror number for read operation on RAID0/5/6 Miao Xie
@ 2014-06-28 11:35 ` Miao Xie
  2014-06-28 11:35 ` [PATCH 12/12] Btrfs: cleanup the read failure record after write or when the inode is freeing Miao Xie
  2014-07-29  9:23 ` [PATCH v2 00/12] Implement the data repair function for direct read Miao Xie
  12 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-06-28 11:35 UTC (permalink / raw)
  To: linux-btrfs

This patch implement data repair function when direct read fails.

The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
  mirror.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/btrfs_inode.h |   2 +-
 fs/btrfs/disk-io.c     |  43 ++++++--
 fs/btrfs/disk-io.h     |   1 +
 fs/btrfs/extent_io.c   |  12 ++-
 fs/btrfs/extent_io.h   |   5 +-
 fs/btrfs/inode.c       | 276 +++++++++++++++++++++++++++++++++++++++++++++----
 6 files changed, 300 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index e82014f..e17c7ae 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -271,7 +271,7 @@ struct btrfs_dio_private {
 	 * The original bio may be splited to several sub-bios, this is
 	 * done during endio of sub-bios
 	 */
-	int (*subio_endio)(struct inode *, struct btrfs_io_bio *);
+	int (*subio_endio)(struct inode *, struct btrfs_io_bio *, int);
 };
 
 /*
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8bb4aa1..a55e594 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -690,6 +690,27 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
 	return -EIO;	/* we fixed nothing */
 }
 
+static inline void do_end_workqueue_fn(struct end_io_wq *end_io_wq)
+{
+	struct bio *bio = end_io_wq->bio;
+
+	bio->bi_private = end_io_wq->private;
+	bio->bi_end_io = end_io_wq->end_io;
+	bio_endio_nodec(bio, end_io_wq->error);
+	kfree(end_io_wq);
+}
+
+static void dio_end_workqueue_fn(struct work_struct *work)
+{
+	struct btrfs_work *bwork;
+	struct end_io_wq *end_io_wq;
+
+	bwork = container_of(work, struct btrfs_work, normal_work);
+	end_io_wq = container_of(bwork, struct end_io_wq, work);
+
+	do_end_workqueue_fn(end_io_wq);
+}
+
 static void end_workqueue_bio(struct bio *bio, int err)
 {
 	struct end_io_wq *end_io_wq = bio->bi_private;
@@ -697,7 +718,12 @@ static void end_workqueue_bio(struct bio *bio, int err)
 
 	fs_info = end_io_wq->info;
 	end_io_wq->error = err;
-	btrfs_init_work(&end_io_wq->work, end_workqueue_fn, NULL, NULL);
+
+	if (likely(end_io_wq->metadata != BTRFS_WQ_ENDIO_DIO_REPAIR))
+		btrfs_init_work(&end_io_wq->work, end_workqueue_fn, NULL,
+				NULL);
+	else
+		INIT_WORK(&end_io_wq->work.normal_work, dio_end_workqueue_fn);
 
 	if (bio->bi_rw & REQ_WRITE) {
 		if (end_io_wq->metadata == BTRFS_WQ_ENDIO_METADATA)
@@ -713,7 +739,9 @@ static void end_workqueue_bio(struct bio *bio, int err)
 			btrfs_queue_work(fs_info->endio_write_workers,
 					 &end_io_wq->work);
 	} else {
-		if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56)
+		if (unlikely(end_io_wq->metadata == BTRFS_WQ_ENDIO_DIO_REPAIR))
+			queue_work(system_wq, &end_io_wq->work.normal_work);
+		else if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56)
 			btrfs_queue_work(fs_info->endio_raid56_workers,
 					 &end_io_wq->work);
 		else if (end_io_wq->metadata)
@@ -737,6 +765,7 @@ int btrfs_bio_wq_end_io(struct btrfs_fs_info *info, struct bio *bio,
 			int metadata)
 {
 	struct end_io_wq *end_io_wq;
+
 	end_io_wq = kmalloc(sizeof(*end_io_wq), GFP_NOFS);
 	if (!end_io_wq)
 		return -ENOMEM;
@@ -1729,18 +1758,10 @@ static int setup_bdi(struct btrfs_fs_info *info, struct backing_dev_info *bdi)
  */
 static void end_workqueue_fn(struct btrfs_work *work)
 {
-	struct bio *bio;
 	struct end_io_wq *end_io_wq;
-	int error;
 
 	end_io_wq = container_of(work, struct end_io_wq, work);
-	bio = end_io_wq->bio;
-
-	error = end_io_wq->error;
-	bio->bi_private = end_io_wq->private;
-	bio->bi_end_io = end_io_wq->end_io;
-	kfree(end_io_wq);
-	bio_endio_nodec(bio, error);
+	do_end_workqueue_fn(end_io_wq);
 }
 
 static int cleaner_kthread(void *arg)
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 23ce3ce..4fde7a0 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -30,6 +30,7 @@ enum {
 	BTRFS_WQ_ENDIO_METADATA = 1,
 	BTRFS_WQ_ENDIO_FREE_SPACE = 2,
 	BTRFS_WQ_ENDIO_RAID56 = 3,
+	BTRFS_WQ_ENDIO_DIO_REPAIR = 4,
 };
 
 static inline u64 btrfs_sb_offset(int mirror)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6e30a8c..edbc3b6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1959,7 +1959,7 @@ static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
 		SetPageUptodate(page);
 }
 
-static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
+int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
 	int ret;
 	int err = 0;
@@ -2078,8 +2078,8 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
  * each time an IO finishes, we do a fast check in the IO failure tree
  * to see if we need to process or clean up an io_failure_record
  */
-static int clean_io_failure(struct inode *inode, u64 start,
-			    struct page *page, unsigned int pg_offset)
+int clean_io_failure(struct inode *inode, u64 start, struct page *page,
+		     unsigned int pg_offset)
 {
 	u64 private;
 	u64 private_failure;
@@ -2288,7 +2288,7 @@ int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
 struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
 				    struct io_failure_record *failrec,
 				    struct page *page, int pg_offset, int icsum,
-				    bio_end_io_t *endio_func)
+				    bio_end_io_t *endio_func, void *data)
 {
 	struct bio *bio;
 	struct btrfs_io_bio *btrfs_failed_bio;
@@ -2302,6 +2302,7 @@ struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
 	bio->bi_iter.bi_sector = failrec->logical >> 9;
 	bio->bi_bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev;
 	bio->bi_iter.bi_size = 0;
+	bio->bi_private = data;
 
 	btrfs_failed_bio = btrfs_io_bio(failed_bio);
 	if (btrfs_failed_bio->csum) {
@@ -2359,7 +2360,8 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 	phy_offset >>= inode->i_sb->s_blocksize_bits;
 	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
 				      start - page_offset(page),
-				      (int)phy_offset, failed_bio->bi_end_io);
+				      (int)phy_offset, failed_bio->bi_end_io,
+				      NULL);
 	if (!bio) {
 		free_io_failure(inode, failrec);
 		return -EIO;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 7662eaa..b23c7c2 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -344,6 +344,8 @@ struct btrfs_fs_info;
 int repair_io_failure(struct inode *inode, u64 start, u64 length, u64 logical,
 		      struct page *page, unsigned int pg_offset,
 		      int mirror_num);
+int clean_io_failure(struct inode *inode, u64 start, struct page *page,
+		     unsigned int pg_offset);
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 			 int mirror_num);
@@ -374,7 +376,8 @@ int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
 struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
 				    struct io_failure_record *failrec,
 				    struct page *page, int pg_offset, int icsum,
-				    bio_end_io_t *endio_func);
+				    bio_end_io_t *endio_func, void *data);
+int free_io_failure(struct inode *inode, struct io_failure_record *rec);
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 noinline u64 find_lock_delalloc_range(struct inode *inode,
 				      struct extent_io_tree *tree,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8e39645..d6b7048 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7083,30 +7083,267 @@ unlock_err:
 	return ret;
 }
 
-static int btrfs_subio_endio_read(struct inode *inode,
-				  struct btrfs_io_bio *io_bio)
+static inline int submit_dio_repair_bio(struct inode *inode, struct bio *bio,
+					int rw, int mirror_num)
+{
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	int ret;
+
+	BUG_ON(rw & REQ_WRITE);
+
+	bio_get(bio);
+
+	ret = btrfs_bio_wq_end_io(root->fs_info, bio,
+				  BTRFS_WQ_ENDIO_DIO_REPAIR);
+	if (ret)
+		goto err;
+
+	ret = btrfs_map_bio(root, rw, bio, mirror_num, 0);
+err:
+	bio_put(bio);
+	return ret;
+}
+
+static int btrfs_check_dio_repairable(struct inode *inode,
+				      struct bio *failed_bio,
+				      struct io_failure_record *failrec,
+				      int failed_mirror)
+{
+	int num_copies;
+
+	num_copies = btrfs_num_copies(BTRFS_I(inode)->root->fs_info,
+				      failrec->logical, failrec->len);
+	if (num_copies == 1) {
+		/*
+		 * we only have a single copy of the data, so don't bother with
+		 * all the retry and error correction code that follows. no
+		 * matter what the error is, it is very likely to persist.
+		 */
+		pr_debug("Check DIO Repairable: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n",
+			 num_copies, failrec->this_mirror, failed_mirror);
+		return 0;
+	}
+
+	failrec->failed_mirror = failed_mirror;
+	failrec->this_mirror++;
+	if (failrec->this_mirror == failed_mirror)
+		failrec->this_mirror++;
+
+	if (failrec->this_mirror > num_copies) {
+		pr_debug("Check DIO Repairable: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n",
+			 num_copies, failrec->this_mirror, failed_mirror);
+		return 0;
+	}
+
+	return 1;
+}
+
+static int dio_read_error(struct inode *inode, struct bio *failed_bio,
+			  struct page *page, u64 start, u64 end,
+			  int failed_mirror, bio_end_io_t *repair_endio,
+			  void *repair_arg)
+{
+	struct io_failure_record *failrec;
+	struct bio *bio;
+	int isector;
+	int read_mode;
+	int ret;
+
+	BUG_ON(failed_bio->bi_rw & REQ_WRITE);
+
+	ret = btrfs_get_io_failure_record(inode, start, end, &failrec);
+	if (ret)
+		return ret;
+
+	ret = btrfs_check_dio_repairable(inode, failed_bio, failrec,
+					 failed_mirror);
+	if (!ret) {
+		free_io_failure(inode, failrec);
+		return -EIO;
+	}
+
+	if (failed_bio->bi_vcnt > 1)
+		read_mode = READ_SYNC | REQ_FAILFAST_DEV;
+	else
+		read_mode = READ_SYNC;
+
+	isector = start - btrfs_io_bio(failed_bio)->logical;
+	isector >>= inode->i_sb->s_blocksize_bits;
+	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
+				      0, isector, repair_endio, repair_arg);
+	if (!bio) {
+		free_io_failure(inode, failrec);
+		return -EIO;
+	}
+
+	btrfs_debug(BTRFS_I(inode)->root->fs_info,
+		    "Repair DIO Read Error: submitting new dio read[%#x] to this_mirror=%d, in_validation=%d\n",
+		    read_mode, failrec->this_mirror, failrec->in_validation);
+
+	ret = submit_dio_repair_bio(inode, bio, read_mode,
+				    failrec->this_mirror);
+	if (ret) {
+		free_io_failure(inode, failrec);
+		bio_put(bio);
+	}
+
+	return ret;
+}
+
+struct btrfs_retry_complete {
+	struct completion done;
+	struct inode *inode;
+	u64 start;
+	int uptodate;
+};
+
+static void btrfs_retry_endio_nocsum(struct bio *bio, int err)
+{
+	struct btrfs_retry_complete *done = bio->bi_private;
+	struct bio_vec *bvec;
+	int i;
+
+	if (err)
+		goto end;
+
+	done->uptodate = 1;
+	bio_for_each_segment_all(bvec, bio, i)
+		clean_io_failure(done->inode, done->start, bvec->bv_page, 0);
+end:
+	complete(&done->done);
+	bio_put(bio);
+}
+
+static int __btrfs_correct_data_nocsum(struct inode *inode,
+				       struct btrfs_io_bio *io_bio)
 {
 	struct bio_vec *bvec;
+	struct btrfs_retry_complete done;
 	u64 start;
 	int i;
 	int ret;
-	int err = 0;
 
-	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
-		return 0;
+	start = io_bio->logical;
+	done.inode = inode;
+
+	bio_for_each_segment_all(bvec, &io_bio->bio, i) {
+try_again:
+		done.uptodate = 0;
+		done.start = start;
+		init_completion(&done.done);
+
+		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, start,
+				     start + bvec->bv_len - 1,
+				     io_bio->mirror_num,
+				     btrfs_retry_endio_nocsum, &done);
+		if (ret)
+			return ret;
+
+		wait_for_completion(&done.done);
+
+		if (!done.uptodate) {
+			/* We might have another mirror, so try again */
+			goto try_again;
+		}
+
+		start += bvec->bv_len;
+	}
+
+	return 0;
+}
+
+static void btrfs_retry_endio(struct bio *bio, int err)
+{
+	struct btrfs_retry_complete *done = bio->bi_private;
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	struct bio_vec *bvec;
+	int uptodate;
+	int ret;
+	int i;
+
+	if (err)
+		goto end;
+
+	uptodate = 1;
+	bio_for_each_segment_all(bvec, bio, i) {
+		ret = __readpage_endio_check(done->inode, io_bio, i,
+					     bvec->bv_page, 0,
+					     done->start, bvec->bv_len);
+		if (!ret)
+			clean_io_failure(done->inode, done->start,
+					 bvec->bv_page, 0);
+		else
+			uptodate = 0;
+	}
+
+	done->uptodate = uptodate;
+end:
+	complete(&done->done);
+	bio_put(bio);
+}
 
+static int __btrfs_subio_endio_read(struct inode *inode,
+				    struct btrfs_io_bio *io_bio, int err)
+{
+	struct bio_vec *bvec;
+	struct btrfs_retry_complete done;
+	u64 start;
+	u64 offset = 0;
+	int i;
+	int ret;
+
+	err = 0;
 	start = io_bio->logical;
+	done.inode = inode;
+
 	bio_for_each_segment_all(bvec, &io_bio->bio, i) {
 		ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
 					     0, start, bvec->bv_len);
-		if (ret)
-			err = -EIO;
+		if (likely(!ret))
+			goto next;
+try_again:
+		done.uptodate = 0;
+		done.start = start;
+		init_completion(&done.done);
+
+		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, start,
+				     start + bvec->bv_len - 1,
+				     io_bio->mirror_num,
+				     btrfs_retry_endio, &done);
+		if (ret) {
+			err = ret;
+			goto next;
+		}
+
+		wait_for_completion(&done.done);
+
+		if (!done.uptodate) {
+			/* We might have another mirror, so try again */
+			goto try_again;
+		}
+next:
+		offset += bvec->bv_len;
 		start += bvec->bv_len;
 	}
 
 	return err;
 }
 
+static int btrfs_subio_endio_read(struct inode *inode,
+				  struct btrfs_io_bio *io_bio, int err)
+{
+	bool skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
+
+	if (skip_csum) {
+		if (unlikely(err))
+			return __btrfs_correct_data_nocsum(inode, io_bio);
+		else
+			return 0;
+	} else {
+		return __btrfs_subio_endio_read(inode, io_bio, err);
+	}
+}
+
 static void btrfs_endio_direct_read(struct bio *bio, int err)
 {
 	struct btrfs_dio_private *dip = bio->bi_private;
@@ -7114,8 +7351,8 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
 	struct bio *dio_bio;
 	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
 
-	if (!err && (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED))
-		err = btrfs_subio_endio_read(inode, io_bio);
+	if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED)
+		err = btrfs_subio_endio_read(inode, io_bio, err);
 
 	unlock_extent(&BTRFS_I(inode)->io_tree, dip->logical_offset,
 		      dip->logical_offset + dip->bytes - 1);
@@ -7193,19 +7430,16 @@ static int __btrfs_submit_bio_start_direct_io(struct inode *inode, int rw,
 static void btrfs_end_dio_bio(struct bio *bio, int err)
 {
 	struct btrfs_dio_private *dip = bio->bi_private;
-	int ret;
 
-	if (err) {
-		btrfs_err(BTRFS_I(dip->inode)->root->fs_info,
-			  "direct IO failed ino %llu rw %lu sector %#Lx len %u err no %d",
-		      btrfs_ino(dip->inode), bio->bi_rw,
-		      (unsigned long long)bio->bi_iter.bi_sector,
-		      bio->bi_iter.bi_size, err);
-	} else if (dip->subio_endio) {
-		ret = dip->subio_endio(dip->inode, btrfs_io_bio(bio));
-		if (ret)
-			err = ret;
-	}
+	if (err)
+		btrfs_warn(BTRFS_I(dip->inode)->root->fs_info,
+			   "direct IO failed ino %llu rw %lu sector %#Lx len %u err no %d",
+			   btrfs_ino(dip->inode), bio->bi_rw,
+			   (unsigned long long)bio->bi_iter.bi_sector,
+			   bio->bi_iter.bi_size, err);
+
+	if (dip->subio_endio)
+		err = dip->subio_endio(dip->inode, btrfs_io_bio(bio), err);
 
 	if (err) {
 		dip->errors = 1;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 12/12] Btrfs: cleanup the read failure record after write or when the inode is freeing
  2014-06-28 11:34 [PATCH 00/12] Implement the data repair function for direct read Miao Xie
                   ` (10 preceding siblings ...)
  2014-06-28 11:35 ` [PATCH 11/12] Btrfs: implement repair function when direct read fails Miao Xie
@ 2014-06-28 11:35 ` Miao Xie
  2014-07-29  9:23 ` [PATCH v2 00/12] Implement the data repair function for direct read Miao Xie
  12 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-06-28 11:35 UTC (permalink / raw)
  To: linux-btrfs

After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
  mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
  the corrupted data is corrected, so the failure record can be removed. And if
  some errors happen on the mirrors, we also needn't worry about it because the
  failure record will be recreated if we read the same place again.

Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/extent_io.c | 34 ++++++++++++++++++++++++++++++++++
 fs/btrfs/extent_io.h |  1 +
 fs/btrfs/inode.c     |  6 ++++++
 3 files changed, 41 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index edbc3b6..7535978 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2135,6 +2135,40 @@ out:
 	return 0;
 }
 
+/*
+ * Can be called when
+ * - hold extent lock
+ * - under ordered extent
+ * - the inode is freeing
+ */
+void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end)
+{
+	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
+	struct io_failure_record *failrec;
+	struct extent_state *state, *next;
+
+	if (RB_EMPTY_ROOT(&failure_tree->state))
+		return;
+
+	spin_lock(&failure_tree->lock);
+	state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY);
+	while (state) {
+		if (state->start > end)
+			break;
+
+		ASSERT(state->end <= end);
+
+		next = next_state(state);
+
+		failrec = (struct io_failure_record *)state->private;
+		free_extent_state(state);
+		kfree(failrec);
+
+		state = next;
+	}
+	spin_unlock(&failure_tree->lock);
+}
+
 int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
 				struct io_failure_record **failrec_ret)
 {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index b23c7c2..5c48eda 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -369,6 +369,7 @@ struct io_failure_record {
 	int in_validation;
 };
 
+void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end);
 int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
 				struct io_failure_record **failrec_ret);
 int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d6b7048..801fb40 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2639,6 +2639,10 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 		goto out;
 	}
 
+	btrfs_free_io_failure_record(inode, ordered_extent->file_offset,
+				     ordered_extent->file_offset +
+				     ordered_extent->len - 1);
+
 	if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) {
 		truncated = true;
 		logical_len = ordered_extent->truncated_len;
@@ -4723,6 +4727,8 @@ void btrfs_evict_inode(struct inode *inode)
 	/* do we really want it for ->i_nlink > 0 and zero btrfs_root_refs? */
 	btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
+	btrfs_free_io_failure_record(inode, 0, (u64)-1);
+
 	if (root->fs_info->log_root_recovering) {
 		BUG_ON(test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
 				 &BTRFS_I(inode)->runtime_flags));
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/12] Btrfs: load checksum data once when submitting a direct read io
  2014-06-28 11:34 ` [PATCH 02/12] Btrfs: load checksum data once when submitting a direct read io Miao Xie
@ 2014-07-28 17:24   ` Filipe David Manana
  2014-07-29  1:56     ` Miao Xie
  0 siblings, 1 reply; 49+ messages in thread
From: Filipe David Manana @ 2014-07-28 17:24 UTC (permalink / raw)
  To: Miao Xie; +Cc: linux-btrfs

On Sat, Jun 28, 2014 at 12:34 PM, Miao Xie <miaox@cn.fujitsu.com> wrote:
> The current code would load checksum data for several times when we split
> a whole direct read io because of the limit of the raid stripe, it would
> make us search the csum tree for several times. In fact, it just wasted time,
> and made the contention of the csum tree root be more serious. This patch
> improves this problem by loading the data at once.
>
> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
> ---
>  fs/btrfs/btrfs_inode.h |  1 -
>  fs/btrfs/ctree.h       |  3 +--
>  fs/btrfs/file-item.c   | 14 ++------------
>  fs/btrfs/inode.c       | 40 ++++++++++++++++++++++------------------
>  4 files changed, 25 insertions(+), 33 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 4794923..7e9f53b 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -263,7 +263,6 @@ struct btrfs_dio_private {
>
>         /* dio_bio came from fs/direct-io.c */
>         struct bio *dio_bio;
> -       u8 csum[0];
>  };
>
>  /*
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index be91397..40e9938 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3739,8 +3739,7 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
>  int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
>                           struct bio *bio, u32 *dst);
>  int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
> -                             struct btrfs_dio_private *dip, struct bio *bio,
> -                             u64 logical_offset);
> +                             struct bio *bio, u64 logical_offset);
>  int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
>                              struct btrfs_root *root,
>                              u64 objectid, u64 pos,
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index f46cfe4..cf1b94f 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -299,19 +299,9 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
>  }
>
>  int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
> -                             struct btrfs_dio_private *dip, struct bio *bio,
> -                             u64 offset)
> +                             struct bio *bio, u64 offset)
>  {
> -       int len = (bio->bi_iter.bi_sector << 9) - dip->disk_bytenr;
> -       u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
> -       int ret;
> -
> -       len >>= inode->i_sb->s_blocksize_bits;
> -       len *= csum_size;
> -
> -       ret = __btrfs_lookup_bio_sums(root, inode, bio, offset,
> -                                     (u32 *)(dip->csum + len), 1);
> -       return ret;
> +       return __btrfs_lookup_bio_sums(root, inode, bio, offset, NULL, 1);
>  }
>
>  int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index a3f102f..969fb22 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7081,7 +7081,8 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
>         struct inode *inode = dip->inode;
>         struct btrfs_root *root = BTRFS_I(inode)->root;
>         struct bio *dio_bio;
> -       u32 *csums = (u32 *)dip->csum;
> +       struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> +       u32 *csums = (u32 *)io_bio->csum;
>         u64 start;
>         int i;
>
> @@ -7123,6 +7124,9 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
>         if (err)
>                 clear_bit(BIO_UPTODATE, &dio_bio->bi_flags);
>         dio_end_io(dio_bio, err);
> +
> +       if (io_bio->end_io)
> +               io_bio->end_io(io_bio, err);
>         bio_put(bio);
>  }
>
> @@ -7261,13 +7265,20 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
>                 ret = btrfs_csum_one_bio(root, inode, bio, file_offset, 1);
>                 if (ret)
>                         goto err;
> -       } else if (!skip_sum) {
> -               ret = btrfs_lookup_bio_sums_dio(root, inode, dip, bio,
> +       } else {
> +               /*
> +                * We have loaded all the csum data we need when we submit
> +                * the first bio, so skip it.
> +                */
> +               if (dip->logical_offset != file_offset)
> +                       goto map;
> +
> +               /* Load all csum data at once. */
> +               ret = btrfs_lookup_bio_sums_dio(root, inode, dip->orig_bio,
>                                                 file_offset);
>                 if (ret)
>                         goto err;
>         }
> -
>  map:
>         ret = btrfs_map_bio(root, rw, bio, 0, async_submit);
>  err:
> @@ -7288,7 +7299,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
>         u64 submit_len = 0;
>         u64 map_length;
>         int nr_pages = 0;
> -       int ret = 0;
> +       int ret;
>         int async_submit = 0;
>
>         map_length = orig_bio->bi_iter.bi_size;
> @@ -7392,30 +7403,20 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
>         struct btrfs_root *root = BTRFS_I(inode)->root;
>         struct btrfs_dio_private *dip;
>         struct bio *io_bio;
> +       struct btrfs_io_bio *btrfs_bio;
>         int skip_sum;
> -       int sum_len;
>         int write = rw & REQ_WRITE;
>         int ret = 0;
> -       u16 csum_size;
>
>         skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
>
> -       io_bio = btrfs_bio_clone(dio_bio, GFP_NOFS);
> +       io_bio = btrfs_bio_clone(dio_bio, GFP_NOFS | __GFP_ZERO);

Hi Miao,

With this change (adding the __GFP_ZERO flag), I ran once into the
following warning while running xfstests (dunno exactly which test
case triggered it, likely one of those that run fsstress):

[ 3941.856860] ------------[ cut here ]------------
[ 3941.856871] WARNING: CPU: 0 PID: 4154 at mm/mempool.c:205
mempool_alloc+0xc8/0x1c0()
[ 3941.856873] Modules linked in: btrfs xor raid6_pq binfmt_misc nfsd
auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc i2c_piix4
i2c_core pcspkr evbug psmouse serio_raw e1000 [
last unloaded: btrfs]
[ 3941.856886] CPU: 0 PID: 4154 Comm: xfs_io Tainted: G        W
3.16.0-rc6-fdm-btrfs-next-37+ #1
[ 3941.856887] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 3941.856889]  0000000000000009 ffff8800d569f778 ffffffff8169a687
00000000000077b0
[ 3941.856892]  0000000000000000 ffff8800d569f7b8 ffffffff8104fb4c
00000000ffffffff
[ 3941.856894]  0000000000008050 0000000000000001 0000000000008050
ffff88004f921918
[ 3941.856896] Call Trace:
[ 3941.856901]  [<ffffffff8169a687>] dump_stack+0x4e/0x68
[ 3941.856904]  [<ffffffff8104fb4c>] warn_slowpath_common+0x8c/0xc0
[ 3941.856905]  [<ffffffff8104fb9a>] warn_slowpath_null+0x1a/0x20
[ 3941.856907]  [<ffffffff81151fc8>] mempool_alloc+0xc8/0x1c0
[ 3941.856911]  [<ffffffff810129cf>] ? save_stack_trace+0x2f/0x50
[ 3941.856918]  [<ffffffff8131331a>] bio_alloc_bioset+0x10a/0x1c0
[ 3941.856921]  [<ffffffff81314c68>] bio_clone_bioset+0x88/0x310
[ 3941.856923]  [<ffffffff81151a65>] ? mempool_alloc_slab+0x15/0x20
[ 3941.856936]  [<ffffffffa0209385>] btrfs_bio_clone+0x15/0x20 [btrfs]
[ 3941.856944]  [<ffffffffa01ed47f>] btrfs_submit_direct+0x4f/0x7b0 [btrfs]
[ 3941.856948]  [<ffffffff811fc10a>] ? do_blockdev_direct_IO+0x17ea/0x1f60
[ 3941.856952]  [<ffffffff810afb35>] ? mark_held_locks+0x75/0xa0
[ 3941.856955]  [<ffffffff816a383f>] ? _raw_spin_unlock_irqrestore+0x3f/0x70
[ 3941.856956]  [<ffffffff811fc13e>] do_blockdev_direct_IO+0x181e/0x1f60
[ 3941.856965]  [<ffffffffa01f86d0>] ?
btrfs_page_exists_in_range+0x2a0/0x2a0 [btrfs]
[ 3941.856972]  [<ffffffffa01ed430>] ?
btrfs_writepage_start_hook+0xf0/0xf0 [btrfs]
[ 3941.856974]  [<ffffffff811fc8cc>] __blockdev_direct_IO+0x4c/0x50
[ 3941.856981]  [<ffffffffa01f86d0>] ?
btrfs_page_exists_in_range+0x2a0/0x2a0 [btrfs]
[ 3941.856987]  [<ffffffffa01ed430>] ?
btrfs_writepage_start_hook+0xf0/0xf0 [btrfs]
[ 3941.856993]  [<ffffffffa01eb591>] btrfs_direct_IO+0x1a1/0x340 [btrfs]
[ 3941.856999]  [<ffffffffa01f86d0>] ?
btrfs_page_exists_in_range+0x2a0/0x2a0 [btrfs]
[ 3941.857005]  [<ffffffffa01ed430>] ?
btrfs_writepage_start_hook+0xf0/0xf0 [btrfs]
[ 3941.857007]  [<ffffffff81150210>] generic_file_direct_write+0xb0/0x180
[ 3941.857014]  [<ffffffffa01fc4a1>] btrfs_file_write_iter+0x411/0x560 [btrfs]
[ 3941.857017]  [<ffffffff811ba541>] new_sync_write+0x81/0xb0
[ 3941.857019]  [<ffffffff811bb342>] vfs_write+0xc2/0x1f0
[ 3941.857020]  [<ffffffff811bba2a>] SyS_pwrite64+0x9a/0xb0
[ 3941.857022]  [<ffffffff816a3d92>] system_call_fastpath+0x16/0x1b
[ 3941.857024] ---[ end trace c1dfd29523250709 ]---

Thanks.


>         if (!io_bio) {
>                 ret = -ENOMEM;
>                 goto free_ordered;
>         }
>
> -       if (!skip_sum && !write) {
> -               csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
> -               sum_len = dio_bio->bi_iter.bi_size >>
> -                       inode->i_sb->s_blocksize_bits;
> -               sum_len *= csum_size;
> -       } else {
> -               sum_len = 0;
> -       }
> -
> -       dip = kmalloc(sizeof(*dip) + sum_len, GFP_NOFS);
> +       dip = kmalloc(sizeof(*dip), GFP_NOFS);
>         if (!dip) {
>                 ret = -ENOMEM;
>                 goto free_io_bio;
> @@ -7441,6 +7442,9 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
>         if (!ret)
>                 return;
>
> +       btrfs_bio = btrfs_io_bio(io_bio);
> +       if (btrfs_bio->end_io)
> +               btrfs_bio->end_io(btrfs_bio, ret);
>  free_io_bio:
>         bio_put(io_bio);
>
> --
> 1.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/12] Btrfs: load checksum data once when submitting a direct read io
  2014-07-28 17:24   ` Filipe David Manana
@ 2014-07-29  1:56     ` Miao Xie
  0 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-07-29  1:56 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Mon, 28 Jul 2014 18:24:47 +0100, Filipe David Manana wrote:
> On Sat, Jun 28, 2014 at 12:34 PM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>> The current code would load checksum data for several times when we split
>> a whole direct read io because of the limit of the raid stripe, it would
>> make us search the csum tree for several times. In fact, it just wasted time,
>> and made the contention of the csum tree root be more serious. This patch
>> improves this problem by loading the data at once.
>>
>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
>> ---
>>  fs/btrfs/btrfs_inode.h |  1 -
>>  fs/btrfs/ctree.h       |  3 +--
>>  fs/btrfs/file-item.c   | 14 ++------------
>>  fs/btrfs/inode.c       | 40 ++++++++++++++++++++++------------------
>>  4 files changed, 25 insertions(+), 33 deletions(-)
>>
>> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
>> index 4794923..7e9f53b 100644
>> --- a/fs/btrfs/btrfs_inode.h
>> +++ b/fs/btrfs/btrfs_inode.h
>> @@ -263,7 +263,6 @@ struct btrfs_dio_private {
>>
>>         /* dio_bio came from fs/direct-io.c */
>>         struct bio *dio_bio;
>> -       u8 csum[0];
>>  };
>>
>>  /*
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index be91397..40e9938 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -3739,8 +3739,7 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
>>  int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
>>                           struct bio *bio, u32 *dst);
>>  int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
>> -                             struct btrfs_dio_private *dip, struct bio *bio,
>> -                             u64 logical_offset);
>> +                             struct bio *bio, u64 logical_offset);
>>  int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
>>                              struct btrfs_root *root,
>>                              u64 objectid, u64 pos,
>> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
>> index f46cfe4..cf1b94f 100644
>> --- a/fs/btrfs/file-item.c
>> +++ b/fs/btrfs/file-item.c
>> @@ -299,19 +299,9 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
>>  }
>>
>>  int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
>> -                             struct btrfs_dio_private *dip, struct bio *bio,
>> -                             u64 offset)
>> +                             struct bio *bio, u64 offset)
>>  {
>> -       int len = (bio->bi_iter.bi_sector << 9) - dip->disk_bytenr;
>> -       u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
>> -       int ret;
>> -
>> -       len >>= inode->i_sb->s_blocksize_bits;
>> -       len *= csum_size;
>> -
>> -       ret = __btrfs_lookup_bio_sums(root, inode, bio, offset,
>> -                                     (u32 *)(dip->csum + len), 1);
>> -       return ret;
>> +       return __btrfs_lookup_bio_sums(root, inode, bio, offset, NULL, 1);
>>  }
>>
>>  int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index a3f102f..969fb22 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -7081,7 +7081,8 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
>>         struct inode *inode = dip->inode;
>>         struct btrfs_root *root = BTRFS_I(inode)->root;
>>         struct bio *dio_bio;
>> -       u32 *csums = (u32 *)dip->csum;
>> +       struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
>> +       u32 *csums = (u32 *)io_bio->csum;
>>         u64 start;
>>         int i;
>>
>> @@ -7123,6 +7124,9 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
>>         if (err)
>>                 clear_bit(BIO_UPTODATE, &dio_bio->bi_flags);
>>         dio_end_io(dio_bio, err);
>> +
>> +       if (io_bio->end_io)
>> +               io_bio->end_io(io_bio, err);
>>         bio_put(bio);
>>  }
>>
>> @@ -7261,13 +7265,20 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
>>                 ret = btrfs_csum_one_bio(root, inode, bio, file_offset, 1);
>>                 if (ret)
>>                         goto err;
>> -       } else if (!skip_sum) {
>> -               ret = btrfs_lookup_bio_sums_dio(root, inode, dip, bio,
>> +       } else {
>> +               /*
>> +                * We have loaded all the csum data we need when we submit
>> +                * the first bio, so skip it.
>> +                */
>> +               if (dip->logical_offset != file_offset)
>> +                       goto map;
>> +
>> +               /* Load all csum data at once. */
>> +               ret = btrfs_lookup_bio_sums_dio(root, inode, dip->orig_bio,
>>                                                 file_offset);
>>                 if (ret)
>>                         goto err;
>>         }
>> -
>>  map:
>>         ret = btrfs_map_bio(root, rw, bio, 0, async_submit);
>>  err:
>> @@ -7288,7 +7299,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
>>         u64 submit_len = 0;
>>         u64 map_length;
>>         int nr_pages = 0;
>> -       int ret = 0;
>> +       int ret;
>>         int async_submit = 0;
>>
>>         map_length = orig_bio->bi_iter.bi_size;
>> @@ -7392,30 +7403,20 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
>>         struct btrfs_root *root = BTRFS_I(inode)->root;
>>         struct btrfs_dio_private *dip;
>>         struct bio *io_bio;
>> +       struct btrfs_io_bio *btrfs_bio;
>>         int skip_sum;
>> -       int sum_len;
>>         int write = rw & REQ_WRITE;
>>         int ret = 0;
>> -       u16 csum_size;
>>
>>         skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
>>
>> -       io_bio = btrfs_bio_clone(dio_bio, GFP_NOFS);
>> +       io_bio = btrfs_bio_clone(dio_bio, GFP_NOFS | __GFP_ZERO);
> 
> Hi Miao,
> 
> With this change (adding the __GFP_ZERO flag), I ran once into the
> following warning while running xfstests (dunno exactly which test
> case triggered it, likely one of those that run fsstress):

Thanks for test.
I'll fix it.

Miao

> 
> [ 3941.856860] ------------[ cut here ]------------
> [ 3941.856871] WARNING: CPU: 0 PID: 4154 at mm/mempool.c:205
> mempool_alloc+0xc8/0x1c0()
> [ 3941.856873] Modules linked in: btrfs xor raid6_pq binfmt_misc nfsd
> auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc i2c_piix4
> i2c_core pcspkr evbug psmouse serio_raw e1000 [
> last unloaded: btrfs]
> [ 3941.856886] CPU: 0 PID: 4154 Comm: xfs_io Tainted: G        W
> 3.16.0-rc6-fdm-btrfs-next-37+ #1
> [ 3941.856887] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 3941.856889]  0000000000000009 ffff8800d569f778 ffffffff8169a687
> 00000000000077b0
> [ 3941.856892]  0000000000000000 ffff8800d569f7b8 ffffffff8104fb4c
> 00000000ffffffff
> [ 3941.856894]  0000000000008050 0000000000000001 0000000000008050
> ffff88004f921918
> [ 3941.856896] Call Trace:
> [ 3941.856901]  [<ffffffff8169a687>] dump_stack+0x4e/0x68
> [ 3941.856904]  [<ffffffff8104fb4c>] warn_slowpath_common+0x8c/0xc0
> [ 3941.856905]  [<ffffffff8104fb9a>] warn_slowpath_null+0x1a/0x20
> [ 3941.856907]  [<ffffffff81151fc8>] mempool_alloc+0xc8/0x1c0
> [ 3941.856911]  [<ffffffff810129cf>] ? save_stack_trace+0x2f/0x50
> [ 3941.856918]  [<ffffffff8131331a>] bio_alloc_bioset+0x10a/0x1c0
> [ 3941.856921]  [<ffffffff81314c68>] bio_clone_bioset+0x88/0x310
> [ 3941.856923]  [<ffffffff81151a65>] ? mempool_alloc_slab+0x15/0x20
> [ 3941.856936]  [<ffffffffa0209385>] btrfs_bio_clone+0x15/0x20 [btrfs]
> [ 3941.856944]  [<ffffffffa01ed47f>] btrfs_submit_direct+0x4f/0x7b0 [btrfs]
> [ 3941.856948]  [<ffffffff811fc10a>] ? do_blockdev_direct_IO+0x17ea/0x1f60
> [ 3941.856952]  [<ffffffff810afb35>] ? mark_held_locks+0x75/0xa0
> [ 3941.856955]  [<ffffffff816a383f>] ? _raw_spin_unlock_irqrestore+0x3f/0x70
> [ 3941.856956]  [<ffffffff811fc13e>] do_blockdev_direct_IO+0x181e/0x1f60
> [ 3941.856965]  [<ffffffffa01f86d0>] ?
> btrfs_page_exists_in_range+0x2a0/0x2a0 [btrfs]
> [ 3941.856972]  [<ffffffffa01ed430>] ?
> btrfs_writepage_start_hook+0xf0/0xf0 [btrfs]
> [ 3941.856974]  [<ffffffff811fc8cc>] __blockdev_direct_IO+0x4c/0x50
> [ 3941.856981]  [<ffffffffa01f86d0>] ?
> btrfs_page_exists_in_range+0x2a0/0x2a0 [btrfs]
> [ 3941.856987]  [<ffffffffa01ed430>] ?
> btrfs_writepage_start_hook+0xf0/0xf0 [btrfs]
> [ 3941.856993]  [<ffffffffa01eb591>] btrfs_direct_IO+0x1a1/0x340 [btrfs]
> [ 3941.856999]  [<ffffffffa01f86d0>] ?
> btrfs_page_exists_in_range+0x2a0/0x2a0 [btrfs]
> [ 3941.857005]  [<ffffffffa01ed430>] ?
> btrfs_writepage_start_hook+0xf0/0xf0 [btrfs]
> [ 3941.857007]  [<ffffffff81150210>] generic_file_direct_write+0xb0/0x180
> [ 3941.857014]  [<ffffffffa01fc4a1>] btrfs_file_write_iter+0x411/0x560 [btrfs]
> [ 3941.857017]  [<ffffffff811ba541>] new_sync_write+0x81/0xb0
> [ 3941.857019]  [<ffffffff811bb342>] vfs_write+0xc2/0x1f0
> [ 3941.857020]  [<ffffffff811bba2a>] SyS_pwrite64+0x9a/0xb0
> [ 3941.857022]  [<ffffffff816a3d92>] system_call_fastpath+0x16/0x1b
> [ 3941.857024] ---[ end trace c1dfd29523250709 ]---
> 
> Thanks.
> 
> 
>>         if (!io_bio) {
>>                 ret = -ENOMEM;
>>                 goto free_ordered;
>>         }
>>
>> -       if (!skip_sum && !write) {
>> -               csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
>> -               sum_len = dio_bio->bi_iter.bi_size >>
>> -                       inode->i_sb->s_blocksize_bits;
>> -               sum_len *= csum_size;
>> -       } else {
>> -               sum_len = 0;
>> -       }
>> -
>> -       dip = kmalloc(sizeof(*dip) + sum_len, GFP_NOFS);
>> +       dip = kmalloc(sizeof(*dip), GFP_NOFS);
>>         if (!dip) {
>>                 ret = -ENOMEM;
>>                 goto free_io_bio;
>> @@ -7441,6 +7442,9 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
>>         if (!ret)
>>                 return;
>>
>> +       btrfs_bio = btrfs_io_bio(io_bio);
>> +       if (btrfs_bio->end_io)
>> +               btrfs_bio->end_io(btrfs_bio, ret);
>>  free_io_bio:
>>         bio_put(io_bio);
>>
>> --
>> 1.9.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 00/12] Implement the data repair function for direct read
@ 2014-07-29  9:23 ` Miao Xie
  2014-07-29  9:23   ` [PATCH v2 01/12] Btrfs: fix put dio bio twice when we submit dio bio fail Miao Xie
                     ` (11 more replies)
  0 siblings, 12 replies; 49+ messages in thread
From: Miao Xie @ 2014-07-29  9:23 UTC (permalink / raw)
  To: linux-btrfs

This patchset implement the data repair function for the direct read, it
is implemented like buffered read:
1.When we find the data is not right, we try to read the data from the other
  mirror.
2.When the io on the mirror ends, we will insert the endio work into the
  system workqueue, not btrfs own endio workqueue, because the original
  endio work is still blocked in the btrfs endio workqueue, if we insert
  the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
3.If We get right data, we write it back to repair the corrupted mirror.
4.If the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
5.After the above work, we set the uptodate flag according to the result.

The difference is that the direct read may be splited to several small io,
in order to get the number of the mirror on which the io error happens. we
have to do data check and repair on the end IO function of those sub-IO
request.

Besides that, we also fixed some bugs of direct io.

Changelog v1 -> v2:
- Fix the warning which was triggered by __GFP_ZERO in the 2nd patch

We can pull this patchset from the URL

  https://github.com/miaoxie/linux-btrfs.git for-Chris

Thanks
Miao
---
Miao Xie (12):
  Btrfs: fix put dio bio twice when we submit dio bio fail
  Btrfs: load checksum data once when submitting a direct read io
  Btrfs: cleanup similar code of the buffered data data check and dio
    read data check
  Btrfs: do file data check by sub-bio's self
  Btrfs: fix missing error handler if submiting re-read bio fails
  Btrfs: Cleanup unused variant and argument of IO failure handlers
  Btrfs: split bio_readpage_error into several functions
  Btrfs: modify repair_io_failure and make it suit direct io
  Btrfs: modify clean_io_failure and make it suit direct io
  Btrfs: Set real mirror number for read operation on RAID0/5/6
  Btrfs: implement repair function when direct read fails
  Btrfs: cleanup the read failure record after write or when the inode
    is freeing

 fs/btrfs/btrfs_inode.h |  10 +-
 fs/btrfs/ctree.h       |   3 +-
 fs/btrfs/disk-io.c     |  43 +++--
 fs/btrfs/disk-io.h     |   1 +
 fs/btrfs/extent_io.c   | 254 ++++++++++++++++++----------
 fs/btrfs/extent_io.h   |  38 ++++-
 fs/btrfs/file-item.c   |  14 +-
 fs/btrfs/inode.c       | 451 ++++++++++++++++++++++++++++++++++++++++---------
 fs/btrfs/scrub.c       |   4 +-
 fs/btrfs/volumes.c     |   5 +
 fs/btrfs/volumes.h     |   5 +-
 11 files changed, 622 insertions(+), 206 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 01/12] Btrfs: fix put dio bio twice when we submit dio bio fail
  2014-07-29  9:23 ` [PATCH v2 00/12] Implement the data repair function for direct read Miao Xie
@ 2014-07-29  9:23   ` Miao Xie
  2014-07-29  9:24   ` [PATCH v2 02/12] Btrfs: load checksum data once when submitting a direct read io Miao Xie
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-07-29  9:23 UTC (permalink / raw)
  To: linux-btrfs

The caller of btrfs_submit_direct_hook() will put the original dio bio
when btrfs_submit_direct_hook() return a error number, so we needn't
put the original bio in btrfs_submit_direct_hook().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1-v2:
- None
---
 fs/btrfs/inode.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6b65fab..548489e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7294,10 +7294,8 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 	map_length = orig_bio->bi_iter.bi_size;
 	ret = btrfs_map_block(root->fs_info, rw, start_sector << 9,
 			      &map_length, NULL, 0);
-	if (ret) {
-		bio_put(orig_bio);
+	if (ret)
 		return -EIO;
-	}
 
 	if (map_length >= orig_bio->bi_iter.bi_size) {
 		bio = orig_bio;
@@ -7314,6 +7312,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 	bio = btrfs_dio_bio_alloc(orig_bio->bi_bdev, start_sector, GFP_NOFS);
 	if (!bio)
 		return -ENOMEM;
+
 	bio->bi_private = dip;
 	bio->bi_end_io = btrfs_end_dio_bio;
 	atomic_inc(&dip->pending_bios);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 02/12] Btrfs: load checksum data once when submitting a direct read io
  2014-07-29  9:23 ` [PATCH v2 00/12] Implement the data repair function for direct read Miao Xie
  2014-07-29  9:23   ` [PATCH v2 01/12] Btrfs: fix put dio bio twice when we submit dio bio fail Miao Xie
@ 2014-07-29  9:24   ` Miao Xie
  2014-08-08  0:32     ` Filipe David Manana
  2014-07-29  9:24   ` [PATCH v2 03/12] Btrfs: cleanup similar code of the buffered data data check and dio read data check Miao Xie
                     ` (9 subsequent siblings)
  11 siblings, 1 reply; 49+ messages in thread
From: Miao Xie @ 2014-07-29  9:24 UTC (permalink / raw)
  To: linux-btrfs

The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v2:
- Remove the __GFP_ZERO flag in btrfs_submit_direct because it would trigger
  a WARNing. It is reported by Filipe David Manana, Thanks.
---
 fs/btrfs/btrfs_inode.h |  1 -
 fs/btrfs/ctree.h       |  3 +--
 fs/btrfs/extent_io.c   | 13 +++++++++++--
 fs/btrfs/file-item.c   | 14 ++------------
 fs/btrfs/inode.c       | 38 +++++++++++++++++++++-----------------
 5 files changed, 35 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index a0cf3e5..b69bf7e 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -263,7 +263,6 @@ struct btrfs_dio_private {
 
 	/* dio_bio came from fs/direct-io.c */
 	struct bio *dio_bio;
-	u8 csum[0];
 };
 
 /*
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index be91397..40e9938 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3739,8 +3739,7 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
 			  struct bio *bio, u32 *dst);
 int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
-			      struct btrfs_dio_private *dip, struct bio *bio,
-			      u64 logical_offset);
+			      struct bio *bio, u64 logical_offset);
 int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root,
 			     u64 objectid, u64 pos,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 23398ad..0fb63c4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2617,9 +2617,18 @@ btrfs_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs,
 
 struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask)
 {
-	return bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
-}
+	struct btrfs_io_bio *btrfs_bio;
+	struct bio *new;
 
+	new = bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
+	if (new) {
+		btrfs_bio = btrfs_io_bio(new);
+		btrfs_bio->csum = NULL;
+		btrfs_bio->csum_allocated = NULL;
+		btrfs_bio->end_io = NULL;
+	}
+	return bio;
+}
 
 /* this also allocates from the btrfs_bioset */
 struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index f46cfe4..cf1b94f 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -299,19 +299,9 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
 }
 
 int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
-			      struct btrfs_dio_private *dip, struct bio *bio,
-			      u64 offset)
+			      struct bio *bio, u64 offset)
 {
-	int len = (bio->bi_iter.bi_sector << 9) - dip->disk_bytenr;
-	u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
-	int ret;
-
-	len >>= inode->i_sb->s_blocksize_bits;
-	len *= csum_size;
-
-	ret = __btrfs_lookup_bio_sums(root, inode, bio, offset,
-				      (u32 *)(dip->csum + len), 1);
-	return ret;
+	return __btrfs_lookup_bio_sums(root, inode, bio, offset, NULL, 1);
 }
 
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 548489e..fd88126 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7081,7 +7081,8 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
 	struct inode *inode = dip->inode;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct bio *dio_bio;
-	u32 *csums = (u32 *)dip->csum;
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	u32 *csums = (u32 *)io_bio->csum;
 	u64 start;
 	int i;
 
@@ -7123,6 +7124,9 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
 	if (err)
 		clear_bit(BIO_UPTODATE, &dio_bio->bi_flags);
 	dio_end_io(dio_bio, err);
+
+	if (io_bio->end_io)
+		io_bio->end_io(io_bio, err);
 	bio_put(bio);
 }
 
@@ -7261,13 +7265,20 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
 		ret = btrfs_csum_one_bio(root, inode, bio, file_offset, 1);
 		if (ret)
 			goto err;
-	} else if (!skip_sum) {
-		ret = btrfs_lookup_bio_sums_dio(root, inode, dip, bio,
+	} else {
+		/*
+		 * We have loaded all the csum data we need when we submit
+		 * the first bio, so skip it.
+		 */
+		if (dip->logical_offset != file_offset)
+			goto map;
+
+		/* Load all csum data at once. */
+		ret = btrfs_lookup_bio_sums_dio(root, inode, dip->orig_bio,
 						file_offset);
 		if (ret)
 			goto err;
 	}
-
 map:
 	ret = btrfs_map_bio(root, rw, bio, 0, async_submit);
 err:
@@ -7288,7 +7299,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 	u64 submit_len = 0;
 	u64 map_length;
 	int nr_pages = 0;
-	int ret = 0;
+	int ret;
 	int async_submit = 0;
 
 	map_length = orig_bio->bi_iter.bi_size;
@@ -7392,11 +7403,10 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_dio_private *dip;
 	struct bio *io_bio;
+	struct btrfs_io_bio *btrfs_bio;
 	int skip_sum;
-	int sum_len;
 	int write = rw & REQ_WRITE;
 	int ret = 0;
-	u16 csum_size;
 
 	skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
 
@@ -7406,16 +7416,7 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 		goto free_ordered;
 	}
 
-	if (!skip_sum && !write) {
-		csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
-		sum_len = dio_bio->bi_iter.bi_size >>
-			inode->i_sb->s_blocksize_bits;
-		sum_len *= csum_size;
-	} else {
-		sum_len = 0;
-	}
-
-	dip = kmalloc(sizeof(*dip) + sum_len, GFP_NOFS);
+	dip = kmalloc(sizeof(*dip), GFP_NOFS);
 	if (!dip) {
 		ret = -ENOMEM;
 		goto free_io_bio;
@@ -7441,6 +7442,9 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 	if (!ret)
 		return;
 
+	btrfs_bio = btrfs_io_bio(io_bio);
+	if (btrfs_bio->end_io)
+		btrfs_bio->end_io(btrfs_bio, ret);
 free_io_bio:
 	bio_put(io_bio);
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 03/12] Btrfs: cleanup similar code of the buffered data data check and dio read data check
  2014-07-29  9:23 ` [PATCH v2 00/12] Implement the data repair function for direct read Miao Xie
  2014-07-29  9:23   ` [PATCH v2 01/12] Btrfs: fix put dio bio twice when we submit dio bio fail Miao Xie
  2014-07-29  9:24   ` [PATCH v2 02/12] Btrfs: load checksum data once when submitting a direct read io Miao Xie
@ 2014-07-29  9:24   ` Miao Xie
  2014-07-29  9:24   ` [PATCH v2 04/12] Btrfs: do file data check by sub-bio's self Miao Xie
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-07-29  9:24 UTC (permalink / raw)
  To: linux-btrfs

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1-v2:
- None
---
 fs/btrfs/inode.c | 102 +++++++++++++++++++++++++------------------------------
 1 file changed, 47 insertions(+), 55 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index fd88126..2e261b1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2830,6 +2830,40 @@ static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
 	return 0;
 }
 
+static int __readpage_endio_check(struct inode *inode,
+				  struct btrfs_io_bio *io_bio,
+				  int icsum, struct page *page,
+				  int pgoff, u64 start, size_t len)
+{
+	char *kaddr;
+	u32 csum_expected;
+	u32 csum = ~(u32)0;
+	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
+
+	csum_expected = *(((u32 *)io_bio->csum) + icsum);
+
+	kaddr = kmap_atomic(page);
+	csum = btrfs_csum_data(kaddr + pgoff, csum,  len);
+	btrfs_csum_final(csum, (char *)&csum);
+	if (csum != csum_expected)
+		goto zeroit;
+
+	kunmap_atomic(kaddr);
+	return 0;
+zeroit:
+	if (__ratelimit(&_rs))
+		btrfs_info(BTRFS_I(inode)->root->fs_info,
+			   "csum failed ino %llu off %llu csum %u expected csum %u",
+			   btrfs_ino(inode), start, csum, csum_expected);
+	memset(kaddr + pgoff, 1, len);
+	flush_dcache_page(page);
+	kunmap_atomic(kaddr);
+	if (csum_expected == 0)
+		return 0;
+	return -EIO;
+}
+
 /*
  * when reads are done, we need to check csums to verify the data is correct
  * if there's a match, we allow the bio to finish.  If not, the code in
@@ -2842,20 +2876,15 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	size_t offset = start - page_offset(page);
 	struct inode *inode = page->mapping->host;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
-	char *kaddr;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	u32 csum_expected;
-	u32 csum = ~(u32)0;
-	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
-	                              DEFAULT_RATELIMIT_BURST);
 
 	if (PageChecked(page)) {
 		ClearPageChecked(page);
-		goto good;
+		return 0;
 	}
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
-		goto good;
+		return 0;
 
 	if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
 	    test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) {
@@ -2865,28 +2894,8 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	}
 
 	phy_offset >>= inode->i_sb->s_blocksize_bits;
-	csum_expected = *(((u32 *)io_bio->csum) + phy_offset);
-
-	kaddr = kmap_atomic(page);
-	csum = btrfs_csum_data(kaddr + offset, csum,  end - start + 1);
-	btrfs_csum_final(csum, (char *)&csum);
-	if (csum != csum_expected)
-		goto zeroit;
-
-	kunmap_atomic(kaddr);
-good:
-	return 0;
-
-zeroit:
-	if (__ratelimit(&_rs))
-		btrfs_info(root->fs_info, "csum failed ino %llu off %llu csum %u expected csum %u",
-			btrfs_ino(page->mapping->host), start, csum, csum_expected);
-	memset(kaddr + offset, 1, end - start + 1);
-	flush_dcache_page(page);
-	kunmap_atomic(kaddr);
-	if (csum_expected == 0)
-		return 0;
-	return -EIO;
+	return __readpage_endio_check(inode, io_bio, phy_offset, page, offset,
+				      start, (size_t)(end - start + 1));
 }
 
 struct delayed_iput {
@@ -7079,41 +7088,24 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
 	struct btrfs_dio_private *dip = bio->bi_private;
 	struct bio_vec *bvec;
 	struct inode *inode = dip->inode;
-	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct bio *dio_bio;
 	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
-	u32 *csums = (u32 *)io_bio->csum;
 	u64 start;
+	int ret;
 	int i;
 
+	if (err || (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
+		goto skip_checksum;
+
 	start = dip->logical_offset;
 	bio_for_each_segment_all(bvec, bio, i) {
-		if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
-			struct page *page = bvec->bv_page;
-			char *kaddr;
-			u32 csum = ~(u32)0;
-			unsigned long flags;
-
-			local_irq_save(flags);
-			kaddr = kmap_atomic(page);
-			csum = btrfs_csum_data(kaddr + bvec->bv_offset,
-					       csum, bvec->bv_len);
-			btrfs_csum_final(csum, (char *)&csum);
-			kunmap_atomic(kaddr);
-			local_irq_restore(flags);
-
-			flush_dcache_page(bvec->bv_page);
-			if (csum != csums[i]) {
-				btrfs_err(root->fs_info, "csum failed ino %llu off %llu csum %u expected csum %u",
-					  btrfs_ino(inode), start, csum,
-					  csums[i]);
-				err = -EIO;
-			}
-		}
-
+		ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
+					     0, start, bvec->bv_len);
+		if (ret)
+			err = -EIO;
 		start += bvec->bv_len;
 	}
-
+skip_checksum:
 	unlock_extent(&BTRFS_I(inode)->io_tree, dip->logical_offset,
 		      dip->logical_offset + dip->bytes - 1);
 	dio_bio = dip->dio_bio;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 04/12] Btrfs: do file data check by sub-bio's self
  2014-07-29  9:23 ` [PATCH v2 00/12] Implement the data repair function for direct read Miao Xie
                     ` (2 preceding siblings ...)
  2014-07-29  9:24   ` [PATCH v2 03/12] Btrfs: cleanup similar code of the buffered data data check and dio read data check Miao Xie
@ 2014-07-29  9:24   ` Miao Xie
  2014-07-29  9:24   ` [PATCH v2 05/12] Btrfs: fix missing error handler if submiting re-read bio fails Miao Xie
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-07-29  9:24 UTC (permalink / raw)
  To: linux-btrfs

Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.

But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file data
check in the final end io function to the sub-bio end io function, in which we can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1-v2:
- None
---
 fs/btrfs/btrfs_inode.h |   9 +++++
 fs/btrfs/extent_io.c   |   2 +-
 fs/btrfs/inode.c       | 100 ++++++++++++++++++++++++++++++++++++-------------
 fs/btrfs/volumes.h     |   5 ++-
 4 files changed, 87 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index b69bf7e..745fca40 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -245,8 +245,11 @@ static inline int btrfs_inode_in_log(struct inode *inode, u64 generation)
 	return 0;
 }
 
+#define BTRFS_DIO_ORIG_BIO_SUBMITTED	0x1
+
 struct btrfs_dio_private {
 	struct inode *inode;
+	unsigned long flags;
 	u64 logical_offset;
 	u64 disk_bytenr;
 	u64 bytes;
@@ -263,6 +266,12 @@ struct btrfs_dio_private {
 
 	/* dio_bio came from fs/direct-io.c */
 	struct bio *dio_bio;
+
+	/*
+	 * The original bio may be splited to several sub-bios, this is
+	 * done during endio of sub-bios
+	 */
+	int (*subio_endio)(struct inode *, struct btrfs_io_bio *);
 };
 
 /*
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0fb63c4..881fc49 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2469,7 +2469,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 		struct inode *inode = page->mapping->host;
 
 		pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, "
-			 "mirror=%lu\n", (u64)bio->bi_iter.bi_sector, err,
+			 "mirror=%u\n", (u64)bio->bi_iter.bi_sector, err,
 			 io_bio->mirror_num);
 		tree = &BTRFS_I(inode)->io_tree;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2e261b1..3e95a2b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7083,29 +7083,40 @@ unlock_err:
 	return ret;
 }
 
-static void btrfs_endio_direct_read(struct bio *bio, int err)
+static int btrfs_subio_endio_read(struct inode *inode,
+				  struct btrfs_io_bio *io_bio)
 {
-	struct btrfs_dio_private *dip = bio->bi_private;
 	struct bio_vec *bvec;
-	struct inode *inode = dip->inode;
-	struct bio *dio_bio;
-	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
 	u64 start;
-	int ret;
 	int i;
+	int ret;
+	int err = 0;
 
-	if (err || (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-		goto skip_checksum;
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
+		return 0;
 
-	start = dip->logical_offset;
-	bio_for_each_segment_all(bvec, bio, i) {
+	start = io_bio->logical;
+	bio_for_each_segment_all(bvec, &io_bio->bio, i) {
 		ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
 					     0, start, bvec->bv_len);
 		if (ret)
 			err = -EIO;
 		start += bvec->bv_len;
 	}
-skip_checksum:
+
+	return err;
+}
+
+static void btrfs_endio_direct_read(struct bio *bio, int err)
+{
+	struct btrfs_dio_private *dip = bio->bi_private;
+	struct inode *inode = dip->inode;
+	struct bio *dio_bio;
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+
+	if (!err && (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED))
+		err = btrfs_subio_endio_read(inode, io_bio);
+
 	unlock_extent(&BTRFS_I(inode)->io_tree, dip->logical_offset,
 		      dip->logical_offset + dip->bytes - 1);
 	dio_bio = dip->dio_bio;
@@ -7182,6 +7193,7 @@ static int __btrfs_submit_bio_start_direct_io(struct inode *inode, int rw,
 static void btrfs_end_dio_bio(struct bio *bio, int err)
 {
 	struct btrfs_dio_private *dip = bio->bi_private;
+	int ret;
 
 	if (err) {
 		btrfs_err(BTRFS_I(dip->inode)->root->fs_info,
@@ -7189,6 +7201,13 @@ static void btrfs_end_dio_bio(struct bio *bio, int err)
 		      btrfs_ino(dip->inode), bio->bi_rw,
 		      (unsigned long long)bio->bi_iter.bi_sector,
 		      bio->bi_iter.bi_size, err);
+	} else if (dip->subio_endio) {
+		ret = dip->subio_endio(dip->inode, btrfs_io_bio(bio));
+		if (ret)
+			err = ret;
+	}
+
+	if (err) {
 		dip->errors = 1;
 
 		/*
@@ -7219,6 +7238,38 @@ static struct bio *btrfs_dio_bio_alloc(struct block_device *bdev,
 	return btrfs_bio_alloc(bdev, first_sector, nr_vecs, gfp_flags);
 }
 
+static inline int btrfs_lookup_and_bind_dio_csum(struct btrfs_root *root,
+						 struct inode *inode,
+						 struct btrfs_dio_private *dip,
+						 struct bio *bio,
+						 u64 file_offset)
+{
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	struct btrfs_io_bio *orig_io_bio = btrfs_io_bio(dip->orig_bio);
+	int ret;
+
+	/*
+	 * We load all the csum data we need when we submit
+	 * the first bio to reduce the csum tree search and
+	 * contention.
+	 */
+	if (dip->logical_offset == file_offset) {
+		ret = btrfs_lookup_bio_sums_dio(root, inode, dip->orig_bio,
+						file_offset);
+		if (ret)
+			return ret;
+	}
+
+	if (bio == dip->orig_bio)
+		return 0;
+
+	file_offset -= dip->logical_offset;
+	file_offset >>= inode->i_sb->s_blocksize_bits;
+	io_bio->csum = (u8 *)(((u32 *)orig_io_bio->csum) + file_offset);
+
+	return 0;
+}
+
 static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
 					 int rw, u64 file_offset, int skip_sum,
 					 int async_submit)
@@ -7258,16 +7309,8 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
 		if (ret)
 			goto err;
 	} else {
-		/*
-		 * We have loaded all the csum data we need when we submit
-		 * the first bio, so skip it.
-		 */
-		if (dip->logical_offset != file_offset)
-			goto map;
-
-		/* Load all csum data at once. */
-		ret = btrfs_lookup_bio_sums_dio(root, inode, dip->orig_bio,
-						file_offset);
+		ret = btrfs_lookup_and_bind_dio_csum(root, inode, dip, bio,
+						     file_offset);
 		if (ret)
 			goto err;
 	}
@@ -7302,6 +7345,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 
 	if (map_length >= orig_bio->bi_iter.bi_size) {
 		bio = orig_bio;
+		dip->flags |= BTRFS_DIO_ORIG_BIO_SUBMITTED;
 		goto submit;
 	}
 
@@ -7318,6 +7362,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 
 	bio->bi_private = dip;
 	bio->bi_end_io = btrfs_end_dio_bio;
+	btrfs_io_bio(bio)->logical = file_offset;
 	atomic_inc(&dip->pending_bios);
 
 	while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) {
@@ -7352,6 +7397,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 				goto out_err;
 			bio->bi_private = dip;
 			bio->bi_end_io = btrfs_end_dio_bio;
+			btrfs_io_bio(bio)->logical = file_offset;
 
 			map_length = orig_bio->bi_iter.bi_size;
 			ret = btrfs_map_block(root->fs_info, rw,
@@ -7408,7 +7454,7 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 		goto free_ordered;
 	}
 
-	dip = kmalloc(sizeof(*dip), GFP_NOFS);
+	dip = kzalloc(sizeof(*dip), GFP_NOFS);
 	if (!dip) {
 		ret = -ENOMEM;
 		goto free_io_bio;
@@ -7420,21 +7466,23 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 	dip->bytes = dio_bio->bi_iter.bi_size;
 	dip->disk_bytenr = (u64)dio_bio->bi_iter.bi_sector << 9;
 	io_bio->bi_private = dip;
-	dip->errors = 0;
 	dip->orig_bio = io_bio;
 	dip->dio_bio = dio_bio;
 	atomic_set(&dip->pending_bios, 0);
+	btrfs_bio = btrfs_io_bio(io_bio);
+	btrfs_bio->logical = file_offset;
 
-	if (write)
+	if (write) {
 		io_bio->bi_end_io = btrfs_endio_direct_write;
-	else
+	} else {
 		io_bio->bi_end_io = btrfs_endio_direct_read;
+		dip->subio_endio = btrfs_subio_endio_read;
+	}
 
 	ret = btrfs_submit_direct_hook(rw, dip, skip_sum);
 	if (!ret)
 		return;
 
-	btrfs_bio = btrfs_io_bio(io_bio);
 	if (btrfs_bio->end_io)
 		btrfs_bio->end_io(btrfs_bio, ret);
 free_io_bio:
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 2aaa00c..cd31de1 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -167,8 +167,9 @@ struct btrfs_fs_devices {
  */
 typedef void (btrfs_io_bio_end_io_t) (struct btrfs_io_bio *bio, int err);
 struct btrfs_io_bio {
-	unsigned long mirror_num;
-	unsigned long stripe_index;
+	unsigned int mirror_num;
+	unsigned int stripe_index;
+	u64 logical;
 	u8 *csum;
 	u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
 	u8 *csum_allocated;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 05/12] Btrfs: fix missing error handler if submiting re-read bio fails
  2014-07-29  9:23 ` [PATCH v2 00/12] Implement the data repair function for direct read Miao Xie
                     ` (3 preceding siblings ...)
  2014-07-29  9:24   ` [PATCH v2 04/12] Btrfs: do file data check by sub-bio's self Miao Xie
@ 2014-07-29  9:24   ` Miao Xie
  2014-07-29  9:24   ` [PATCH v2 06/12] Btrfs: Cleanup unused variant and argument of IO failure handlers Miao Xie
                     ` (6 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-07-29  9:24 UTC (permalink / raw)
  To: linux-btrfs

We forgot to free failure record and bio after submitting re-read bio failed,
fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1-v2:
- None
---
 fs/btrfs/extent_io.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 881fc49..fb00736 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2345,6 +2345,11 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 	ret = tree->ops->submit_bio_hook(inode, read_mode, bio,
 					 failrec->this_mirror,
 					 failrec->bio_flags, 0);
+	if (ret) {
+		free_io_failure(inode, failrec, 0);
+		bio_put(bio);
+	}
+
 	return ret;
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 06/12] Btrfs: Cleanup unused variant and argument of IO failure handlers
  2014-07-29  9:23 ` [PATCH v2 00/12] Implement the data repair function for direct read Miao Xie
                     ` (4 preceding siblings ...)
  2014-07-29  9:24   ` [PATCH v2 05/12] Btrfs: fix missing error handler if submiting re-read bio fails Miao Xie
@ 2014-07-29  9:24   ` Miao Xie
  2014-07-29  9:24   ` [PATCH v2 07/12] Btrfs: split bio_readpage_error into several functions Miao Xie
                     ` (5 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-07-29  9:24 UTC (permalink / raw)
  To: linux-btrfs

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1-v2:
- None
---
 fs/btrfs/extent_io.c | 26 ++++++++++----------------
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fb00736..f71b34f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1978,8 +1978,7 @@ struct io_failure_record {
 	int in_validation;
 };
 
-static int free_io_failure(struct inode *inode, struct io_failure_record *rec,
-				int did_repair)
+static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
 	int ret;
 	int err = 0;
@@ -2106,7 +2105,6 @@ static int clean_io_failure(u64 start, struct page *page)
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct extent_state *state;
 	int num_copies;
-	int did_repair = 0;
 	int ret;
 
 	private = 0;
@@ -2127,7 +2125,6 @@ static int clean_io_failure(u64 start, struct page *page)
 		/* there was no real error, just free the record */
 		pr_debug("clean_io_failure: freeing dummy error at %llu\n",
 			 failrec->start);
-		did_repair = 1;
 		goto out;
 	}
 	if (fs_info->sb->s_flags & MS_RDONLY)
@@ -2144,19 +2141,16 @@ static int clean_io_failure(u64 start, struct page *page)
 		num_copies = btrfs_num_copies(fs_info, failrec->logical,
 					      failrec->len);
 		if (num_copies > 1)  {
-			ret = repair_io_failure(fs_info, start, failrec->len,
-						failrec->logical, page,
-						failrec->failed_mirror);
-			did_repair = !ret;
+			repair_io_failure(fs_info, start, failrec->len,
+					  failrec->logical, page,
+					  failrec->failed_mirror);
 		}
-		ret = 0;
 	}
 
 out:
-	if (!ret)
-		ret = free_io_failure(inode, failrec, did_repair);
+	free_io_failure(inode, failrec);
 
-	return ret;
+	return 0;
 }
 
 /*
@@ -2266,7 +2260,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		 */
 		pr_debug("bio_readpage_error: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n",
 			 num_copies, failrec->this_mirror, failed_mirror);
-		free_io_failure(inode, failrec, 0);
+		free_io_failure(inode, failrec);
 		return -EIO;
 	}
 
@@ -2309,13 +2303,13 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 	if (failrec->this_mirror > num_copies) {
 		pr_debug("bio_readpage_error: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n",
 			 num_copies, failrec->this_mirror, failed_mirror);
-		free_io_failure(inode, failrec, 0);
+		free_io_failure(inode, failrec);
 		return -EIO;
 	}
 
 	bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
 	if (!bio) {
-		free_io_failure(inode, failrec, 0);
+		free_io_failure(inode, failrec);
 		return -EIO;
 	}
 	bio->bi_end_io = failed_bio->bi_end_io;
@@ -2346,7 +2340,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 					 failrec->this_mirror,
 					 failrec->bio_flags, 0);
 	if (ret) {
-		free_io_failure(inode, failrec, 0);
+		free_io_failure(inode, failrec);
 		bio_put(bio);
 	}
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 07/12] Btrfs: split bio_readpage_error into several functions
  2014-07-29  9:23 ` [PATCH v2 00/12] Implement the data repair function for direct read Miao Xie
                     ` (5 preceding siblings ...)
  2014-07-29  9:24   ` [PATCH v2 06/12] Btrfs: Cleanup unused variant and argument of IO failure handlers Miao Xie
@ 2014-07-29  9:24   ` Miao Xie
  2014-07-29  9:24   ` [PATCH v2 08/12] Btrfs: modify repair_io_failure and make it suit direct io Miao Xie
                     ` (4 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-07-29  9:24 UTC (permalink / raw)
  To: linux-btrfs

The data repair function of direct read will be implemented later, and some code
in bio_readpage_error will be reused, so split bio_readpage_error into
several functions which will be used in direct read repair later.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1-v2:
- None
---
 fs/btrfs/extent_io.c | 159 ++++++++++++++++++++++++++++++---------------------
 fs/btrfs/extent_io.h |  28 +++++++++
 2 files changed, 123 insertions(+), 64 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f71b34f..daa3e9c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1959,25 +1959,6 @@ static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
 		SetPageUptodate(page);
 }
 
-/*
- * When IO fails, either with EIO or csum verification fails, we
- * try other mirrors that might have a good copy of the data.  This
- * io_failure_record is used to record state as we go through all the
- * mirrors.  If another mirror has good data, the page is set up to date
- * and things continue.  If a good mirror can't be found, the original
- * bio end_io callback is called to indicate things have failed.
- */
-struct io_failure_record {
-	struct page *page;
-	u64 start;
-	u64 len;
-	u64 logical;
-	unsigned long bio_flags;
-	int this_mirror;
-	int failed_mirror;
-	int in_validation;
-};
-
 static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
 	int ret;
@@ -2153,40 +2134,24 @@ out:
 	return 0;
 }
 
-/*
- * this is a generic handler for readpage errors (default
- * readpage_io_failed_hook). if other copies exist, read those and write back
- * good data to the failed position. does not investigate in remapping the
- * failed extent elsewhere, hoping the device will be smart enough to do this as
- * needed
- */
-
-static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
-			      struct page *page, u64 start, u64 end,
-			      int failed_mirror)
+int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
+				struct io_failure_record **failrec_ret)
 {
-	struct io_failure_record *failrec = NULL;
+	struct io_failure_record *failrec;
 	u64 private;
 	struct extent_map *em;
-	struct inode *inode = page->mapping->host;
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
-	struct bio *bio;
-	struct btrfs_io_bio *btrfs_failed_bio;
-	struct btrfs_io_bio *btrfs_bio;
-	int num_copies;
 	int ret;
-	int read_mode;
 	u64 logical;
 
-	BUG_ON(failed_bio->bi_rw & REQ_WRITE);
-
 	ret = get_state_private(failure_tree, start, &private);
 	if (ret) {
 		failrec = kzalloc(sizeof(*failrec), GFP_NOFS);
 		if (!failrec)
 			return -ENOMEM;
+
 		failrec->start = start;
 		failrec->len = end - start + 1;
 		failrec->this_mirror = 0;
@@ -2206,11 +2171,11 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 			em = NULL;
 		}
 		read_unlock(&em_tree->lock);
-
 		if (!em) {
 			kfree(failrec);
 			return -EIO;
 		}
+
 		logical = start - em->start;
 		logical = em->block_start + logical;
 		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
@@ -2219,8 +2184,10 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 			extent_set_compress_type(&failrec->bio_flags,
 						 em->compress_type);
 		}
-		pr_debug("bio_readpage_error: (new) logical=%llu, start=%llu, "
-			 "len=%llu\n", logical, start, failrec->len);
+
+		pr_debug("Get IO Failure Record: (new) logical=%llu, start=%llu, len=%llu\n",
+			 logical, start, failrec->len);
+
 		failrec->logical = logical;
 		free_extent_map(em);
 
@@ -2240,8 +2207,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		}
 	} else {
 		failrec = (struct io_failure_record *)(unsigned long)private;
-		pr_debug("bio_readpage_error: (found) logical=%llu, "
-			 "start=%llu, len=%llu, validation=%d\n",
+		pr_debug("Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu, validation=%d\n",
 			 failrec->logical, failrec->start, failrec->len,
 			 failrec->in_validation);
 		/*
@@ -2250,6 +2216,17 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		 * clean_io_failure() clean all those errors at once.
 		 */
 	}
+
+	*failrec_ret = failrec;
+
+	return 0;
+}
+
+int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
+			   struct io_failure_record *failrec, int failed_mirror)
+{
+	int num_copies;
+
 	num_copies = btrfs_num_copies(BTRFS_I(inode)->root->fs_info,
 				      failrec->logical, failrec->len);
 	if (num_copies == 1) {
@@ -2258,10 +2235,9 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		 * all the retry and error correction code that follows. no
 		 * matter what the error is, it is very likely to persist.
 		 */
-		pr_debug("bio_readpage_error: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n",
+		pr_debug("Check Repairable: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n",
 			 num_copies, failrec->this_mirror, failed_mirror);
-		free_io_failure(inode, failrec);
-		return -EIO;
+		return 0;
 	}
 
 	/*
@@ -2281,7 +2257,6 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		BUG_ON(failrec->in_validation);
 		failrec->in_validation = 1;
 		failrec->this_mirror = failed_mirror;
-		read_mode = READ_SYNC | REQ_FAILFAST_DEV;
 	} else {
 		/*
 		 * we're ready to fulfill a) and b) alongside. get a good copy
@@ -2297,22 +2272,32 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		failrec->this_mirror++;
 		if (failrec->this_mirror == failed_mirror)
 			failrec->this_mirror++;
-		read_mode = READ_SYNC;
 	}
 
 	if (failrec->this_mirror > num_copies) {
-		pr_debug("bio_readpage_error: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n",
+		pr_debug("Check Repairable: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n",
 			 num_copies, failrec->this_mirror, failed_mirror);
-		free_io_failure(inode, failrec);
-		return -EIO;
+		return 0;
 	}
 
+	return 1;
+}
+
+
+struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
+				    struct io_failure_record *failrec,
+				    struct page *page, int pg_offset, int icsum,
+				    bio_end_io_t *endio_func)
+{
+	struct bio *bio;
+	struct btrfs_io_bio *btrfs_failed_bio;
+	struct btrfs_io_bio *btrfs_bio;
+
 	bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
-	if (!bio) {
-		free_io_failure(inode, failrec);
-		return -EIO;
-	}
-	bio->bi_end_io = failed_bio->bi_end_io;
+	if (!bio)
+		return NULL;
+
+	bio->bi_end_io = endio_func;
 	bio->bi_iter.bi_sector = failrec->logical >> 9;
 	bio->bi_bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev;
 	bio->bi_iter.bi_size = 0;
@@ -2324,17 +2309,63 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 
 		btrfs_bio = btrfs_io_bio(bio);
 		btrfs_bio->csum = btrfs_bio->csum_inline;
-		phy_offset >>= inode->i_sb->s_blocksize_bits;
-		phy_offset *= csum_size;
-		memcpy(btrfs_bio->csum, btrfs_failed_bio->csum + phy_offset,
+		icsum *= csum_size;
+		memcpy(btrfs_bio->csum, btrfs_failed_bio->csum + icsum,
 		       csum_size);
 	}
 
-	bio_add_page(bio, page, failrec->len, start - page_offset(page));
+	bio_add_page(bio, page, failrec->len, pg_offset);
+
+	return bio;
+}
+
+/*
+ * this is a generic handler for readpage errors (default
+ * readpage_io_failed_hook). if other copies exist, read those and write back
+ * good data to the failed position. does not investigate in remapping the
+ * failed extent elsewhere, hoping the device will be smart enough to do this as
+ * needed
+ */
+
+static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
+			      struct page *page, u64 start, u64 end,
+			      int failed_mirror)
+{
+	struct io_failure_record *failrec;
+	struct inode *inode = page->mapping->host;
+	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
+	struct bio *bio;
+	int read_mode;
+	int ret;
+
+	BUG_ON(failed_bio->bi_rw & REQ_WRITE);
+
+	ret = btrfs_get_io_failure_record(inode, start, end, &failrec);
+	if (ret)
+		return ret;
+
+	ret = btrfs_check_repairable(inode, failed_bio, failrec, failed_mirror);
+	if (!ret) {
+		free_io_failure(inode, failrec);
+		return -EIO;
+	}
+
+	if (failed_bio->bi_vcnt > 1)
+		read_mode = READ_SYNC | REQ_FAILFAST_DEV;
+	else
+		read_mode = READ_SYNC;
+
+	phy_offset >>= inode->i_sb->s_blocksize_bits;
+	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
+				      start - page_offset(page),
+				      (int)phy_offset, failed_bio->bi_end_io);
+	if (!bio) {
+		free_io_failure(inode, failrec);
+		return -EIO;
+	}
 
-	pr_debug("bio_readpage_error: submitting new read[%#x] to "
-		 "this_mirror=%d, num_copies=%d, in_validation=%d\n", read_mode,
-		 failrec->this_mirror, num_copies, failrec->in_validation);
+	pr_debug("Repair Read Error: submitting new read[%#x] to this_mirror=%d, in_validation=%d\n",
+		 read_mode, failrec->this_mirror, failrec->in_validation);
 
 	ret = tree->ops->submit_bio_hook(inode, read_mode, bio,
 					 failrec->this_mirror,
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index ccc264e..4ce0547 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -347,6 +347,34 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 			 int mirror_num);
+
+/*
+ * When IO fails, either with EIO or csum verification fails, we
+ * try other mirrors that might have a good copy of the data.  This
+ * io_failure_record is used to record state as we go through all the
+ * mirrors.  If another mirror has good data, the page is set up to date
+ * and things continue.  If a good mirror can't be found, the original
+ * bio end_io callback is called to indicate things have failed.
+ */
+struct io_failure_record {
+	struct page *page;
+	u64 start;
+	u64 len;
+	u64 logical;
+	unsigned long bio_flags;
+	int this_mirror;
+	int failed_mirror;
+	int in_validation;
+};
+
+int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
+				struct io_failure_record **failrec_ret);
+int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
+			   struct io_failure_record *failrec, int fail_mirror);
+struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
+				    struct io_failure_record *failrec,
+				    struct page *page, int pg_offset, int icsum,
+				    bio_end_io_t *endio_func);
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 noinline u64 find_lock_delalloc_range(struct inode *inode,
 				      struct extent_io_tree *tree,
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 08/12] Btrfs: modify repair_io_failure and make it suit direct io
  2014-07-29  9:23 ` [PATCH v2 00/12] Implement the data repair function for direct read Miao Xie
                     ` (6 preceding siblings ...)
  2014-07-29  9:24   ` [PATCH v2 07/12] Btrfs: split bio_readpage_error into several functions Miao Xie
@ 2014-07-29  9:24   ` Miao Xie
  2014-07-29  9:24   ` [PATCH v2 09/12] Btrfs: modify clean_io_failure " Miao Xie
                     ` (3 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-07-29  9:24 UTC (permalink / raw)
  To: linux-btrfs

The original code of repair_io_failure was just used for buffered read,
because it got some filesystem data from page structure, it is safe for
the page in the page cache. But when we do a direct read, the pages in bio
are not in the page cache, that is there is no filesystem data in the page
structure. In order to implement direct read data repair, we need modify
repair_io_failure and pass all filesystem data it need by function
parameters.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1-v2:
- None
---
 fs/btrfs/extent_io.c | 8 +++++---
 fs/btrfs/extent_io.h | 2 +-
 fs/btrfs/scrub.c     | 1 +
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index daa3e9c..1389759 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1994,7 +1994,7 @@ static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
  */
 int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 			u64 length, u64 logical, struct page *page,
-			int mirror_num)
+			unsigned int pg_offset, int mirror_num)
 {
 	struct bio *bio;
 	struct btrfs_device *dev;
@@ -2033,7 +2033,7 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 		return -EIO;
 	}
 	bio->bi_bdev = dev->bdev;
-	bio_add_page(bio, page, length, start - page_offset(page));
+	bio_add_page(bio, page, length, pg_offset);
 
 	if (btrfsic_submit_bio_wait(WRITE_SYNC, bio)) {
 		/* try to remap that extent elsewhere? */
@@ -2064,7 +2064,8 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = extent_buffer_page(eb, i);
 		ret = repair_io_failure(root->fs_info, start, PAGE_CACHE_SIZE,
-					start, p, mirror_num);
+					start, p, start - page_offset(p),
+					mirror_num);
 		if (ret)
 			break;
 		start += PAGE_CACHE_SIZE;
@@ -2124,6 +2125,7 @@ static int clean_io_failure(u64 start, struct page *page)
 		if (num_copies > 1)  {
 			repair_io_failure(fs_info, start, failrec->len,
 					  failrec->logical, page,
+					  start - page_offset(page),
 					  failrec->failed_mirror);
 		}
 	}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 4ce0547..4366453 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -343,7 +343,7 @@ struct btrfs_fs_info;
 
 int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 			u64 length, u64 logical, struct page *page,
-			int mirror_num);
+			unsigned int pg_offset, int mirror_num);
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 			 int mirror_num);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b6d198f..0609245 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -684,6 +684,7 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *fixup_ctx)
 		fs_info = BTRFS_I(inode)->root->fs_info;
 		ret = repair_io_failure(fs_info, offset, PAGE_SIZE,
 					fixup->logical, page,
+					offset - page_offset(page),
 					fixup->mirror_num);
 		unlock_page(page);
 		corrected = !ret;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 09/12] Btrfs: modify clean_io_failure and make it suit direct io
  2014-07-29  9:23 ` [PATCH v2 00/12] Implement the data repair function for direct read Miao Xie
                     ` (7 preceding siblings ...)
  2014-07-29  9:24   ` [PATCH v2 08/12] Btrfs: modify repair_io_failure and make it suit direct io Miao Xie
@ 2014-07-29  9:24   ` Miao Xie
  2014-07-29  9:24   ` [PATCH v2 10/12] Btrfs: Set real mirror number for read operation on RAID0/5/6 Miao Xie
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-07-29  9:24 UTC (permalink / raw)
  To: linux-btrfs

We could not use clean_io_failure in the direct IO path because it got the
filesystem information from the page structure, but the page in the direct
IO bio didn't have the filesystem information in its structure. So we need
modify it and pass all the information it need by parameters.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1-v2:
- None
---
 fs/btrfs/extent_io.c | 31 +++++++++++++++----------------
 fs/btrfs/extent_io.h |  6 +++---
 fs/btrfs/scrub.c     |  3 +--
 3 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1389759..8082220 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1992,10 +1992,10 @@ static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
  * currently, there can be no more than two copies of every data bit. thus,
  * exactly one rewrite is required.
  */
-int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
-			u64 length, u64 logical, struct page *page,
-			unsigned int pg_offset, int mirror_num)
+int repair_io_failure(struct inode *inode, u64 start, u64 length, u64 logical,
+		      struct page *page, unsigned int pg_offset, int mirror_num)
 {
+	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct bio *bio;
 	struct btrfs_device *dev;
 	u64 map_length = 0;
@@ -2043,10 +2043,9 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 	}
 
 	printk_ratelimited_in_rcu(KERN_INFO
-			"BTRFS: read error corrected: ino %lu off %llu "
-		    "(dev %s sector %llu)\n", page->mapping->host->i_ino,
-		    start, rcu_str_deref(dev->name), sector);
-
+				  "BTRFS: read error corrected: ino %llu off %llu (dev %s sector %llu)\n",
+				  btrfs_ino(inode), start,
+				  rcu_str_deref(dev->name), sector);
 	bio_put(bio);
 	return 0;
 }
@@ -2063,9 +2062,10 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = extent_buffer_page(eb, i);
-		ret = repair_io_failure(root->fs_info, start, PAGE_CACHE_SIZE,
-					start, p, start - page_offset(p),
-					mirror_num);
+
+		ret = repair_io_failure(root->fs_info->btree_inode, start,
+					PAGE_CACHE_SIZE, start, p,
+					start - page_offset(p), mirror_num);
 		if (ret)
 			break;
 		start += PAGE_CACHE_SIZE;
@@ -2078,12 +2078,12 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
  * each time an IO finishes, we do a fast check in the IO failure tree
  * to see if we need to process or clean up an io_failure_record
  */
-static int clean_io_failure(u64 start, struct page *page)
+static int clean_io_failure(struct inode *inode, u64 start,
+			    struct page *page, unsigned int pg_offset)
 {
 	u64 private;
 	u64 private_failure;
 	struct io_failure_record *failrec;
-	struct inode *inode = page->mapping->host;
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct extent_state *state;
 	int num_copies;
@@ -2123,10 +2123,9 @@ static int clean_io_failure(u64 start, struct page *page)
 		num_copies = btrfs_num_copies(fs_info, failrec->logical,
 					      failrec->len);
 		if (num_copies > 1)  {
-			repair_io_failure(fs_info, start, failrec->len,
+			repair_io_failure(inode, start, failrec->len,
 					  failrec->logical, page,
-					  start - page_offset(page),
-					  failrec->failed_mirror);
+					  pg_offset, failrec->failed_mirror);
 		}
 	}
 
@@ -2535,7 +2534,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 			if (ret)
 				uptodate = 0;
 			else
-				clean_io_failure(start, page);
+				clean_io_failure(inode, start, page, 0);
 		}
 
 		if (likely(uptodate))
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 4366453..7662eaa 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -341,9 +341,9 @@ struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask);
 
 struct btrfs_fs_info;
 
-int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
-			u64 length, u64 logical, struct page *page,
-			unsigned int pg_offset, int mirror_num);
+int repair_io_failure(struct inode *inode, u64 start, u64 length, u64 logical,
+		      struct page *page, unsigned int pg_offset,
+		      int mirror_num);
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 			 int mirror_num);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 0609245..149a528 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -681,8 +681,7 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *fixup_ctx)
 			ret = -EIO;
 			goto out;
 		}
-		fs_info = BTRFS_I(inode)->root->fs_info;
-		ret = repair_io_failure(fs_info, offset, PAGE_SIZE,
+		ret = repair_io_failure(inode, offset, PAGE_SIZE,
 					fixup->logical, page,
 					offset - page_offset(page),
 					fixup->mirror_num);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 10/12] Btrfs: Set real mirror number for read operation on RAID0/5/6
  2014-07-29  9:23 ` [PATCH v2 00/12] Implement the data repair function for direct read Miao Xie
                     ` (8 preceding siblings ...)
  2014-07-29  9:24   ` [PATCH v2 09/12] Btrfs: modify clean_io_failure " Miao Xie
@ 2014-07-29  9:24   ` Miao Xie
  2014-07-29  9:24   ` [PATCH v2 11/12] Btrfs: implement repair function when direct read fails Miao Xie
  2014-07-29  9:24   ` [PATCH v2 12/12] Btrfs: cleanup the read failure record after write or when the inode is freeing Miao Xie
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-07-29  9:24 UTC (permalink / raw)
  To: linux-btrfs

We need real mirror number for RAID0/5/6 when reading data, or if read error
happens, we would pass 0 as the number of the mirror on which the io error
happens. It is wrong and would cause the filesystem read the data from the
corrupted mirror again.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1-v2:
- None
---
 fs/btrfs/volumes.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6cb82f6..eeb8291 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4955,6 +4955,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 			num_stripes = min_t(u64, map->num_stripes,
 					    stripe_nr_end - stripe_nr_orig);
 		stripe_index = do_div(stripe_nr, map->num_stripes);
+		if (!(rw & (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS)))
+			mirror_num = 1;
 	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
 		if (rw & (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS))
 			num_stripes = map->num_stripes;
@@ -5058,6 +5060,9 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 			/* We distribute the parity blocks across stripes */
 			tmp = stripe_nr + stripe_index;
 			stripe_index = do_div(tmp, map->num_stripes);
+			if (!(rw & (REQ_WRITE | REQ_DISCARD |
+				    REQ_GET_READ_MIRRORS)) && mirror_num <= 1)
+				mirror_num = 1;
 		}
 	} else {
 		/*
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 11/12] Btrfs: implement repair function when direct read fails
  2014-07-29  9:23 ` [PATCH v2 00/12] Implement the data repair function for direct read Miao Xie
                     ` (9 preceding siblings ...)
  2014-07-29  9:24   ` [PATCH v2 10/12] Btrfs: Set real mirror number for read operation on RAID0/5/6 Miao Xie
@ 2014-07-29  9:24   ` Miao Xie
  2014-08-29 18:31     ` Chris Mason
  2014-07-29  9:24   ` [PATCH v2 12/12] Btrfs: cleanup the read failure record after write or when the inode is freeing Miao Xie
  11 siblings, 1 reply; 49+ messages in thread
From: Miao Xie @ 2014-07-29  9:24 UTC (permalink / raw)
  To: linux-btrfs

This patch implement data repair function when direct read fails.

The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
  mirror.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1-v2:
- None
---
 fs/btrfs/btrfs_inode.h |   2 +-
 fs/btrfs/disk-io.c     |  43 ++++++--
 fs/btrfs/disk-io.h     |   1 +
 fs/btrfs/extent_io.c   |  12 ++-
 fs/btrfs/extent_io.h   |   5 +-
 fs/btrfs/inode.c       | 276 +++++++++++++++++++++++++++++++++++++++++++++----
 6 files changed, 300 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 745fca40..20d4975 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -271,7 +271,7 @@ struct btrfs_dio_private {
 	 * The original bio may be splited to several sub-bios, this is
 	 * done during endio of sub-bios
 	 */
-	int (*subio_endio)(struct inode *, struct btrfs_io_bio *);
+	int (*subio_endio)(struct inode *, struct btrfs_io_bio *, int);
 };
 
 /*
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 08e65e9..56b1546 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -691,6 +691,27 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror)
 	return -EIO;	/* we fixed nothing */
 }
 
+static inline void do_end_workqueue_fn(struct end_io_wq *end_io_wq)
+{
+	struct bio *bio = end_io_wq->bio;
+
+	bio->bi_private = end_io_wq->private;
+	bio->bi_end_io = end_io_wq->end_io;
+	bio_endio_nodec(bio, end_io_wq->error);
+	kfree(end_io_wq);
+}
+
+static void dio_end_workqueue_fn(struct work_struct *work)
+{
+	struct btrfs_work *bwork;
+	struct end_io_wq *end_io_wq;
+
+	bwork = container_of(work, struct btrfs_work, normal_work);
+	end_io_wq = container_of(bwork, struct end_io_wq, work);
+
+	do_end_workqueue_fn(end_io_wq);
+}
+
 static void end_workqueue_bio(struct bio *bio, int err)
 {
 	struct end_io_wq *end_io_wq = bio->bi_private;
@@ -698,7 +719,12 @@ static void end_workqueue_bio(struct bio *bio, int err)
 
 	fs_info = end_io_wq->info;
 	end_io_wq->error = err;
-	btrfs_init_work(&end_io_wq->work, end_workqueue_fn, NULL, NULL);
+
+	if (likely(end_io_wq->metadata != BTRFS_WQ_ENDIO_DIO_REPAIR))
+		btrfs_init_work(&end_io_wq->work, end_workqueue_fn, NULL,
+				NULL);
+	else
+		INIT_WORK(&end_io_wq->work.normal_work, dio_end_workqueue_fn);
 
 	if (bio->bi_rw & REQ_WRITE) {
 		if (end_io_wq->metadata == BTRFS_WQ_ENDIO_METADATA)
@@ -714,7 +740,9 @@ static void end_workqueue_bio(struct bio *bio, int err)
 			btrfs_queue_work(fs_info->endio_write_workers,
 					 &end_io_wq->work);
 	} else {
-		if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56)
+		if (unlikely(end_io_wq->metadata == BTRFS_WQ_ENDIO_DIO_REPAIR))
+			queue_work(system_wq, &end_io_wq->work.normal_work);
+		else if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56)
 			btrfs_queue_work(fs_info->endio_raid56_workers,
 					 &end_io_wq->work);
 		else if (end_io_wq->metadata)
@@ -738,6 +766,7 @@ int btrfs_bio_wq_end_io(struct btrfs_fs_info *info, struct bio *bio,
 			int metadata)
 {
 	struct end_io_wq *end_io_wq;
+
 	end_io_wq = kmalloc(sizeof(*end_io_wq), GFP_NOFS);
 	if (!end_io_wq)
 		return -ENOMEM;
@@ -1730,18 +1759,10 @@ static int setup_bdi(struct btrfs_fs_info *info, struct backing_dev_info *bdi)
  */
 static void end_workqueue_fn(struct btrfs_work *work)
 {
-	struct bio *bio;
 	struct end_io_wq *end_io_wq;
-	int error;
 
 	end_io_wq = container_of(work, struct end_io_wq, work);
-	bio = end_io_wq->bio;
-
-	error = end_io_wq->error;
-	bio->bi_private = end_io_wq->private;
-	bio->bi_end_io = end_io_wq->end_io;
-	kfree(end_io_wq);
-	bio_endio_nodec(bio, error);
+	do_end_workqueue_fn(end_io_wq);
 }
 
 static int cleaner_kthread(void *arg)
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 23ce3ce..4fde7a0 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -30,6 +30,7 @@ enum {
 	BTRFS_WQ_ENDIO_METADATA = 1,
 	BTRFS_WQ_ENDIO_FREE_SPACE = 2,
 	BTRFS_WQ_ENDIO_RAID56 = 3,
+	BTRFS_WQ_ENDIO_DIO_REPAIR = 4,
 };
 
 static inline u64 btrfs_sb_offset(int mirror)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8082220..31600ef 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1959,7 +1959,7 @@ static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
 		SetPageUptodate(page);
 }
 
-static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
+int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
 	int ret;
 	int err = 0;
@@ -2078,8 +2078,8 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
  * each time an IO finishes, we do a fast check in the IO failure tree
  * to see if we need to process or clean up an io_failure_record
  */
-static int clean_io_failure(struct inode *inode, u64 start,
-			    struct page *page, unsigned int pg_offset)
+int clean_io_failure(struct inode *inode, u64 start, struct page *page,
+		     unsigned int pg_offset)
 {
 	u64 private;
 	u64 private_failure;
@@ -2288,7 +2288,7 @@ int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
 struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
 				    struct io_failure_record *failrec,
 				    struct page *page, int pg_offset, int icsum,
-				    bio_end_io_t *endio_func)
+				    bio_end_io_t *endio_func, void *data)
 {
 	struct bio *bio;
 	struct btrfs_io_bio *btrfs_failed_bio;
@@ -2302,6 +2302,7 @@ struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
 	bio->bi_iter.bi_sector = failrec->logical >> 9;
 	bio->bi_bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev;
 	bio->bi_iter.bi_size = 0;
+	bio->bi_private = data;
 
 	btrfs_failed_bio = btrfs_io_bio(failed_bio);
 	if (btrfs_failed_bio->csum) {
@@ -2359,7 +2360,8 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 	phy_offset >>= inode->i_sb->s_blocksize_bits;
 	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
 				      start - page_offset(page),
-				      (int)phy_offset, failed_bio->bi_end_io);
+				      (int)phy_offset, failed_bio->bi_end_io,
+				      NULL);
 	if (!bio) {
 		free_io_failure(inode, failrec);
 		return -EIO;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 7662eaa..b23c7c2 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -344,6 +344,8 @@ struct btrfs_fs_info;
 int repair_io_failure(struct inode *inode, u64 start, u64 length, u64 logical,
 		      struct page *page, unsigned int pg_offset,
 		      int mirror_num);
+int clean_io_failure(struct inode *inode, u64 start, struct page *page,
+		     unsigned int pg_offset);
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 			 int mirror_num);
@@ -374,7 +376,8 @@ int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
 struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
 				    struct io_failure_record *failrec,
 				    struct page *page, int pg_offset, int icsum,
-				    bio_end_io_t *endio_func);
+				    bio_end_io_t *endio_func, void *data);
+int free_io_failure(struct inode *inode, struct io_failure_record *rec);
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 noinline u64 find_lock_delalloc_range(struct inode *inode,
 				      struct extent_io_tree *tree,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3e95a2b..e087189 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7083,30 +7083,267 @@ unlock_err:
 	return ret;
 }
 
-static int btrfs_subio_endio_read(struct inode *inode,
-				  struct btrfs_io_bio *io_bio)
+static inline int submit_dio_repair_bio(struct inode *inode, struct bio *bio,
+					int rw, int mirror_num)
+{
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	int ret;
+
+	BUG_ON(rw & REQ_WRITE);
+
+	bio_get(bio);
+
+	ret = btrfs_bio_wq_end_io(root->fs_info, bio,
+				  BTRFS_WQ_ENDIO_DIO_REPAIR);
+	if (ret)
+		goto err;
+
+	ret = btrfs_map_bio(root, rw, bio, mirror_num, 0);
+err:
+	bio_put(bio);
+	return ret;
+}
+
+static int btrfs_check_dio_repairable(struct inode *inode,
+				      struct bio *failed_bio,
+				      struct io_failure_record *failrec,
+				      int failed_mirror)
+{
+	int num_copies;
+
+	num_copies = btrfs_num_copies(BTRFS_I(inode)->root->fs_info,
+				      failrec->logical, failrec->len);
+	if (num_copies == 1) {
+		/*
+		 * we only have a single copy of the data, so don't bother with
+		 * all the retry and error correction code that follows. no
+		 * matter what the error is, it is very likely to persist.
+		 */
+		pr_debug("Check DIO Repairable: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n",
+			 num_copies, failrec->this_mirror, failed_mirror);
+		return 0;
+	}
+
+	failrec->failed_mirror = failed_mirror;
+	failrec->this_mirror++;
+	if (failrec->this_mirror == failed_mirror)
+		failrec->this_mirror++;
+
+	if (failrec->this_mirror > num_copies) {
+		pr_debug("Check DIO Repairable: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n",
+			 num_copies, failrec->this_mirror, failed_mirror);
+		return 0;
+	}
+
+	return 1;
+}
+
+static int dio_read_error(struct inode *inode, struct bio *failed_bio,
+			  struct page *page, u64 start, u64 end,
+			  int failed_mirror, bio_end_io_t *repair_endio,
+			  void *repair_arg)
+{
+	struct io_failure_record *failrec;
+	struct bio *bio;
+	int isector;
+	int read_mode;
+	int ret;
+
+	BUG_ON(failed_bio->bi_rw & REQ_WRITE);
+
+	ret = btrfs_get_io_failure_record(inode, start, end, &failrec);
+	if (ret)
+		return ret;
+
+	ret = btrfs_check_dio_repairable(inode, failed_bio, failrec,
+					 failed_mirror);
+	if (!ret) {
+		free_io_failure(inode, failrec);
+		return -EIO;
+	}
+
+	if (failed_bio->bi_vcnt > 1)
+		read_mode = READ_SYNC | REQ_FAILFAST_DEV;
+	else
+		read_mode = READ_SYNC;
+
+	isector = start - btrfs_io_bio(failed_bio)->logical;
+	isector >>= inode->i_sb->s_blocksize_bits;
+	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
+				      0, isector, repair_endio, repair_arg);
+	if (!bio) {
+		free_io_failure(inode, failrec);
+		return -EIO;
+	}
+
+	btrfs_debug(BTRFS_I(inode)->root->fs_info,
+		    "Repair DIO Read Error: submitting new dio read[%#x] to this_mirror=%d, in_validation=%d\n",
+		    read_mode, failrec->this_mirror, failrec->in_validation);
+
+	ret = submit_dio_repair_bio(inode, bio, read_mode,
+				    failrec->this_mirror);
+	if (ret) {
+		free_io_failure(inode, failrec);
+		bio_put(bio);
+	}
+
+	return ret;
+}
+
+struct btrfs_retry_complete {
+	struct completion done;
+	struct inode *inode;
+	u64 start;
+	int uptodate;
+};
+
+static void btrfs_retry_endio_nocsum(struct bio *bio, int err)
+{
+	struct btrfs_retry_complete *done = bio->bi_private;
+	struct bio_vec *bvec;
+	int i;
+
+	if (err)
+		goto end;
+
+	done->uptodate = 1;
+	bio_for_each_segment_all(bvec, bio, i)
+		clean_io_failure(done->inode, done->start, bvec->bv_page, 0);
+end:
+	complete(&done->done);
+	bio_put(bio);
+}
+
+static int __btrfs_correct_data_nocsum(struct inode *inode,
+				       struct btrfs_io_bio *io_bio)
 {
 	struct bio_vec *bvec;
+	struct btrfs_retry_complete done;
 	u64 start;
 	int i;
 	int ret;
-	int err = 0;
 
-	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
-		return 0;
+	start = io_bio->logical;
+	done.inode = inode;
+
+	bio_for_each_segment_all(bvec, &io_bio->bio, i) {
+try_again:
+		done.uptodate = 0;
+		done.start = start;
+		init_completion(&done.done);
+
+		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, start,
+				     start + bvec->bv_len - 1,
+				     io_bio->mirror_num,
+				     btrfs_retry_endio_nocsum, &done);
+		if (ret)
+			return ret;
+
+		wait_for_completion(&done.done);
+
+		if (!done.uptodate) {
+			/* We might have another mirror, so try again */
+			goto try_again;
+		}
+
+		start += bvec->bv_len;
+	}
+
+	return 0;
+}
+
+static void btrfs_retry_endio(struct bio *bio, int err)
+{
+	struct btrfs_retry_complete *done = bio->bi_private;
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	struct bio_vec *bvec;
+	int uptodate;
+	int ret;
+	int i;
+
+	if (err)
+		goto end;
+
+	uptodate = 1;
+	bio_for_each_segment_all(bvec, bio, i) {
+		ret = __readpage_endio_check(done->inode, io_bio, i,
+					     bvec->bv_page, 0,
+					     done->start, bvec->bv_len);
+		if (!ret)
+			clean_io_failure(done->inode, done->start,
+					 bvec->bv_page, 0);
+		else
+			uptodate = 0;
+	}
+
+	done->uptodate = uptodate;
+end:
+	complete(&done->done);
+	bio_put(bio);
+}
 
+static int __btrfs_subio_endio_read(struct inode *inode,
+				    struct btrfs_io_bio *io_bio, int err)
+{
+	struct bio_vec *bvec;
+	struct btrfs_retry_complete done;
+	u64 start;
+	u64 offset = 0;
+	int i;
+	int ret;
+
+	err = 0;
 	start = io_bio->logical;
+	done.inode = inode;
+
 	bio_for_each_segment_all(bvec, &io_bio->bio, i) {
 		ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
 					     0, start, bvec->bv_len);
-		if (ret)
-			err = -EIO;
+		if (likely(!ret))
+			goto next;
+try_again:
+		done.uptodate = 0;
+		done.start = start;
+		init_completion(&done.done);
+
+		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, start,
+				     start + bvec->bv_len - 1,
+				     io_bio->mirror_num,
+				     btrfs_retry_endio, &done);
+		if (ret) {
+			err = ret;
+			goto next;
+		}
+
+		wait_for_completion(&done.done);
+
+		if (!done.uptodate) {
+			/* We might have another mirror, so try again */
+			goto try_again;
+		}
+next:
+		offset += bvec->bv_len;
 		start += bvec->bv_len;
 	}
 
 	return err;
 }
 
+static int btrfs_subio_endio_read(struct inode *inode,
+				  struct btrfs_io_bio *io_bio, int err)
+{
+	bool skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
+
+	if (skip_csum) {
+		if (unlikely(err))
+			return __btrfs_correct_data_nocsum(inode, io_bio);
+		else
+			return 0;
+	} else {
+		return __btrfs_subio_endio_read(inode, io_bio, err);
+	}
+}
+
 static void btrfs_endio_direct_read(struct bio *bio, int err)
 {
 	struct btrfs_dio_private *dip = bio->bi_private;
@@ -7114,8 +7351,8 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
 	struct bio *dio_bio;
 	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
 
-	if (!err && (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED))
-		err = btrfs_subio_endio_read(inode, io_bio);
+	if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED)
+		err = btrfs_subio_endio_read(inode, io_bio, err);
 
 	unlock_extent(&BTRFS_I(inode)->io_tree, dip->logical_offset,
 		      dip->logical_offset + dip->bytes - 1);
@@ -7193,19 +7430,16 @@ static int __btrfs_submit_bio_start_direct_io(struct inode *inode, int rw,
 static void btrfs_end_dio_bio(struct bio *bio, int err)
 {
 	struct btrfs_dio_private *dip = bio->bi_private;
-	int ret;
 
-	if (err) {
-		btrfs_err(BTRFS_I(dip->inode)->root->fs_info,
-			  "direct IO failed ino %llu rw %lu sector %#Lx len %u err no %d",
-		      btrfs_ino(dip->inode), bio->bi_rw,
-		      (unsigned long long)bio->bi_iter.bi_sector,
-		      bio->bi_iter.bi_size, err);
-	} else if (dip->subio_endio) {
-		ret = dip->subio_endio(dip->inode, btrfs_io_bio(bio));
-		if (ret)
-			err = ret;
-	}
+	if (err)
+		btrfs_warn(BTRFS_I(dip->inode)->root->fs_info,
+			   "direct IO failed ino %llu rw %lu sector %#Lx len %u err no %d",
+			   btrfs_ino(dip->inode), bio->bi_rw,
+			   (unsigned long long)bio->bi_iter.bi_sector,
+			   bio->bi_iter.bi_size, err);
+
+	if (dip->subio_endio)
+		err = dip->subio_endio(dip->inode, btrfs_io_bio(bio), err);
 
 	if (err) {
 		dip->errors = 1;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 12/12] Btrfs: cleanup the read failure record after write or when the inode is freeing
  2014-07-29  9:23 ` [PATCH v2 00/12] Implement the data repair function for direct read Miao Xie
                     ` (10 preceding siblings ...)
  2014-07-29  9:24   ` [PATCH v2 11/12] Btrfs: implement repair function when direct read fails Miao Xie
@ 2014-07-29  9:24   ` Miao Xie
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-07-29  9:24 UTC (permalink / raw)
  To: linux-btrfs

After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
  mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
  the corrupted data is corrected, so the failure record can be removed. And if
  some errors happen on the mirrors, we also needn't worry about it because the
  failure record will be recreated if we read the same place again.

Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1-v2:
- None
---
 fs/btrfs/extent_io.c | 34 ++++++++++++++++++++++++++++++++++
 fs/btrfs/extent_io.h |  1 +
 fs/btrfs/inode.c     |  6 ++++++
 3 files changed, 41 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 31600ef..39783e7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2135,6 +2135,40 @@ out:
 	return 0;
 }
 
+/*
+ * Can be called when
+ * - hold extent lock
+ * - under ordered extent
+ * - the inode is freeing
+ */
+void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end)
+{
+	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
+	struct io_failure_record *failrec;
+	struct extent_state *state, *next;
+
+	if (RB_EMPTY_ROOT(&failure_tree->state))
+		return;
+
+	spin_lock(&failure_tree->lock);
+	state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY);
+	while (state) {
+		if (state->start > end)
+			break;
+
+		ASSERT(state->end <= end);
+
+		next = next_state(state);
+
+		failrec = (struct io_failure_record *)state->private;
+		free_extent_state(state);
+		kfree(failrec);
+
+		state = next;
+	}
+	spin_unlock(&failure_tree->lock);
+}
+
 int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
 				struct io_failure_record **failrec_ret)
 {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index b23c7c2..5c48eda 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -369,6 +369,7 @@ struct io_failure_record {
 	int in_validation;
 };
 
+void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end);
 int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
 				struct io_failure_record **failrec_ret);
 int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e087189..56bd9c1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2639,6 +2639,10 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 		goto out;
 	}
 
+	btrfs_free_io_failure_record(inode, ordered_extent->file_offset,
+				     ordered_extent->file_offset +
+				     ordered_extent->len - 1);
+
 	if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) {
 		truncated = true;
 		logical_len = ordered_extent->truncated_len;
@@ -4723,6 +4727,8 @@ void btrfs_evict_inode(struct inode *inode)
 	/* do we really want it for ->i_nlink > 0 and zero btrfs_root_refs? */
 	btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
+	btrfs_free_io_failure_record(inode, 0, (u64)-1);
+
 	if (root->fs_info->log_root_recovering) {
 		BUG_ON(test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
 				 &BTRFS_I(inode)->runtime_flags));
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 02/12] Btrfs: load checksum data once when submitting a direct read io
  2014-07-29  9:24   ` [PATCH v2 02/12] Btrfs: load checksum data once when submitting a direct read io Miao Xie
@ 2014-08-08  0:32     ` Filipe David Manana
  2014-08-08  9:22       ` Miao Xie
  2014-08-08  9:23       ` [PATCH v3 " Miao Xie
  0 siblings, 2 replies; 49+ messages in thread
From: Filipe David Manana @ 2014-08-08  0:32 UTC (permalink / raw)
  To: Miao Xie; +Cc: linux-btrfs

On Tue, Jul 29, 2014 at 10:24 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
> The current code would load checksum data for several times when we split
> a whole direct read io because of the limit of the raid stripe, it would
> make us search the csum tree for several times. In fact, it just wasted time,
> and made the contention of the csum tree root be more serious. This patch
> improves this problem by loading the data at once.
>
> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
> ---
> Changelog v1 -> v2:
> - Remove the __GFP_ZERO flag in btrfs_submit_direct because it would trigger
>   a WARNing. It is reported by Filipe David Manana, Thanks.

Hi Miao,

Without this flag (to zero the struct), I get a general protection fault at:

(gdb) list *(btrfs_endio_direct_read+0xb8)
0x38378 is in btrfs_endio_direct_read (fs/btrfs/inode.c:7242).
7237 if (err)
7238 clear_bit(BIO_UPTODATE, &dio_bio->bi_flags);
7239 dio_end_io(dio_bio, err);
7240
7241 if (io_bio->end_io)
7242 io_bio->end_io(io_bio, err);
7243 bio_put(bio);
7244 }

Happens when running xfstests/generic/091


> ---
>  fs/btrfs/btrfs_inode.h |  1 -
>  fs/btrfs/ctree.h       |  3 +--
>  fs/btrfs/extent_io.c   | 13 +++++++++++--
>  fs/btrfs/file-item.c   | 14 ++------------
>  fs/btrfs/inode.c       | 38 +++++++++++++++++++++-----------------
>  5 files changed, 35 insertions(+), 34 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index a0cf3e5..b69bf7e 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -263,7 +263,6 @@ struct btrfs_dio_private {
>
>         /* dio_bio came from fs/direct-io.c */
>         struct bio *dio_bio;
> -       u8 csum[0];
>  };
>
>  /*
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index be91397..40e9938 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3739,8 +3739,7 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
>  int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
>                           struct bio *bio, u32 *dst);
>  int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
> -                             struct btrfs_dio_private *dip, struct bio *bio,
> -                             u64 logical_offset);
> +                             struct bio *bio, u64 logical_offset);
>  int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
>                              struct btrfs_root *root,
>                              u64 objectid, u64 pos,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 23398ad..0fb63c4 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2617,9 +2617,18 @@ btrfs_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs,
>
>  struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask)
>  {
> -       return bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
> -}
> +       struct btrfs_io_bio *btrfs_bio;
> +       struct bio *new;
>
> +       new = bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
> +       if (new) {
> +               btrfs_bio = btrfs_io_bio(new);
> +               btrfs_bio->csum = NULL;
> +               btrfs_bio->csum_allocated = NULL;
> +               btrfs_bio->end_io = NULL;
> +       }
> +       return bio;
> +}
>
>  /* this also allocates from the btrfs_bioset */
>  struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index f46cfe4..cf1b94f 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -299,19 +299,9 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
>  }
>
>  int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
> -                             struct btrfs_dio_private *dip, struct bio *bio,
> -                             u64 offset)
> +                             struct bio *bio, u64 offset)
>  {
> -       int len = (bio->bi_iter.bi_sector << 9) - dip->disk_bytenr;
> -       u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
> -       int ret;
> -
> -       len >>= inode->i_sb->s_blocksize_bits;
> -       len *= csum_size;
> -
> -       ret = __btrfs_lookup_bio_sums(root, inode, bio, offset,
> -                                     (u32 *)(dip->csum + len), 1);
> -       return ret;
> +       return __btrfs_lookup_bio_sums(root, inode, bio, offset, NULL, 1);
>  }
>
>  int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 548489e..fd88126 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7081,7 +7081,8 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
>         struct inode *inode = dip->inode;
>         struct btrfs_root *root = BTRFS_I(inode)->root;
>         struct bio *dio_bio;
> -       u32 *csums = (u32 *)dip->csum;
> +       struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> +       u32 *csums = (u32 *)io_bio->csum;
>         u64 start;
>         int i;
>
> @@ -7123,6 +7124,9 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
>         if (err)
>                 clear_bit(BIO_UPTODATE, &dio_bio->bi_flags);
>         dio_end_io(dio_bio, err);
> +
> +       if (io_bio->end_io)
> +               io_bio->end_io(io_bio, err);
>         bio_put(bio);
>  }
>
> @@ -7261,13 +7265,20 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
>                 ret = btrfs_csum_one_bio(root, inode, bio, file_offset, 1);
>                 if (ret)
>                         goto err;
> -       } else if (!skip_sum) {
> -               ret = btrfs_lookup_bio_sums_dio(root, inode, dip, bio,
> +       } else {
> +               /*
> +                * We have loaded all the csum data we need when we submit
> +                * the first bio, so skip it.
> +                */
> +               if (dip->logical_offset != file_offset)
> +                       goto map;
> +
> +               /* Load all csum data at once. */
> +               ret = btrfs_lookup_bio_sums_dio(root, inode, dip->orig_bio,
>                                                 file_offset);
>                 if (ret)
>                         goto err;
>         }
> -
>  map:
>         ret = btrfs_map_bio(root, rw, bio, 0, async_submit);
>  err:
> @@ -7288,7 +7299,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
>         u64 submit_len = 0;
>         u64 map_length;
>         int nr_pages = 0;
> -       int ret = 0;
> +       int ret;
>         int async_submit = 0;
>
>         map_length = orig_bio->bi_iter.bi_size;
> @@ -7392,11 +7403,10 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
>         struct btrfs_root *root = BTRFS_I(inode)->root;
>         struct btrfs_dio_private *dip;
>         struct bio *io_bio;
> +       struct btrfs_io_bio *btrfs_bio;
>         int skip_sum;
> -       int sum_len;
>         int write = rw & REQ_WRITE;
>         int ret = 0;
> -       u16 csum_size;
>
>         skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
>
> @@ -7406,16 +7416,7 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
>                 goto free_ordered;
>         }
>
> -       if (!skip_sum && !write) {
> -               csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
> -               sum_len = dio_bio->bi_iter.bi_size >>
> -                       inode->i_sb->s_blocksize_bits;
> -               sum_len *= csum_size;
> -       } else {
> -               sum_len = 0;
> -       }
> -
> -       dip = kmalloc(sizeof(*dip) + sum_len, GFP_NOFS);
> +       dip = kmalloc(sizeof(*dip), GFP_NOFS);
>         if (!dip) {
>                 ret = -ENOMEM;
>                 goto free_io_bio;
> @@ -7441,6 +7442,9 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
>         if (!ret)
>                 return;
>
> +       btrfs_bio = btrfs_io_bio(io_bio);
> +       if (btrfs_bio->end_io)
> +               btrfs_bio->end_io(btrfs_bio, ret);
>  free_io_bio:
>         bio_put(io_bio);
>
> --
> 1.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 02/12] Btrfs: load checksum data once when submitting a direct read io
  2014-08-08  0:32     ` Filipe David Manana
@ 2014-08-08  9:22       ` Miao Xie
  2014-08-08  9:23       ` [PATCH v3 " Miao Xie
  1 sibling, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-08-08  9:22 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Fri, 8 Aug 2014 01:32:00 +0100, Filipe David Manana wrote:
> On Tue, Jul 29, 2014 at 10:24 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>> The current code would load checksum data for several times when we split
>> a whole direct read io because of the limit of the raid stripe, it would
>> make us search the csum tree for several times. In fact, it just wasted time,
>> and made the contention of the csum tree root be more serious. This patch
>> improves this problem by loading the data at once.
>>
>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
>> ---
>> Changelog v1 -> v2:
>> - Remove the __GFP_ZERO flag in btrfs_submit_direct because it would trigger
>>   a WARNing. It is reported by Filipe David Manana, Thanks.
> 
> Hi Miao,
> 
> Without this flag (to zero the struct), I get a general protection fault at:
> 
> (gdb) list *(btrfs_endio_direct_read+0xb8)
> 0x38378 is in btrfs_endio_direct_read (fs/btrfs/inode.c:7242).
> 7237 if (err)
> 7238 clear_bit(BIO_UPTODATE, &dio_bio->bi_flags);
> 7239 dio_end_io(dio_bio, err);
> 7240
> 7241 if (io_bio->end_io)
> 7242 io_bio->end_io(io_bio, err);
> 7243 bio_put(bio);
> 7244 }
> 
> Happens when running xfstests/generic/091

Sorry, I sent out wrong version of this patchset, which has a bug
in btrfs_bio_clone.

I will send new version of this patch. The other patches are not changed,
and can be applied successfully after the new version of this patch, so
don't send them again. Surely, we also can pull the whole patchset from

https://github.com/miaoxie/linux-btrfs.git for-Chris

Thanks
Miao

> 
> 
>> ---
>>  fs/btrfs/btrfs_inode.h |  1 -
>>  fs/btrfs/ctree.h       |  3 +--
>>  fs/btrfs/extent_io.c   | 13 +++++++++++--
>>  fs/btrfs/file-item.c   | 14 ++------------
>>  fs/btrfs/inode.c       | 38 +++++++++++++++++++++-----------------
>>  5 files changed, 35 insertions(+), 34 deletions(-)
>>
>> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
>> index a0cf3e5..b69bf7e 100644
>> --- a/fs/btrfs/btrfs_inode.h
>> +++ b/fs/btrfs/btrfs_inode.h
>> @@ -263,7 +263,6 @@ struct btrfs_dio_private {
>>
>>         /* dio_bio came from fs/direct-io.c */
>>         struct bio *dio_bio;
>> -       u8 csum[0];
>>  };
>>
>>  /*
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index be91397..40e9938 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -3739,8 +3739,7 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
>>  int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
>>                           struct bio *bio, u32 *dst);
>>  int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
>> -                             struct btrfs_dio_private *dip, struct bio *bio,
>> -                             u64 logical_offset);
>> +                             struct bio *bio, u64 logical_offset);
>>  int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
>>                              struct btrfs_root *root,
>>                              u64 objectid, u64 pos,
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 23398ad..0fb63c4 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2617,9 +2617,18 @@ btrfs_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs,
>>
>>  struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask)
>>  {
>> -       return bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
>> -}
>> +       struct btrfs_io_bio *btrfs_bio;
>> +       struct bio *new;
>>
>> +       new = bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
>> +       if (new) {
>> +               btrfs_bio = btrfs_io_bio(new);
>> +               btrfs_bio->csum = NULL;
>> +               btrfs_bio->csum_allocated = NULL;
>> +               btrfs_bio->end_io = NULL;
>> +       }
>> +       return bio;
>> +}
>>
>>  /* this also allocates from the btrfs_bioset */
>>  struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
>> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
>> index f46cfe4..cf1b94f 100644
>> --- a/fs/btrfs/file-item.c
>> +++ b/fs/btrfs/file-item.c
>> @@ -299,19 +299,9 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
>>  }
>>
>>  int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
>> -                             struct btrfs_dio_private *dip, struct bio *bio,
>> -                             u64 offset)
>> +                             struct bio *bio, u64 offset)
>>  {
>> -       int len = (bio->bi_iter.bi_sector << 9) - dip->disk_bytenr;
>> -       u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
>> -       int ret;
>> -
>> -       len >>= inode->i_sb->s_blocksize_bits;
>> -       len *= csum_size;
>> -
>> -       ret = __btrfs_lookup_bio_sums(root, inode, bio, offset,
>> -                                     (u32 *)(dip->csum + len), 1);
>> -       return ret;
>> +       return __btrfs_lookup_bio_sums(root, inode, bio, offset, NULL, 1);
>>  }
>>
>>  int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 548489e..fd88126 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -7081,7 +7081,8 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
>>         struct inode *inode = dip->inode;
>>         struct btrfs_root *root = BTRFS_I(inode)->root;
>>         struct bio *dio_bio;
>> -       u32 *csums = (u32 *)dip->csum;
>> +       struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
>> +       u32 *csums = (u32 *)io_bio->csum;
>>         u64 start;
>>         int i;
>>
>> @@ -7123,6 +7124,9 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
>>         if (err)
>>                 clear_bit(BIO_UPTODATE, &dio_bio->bi_flags);
>>         dio_end_io(dio_bio, err);
>> +
>> +       if (io_bio->end_io)
>> +               io_bio->end_io(io_bio, err);
>>         bio_put(bio);
>>  }
>>
>> @@ -7261,13 +7265,20 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
>>                 ret = btrfs_csum_one_bio(root, inode, bio, file_offset, 1);
>>                 if (ret)
>>                         goto err;
>> -       } else if (!skip_sum) {
>> -               ret = btrfs_lookup_bio_sums_dio(root, inode, dip, bio,
>> +       } else {
>> +               /*
>> +                * We have loaded all the csum data we need when we submit
>> +                * the first bio, so skip it.
>> +                */
>> +               if (dip->logical_offset != file_offset)
>> +                       goto map;
>> +
>> +               /* Load all csum data at once. */
>> +               ret = btrfs_lookup_bio_sums_dio(root, inode, dip->orig_bio,
>>                                                 file_offset);
>>                 if (ret)
>>                         goto err;
>>         }
>> -
>>  map:
>>         ret = btrfs_map_bio(root, rw, bio, 0, async_submit);
>>  err:
>> @@ -7288,7 +7299,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
>>         u64 submit_len = 0;
>>         u64 map_length;
>>         int nr_pages = 0;
>> -       int ret = 0;
>> +       int ret;
>>         int async_submit = 0;
>>
>>         map_length = orig_bio->bi_iter.bi_size;
>> @@ -7392,11 +7403,10 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
>>         struct btrfs_root *root = BTRFS_I(inode)->root;
>>         struct btrfs_dio_private *dip;
>>         struct bio *io_bio;
>> +       struct btrfs_io_bio *btrfs_bio;
>>         int skip_sum;
>> -       int sum_len;
>>         int write = rw & REQ_WRITE;
>>         int ret = 0;
>> -       u16 csum_size;
>>
>>         skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
>>
>> @@ -7406,16 +7416,7 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
>>                 goto free_ordered;
>>         }
>>
>> -       if (!skip_sum && !write) {
>> -               csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
>> -               sum_len = dio_bio->bi_iter.bi_size >>
>> -                       inode->i_sb->s_blocksize_bits;
>> -               sum_len *= csum_size;
>> -       } else {
>> -               sum_len = 0;
>> -       }
>> -
>> -       dip = kmalloc(sizeof(*dip) + sum_len, GFP_NOFS);
>> +       dip = kmalloc(sizeof(*dip), GFP_NOFS);
>>         if (!dip) {
>>                 ret = -ENOMEM;
>>                 goto free_io_bio;
>> @@ -7441,6 +7442,9 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
>>         if (!ret)
>>                 return;
>>
>> +       btrfs_bio = btrfs_io_bio(io_bio);
>> +       if (btrfs_bio->end_io)
>> +               btrfs_bio->end_io(btrfs_bio, ret);
>>  free_io_bio:
>>         bio_put(io_bio);
>>
>> --
>> 1.9.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v3 02/12] Btrfs: load checksum data once when submitting a direct read io
  2014-08-08  0:32     ` Filipe David Manana
  2014-08-08  9:22       ` Miao Xie
@ 2014-08-08  9:23       ` Miao Xie
  1 sibling, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-08-08  9:23 UTC (permalink / raw)
  To: linux-btrfs; +Cc: fdmanana

The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v2 -> v3:
- Fix the wrong return value of btrfs_bio_clone

Changelog v1 -> v2:
- Remove the __GFP_ZERO flag in btrfs_submit_direct because it would trigger
  a WARNing. It is reported by Filipe David Manana, Thanks.
---
 fs/btrfs/btrfs_inode.h |  1 -
 fs/btrfs/ctree.h       |  3 +--
 fs/btrfs/extent_io.c   | 13 +++++++++++--
 fs/btrfs/file-item.c   | 14 ++------------
 fs/btrfs/inode.c       | 38 +++++++++++++++++++++-----------------
 5 files changed, 35 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index a0cf3e5..b69bf7e 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -263,7 +263,6 @@ struct btrfs_dio_private {
 
 	/* dio_bio came from fs/direct-io.c */
 	struct bio *dio_bio;
-	u8 csum[0];
 };
 
 /*
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index be91397..40e9938 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3739,8 +3739,7 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
 			  struct bio *bio, u32 *dst);
 int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
-			      struct btrfs_dio_private *dip, struct bio *bio,
-			      u64 logical_offset);
+			      struct bio *bio, u64 logical_offset);
 int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root,
 			     u64 objectid, u64 pos,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 23398ad..0fb63c4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2617,9 +2617,18 @@ btrfs_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs,
 
 struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask)
 {
-	return bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
-}
+	struct btrfs_io_bio *btrfs_bio;
+	struct bio *new;
 
+	new = bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
+	if (new) {
+		btrfs_bio = btrfs_io_bio(new);
+		btrfs_bio->csum = NULL;
+		btrfs_bio->csum_allocated = NULL;
+		btrfs_bio->end_io = NULL;
+	}
+	return new;
+}
 
 /* this also allocates from the btrfs_bioset */
 struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index f46cfe4..cf1b94f 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -299,19 +299,9 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
 }
 
 int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
-			      struct btrfs_dio_private *dip, struct bio *bio,
-			      u64 offset)
+			      struct bio *bio, u64 offset)
 {
-	int len = (bio->bi_iter.bi_sector << 9) - dip->disk_bytenr;
-	u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
-	int ret;
-
-	len >>= inode->i_sb->s_blocksize_bits;
-	len *= csum_size;
-
-	ret = __btrfs_lookup_bio_sums(root, inode, bio, offset,
-				      (u32 *)(dip->csum + len), 1);
-	return ret;
+	return __btrfs_lookup_bio_sums(root, inode, bio, offset, NULL, 1);
 }
 
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 548489e..fd88126 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7081,7 +7081,8 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
 	struct inode *inode = dip->inode;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct bio *dio_bio;
-	u32 *csums = (u32 *)dip->csum;
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	u32 *csums = (u32 *)io_bio->csum;
 	u64 start;
 	int i;
 
@@ -7123,6 +7124,9 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
 	if (err)
 		clear_bit(BIO_UPTODATE, &dio_bio->bi_flags);
 	dio_end_io(dio_bio, err);
+
+	if (io_bio->end_io)
+		io_bio->end_io(io_bio, err);
 	bio_put(bio);
 }
 
@@ -7261,13 +7265,20 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
 		ret = btrfs_csum_one_bio(root, inode, bio, file_offset, 1);
 		if (ret)
 			goto err;
-	} else if (!skip_sum) {
-		ret = btrfs_lookup_bio_sums_dio(root, inode, dip, bio,
+	} else {
+		/*
+		 * We have loaded all the csum data we need when we submit
+		 * the first bio, so skip it.
+		 */
+		if (dip->logical_offset != file_offset)
+			goto map;
+
+		/* Load all csum data at once. */
+		ret = btrfs_lookup_bio_sums_dio(root, inode, dip->orig_bio,
 						file_offset);
 		if (ret)
 			goto err;
 	}
-
 map:
 	ret = btrfs_map_bio(root, rw, bio, 0, async_submit);
 err:
@@ -7288,7 +7299,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 	u64 submit_len = 0;
 	u64 map_length;
 	int nr_pages = 0;
-	int ret = 0;
+	int ret;
 	int async_submit = 0;
 
 	map_length = orig_bio->bi_iter.bi_size;
@@ -7392,11 +7403,10 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_dio_private *dip;
 	struct bio *io_bio;
+	struct btrfs_io_bio *btrfs_bio;
 	int skip_sum;
-	int sum_len;
 	int write = rw & REQ_WRITE;
 	int ret = 0;
-	u16 csum_size;
 
 	skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
 
@@ -7406,16 +7416,7 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 		goto free_ordered;
 	}
 
-	if (!skip_sum && !write) {
-		csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
-		sum_len = dio_bio->bi_iter.bi_size >>
-			inode->i_sb->s_blocksize_bits;
-		sum_len *= csum_size;
-	} else {
-		sum_len = 0;
-	}
-
-	dip = kmalloc(sizeof(*dip) + sum_len, GFP_NOFS);
+	dip = kmalloc(sizeof(*dip), GFP_NOFS);
 	if (!dip) {
 		ret = -ENOMEM;
 		goto free_io_bio;
@@ -7441,6 +7442,9 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 	if (!ret)
 		return;
 
+	btrfs_bio = btrfs_io_bio(io_bio);
+	if (btrfs_bio->end_io)
+		btrfs_bio->end_io(btrfs_bio, ret);
 free_io_bio:
 	bio_put(io_bio);
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 11/12] Btrfs: implement repair function when direct read fails
  2014-07-29  9:24   ` [PATCH v2 11/12] Btrfs: implement repair function when direct read fails Miao Xie
@ 2014-08-29 18:31     ` Chris Mason
  2014-09-01  6:56       ` Miao Xie
  0 siblings, 1 reply; 49+ messages in thread
From: Chris Mason @ 2014-08-29 18:31 UTC (permalink / raw)
  To: Miao Xie, linux-btrfs

On 07/29/2014 05:24 AM, Miao Xie wrote:
> This patch implement data repair function when direct read fails.
> 
> The detail of the implementation is:
> - When we find the data is not right, we try to read the data from the other
>   mirror.
> - After we get right data, we write it back to the corrupted mirror.
> - And if the data on the new mirror is still corrupted, we will try next
>   mirror until we read right data or all the mirrors are traversed.
> - After the above work, we set the uptodate flag according to the result.
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 08e65e9..56b1546 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -698,7 +719,12 @@ static void end_workqueue_bio(struct bio *bio, int err)
>  
>  	fs_info = end_io_wq->info;
>  	end_io_wq->error = err;
> -	btrfs_init_work(&end_io_wq->work, end_workqueue_fn, NULL, NULL);
> +
> +	if (likely(end_io_wq->metadata != BTRFS_WQ_ENDIO_DIO_REPAIR))
> +		btrfs_init_work(&end_io_wq->work, end_workqueue_fn, NULL,
> +				NULL);
> +	else
> +		INIT_WORK(&end_io_wq->work.normal_work, dio_end_workqueue_fn);

It's not clear why this one is using INIT_WORK instead of
btrfs_init_work, or why we're calling directly into queue_work instead
of btrfs_queue_work.  What am I missing?

-chris

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 11/12] Btrfs: implement repair function when direct read fails
  2014-08-29 18:31     ` Chris Mason
@ 2014-09-01  6:56       ` Miao Xie
  2014-09-02 12:33         ` Liu Bo
  0 siblings, 1 reply; 49+ messages in thread
From: Miao Xie @ 2014-09-01  6:56 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs

On Fri, 29 Aug 2014 14:31:48 -0400, Chris Mason wrote:
> On 07/29/2014 05:24 AM, Miao Xie wrote:
>> This patch implement data repair function when direct read fails.
>>
>> The detail of the implementation is:
>> - When we find the data is not right, we try to read the data from the other
>>   mirror.
>> - After we get right data, we write it back to the corrupted mirror.
>> - And if the data on the new mirror is still corrupted, we will try next
>>   mirror until we read right data or all the mirrors are traversed.
>> - After the above work, we set the uptodate flag according to the result.
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 08e65e9..56b1546 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -698,7 +719,12 @@ static void end_workqueue_bio(struct bio *bio, int err)
>>  
>>  	fs_info = end_io_wq->info;
>>  	end_io_wq->error = err;
>> -	btrfs_init_work(&end_io_wq->work, end_workqueue_fn, NULL, NULL);
>> +
>> +	if (likely(end_io_wq->metadata != BTRFS_WQ_ENDIO_DIO_REPAIR))
>> +		btrfs_init_work(&end_io_wq->work, end_workqueue_fn, NULL,
>> +				NULL);
>> +	else
>> +		INIT_WORK(&end_io_wq->work.normal_work, dio_end_workqueue_fn);
> 
> It's not clear why this one is using INIT_WORK instead of
> btrfs_init_work, or why we're calling directly into queue_work instead
> of btrfs_queue_work.  What am I missing?

I'm sorry that I forgot writing the explanation in this patch's changlog,
I wrote it in Patch 0.

"2.When the io on the mirror ends, we will insert the endio work into the
   system workqueue, not btrfs own endio workqueue, because the original
   endio work is still blocked in the btrfs endio workqueue, if we insert
   the endio work of the io on the mirror into that workqueue, deadlock
   would happen."

Could you add it into the changelog of this patch when you apply this patch?

Thanks
Miao

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 11/12] Btrfs: implement repair function when direct read fails
  2014-09-01  6:56       ` Miao Xie
@ 2014-09-02 12:33         ` Liu Bo
  2014-09-02 13:05           ` Chris Mason
  0 siblings, 1 reply; 49+ messages in thread
From: Liu Bo @ 2014-09-02 12:33 UTC (permalink / raw)
  To: Miao Xie; +Cc: Chris Mason, linux-btrfs

On Mon, Sep 01, 2014 at 02:56:15PM +0800, Miao Xie wrote:
> On Fri, 29 Aug 2014 14:31:48 -0400, Chris Mason wrote:
> > On 07/29/2014 05:24 AM, Miao Xie wrote:
> >> This patch implement data repair function when direct read fails.
> >>
> >> The detail of the implementation is:
> >> - When we find the data is not right, we try to read the data from the other
> >>   mirror.
> >> - After we get right data, we write it back to the corrupted mirror.
> >> - And if the data on the new mirror is still corrupted, we will try next
> >>   mirror until we read right data or all the mirrors are traversed.
> >> - After the above work, we set the uptodate flag according to the result.
> >>
> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >> index 08e65e9..56b1546 100644
> >> --- a/fs/btrfs/disk-io.c
> >> +++ b/fs/btrfs/disk-io.c
> >> @@ -698,7 +719,12 @@ static void end_workqueue_bio(struct bio *bio, int err)
> >>  
> >>  	fs_info = end_io_wq->info;
> >>  	end_io_wq->error = err;
> >> -	btrfs_init_work(&end_io_wq->work, end_workqueue_fn, NULL, NULL);
> >> +
> >> +	if (likely(end_io_wq->metadata != BTRFS_WQ_ENDIO_DIO_REPAIR))
> >> +		btrfs_init_work(&end_io_wq->work, end_workqueue_fn, NULL,
> >> +				NULL);
> >> +	else
> >> +		INIT_WORK(&end_io_wq->work.normal_work, dio_end_workqueue_fn);
> > 
> > It's not clear why this one is using INIT_WORK instead of
> > btrfs_init_work, or why we're calling directly into queue_work instead
> > of btrfs_queue_work.  What am I missing?
> 
> I'm sorry that I forgot writing the explanation in this patch's changlog,
> I wrote it in Patch 0.
> 
> "2.When the io on the mirror ends, we will insert the endio work into the
>    system workqueue, not btrfs own endio workqueue, because the original
>    endio work is still blocked in the btrfs endio workqueue, if we insert
>    the endio work of the io on the mirror into that workqueue, deadlock
>    would happen."

Can you elaborate the deadlock?

Now that buffer read can insert a subsequent read-mirror bio into btrfs endio
workqueue without problems, what's the difference?

thanks,
-liubo

> 
> Could you add it into the changelog of this patch when you apply this patch?
> 
> Thanks
> Miao
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 11/12] Btrfs: implement repair function when direct read fails
  2014-09-02 12:33         ` Liu Bo
@ 2014-09-02 13:05           ` Chris Mason
  2014-09-03  9:02             ` Miao Xie
  2014-09-12 10:43             ` [PATCH v4 00/11] Implement the data repair function for direct read Miao Xie
  0 siblings, 2 replies; 49+ messages in thread
From: Chris Mason @ 2014-09-02 13:05 UTC (permalink / raw)
  To: bo.li.liu, Miao Xie; +Cc: linux-btrfs



On 09/02/2014 08:33 AM, Liu Bo wrote:
> On Mon, Sep 01, 2014 at 02:56:15PM +0800, Miao Xie wrote:
>> On Fri, 29 Aug 2014 14:31:48 -0400, Chris Mason wrote:
>>> On 07/29/2014 05:24 AM, Miao Xie wrote:
>>>> This patch implement data repair function when direct read fails.
>>>>
>>>> The detail of the implementation is:
>>>> - When we find the data is not right, we try to read the data from the other
>>>>   mirror.
>>>> - After we get right data, we write it back to the corrupted mirror.
>>>> - And if the data on the new mirror is still corrupted, we will try next
>>>>   mirror until we read right data or all the mirrors are traversed.
>>>> - After the above work, we set the uptodate flag according to the result.
>>>>
>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>> index 08e65e9..56b1546 100644
>>>> --- a/fs/btrfs/disk-io.c
>>>> +++ b/fs/btrfs/disk-io.c
>>>> @@ -698,7 +719,12 @@ static void end_workqueue_bio(struct bio *bio, int err)
>>>>  
>>>>  	fs_info = end_io_wq->info;
>>>>  	end_io_wq->error = err;
>>>> -	btrfs_init_work(&end_io_wq->work, end_workqueue_fn, NULL, NULL);
>>>> +
>>>> +	if (likely(end_io_wq->metadata != BTRFS_WQ_ENDIO_DIO_REPAIR))
>>>> +		btrfs_init_work(&end_io_wq->work, end_workqueue_fn, NULL,
>>>> +				NULL);
>>>> +	else
>>>> +		INIT_WORK(&end_io_wq->work.normal_work, dio_end_workqueue_fn);
>>>
>>> It's not clear why this one is using INIT_WORK instead of
>>> btrfs_init_work, or why we're calling directly into queue_work instead
>>> of btrfs_queue_work.  What am I missing?
>>
>> I'm sorry that I forgot writing the explanation in this patch's changlog,
>> I wrote it in Patch 0.
>>
>> "2.When the io on the mirror ends, we will insert the endio work into the
>>    system workqueue, not btrfs own endio workqueue, because the original
>>    endio work is still blocked in the btrfs endio workqueue, if we insert
>>    the endio work of the io on the mirror into that workqueue, deadlock
>>    would happen."
> 
> Can you elaborate the deadlock?
> 
> Now that buffer read can insert a subsequent read-mirror bio into btrfs endio
> workqueue without problems, what's the difference?

We do have problems if we're inserting dependent items in the same
workqueue.

Miao, please make a repair workqueue.  I'll also have a use for it in
the raid56 parity work I think.

-chris

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 11/12] Btrfs: implement repair function when direct read fails
  2014-09-02 13:05           ` Chris Mason
@ 2014-09-03  9:02             ` Miao Xie
  2014-09-12 10:43             ` [PATCH v4 00/11] Implement the data repair function for direct read Miao Xie
  1 sibling, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-09-03  9:02 UTC (permalink / raw)
  To: Chris Mason, bo.li.liu; +Cc: linux-btrfs

On Tue, 2 Sep 2014 09:05:15 -0400, Chris Mason wrote:
>>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>>> index 08e65e9..56b1546 100644
>>>>> --- a/fs/btrfs/disk-io.c
>>>>> +++ b/fs/btrfs/disk-io.c
>>>>> @@ -698,7 +719,12 @@ static void end_workqueue_bio(struct bio *bio, int err)
>>>>>  
>>>>>  	fs_info = end_io_wq->info;
>>>>>  	end_io_wq->error = err;
>>>>> -	btrfs_init_work(&end_io_wq->work, end_workqueue_fn, NULL, NULL);
>>>>> +
>>>>> +	if (likely(end_io_wq->metadata != BTRFS_WQ_ENDIO_DIO_REPAIR))
>>>>> +		btrfs_init_work(&end_io_wq->work, end_workqueue_fn, NULL,
>>>>> +				NULL);
>>>>> +	else
>>>>> +		INIT_WORK(&end_io_wq->work.normal_work, dio_end_workqueue_fn);
>>>>
>>>> It's not clear why this one is using INIT_WORK instead of
>>>> btrfs_init_work, or why we're calling directly into queue_work instead
>>>> of btrfs_queue_work.  What am I missing?
>>>
>>> I'm sorry that I forgot writing the explanation in this patch's changlog,
>>> I wrote it in Patch 0.
>>>
>>> "2.When the io on the mirror ends, we will insert the endio work into the
>>>    system workqueue, not btrfs own endio workqueue, because the original
>>>    endio work is still blocked in the btrfs endio workqueue, if we insert
>>>    the endio work of the io on the mirror into that workqueue, deadlock
>>>    would happen."
>>
>> Can you elaborate the deadlock?
>>
>> Now that buffer read can insert a subsequent read-mirror bio into btrfs endio
>> workqueue without problems, what's the difference?
> 
> We do have problems if we're inserting dependent items in the same
> workqueue.
> 
> Miao, please make a repair workqueue.  I'll also have a use for it in
> the raid56 parity work I think.

OK, I'll update the patch soon.

Thanks
Miao


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v4 00/11] Implement the data repair function for direct read
  2014-09-02 13:05           ` Chris Mason
  2014-09-03  9:02             ` Miao Xie
@ 2014-09-12 10:43             ` Miao Xie
  2014-09-12 10:43               ` [PATCH v4 01/11] Btrfs: load checksum data once when submitting a direct read io Miao Xie
                                 ` (11 more replies)
  1 sibling, 12 replies; 49+ messages in thread
From: Miao Xie @ 2014-09-12 10:43 UTC (permalink / raw)
  To: linux-btrfs

This patchset implement the data repair function for the direct read, it
is implemented like buffered read:
1.When we find the data is not right, we try to read the data from the other
  mirror.
2.When the io on the mirror ends, we will insert the endio work into the
  dedicated btrfs workqueue, not common read endio workqueue, because the
  original endio work is still blocked in the btrfs endio workqueue, if we
  insert the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
3.If We get right data, we write it back to repair the corrupted mirror.
4.If the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
5.After the above work, we set the uptodate flag according to the result.

The difference is that the direct read may be splited to several small io,
in order to get the number of the mirror on which the io error happens. we
have to do data check and repair on the end IO function of those sub-IO
request.

Besides that, we also fixed some bugs of direct io.

Changelog v3 -> v4:
- Remove the 1st patch which has been applied into the upstream kernel.
- Use a dedicated btrfs workqueue instead of the system workqueue to
  deal with the completed repair bio, this suggest was from Chris.
- Rebase the patchset to integration branch of Chris's git tree.

Changelog v2 -> v3:
- Fix wrong returned bio when doing bio clone, which was reported by Filipe

Changelog v1 -> v2:
- Fix the warning which was triggered by __GFP_ZERO in the 2nd patch

Miao Xie (11):
  Btrfs: load checksum data once when submitting a direct read io
  Btrfs: cleanup similar code of the buffered data data check and dio
    read data check
  Btrfs: do file data check by sub-bio's self
  Btrfs: fix missing error handler if submiting re-read bio fails
  Btrfs: Cleanup unused variant and argument of IO failure handlers
  Btrfs: split bio_readpage_error into several functions
  Btrfs: modify repair_io_failure and make it suit direct io
  Btrfs: modify clean_io_failure and make it suit direct io
  Btrfs: Set real mirror number for read operation on RAID0/5/6
  Btrfs: implement repair function when direct read fails
  Btrfs: cleanup the read failure record after write or when the inode
    is freeing

 fs/btrfs/async-thread.c |   1 +
 fs/btrfs/async-thread.h |   1 +
 fs/btrfs/btrfs_inode.h  |  10 +-
 fs/btrfs/ctree.h        |   4 +-
 fs/btrfs/disk-io.c      |  11 +-
 fs/btrfs/disk-io.h      |   1 +
 fs/btrfs/extent_io.c    | 254 +++++++++++++++++----------
 fs/btrfs/extent_io.h    |  38 ++++-
 fs/btrfs/file-item.c    |  14 +-
 fs/btrfs/inode.c        | 446 +++++++++++++++++++++++++++++++++++++++---------
 fs/btrfs/scrub.c        |   4 +-
 fs/btrfs/volumes.c      |   5 +
 fs/btrfs/volumes.h      |   5 +-
 13 files changed, 601 insertions(+), 193 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v4 01/11] Btrfs: load checksum data once when submitting a direct read io
  2014-09-12 10:43             ` [PATCH v4 00/11] Implement the data repair function for direct read Miao Xie
@ 2014-09-12 10:43               ` Miao Xie
  2014-09-12 10:43               ` [PATCH v4 02/11] Btrfs: cleanup similar code of the buffered data data check and dio read data check Miao Xie
                                 ` (10 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-09-12 10:43 UTC (permalink / raw)
  To: linux-btrfs

The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v3 -> v4:
- None

Changelog v2 -> v3:
- Fix the wrong return value of btrfs_bio_clone

Changelog v1 -> v2:
- Remove the __GFP_ZERO flag in btrfs_submit_direct because it would trigger
  a WARNing. It is reported by Filipe David Manana, Thanks.
---
 fs/btrfs/btrfs_inode.h |  1 -
 fs/btrfs/ctree.h       |  3 +--
 fs/btrfs/extent_io.c   | 13 +++++++++++--
 fs/btrfs/file-item.c   | 14 ++------------
 fs/btrfs/inode.c       | 38 +++++++++++++++++++++-----------------
 5 files changed, 35 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index fd87941..8bea70e 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -263,7 +263,6 @@ struct btrfs_dio_private {
 
 	/* dio_bio came from fs/direct-io.c */
 	struct bio *dio_bio;
-	u8 csum[0];
 };
 
 /*
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ded7781..7b54cd9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3719,8 +3719,7 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
 			  struct bio *bio, u32 *dst);
 int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
-			      struct btrfs_dio_private *dip, struct bio *bio,
-			      u64 logical_offset);
+			      struct bio *bio, u64 logical_offset);
 int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root,
 			     u64 objectid, u64 pos,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 86b39de..dfe1afe 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2621,9 +2621,18 @@ btrfs_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs,
 
 struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask)
 {
-	return bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
-}
+	struct btrfs_io_bio *btrfs_bio;
+	struct bio *new;
 
+	new = bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
+	if (new) {
+		btrfs_bio = btrfs_io_bio(new);
+		btrfs_bio->csum = NULL;
+		btrfs_bio->csum_allocated = NULL;
+		btrfs_bio->end_io = NULL;
+	}
+	return new;
+}
 
 /* this also allocates from the btrfs_bioset */
 struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 6e6262e..783a943 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -299,19 +299,9 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
 }
 
 int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
-			      struct btrfs_dio_private *dip, struct bio *bio,
-			      u64 offset)
+			      struct bio *bio, u64 offset)
 {
-	int len = (bio->bi_iter.bi_sector << 9) - dip->disk_bytenr;
-	u16 csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
-	int ret;
-
-	len >>= inode->i_sb->s_blocksize_bits;
-	len *= csum_size;
-
-	ret = __btrfs_lookup_bio_sums(root, inode, bio, offset,
-				      (u32 *)(dip->csum + len), 1);
-	return ret;
+	return __btrfs_lookup_bio_sums(root, inode, bio, offset, NULL, 1);
 }
 
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2118ea6..af304e1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7196,7 +7196,8 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
 	struct inode *inode = dip->inode;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct bio *dio_bio;
-	u32 *csums = (u32 *)dip->csum;
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	u32 *csums = (u32 *)io_bio->csum;
 	u64 start;
 	int i;
 
@@ -7238,6 +7239,9 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
 	if (err)
 		clear_bit(BIO_UPTODATE, &dio_bio->bi_flags);
 	dio_end_io(dio_bio, err);
+
+	if (io_bio->end_io)
+		io_bio->end_io(io_bio, err);
 	bio_put(bio);
 }
 
@@ -7377,13 +7381,20 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
 		ret = btrfs_csum_one_bio(root, inode, bio, file_offset, 1);
 		if (ret)
 			goto err;
-	} else if (!skip_sum) {
-		ret = btrfs_lookup_bio_sums_dio(root, inode, dip, bio,
+	} else {
+		/*
+		 * We have loaded all the csum data we need when we submit
+		 * the first bio, so skip it.
+		 */
+		if (dip->logical_offset != file_offset)
+			goto map;
+
+		/* Load all csum data at once. */
+		ret = btrfs_lookup_bio_sums_dio(root, inode, dip->orig_bio,
 						file_offset);
 		if (ret)
 			goto err;
 	}
-
 map:
 	ret = btrfs_map_bio(root, rw, bio, 0, async_submit);
 err:
@@ -7404,7 +7415,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 	u64 submit_len = 0;
 	u64 map_length;
 	int nr_pages = 0;
-	int ret = 0;
+	int ret;
 	int async_submit = 0;
 
 	map_length = orig_bio->bi_iter.bi_size;
@@ -7508,11 +7519,10 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_dio_private *dip;
 	struct bio *io_bio;
+	struct btrfs_io_bio *btrfs_bio;
 	int skip_sum;
-	int sum_len;
 	int write = rw & REQ_WRITE;
 	int ret = 0;
-	u16 csum_size;
 
 	skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
 
@@ -7522,16 +7532,7 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 		goto free_ordered;
 	}
 
-	if (!skip_sum && !write) {
-		csum_size = btrfs_super_csum_size(root->fs_info->super_copy);
-		sum_len = dio_bio->bi_iter.bi_size >>
-			inode->i_sb->s_blocksize_bits;
-		sum_len *= csum_size;
-	} else {
-		sum_len = 0;
-	}
-
-	dip = kmalloc(sizeof(*dip) + sum_len, GFP_NOFS);
+	dip = kmalloc(sizeof(*dip), GFP_NOFS);
 	if (!dip) {
 		ret = -ENOMEM;
 		goto free_io_bio;
@@ -7557,6 +7558,9 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 	if (!ret)
 		return;
 
+	btrfs_bio = btrfs_io_bio(io_bio);
+	if (btrfs_bio->end_io)
+		btrfs_bio->end_io(btrfs_bio, ret);
 free_io_bio:
 	bio_put(io_bio);
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 02/11] Btrfs: cleanup similar code of the buffered data data check and dio read data check
  2014-09-12 10:43             ` [PATCH v4 00/11] Implement the data repair function for direct read Miao Xie
  2014-09-12 10:43               ` [PATCH v4 01/11] Btrfs: load checksum data once when submitting a direct read io Miao Xie
@ 2014-09-12 10:43               ` Miao Xie
  2014-09-12 10:43               ` [PATCH v4 03/11] Btrfs: do file data check by sub-bio's self Miao Xie
                                 ` (9 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-09-12 10:43 UTC (permalink / raw)
  To: linux-btrfs

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/inode.c | 102 +++++++++++++++++++++++++------------------------------
 1 file changed, 47 insertions(+), 55 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index af304e1..e8139c6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2893,6 +2893,40 @@ static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
 	return 0;
 }
 
+static int __readpage_endio_check(struct inode *inode,
+				  struct btrfs_io_bio *io_bio,
+				  int icsum, struct page *page,
+				  int pgoff, u64 start, size_t len)
+{
+	char *kaddr;
+	u32 csum_expected;
+	u32 csum = ~(u32)0;
+	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
+
+	csum_expected = *(((u32 *)io_bio->csum) + icsum);
+
+	kaddr = kmap_atomic(page);
+	csum = btrfs_csum_data(kaddr + pgoff, csum,  len);
+	btrfs_csum_final(csum, (char *)&csum);
+	if (csum != csum_expected)
+		goto zeroit;
+
+	kunmap_atomic(kaddr);
+	return 0;
+zeroit:
+	if (__ratelimit(&_rs))
+		btrfs_info(BTRFS_I(inode)->root->fs_info,
+			   "csum failed ino %llu off %llu csum %u expected csum %u",
+			   btrfs_ino(inode), start, csum, csum_expected);
+	memset(kaddr + pgoff, 1, len);
+	flush_dcache_page(page);
+	kunmap_atomic(kaddr);
+	if (csum_expected == 0)
+		return 0;
+	return -EIO;
+}
+
 /*
  * when reads are done, we need to check csums to verify the data is correct
  * if there's a match, we allow the bio to finish.  If not, the code in
@@ -2905,20 +2939,15 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	size_t offset = start - page_offset(page);
 	struct inode *inode = page->mapping->host;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
-	char *kaddr;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	u32 csum_expected;
-	u32 csum = ~(u32)0;
-	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
-	                              DEFAULT_RATELIMIT_BURST);
 
 	if (PageChecked(page)) {
 		ClearPageChecked(page);
-		goto good;
+		return 0;
 	}
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
-		goto good;
+		return 0;
 
 	if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
 	    test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) {
@@ -2928,28 +2957,8 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
 	}
 
 	phy_offset >>= inode->i_sb->s_blocksize_bits;
-	csum_expected = *(((u32 *)io_bio->csum) + phy_offset);
-
-	kaddr = kmap_atomic(page);
-	csum = btrfs_csum_data(kaddr + offset, csum,  end - start + 1);
-	btrfs_csum_final(csum, (char *)&csum);
-	if (csum != csum_expected)
-		goto zeroit;
-
-	kunmap_atomic(kaddr);
-good:
-	return 0;
-
-zeroit:
-	if (__ratelimit(&_rs))
-		btrfs_info(root->fs_info, "csum failed ino %llu off %llu csum %u expected csum %u",
-			btrfs_ino(page->mapping->host), start, csum, csum_expected);
-	memset(kaddr + offset, 1, end - start + 1);
-	flush_dcache_page(page);
-	kunmap_atomic(kaddr);
-	if (csum_expected == 0)
-		return 0;
-	return -EIO;
+	return __readpage_endio_check(inode, io_bio, phy_offset, page, offset,
+				      start, (size_t)(end - start + 1));
 }
 
 struct delayed_iput {
@@ -7194,41 +7203,24 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
 	struct btrfs_dio_private *dip = bio->bi_private;
 	struct bio_vec *bvec;
 	struct inode *inode = dip->inode;
-	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct bio *dio_bio;
 	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
-	u32 *csums = (u32 *)io_bio->csum;
 	u64 start;
+	int ret;
 	int i;
 
+	if (err || (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
+		goto skip_checksum;
+
 	start = dip->logical_offset;
 	bio_for_each_segment_all(bvec, bio, i) {
-		if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
-			struct page *page = bvec->bv_page;
-			char *kaddr;
-			u32 csum = ~(u32)0;
-			unsigned long flags;
-
-			local_irq_save(flags);
-			kaddr = kmap_atomic(page);
-			csum = btrfs_csum_data(kaddr + bvec->bv_offset,
-					       csum, bvec->bv_len);
-			btrfs_csum_final(csum, (char *)&csum);
-			kunmap_atomic(kaddr);
-			local_irq_restore(flags);
-
-			flush_dcache_page(bvec->bv_page);
-			if (csum != csums[i]) {
-				btrfs_err(root->fs_info, "csum failed ino %llu off %llu csum %u expected csum %u",
-					  btrfs_ino(inode), start, csum,
-					  csums[i]);
-				err = -EIO;
-			}
-		}
-
+		ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
+					     0, start, bvec->bv_len);
+		if (ret)
+			err = -EIO;
 		start += bvec->bv_len;
 	}
-
+skip_checksum:
 	unlock_extent(&BTRFS_I(inode)->io_tree, dip->logical_offset,
 		      dip->logical_offset + dip->bytes - 1);
 	dio_bio = dip->dio_bio;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 03/11] Btrfs: do file data check by sub-bio's self
  2014-09-12 10:43             ` [PATCH v4 00/11] Implement the data repair function for direct read Miao Xie
  2014-09-12 10:43               ` [PATCH v4 01/11] Btrfs: load checksum data once when submitting a direct read io Miao Xie
  2014-09-12 10:43               ` [PATCH v4 02/11] Btrfs: cleanup similar code of the buffered data data check and dio read data check Miao Xie
@ 2014-09-12 10:43               ` Miao Xie
  2014-09-12 10:43               ` [PATCH v4 04/11] Btrfs: fix missing error handler if submiting re-read bio fails Miao Xie
                                 ` (8 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-09-12 10:43 UTC (permalink / raw)
  To: linux-btrfs

Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.

But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file data
check in the final end io function to the sub-bio end io function, in which we can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/btrfs_inode.h |   9 +++++
 fs/btrfs/extent_io.c   |   2 +-
 fs/btrfs/inode.c       | 100 ++++++++++++++++++++++++++++++++++++-------------
 fs/btrfs/volumes.h     |   5 ++-
 4 files changed, 87 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 8bea70e..4d30947 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -245,8 +245,11 @@ static inline int btrfs_inode_in_log(struct inode *inode, u64 generation)
 	return 0;
 }
 
+#define BTRFS_DIO_ORIG_BIO_SUBMITTED	0x1
+
 struct btrfs_dio_private {
 	struct inode *inode;
+	unsigned long flags;
 	u64 logical_offset;
 	u64 disk_bytenr;
 	u64 bytes;
@@ -263,6 +266,12 @@ struct btrfs_dio_private {
 
 	/* dio_bio came from fs/direct-io.c */
 	struct bio *dio_bio;
+
+	/*
+	 * The original bio may be splited to several sub-bios, this is
+	 * done during endio of sub-bios
+	 */
+	int (*subio_endio)(struct inode *, struct btrfs_io_bio *);
 };
 
 /*
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index dfe1afe..92a6d9f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2472,7 +2472,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 		struct inode *inode = page->mapping->host;
 
 		pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, "
-			 "mirror=%lu\n", (u64)bio->bi_iter.bi_sector, err,
+			 "mirror=%u\n", (u64)bio->bi_iter.bi_sector, err,
 			 io_bio->mirror_num);
 		tree = &BTRFS_I(inode)->io_tree;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e8139c6..cf79f79 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7198,29 +7198,40 @@ unlock_err:
 	return ret;
 }
 
-static void btrfs_endio_direct_read(struct bio *bio, int err)
+static int btrfs_subio_endio_read(struct inode *inode,
+				  struct btrfs_io_bio *io_bio)
 {
-	struct btrfs_dio_private *dip = bio->bi_private;
 	struct bio_vec *bvec;
-	struct inode *inode = dip->inode;
-	struct bio *dio_bio;
-	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
 	u64 start;
-	int ret;
 	int i;
+	int ret;
+	int err = 0;
 
-	if (err || (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-		goto skip_checksum;
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
+		return 0;
 
-	start = dip->logical_offset;
-	bio_for_each_segment_all(bvec, bio, i) {
+	start = io_bio->logical;
+	bio_for_each_segment_all(bvec, &io_bio->bio, i) {
 		ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
 					     0, start, bvec->bv_len);
 		if (ret)
 			err = -EIO;
 		start += bvec->bv_len;
 	}
-skip_checksum:
+
+	return err;
+}
+
+static void btrfs_endio_direct_read(struct bio *bio, int err)
+{
+	struct btrfs_dio_private *dip = bio->bi_private;
+	struct inode *inode = dip->inode;
+	struct bio *dio_bio;
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+
+	if (!err && (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED))
+		err = btrfs_subio_endio_read(inode, io_bio);
+
 	unlock_extent(&BTRFS_I(inode)->io_tree, dip->logical_offset,
 		      dip->logical_offset + dip->bytes - 1);
 	dio_bio = dip->dio_bio;
@@ -7298,6 +7309,7 @@ static int __btrfs_submit_bio_start_direct_io(struct inode *inode, int rw,
 static void btrfs_end_dio_bio(struct bio *bio, int err)
 {
 	struct btrfs_dio_private *dip = bio->bi_private;
+	int ret;
 
 	if (err) {
 		btrfs_err(BTRFS_I(dip->inode)->root->fs_info,
@@ -7305,6 +7317,13 @@ static void btrfs_end_dio_bio(struct bio *bio, int err)
 		      btrfs_ino(dip->inode), bio->bi_rw,
 		      (unsigned long long)bio->bi_iter.bi_sector,
 		      bio->bi_iter.bi_size, err);
+	} else if (dip->subio_endio) {
+		ret = dip->subio_endio(dip->inode, btrfs_io_bio(bio));
+		if (ret)
+			err = ret;
+	}
+
+	if (err) {
 		dip->errors = 1;
 
 		/*
@@ -7335,6 +7354,38 @@ static struct bio *btrfs_dio_bio_alloc(struct block_device *bdev,
 	return btrfs_bio_alloc(bdev, first_sector, nr_vecs, gfp_flags);
 }
 
+static inline int btrfs_lookup_and_bind_dio_csum(struct btrfs_root *root,
+						 struct inode *inode,
+						 struct btrfs_dio_private *dip,
+						 struct bio *bio,
+						 u64 file_offset)
+{
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	struct btrfs_io_bio *orig_io_bio = btrfs_io_bio(dip->orig_bio);
+	int ret;
+
+	/*
+	 * We load all the csum data we need when we submit
+	 * the first bio to reduce the csum tree search and
+	 * contention.
+	 */
+	if (dip->logical_offset == file_offset) {
+		ret = btrfs_lookup_bio_sums_dio(root, inode, dip->orig_bio,
+						file_offset);
+		if (ret)
+			return ret;
+	}
+
+	if (bio == dip->orig_bio)
+		return 0;
+
+	file_offset -= dip->logical_offset;
+	file_offset >>= inode->i_sb->s_blocksize_bits;
+	io_bio->csum = (u8 *)(((u32 *)orig_io_bio->csum) + file_offset);
+
+	return 0;
+}
+
 static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
 					 int rw, u64 file_offset, int skip_sum,
 					 int async_submit)
@@ -7374,16 +7425,8 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
 		if (ret)
 			goto err;
 	} else {
-		/*
-		 * We have loaded all the csum data we need when we submit
-		 * the first bio, so skip it.
-		 */
-		if (dip->logical_offset != file_offset)
-			goto map;
-
-		/* Load all csum data at once. */
-		ret = btrfs_lookup_bio_sums_dio(root, inode, dip->orig_bio,
-						file_offset);
+		ret = btrfs_lookup_and_bind_dio_csum(root, inode, dip, bio,
+						     file_offset);
 		if (ret)
 			goto err;
 	}
@@ -7418,6 +7461,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 
 	if (map_length >= orig_bio->bi_iter.bi_size) {
 		bio = orig_bio;
+		dip->flags |= BTRFS_DIO_ORIG_BIO_SUBMITTED;
 		goto submit;
 	}
 
@@ -7434,6 +7478,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 
 	bio->bi_private = dip;
 	bio->bi_end_io = btrfs_end_dio_bio;
+	btrfs_io_bio(bio)->logical = file_offset;
 	atomic_inc(&dip->pending_bios);
 
 	while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) {
@@ -7468,6 +7513,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,
 				goto out_err;
 			bio->bi_private = dip;
 			bio->bi_end_io = btrfs_end_dio_bio;
+			btrfs_io_bio(bio)->logical = file_offset;
 
 			map_length = orig_bio->bi_iter.bi_size;
 			ret = btrfs_map_block(root->fs_info, rw,
@@ -7524,7 +7570,7 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 		goto free_ordered;
 	}
 
-	dip = kmalloc(sizeof(*dip), GFP_NOFS);
+	dip = kzalloc(sizeof(*dip), GFP_NOFS);
 	if (!dip) {
 		ret = -ENOMEM;
 		goto free_io_bio;
@@ -7536,21 +7582,23 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 	dip->bytes = dio_bio->bi_iter.bi_size;
 	dip->disk_bytenr = (u64)dio_bio->bi_iter.bi_sector << 9;
 	io_bio->bi_private = dip;
-	dip->errors = 0;
 	dip->orig_bio = io_bio;
 	dip->dio_bio = dio_bio;
 	atomic_set(&dip->pending_bios, 0);
+	btrfs_bio = btrfs_io_bio(io_bio);
+	btrfs_bio->logical = file_offset;
 
-	if (write)
+	if (write) {
 		io_bio->bi_end_io = btrfs_endio_direct_write;
-	else
+	} else {
 		io_bio->bi_end_io = btrfs_endio_direct_read;
+		dip->subio_endio = btrfs_subio_endio_read;
+	}
 
 	ret = btrfs_submit_direct_hook(rw, dip, skip_sum);
 	if (!ret)
 		return;
 
-	btrfs_bio = btrfs_io_bio(io_bio);
 	if (btrfs_bio->end_io)
 		btrfs_bio->end_io(btrfs_bio, ret);
 free_io_bio:
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 2b37da3..91998bc 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -268,8 +268,9 @@ struct btrfs_fs_devices {
  */
 typedef void (btrfs_io_bio_end_io_t) (struct btrfs_io_bio *bio, int err);
 struct btrfs_io_bio {
-	unsigned long mirror_num;
-	unsigned long stripe_index;
+	unsigned int mirror_num;
+	unsigned int stripe_index;
+	u64 logical;
 	u8 *csum;
 	u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
 	u8 *csum_allocated;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 04/11] Btrfs: fix missing error handler if submiting re-read bio fails
  2014-09-12 10:43             ` [PATCH v4 00/11] Implement the data repair function for direct read Miao Xie
                                 ` (2 preceding siblings ...)
  2014-09-12 10:43               ` [PATCH v4 03/11] Btrfs: do file data check by sub-bio's self Miao Xie
@ 2014-09-12 10:43               ` Miao Xie
  2014-09-12 10:43               ` [PATCH v4 05/11] Btrfs: Cleanup unused variant and argument of IO failure handlers Miao Xie
                                 ` (7 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-09-12 10:43 UTC (permalink / raw)
  To: linux-btrfs

We forgot to free failure record and bio after submitting re-read bio failed,
fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/extent_io.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 92a6d9f..f8dda46 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2348,6 +2348,11 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 	ret = tree->ops->submit_bio_hook(inode, read_mode, bio,
 					 failrec->this_mirror,
 					 failrec->bio_flags, 0);
+	if (ret) {
+		free_io_failure(inode, failrec, 0);
+		bio_put(bio);
+	}
+
 	return ret;
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 05/11] Btrfs: Cleanup unused variant and argument of IO failure handlers
  2014-09-12 10:43             ` [PATCH v4 00/11] Implement the data repair function for direct read Miao Xie
                                 ` (3 preceding siblings ...)
  2014-09-12 10:43               ` [PATCH v4 04/11] Btrfs: fix missing error handler if submiting re-read bio fails Miao Xie
@ 2014-09-12 10:43               ` Miao Xie
  2014-09-12 10:43               ` [PATCH v4 06/11] Btrfs: split bio_readpage_error into several functions Miao Xie
                                 ` (6 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-09-12 10:43 UTC (permalink / raw)
  To: linux-btrfs

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/extent_io.c | 26 ++++++++++----------------
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f8dda46..154cb8e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1981,8 +1981,7 @@ struct io_failure_record {
 	int in_validation;
 };
 
-static int free_io_failure(struct inode *inode, struct io_failure_record *rec,
-				int did_repair)
+static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
 	int ret;
 	int err = 0;
@@ -2109,7 +2108,6 @@ static int clean_io_failure(u64 start, struct page *page)
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct extent_state *state;
 	int num_copies;
-	int did_repair = 0;
 	int ret;
 
 	private = 0;
@@ -2130,7 +2128,6 @@ static int clean_io_failure(u64 start, struct page *page)
 		/* there was no real error, just free the record */
 		pr_debug("clean_io_failure: freeing dummy error at %llu\n",
 			 failrec->start);
-		did_repair = 1;
 		goto out;
 	}
 	if (fs_info->sb->s_flags & MS_RDONLY)
@@ -2147,19 +2144,16 @@ static int clean_io_failure(u64 start, struct page *page)
 		num_copies = btrfs_num_copies(fs_info, failrec->logical,
 					      failrec->len);
 		if (num_copies > 1)  {
-			ret = repair_io_failure(fs_info, start, failrec->len,
-						failrec->logical, page,
-						failrec->failed_mirror);
-			did_repair = !ret;
+			repair_io_failure(fs_info, start, failrec->len,
+					  failrec->logical, page,
+					  failrec->failed_mirror);
 		}
-		ret = 0;
 	}
 
 out:
-	if (!ret)
-		ret = free_io_failure(inode, failrec, did_repair);
+	free_io_failure(inode, failrec);
 
-	return ret;
+	return 0;
 }
 
 /*
@@ -2269,7 +2263,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		 */
 		pr_debug("bio_readpage_error: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n",
 			 num_copies, failrec->this_mirror, failed_mirror);
-		free_io_failure(inode, failrec, 0);
+		free_io_failure(inode, failrec);
 		return -EIO;
 	}
 
@@ -2312,13 +2306,13 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 	if (failrec->this_mirror > num_copies) {
 		pr_debug("bio_readpage_error: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n",
 			 num_copies, failrec->this_mirror, failed_mirror);
-		free_io_failure(inode, failrec, 0);
+		free_io_failure(inode, failrec);
 		return -EIO;
 	}
 
 	bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
 	if (!bio) {
-		free_io_failure(inode, failrec, 0);
+		free_io_failure(inode, failrec);
 		return -EIO;
 	}
 	bio->bi_end_io = failed_bio->bi_end_io;
@@ -2349,7 +2343,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 					 failrec->this_mirror,
 					 failrec->bio_flags, 0);
 	if (ret) {
-		free_io_failure(inode, failrec, 0);
+		free_io_failure(inode, failrec);
 		bio_put(bio);
 	}
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 06/11] Btrfs: split bio_readpage_error into several functions
  2014-09-12 10:43             ` [PATCH v4 00/11] Implement the data repair function for direct read Miao Xie
                                 ` (4 preceding siblings ...)
  2014-09-12 10:43               ` [PATCH v4 05/11] Btrfs: Cleanup unused variant and argument of IO failure handlers Miao Xie
@ 2014-09-12 10:43               ` Miao Xie
  2014-09-12 10:44               ` [PATCH v4 07/11] Btrfs: modify repair_io_failure and make it suit direct io Miao Xie
                                 ` (5 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-09-12 10:43 UTC (permalink / raw)
  To: linux-btrfs

The data repair function of direct read will be implemented later, and some code
in bio_readpage_error will be reused, so split bio_readpage_error into
several functions which will be used in direct read repair later.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/extent_io.c | 159 ++++++++++++++++++++++++++++++---------------------
 fs/btrfs/extent_io.h |  28 +++++++++
 2 files changed, 123 insertions(+), 64 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 154cb8e..cf1de40 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1962,25 +1962,6 @@ static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
 		SetPageUptodate(page);
 }
 
-/*
- * When IO fails, either with EIO or csum verification fails, we
- * try other mirrors that might have a good copy of the data.  This
- * io_failure_record is used to record state as we go through all the
- * mirrors.  If another mirror has good data, the page is set up to date
- * and things continue.  If a good mirror can't be found, the original
- * bio end_io callback is called to indicate things have failed.
- */
-struct io_failure_record {
-	struct page *page;
-	u64 start;
-	u64 len;
-	u64 logical;
-	unsigned long bio_flags;
-	int this_mirror;
-	int failed_mirror;
-	int in_validation;
-};
-
 static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
 	int ret;
@@ -2156,40 +2137,24 @@ out:
 	return 0;
 }
 
-/*
- * this is a generic handler for readpage errors (default
- * readpage_io_failed_hook). if other copies exist, read those and write back
- * good data to the failed position. does not investigate in remapping the
- * failed extent elsewhere, hoping the device will be smart enough to do this as
- * needed
- */
-
-static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
-			      struct page *page, u64 start, u64 end,
-			      int failed_mirror)
+int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
+				struct io_failure_record **failrec_ret)
 {
-	struct io_failure_record *failrec = NULL;
+	struct io_failure_record *failrec;
 	u64 private;
 	struct extent_map *em;
-	struct inode *inode = page->mapping->host;
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
-	struct bio *bio;
-	struct btrfs_io_bio *btrfs_failed_bio;
-	struct btrfs_io_bio *btrfs_bio;
-	int num_copies;
 	int ret;
-	int read_mode;
 	u64 logical;
 
-	BUG_ON(failed_bio->bi_rw & REQ_WRITE);
-
 	ret = get_state_private(failure_tree, start, &private);
 	if (ret) {
 		failrec = kzalloc(sizeof(*failrec), GFP_NOFS);
 		if (!failrec)
 			return -ENOMEM;
+
 		failrec->start = start;
 		failrec->len = end - start + 1;
 		failrec->this_mirror = 0;
@@ -2209,11 +2174,11 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 			em = NULL;
 		}
 		read_unlock(&em_tree->lock);
-
 		if (!em) {
 			kfree(failrec);
 			return -EIO;
 		}
+
 		logical = start - em->start;
 		logical = em->block_start + logical;
 		if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
@@ -2222,8 +2187,10 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 			extent_set_compress_type(&failrec->bio_flags,
 						 em->compress_type);
 		}
-		pr_debug("bio_readpage_error: (new) logical=%llu, start=%llu, "
-			 "len=%llu\n", logical, start, failrec->len);
+
+		pr_debug("Get IO Failure Record: (new) logical=%llu, start=%llu, len=%llu\n",
+			 logical, start, failrec->len);
+
 		failrec->logical = logical;
 		free_extent_map(em);
 
@@ -2243,8 +2210,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		}
 	} else {
 		failrec = (struct io_failure_record *)(unsigned long)private;
-		pr_debug("bio_readpage_error: (found) logical=%llu, "
-			 "start=%llu, len=%llu, validation=%d\n",
+		pr_debug("Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu, validation=%d\n",
 			 failrec->logical, failrec->start, failrec->len,
 			 failrec->in_validation);
 		/*
@@ -2253,6 +2219,17 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		 * clean_io_failure() clean all those errors at once.
 		 */
 	}
+
+	*failrec_ret = failrec;
+
+	return 0;
+}
+
+int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
+			   struct io_failure_record *failrec, int failed_mirror)
+{
+	int num_copies;
+
 	num_copies = btrfs_num_copies(BTRFS_I(inode)->root->fs_info,
 				      failrec->logical, failrec->len);
 	if (num_copies == 1) {
@@ -2261,10 +2238,9 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		 * all the retry and error correction code that follows. no
 		 * matter what the error is, it is very likely to persist.
 		 */
-		pr_debug("bio_readpage_error: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n",
+		pr_debug("Check Repairable: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n",
 			 num_copies, failrec->this_mirror, failed_mirror);
-		free_io_failure(inode, failrec);
-		return -EIO;
+		return 0;
 	}
 
 	/*
@@ -2284,7 +2260,6 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		BUG_ON(failrec->in_validation);
 		failrec->in_validation = 1;
 		failrec->this_mirror = failed_mirror;
-		read_mode = READ_SYNC | REQ_FAILFAST_DEV;
 	} else {
 		/*
 		 * we're ready to fulfill a) and b) alongside. get a good copy
@@ -2300,22 +2275,32 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 		failrec->this_mirror++;
 		if (failrec->this_mirror == failed_mirror)
 			failrec->this_mirror++;
-		read_mode = READ_SYNC;
 	}
 
 	if (failrec->this_mirror > num_copies) {
-		pr_debug("bio_readpage_error: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n",
+		pr_debug("Check Repairable: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n",
 			 num_copies, failrec->this_mirror, failed_mirror);
-		free_io_failure(inode, failrec);
-		return -EIO;
+		return 0;
 	}
 
+	return 1;
+}
+
+
+struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
+				    struct io_failure_record *failrec,
+				    struct page *page, int pg_offset, int icsum,
+				    bio_end_io_t *endio_func)
+{
+	struct bio *bio;
+	struct btrfs_io_bio *btrfs_failed_bio;
+	struct btrfs_io_bio *btrfs_bio;
+
 	bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
-	if (!bio) {
-		free_io_failure(inode, failrec);
-		return -EIO;
-	}
-	bio->bi_end_io = failed_bio->bi_end_io;
+	if (!bio)
+		return NULL;
+
+	bio->bi_end_io = endio_func;
 	bio->bi_iter.bi_sector = failrec->logical >> 9;
 	bio->bi_bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev;
 	bio->bi_iter.bi_size = 0;
@@ -2327,17 +2312,63 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 
 		btrfs_bio = btrfs_io_bio(bio);
 		btrfs_bio->csum = btrfs_bio->csum_inline;
-		phy_offset >>= inode->i_sb->s_blocksize_bits;
-		phy_offset *= csum_size;
-		memcpy(btrfs_bio->csum, btrfs_failed_bio->csum + phy_offset,
+		icsum *= csum_size;
+		memcpy(btrfs_bio->csum, btrfs_failed_bio->csum + icsum,
 		       csum_size);
 	}
 
-	bio_add_page(bio, page, failrec->len, start - page_offset(page));
+	bio_add_page(bio, page, failrec->len, pg_offset);
+
+	return bio;
+}
+
+/*
+ * this is a generic handler for readpage errors (default
+ * readpage_io_failed_hook). if other copies exist, read those and write back
+ * good data to the failed position. does not investigate in remapping the
+ * failed extent elsewhere, hoping the device will be smart enough to do this as
+ * needed
+ */
+
+static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
+			      struct page *page, u64 start, u64 end,
+			      int failed_mirror)
+{
+	struct io_failure_record *failrec;
+	struct inode *inode = page->mapping->host;
+	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
+	struct bio *bio;
+	int read_mode;
+	int ret;
+
+	BUG_ON(failed_bio->bi_rw & REQ_WRITE);
+
+	ret = btrfs_get_io_failure_record(inode, start, end, &failrec);
+	if (ret)
+		return ret;
+
+	ret = btrfs_check_repairable(inode, failed_bio, failrec, failed_mirror);
+	if (!ret) {
+		free_io_failure(inode, failrec);
+		return -EIO;
+	}
+
+	if (failed_bio->bi_vcnt > 1)
+		read_mode = READ_SYNC | REQ_FAILFAST_DEV;
+	else
+		read_mode = READ_SYNC;
+
+	phy_offset >>= inode->i_sb->s_blocksize_bits;
+	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
+				      start - page_offset(page),
+				      (int)phy_offset, failed_bio->bi_end_io);
+	if (!bio) {
+		free_io_failure(inode, failrec);
+		return -EIO;
+	}
 
-	pr_debug("bio_readpage_error: submitting new read[%#x] to "
-		 "this_mirror=%d, num_copies=%d, in_validation=%d\n", read_mode,
-		 failrec->this_mirror, num_copies, failrec->in_validation);
+	pr_debug("Repair Read Error: submitting new read[%#x] to this_mirror=%d, in_validation=%d\n",
+		 read_mode, failrec->this_mirror, failrec->in_validation);
 
 	ret = tree->ops->submit_bio_hook(inode, read_mode, bio,
 					 failrec->this_mirror,
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 844b4c5..75b621b 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -344,6 +344,34 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 			 int mirror_num);
+
+/*
+ * When IO fails, either with EIO or csum verification fails, we
+ * try other mirrors that might have a good copy of the data.  This
+ * io_failure_record is used to record state as we go through all the
+ * mirrors.  If another mirror has good data, the page is set up to date
+ * and things continue.  If a good mirror can't be found, the original
+ * bio end_io callback is called to indicate things have failed.
+ */
+struct io_failure_record {
+	struct page *page;
+	u64 start;
+	u64 len;
+	u64 logical;
+	unsigned long bio_flags;
+	int this_mirror;
+	int failed_mirror;
+	int in_validation;
+};
+
+int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
+				struct io_failure_record **failrec_ret);
+int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
+			   struct io_failure_record *failrec, int fail_mirror);
+struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
+				    struct io_failure_record *failrec,
+				    struct page *page, int pg_offset, int icsum,
+				    bio_end_io_t *endio_func);
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 noinline u64 find_lock_delalloc_range(struct inode *inode,
 				      struct extent_io_tree *tree,
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 07/11] Btrfs: modify repair_io_failure and make it suit direct io
  2014-09-12 10:43             ` [PATCH v4 00/11] Implement the data repair function for direct read Miao Xie
                                 ` (5 preceding siblings ...)
  2014-09-12 10:43               ` [PATCH v4 06/11] Btrfs: split bio_readpage_error into several functions Miao Xie
@ 2014-09-12 10:44               ` Miao Xie
  2014-09-12 10:44               ` [PATCH v4 08/11] Btrfs: modify clean_io_failure " Miao Xie
                                 ` (4 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-09-12 10:44 UTC (permalink / raw)
  To: linux-btrfs

The original code of repair_io_failure was just used for buffered read,
because it got some filesystem data from page structure, it is safe for
the page in the page cache. But when we do a direct read, the pages in bio
are not in the page cache, that is there is no filesystem data in the page
structure. In order to implement direct read data repair, we need modify
repair_io_failure and pass all filesystem data it need by function
parameters.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/extent_io.c | 8 +++++---
 fs/btrfs/extent_io.h | 2 +-
 fs/btrfs/scrub.c     | 1 +
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cf1de40..9fbc005 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1997,7 +1997,7 @@ static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
  */
 int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 			u64 length, u64 logical, struct page *page,
-			int mirror_num)
+			unsigned int pg_offset, int mirror_num)
 {
 	struct bio *bio;
 	struct btrfs_device *dev;
@@ -2036,7 +2036,7 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 		return -EIO;
 	}
 	bio->bi_bdev = dev->bdev;
-	bio_add_page(bio, page, length, start - page_offset(page));
+	bio_add_page(bio, page, length, pg_offset);
 
 	if (btrfsic_submit_bio_wait(WRITE_SYNC, bio)) {
 		/* try to remap that extent elsewhere? */
@@ -2067,7 +2067,8 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = extent_buffer_page(eb, i);
 		ret = repair_io_failure(root->fs_info, start, PAGE_CACHE_SIZE,
-					start, p, mirror_num);
+					start, p, start - page_offset(p),
+					mirror_num);
 		if (ret)
 			break;
 		start += PAGE_CACHE_SIZE;
@@ -2127,6 +2128,7 @@ static int clean_io_failure(u64 start, struct page *page)
 		if (num_copies > 1)  {
 			repair_io_failure(fs_info, start, failrec->len,
 					  failrec->logical, page,
+					  start - page_offset(page),
 					  failrec->failed_mirror);
 		}
 	}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 75b621b..a82ecbc 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -340,7 +340,7 @@ struct btrfs_fs_info;
 
 int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 			u64 length, u64 logical, struct page *page,
-			int mirror_num);
+			unsigned int pg_offset, int mirror_num);
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 			 int mirror_num);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index cce122b..3978529 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -682,6 +682,7 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *fixup_ctx)
 		fs_info = BTRFS_I(inode)->root->fs_info;
 		ret = repair_io_failure(fs_info, offset, PAGE_SIZE,
 					fixup->logical, page,
+					offset - page_offset(page),
 					fixup->mirror_num);
 		unlock_page(page);
 		corrected = !ret;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 08/11] Btrfs: modify clean_io_failure and make it suit direct io
  2014-09-12 10:43             ` [PATCH v4 00/11] Implement the data repair function for direct read Miao Xie
                                 ` (6 preceding siblings ...)
  2014-09-12 10:44               ` [PATCH v4 07/11] Btrfs: modify repair_io_failure and make it suit direct io Miao Xie
@ 2014-09-12 10:44               ` Miao Xie
  2014-09-12 10:44               ` [PATCH v4 09/11] Btrfs: Set real mirror number for read operation on RAID0/5/6 Miao Xie
                                 ` (3 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-09-12 10:44 UTC (permalink / raw)
  To: linux-btrfs

We could not use clean_io_failure in the direct IO path because it got the
filesystem information from the page structure, but the page in the direct
IO bio didn't have the filesystem information in its structure. So we need
modify it and pass all the information it need by parameters.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/extent_io.c | 31 +++++++++++++++----------------
 fs/btrfs/extent_io.h |  6 +++---
 fs/btrfs/scrub.c     |  3 +--
 3 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9fbc005..94c5c04 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1995,10 +1995,10 @@ static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
  * currently, there can be no more than two copies of every data bit. thus,
  * exactly one rewrite is required.
  */
-int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
-			u64 length, u64 logical, struct page *page,
-			unsigned int pg_offset, int mirror_num)
+int repair_io_failure(struct inode *inode, u64 start, u64 length, u64 logical,
+		      struct page *page, unsigned int pg_offset, int mirror_num)
 {
+	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct bio *bio;
 	struct btrfs_device *dev;
 	u64 map_length = 0;
@@ -2046,10 +2046,9 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 	}
 
 	printk_ratelimited_in_rcu(KERN_INFO
-			"BTRFS: read error corrected: ino %lu off %llu "
-		    "(dev %s sector %llu)\n", page->mapping->host->i_ino,
-		    start, rcu_str_deref(dev->name), sector);
-
+				  "BTRFS: read error corrected: ino %llu off %llu (dev %s sector %llu)\n",
+				  btrfs_ino(inode), start,
+				  rcu_str_deref(dev->name), sector);
 	bio_put(bio);
 	return 0;
 }
@@ -2066,9 +2065,10 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = extent_buffer_page(eb, i);
-		ret = repair_io_failure(root->fs_info, start, PAGE_CACHE_SIZE,
-					start, p, start - page_offset(p),
-					mirror_num);
+
+		ret = repair_io_failure(root->fs_info->btree_inode, start,
+					PAGE_CACHE_SIZE, start, p,
+					start - page_offset(p), mirror_num);
 		if (ret)
 			break;
 		start += PAGE_CACHE_SIZE;
@@ -2081,12 +2081,12 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
  * each time an IO finishes, we do a fast check in the IO failure tree
  * to see if we need to process or clean up an io_failure_record
  */
-static int clean_io_failure(u64 start, struct page *page)
+static int clean_io_failure(struct inode *inode, u64 start,
+			    struct page *page, unsigned int pg_offset)
 {
 	u64 private;
 	u64 private_failure;
 	struct io_failure_record *failrec;
-	struct inode *inode = page->mapping->host;
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct extent_state *state;
 	int num_copies;
@@ -2126,10 +2126,9 @@ static int clean_io_failure(u64 start, struct page *page)
 		num_copies = btrfs_num_copies(fs_info, failrec->logical,
 					      failrec->len);
 		if (num_copies > 1)  {
-			repair_io_failure(fs_info, start, failrec->len,
+			repair_io_failure(inode, start, failrec->len,
 					  failrec->logical, page,
-					  start - page_offset(page),
-					  failrec->failed_mirror);
+					  pg_offset, failrec->failed_mirror);
 		}
 	}
 
@@ -2538,7 +2537,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 			if (ret)
 				uptodate = 0;
 			else
-				clean_io_failure(start, page);
+				clean_io_failure(inode, start, page, 0);
 		}
 
 		if (likely(uptodate))
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index a82ecbc..bf0597f 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -338,9 +338,9 @@ struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask);
 
 struct btrfs_fs_info;
 
-int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
-			u64 length, u64 logical, struct page *page,
-			unsigned int pg_offset, int mirror_num);
+int repair_io_failure(struct inode *inode, u64 start, u64 length, u64 logical,
+		      struct page *page, unsigned int pg_offset,
+		      int mirror_num);
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 			 int mirror_num);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 3978529..d061949 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -679,8 +679,7 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *fixup_ctx)
 			ret = -EIO;
 			goto out;
 		}
-		fs_info = BTRFS_I(inode)->root->fs_info;
-		ret = repair_io_failure(fs_info, offset, PAGE_SIZE,
+		ret = repair_io_failure(inode, offset, PAGE_SIZE,
 					fixup->logical, page,
 					offset - page_offset(page),
 					fixup->mirror_num);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 09/11] Btrfs: Set real mirror number for read operation on RAID0/5/6
  2014-09-12 10:43             ` [PATCH v4 00/11] Implement the data repair function for direct read Miao Xie
                                 ` (7 preceding siblings ...)
  2014-09-12 10:44               ` [PATCH v4 08/11] Btrfs: modify clean_io_failure " Miao Xie
@ 2014-09-12 10:44               ` Miao Xie
  2014-09-12 10:44               ` [PATCH v4 10/11] Btrfs: implement repair function when direct read fails Miao Xie
                                 ` (2 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-09-12 10:44 UTC (permalink / raw)
  To: linux-btrfs

We need real mirror number for RAID0/5/6 when reading data, or if read error
happens, we would pass 0 as the number of the mirror on which the io error
happens. It is wrong and would cause the filesystem read the data from the
corrupted mirror again.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/volumes.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1aacf5f..4856547 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5073,6 +5073,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 			num_stripes = min_t(u64, map->num_stripes,
 					    stripe_nr_end - stripe_nr_orig);
 		stripe_index = do_div(stripe_nr, map->num_stripes);
+		if (!(rw & (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS)))
+			mirror_num = 1;
 	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
 		if (rw & (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS))
 			num_stripes = map->num_stripes;
@@ -5176,6 +5178,9 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 			/* We distribute the parity blocks across stripes */
 			tmp = stripe_nr + stripe_index;
 			stripe_index = do_div(tmp, map->num_stripes);
+			if (!(rw & (REQ_WRITE | REQ_DISCARD |
+				    REQ_GET_READ_MIRRORS)) && mirror_num <= 1)
+				mirror_num = 1;
 		}
 	} else {
 		/*
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 10/11] Btrfs: implement repair function when direct read fails
  2014-09-12 10:43             ` [PATCH v4 00/11] Implement the data repair function for direct read Miao Xie
                                 ` (8 preceding siblings ...)
  2014-09-12 10:44               ` [PATCH v4 09/11] Btrfs: Set real mirror number for read operation on RAID0/5/6 Miao Xie
@ 2014-09-12 10:44               ` Miao Xie
  2014-09-12 10:44               ` [PATCH v4 11/11] Btrfs: cleanup the read failure record after write or when the inode is freeing Miao Xie
  2014-09-12 14:50               ` [PATCH v4 00/11] Implement the data repair function for direct read Chris Mason
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-09-12 10:44 UTC (permalink / raw)
  To: linux-btrfs

This patch implement data repair function when direct read fails.

The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
  mirror.
- When the io on the mirror ends, we will insert the endio work into the
  dedicated btrfs workqueue, not common read endio workqueue, because the
  original endio work is still blocked in the btrfs endio workqueue, if we
  insert the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v3 -> v4:
- Use a dedicated btrfs workqueue instead of the system workqueue to
  deal with the completed repair bio, this suggest was from Chris.

Changelog v1 -> v3:
- None
---
 fs/btrfs/async-thread.c |   1 +
 fs/btrfs/async-thread.h |   1 +
 fs/btrfs/btrfs_inode.h  |   2 +-
 fs/btrfs/ctree.h        |   1 +
 fs/btrfs/disk-io.c      |  11 +-
 fs/btrfs/disk-io.h      |   1 +
 fs/btrfs/extent_io.c    |  12 ++-
 fs/btrfs/extent_io.h    |   5 +-
 fs/btrfs/inode.c        | 276 ++++++++++++++++++++++++++++++++++++++++++++----
 9 files changed, 281 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index fbd76de..2da0a66 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -74,6 +74,7 @@ BTRFS_WORK_HELPER(endio_helper);
 BTRFS_WORK_HELPER(endio_meta_helper);
 BTRFS_WORK_HELPER(endio_meta_write_helper);
 BTRFS_WORK_HELPER(endio_raid56_helper);
+BTRFS_WORK_HELPER(endio_repair_helper);
 BTRFS_WORK_HELPER(rmw_helper);
 BTRFS_WORK_HELPER(endio_write_helper);
 BTRFS_WORK_HELPER(freespace_write_helper);
diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h
index e9e31c9..e386c29 100644
--- a/fs/btrfs/async-thread.h
+++ b/fs/btrfs/async-thread.h
@@ -53,6 +53,7 @@ BTRFS_WORK_HELPER_PROTO(endio_helper);
 BTRFS_WORK_HELPER_PROTO(endio_meta_helper);
 BTRFS_WORK_HELPER_PROTO(endio_meta_write_helper);
 BTRFS_WORK_HELPER_PROTO(endio_raid56_helper);
+BTRFS_WORK_HELPER_PROTO(endio_repair_helper);
 BTRFS_WORK_HELPER_PROTO(rmw_helper);
 BTRFS_WORK_HELPER_PROTO(endio_write_helper);
 BTRFS_WORK_HELPER_PROTO(freespace_write_helper);
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 4d30947..7a7521c 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -271,7 +271,7 @@ struct btrfs_dio_private {
 	 * The original bio may be splited to several sub-bios, this is
 	 * done during endio of sub-bios
 	 */
-	int (*subio_endio)(struct inode *, struct btrfs_io_bio *);
+	int (*subio_endio)(struct inode *, struct btrfs_io_bio *, int);
 };
 
 /*
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7b54cd9..63acfd8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1538,6 +1538,7 @@ struct btrfs_fs_info {
 	struct btrfs_workqueue *endio_workers;
 	struct btrfs_workqueue *endio_meta_workers;
 	struct btrfs_workqueue *endio_raid56_workers;
+	struct btrfs_workqueue *endio_repair_workers;
 	struct btrfs_workqueue *rmw_workers;
 	struct btrfs_workqueue *endio_meta_write_workers;
 	struct btrfs_workqueue *endio_write_workers;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ff3ee22..1594d91 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -713,7 +713,11 @@ static void end_workqueue_bio(struct bio *bio, int err)
 			func = btrfs_endio_write_helper;
 		}
 	} else {
-		if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56) {
+		if (unlikely(end_io_wq->metadata ==
+			     BTRFS_WQ_ENDIO_DIO_REPAIR)) {
+			wq = fs_info->endio_repair_workers;
+			func = btrfs_endio_repair_helper;
+		} else if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56) {
 			wq = fs_info->endio_raid56_workers;
 			func = btrfs_endio_raid56_helper;
 		} else if (end_io_wq->metadata) {
@@ -741,6 +745,7 @@ int btrfs_bio_wq_end_io(struct btrfs_fs_info *info, struct bio *bio,
 			int metadata)
 {
 	struct end_io_wq *end_io_wq;
+
 	end_io_wq = kmalloc(sizeof(*end_io_wq), GFP_NOFS);
 	if (!end_io_wq)
 		return -ENOMEM;
@@ -2059,6 +2064,7 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 	btrfs_destroy_workqueue(fs_info->endio_workers);
 	btrfs_destroy_workqueue(fs_info->endio_meta_workers);
 	btrfs_destroy_workqueue(fs_info->endio_raid56_workers);
+	btrfs_destroy_workqueue(fs_info->endio_repair_workers);
 	btrfs_destroy_workqueue(fs_info->rmw_workers);
 	btrfs_destroy_workqueue(fs_info->endio_meta_write_workers);
 	btrfs_destroy_workqueue(fs_info->endio_write_workers);
@@ -2576,6 +2582,8 @@ int open_ctree(struct super_block *sb,
 		btrfs_alloc_workqueue("endio-meta-write", flags, max_active, 2);
 	fs_info->endio_raid56_workers =
 		btrfs_alloc_workqueue("endio-raid56", flags, max_active, 4);
+	fs_info->endio_repair_workers =
+		btrfs_alloc_workqueue("endio-repair", flags, 1, 0);
 	fs_info->rmw_workers =
 		btrfs_alloc_workqueue("rmw", flags, max_active, 2);
 	fs_info->endio_write_workers =
@@ -2597,6 +2605,7 @@ int open_ctree(struct super_block *sb,
 	      fs_info->submit_workers && fs_info->flush_workers &&
 	      fs_info->endio_workers && fs_info->endio_meta_workers &&
 	      fs_info->endio_meta_write_workers &&
+	      fs_info->endio_repair_workers &&
 	      fs_info->endio_write_workers && fs_info->endio_raid56_workers &&
 	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
 	      fs_info->caching_workers && fs_info->readahead_workers &&
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 52a17db..14d06ee 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -30,6 +30,7 @@ enum {
 	BTRFS_WQ_ENDIO_METADATA = 1,
 	BTRFS_WQ_ENDIO_FREE_SPACE = 2,
 	BTRFS_WQ_ENDIO_RAID56 = 3,
+	BTRFS_WQ_ENDIO_DIO_REPAIR = 4,
 };
 
 static inline u64 btrfs_sb_offset(int mirror)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 94c5c04..86dc352 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1962,7 +1962,7 @@ static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
 		SetPageUptodate(page);
 }
 
-static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
+int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
 	int ret;
 	int err = 0;
@@ -2081,8 +2081,8 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
  * each time an IO finishes, we do a fast check in the IO failure tree
  * to see if we need to process or clean up an io_failure_record
  */
-static int clean_io_failure(struct inode *inode, u64 start,
-			    struct page *page, unsigned int pg_offset)
+int clean_io_failure(struct inode *inode, u64 start, struct page *page,
+		     unsigned int pg_offset)
 {
 	u64 private;
 	u64 private_failure;
@@ -2291,7 +2291,7 @@ int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
 struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
 				    struct io_failure_record *failrec,
 				    struct page *page, int pg_offset, int icsum,
-				    bio_end_io_t *endio_func)
+				    bio_end_io_t *endio_func, void *data)
 {
 	struct bio *bio;
 	struct btrfs_io_bio *btrfs_failed_bio;
@@ -2305,6 +2305,7 @@ struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
 	bio->bi_iter.bi_sector = failrec->logical >> 9;
 	bio->bi_bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev;
 	bio->bi_iter.bi_size = 0;
+	bio->bi_private = data;
 
 	btrfs_failed_bio = btrfs_io_bio(failed_bio);
 	if (btrfs_failed_bio->csum) {
@@ -2362,7 +2363,8 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 	phy_offset >>= inode->i_sb->s_blocksize_bits;
 	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
 				      start - page_offset(page),
-				      (int)phy_offset, failed_bio->bi_end_io);
+				      (int)phy_offset, failed_bio->bi_end_io,
+				      NULL);
 	if (!bio) {
 		free_io_failure(inode, failrec);
 		return -EIO;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index bf0597f..176a4b1 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -341,6 +341,8 @@ struct btrfs_fs_info;
 int repair_io_failure(struct inode *inode, u64 start, u64 length, u64 logical,
 		      struct page *page, unsigned int pg_offset,
 		      int mirror_num);
+int clean_io_failure(struct inode *inode, u64 start, struct page *page,
+		     unsigned int pg_offset);
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 			 int mirror_num);
@@ -371,7 +373,8 @@ int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
 struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
 				    struct io_failure_record *failrec,
 				    struct page *page, int pg_offset, int icsum,
-				    bio_end_io_t *endio_func);
+				    bio_end_io_t *endio_func, void *data);
+int free_io_failure(struct inode *inode, struct io_failure_record *rec);
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 noinline u64 find_lock_delalloc_range(struct inode *inode,
 				      struct extent_io_tree *tree,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cf79f79..bc8cdaf 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7198,30 +7198,267 @@ unlock_err:
 	return ret;
 }
 
-static int btrfs_subio_endio_read(struct inode *inode,
-				  struct btrfs_io_bio *io_bio)
+static inline int submit_dio_repair_bio(struct inode *inode, struct bio *bio,
+					int rw, int mirror_num)
+{
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	int ret;
+
+	BUG_ON(rw & REQ_WRITE);
+
+	bio_get(bio);
+
+	ret = btrfs_bio_wq_end_io(root->fs_info, bio,
+				  BTRFS_WQ_ENDIO_DIO_REPAIR);
+	if (ret)
+		goto err;
+
+	ret = btrfs_map_bio(root, rw, bio, mirror_num, 0);
+err:
+	bio_put(bio);
+	return ret;
+}
+
+static int btrfs_check_dio_repairable(struct inode *inode,
+				      struct bio *failed_bio,
+				      struct io_failure_record *failrec,
+				      int failed_mirror)
+{
+	int num_copies;
+
+	num_copies = btrfs_num_copies(BTRFS_I(inode)->root->fs_info,
+				      failrec->logical, failrec->len);
+	if (num_copies == 1) {
+		/*
+		 * we only have a single copy of the data, so don't bother with
+		 * all the retry and error correction code that follows. no
+		 * matter what the error is, it is very likely to persist.
+		 */
+		pr_debug("Check DIO Repairable: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n",
+			 num_copies, failrec->this_mirror, failed_mirror);
+		return 0;
+	}
+
+	failrec->failed_mirror = failed_mirror;
+	failrec->this_mirror++;
+	if (failrec->this_mirror == failed_mirror)
+		failrec->this_mirror++;
+
+	if (failrec->this_mirror > num_copies) {
+		pr_debug("Check DIO Repairable: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n",
+			 num_copies, failrec->this_mirror, failed_mirror);
+		return 0;
+	}
+
+	return 1;
+}
+
+static int dio_read_error(struct inode *inode, struct bio *failed_bio,
+			  struct page *page, u64 start, u64 end,
+			  int failed_mirror, bio_end_io_t *repair_endio,
+			  void *repair_arg)
+{
+	struct io_failure_record *failrec;
+	struct bio *bio;
+	int isector;
+	int read_mode;
+	int ret;
+
+	BUG_ON(failed_bio->bi_rw & REQ_WRITE);
+
+	ret = btrfs_get_io_failure_record(inode, start, end, &failrec);
+	if (ret)
+		return ret;
+
+	ret = btrfs_check_dio_repairable(inode, failed_bio, failrec,
+					 failed_mirror);
+	if (!ret) {
+		free_io_failure(inode, failrec);
+		return -EIO;
+	}
+
+	if (failed_bio->bi_vcnt > 1)
+		read_mode = READ_SYNC | REQ_FAILFAST_DEV;
+	else
+		read_mode = READ_SYNC;
+
+	isector = start - btrfs_io_bio(failed_bio)->logical;
+	isector >>= inode->i_sb->s_blocksize_bits;
+	bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page,
+				      0, isector, repair_endio, repair_arg);
+	if (!bio) {
+		free_io_failure(inode, failrec);
+		return -EIO;
+	}
+
+	btrfs_debug(BTRFS_I(inode)->root->fs_info,
+		    "Repair DIO Read Error: submitting new dio read[%#x] to this_mirror=%d, in_validation=%d\n",
+		    read_mode, failrec->this_mirror, failrec->in_validation);
+
+	ret = submit_dio_repair_bio(inode, bio, read_mode,
+				    failrec->this_mirror);
+	if (ret) {
+		free_io_failure(inode, failrec);
+		bio_put(bio);
+	}
+
+	return ret;
+}
+
+struct btrfs_retry_complete {
+	struct completion done;
+	struct inode *inode;
+	u64 start;
+	int uptodate;
+};
+
+static void btrfs_retry_endio_nocsum(struct bio *bio, int err)
+{
+	struct btrfs_retry_complete *done = bio->bi_private;
+	struct bio_vec *bvec;
+	int i;
+
+	if (err)
+		goto end;
+
+	done->uptodate = 1;
+	bio_for_each_segment_all(bvec, bio, i)
+		clean_io_failure(done->inode, done->start, bvec->bv_page, 0);
+end:
+	complete(&done->done);
+	bio_put(bio);
+}
+
+static int __btrfs_correct_data_nocsum(struct inode *inode,
+				       struct btrfs_io_bio *io_bio)
 {
 	struct bio_vec *bvec;
+	struct btrfs_retry_complete done;
 	u64 start;
 	int i;
 	int ret;
-	int err = 0;
 
-	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
-		return 0;
+	start = io_bio->logical;
+	done.inode = inode;
+
+	bio_for_each_segment_all(bvec, &io_bio->bio, i) {
+try_again:
+		done.uptodate = 0;
+		done.start = start;
+		init_completion(&done.done);
+
+		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, start,
+				     start + bvec->bv_len - 1,
+				     io_bio->mirror_num,
+				     btrfs_retry_endio_nocsum, &done);
+		if (ret)
+			return ret;
+
+		wait_for_completion(&done.done);
+
+		if (!done.uptodate) {
+			/* We might have another mirror, so try again */
+			goto try_again;
+		}
+
+		start += bvec->bv_len;
+	}
+
+	return 0;
+}
+
+static void btrfs_retry_endio(struct bio *bio, int err)
+{
+	struct btrfs_retry_complete *done = bio->bi_private;
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	struct bio_vec *bvec;
+	int uptodate;
+	int ret;
+	int i;
+
+	if (err)
+		goto end;
+
+	uptodate = 1;
+	bio_for_each_segment_all(bvec, bio, i) {
+		ret = __readpage_endio_check(done->inode, io_bio, i,
+					     bvec->bv_page, 0,
+					     done->start, bvec->bv_len);
+		if (!ret)
+			clean_io_failure(done->inode, done->start,
+					 bvec->bv_page, 0);
+		else
+			uptodate = 0;
+	}
+
+	done->uptodate = uptodate;
+end:
+	complete(&done->done);
+	bio_put(bio);
+}
 
+static int __btrfs_subio_endio_read(struct inode *inode,
+				    struct btrfs_io_bio *io_bio, int err)
+{
+	struct bio_vec *bvec;
+	struct btrfs_retry_complete done;
+	u64 start;
+	u64 offset = 0;
+	int i;
+	int ret;
+
+	err = 0;
 	start = io_bio->logical;
+	done.inode = inode;
+
 	bio_for_each_segment_all(bvec, &io_bio->bio, i) {
 		ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
 					     0, start, bvec->bv_len);
-		if (ret)
-			err = -EIO;
+		if (likely(!ret))
+			goto next;
+try_again:
+		done.uptodate = 0;
+		done.start = start;
+		init_completion(&done.done);
+
+		ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, start,
+				     start + bvec->bv_len - 1,
+				     io_bio->mirror_num,
+				     btrfs_retry_endio, &done);
+		if (ret) {
+			err = ret;
+			goto next;
+		}
+
+		wait_for_completion(&done.done);
+
+		if (!done.uptodate) {
+			/* We might have another mirror, so try again */
+			goto try_again;
+		}
+next:
+		offset += bvec->bv_len;
 		start += bvec->bv_len;
 	}
 
 	return err;
 }
 
+static int btrfs_subio_endio_read(struct inode *inode,
+				  struct btrfs_io_bio *io_bio, int err)
+{
+	bool skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
+
+	if (skip_csum) {
+		if (unlikely(err))
+			return __btrfs_correct_data_nocsum(inode, io_bio);
+		else
+			return 0;
+	} else {
+		return __btrfs_subio_endio_read(inode, io_bio, err);
+	}
+}
+
 static void btrfs_endio_direct_read(struct bio *bio, int err)
 {
 	struct btrfs_dio_private *dip = bio->bi_private;
@@ -7229,8 +7466,8 @@ static void btrfs_endio_direct_read(struct bio *bio, int err)
 	struct bio *dio_bio;
 	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
 
-	if (!err && (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED))
-		err = btrfs_subio_endio_read(inode, io_bio);
+	if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED)
+		err = btrfs_subio_endio_read(inode, io_bio, err);
 
 	unlock_extent(&BTRFS_I(inode)->io_tree, dip->logical_offset,
 		      dip->logical_offset + dip->bytes - 1);
@@ -7309,19 +7546,16 @@ static int __btrfs_submit_bio_start_direct_io(struct inode *inode, int rw,
 static void btrfs_end_dio_bio(struct bio *bio, int err)
 {
 	struct btrfs_dio_private *dip = bio->bi_private;
-	int ret;
 
-	if (err) {
-		btrfs_err(BTRFS_I(dip->inode)->root->fs_info,
-			  "direct IO failed ino %llu rw %lu sector %#Lx len %u err no %d",
-		      btrfs_ino(dip->inode), bio->bi_rw,
-		      (unsigned long long)bio->bi_iter.bi_sector,
-		      bio->bi_iter.bi_size, err);
-	} else if (dip->subio_endio) {
-		ret = dip->subio_endio(dip->inode, btrfs_io_bio(bio));
-		if (ret)
-			err = ret;
-	}
+	if (err)
+		btrfs_warn(BTRFS_I(dip->inode)->root->fs_info,
+			   "direct IO failed ino %llu rw %lu sector %#Lx len %u err no %d",
+			   btrfs_ino(dip->inode), bio->bi_rw,
+			   (unsigned long long)bio->bi_iter.bi_sector,
+			   bio->bi_iter.bi_size, err);
+
+	if (dip->subio_endio)
+		err = dip->subio_endio(dip->inode, btrfs_io_bio(bio), err);
 
 	if (err) {
 		dip->errors = 1;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4 11/11] Btrfs: cleanup the read failure record after write or when the inode is freeing
  2014-09-12 10:43             ` [PATCH v4 00/11] Implement the data repair function for direct read Miao Xie
                                 ` (9 preceding siblings ...)
  2014-09-12 10:44               ` [PATCH v4 10/11] Btrfs: implement repair function when direct read fails Miao Xie
@ 2014-09-12 10:44               ` Miao Xie
  2014-09-12 14:50               ` [PATCH v4 00/11] Implement the data repair function for direct read Chris Mason
  11 siblings, 0 replies; 49+ messages in thread
From: Miao Xie @ 2014-09-12 10:44 UTC (permalink / raw)
  To: linux-btrfs

After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
  mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
  the corrupted data is corrected, so the failure record can be removed. And if
  some errors happen on the mirrors, we also needn't worry about it because the
  failure record will be recreated if we read the same place again.

Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v4:
- None
---
 fs/btrfs/extent_io.c | 34 ++++++++++++++++++++++++++++++++++
 fs/btrfs/extent_io.h |  1 +
 fs/btrfs/inode.c     |  6 ++++++
 3 files changed, 41 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 86dc352..5427fd5 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2138,6 +2138,40 @@ out:
 	return 0;
 }
 
+/*
+ * Can be called when
+ * - hold extent lock
+ * - under ordered extent
+ * - the inode is freeing
+ */
+void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end)
+{
+	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
+	struct io_failure_record *failrec;
+	struct extent_state *state, *next;
+
+	if (RB_EMPTY_ROOT(&failure_tree->state))
+		return;
+
+	spin_lock(&failure_tree->lock);
+	state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY);
+	while (state) {
+		if (state->start > end)
+			break;
+
+		ASSERT(state->end <= end);
+
+		next = next_state(state);
+
+		failrec = (struct io_failure_record *)state->private;
+		free_extent_state(state);
+		kfree(failrec);
+
+		state = next;
+	}
+	spin_unlock(&failure_tree->lock);
+}
+
 int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
 				struct io_failure_record **failrec_ret)
 {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 176a4b1..5e91fb9 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -366,6 +366,7 @@ struct io_failure_record {
 	int in_validation;
 };
 
+void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end);
 int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
 				struct io_failure_record **failrec_ret);
 int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bc8cdaf..c591af5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2697,6 +2697,10 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 		goto out;
 	}
 
+	btrfs_free_io_failure_record(inode, ordered_extent->file_offset,
+				     ordered_extent->file_offset +
+				     ordered_extent->len - 1);
+
 	if (test_bit(BTRFS_ORDERED_TRUNCATED, &ordered_extent->flags)) {
 		truncated = true;
 		logical_len = ordered_extent->truncated_len;
@@ -4792,6 +4796,8 @@ void btrfs_evict_inode(struct inode *inode)
 	/* do we really want it for ->i_nlink > 0 and zero btrfs_root_refs? */
 	btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
+	btrfs_free_io_failure_record(inode, 0, (u64)-1);
+
 	if (root->fs_info->log_root_recovering) {
 		BUG_ON(test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
 				 &BTRFS_I(inode)->runtime_flags));
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v4 00/11] Implement the data repair function for direct read
  2014-09-12 10:43             ` [PATCH v4 00/11] Implement the data repair function for direct read Miao Xie
                                 ` (10 preceding siblings ...)
  2014-09-12 10:44               ` [PATCH v4 11/11] Btrfs: cleanup the read failure record after write or when the inode is freeing Miao Xie
@ 2014-09-12 14:50               ` Chris Mason
  11 siblings, 0 replies; 49+ messages in thread
From: Chris Mason @ 2014-09-12 14:50 UTC (permalink / raw)
  To: Miao Xie, linux-btrfs



On 09/12/2014 06:43 AM, Miao Xie wrote:
> This patchset implement the data repair function for the direct read, it
> is implemented like buffered read:
> 1.When we find the data is not right, we try to read the data from the other
>   mirror.
> 2.When the io on the mirror ends, we will insert the endio work into the
>   dedicated btrfs workqueue, not common read endio workqueue, because the
>   original endio work is still blocked in the btrfs endio workqueue, if we
>   insert the endio work of the io on the mirror into that workqueue, deadlock
>   would happen.
> 3.If We get right data, we write it back to repair the corrupted mirror.
> 4.If the data on the new mirror is still corrupted, we will try next
>   mirror until we read right data or all the mirrors are traversed.
> 5.After the above work, we set the uptodate flag according to the result.
> 
> The difference is that the direct read may be splited to several small io,
> in order to get the number of the mirror on which the io error happens. we
> have to do data check and repair on the end IO function of those sub-IO
> request.
> 
> Besides that, we also fixed some bugs of direct io.
> 
> Changelog v3 -> v4:
> - Remove the 1st patch which has been applied into the upstream kernel.
> - Use a dedicated btrfs workqueue instead of the system workqueue to
>   deal with the completed repair bio, this suggest was from Chris.
> - Rebase the patchset to integration branch of Chris's git tree.

Perfect, thank you.

-chris

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2014-09-12 14:50 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-28 11:34 [PATCH 00/12] Implement the data repair function for direct read Miao Xie
2014-06-28 11:34 ` [PATCH 01/12] Btrfs: fix put dio bio twice when we submit dio bio fail Miao Xie
2014-06-28 11:34 ` [PATCH 02/12] Btrfs: load checksum data once when submitting a direct read io Miao Xie
2014-07-28 17:24   ` Filipe David Manana
2014-07-29  1:56     ` Miao Xie
2014-06-28 11:34 ` [PATCH 03/12] Btrfs: cleanup similar code of the buffered data data check and dio read data check Miao Xie
2014-06-28 11:34 ` [PATCH 04/12] Btrfs: do file data check by sub-bio's self Miao Xie
2014-06-28 11:34 ` [PATCH 05/12] Btrfs: fix missing error handler if submiting re-read bio fails Miao Xie
2014-06-28 11:34 ` [PATCH 06/12] Btrfs: Cleanup unused variant and argument of IO failure handlers Miao Xie
2014-06-28 11:34 ` [PATCH 07/12] Btrfs: split bio_readpage_error into several functions Miao Xie
2014-06-28 11:34 ` [PATCH 08/12] Btrfs: modify repair_io_failure and make it suit direct io Miao Xie
2014-06-28 11:34 ` [PATCH 09/12] Btrfs: modify clean_io_failure " Miao Xie
2014-06-28 11:35 ` [PATCH 10/12] Btrfs: Set real mirror number for read operation on RAID0/5/6 Miao Xie
2014-06-28 11:35 ` [PATCH 11/12] Btrfs: implement repair function when direct read fails Miao Xie
2014-06-28 11:35 ` [PATCH 12/12] Btrfs: cleanup the read failure record after write or when the inode is freeing Miao Xie
2014-07-29  9:23 ` [PATCH v2 00/12] Implement the data repair function for direct read Miao Xie
2014-07-29  9:23   ` [PATCH v2 01/12] Btrfs: fix put dio bio twice when we submit dio bio fail Miao Xie
2014-07-29  9:24   ` [PATCH v2 02/12] Btrfs: load checksum data once when submitting a direct read io Miao Xie
2014-08-08  0:32     ` Filipe David Manana
2014-08-08  9:22       ` Miao Xie
2014-08-08  9:23       ` [PATCH v3 " Miao Xie
2014-07-29  9:24   ` [PATCH v2 03/12] Btrfs: cleanup similar code of the buffered data data check and dio read data check Miao Xie
2014-07-29  9:24   ` [PATCH v2 04/12] Btrfs: do file data check by sub-bio's self Miao Xie
2014-07-29  9:24   ` [PATCH v2 05/12] Btrfs: fix missing error handler if submiting re-read bio fails Miao Xie
2014-07-29  9:24   ` [PATCH v2 06/12] Btrfs: Cleanup unused variant and argument of IO failure handlers Miao Xie
2014-07-29  9:24   ` [PATCH v2 07/12] Btrfs: split bio_readpage_error into several functions Miao Xie
2014-07-29  9:24   ` [PATCH v2 08/12] Btrfs: modify repair_io_failure and make it suit direct io Miao Xie
2014-07-29  9:24   ` [PATCH v2 09/12] Btrfs: modify clean_io_failure " Miao Xie
2014-07-29  9:24   ` [PATCH v2 10/12] Btrfs: Set real mirror number for read operation on RAID0/5/6 Miao Xie
2014-07-29  9:24   ` [PATCH v2 11/12] Btrfs: implement repair function when direct read fails Miao Xie
2014-08-29 18:31     ` Chris Mason
2014-09-01  6:56       ` Miao Xie
2014-09-02 12:33         ` Liu Bo
2014-09-02 13:05           ` Chris Mason
2014-09-03  9:02             ` Miao Xie
2014-09-12 10:43             ` [PATCH v4 00/11] Implement the data repair function for direct read Miao Xie
2014-09-12 10:43               ` [PATCH v4 01/11] Btrfs: load checksum data once when submitting a direct read io Miao Xie
2014-09-12 10:43               ` [PATCH v4 02/11] Btrfs: cleanup similar code of the buffered data data check and dio read data check Miao Xie
2014-09-12 10:43               ` [PATCH v4 03/11] Btrfs: do file data check by sub-bio's self Miao Xie
2014-09-12 10:43               ` [PATCH v4 04/11] Btrfs: fix missing error handler if submiting re-read bio fails Miao Xie
2014-09-12 10:43               ` [PATCH v4 05/11] Btrfs: Cleanup unused variant and argument of IO failure handlers Miao Xie
2014-09-12 10:43               ` [PATCH v4 06/11] Btrfs: split bio_readpage_error into several functions Miao Xie
2014-09-12 10:44               ` [PATCH v4 07/11] Btrfs: modify repair_io_failure and make it suit direct io Miao Xie
2014-09-12 10:44               ` [PATCH v4 08/11] Btrfs: modify clean_io_failure " Miao Xie
2014-09-12 10:44               ` [PATCH v4 09/11] Btrfs: Set real mirror number for read operation on RAID0/5/6 Miao Xie
2014-09-12 10:44               ` [PATCH v4 10/11] Btrfs: implement repair function when direct read fails Miao Xie
2014-09-12 10:44               ` [PATCH v4 11/11] Btrfs: cleanup the read failure record after write or when the inode is freeing Miao Xie
2014-09-12 14:50               ` [PATCH v4 00/11] Implement the data repair function for direct read Chris Mason
2014-07-29  9:24   ` [PATCH v2 12/12] Btrfs: cleanup the read failure record after write or when the inode is freeing Miao Xie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.