All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] btrfs: fix read corrpution from disks of different generation
@ 2019-03-19 11:35 Anand Jain
  2019-03-19 11:35 ` [PATCH] fstests: btrfs test read " Anand Jain
  2019-03-19 12:07 ` [PATCH RFC] btrfs: fix read corrpution " Qu Wenruo
  0 siblings, 2 replies; 15+ messages in thread
From: Anand Jain @ 2019-03-19 11:35 UTC (permalink / raw)
  To: linux-btrfs

   In case of file regular extent (non inline), the metadata and data are 
 read from two different IO operations. When we read the metadata using 
 the btree each extent block gets verified with the expected  transid as 
 per its parent. So suppose if any of the block is stale it gets reported 
 and corrected by reading from the mirror (consider raid1).
 
   This also means the data extent is not yet read or verified and this 
 happens when application tries to read the file.
 When application tries to read the file btrfs_readpage(), it shall first 
 read file extent map as provided by the metadata in the above operation 
 (which is assured to be consistent).
 And when file data has to be read using the above filemap as it again 
 has the choice of using either of the mirrors, suppose if the IO is 
 routed to the disk on which we haven't verified the metadata on it and 
 if there is write hole on that disk, then we shall be reading the junk 
 data without being reported anywhere.

   This is reproducible with the test case [1] in this email thread.
     [1]
     fstests: btrfs test read from disks of different generation

   As we have data csum most of these get caught and reported as csum 
 error. But csum verification is a point in verification and its not a 
 tree based transid verification. Which means if there is a stale data 
 with matching csum we may return a junk data silently. This problem is 
 easily reproducible when csum is disabled but not impossible to achieve 
 when csum is not disabled as well. A tree based integrity verification 
 is important for all data, which is missing.
 
 Fix:
   In this RFC patch it proposes to use same disk from with the metadata 
 is read to read the data. Which means the feature such as readmirror is 
 less effective (which is fine). But if the good data fails (media error) 
 as usual we shall read the mirrored data without being aware that this 
 data may be stale. To fix that, we have to invalidate the corresponding 
 metadata and re-read the metadata from other disk (this part is not 
 implemented in this patch).
 
    The other better solution is - a tree based verification, which means 
 we have to verify the data-extent's metadata in the extent tree when its 
 read - but there is a problem we don't update the generation number in 
 the extent tree for all types of write, for example consider overwrite 
 with notrunc option on a filesystem mounted with nodatacow option [1], 
 which I wonder if its a bug or if its like that for a reason? as because 
 there is no cow?
 
 [1] The EXTENT_DATA generation is same before and after dd with notrunc
 
   mkfs.btrfs -fq /dev/sdb && mount -o notreelog,nodatacow /dev/sdb /btrfs
   dd if=/dev/urandom of=/btrfs/tf1 bs=4096 count=1 conv=fsync,notrunc && 
 sync
   btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA 0"
   btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) itemoff"
   dd if=/dev/urandom of=/btrfs/tf1 bs=4096 count=1 conv=fsync,notrunc && 
 sync
   btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA 0"
   btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) itemoff"
 
 ---------
      item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
          generation 6 type 1 (regular)
          extent data disk byte 13631488 nr 4096
          extent data offset 0 nr 4096 ram 4096
          extent compression 0 (none)
      item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
          generation 6 transid 6 size 4096 nbytes 4096
          block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
          sequence 66785 flags 0x3(NODATASUM|NODATACOW)
          atime 1552993569.276079763 (2019-03-19 19:06:09)
          ctime 1552993569.276079763 (2019-03-19 19:06:09)
          mtime 1552993569.276079763 (2019-03-19 19:06:09)
          otime 1552993569.276079763 (2019-03-19 19:06:09)
 1+0 records in
 1+0 records out
 4096 bytes (4.1 kB) copied, 0.00302717 s, 1.4 MB/s
      item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
          generation 6 type 1 (regular)
          extent data disk byte 13631488 nr 4096
          extent data offset 0 nr 4096 ram 4096
          extent compression 0 (none)
      item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
          generation 6 transid 7 size 4096 nbytes 4096
          block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
          sequence 66786 flags 0x3(NODATASUM|NODATACOW)
          atime 1552993569.276079763 (2019-03-19 19:06:09)
          ctime 1552993569.302079763 (2019-03-19 19:06:09)
          mtime 1552993569.302079763 (2019-03-19 19:06:09)
          otime 1552993569.276079763 (2019-03-19 19:06:09)
 -------
 
   Comments appreciated.


Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/extent_io.c | 41 ++++++++++++++++++++++++++++++++++-------
 1 file changed, 34 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 31892052a24f..f1cc21e8dff7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3079,7 +3079,7 @@ static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
 					     struct extent_map **em_cached,
 					     struct bio **bio,
 					     unsigned long *bio_flags,
-					     u64 *prev_em_start)
+					     u64 *prev_em_start, int mirror)
 {
 	struct inode *inode;
 	struct btrfs_ordered_extent *ordered;
@@ -3099,7 +3099,7 @@ static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
 
 	for (index = 0; index < nr_pages; index++) {
 		__do_readpage(tree, pages[index], btrfs_get_extent, em_cached,
-				bio, 0, bio_flags, REQ_RAHEAD, prev_em_start);
+				bio, mirror, bio_flags, REQ_RAHEAD, prev_em_start);
 		put_page(pages[index]);
 	}
 }
@@ -3109,7 +3109,7 @@ static void __extent_readpages(struct extent_io_tree *tree,
 			       int nr_pages,
 			       struct extent_map **em_cached,
 			       struct bio **bio, unsigned long *bio_flags,
-			       u64 *prev_em_start)
+			       u64 *prev_em_start, int mirror)
 {
 	u64 start = 0;
 	u64 end = 0;
@@ -3130,7 +3130,7 @@ static void __extent_readpages(struct extent_io_tree *tree,
 						  index - first_index, start,
 						  end, em_cached,
 						  bio, bio_flags,
-						  prev_em_start);
+						  prev_em_start, mirror);
 			start = page_start;
 			end = start + PAGE_SIZE - 1;
 			first_index = index;
@@ -3141,7 +3141,7 @@ static void __extent_readpages(struct extent_io_tree *tree,
 		__do_contiguous_readpages(tree, &pages[first_index],
 					  index - first_index, start,
 					  end, em_cached, bio,
-					  bio_flags, prev_em_start);
+					  bio_flags, prev_em_start, mirror);
 }
 
 static int __extent_read_full_page(struct extent_io_tree *tree,
@@ -4093,6 +4093,27 @@ int extent_writepages(struct address_space *mapping,
 	return ret;
 }
 
+static int btrfs_get_read_mirror(struct inode *inode)
+{
+	int ret;
+	struct btrfs_path *path;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
+				       btrfs_ino(BTRFS_I(inode)), 0, 0);
+	if (!ret) {
+		struct extent_buffer *eb = path->nodes[0];
+
+		ret = eb->read_mirror;
+	}
+
+	btrfs_free_path(path);
+	return ret;
+}
+
 int extent_readpages(struct address_space *mapping, struct list_head *pages,
 		     unsigned nr_pages)
 {
@@ -4103,6 +4124,11 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages,
 	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
 	int nr = 0;
 	u64 prev_em_start = (u64)-1;
+	int mirror;
+
+	mirror = btrfs_get_read_mirror(mapping->host);
+	if (mirror < 0)
+		return mirror;
 
 	while (!list_empty(pages)) {
 		for (nr = 0; nr < ARRAY_SIZE(pagepool) && !list_empty(pages);) {
@@ -4120,14 +4146,15 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages,
 		}
 
 		__extent_readpages(tree, pagepool, nr, &em_cached, &bio,
-				   &bio_flags, &prev_em_start);
+				   &bio_flags, &prev_em_start, mirror);
+
 	}
 
 	if (em_cached)
 		free_extent_map(em_cached);
 
 	if (bio)
-		return submit_one_bio(bio, 0, bio_flags);
+		return submit_one_bio(bio, mirror, bio_flags);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH] fstests: btrfs test read from disks of different generation
  2019-03-19 11:35 [PATCH RFC] btrfs: fix read corrpution from disks of different generation Anand Jain
@ 2019-03-19 11:35 ` Anand Jain
  2020-04-06 12:00   ` [PATCH v2] " Anand Jain
  2019-03-19 12:07 ` [PATCH RFC] btrfs: fix read corrpution " Qu Wenruo
  1 sibling, 1 reply; 15+ messages in thread
From: Anand Jain @ 2019-03-19 11:35 UTC (permalink / raw)
  To: linux-btrfs

Verify file read from disks assembled with different generations.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 tests/btrfs/184     | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/btrfs/184.out |   9 ++++
 tests/btrfs/group   |   1 +
 3 files changed, 142 insertions(+)
 create mode 100755 tests/btrfs/184
 create mode 100644 tests/btrfs/184.out

diff --git a/tests/btrfs/184 b/tests/btrfs/184
new file mode 100755
index 000000000000..1791fdb079b7
--- /dev/null
+++ b/tests/btrfs/184
@@ -0,0 +1,132 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2019 YOUR NAME HERE.  All Rights Reserved.
+#
+# FS QA Test 184
+#
+# Test read on raid1 assembled with disks of different generations.
+#  Steps:
+#   . Use btrfs snapshot on the test_dir to create disk images of
+#     different generations (lets say generation a and generation b).
+#   . Mount one disk from generation a and another disk from generation
+#     b, and verify the file md5sum.
+#  Note: readmirror feature (in the btrfs ml) helps to reproduce the
+#     problem consistently.
+#  Fix-by: kernel patch
+#   [PATCH] btrfs: fix read corrpution on disks of different generation
+
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+	_scratch_unmount > /dev/null 2>&1
+	_destroy_loop_device $dev1
+	_destroy_loop_device $dev2
+	btrfs subvolume delete $img_subvol > /dev/null 2>&1
+	btrfs subvolume delete $img_subvol_at_gen_a > /dev/null 2>&1
+	[[ -z $scratch_dev_pool_saved ]] || \
+			SCRATCH_DEV_POOL="$scratch_dev_pool_saved"
+	_scratch_dev_pool_put
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_test
+_require_scratch_dev_pool 2
+
+_test_unmount
+_require_btrfs_forget_if_not_fs_loadable
+_test_mount
+
+_scratch_dev_pool_get 2
+scratch_dev_pool_saved=""
+
+img_subvol=$TEST_DIR/img_subvol
+img_subvol_at_gen_a=$TEST_DIR/img_subvol_at_gen_a
+dev1_img=$img_subvol/d1
+dev2_img=$img_subvol/d2
+dev1_img_at_gen_a=$img_subvol_at_gen_a/d1
+dev2_img_at_gen_a=$img_subvol_at_gen_a/d2
+
+$BTRFS_UTIL_PROG subvolume create $img_subvol >> /dev/null 2>&1 || \
+					_fail "subvol create failed"
+
+$XFS_IO_PROG -f -c "pwrite -S 0xdd 0 256m" $dev1_img >> /dev/null 2>&1
+$XFS_IO_PROG -f -c "pwrite -S 0xdd 0 256m" $dev2_img >> /dev/null 2>&1
+
+dev1=$(_create_loop_device $dev1_img)
+dev2=$(_create_loop_device $dev2_img)
+
+scratch_dev_pool_saved="$SCRATCH_DEV_POOL"
+SCRATCH_DEV_POOL="$dev1 $dev2"
+_scratch_pool_mkfs -d raid1 -m raid1 >> $seqres.full 2>&1 || \
+			_fail  "mkfs failed on $SCRATCH_DEV_POOL"
+
+
+_mount "-o max_inline=0,nodatacow,device=$dev1" $dev2 $SCRATCH_MNT || \
+					_fail "mount for gen a failed"
+
+$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 4K" $SCRATCH_MNT/foobar >> /dev/null 2>&1
+
+echo md5sum at generation aa
+md5sum $SCRATCH_MNT/foobar | _filter_scratch
+echo
+_scratch_unmount
+
+$BTRFS_UTIL_PROG subvolume snapshot $img_subvol $img_subvol_at_gen_a \
+		>> /dev/null 2>&1 || _fail "subvol create failed"
+
+_mount "-o max_inline=0,nodatacow,device=$dev1" $dev2 $SCRATCH_MNT || \
+					_fail "mount for gen b failed"
+
+$XFS_IO_PROG -f -c "pwrite -S 0xbb 0 4K" $SCRATCH_MNT/foobar >> /dev/null 2>&1
+
+echo md5sum at generation bb
+md5sum $SCRATCH_MNT/foobar | _filter_scratch
+echo
+
+_scratch_unmount
+
+_destroy_loop_device $dev1
+_destroy_loop_device $dev2
+
+_test_unmount
+_btrfs_forget_if_not_fs_reload
+_test_mount
+
+dev1=$(_create_loop_device $dev1_img_at_gen_a)
+dev2=$(_create_loop_device $dev2_img)
+
+# mount with readmirror option, if it fails try without but the
+# problem may or may not be reproduced without the readmirror option.
+_mount "-o ro,device=$dev1,readmirror=devid$dev1_devid" $dev2 $SCRATCH_MNT > \
+								/dev/null 2>&1
+[[ $? != 0 ]] && (echo "readmirror option notfound" >> $seqres.full;
+		  _mount "-o ro,device=$dev1" $dev2 $SCRATCH_MNT || \
+					_fail "mount at gen ab failed";)
+echo md5sum at generation ab
+md5sum $SCRATCH_MNT/foobar | _filter_scratch
+
+# success, all done
+status=0
+exit
diff --git a/tests/btrfs/184.out b/tests/btrfs/184.out
new file mode 100644
index 000000000000..cc71898920ec
--- /dev/null
+++ b/tests/btrfs/184.out
@@ -0,0 +1,9 @@
+QA output created by 184
+md5sum at generation aa
+3e4309c7cc81f23d45e260a8f13ca860  SCRATCH_MNT/foobar
+
+md5sum at generation bb
+055a2a7342c0528969e1b6e32aadeee3  SCRATCH_MNT/foobar
+
+md5sum at generation ab
+055a2a7342c0528969e1b6e32aadeee3  SCRATCH_MNT/foobar
diff --git a/tests/btrfs/group b/tests/btrfs/group
index f3227c1708d9..a4f8fe1eb5ce 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -186,3 +186,4 @@
 181 auto quick balance
 182 auto quick balance
 183 auto quick clone compress punch
+184 other
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation
  2019-03-19 11:35 [PATCH RFC] btrfs: fix read corrpution from disks of different generation Anand Jain
  2019-03-19 11:35 ` [PATCH] fstests: btrfs test read " Anand Jain
@ 2019-03-19 12:07 ` Qu Wenruo
  2019-03-19 23:41   ` Anand Jain
  1 sibling, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2019-03-19 12:07 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 9590 bytes --]



On 2019/3/19 下午7:35, Anand Jain wrote:
>    In case of file regular extent (non inline), the metadata and data are 
>  read from two different IO operations. When we read the metadata using 
>  the btree each extent block gets verified with the expected  transid as 
>  per its parent. So suppose if any of the block is stale it gets reported 
>  and corrected by reading from the mirror (consider raid1).
>  
>    This also means the data extent is not yet read or verified and this 
>  happens when application tries to read the file.
>  When application tries to read the file btrfs_readpage(), it shall first 
>  read file extent map as provided by the metadata in the above operation 
>  (which is assured to be consistent).
>  And when file data has to be read using the above filemap as it again 
>  has the choice of using either of the mirrors, suppose if the IO is 
>  routed to the disk on which we haven't verified the metadata on it and 
>  if there is write hole on that disk, then we shall be reading the junk 
>  data without being reported anywhere.
> 
>    This is reproducible with the test case [1] in this email thread.
>      [1]
>      fstests: btrfs test read from disks of different generation
> 
>    As we have data csum most of these get caught and reported as csum 
>  error.

So far so good.

>  But csum verification is a point in verification and its not a 
>  tree based transid verification. Which means if there is a stale data 
>  with matching csum we may return a junk data silently.

Then the normal idea is to use stronger but slower csum in the first
place, to avoid the csum match case.

>  This problem is 
>  easily reproducible when csum is disabled but not impossible to achieve 
>  when csum is not disabled as well.

Under this case, it's the user to be blamed for the decision to disable
the csum in the first place.

> A tree based integrity verification 
>  is important for all data, which is missing.
>  
>  Fix:
>    In this RFC patch it proposes to use same disk from with the metadata 
>  is read to read the data.

The obvious problem I found is, the idea only works for RAID1/10.

For striped profile it makes no sense, or even have a worse chance to
get stale data.


To me, the idea of using possible better mirror makes some sense, but
very profile limited.


Another idea I get inspired from the idea is, make it more generic so
that bad/stale device get a lower priority.
Although it suffers the same problem as I described.

To make the point short, the use case looks very limited.

Thanks,
Qu

> Which means the feature such as readmirror is 
>  less effective (which is fine). But if the good data fails (media error) 
>  as usual we shall read the mirrored data without being aware that this 
>  data may be stale. To fix that, we have to invalidate the corresponding 
>  metadata and re-read the metadata from other disk (this part is not 
>  implemented in this patch).
>  
>     The other better solution is - a tree based verification, which means 
>  we have to verify the data-extent's metadata in the extent tree when its 
>  read - but there is a problem we don't update the generation number in 
>  the extent tree for all types of write, for example consider overwrite 
>  with notrunc option on a filesystem mounted with nodatacow option [1], 
>  which I wonder if its a bug or if its like that for a reason? as because 
>  there is no cow?
>  
>  [1] The EXTENT_DATA generation is same before and after dd with notrunc
>  
>    mkfs.btrfs -fq /dev/sdb && mount -o notreelog,nodatacow /dev/sdb /btrfs
>    dd if=/dev/urandom of=/btrfs/tf1 bs=4096 count=1 conv=fsync,notrunc && 
>  sync
>    btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA 0"
>    btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) itemoff"
>    dd if=/dev/urandom of=/btrfs/tf1 bs=4096 count=1 conv=fsync,notrunc && 
>  sync
>    btrfs in dump-tree /dev/sdb | grep -A4 "257 EXTENT_DATA 0"
>    btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) itemoff"
>  
>  ---------
>       item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
>           generation 6 type 1 (regular)
>           extent data disk byte 13631488 nr 4096
>           extent data offset 0 nr 4096 ram 4096
>           extent compression 0 (none)
>       item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
>           generation 6 transid 6 size 4096 nbytes 4096
>           block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>           sequence 66785 flags 0x3(NODATASUM|NODATACOW)
>           atime 1552993569.276079763 (2019-03-19 19:06:09)
>           ctime 1552993569.276079763 (2019-03-19 19:06:09)
>           mtime 1552993569.276079763 (2019-03-19 19:06:09)
>           otime 1552993569.276079763 (2019-03-19 19:06:09)
>  1+0 records in
>  1+0 records out
>  4096 bytes (4.1 kB) copied, 0.00302717 s, 1.4 MB/s
>       item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
>           generation 6 type 1 (regular)
>           extent data disk byte 13631488 nr 4096
>           extent data offset 0 nr 4096 ram 4096
>           extent compression 0 (none)
>       item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
>           generation 6 transid 7 size 4096 nbytes 4096
>           block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>           sequence 66786 flags 0x3(NODATASUM|NODATACOW)
>           atime 1552993569.276079763 (2019-03-19 19:06:09)
>           ctime 1552993569.302079763 (2019-03-19 19:06:09)
>           mtime 1552993569.302079763 (2019-03-19 19:06:09)
>           otime 1552993569.276079763 (2019-03-19 19:06:09)
>  -------
>  
>    Comments appreciated.
> 
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
>  fs/btrfs/extent_io.c | 41 ++++++++++++++++++++++++++++++++++-------
>  1 file changed, 34 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 31892052a24f..f1cc21e8dff7 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3079,7 +3079,7 @@ static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
>  					     struct extent_map **em_cached,
>  					     struct bio **bio,
>  					     unsigned long *bio_flags,
> -					     u64 *prev_em_start)
> +					     u64 *prev_em_start, int mirror)
>  {
>  	struct inode *inode;
>  	struct btrfs_ordered_extent *ordered;
> @@ -3099,7 +3099,7 @@ static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
>  
>  	for (index = 0; index < nr_pages; index++) {
>  		__do_readpage(tree, pages[index], btrfs_get_extent, em_cached,
> -				bio, 0, bio_flags, REQ_RAHEAD, prev_em_start);
> +				bio, mirror, bio_flags, REQ_RAHEAD, prev_em_start);
>  		put_page(pages[index]);
>  	}
>  }
> @@ -3109,7 +3109,7 @@ static void __extent_readpages(struct extent_io_tree *tree,
>  			       int nr_pages,
>  			       struct extent_map **em_cached,
>  			       struct bio **bio, unsigned long *bio_flags,
> -			       u64 *prev_em_start)
> +			       u64 *prev_em_start, int mirror)
>  {
>  	u64 start = 0;
>  	u64 end = 0;
> @@ -3130,7 +3130,7 @@ static void __extent_readpages(struct extent_io_tree *tree,
>  						  index - first_index, start,
>  						  end, em_cached,
>  						  bio, bio_flags,
> -						  prev_em_start);
> +						  prev_em_start, mirror);
>  			start = page_start;
>  			end = start + PAGE_SIZE - 1;
>  			first_index = index;
> @@ -3141,7 +3141,7 @@ static void __extent_readpages(struct extent_io_tree *tree,
>  		__do_contiguous_readpages(tree, &pages[first_index],
>  					  index - first_index, start,
>  					  end, em_cached, bio,
> -					  bio_flags, prev_em_start);
> +					  bio_flags, prev_em_start, mirror);
>  }
>  
>  static int __extent_read_full_page(struct extent_io_tree *tree,
> @@ -4093,6 +4093,27 @@ int extent_writepages(struct address_space *mapping,
>  	return ret;
>  }
>  
> +static int btrfs_get_read_mirror(struct inode *inode)
> +{
> +	int ret;
> +	struct btrfs_path *path;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
> +				       btrfs_ino(BTRFS_I(inode)), 0, 0);
> +	if (!ret) {
> +		struct extent_buffer *eb = path->nodes[0];
> +
> +		ret = eb->read_mirror;
> +	}
> +
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
>  int extent_readpages(struct address_space *mapping, struct list_head *pages,
>  		     unsigned nr_pages)
>  {
> @@ -4103,6 +4124,11 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages,
>  	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
>  	int nr = 0;
>  	u64 prev_em_start = (u64)-1;
> +	int mirror;
> +
> +	mirror = btrfs_get_read_mirror(mapping->host);
> +	if (mirror < 0)
> +		return mirror;
>  
>  	while (!list_empty(pages)) {
>  		for (nr = 0; nr < ARRAY_SIZE(pagepool) && !list_empty(pages);) {
> @@ -4120,14 +4146,15 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages,
>  		}
>  
>  		__extent_readpages(tree, pagepool, nr, &em_cached, &bio,
> -				   &bio_flags, &prev_em_start);
> +				   &bio_flags, &prev_em_start, mirror);
> +
>  	}
>  
>  	if (em_cached)
>  		free_extent_map(em_cached);
>  
>  	if (bio)
> -		return submit_one_bio(bio, 0, bio_flags);
> +		return submit_one_bio(bio, mirror, bio_flags);
>  	return 0;
>  }
>  
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation
  2019-03-19 12:07 ` [PATCH RFC] btrfs: fix read corrpution " Qu Wenruo
@ 2019-03-19 23:41   ` Anand Jain
  2019-03-20  1:02     ` Qu Wenruo
  0 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2019-03-19 23:41 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs


>>   But csum verification is a point in verification and its not a
>>   tree based transid verification. Which means if there is a stale data
>>   with matching csum we may return a junk data silently.
> 
> Then the normal idea is to use stronger but slower csum in the first
> place, to avoid the csum match case.

  This is just a general observational comment, its ok lets assume
  current point in csum verification works (as opposed to tree based
  parent transid verification).

>>   This problem is
>>   easily reproducible when csum is disabled but not impossible to achieve
>>   when csum is not disabled as well.
> 
> Under this case, it's the user to be blamed for the decision to disable
> the csum in the first place.

  The point here is. The logic isn't aware of the write hole on the other
  disk on which the metadata is not verified. I disagree that nocsum or
  the user to be blamed.

> 
>> A tree based integrity verification
>>   is important for all data, which is missing.
>>   
>>   Fix:
>>     In this RFC patch it proposes to use same disk from with the metadata
>>   is read to read the data.
> 
> The obvious problem I found is, the idea only works for RAID1/10.
> 
> For striped profile it makes no sense, or even have a worse chance to
> get stale data.
> 
> 
> To me, the idea of using possible better mirror makes some sense, but
> very profile limited.

  Yep. This problem and fix is only for the mirror based profiles
  such as raid1/raid10.

> 
> Another idea I get inspired from the idea is, make it more generic so
> that bad/stale device get a lower priority.

  When it comes to reading junk data, its not about the priority its
  about the eliminating. When the problem is only few blocks, I am
  against making the whole disk as bad.

> Although it suffers the same problem as I described.
> 
> To make the point short, the use case looks very limited.

  It applies to raid1/raid10 with nodatacow (which implies nodatasum).
  In my understanding that's not rare.

  Any comments on the fix offered here?

Thanks, Anand


> Thanks,
> Qu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation
  2019-03-19 23:41   ` Anand Jain
@ 2019-03-20  1:02     ` Qu Wenruo
  2019-03-20  5:47       ` Anand Jain
  0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2019-03-20  1:02 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2647 bytes --]



On 2019/3/20 上午7:41, Anand Jain wrote:
> 
>>>   But csum verification is a point in verification and its not a
>>>   tree based transid verification. Which means if there is a stale data
>>>   with matching csum we may return a junk data silently.
>>
>> Then the normal idea is to use stronger but slower csum in the first
>> place, to avoid the csum match case.
> 
>  This is just a general observational comment, its ok lets assume
>  current point in csum verification works (as opposed to tree based
>  parent transid verification).
> 
>>>   This problem is
>>>   easily reproducible when csum is disabled but not impossible to
>>> achieve
>>>   when csum is not disabled as well.
>>
>> Under this case, it's the user to be blamed for the decision to disable
>> the csum in the first place.
> 
>  The point here is. The logic isn't aware of the write hole on the other
>  disk on which the metadata is not verified. I disagree that nocsum or
>  the user to be blamed.
> 
>>
>>> A tree based integrity verification
>>>   is important for all data, which is missing.
>>>     Fix:
>>>     In this RFC patch it proposes to use same disk from with the
>>> metadata
>>>   is read to read the data.
>>
>> The obvious problem I found is, the idea only works for RAID1/10.
>>
>> For striped profile it makes no sense, or even have a worse chance to
>> get stale data.
>>
>>
>> To me, the idea of using possible better mirror makes some sense, but
>> very profile limited.
> 
>  Yep. This problem and fix is only for the mirror based profiles
>  such as raid1/raid10.

Then current implementation lacks such check.

Further more, data and metadata can lie in different chunks and have
different chunk types.

> 
>>
>> Another idea I get inspired from the idea is, make it more generic so
>> that bad/stale device get a lower priority.
> 
>  When it comes to reading junk data, its not about the priority its
>  about the eliminating. When the problem is only few blocks, I am
>  against making the whole disk as bad.
> 
>> Although it suffers the same problem as I described.
>>
>> To make the point short, the use case looks very limited.
> 
>  It applies to raid1/raid10 with nodatacow (which implies nodatasum).
>  In my understanding that's not rare.
> 
>  Any comments on the fix offered here?

The implementation part is, is eb->read_mirror reliable?

E.g. if the data and the eb are in different chunks, and the stale
happens in the chunk of eb but not in the data chunk?

Thanks,
Qu

> 
> Thanks, Anand
> 
> 
>> Thanks,
>> Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation
  2019-03-20  1:02     ` Qu Wenruo
@ 2019-03-20  5:47       ` Anand Jain
  2019-03-20  6:19         ` Qu Wenruo
  2019-03-20  6:27         ` Qu Wenruo
  0 siblings, 2 replies; 15+ messages in thread
From: Anand Jain @ 2019-03-20  5:47 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



>>>> A tree based integrity verification
>>>>    is important for all data, which is missing.
>>>>      Fix:
>>>>      In this RFC patch it proposes to use same disk from with the
>>>> metadata
>>>>    is read to read the data.
>>>
>>> The obvious problem I found is, the idea only works for RAID1/10.
>>>
>>> For striped profile it makes no sense, or even have a worse chance to
>>> get stale data.
>>>
>>>
>>> To me, the idea of using possible better mirror makes some sense, but
>>> very profile limited.
>>
>>   Yep. This problem and fix is only for the mirror based profiles
>>   such as raid1/raid10.
> 
> Then current implementation lacks such check.
> 
> Further more, data and metadata can lie in different chunks and have
> different chunk types.

  Right. Current tests for this RFC were only for raid1.

  But the final patch can fix that.

  In fact current patch works for all the cases except for the case of
  replace is running and mix of metadata:raid1 and data:raid56

  We need some cleanups in mirror_num, basically we need to bring it
  under #define. and handle it accordingly in __btrfs_map_block()

>>> Another idea I get inspired from the idea is, make it more generic so
>>> that bad/stale device get a lower priority.
>>
>>   When it comes to reading junk data, its not about the priority its
>>   about the eliminating. When the problem is only few blocks, I am
>>   against making the whole disk as bad.
>>
>>> Although it suffers the same problem as I described.
>>>
>>> To make the point short, the use case looks very limited.
>>
>>   It applies to raid1/raid10 with nodatacow (which implies nodatasum).
>>   In my understanding that's not rare.
>>
>>   Any comments on the fix offered here?
> 
> The implementation part is, is eb->read_mirror reliable?
> 
> E.g. if the data and the eb are in different chunks, and the stale
> happens in the chunk of eb but not in the data chunk?


  eb and regular data are indeed in different chunks always. But eb
  can never be stale as there is parent transid which is verified against
  the read eb. However we do not have such a check for the data (this is
  the core of the issue here) and so we return the junk data silently.

  Also any idea why the generation number for the extent data is not
  incremented [2] when -o nodatacow and notrunc option is used, is it
  a bug? the dump-tree is taken with the script as below [1]
  (this corruption is seen with or without generation number is
  being incremented, but as another way to fix for the corruption we can
  verify the inode EXTENT_DATA generation from the same disk from which
  the data is read).

[1]
  umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \
  mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \
  echo 1st write: && \
  dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1 
conv=fsync,notrunc && sync && \
  btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
  echo --- && \
  btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA" && \
  echo 2nd write: && \
  dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1 
conv=fsync,notrunc && sync && \
  btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
  echo --- && \
  btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA"


1st write:
	item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
		generation 6 transid 6 size 4096 nbytes 4096
		block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
		sequence 1 flags 0x3(NODATASUM|NODATACOW)
		atime 1553058460.163985452 (2019-03-20 13:07:40)
		ctime 1553058460.163985452 (2019-03-20 13:07:40)
		mtime 1553058460.163985452 (2019-03-20 13:07:40)
		otime 1553058460.163985452 (2019-03-20 13:07:40)
---
	item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
		generation 6 type 1 (regular)
		extent data disk byte 13631488 nr 4096
		extent data offset 0 nr 4096 ram 4096
		extent compression 0 (none)
2nd write:
	item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
		generation 6 transid 7 size 4096 nbytes 4096
		block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
		sequence 2 flags 0x3(NODATASUM|NODATACOW)
		atime 1553058460.163985452 (2019-03-20 13:07:40)
		ctime 1553058460.189985450 (2019-03-20 13:07:40)
		mtime 1553058460.189985450 (2019-03-20 13:07:40)
		otime 1553058460.163985452 (2019-03-20 13:07:40)
---
	item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
		generation 6 type 1 (regular)   <----- [2]
		extent data disk byte 13631488 nr 4096
		extent data offset 0 nr 4096 ram 4096
		extent compression 0 (none)


Thanks, Anand

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation
  2019-03-20  5:47       ` Anand Jain
@ 2019-03-20  6:19         ` Qu Wenruo
  2019-03-20 14:00           ` Anand Jain
  2019-03-20  6:27         ` Qu Wenruo
  1 sibling, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2019-03-20  6:19 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5762 bytes --]



On 2019/3/20 下午1:47, Anand Jain wrote:
> 
> 
>>>>> A tree based integrity verification
>>>>>    is important for all data, which is missing.
>>>>>      Fix:
>>>>>      In this RFC patch it proposes to use same disk from with the
>>>>> metadata
>>>>>    is read to read the data.
>>>>
>>>> The obvious problem I found is, the idea only works for RAID1/10.
>>>>
>>>> For striped profile it makes no sense, or even have a worse chance to
>>>> get stale data.
>>>>
>>>>
>>>> To me, the idea of using possible better mirror makes some sense, but
>>>> very profile limited.
>>>
>>>   Yep. This problem and fix is only for the mirror based profiles
>>>   such as raid1/raid10.
>>
>> Then current implementation lacks such check.
>>
>> Further more, data and metadata can lie in different chunks and have
>> different chunk types.
> 
>  Right. Current tests for this RFC were only for raid1.
> 
>  But the final patch can fix that.
> 
>  In fact current patch works for all the cases except for the case of
>  replace is running and mix of metadata:raid1 and data:raid56
> 
>  We need some cleanups in mirror_num, basically we need to bring it
>  under #define. and handle it accordingly in __btrfs_map_block()
> 
>>>> Another idea I get inspired from the idea is, make it more generic so
>>>> that bad/stale device get a lower priority.
>>>
>>>   When it comes to reading junk data, its not about the priority its
>>>   about the eliminating. When the problem is only few blocks, I am
>>>   against making the whole disk as bad.
>>>
>>>> Although it suffers the same problem as I described.
>>>>
>>>> To make the point short, the use case looks very limited.
>>>
>>>   It applies to raid1/raid10 with nodatacow (which implies nodatasum).
>>>   In my understanding that's not rare.
>>>
>>>   Any comments on the fix offered here?
>>
>> The implementation part is, is eb->read_mirror reliable?
>>
>> E.g. if the data and the eb are in different chunks, and the stale
>> happens in the chunk of eb but not in the data chunk?
> 
> 
>  eb and regular data are indeed in different chunks always. But eb
>  can never be stale as there is parent transid which is verified against
>  the read eb. However we do not have such a check for the data (this is
>  the core of the issue here) and so we return the junk data silently.

Yes, that's the basis of your patch.
And I totally agree this part, since it's not only transid, but eb's
content is also checked almost bit by bit.

So the idea is OK, for mirror based profile.

> 
>  Also any idea why the generation number for the extent data is not
>  incremented [2] when -o nodatacow and notrunc option is used, is it
>  a bug? the dump-tree is taken with the script as below [1]
>  (this corruption is seen with or without generation number is
>  being incremented, but as another way to fix for the corruption we can
>  verify the inode EXTENT_DATA generation from the same disk from which
>  the data is read).

For the generation part, it's the generation when data is written to disk.

Truncation/nocow overwrite shouldn't really change the generation of
existing file extents.

So I'm afraid you can't use that generation to do the check.

Thanks,
Qu

> 
> [1]
>  umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \
>  mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \
>  echo 1st write: && \
>  dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
> conv=fsync,notrunc && sync && \
>  btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>  echo --- && \
>  btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA" && \
>  echo 2nd write: && \
>  dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
> conv=fsync,notrunc && sync && \
>  btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>  echo --- && \
>  btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA"
> 
> 
> 1st write:
>     item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>         generation 6 transid 6 size 4096 nbytes 4096
>         block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>         sequence 1 flags 0x3(NODATASUM|NODATACOW)
>         atime 1553058460.163985452 (2019-03-20 13:07:40)
>         ctime 1553058460.163985452 (2019-03-20 13:07:40)
>         mtime 1553058460.163985452 (2019-03-20 13:07:40)
>         otime 1553058460.163985452 (2019-03-20 13:07:40)
> ---
>     item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>         generation 6 type 1 (regular)
>         extent data disk byte 13631488 nr 4096
>         extent data offset 0 nr 4096 ram 4096
>         extent compression 0 (none)
> 2nd write:
>     item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>         generation 6 transid 7 size 4096 nbytes 4096
>         block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>         sequence 2 flags 0x3(NODATASUM|NODATACOW)
>         atime 1553058460.163985452 (2019-03-20 13:07:40)
>         ctime 1553058460.189985450 (2019-03-20 13:07:40)
>         mtime 1553058460.189985450 (2019-03-20 13:07:40)
>         otime 1553058460.163985452 (2019-03-20 13:07:40)
> ---
>     item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>         generation 6 type 1 (regular)   <----- [2]
>         extent data disk byte 13631488 nr 4096
>         extent data offset 0 nr 4096 ram 4096
>         extent compression 0 (none)
> 
> 
> Thanks, Anand


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation
  2019-03-20  5:47       ` Anand Jain
  2019-03-20  6:19         ` Qu Wenruo
@ 2019-03-20  6:27         ` Qu Wenruo
  2019-03-20 13:54           ` Anand Jain
  1 sibling, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2019-03-20  6:27 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5940 bytes --]



On 2019/3/20 下午1:47, Anand Jain wrote:
> 
> 
>>>>> A tree based integrity verification
>>>>>    is important for all data, which is missing.
>>>>>      Fix:
>>>>>      In this RFC patch it proposes to use same disk from with the
>>>>> metadata
>>>>>    is read to read the data.
>>>>
>>>> The obvious problem I found is, the idea only works for RAID1/10.
>>>>
>>>> For striped profile it makes no sense, or even have a worse chance to
>>>> get stale data.
>>>>
>>>>
>>>> To me, the idea of using possible better mirror makes some sense, but
>>>> very profile limited.
>>>
>>>   Yep. This problem and fix is only for the mirror based profiles
>>>   such as raid1/raid10.
>>
>> Then current implementation lacks such check.
>>
>> Further more, data and metadata can lie in different chunks and have
>> different chunk types.
> 
>  Right. Current tests for this RFC were only for raid1.
> 
>  But the final patch can fix that.
> 
>  In fact current patch works for all the cases except for the case of
>  replace is running and mix of metadata:raid1 and data:raid56
> 
>  We need some cleanups in mirror_num, basically we need to bring it
>  under #define. and handle it accordingly in __btrfs_map_block()


Wait for a minute.

There is a hidden pitfall from the very beginning.

Consider such chunk layout:
Chunk A Type DATA|RAID1
Stripe 1: Dev 1
Stripe 2: Dev 2

Chunk B Type METADATA|RAID1
Stripe 1: Dev 2
Stripe 2: Dev 1

Then when we found stale metadata in chunk B mirror 1, caused by dev 2,
then your patch consider device 2 stale, and try to use mirror num 2 to
read from data chunk.

However in data chunk, mirror num 2 means it's still from device 2, and
we get stale data.

So the eb->mirror_num can still map to bad/stale device, due to the
flexibility provided by btrfs per-chunk mapping.

Thanks,
Qu

> 
>>>> Another idea I get inspired from the idea is, make it more generic so
>>>> that bad/stale device get a lower priority.
>>>
>>>   When it comes to reading junk data, its not about the priority its
>>>   about the eliminating. When the problem is only few blocks, I am
>>>   against making the whole disk as bad.
>>>
>>>> Although it suffers the same problem as I described.
>>>>
>>>> To make the point short, the use case looks very limited.
>>>
>>>   It applies to raid1/raid10 with nodatacow (which implies nodatasum).
>>>   In my understanding that's not rare.
>>>
>>>   Any comments on the fix offered here?
>>
>> The implementation part is, is eb->read_mirror reliable?
>>
>> E.g. if the data and the eb are in different chunks, and the stale
>> happens in the chunk of eb but not in the data chunk?
> 
> 
>  eb and regular data are indeed in different chunks always. But eb
>  can never be stale as there is parent transid which is verified against
>  the read eb. However we do not have such a check for the data (this is
>  the core of the issue here) and so we return the junk data silently.
> 
>  Also any idea why the generation number for the extent data is not
>  incremented [2] when -o nodatacow and notrunc option is used, is it
>  a bug? the dump-tree is taken with the script as below [1]
>  (this corruption is seen with or without generation number is
>  being incremented, but as another way to fix for the corruption we can
>  verify the inode EXTENT_DATA generation from the same disk from which
>  the data is read).
> 
> [1]
>  umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \
>  mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \
>  echo 1st write: && \
>  dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
> conv=fsync,notrunc && sync && \
>  btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>  echo --- && \
>  btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA" && \
>  echo 2nd write: && \
>  dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
> conv=fsync,notrunc && sync && \
>  btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>  echo --- && \
>  btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA"
> 
> 
> 1st write:
>     item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>         generation 6 transid 6 size 4096 nbytes 4096
>         block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>         sequence 1 flags 0x3(NODATASUM|NODATACOW)
>         atime 1553058460.163985452 (2019-03-20 13:07:40)
>         ctime 1553058460.163985452 (2019-03-20 13:07:40)
>         mtime 1553058460.163985452 (2019-03-20 13:07:40)
>         otime 1553058460.163985452 (2019-03-20 13:07:40)
> ---
>     item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>         generation 6 type 1 (regular)
>         extent data disk byte 13631488 nr 4096
>         extent data offset 0 nr 4096 ram 4096
>         extent compression 0 (none)
> 2nd write:
>     item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>         generation 6 transid 7 size 4096 nbytes 4096
>         block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>         sequence 2 flags 0x3(NODATASUM|NODATACOW)
>         atime 1553058460.163985452 (2019-03-20 13:07:40)
>         ctime 1553058460.189985450 (2019-03-20 13:07:40)
>         mtime 1553058460.189985450 (2019-03-20 13:07:40)
>         otime 1553058460.163985452 (2019-03-20 13:07:40)
> ---
>     item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>         generation 6 type 1 (regular)   <----- [2]
>         extent data disk byte 13631488 nr 4096
>         extent data offset 0 nr 4096 ram 4096
>         extent compression 0 (none)
> 
> 
> Thanks, Anand


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation
  2019-03-20  6:27         ` Qu Wenruo
@ 2019-03-20 13:54           ` Anand Jain
  2019-03-20 15:46             ` Zygo Blaxell
  0 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2019-03-20 13:54 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 3/20/19 2:27 PM, Qu Wenruo wrote:
> 
> 
> On 2019/3/20 下午1:47, Anand Jain wrote:
>>
>>
>>>>>> A tree based integrity verification
>>>>>>     is important for all data, which is missing.
>>>>>>       Fix:
>>>>>>       In this RFC patch it proposes to use same disk from with the
>>>>>> metadata
>>>>>>     is read to read the data.
>>>>>
>>>>> The obvious problem I found is, the idea only works for RAID1/10.
>>>>>
>>>>> For striped profile it makes no sense, or even have a worse chance to
>>>>> get stale data.
>>>>>
>>>>>
>>>>> To me, the idea of using possible better mirror makes some sense, but
>>>>> very profile limited.
>>>>
>>>>    Yep. This problem and fix is only for the mirror based profiles
>>>>    such as raid1/raid10.
>>>
>>> Then current implementation lacks such check.
>>>
>>> Further more, data and metadata can lie in different chunks and have
>>> different chunk types.
>>
>>   Right. Current tests for this RFC were only for raid1.
>>
>>   But the final patch can fix that.
>>
>>   In fact current patch works for all the cases except for the case of
>>   replace is running and mix of metadata:raid1 and data:raid56
>>
>>   We need some cleanups in mirror_num, basically we need to bring it
>>   under #define. and handle it accordingly in __btrfs_map_block()
> 
> 
> Wait for a minute.
> 
> There is a hidden pitfall from the very beginning.
> 
> Consider such chunk layout:
> Chunk A Type DATA|RAID1
> Stripe 1: Dev 1
> Stripe 2: Dev 2
> 
> Chunk B Type METADATA|RAID1
> Stripe 1: Dev 2
> Stripe 2: Dev 1

  Right this fix can't solve this case.
  So one down. The other choice I had is to
  Quarantine whole disk by comparing dev1 and dev2 generation # in the 
superblock during mount. or
  Quarantine whole chunk(s) by comparing dev1 and dev2 generation # in 
the chunk tree during mount. or
  Always read eb from both dev1 and dev2, if fails then fail its 
corresponding data block as well if any. or
  During mount compare dev1 and dev2 generation #, record the range of 
generation number missing on the disk. But this need ondisk format 
changes. (But nodatacow, notrunc reg data extent write must change gen#). or
  Similar to btree eb read pages, nest the extent data read as well.
  (But nodatacow, notrunc reg data extent write must change gen#).

Thanks, Anand


> Then when we found stale metadata in chunk B mirror 1, caused by dev 2,
> then your patch consider device 2 stale, and try to use mirror num 2 to
> read from data chunk.
> 
> However in data chunk, mirror num 2 means it's still from device 2, and
> we get stale data.
> 
> So the eb->mirror_num can still map to bad/stale device, due to the
> flexibility provided by btrfs per-chunk mapping.
> 
> Thanks,
> Qu
> 
>>
>>>>> Another idea I get inspired from the idea is, make it more generic so
>>>>> that bad/stale device get a lower priority.
>>>>
>>>>    When it comes to reading junk data, its not about the priority its
>>>>    about the eliminating. When the problem is only few blocks, I am
>>>>    against making the whole disk as bad.
>>>>
>>>>> Although it suffers the same problem as I described.
>>>>>
>>>>> To make the point short, the use case looks very limited.
>>>>
>>>>    It applies to raid1/raid10 with nodatacow (which implies nodatasum).
>>>>    In my understanding that's not rare.
>>>>
>>>>    Any comments on the fix offered here?
>>>
>>> The implementation part is, is eb->read_mirror reliable?
>>>
>>> E.g. if the data and the eb are in different chunks, and the stale
>>> happens in the chunk of eb but not in the data chunk?
>>
>>
>>   eb and regular data are indeed in different chunks always. But eb
>>   can never be stale as there is parent transid which is verified against
>>   the read eb. However we do not have such a check for the data (this is
>>   the core of the issue here) and so we return the junk data silently.
>>
>>   Also any idea why the generation number for the extent data is not
>>   incremented [2] when -o nodatacow and notrunc option is used, is it
>>   a bug? the dump-tree is taken with the script as below [1]
>>   (this corruption is seen with or without generation number is
>>   being incremented, but as another way to fix for the corruption we can
>>   verify the inode EXTENT_DATA generation from the same disk from which
>>   the data is read).
>>
>> [1]
>>   umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \
>>   mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \
>>   echo 1st write: && \
>>   dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
>> conv=fsync,notrunc && sync && \
>>   btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>>   echo --- && \
>>   btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA" && \
>>   echo 2nd write: && \
>>   dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
>> conv=fsync,notrunc && sync && \
>>   btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>>   echo --- && \
>>   btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA"
>>
>>
>> 1st write:
>>      item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>>          generation 6 transid 6 size 4096 nbytes 4096
>>          block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>>          sequence 1 flags 0x3(NODATASUM|NODATACOW)
>>          atime 1553058460.163985452 (2019-03-20 13:07:40)
>>          ctime 1553058460.163985452 (2019-03-20 13:07:40)
>>          mtime 1553058460.163985452 (2019-03-20 13:07:40)
>>          otime 1553058460.163985452 (2019-03-20 13:07:40)
>> ---
>>      item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>>          generation 6 type 1 (regular)
>>          extent data disk byte 13631488 nr 4096
>>          extent data offset 0 nr 4096 ram 4096
>>          extent compression 0 (none)
>> 2nd write:
>>      item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>>          generation 6 transid 7 size 4096 nbytes 4096
>>          block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>>          sequence 2 flags 0x3(NODATASUM|NODATACOW)
>>          atime 1553058460.163985452 (2019-03-20 13:07:40)
>>          ctime 1553058460.189985450 (2019-03-20 13:07:40)
>>          mtime 1553058460.189985450 (2019-03-20 13:07:40)
>>          otime 1553058460.163985452 (2019-03-20 13:07:40)
>> ---
>>      item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>>          generation 6 type 1 (regular)   <----- [2]
>>          extent data disk byte 13631488 nr 4096
>>          extent data offset 0 nr 4096 ram 4096
>>          extent compression 0 (none)
>>
>>
>> Thanks, Anand
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation
  2019-03-20  6:19         ` Qu Wenruo
@ 2019-03-20 14:00           ` Anand Jain
  2019-03-20 14:40             ` Qu Wenruo
  0 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2019-03-20 14:00 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs


>>   Also any idea why the generation number for the extent data is not
>>   incremented [2] when -o nodatacow and notrunc option is used, is it
>>   a bug? the dump-tree is taken with the script as below [1]
>>   (this corruption is seen with or without generation number is
>>   being incremented, but as another way to fix for the corruption we can
>>   verify the inode EXTENT_DATA generation from the same disk from which
>>   the data is read).
> 
> For the generation part, it's the generation when data is written to disk.
> 
> Truncation/nocow overwrite shouldn't really change the generation of
> existing file extents.
> 
> So I'm afraid you can't use that generation to do the check.

  Any idea why it shouldn't change? Albeit there isn't new allocation
  due to nodatacow and notrunc overwrite, but sure data is overwritten.
  If that's the case then I would guess there will be bug in send receive
  as well.

Thanks, Anand

> Thanks,
> Qu
> 
>>
>> [1]
>>   umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \
>>   mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \
>>   echo 1st write: && \
>>   dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
>> conv=fsync,notrunc && sync && \
>>   btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>>   echo --- && \
>>   btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA" && \
>>   echo 2nd write: && \
>>   dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
>> conv=fsync,notrunc && sync && \
>>   btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>>   echo --- && \
>>   btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA"
>>
>>
>> 1st write:
>>      item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>>          generation 6 transid 6 size 4096 nbytes 4096
>>          block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>>          sequence 1 flags 0x3(NODATASUM|NODATACOW)
>>          atime 1553058460.163985452 (2019-03-20 13:07:40)
>>          ctime 1553058460.163985452 (2019-03-20 13:07:40)
>>          mtime 1553058460.163985452 (2019-03-20 13:07:40)
>>          otime 1553058460.163985452 (2019-03-20 13:07:40)
>> ---
>>      item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>>          generation 6 type 1 (regular)
>>          extent data disk byte 13631488 nr 4096
>>          extent data offset 0 nr 4096 ram 4096
>>          extent compression 0 (none)
>> 2nd write:
>>      item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>>          generation 6 transid 7 size 4096 nbytes 4096
>>          block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>>          sequence 2 flags 0x3(NODATASUM|NODATACOW)
>>          atime 1553058460.163985452 (2019-03-20 13:07:40)
>>          ctime 1553058460.189985450 (2019-03-20 13:07:40)
>>          mtime 1553058460.189985450 (2019-03-20 13:07:40)
>>          otime 1553058460.163985452 (2019-03-20 13:07:40)
>> ---
>>      item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>>          generation 6 type 1 (regular)   <----- [2]
>>          extent data disk byte 13631488 nr 4096
>>          extent data offset 0 nr 4096 ram 4096
>>          extent compression 0 (none)
>>
>>
>> Thanks, Anand
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation
  2019-03-20 14:00           ` Anand Jain
@ 2019-03-20 14:40             ` Qu Wenruo
  2019-03-20 15:40               ` Zygo Blaxell
  0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2019-03-20 14:40 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4209 bytes --]



On 2019/3/20 下午10:00, Anand Jain wrote:
> 
>>>   Also any idea why the generation number for the extent data is not
>>>   incremented [2] when -o nodatacow and notrunc option is used, is it
>>>   a bug? the dump-tree is taken with the script as below [1]
>>>   (this corruption is seen with or without generation number is
>>>   being incremented, but as another way to fix for the corruption we can
>>>   verify the inode EXTENT_DATA generation from the same disk from which
>>>   the data is read).
>>
>> For the generation part, it's the generation when data is written to
>> disk.
>>
>> Truncation/nocow overwrite shouldn't really change the generation of
>> existing file extents.
>>
>> So I'm afraid you can't use that generation to do the check.
> 
>  Any idea why it shouldn't change? Albeit there isn't new allocation
>  due to nodatacow and notrunc overwrite, but sure data is overwritten.
>  If that's the case then I would guess there will be bug in send receive
>  as well.

I'm not sure about the send part.

On the other hand, if btrfs is going to update the generation of
nodatacow file extent overwrite, it should cause pretty big performance
degradation.

The idea of nodatacow is to skip all the expensive csum, extent
allocation (maybe not that expensive) and the race of subvol tree.

If we're going to update file extents for such case, we're re-introduce
performance impact to users who don't want that impact at all.
I don't believe it's worthy at all.

Thanks,
Qu

> 
> Thanks, Anand
> 
>> Thanks,
>> Qu
>>
>>>
>>> [1]
>>>   umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \
>>>   mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \
>>>   echo 1st write: && \
>>>   dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
>>> conv=fsync,notrunc && sync && \
>>>   btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>>>   echo --- && \
>>>   btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA" && \
>>>   echo 2nd write: && \
>>>   dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
>>> conv=fsync,notrunc && sync && \
>>>   btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>>>   echo --- && \
>>>   btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA"
>>>
>>>
>>> 1st write:
>>>      item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>>>          generation 6 transid 6 size 4096 nbytes 4096
>>>          block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>>>          sequence 1 flags 0x3(NODATASUM|NODATACOW)
>>>          atime 1553058460.163985452 (2019-03-20 13:07:40)
>>>          ctime 1553058460.163985452 (2019-03-20 13:07:40)
>>>          mtime 1553058460.163985452 (2019-03-20 13:07:40)
>>>          otime 1553058460.163985452 (2019-03-20 13:07:40)
>>> ---
>>>      item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>>>          generation 6 type 1 (regular)
>>>          extent data disk byte 13631488 nr 4096
>>>          extent data offset 0 nr 4096 ram 4096
>>>          extent compression 0 (none)
>>> 2nd write:
>>>      item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>>>          generation 6 transid 7 size 4096 nbytes 4096
>>>          block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>>>          sequence 2 flags 0x3(NODATASUM|NODATACOW)
>>>          atime 1553058460.163985452 (2019-03-20 13:07:40)
>>>          ctime 1553058460.189985450 (2019-03-20 13:07:40)
>>>          mtime 1553058460.189985450 (2019-03-20 13:07:40)
>>>          otime 1553058460.163985452 (2019-03-20 13:07:40)
>>> ---
>>>      item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>>>          generation 6 type 1 (regular)   <----- [2]
>>>          extent data disk byte 13631488 nr 4096
>>>          extent data offset 0 nr 4096 ram 4096
>>>          extent compression 0 (none)
>>>
>>>
>>> Thanks, Anand
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation
  2019-03-20 14:40             ` Qu Wenruo
@ 2019-03-20 15:40               ` Zygo Blaxell
  2019-03-21  6:37                 ` Anand Jain
  0 siblings, 1 reply; 15+ messages in thread
From: Zygo Blaxell @ 2019-03-20 15:40 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Anand Jain, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5547 bytes --]

On Wed, Mar 20, 2019 at 10:40:07PM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/3/20 下午10:00, Anand Jain wrote:
> > 
> >>>   Also any idea why the generation number for the extent data is not
> >>>   incremented [2] when -o nodatacow and notrunc option is used, is it
> >>>   a bug? the dump-tree is taken with the script as below [1]
> >>>   (this corruption is seen with or without generation number is
> >>>   being incremented, but as another way to fix for the corruption we can
> >>>   verify the inode EXTENT_DATA generation from the same disk from which
> >>>   the data is read).
> >>
> >> For the generation part, it's the generation when data is written to
> >> disk.
> >>
> >> Truncation/nocow overwrite shouldn't really change the generation of
> >> existing file extents.
> >>
> >> So I'm afraid you can't use that generation to do the check.
> > 
> >  Any idea why it shouldn't change? Albeit there isn't new allocation
> >  due to nodatacow and notrunc overwrite, but sure data is overwritten.

The references to the extent in the subvol trees hold a copy of the
extent's generation, so if the extent's generation is modified, all the
references to the extent in all the subvol trees have to be modified too,
or they fail transid verification later on.  Compared to pure datacow
(nodatasum, compress=none), this would be the same number of iops, only
fragmentation can be saved because of block overwrites (and even that
isn't saved all the time).

> >  If that's the case then I would guess there will be bug in send receive
> >  as well.

Send requires a read-only snapshot, and the snapshot's reference to
the nodatacow extents automatically turns on datacow for those extents.
Thus, send behaves correctly because nodatacow is disabled.

The nodatacow flag is advisory.  It doesn't prevent btrfs from relocating
data when needed.

> I'm not sure about the send part.
> 
> On the other hand, if btrfs is going to update the generation of
> nodatacow file extent overwrite, it should cause pretty big performance
> degradation.
> 
> The idea of nodatacow is to skip all the expensive csum, extent
> allocation (maybe not that expensive) and the race of subvol tree.

nodatacow also skips RAID data integrity checks.  Generally, applications
and admins should plan for any data put in a nodatacow file to be silently
corrupted at any time, i.e. the same situation as an ext4-on-mdadm setup.
This is the price for minimizing the overhead for a write to a nodatacow
extent.

> If we're going to update file extents for such case, we're re-introduce
> performance impact to users who don't want that impact at all.
> I don't believe it's worthy at all.
> 
> Thanks,
> Qu
> 
> > 
> > Thanks, Anand
> > 
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> [1]
> >>>   umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \
> >>>   mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \
> >>>   echo 1st write: && \
> >>>   dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
> >>> conv=fsync,notrunc && sync && \
> >>>   btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
> >>>   echo --- && \
> >>>   btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA" && \
> >>>   echo 2nd write: && \
> >>>   dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
> >>> conv=fsync,notrunc && sync && \
> >>>   btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
> >>>   echo --- && \
> >>>   btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA"
> >>>
> >>>
> >>> 1st write:
> >>>      item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
> >>>          generation 6 transid 6 size 4096 nbytes 4096
> >>>          block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
> >>>          sequence 1 flags 0x3(NODATASUM|NODATACOW)
> >>>          atime 1553058460.163985452 (2019-03-20 13:07:40)
> >>>          ctime 1553058460.163985452 (2019-03-20 13:07:40)
> >>>          mtime 1553058460.163985452 (2019-03-20 13:07:40)
> >>>          otime 1553058460.163985452 (2019-03-20 13:07:40)
> >>> ---
> >>>      item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
> >>>          generation 6 type 1 (regular)
> >>>          extent data disk byte 13631488 nr 4096
> >>>          extent data offset 0 nr 4096 ram 4096
> >>>          extent compression 0 (none)
> >>> 2nd write:
> >>>      item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
> >>>          generation 6 transid 7 size 4096 nbytes 4096
> >>>          block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
> >>>          sequence 2 flags 0x3(NODATASUM|NODATACOW)
> >>>          atime 1553058460.163985452 (2019-03-20 13:07:40)
> >>>          ctime 1553058460.189985450 (2019-03-20 13:07:40)
> >>>          mtime 1553058460.189985450 (2019-03-20 13:07:40)
> >>>          otime 1553058460.163985452 (2019-03-20 13:07:40)
> >>> ---
> >>>      item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
> >>>          generation 6 type 1 (regular)   <----- [2]
> >>>          extent data disk byte 13631488 nr 4096
> >>>          extent data offset 0 nr 4096 ram 4096
> >>>          extent compression 0 (none)
> >>>
> >>>
> >>> Thanks, Anand
> >>
> 




[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation
  2019-03-20 13:54           ` Anand Jain
@ 2019-03-20 15:46             ` Zygo Blaxell
  0 siblings, 0 replies; 15+ messages in thread
From: Zygo Blaxell @ 2019-03-20 15:46 UTC (permalink / raw)
  To: Anand Jain; +Cc: Qu Wenruo, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 8076 bytes --]

On Wed, Mar 20, 2019 at 09:54:16PM +0800, Anand Jain wrote:
> 
> 
> On 3/20/19 2:27 PM, Qu Wenruo wrote:
> > 
> > 
> > On 2019/3/20 下午1:47, Anand Jain wrote:
> > > 
> > > 
> > > > > > > A tree based integrity verification
> > > > > > >     is important for all data, which is missing.
> > > > > > >       Fix:
> > > > > > >       In this RFC patch it proposes to use same disk from with the
> > > > > > > metadata
> > > > > > >     is read to read the data.
> > > > > > 
> > > > > > The obvious problem I found is, the idea only works for RAID1/10.
> > > > > > 
> > > > > > For striped profile it makes no sense, or even have a worse chance to
> > > > > > get stale data.
> > > > > > 
> > > > > > 
> > > > > > To me, the idea of using possible better mirror makes some sense, but
> > > > > > very profile limited.
> > > > > 
> > > > >    Yep. This problem and fix is only for the mirror based profiles
> > > > >    such as raid1/raid10.
> > > > 
> > > > Then current implementation lacks such check.
> > > > 
> > > > Further more, data and metadata can lie in different chunks and have
> > > > different chunk types.
> > > 
> > >   Right. Current tests for this RFC were only for raid1.
> > > 
> > >   But the final patch can fix that.
> > > 
> > >   In fact current patch works for all the cases except for the case of
> > >   replace is running and mix of metadata:raid1 and data:raid56
> > > 
> > >   We need some cleanups in mirror_num, basically we need to bring it
> > >   under #define. and handle it accordingly in __btrfs_map_block()
> > 
> > 
> > Wait for a minute.
> > 
> > There is a hidden pitfall from the very beginning.
> > 
> > Consider such chunk layout:
> > Chunk A Type DATA|RAID1
> > Stripe 1: Dev 1
> > Stripe 2: Dev 2
> > 
> > Chunk B Type METADATA|RAID1
> > Stripe 1: Dev 2
> > Stripe 2: Dev 1
> 
>  Right this fix can't solve this case.
>  So one down. The other choice I had is to

I think it's more than one down.  I think *only* a handful of scenarios
can work.

What happens if there are 4 disks?

Chunk A Type DATA|RAID1
Stripe 1: Dev 1
Stripe 2: Dev 2

Chunk B Type METADATA|RAID1
Stripe 1: Dev 3
Stripe 2: Dev 4

How do you know if dev 2 lost some data writes, but came back in time
to update the superblock?  Which disk do you plan to read data from if
Dev 3 fails a generation check?

>  Quarantine whole disk by comparing dev1 and dev2 generation # in the
> superblock during mount. or
>  Quarantine whole chunk(s) by comparing dev1 and dev2 generation # in the
> chunk tree during mount. or
>  Always read eb from both dev1 and dev2, if fails then fail its
> corresponding data block as well if any. or
>  During mount compare dev1 and dev2 generation #, record the range of
> generation number missing on the disk. But this need ondisk format changes.
> (But nodatacow, notrunc reg data extent write must change gen#). or
>  Similar to btree eb read pages, nest the extent data read as well.
>  (But nodatacow, notrunc reg data extent write must change gen#).

> Thanks, Anand
> 
> 
> > Then when we found stale metadata in chunk B mirror 1, caused by dev 2,
> > then your patch consider device 2 stale, and try to use mirror num 2 to
> > read from data chunk.
> > 
> > However in data chunk, mirror num 2 means it's still from device 2, and
> > we get stale data.
> > 
> > So the eb->mirror_num can still map to bad/stale device, due to the
> > flexibility provided by btrfs per-chunk mapping.
> > 
> > Thanks,
> > Qu
> > 
> > > 
> > > > > > Another idea I get inspired from the idea is, make it more generic so
> > > > > > that bad/stale device get a lower priority.
> > > > > 
> > > > >    When it comes to reading junk data, its not about the priority its
> > > > >    about the eliminating. When the problem is only few blocks, I am
> > > > >    against making the whole disk as bad.
> > > > > 
> > > > > > Although it suffers the same problem as I described.
> > > > > > 
> > > > > > To make the point short, the use case looks very limited.
> > > > > 
> > > > >    It applies to raid1/raid10 with nodatacow (which implies nodatasum).
> > > > >    In my understanding that's not rare.
> > > > > 
> > > > >    Any comments on the fix offered here?
> > > > 
> > > > The implementation part is, is eb->read_mirror reliable?
> > > > 
> > > > E.g. if the data and the eb are in different chunks, and the stale
> > > > happens in the chunk of eb but not in the data chunk?
> > > 
> > > 
> > >   eb and regular data are indeed in different chunks always. But eb
> > >   can never be stale as there is parent transid which is verified against
> > >   the read eb. However we do not have such a check for the data (this is
> > >   the core of the issue here) and so we return the junk data silently.
> > > 
> > >   Also any idea why the generation number for the extent data is not
> > >   incremented [2] when -o nodatacow and notrunc option is used, is it
> > >   a bug? the dump-tree is taken with the script as below [1]
> > >   (this corruption is seen with or without generation number is
> > >   being incremented, but as another way to fix for the corruption we can
> > >   verify the inode EXTENT_DATA generation from the same disk from which
> > >   the data is read).
> > > 
> > > [1]
> > >   umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \
> > >   mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \
> > >   echo 1st write: && \
> > >   dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
> > > conv=fsync,notrunc && sync && \
> > >   btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
> > >   echo --- && \
> > >   btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA" && \
> > >   echo 2nd write: && \
> > >   dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
> > > conv=fsync,notrunc && sync && \
> > >   btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
> > >   echo --- && \
> > >   btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA"
> > > 
> > > 
> > > 1st write:
> > >      item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
> > >          generation 6 transid 6 size 4096 nbytes 4096
> > >          block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
> > >          sequence 1 flags 0x3(NODATASUM|NODATACOW)
> > >          atime 1553058460.163985452 (2019-03-20 13:07:40)
> > >          ctime 1553058460.163985452 (2019-03-20 13:07:40)
> > >          mtime 1553058460.163985452 (2019-03-20 13:07:40)
> > >          otime 1553058460.163985452 (2019-03-20 13:07:40)
> > > ---
> > >      item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
> > >          generation 6 type 1 (regular)
> > >          extent data disk byte 13631488 nr 4096
> > >          extent data offset 0 nr 4096 ram 4096
> > >          extent compression 0 (none)
> > > 2nd write:
> > >      item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
> > >          generation 6 transid 7 size 4096 nbytes 4096
> > >          block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
> > >          sequence 2 flags 0x3(NODATASUM|NODATACOW)
> > >          atime 1553058460.163985452 (2019-03-20 13:07:40)
> > >          ctime 1553058460.189985450 (2019-03-20 13:07:40)
> > >          mtime 1553058460.189985450 (2019-03-20 13:07:40)
> > >          otime 1553058460.163985452 (2019-03-20 13:07:40)
> > > ---
> > >      item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
> > >          generation 6 type 1 (regular)   <----- [2]
> > >          extent data disk byte 13631488 nr 4096
> > >          extent data offset 0 nr 4096 ram 4096
> > >          extent compression 0 (none)
> > > 
> > > 
> > > Thanks, Anand
> > 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC] btrfs: fix read corrpution from disks of different generation
  2019-03-20 15:40               ` Zygo Blaxell
@ 2019-03-21  6:37                 ` Anand Jain
  0 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2019-03-21  6:37 UTC (permalink / raw)
  To: Zygo Blaxell, Qu Wenruo; +Cc: linux-btrfs



On 3/20/19 11:40 PM, Zygo Blaxell wrote:
> On Wed, Mar 20, 2019 at 10:40:07PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2019/3/20 下午10:00, Anand Jain wrote:
>>>
>>>>>    Also any idea why the generation number for the extent data is not
>>>>>    incremented [2] when -o nodatacow and notrunc option is used, is it
>>>>>    a bug? the dump-tree is taken with the script as below [1]
>>>>>    (this corruption is seen with or without generation number is
>>>>>    being incremented, but as another way to fix for the corruption we can
>>>>>    verify the inode EXTENT_DATA generation from the same disk from which
>>>>>    the data is read).
>>>>
>>>> For the generation part, it's the generation when data is written to
>>>> disk.
>>>>
>>>> Truncation/nocow overwrite shouldn't really change the generation of
>>>> existing file extents.
>>>>
>>>> So I'm afraid you can't use that generation to do the check.
>>>
>>>   Any idea why it shouldn't change? Albeit there isn't new allocation
>>>   due to nodatacow and notrunc overwrite, but sure data is overwritten.
> 
> The references to the extent in the subvol trees hold a copy of the
> extent's generation, so if the extent's generation is modified, all the
> references to the extent in all the subvol trees have to be modified too,
> or they fail transid verification later on.


  Extents with refs > 1 will cow , so gen # is incremented on the
  overwritten extents. But my scenario is when refs = 1. As one of
  the idea I am mulling is to track/record missed generation numbers
  on the disks and quarantine those reads on those disks accordingly.


> Compared to pure datacow
> (nodatasum, compress=none), this would be the same number of iops, only
> fragmentation can be saved because of block overwrites (and even that
> isn't saved all the time).
> 
>>>   If that's the case then I would guess there will be bug in send receive
>>>   as well.
> 
> Send requires a read-only snapshot, > and the snapshot's reference to
> the nodatacow extents automatically turns on datacow for those extents.
> Thus, send behaves correctly because nodatacow is disabled.
> 
> The nodatacow flag is advisory.  It doesn't prevent btrfs from relocating
> data when needed.

  Oh yes. I forgot, there is rdonly snapshot prerequisite for the sends.

Thanks, Anand

>> I'm not sure about the send part.
>>
>> On the other hand, if btrfs is going to update the generation of
>> nodatacow file extent overwrite, it should cause pretty big performance
>> degradation.
>>
>> The idea of nodatacow is to skip all the expensive csum, extent
>> allocation (maybe not that expensive) and the race of subvol tree.
> 
> nodatacow also skips RAID data integrity checks.  Generally, applications
> and admins should plan for any data put in a nodatacow file to be silently
> corrupted at any time, i.e. the same situation as an ext4-on-mdadm setup.
> This is the price for minimizing the overhead for a write to a nodatacow
> extent.
> 
>> If we're going to update file extents for such case, we're re-introduce
>> performance impact to users who don't want that impact at all.
>> I don't believe it's worthy at all.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks, Anand
>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> [1]
>>>>>    umount /btrfs; mkfs.btrfs -fq -dsingle -msingle /dev/sdb && \
>>>>>    mount -o notreelog,max_inline=0,nodatasum /dev/sdb /btrfs && \
>>>>>    echo 1st write: && \
>>>>>    dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
>>>>> conv=fsync,notrunc && sync && \
>>>>>    btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>>>>>    echo --- && \
>>>>>    btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA" && \
>>>>>    echo 2nd write: && \
>>>>>    dd status=none if=/dev/urandom of=/btrfs/anand bs=4096 count=1
>>>>> conv=fsync,notrunc && sync && \
>>>>>    btrfs in dump-tree /dev/sdb | egrep -A7 "257 INODE_ITEM 0\) item" && \
>>>>>    echo --- && \
>>>>>    btrfs in dump-tree /dev/sdb  | grep -A4 "257 EXTENT_DATA"
>>>>>
>>>>>
>>>>> 1st write:
>>>>>       item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>>>>>           generation 6 transid 6 size 4096 nbytes 4096
>>>>>           block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>>>>>           sequence 1 flags 0x3(NODATASUM|NODATACOW)
>>>>>           atime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>>           ctime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>>           mtime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>>           otime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>> ---
>>>>>       item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>>>>>           generation 6 type 1 (regular)
>>>>>           extent data disk byte 13631488 nr 4096
>>>>>           extent data offset 0 nr 4096 ram 4096
>>>>>           extent compression 0 (none)
>>>>> 2nd write:
>>>>>       item 4 key (257 INODE_ITEM 0) itemoff 15881 itemsize 160
>>>>>           generation 6 transid 7 size 4096 nbytes 4096
>>>>>           block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>>>>>           sequence 2 flags 0x3(NODATASUM|NODATACOW)
>>>>>           atime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>>           ctime 1553058460.189985450 (2019-03-20 13:07:40)
>>>>>           mtime 1553058460.189985450 (2019-03-20 13:07:40)
>>>>>           otime 1553058460.163985452 (2019-03-20 13:07:40)
>>>>> ---
>>>>>       item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
>>>>>           generation 6 type 1 (regular)   <----- [2]
>>>>>           extent data disk byte 13631488 nr 4096
>>>>>           extent data offset 0 nr 4096 ram 4096
>>>>>           extent compression 0 (none)
>>>>>
>>>>>
>>>>> Thanks, Anand
>>>>
>>
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2] fstests: btrfs test read from disks of different generation
  2019-03-19 11:35 ` [PATCH] fstests: btrfs test read " Anand Jain
@ 2020-04-06 12:00   ` Anand Jain
  0 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2020-04-06 12:00 UTC (permalink / raw)
  To: linux-btrfs

Verify file read from disks assembled with different generations.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
(Not yet for the integration, prerequisites are not yet integrated)

v2: rebase.
    use the sysfs interface to set the read_policy instead of mount

 tests/btrfs/210     | 147 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/btrfs/210.out |  12 +++++
 tests/btrfs/group   |   1 +
 3 files changed, 160 insertions(+)
 create mode 100755 tests/btrfs/210
 create mode 100644 tests/btrfs/210.out

diff --git a/tests/btrfs/210 b/tests/btrfs/210
new file mode 100755
index 000000000000..df51ab584688
--- /dev/null
+++ b/tests/btrfs/210
@@ -0,0 +1,147 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2019 Oracle.  All Rights Reserved.
+#
+# FS QA Test ?
+#
+# Test read on raid1 assembled with disks of different generations.
+#  Steps:
+#   . Use btrfs snapshot on the test_dir to create disk images of
+#     different generations (lets say generation a and generation b).
+#   . Mount one disk from generation a and another disk from generation
+#     b, and verify the file md5sum.
+#  Note: readmirror feature (in the btrfs ml) helps to reproduce the
+#     problem consistently.
+#  Fix-by: kernel patch
+#   NA
+#  Temp workaroud fix for two disk raid1:
+#   [PATCH] btrfs: fix read corrpution on disks of different generation
+#  Needs read_policy patch-set to test:
+#   [PATCH v7 0/5] readmirror feature (sysfs and in-memory only approach; with new read_policy device)
+
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+	_scratch_unmount > /dev/null 2>&1
+	_destroy_loop_device $dev1
+	_destroy_loop_device $dev2
+	btrfs subvolume delete $img_subvol > /dev/null 2>&1
+	btrfs subvolume delete $img_subvol_at_gen_a > /dev/null 2>&1
+	[[ -z $scratch_dev_pool_saved ]] || \
+			SCRATCH_DEV_POOL="$scratch_dev_pool_saved"
+	_scratch_dev_pool_put
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_test
+_require_scratch_dev_pool 2
+_require_btrfs_forget_or_module_loadable
+
+_btrfs_forget_or_module_reload
+
+_scratch_dev_pool_get 2
+scratch_dev_pool_saved=""
+
+img_subvol=$TEST_DIR/img_subvol
+img_subvol_at_gen_a=$TEST_DIR/img_subvol_at_gen_a
+dev1_img=$img_subvol/d1
+dev2_img=$img_subvol/d2
+dev1_img_at_gen_a=$img_subvol_at_gen_a/d1
+dev2_img_at_gen_a=$img_subvol_at_gen_a/d2
+
+$BTRFS_UTIL_PROG subvolume create $img_subvol >> /dev/null 2>&1 || \
+					_fail "subvol create failed"
+
+$XFS_IO_PROG -f -c "pwrite -S 0xdd 0 256m" $dev1_img >> /dev/null 2>&1
+$XFS_IO_PROG -f -c "pwrite -S 0xdd 0 256m" $dev2_img >> /dev/null 2>&1
+
+dev1=$(_create_loop_device $dev1_img)
+dev2=$(_create_loop_device $dev2_img)
+
+scratch_dev_pool_saved="$SCRATCH_DEV_POOL"
+SCRATCH_DEV_POOL="$dev1 $dev2"
+_scratch_pool_mkfs -U 12345678-1234-1234-1234-123456789abc -d raid1 -m raid1 >>\
+		$seqres.full 2>&1 || _fail  "mkfs failed on $SCRATCH_DEV_POOL"
+
+
+_mount "-o max_inline=0,nodatacow,device=$dev1" $dev2 $SCRATCH_MNT || \
+					_fail "mount for gen a failed"
+
+$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 4K" $SCRATCH_MNT/foobar >> /dev/null 2>&1
+
+echo md5sum at generation aa
+md5sum $SCRATCH_MNT/foobar | _filter_scratch
+echo
+_scratch_unmount
+
+$BTRFS_UTIL_PROG subvolume snapshot $img_subvol $img_subvol_at_gen_a \
+		>> /dev/null 2>&1 || _fail "subvol create failed"
+
+_mount "-o max_inline=0,nodatacow,device=$dev1" $dev2 $SCRATCH_MNT || \
+					_fail "mount for gen b failed"
+
+$XFS_IO_PROG -f -c "pwrite -S 0xbb 0 4K" $SCRATCH_MNT/foobar >> /dev/null 2>&1
+
+echo md5sum at generation bb
+md5sum $SCRATCH_MNT/foobar | _filter_scratch
+echo
+
+_scratch_unmount
+
+_destroy_loop_device $dev1
+_destroy_loop_device $dev2
+
+_test_unmount
+_btrfs_forget_or_module_reload
+_test_mount
+
+dev1=$(_create_loop_device $dev1_img_at_gen_a)
+dev2=$(_create_loop_device $dev2_img)
+
+_mount "-o ro,device=$dev1" $dev2 $SCRATCH_MNT || _fail "mount at gen ab failed"
+
+SCRATCH_SYSFS_PATH="/sys/fs/btrfs/12345678-1234-1234-1234-123456789abc"
+
+devid_for_read=$($BTRFS_UTIL_PROG filesystem show -m $SCRATCH_MNT| grep $dev1 |\
+							awk '{print $2}')
+echo 1 > $SCRATCH_SYSFS_PATH/devinfo/$devid_for_read/read_preferred
+echo "device" > $SCRATCH_SYSFS_PATH/read_policy
+
+echo md5sum at generation ab read from devid=$devid_for_read
+md5sum $SCRATCH_MNT/foobar | _filter_scratch
+
+echo 1 > /proc/sys/vm/drop_caches
+echo 0 > $SCRATCH_SYSFS_PATH/devinfo/$devid_for_read/read_preferred
+devid_for_read=$($BTRFS_UTIL_PROG filesystem show -m $SCRATCH_MNT| grep $dev2 |\
+							awk '{print $2}')
+echo 1 > $SCRATCH_SYSFS_PATH/devinfo/$devid_for_read/read_preferred
+
+echo
+echo md5sum at generation ab read from devid=$devid_for_read
+md5sum $SCRATCH_MNT/foobar | _filter_scratch
+
+# success, all done
+status=0
+exit
diff --git a/tests/btrfs/210.out b/tests/btrfs/210.out
new file mode 100644
index 000000000000..b55030fea1fa
--- /dev/null
+++ b/tests/btrfs/210.out
@@ -0,0 +1,12 @@
+QA output created by 210
+md5sum at generation aa
+3e4309c7cc81f23d45e260a8f13ca860  SCRATCH_MNT/foobar
+
+md5sum at generation bb
+055a2a7342c0528969e1b6e32aadeee3  SCRATCH_MNT/foobar
+
+md5sum at generation ab read from devid=1
+055a2a7342c0528969e1b6e32aadeee3  SCRATCH_MNT/foobar
+
+md5sum at generation ab read from devid=2
+055a2a7342c0528969e1b6e32aadeee3  SCRATCH_MNT/foobar
diff --git a/tests/btrfs/group b/tests/btrfs/group
index 657ec548159b..3f71ce235d96 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -212,3 +212,4 @@
 207 auto rw raid
 208 auto quick subvol
 209 auto quick log
+210 auto quick volume
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-04-06 12:00 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-19 11:35 [PATCH RFC] btrfs: fix read corrpution from disks of different generation Anand Jain
2019-03-19 11:35 ` [PATCH] fstests: btrfs test read " Anand Jain
2020-04-06 12:00   ` [PATCH v2] " Anand Jain
2019-03-19 12:07 ` [PATCH RFC] btrfs: fix read corrpution " Qu Wenruo
2019-03-19 23:41   ` Anand Jain
2019-03-20  1:02     ` Qu Wenruo
2019-03-20  5:47       ` Anand Jain
2019-03-20  6:19         ` Qu Wenruo
2019-03-20 14:00           ` Anand Jain
2019-03-20 14:40             ` Qu Wenruo
2019-03-20 15:40               ` Zygo Blaxell
2019-03-21  6:37                 ` Anand Jain
2019-03-20  6:27         ` Qu Wenruo
2019-03-20 13:54           ` Anand Jain
2019-03-20 15:46             ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.