linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Fix raid6 reconstruction bug
@ 2017-11-02  0:54 Liu Bo
  2017-11-02  0:54 ` [PATCH 1/4] Btrfs: introduce device flags Liu Bo
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Liu Bo @ 2017-11-02  0:54 UTC (permalink / raw)
  To: linux-btrfs

This is an attempt to fix a raid6 bug, that is, if two disks got
hotplugged at run time, reconstruction process is not able to rebuild
correct data to use even if raid6 can theoretically tolerate two disk
failure.

Patch 1 is a preparation patch, which introduces the necessary flags
and flag handling.

More details about the above bug can be found in Patch 2.

Patch 3 applies flags to raid1/10 profile.

Patch 4 fixes btrfs to show flags for each device correctly.

A test case[1] for fstests has been sent out.

[1]: fstests: btrfs: add test case for raid6 reconstruction bug
https://patchwork.kernel.org/patch/10037769/

Liu Bo (4):
  Btrfs: introduce device flags
  Btrfs: fix data corruption in raid6
  Btrfs: make raid1 and raid10 be aware of device flag In_sync
  Btrfs: change how we set In_sync

 fs/btrfs/disk-io.c | 21 ++++++++++++++++-
 fs/btrfs/raid56.c  |  3 ++-
 fs/btrfs/volumes.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++--------
 fs/btrfs/volumes.h |  8 +++++++
 4 files changed, 88 insertions(+), 12 deletions(-)

-- 
2.9.4


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/4] Btrfs: introduce device flags
  2017-11-02  0:54 [PATCH 0/4] Fix raid6 reconstruction bug Liu Bo
@ 2017-11-02  0:54 ` Liu Bo
  2017-11-06 16:40   ` David Sterba
  2017-11-07  9:30   ` Anand Jain
  2017-11-02  0:54 ` [PATCH 2/4] Btrfs: fix data corruption in raid6 Liu Bo
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 16+ messages in thread
From: Liu Bo @ 2017-11-02  0:54 UTC (permalink / raw)
  To: linux-btrfs

Here we have defined two flags,
- Fautly
- In_sync

Currently only In_sync is in use, it only matters when device serves
as part of a raid profile.  The flag In_sync is decided when mounting
a btrfs and opening a device, by default every device is set with
In_sync, but would not be set if its generation was out of date.

If the device doesn't get resync and to avoid overriding its
generation during writing superblock, if the !In_sync device is still
writeable, its last valid generation will be recorded in
superblock.dev_item.generation instead of superblock.generation, such
that the last valid generation can be retained through several reboot,
btrfs is able to detect the out-of-sync device.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/disk-io.c | 21 ++++++++++++++++++++-
 fs/btrfs/volumes.c | 47 ++++++++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/volumes.h |  8 ++++++++
 3 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 487bbe4..a080d58 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3640,6 +3640,7 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors)
 	int do_barriers;
 	int max_errors;
 	int total_errors = 0;
+	u64 generation;
 	u64 flags;
 
 	do_barriers = !btrfs_test_opt(fs_info, NOBARRIER);
@@ -3671,7 +3672,25 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors)
 		if (!dev->in_fs_metadata || !dev->writeable)
 			continue;
 
-		btrfs_set_stack_device_generation(dev_item, 0);
+		/*
+		 * if dev is not in_sync, record the last valid
+		 * generation.
+		 *
+		 * This is used to detect !In_sync device after
+		 * reboot, now we check a device's In_sync status by
+		 * comparing its superblock->dev_item->generation with
+		 * the latest one.
+		 *
+		 * We don't compare superblock->generation because
+		 * after reboot and a successfuly superblock writing,
+		 * generation in superblock will be updated.
+		 */
+		if (!test_bit(In_sync, &dev->flags))
+			generation = dev->generation;
+		else
+			generation = 0;
+		btrfs_set_stack_device_generation(dev_item, generation);
+
 		btrfs_set_stack_device_type(dev_item, dev->type);
 		btrfs_set_stack_device_id(dev_item, dev->devid);
 		btrfs_set_stack_device_total_bytes(dev_item,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0e8f16c..7b29b1a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -250,6 +250,7 @@ static struct btrfs_device *__alloc_device(void)
 	btrfs_device_data_ordered_init(dev);
 	INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
 	INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
+	ASSERT(dev->flags == 0);
 
 	return dev;
 }
@@ -1003,10 +1004,32 @@ static int __btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
 			   BTRFS_UUID_SIZE))
 			goto error_brelse;
 
-		device->generation = btrfs_super_generation(disk_super);
-		if (!latest_dev ||
-		    device->generation > latest_dev->generation)
+		if (btrfs_stack_device_generation(&disk_super->dev_item) == 0)
+			device->generation = btrfs_super_generation(disk_super);
+		else
+			device->generation =
+				btrfs_stack_device_generation(&disk_super->dev_item);
+
+		/* no lock is required during the initial stage. */
+		if (!latest_dev) {
+			set_bit(In_sync, &device->flags);
 			latest_dev = device;
+		} else {
+			if (device->generation > latest_dev->generation) {
+				set_bit(In_sync, &device->flags);
+				clear_bit(In_sync, &latest_dev->flags);
+				latest_dev = device;
+			} else if (device->generation == latest_dev->generation) {
+				set_bit(In_sync, &device->flags);
+			}
+			/*
+			 * if (device->generation < latest_dev->generation)
+			 *	# don't set In_sync
+			 */
+		}
+
+		if (!test_bit(In_sync, &device->flags))
+			pr_info("dev %s gen %llu is not In_sync\n", device->name->str, device->generation);
 
 		if (btrfs_super_flags(disk_super) & BTRFS_SUPER_FLAG_SEEDING) {
 			device->writeable = 0;
@@ -2379,6 +2402,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	q = bdev_get_queue(bdev);
 	if (blk_queue_discard(q))
 		device->can_discard = 1;
+	set_bit(In_sync, &device->flags);
 	device->writeable = 1;
 	device->generation = trans->transid;
 	device->io_width = fs_info->sectorsize;
@@ -2599,6 +2623,7 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 	device->dev_stats_valid = 1;
 	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
 	device->fs_devices = fs_info->fs_devices;
+	set_bit(In_sync, &device->flags);
 	list_add(&device->dev_list, &fs_info->fs_devices->devices);
 	fs_info->fs_devices->num_devices++;
 	fs_info->fs_devices->open_devices++;
@@ -6450,6 +6475,14 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
 			btrfs_report_missing_device(fs_info, devid, uuid);
 			return -EIO;
 		}
+		if (map->stripes[i].dev &&
+		    !test_bit(In_sync, &map->stripes[i].dev->flags) &&
+		    !btrfs_test_opt(fs_info, DEGRADED)) {
+			btrfs_warn(fs_info, "devid %llu uuid %pU is Not in_sync, but we're not under degraded mode",
+				   devid, uuid);
+			free_extent_map(em);
+			return -EIO;
+		}
 		if (!map->stripes[i].dev) {
 			map->stripes[i].dev =
 				add_missing_dev(fs_info->fs_devices, devid,
@@ -6592,6 +6625,14 @@ static int read_one_dev(struct btrfs_fs_info *fs_info,
 				return -EIO;
 		}
 
+		if (!test_bit(In_sync, &device->flags) &&
+		    !btrfs_test_opt(fs_info, DEGRADED)) {
+			btrfs_warn(fs_info,
+				   "devid %llu uuid %pU is Not in_sync, but we're not under degraded mode",
+				   devid, dev_uuid);
+			return -EIO;
+		}
+
 		if(!device->bdev && !device->missing) {
 			/*
 			 * this happens when a device that was properly setup
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 6108fdf..65b928c 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -69,6 +69,9 @@ struct btrfs_device {
 	/* the mode sent to blkdev_get */
 	fmode_t mode;
 
+	/* bit set of 'enum flag_bits' bit  */
+	unsigned long flags;
+
 	int writeable;
 	int in_fs_metadata;
 	int missing;
@@ -261,6 +264,11 @@ struct btrfs_fs_devices {
 	struct completion kobj_unregister;
 };
 
+enum flag_bits {
+	Faulty,		/* device is known to have a fault */
+	In_sync,	/* device is in_sync with rest of array */
+};
+
 #define BTRFS_BIO_INLINE_CSUM_SIZE	64
 
 /*
-- 
2.9.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/4] Btrfs: fix data corruption in raid6
  2017-11-02  0:54 [PATCH 0/4] Fix raid6 reconstruction bug Liu Bo
  2017-11-02  0:54 ` [PATCH 1/4] Btrfs: introduce device flags Liu Bo
@ 2017-11-02  0:54 ` Liu Bo
  2017-11-07  8:32   ` Anand Jain
  2017-11-02  0:54 ` [PATCH 3/4] Btrfs: make raid1 and raid10 be aware of device flag In_sync Liu Bo
  2017-11-02  0:54 ` [PATCH 4/4] Btrfs: change how we set In_sync Liu Bo
  3 siblings, 1 reply; 16+ messages in thread
From: Liu Bo @ 2017-11-02  0:54 UTC (permalink / raw)
  To: linux-btrfs

With raid6 profile, btrfs can end up with data corruption by the
following steps.

Say we have a 5 disks that are set up with raid6 profile,

1) mount this btrfs
2) one disk gets pulled out
3) write something to btrfs and sync
4) another disk gets pulled out
5) write something to btrfs and sync
6) umount this btrfs
7) bring the two disk back
8) reboot
9) mount this btrfs

Chances are mount will fail (even with -odegraded) because of failure
on reading metadata blocks, IOW, our raid6 setup is not able to
protect two disk failures in some cases.

So it turns out that there is a bug in raid6's recover code, that is,

if we have one of stripes in the raid6 layout as shown here,

| D1(*)  | D2(*)  | D3 | D4 | D5    |
-------------------------------------
| data1  | data2  | P  | Q  | data0 |

D1 and D2 are the two disks which got pulled out and brought back.
When mount reads data1 and finds out that data1 doesn't match its crc,
btrfs goes to recover data1 from other stripes, what it'll be doing is

1) read data2, parity P, parity Q and data0

2) as we have a valid parity P and two data stripes, it goes to
   recover in raid5 style.

(However, since disk D2 was pulled out and data2 on D2 could be stale,
we still get the wrong data1 from this reconstruction.)

3) btrfs continues to try to reconstruct data1 from parity Q, data2
   and data0, we still get the wrong one for the same reason.

The fix here is to take advantage of the device flag, ie. 'In_sync',
all data on a device might be stale if 'In_sync' has not been set.

With this, we can build the correct data2 from parity P, Q and data1.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/raid56.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 67262f8..3c0ce61 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1076,7 +1076,8 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
 	disk_start = stripe->physical + (page_index << PAGE_SHIFT);
 
 	/* if the device is missing, just fail this stripe */
-	if (!stripe->dev->bdev)
+	if (!stripe->dev->bdev ||
+	    !test_bit(In_sync, &stripe->dev->flags))
 		return fail_rbio_index(rbio, stripe_nr);
 
 	/* see if we can add this page onto our existing bio */
-- 
2.9.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/4] Btrfs: make raid1 and raid10 be aware of device flag In_sync
  2017-11-02  0:54 [PATCH 0/4] Fix raid6 reconstruction bug Liu Bo
  2017-11-02  0:54 ` [PATCH 1/4] Btrfs: introduce device flags Liu Bo
  2017-11-02  0:54 ` [PATCH 2/4] Btrfs: fix data corruption in raid6 Liu Bo
@ 2017-11-02  0:54 ` Liu Bo
  2017-11-02  0:54 ` [PATCH 4/4] Btrfs: change how we set In_sync Liu Bo
  3 siblings, 0 replies; 16+ messages in thread
From: Liu Bo @ 2017-11-02  0:54 UTC (permalink / raw)
  To: linux-btrfs

If a device is not in_sync, avoid reading data from it as data on it
might be stale.

Although checksum can detect stale data so we won't return stale data
to users , this helps us read the good copy directly.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/volumes.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7b29b1a..b83b7c8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5175,6 +5175,7 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	int i;
 	int tolerance;
 	struct btrfs_device *srcdev;
+	struct btrfs_device *curr;
 
 	if (dev_replace_is_ongoing &&
 	    fs_info->dev_replace.cont_reading_from_srcdev_mode ==
@@ -5184,17 +5185,25 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 		srcdev = NULL;
 
 	/*
-	 * try to avoid the drive that is the source drive for a
-	 * dev-replace procedure, only choose it if no other non-missing
-	 * mirror is available
+	 * try to avoid the drive that is
+	 *
+	 * a) the source drive for a dev-replace procedure,
+	 * b) not in_sync
+	 *
+	 * only choose it if no other non-missing mirror is available.
 	 */
 	for (tolerance = 0; tolerance < 2; tolerance++) {
-		if (map->stripes[optimal].dev->bdev &&
-		    (tolerance || map->stripes[optimal].dev != srcdev))
+		curr = map->stripes[optimal].dev;
+		if (curr->bdev &&
+		    test_bit(In_sync, &curr->flags) &&
+		    (tolerance || curr != srcdev))
 			return optimal;
+
 		for (i = first; i < first + num; i++) {
-			if (map->stripes[i].dev->bdev &&
-			    (tolerance || map->stripes[i].dev != srcdev))
+			curr = map->stripes[i].dev;
+			if (curr->bdev &&
+			    test_bit(In_sync, &curr->flags) &&
+			    (tolerance || curr != srcdev))
 				return i;
 		}
 	}
-- 
2.9.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/4] Btrfs: change how we set In_sync
  2017-11-02  0:54 [PATCH 0/4] Fix raid6 reconstruction bug Liu Bo
                   ` (2 preceding siblings ...)
  2017-11-02  0:54 ` [PATCH 3/4] Btrfs: make raid1 and raid10 be aware of device flag In_sync Liu Bo
@ 2017-11-02  0:54 ` Liu Bo
  3 siblings, 0 replies; 16+ messages in thread
From: Liu Bo @ 2017-11-02  0:54 UTC (permalink / raw)
  To: linux-btrfs

The current method sometimes can report In_sync status wrongly if the
out-of-sync device happens to be the first one we check, and since it
will be the only one, it'll be set In_sync, but later if a newer
device is found, we then change it to !In_sync, but it's not reported
in kernel log.

This changes it to set In_sync flag after iterating all devices, at
this point we have the latest device so we can report out-of-sync
device precisely.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/volumes.c | 38 ++++++++++++++++++--------------------
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b83b7c8..43f03a6 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1010,26 +1010,9 @@ static int __btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
 			device->generation =
 				btrfs_stack_device_generation(&disk_super->dev_item);
 
-		/* no lock is required during the initial stage. */
-		if (!latest_dev) {
-			set_bit(In_sync, &device->flags);
-			latest_dev = device;
-		} else {
-			if (device->generation > latest_dev->generation) {
-				set_bit(In_sync, &device->flags);
-				clear_bit(In_sync, &latest_dev->flags);
-				latest_dev = device;
-			} else if (device->generation == latest_dev->generation) {
-				set_bit(In_sync, &device->flags);
-			}
-			/*
-			 * if (device->generation < latest_dev->generation)
-			 *	# don't set In_sync
-			 */
-		}
-
-		if (!test_bit(In_sync, &device->flags))
-			pr_info("dev %s gen %llu is not In_sync\n", device->name->str, device->generation);
+		if (!latest_dev ||
+                    device->generation > latest_dev->generation)
+                        latest_dev = device;
 
 		if (btrfs_super_flags(disk_super) & BTRFS_SUPER_FLAG_SEEDING) {
 			device->writeable = 0;
@@ -1063,6 +1046,21 @@ static int __btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
 		blkdev_put(bdev, flags);
 		continue;
 	}
+
+	/* set In_sync flag */
+	list_for_each_entry(device, head, dev_list) {
+		ASSERT(device->generation <= latest_dev->generation);
+		if (device->generation == latest_dev->generation) {
+			set_bit(In_sync, &device->flags);
+			pr_debug("btrfs: dev %s gen %llu is In_sync\n",
+				 device->name->str, device->generation);
+		} else {
+			clear_bit(In_sync, &device->flags);
+			pr_info("btrfs: dev %s gen %llu is not In_sync\n",
+				device->name->str, device->generation);
+		}
+	}
+
 	if (fs_devices->open_devices == 0) {
 		ret = -EINVAL;
 		goto out;
-- 
2.9.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] Btrfs: introduce device flags
  2017-11-02  0:54 ` [PATCH 1/4] Btrfs: introduce device flags Liu Bo
@ 2017-11-06 16:40   ` David Sterba
  2017-11-08 19:46     ` Liu Bo
  2017-11-07  9:30   ` Anand Jain
  1 sibling, 1 reply; 16+ messages in thread
From: David Sterba @ 2017-11-06 16:40 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs

On Wed, Nov 01, 2017 at 06:54:02PM -0600, Liu Bo wrote:
> Here we have defined two flags,
> - Fautly
> - In_sync
> 
> Currently only In_sync is in use, it only matters when device serves
> as part of a raid profile.  The flag In_sync is decided when mounting
> a btrfs and opening a device, by default every device is set with
> In_sync, but would not be set if its generation was out of date.
> 
> If the device doesn't get resync and to avoid overriding its
> generation during writing superblock, if the !In_sync device is still
> writeable, its last valid generation will be recorded in
> superblock.dev_item.generation instead of superblock.generation, such
> that the last valid generation can be retained through several reboot,
> btrfs is able to detect the out-of-sync device.
> 
> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> ---
>  fs/btrfs/disk-io.c | 21 ++++++++++++++++++++-
>  fs/btrfs/volumes.c | 47 ++++++++++++++++++++++++++++++++++++++++++++---
>  fs/btrfs/volumes.h |  8 ++++++++
>  3 files changed, 72 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 487bbe4..a080d58 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3640,6 +3640,7 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors)
>  	int do_barriers;
>  	int max_errors;
>  	int total_errors = 0;
> +	u64 generation;
>  	u64 flags;
>  
>  	do_barriers = !btrfs_test_opt(fs_info, NOBARRIER);
> @@ -3671,7 +3672,25 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors)
>  		if (!dev->in_fs_metadata || !dev->writeable)
>  			continue;
>  
> -		btrfs_set_stack_device_generation(dev_item, 0);
> +		/*
> +		 * if dev is not in_sync, record the last valid
> +		 * generation.
> +		 *
> +		 * This is used to detect !In_sync device after
> +		 * reboot, now we check a device's In_sync status by
> +		 * comparing its superblock->dev_item->generation with
> +		 * the latest one.
> +		 *
> +		 * We don't compare superblock->generation because
> +		 * after reboot and a successfuly superblock writing,
> +		 * generation in superblock will be updated.
> +		 */

Please format the comments to 80 columns, and use first capital letter
if it's a stentence.

> +		if (!test_bit(In_sync, &dev->flags))
> +			generation = dev->generation;
> +		else
> +			generation = 0;
> +		btrfs_set_stack_device_generation(dev_item, generation);

Can generation == 0 possibly confuse older kernels that will not have
this patch?

> +
>  		btrfs_set_stack_device_type(dev_item, dev->type);
>  		btrfs_set_stack_device_id(dev_item, dev->devid);
>  		btrfs_set_stack_device_total_bytes(dev_item,
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 0e8f16c..7b29b1a 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -250,6 +250,7 @@ static struct btrfs_device *__alloc_device(void)
>  	btrfs_device_data_ordered_init(dev);
>  	INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
>  	INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
> +	ASSERT(dev->flags == 0);

I think you can drop this,  the whole structure is zeroed a few lines
above.

>  
>  	return dev;
>  }
> @@ -1003,10 +1004,32 @@ static int __btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
>  			   BTRFS_UUID_SIZE))
>  			goto error_brelse;
>  
> -		device->generation = btrfs_super_generation(disk_super);
> -		if (!latest_dev ||
> -		    device->generation > latest_dev->generation)
> +		if (btrfs_stack_device_generation(&disk_super->dev_item) == 0)
> +			device->generation = btrfs_super_generation(disk_super);
> +		else
> +			device->generation =
> +				btrfs_stack_device_generation(&disk_super->dev_item);
> +
> +		/* no lock is required during the initial stage. */
> +		if (!latest_dev) {
> +			set_bit(In_sync, &device->flags);
>  			latest_dev = device;
> +		} else {
> +			if (device->generation > latest_dev->generation) {
> +				set_bit(In_sync, &device->flags);
> +				clear_bit(In_sync, &latest_dev->flags);
> +				latest_dev = device;
> +			} else if (device->generation == latest_dev->generation) {
> +				set_bit(In_sync, &device->flags);
> +			}
> +			/*
> +			 * if (device->generation < latest_dev->generation)
> +			 *	# don't set In_sync
> +			 */
> +		}
> +
> +		if (!test_bit(In_sync, &device->flags))
> +			pr_info("dev %s gen %llu is not In_sync\n", device->name->str, device->generation);
>  
>  		if (btrfs_super_flags(disk_super) & BTRFS_SUPER_FLAG_SEEDING) {
>  			device->writeable = 0;
> @@ -2379,6 +2402,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>  	q = bdev_get_queue(bdev);
>  	if (blk_queue_discard(q))
>  		device->can_discard = 1;
> +	set_bit(In_sync, &device->flags);
>  	device->writeable = 1;
>  	device->generation = trans->transid;
>  	device->io_width = fs_info->sectorsize;
> @@ -2599,6 +2623,7 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>  	device->dev_stats_valid = 1;
>  	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
>  	device->fs_devices = fs_info->fs_devices;
> +	set_bit(In_sync, &device->flags);
>  	list_add(&device->dev_list, &fs_info->fs_devices->devices);
>  	fs_info->fs_devices->num_devices++;
>  	fs_info->fs_devices->open_devices++;
> @@ -6450,6 +6475,14 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
>  			btrfs_report_missing_device(fs_info, devid, uuid);
>  			return -EIO;
>  		}
> +		if (map->stripes[i].dev &&
> +		    !test_bit(In_sync, &map->stripes[i].dev->flags) &&
> +		    !btrfs_test_opt(fs_info, DEGRADED)) {
> +			btrfs_warn(fs_info, "devid %llu uuid %pU is Not in_sync, but we're not under degraded mode",

Please move the string to the next line and un-indent until it fits 80
columns line.

> +				   devid, uuid);
> +			free_extent_map(em);
> +			return -EIO;
> +		}
>  		if (!map->stripes[i].dev) {
>  			map->stripes[i].dev =
>  				add_missing_dev(fs_info->fs_devices, devid,
> @@ -6592,6 +6625,14 @@ static int read_one_dev(struct btrfs_fs_info *fs_info,
>  				return -EIO;
>  		}
>  
> +		if (!test_bit(In_sync, &device->flags) &&
> +		    !btrfs_test_opt(fs_info, DEGRADED)) {
> +			btrfs_warn(fs_info,
> +				   "devid %llu uuid %pU is Not in_sync, but we're not under degraded mode",
> +				   devid, dev_uuid);
> +			return -EIO;
> +		}
> +
>  		if(!device->bdev && !device->missing) {
>  			/*
>  			 * this happens when a device that was properly setup
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 6108fdf..65b928c 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -69,6 +69,9 @@ struct btrfs_device {
>  	/* the mode sent to blkdev_get */
>  	fmode_t mode;
>  
> +	/* bit set of 'enum flag_bits' bit  */
> +	unsigned long flags;
> +
>  	int writeable;
>  	int in_fs_metadata;
>  	int missing;

It would be good if you first make a generic member 'flags', and convert
writeable/in_fs_metadata/missing/... to that and add the sync flags
later.

> @@ -261,6 +264,11 @@ struct btrfs_fs_devices {
>  	struct completion kobj_unregister;
>  };
>  
> +enum flag_bits {

This name is too generic.

> +	Faulty,		/* device is known to have a fault */
> +	In_sync,	/* device is in_sync with rest of array */

Enums usually are all caps, and with some prefix.

> +};

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] Btrfs: fix data corruption in raid6
  2017-11-02  0:54 ` [PATCH 2/4] Btrfs: fix data corruption in raid6 Liu Bo
@ 2017-11-07  8:32   ` Anand Jain
  2017-11-08 19:53     ` Liu Bo
  0 siblings, 1 reply; 16+ messages in thread
From: Anand Jain @ 2017-11-07  8:32 UTC (permalink / raw)
  To: Liu Bo, linux-btrfs



On 11/02/2017 08:54 AM, Liu Bo wrote:
> With raid6 profile, btrfs can end up with data corruption by the
> following steps.
> 
> Say we have a 5 disks that are set up with raid6 profile,
> 
> 1) mount this btrfs
> 2) one disk gets pulled out
> 3) write something to btrfs and sync
> 4) another disk gets pulled out
> 5) write something to btrfs and sync
> 6) umount this btrfs
> 7) bring the two disk back
> 8) reboot
> 9) mount this btrfs
> 
> Chances are mount will fail (even with -odegraded) because of failure
> on reading metadata blocks, IOW, our raid6 setup is not able to
> protect two disk failures in some cases.
> 
> So it turns out that there is a bug in raid6's recover code, that is,
> 
> if we have one of stripes in the raid6 layout as shown here,
> 
> | D1(*)  | D2(*)  | D3 | D4 | D5    |
> -------------------------------------
> | data1  | data2  | P  | Q  | data0 |


> D1 and D2 are the two disks which got pulled out and brought back.
> When mount reads data1 and finds out that data1 doesn't match its crc,
> btrfs goes to recover data1 from other stripes, what it'll be doing is
> 
> 1) read data2, parity P, parity Q and data0
>
> 2) as we have a valid parity P and two data stripes, it goes to
>     recover in raid5 style.


> (However, since disk D2 was pulled out and data2 on D2 could be stale,

  data2 should end up crc error, if not then raid5 recover is as
  expected OR this example is confusing to explain the context of
  two data stipe missing.

Thanks, Anand


> we still get the wrong data1 from this reconstruction.)
> 
> 3) btrfs continues to try to reconstruct data1 from parity Q, data2
>     and data0, we still get the wrong one for the same reason.
> 
> The fix here is to take advantage of the device flag, ie. 'In_sync',
> all data on a device might be stale if 'In_sync' has not been set.
 >
> With this, we can build the correct data2 from parity P, Q and data1.
> 
> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> ---
>   fs/btrfs/raid56.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index 67262f8..3c0ce61 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -1076,7 +1076,8 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
>   	disk_start = stripe->physical + (page_index << PAGE_SHIFT);
>   
>   	/* if the device is missing, just fail this stripe */
> -	if (!stripe->dev->bdev)
> +	if (!stripe->dev->bdev ||
> +	    !test_bit(In_sync, &stripe->dev->flags))
>   		return fail_rbio_index(rbio, stripe_nr);
>   
>   	/* see if we can add this page onto our existing bio */
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] Btrfs: introduce device flags
  2017-11-02  0:54 ` [PATCH 1/4] Btrfs: introduce device flags Liu Bo
  2017-11-06 16:40   ` David Sterba
@ 2017-11-07  9:30   ` Anand Jain
  1 sibling, 0 replies; 16+ messages in thread
From: Anand Jain @ 2017-11-07  9:30 UTC (permalink / raw)
  To: Liu Bo, linux-btrfs



On 11/02/2017 08:54 AM, Liu Bo wrote:
> Here we have defined two flags,
> - Fautly
> - In_sync
> 
> Currently only In_sync is in use, it only matters when device serves
> as part of a raid profile.  The flag In_sync is decided when mounting
> a btrfs and opening a device, by default every device is set with
> In_sync, but would not be set if its generation was out of date.
> 
> If the device doesn't get resync and to avoid overriding its
> generation during writing superblock, if the !In_sync device is still
> writeable, its last valid generation will be recorded in
> superblock.dev_item.generation instead of superblock.generation, such
> that the last valid generation can be retained through several reboot,
> btrfs is able to detect the out-of-sync device.

  Based on the bug you are trying to address in patch 2/4.
  - I still don't get the point the need for explicit flag
  In_Sync for the whole disk and not just depend on stripe
  for which anyway crc will fail ?
  - Here whole disk isn't the problem. Only the stripe which
  were missed on the missing disk is the problem.

> @@ -6592,6 +6625,14 @@ static int read_one_dev(struct btrfs_fs_info *fs_info,
>   				return -EIO;
>   		}
>   
> +		if (!test_bit(In_sync, &device->flags) &&
> +		    !btrfs_test_opt(fs_info, DEGRADED)) {
> +			btrfs_warn(fs_info,
> +				   "devid %llu uuid %pU is Not in_sync, but we're not under degraded mode",
> +				   devid, dev_uuid);
> +			return -EIO;
> +		}
> +

  degrade option is for the missing device and if user still want to
  continue, here its about device with not_In_Sync flag, degrade option
  check is not required.

  Similarly the test case should be updated as well.

Thanks, Anand

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] Btrfs: introduce device flags
  2017-11-06 16:40   ` David Sterba
@ 2017-11-08 19:46     ` Liu Bo
  2017-11-13 17:13       ` David Sterba
  0 siblings, 1 reply; 16+ messages in thread
From: Liu Bo @ 2017-11-08 19:46 UTC (permalink / raw)
  To: dsterba, linux-btrfs

On Mon, Nov 06, 2017 at 05:40:25PM +0100, David Sterba wrote:
> On Wed, Nov 01, 2017 at 06:54:02PM -0600, Liu Bo wrote:
> > Here we have defined two flags,
> > - Fautly
> > - In_sync
> > 
> > Currently only In_sync is in use, it only matters when device serves
> > as part of a raid profile.  The flag In_sync is decided when mounting
> > a btrfs and opening a device, by default every device is set with
> > In_sync, but would not be set if its generation was out of date.
> > 
> > If the device doesn't get resync and to avoid overriding its
> > generation during writing superblock, if the !In_sync device is still
> > writeable, its last valid generation will be recorded in
> > superblock.dev_item.generation instead of superblock.generation, such
> > that the last valid generation can be retained through several reboot,
> > btrfs is able to detect the out-of-sync device.
> > 
> > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> > ---
> >  fs/btrfs/disk-io.c | 21 ++++++++++++++++++++-
> >  fs/btrfs/volumes.c | 47 ++++++++++++++++++++++++++++++++++++++++++++---
> >  fs/btrfs/volumes.h |  8 ++++++++
> >  3 files changed, 72 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 487bbe4..a080d58 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -3640,6 +3640,7 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors)
> >  	int do_barriers;
> >  	int max_errors;
> >  	int total_errors = 0;
> > +	u64 generation;
> >  	u64 flags;
> >  
> >  	do_barriers = !btrfs_test_opt(fs_info, NOBARRIER);
> > @@ -3671,7 +3672,25 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors)
> >  		if (!dev->in_fs_metadata || !dev->writeable)
> >  			continue;
> >  
> > -		btrfs_set_stack_device_generation(dev_item, 0);
> > +		/*
> > +		 * if dev is not in_sync, record the last valid
> > +		 * generation.
> > +		 *
> > +		 * This is used to detect !In_sync device after
> > +		 * reboot, now we check a device's In_sync status by
> > +		 * comparing its superblock->dev_item->generation with
> > +		 * the latest one.
> > +		 *
> > +		 * We don't compare superblock->generation because
> > +		 * after reboot and a successfuly superblock writing,
> > +		 * generation in superblock will be updated.
> > +		 */
> 
> Please format the comments to 80 columns, and use first capital letter
> if it's a stentence.

OK.

> 
> > +		if (!test_bit(In_sync, &dev->flags))
> > +			generation = dev->generation;
> > +		else
> > +			generation = 0;
> > +		btrfs_set_stack_device_generation(dev_item, generation);
> 
> Can generation == 0 possibly confuse older kernels that will not have
> this patch?
>

It is safe, the default value from dev_item.generation is always 0 in
old kernels, and old kernels don't check it at all.

> > +
> >  		btrfs_set_stack_device_type(dev_item, dev->type);
> >  		btrfs_set_stack_device_id(dev_item, dev->devid);
> >  		btrfs_set_stack_device_total_bytes(dev_item,
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 0e8f16c..7b29b1a 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -250,6 +250,7 @@ static struct btrfs_device *__alloc_device(void)
> >  	btrfs_device_data_ordered_init(dev);
> >  	INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
> >  	INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
> > +	ASSERT(dev->flags == 0);
> 
> I think you can drop this,  the whole structure is zeroed a few lines
> above.
>

Make sense.

> >  
> >  	return dev;
> >  }
> > @@ -1003,10 +1004,32 @@ static int __btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
> >  			   BTRFS_UUID_SIZE))
> >  			goto error_brelse;
> >  
> > -		device->generation = btrfs_super_generation(disk_super);
> > -		if (!latest_dev ||
> > -		    device->generation > latest_dev->generation)
> > +		if (btrfs_stack_device_generation(&disk_super->dev_item) == 0)
> > +			device->generation = btrfs_super_generation(disk_super);
> > +		else
> > +			device->generation =
> > +				btrfs_stack_device_generation(&disk_super->dev_item);
> > +
> > +		/* no lock is required during the initial stage. */
> > +		if (!latest_dev) {
> > +			set_bit(In_sync, &device->flags);
> >  			latest_dev = device;
> > +		} else {
> > +			if (device->generation > latest_dev->generation) {
> > +				set_bit(In_sync, &device->flags);
> > +				clear_bit(In_sync, &latest_dev->flags);
> > +				latest_dev = device;
> > +			} else if (device->generation == latest_dev->generation) {
> > +				set_bit(In_sync, &device->flags);
> > +			}
> > +			/*
> > +			 * if (device->generation < latest_dev->generation)
> > +			 *	# don't set In_sync
> > +			 */
> > +		}
> > +
> > +		if (!test_bit(In_sync, &device->flags))
> > +			pr_info("dev %s gen %llu is not In_sync\n", device->name->str, device->generation);
> >  
> >  		if (btrfs_super_flags(disk_super) & BTRFS_SUPER_FLAG_SEEDING) {
> >  			device->writeable = 0;
> > @@ -2379,6 +2402,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> >  	q = bdev_get_queue(bdev);
> >  	if (blk_queue_discard(q))
> >  		device->can_discard = 1;
> > +	set_bit(In_sync, &device->flags);
> >  	device->writeable = 1;
> >  	device->generation = trans->transid;
> >  	device->io_width = fs_info->sectorsize;
> > @@ -2599,6 +2623,7 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
> >  	device->dev_stats_valid = 1;
> >  	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
> >  	device->fs_devices = fs_info->fs_devices;
> > +	set_bit(In_sync, &device->flags);
> >  	list_add(&device->dev_list, &fs_info->fs_devices->devices);
> >  	fs_info->fs_devices->num_devices++;
> >  	fs_info->fs_devices->open_devices++;
> > @@ -6450,6 +6475,14 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
> >  			btrfs_report_missing_device(fs_info, devid, uuid);
> >  			return -EIO;
> >  		}
> > +		if (map->stripes[i].dev &&
> > +		    !test_bit(In_sync, &map->stripes[i].dev->flags) &&
> > +		    !btrfs_test_opt(fs_info, DEGRADED)) {
> > +			btrfs_warn(fs_info, "devid %llu uuid %pU is Not in_sync, but we're not under degraded mode",
> 
> Please move the string to the next line and un-indent until it fits 80
> columns line.

OK.

> 
> > +				   devid, uuid);
> > +			free_extent_map(em);
> > +			return -EIO;
> > +		}
> >  		if (!map->stripes[i].dev) {
> >  			map->stripes[i].dev =
> >  				add_missing_dev(fs_info->fs_devices, devid,
> > @@ -6592,6 +6625,14 @@ static int read_one_dev(struct btrfs_fs_info *fs_info,
> >  				return -EIO;
> >  		}
> >  
> > +		if (!test_bit(In_sync, &device->flags) &&
> > +		    !btrfs_test_opt(fs_info, DEGRADED)) {
> > +			btrfs_warn(fs_info,
> > +				   "devid %llu uuid %pU is Not in_sync, but we're not under degraded mode",
> > +				   devid, dev_uuid);
> > +			return -EIO;
> > +		}
> > +
> >  		if(!device->bdev && !device->missing) {
> >  			/*
> >  			 * this happens when a device that was properly setup
> > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> > index 6108fdf..65b928c 100644
> > --- a/fs/btrfs/volumes.h
> > +++ b/fs/btrfs/volumes.h
> > @@ -69,6 +69,9 @@ struct btrfs_device {
> >  	/* the mode sent to blkdev_get */
> >  	fmode_t mode;
> >  
> > +	/* bit set of 'enum flag_bits' bit  */
> > +	unsigned long flags;
> > +
> >  	int writeable;
> >  	int in_fs_metadata;
> >  	int missing;
> 
> It would be good if you first make a generic member 'flags', and convert
> writeable/in_fs_metadata/missing/... to that and add the sync flags
> later.
>

Sounds good.

> > @@ -261,6 +264,11 @@ struct btrfs_fs_devices {
> >  	struct completion kobj_unregister;
> >  };
> >  
> > +enum flag_bits {
> 
> This name is too generic.

Will add a 'btrfs_device' prefix.

> 
> > +	Faulty,		/* device is known to have a fault */
> > +	In_sync,	/* device is in_sync with rest of array */
> 
> Enums usually are all caps, and with some prefix.
>

Well, copied from md... I'll make it all caps, but no prefix name
seems better to me.

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] Btrfs: fix data corruption in raid6
  2017-11-07  8:32   ` Anand Jain
@ 2017-11-08 19:53     ` Liu Bo
  2017-11-09  9:29       ` Anand Jain
  0 siblings, 1 reply; 16+ messages in thread
From: Liu Bo @ 2017-11-08 19:53 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote:
> 
> 
> On 11/02/2017 08:54 AM, Liu Bo wrote:
> >With raid6 profile, btrfs can end up with data corruption by the
> >following steps.
> >
> >Say we have a 5 disks that are set up with raid6 profile,
> >
> >1) mount this btrfs
> >2) one disk gets pulled out
> >3) write something to btrfs and sync
> >4) another disk gets pulled out
> >5) write something to btrfs and sync
> >6) umount this btrfs
> >7) bring the two disk back
> >8) reboot
> >9) mount this btrfs
> >
> >Chances are mount will fail (even with -odegraded) because of failure
> >on reading metadata blocks, IOW, our raid6 setup is not able to
> >protect two disk failures in some cases.
> >
> >So it turns out that there is a bug in raid6's recover code, that is,
> >
> >if we have one of stripes in the raid6 layout as shown here,
> >
> >| D1(*)  | D2(*)  | D3 | D4 | D5    |
> >-------------------------------------
> >| data1  | data2  | P  | Q  | data0 |
> 
> 
> >D1 and D2 are the two disks which got pulled out and brought back.
> >When mount reads data1 and finds out that data1 doesn't match its crc,
> >btrfs goes to recover data1 from other stripes, what it'll be doing is
> >
> >1) read data2, parity P, parity Q and data0
> >
> >2) as we have a valid parity P and two data stripes, it goes to
> >    recover in raid5 style.
> 
> 
> >(However, since disk D2 was pulled out and data2 on D2 could be stale,
> 
>  data2 should end up crc error, if not then raid5 recover is as
>  expected OR this example is confusing to explain the context of
>  two data stipe missing.

The assumption you have is not true, when doing reconstruction, we
don't verify checksum for each data stripe that is read from disk.

Thanks,

-liubo
> 
> Thanks, Anand
> 
> 
> >we still get the wrong data1 from this reconstruction.)
> >
> >3) btrfs continues to try to reconstruct data1 from parity Q, data2
> >    and data0, we still get the wrong one for the same reason.
> >
> >The fix here is to take advantage of the device flag, ie. 'In_sync',
> >all data on a device might be stale if 'In_sync' has not been set.
> >
> >With this, we can build the correct data2 from parity P, Q and data1.
> >
> >Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> >---
> >  fs/btrfs/raid56.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> >diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> >index 67262f8..3c0ce61 100644
> >--- a/fs/btrfs/raid56.c
> >+++ b/fs/btrfs/raid56.c
> >@@ -1076,7 +1076,8 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
> >  	disk_start = stripe->physical + (page_index << PAGE_SHIFT);
> >  	/* if the device is missing, just fail this stripe */
> >-	if (!stripe->dev->bdev)
> >+	if (!stripe->dev->bdev ||
> >+	    !test_bit(In_sync, &stripe->dev->flags))
> >  		return fail_rbio_index(rbio, stripe_nr);
> >  	/* see if we can add this page onto our existing bio */
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] Btrfs: fix data corruption in raid6
  2017-11-08 19:53     ` Liu Bo
@ 2017-11-09  9:29       ` Anand Jain
  2017-11-09  9:59         ` Qu Wenruo
  2017-11-10  0:12         ` Liu Bo
  0 siblings, 2 replies; 16+ messages in thread
From: Anand Jain @ 2017-11-09  9:29 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs



On 11/09/2017 03:53 AM, Liu Bo wrote:
> On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote:
>>
>>
>> On 11/02/2017 08:54 AM, Liu Bo wrote:
>>> With raid6 profile, btrfs can end up with data corruption by the
>>> following steps.
>>>
>>> Say we have a 5 disks that are set up with raid6 profile,
>>>
>>> 1) mount this btrfs
>>> 2) one disk gets pulled out
>>> 3) write something to btrfs and sync
>>> 4) another disk gets pulled out
>>> 5) write something to btrfs and sync
>>> 6) umount this btrfs
>>> 7) bring the two disk back
>>> 8) reboot
>>> 9) mount this btrfs
>>>
>>> Chances are mount will fail (even with -odegraded) because of failure
>>> on reading metadata blocks, IOW, our raid6 setup is not able to
>>> protect two disk failures in some cases.
>>>
>>> So it turns out that there is a bug in raid6's recover code, that is,
>>>
>>> if we have one of stripes in the raid6 layout as shown here,
>>>
>>> | D1(*)  | D2(*)  | D3 | D4 | D5    |
>>> -------------------------------------
>>> | data1  | data2  | P  | Q  | data0 |
>>
>>
>>> D1 and D2 are the two disks which got pulled out and brought back.
>>> When mount reads data1 and finds out that data1 doesn't match its crc,
>>> btrfs goes to recover data1 from other stripes, what it'll be doing is
>>>
>>> 1) read data2, parity P, parity Q and data0
>>>
>>> 2) as we have a valid parity P and two data stripes, it goes to
>>>     recover in raid5 style.
>>
>>
>>> (However, since disk D2 was pulled out and data2 on D2 could be stale,
>>
>>   data2 should end up crc error, if not then raid5 recover is as
>>   expected OR this example is confusing to explain the context of
>>   two data stipe missing.
> 
> The assumption you have is not true, 

> when doing reconstruction, we
> don't verify checksum for each data stripe that is read from disk.

  Why ? what if data0 (which has InSync flag) was wrong ?

  I am wary about the In_sync approach. IMO we are loosing the
  advantage of FS being merged with volume manager - which is to
  know exactly which active stripes were lost for quicker reconstruct.

  Lets say RAID6 is 50% full and disk1 is pulled out, and at 75%
  full disk2 is pulled out and reaches 90% full.

  So now when all the disks are back, two disks will not have
  In_sync flag? hope I understood this correctly.

  Now.
  15% of the data have lost two stripes.
  25% of the data have lost one stripe.
  50% of the data did not loose any stripe at all.

  Now for read and reconstruct..
  50% of the data does not need any reconstruction.
  25% of the data needs RAID5 style reconstruction as they have lost 
only one stripe.
  15% of the data needs RAID6 reconstruction since they have lost two 
stripes.

  Does InSync design here help to work like this ?

  Now for RAID1, lets say disk1 lost first 10% of the stripes and
  disk2 lost the last 10% of the stripes and these stripes are mutually
  exclusive. That means there is no single stripe which has lost
  completely.

  There is no code yet where I can see when the In_sync will be reset,
  which is fine for this eg. Assume no reconstruction is done. But now
  when both the disks are being mounted and as the In_sync will be based
  on dev_item.generation, one of it will get In_sync to me it looks like
  there will be 10% of stripes which will fail even though its not
  practically true in this example.


Thanks, Anand



> Thanks,
> 
> -liubo
>>
>> Thanks, Anand
>>
>>
>>> we still get the wrong data1 from this reconstruction.)
>>>
>>> 3) btrfs continues to try to reconstruct data1 from parity Q, data2
>>>     and data0, we still get the wrong one for the same reason.
>>>
>>> The fix here is to take advantage of the device flag, ie. 'In_sync',
>>> all data on a device might be stale if 'In_sync' has not been set.
>>>
>>> With this, we can build the correct data2 from parity P, Q and data1.
>>>
>>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
>>> ---
>>>   fs/btrfs/raid56.c | 3 ++-
>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>>> index 67262f8..3c0ce61 100644
>>> --- a/fs/btrfs/raid56.c
>>> +++ b/fs/btrfs/raid56.c
>>> @@ -1076,7 +1076,8 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
>>>   	disk_start = stripe->physical + (page_index << PAGE_SHIFT);
>>>   	/* if the device is missing, just fail this stripe */
>>> -	if (!stripe->dev->bdev)
>>> +	if (!stripe->dev->bdev ||
>>> +	    !test_bit(In_sync, &stripe->dev->flags))
>>>   		return fail_rbio_index(rbio, stripe_nr);
>>>   	/* see if we can add this page onto our existing bio */
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] Btrfs: fix data corruption in raid6
  2017-11-09  9:29       ` Anand Jain
@ 2017-11-09  9:59         ` Qu Wenruo
  2017-11-10  0:12         ` Liu Bo
  1 sibling, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2017-11-09  9:59 UTC (permalink / raw)
  To: Anand Jain, bo.li.liu; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 6558 bytes --]



On 2017年11月09日 17:29, Anand Jain wrote:
> 
> 
> On 11/09/2017 03:53 AM, Liu Bo wrote:
>> On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote:
>>>
>>>
>>> On 11/02/2017 08:54 AM, Liu Bo wrote:
>>>> With raid6 profile, btrfs can end up with data corruption by the
>>>> following steps.
>>>>
>>>> Say we have a 5 disks that are set up with raid6 profile,
>>>>
>>>> 1) mount this btrfs
>>>> 2) one disk gets pulled out
>>>> 3) write something to btrfs and sync
>>>> 4) another disk gets pulled out
>>>> 5) write something to btrfs and sync
>>>> 6) umount this btrfs
>>>> 7) bring the two disk back
>>>> 8) reboot
>>>> 9) mount this btrfs
>>>>
>>>> Chances are mount will fail (even with -odegraded) because of failure
>>>> on reading metadata blocks, IOW, our raid6 setup is not able to
>>>> protect two disk failures in some cases.
>>>>
>>>> So it turns out that there is a bug in raid6's recover code, that is,
>>>>
>>>> if we have one of stripes in the raid6 layout as shown here,
>>>>
>>>> | D1(*)  | D2(*)  | D3 | D4 | D5    |
>>>> -------------------------------------
>>>> | data1  | data2  | P  | Q  | data0 |
>>>
>>>
>>>> D1 and D2 are the two disks which got pulled out and brought back.
>>>> When mount reads data1 and finds out that data1 doesn't match its crc,
>>>> btrfs goes to recover data1 from other stripes, what it'll be doing is
>>>>
>>>> 1) read data2, parity P, parity Q and data0
>>>>
>>>> 2) as we have a valid parity P and two data stripes, it goes to
>>>>     recover in raid5 style.
>>>
>>>
>>>> (However, since disk D2 was pulled out and data2 on D2 could be stale,
>>>
>>>   data2 should end up crc error, if not then raid5 recover is as
>>>   expected OR this example is confusing to explain the context of
>>>   two data stipe missing.
>>
>> The assumption you have is not true, 
> 
>> when doing reconstruction, we
>> don't verify checksum for each data stripe that is read from disk.

I have the question, why not verifying the data stripe's csum when doing
reconstruction?

I know it can be nasty for metadata, since until we fully rebuild the
stripe we won't be able to check if the reconstruction is correct.
(And if we reconstruction failed, we should try to rebuild the block
using other untried methods)

But for data (with csum), I think it's possible to verify the csum at
reconstruction time, and make allow btrfs to have better understanding
that which repair method should be applied.

So at least for data with csum (the easiest case), reconstruction should
be done for
case D1 corruption:
| D1(*)  | D2(*)  | D3 | D4 | D5    |
-------------------------------------
| data1  | data2  | P  | Q  | data0 |

0) D1 csum mismatch
   Goto repair procedure
1) Read out csum for data0~data2
2) Read out data0~data2, P, Q
3) Verify data0~data2
   And both data1 and 2 fails the check
4) Repair using P Q data0
5) Recheck csum
   If data1/2 matches csum, repair is done well.
   If data1/2 mismatches csum, return EIO.

Thanks,
Qu

> 
>  Why ? what if data0 (which has InSync flag) was wrong ?
> 
>  I am wary about the In_sync approach. IMO we are loosing the
>  advantage of FS being merged with volume manager - which is to
>  know exactly which active stripes were lost for quicker reconstruct.
> 
>  Lets say RAID6 is 50% full and disk1 is pulled out, and at 75%
>  full disk2 is pulled out and reaches 90% full.
> 
>  So now when all the disks are back, two disks will not have
>  In_sync flag? hope I understood this correctly.
> 
>  Now.
>  15% of the data have lost two stripes.
>  25% of the data have lost one stripe.
>  50% of the data did not loose any stripe at all.
> 
>  Now for read and reconstruct..
>  50% of the data does not need any reconstruction.
>  25% of the data needs RAID5 style reconstruction as they have lost only
> one stripe.
>  15% of the data needs RAID6 reconstruction since they have lost two
> stripes.
> 
>  Does InSync design here help to work like this ?
> 
>  Now for RAID1, lets say disk1 lost first 10% of the stripes and
>  disk2 lost the last 10% of the stripes and these stripes are mutually
>  exclusive. That means there is no single stripe which has lost
>  completely.
> 
>  There is no code yet where I can see when the In_sync will be reset,
>  which is fine for this eg. Assume no reconstruction is done. But now
>  when both the disks are being mounted and as the In_sync will be based
>  on dev_item.generation, one of it will get In_sync to me it looks like
>  there will be 10% of stripes which will fail even though its not
>  practically true in this example.
> 
> 
> Thanks, Anand
> 
> 
> 
>> Thanks,
>>
>> -liubo
>>>
>>> Thanks, Anand
>>>
>>>
>>>> we still get the wrong data1 from this reconstruction.)
>>>>
>>>> 3) btrfs continues to try to reconstruct data1 from parity Q, data2
>>>>     and data0, we still get the wrong one for the same reason.
>>>>
>>>> The fix here is to take advantage of the device flag, ie. 'In_sync',
>>>> all data on a device might be stale if 'In_sync' has not been set.
>>>>
>>>> With this, we can build the correct data2 from parity P, Q and data1.
>>>>
>>>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
>>>> ---
>>>>   fs/btrfs/raid56.c | 3 ++-
>>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>>>> index 67262f8..3c0ce61 100644
>>>> --- a/fs/btrfs/raid56.c
>>>> +++ b/fs/btrfs/raid56.c
>>>> @@ -1076,7 +1076,8 @@ static int rbio_add_io_page(struct
>>>> btrfs_raid_bio *rbio,
>>>>       disk_start = stripe->physical + (page_index << PAGE_SHIFT);
>>>>       /* if the device is missing, just fail this stripe */
>>>> -    if (!stripe->dev->bdev)
>>>> +    if (!stripe->dev->bdev ||
>>>> +        !test_bit(In_sync, &stripe->dev->flags))
>>>>           return fail_rbio_index(rbio, stripe_nr);
>>>>       /* see if we can add this page onto our existing bio */
>>>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] Btrfs: fix data corruption in raid6
  2017-11-09  9:29       ` Anand Jain
  2017-11-09  9:59         ` Qu Wenruo
@ 2017-11-10  0:12         ` Liu Bo
  2017-11-10  0:54           ` Qu Wenruo
  1 sibling, 1 reply; 16+ messages in thread
From: Liu Bo @ 2017-11-10  0:12 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, Qu Wenruo

On Thu, Nov 09, 2017 at 05:29:25PM +0800, Anand Jain wrote:
> 
> 
> On 11/09/2017 03:53 AM, Liu Bo wrote:
> >On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote:
> >>
> >>
> >>On 11/02/2017 08:54 AM, Liu Bo wrote:
> >>>With raid6 profile, btrfs can end up with data corruption by the
> >>>following steps.
> >>>
> >>>Say we have a 5 disks that are set up with raid6 profile,
> >>>
> >>>1) mount this btrfs
> >>>2) one disk gets pulled out
> >>>3) write something to btrfs and sync
> >>>4) another disk gets pulled out
> >>>5) write something to btrfs and sync
> >>>6) umount this btrfs
> >>>7) bring the two disk back
> >>>8) reboot
> >>>9) mount this btrfs
> >>>
> >>>Chances are mount will fail (even with -odegraded) because of failure
> >>>on reading metadata blocks, IOW, our raid6 setup is not able to
> >>>protect two disk failures in some cases.
> >>>
> >>>So it turns out that there is a bug in raid6's recover code, that is,
> >>>
> >>>if we have one of stripes in the raid6 layout as shown here,
> >>>
> >>>| D1(*)  | D2(*)  | D3 | D4 | D5    |
> >>>-------------------------------------
> >>>| data1  | data2  | P  | Q  | data0 |
> >>
> >>
> >>>D1 and D2 are the two disks which got pulled out and brought back.
> >>>When mount reads data1 and finds out that data1 doesn't match its crc,
> >>>btrfs goes to recover data1 from other stripes, what it'll be doing is
> >>>
> >>>1) read data2, parity P, parity Q and data0
> >>>
> >>>2) as we have a valid parity P and two data stripes, it goes to
> >>>    recover in raid5 style.
> >>
> >>
> >>>(However, since disk D2 was pulled out and data2 on D2 could be stale,
> >>
> >>  data2 should end up crc error, if not then raid5 recover is as
> >>  expected OR this example is confusing to explain the context of
> >>  two data stipe missing.
> >
> >The assumption you have is not true,
> 
> >when doing reconstruction, we
> >don't verify checksum for each data stripe that is read from disk.
> 
>  Why ? what if data0 (which has InSync flag) was wrong ?
> 

(Wrinting these took me the whole afternoon, hopefully it makes
sense...)

Checksum verification happens much later, in readpage's end io, and if
data0 is wrong, then the rebuilt data couldn't match its checksum.

As Qu Wenruo also has the same question, here is my input about why
checksum is not done,

a) We may have mutliple different raid profiles within one btrfs, and
   checksum tree also resides somewhere on the disks, for
   raid1(default) and raid5, two disk failures may end up with data
   loss, it applies to raid6 as well because of the bug this patch is
   addressing.

b) If this is a metadata stripe, then it's likely that the stale
   content still matches with its checksum as metadata blocks store
   their checksum in the header part, thus checksum verification can
   not be trusted.

c) If this is a data stripe, then chances are they're belonging to
   nocow/nodatasum inodes such that no checksum could be verified.  In
   case that checksum is available, yes, finally we can verify
   checksum, but keep in mind, the whole data stripe may consist of
   several pieces of data which are from different inodes, some may
   nodatacow/nodatasum, some may not.

Given all the troubles it could lead to, I'd like to avoid it.

As a raid6 system, 2 disks failures can always be tolerated by
rebuilding data from other good copies.

Now that we have known two things, i.e.
a) data1 fails its checksum verification and
b) data2 from a out-of-sync disk might be stale,
it's straightforward to rebuild data1 from other good copies.


>  I am wary about the In_sync approach. IMO we are loosing the
>  advantage of FS being merged with volume manager - which is to
>  know exactly which active stripes were lost for quicker reconstruct.
>

Thanks for pointing this out, I'd say filesystem still has some
advantages over a vanilla volume manager while resync-ing out-of-sync
disks.

>  Lets say RAID6 is 50% full and disk1 is pulled out, and at 75%
>  full disk2 is pulled out and reaches 90% full.
> 
>  So now when all the disks are back, two disks will not have
>  In_sync flag? hope I understood this correctly.
> 
>  Now.
>  15% of the data have lost two stripes.
>  25% of the data have lost one stripe.
>  50% of the data did not loose any stripe at all.
> 
>  Now for read and reconstruct..
>  50% of the data does not need any reconstruction.

A 'Yes' for this, actually I should have avoided reading from
out-of-sync disks, but I forgot to do so. So as a side effect, if the
data read out from a out-of-sync disk matches its checksum, then it's
proven to be OK.

>  25% of the data needs RAID5 style reconstruction as they have lost only one
> stripe.
>  15% of the data needs RAID6 reconstruction since they have lost two
> stripes.
>

It depends, the stripes on the out-of-sync disks could be
a) all data stripes(as the example in this patch)
b) one data stripe and one stripe P
c) one data stripe and one stripe Q

with In_sync flag,
for a) raid6 reconstruction,
for b) raid6 reconstruction,
for c) raid5 reconstruction.

>  Does InSync design here help to work like this ?
> 
>  Now for RAID1, lets say disk1 lost first 10% of the stripes and
>  disk2 lost the last 10% of the stripes and these stripes are mutually
>  exclusive. That means there is no single stripe which has lost
>  completely.
>

True, probably we can drop patch 3 to let read happen on out-of-sync
disks and go the existing repair routine.

>  There is no code yet where I can see when the In_sync will be reset,
>  which is fine for this eg. Assume no reconstruction is done. But now
>  when both the disks are being mounted and as the In_sync will be based
>  on dev_item.generation, one of it will get In_sync to me it looks like
>  there will be 10% of stripes which will fail even though its not
>  practically true in this example.
>

Not sure if I understand this paragraph, are you talking about raid1
or raid6?

Regarding to when to reset In_sync, my naive idea is to utilize scrub
to verify checksum on all content that filesystem knows and do the
repair work, If it turns out that everything looks good, then we know
the disk is already resync, it's a good time to set In_sync.

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] Btrfs: fix data corruption in raid6
  2017-11-10  0:12         ` Liu Bo
@ 2017-11-10  0:54           ` Qu Wenruo
  2017-11-10 10:15             ` Qu Wenruo
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2017-11-10  0:54 UTC (permalink / raw)
  To: bo.li.liu, Anand Jain; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 8117 bytes --]



On 2017年11月10日 08:12, Liu Bo wrote:
> On Thu, Nov 09, 2017 at 05:29:25PM +0800, Anand Jain wrote:
>>
>>
>> On 11/09/2017 03:53 AM, Liu Bo wrote:
>>> On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote:
>>>>
>>>>
>>>> On 11/02/2017 08:54 AM, Liu Bo wrote:
>>>>> With raid6 profile, btrfs can end up with data corruption by the
>>>>> following steps.
>>>>>
>>>>> Say we have a 5 disks that are set up with raid6 profile,
>>>>>
>>>>> 1) mount this btrfs
>>>>> 2) one disk gets pulled out
>>>>> 3) write something to btrfs and sync
>>>>> 4) another disk gets pulled out
>>>>> 5) write something to btrfs and sync
>>>>> 6) umount this btrfs
>>>>> 7) bring the two disk back
>>>>> 8) reboot
>>>>> 9) mount this btrfs
>>>>>
>>>>> Chances are mount will fail (even with -odegraded) because of failure
>>>>> on reading metadata blocks, IOW, our raid6 setup is not able to
>>>>> protect two disk failures in some cases.
>>>>>
>>>>> So it turns out that there is a bug in raid6's recover code, that is,
>>>>>
>>>>> if we have one of stripes in the raid6 layout as shown here,
>>>>>
>>>>> | D1(*)  | D2(*)  | D3 | D4 | D5    |
>>>>> -------------------------------------
>>>>> | data1  | data2  | P  | Q  | data0 |
>>>>
>>>>
>>>>> D1 and D2 are the two disks which got pulled out and brought back.
>>>>> When mount reads data1 and finds out that data1 doesn't match its crc,
>>>>> btrfs goes to recover data1 from other stripes, what it'll be doing is
>>>>>
>>>>> 1) read data2, parity P, parity Q and data0
>>>>>
>>>>> 2) as we have a valid parity P and two data stripes, it goes to
>>>>>    recover in raid5 style.
>>>>
>>>>
>>>>> (However, since disk D2 was pulled out and data2 on D2 could be stale,
>>>>
>>>>  data2 should end up crc error, if not then raid5 recover is as
>>>>  expected OR this example is confusing to explain the context of
>>>>  two data stipe missing.
>>>
>>> The assumption you have is not true,
>>
>>> when doing reconstruction, we
>>> don't verify checksum for each data stripe that is read from disk.
>>
>>  Why ? what if data0 (which has InSync flag) was wrong ?
>>
> 
> (Wrinting these took me the whole afternoon, hopefully it makes
> sense...)
> 
> Checksum verification happens much later, in readpage's end io, and if
> data0 is wrong, then the rebuilt data couldn't match its checksum.
> 
> As Qu Wenruo also has the same question, here is my input about why
> checksum is not done,
> 
> a) We may have mutliple different raid profiles within one btrfs, and
>    checksum tree also resides somewhere on the disks, for
>    raid1(default) and raid5, two disk failures may end up with data
>    loss, it applies to raid6 as well because of the bug this patch is
>    addressing.

However, raid56 is more complex than other profiles, so it even has its
own raid56.[ch].
Although currently it's mostly for the complex RMW cycle, I think it can
also be enhanced to do csum verification at repair time.

> 
> b) If this is a metadata stripe, then it's likely that the stale
>    content still matches with its checksum as metadata blocks store
>    their checksum in the header part, thus checksum verification can
>    not be trusted.

Metadata can be handled later, it's much more complex since (by default)
metadata crosses several pages, which makes it impossible to verify
until we rebuild all related pages.

BTW, if metadata (tree block) has stale content, it will not match the csum.
Tree block csum is for the whole tree block, so any stale data in the
whole tree block will cause mismatched csum.
It's hard to do just because we can't verify if the rebuilt/original
tree block is good until we rebuild/read the whole tree block.

> 
> c) If this is a data stripe, then chances are they're belonging to
>    nocow/nodatasum inodes such that no checksum could be verified.  In
>    case that checksum is available, yes, finally we can verify
>    checksum, but keep in mind, the whole data stripe may consist of
>    several pieces of data which are from different inodes, some may
>    nodatacow/nodatasum, some may not.

Ignore such case too.

> 
> Given all the troubles it could lead to, I'd like to avoid it.

For hard parts, I think keeping them untouched until good solution is
found is completely acceptable.

No need to do it all-in-one, step by step should be the correct way.

> 
> As a raid6 system, 2 disks failures can always be tolerated by
> rebuilding data from other good copies.
> 
> Now that we have known two things, i.e.
> a) data1 fails its checksum verification and
> b) data2 from a out-of-sync disk might be stale,
> it's straightforward to rebuild data1 from other good copies.

But the out-of-sync flag is too generic to determine if we can trust the
copy.
Csum here is a much better indicator, which can provide page level (for
data) indicator other than device level.

For case 3 devices out-of-sync, we just treat it as 3 missing devices?

Thanks,
Qu
> 
> 
>>  I am wary about the In_sync approach. IMO we are loosing the
>>  advantage of FS being merged with volume manager - which is to
>>  know exactly which active stripes were lost for quicker reconstruct.
>>
> 
> Thanks for pointing this out, I'd say filesystem still has some
> advantages over a vanilla volume manager while resync-ing out-of-sync
> disks.
> 
>>  Lets say RAID6 is 50% full and disk1 is pulled out, and at 75%
>>  full disk2 is pulled out and reaches 90% full.
>>
>>  So now when all the disks are back, two disks will not have
>>  In_sync flag? hope I understood this correctly.
>>
>>  Now.
>>  15% of the data have lost two stripes.
>>  25% of the data have lost one stripe.
>>  50% of the data did not loose any stripe at all.
>>
>>  Now for read and reconstruct..
>>  50% of the data does not need any reconstruction.
> 
> A 'Yes' for this, actually I should have avoided reading from
> out-of-sync disks, but I forgot to do so. So as a side effect, if the
> data read out from a out-of-sync disk matches its checksum, then it's
> proven to be OK.
> 
>>  25% of the data needs RAID5 style reconstruction as they have lost only one
>> stripe.
>>  15% of the data needs RAID6 reconstruction since they have lost two
>> stripes.
>>
> 
> It depends, the stripes on the out-of-sync disks could be
> a) all data stripes(as the example in this patch)
> b) one data stripe and one stripe P
> c) one data stripe and one stripe Q
> 
> with In_sync flag,
> for a) raid6 reconstruction,
> for b) raid6 reconstruction,
> for c) raid5 reconstruction.
> 
>>  Does InSync design here help to work like this ?
>>
>>  Now for RAID1, lets say disk1 lost first 10% of the stripes and
>>  disk2 lost the last 10% of the stripes and these stripes are mutually
>>  exclusive. That means there is no single stripe which has lost
>>  completely.
>>
> 
> True, probably we can drop patch 3 to let read happen on out-of-sync
> disks and go the existing repair routine.
> 
>>  There is no code yet where I can see when the In_sync will be reset,
>>  which is fine for this eg. Assume no reconstruction is done. But now
>>  when both the disks are being mounted and as the In_sync will be based
>>  on dev_item.generation, one of it will get In_sync to me it looks like
>>  there will be 10% of stripes which will fail even though its not
>>  practically true in this example.
>>
> 
> Not sure if I understand this paragraph, are you talking about raid1
> or raid6?
> 
> Regarding to when to reset In_sync, my naive idea is to utilize scrub
> to verify checksum on all content that filesystem knows and do the
> repair work, If it turns out that everything looks good, then we know
> the disk is already resync, it's a good time to set In_sync.
> 
> Thanks,
> 
> -liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/4] Btrfs: fix data corruption in raid6
  2017-11-10  0:54           ` Qu Wenruo
@ 2017-11-10 10:15             ` Qu Wenruo
  0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2017-11-10 10:15 UTC (permalink / raw)
  To: bo.li.liu, Anand Jain; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 9145 bytes --]



On 2017年11月10日 08:54, Qu Wenruo wrote:
> 
> 
> On 2017年11月10日 08:12, Liu Bo wrote:
>> On Thu, Nov 09, 2017 at 05:29:25PM +0800, Anand Jain wrote:
>>>
>>>
>>> On 11/09/2017 03:53 AM, Liu Bo wrote:
>>>> On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote:
>>>>>
>>>>>
>>>>> On 11/02/2017 08:54 AM, Liu Bo wrote:
>>>>>> With raid6 profile, btrfs can end up with data corruption by the
>>>>>> following steps.
>>>>>>
>>>>>> Say we have a 5 disks that are set up with raid6 profile,
>>>>>>
>>>>>> 1) mount this btrfs
>>>>>> 2) one disk gets pulled out
>>>>>> 3) write something to btrfs and sync
>>>>>> 4) another disk gets pulled out
>>>>>> 5) write something to btrfs and sync
>>>>>> 6) umount this btrfs
>>>>>> 7) bring the two disk back
>>>>>> 8) reboot
>>>>>> 9) mount this btrfs
>>>>>>
>>>>>> Chances are mount will fail (even with -odegraded) because of failure
>>>>>> on reading metadata blocks, IOW, our raid6 setup is not able to
>>>>>> protect two disk failures in some cases.
>>>>>>
>>>>>> So it turns out that there is a bug in raid6's recover code, that is,
>>>>>>
>>>>>> if we have one of stripes in the raid6 layout as shown here,
>>>>>>
>>>>>> | D1(*)  | D2(*)  | D3 | D4 | D5    |
>>>>>> -------------------------------------
>>>>>> | data1  | data2  | P  | Q  | data0 |
>>>>>
>>>>>
>>>>>> D1 and D2 are the two disks which got pulled out and brought back.
>>>>>> When mount reads data1 and finds out that data1 doesn't match its crc,
>>>>>> btrfs goes to recover data1 from other stripes, what it'll be doing is
>>>>>>
>>>>>> 1) read data2, parity P, parity Q and data0
>>>>>>
>>>>>> 2) as we have a valid parity P and two data stripes, it goes to
>>>>>>    recover in raid5 style.
>>>>>
>>>>>
>>>>>> (However, since disk D2 was pulled out and data2 on D2 could be stale,
>>>>>
>>>>>  data2 should end up crc error, if not then raid5 recover is as
>>>>>  expected OR this example is confusing to explain the context of
>>>>>  two data stipe missing.
>>>>
>>>> The assumption you have is not true,
>>>
>>>> when doing reconstruction, we
>>>> don't verify checksum for each data stripe that is read from disk.
>>>
>>>  Why ? what if data0 (which has InSync flag) was wrong ?
>>>
>>
>> (Wrinting these took me the whole afternoon, hopefully it makes
>> sense...)
>>
>> Checksum verification happens much later, in readpage's end io, and if
>> data0 is wrong, then the rebuilt data couldn't match its checksum.
>>
>> As Qu Wenruo also has the same question, here is my input about why
>> checksum is not done,
>>
>> a) We may have mutliple different raid profiles within one btrfs, and
>>    checksum tree also resides somewhere on the disks, for
>>    raid1(default) and raid5, two disk failures may end up with data
>>    loss, it applies to raid6 as well because of the bug this patch is
>>    addressing.
> 
> However, raid56 is more complex than other profiles, so it even has its
> own raid56.[ch].
> Although currently it's mostly for the complex RMW cycle, I think it can
> also be enhanced to do csum verification at repair time.

BTW, repair can even be done without verifying csum of data2/data0.

Just try all repair combination, until the repair result matches
checksum, or no combination can result a good repair.

I know this sounds stupid, but at least this means it's possible.

The main point here is, we should at least ensure repair result matches
checksum.

This also reminds me that, checksum verification should be handled at
lower level, other than doing it at endio time.

(Not related to this topic, but I also want to implement optional csum
check hook in bio, so for case like dm-integrity on dm-raid1 can
recognize bad and good copy, and even let btrfs to reuse device mapper
to do chunk map, but that's another story)

Thanks,
Qu

> 
>>
>> b) If this is a metadata stripe, then it's likely that the stale
>>    content still matches with its checksum as metadata blocks store
>>    their checksum in the header part, thus checksum verification can
>>    not be trusted.
> 
> Metadata can be handled later, it's much more complex since (by default)
> metadata crosses several pages, which makes it impossible to verify
> until we rebuild all related pages.
> 
> BTW, if metadata (tree block) has stale content, it will not match the csum.
> Tree block csum is for the whole tree block, so any stale data in the
> whole tree block will cause mismatched csum.
> It's hard to do just because we can't verify if the rebuilt/original
> tree block is good until we rebuild/read the whole tree block.
> 
>>
>> c) If this is a data stripe, then chances are they're belonging to
>>    nocow/nodatasum inodes such that no checksum could be verified.  In
>>    case that checksum is available, yes, finally we can verify
>>    checksum, but keep in mind, the whole data stripe may consist of
>>    several pieces of data which are from different inodes, some may
>>    nodatacow/nodatasum, some may not.
> 
> Ignore such case too.
> 
>>
>> Given all the troubles it could lead to, I'd like to avoid it.
> 
> For hard parts, I think keeping them untouched until good solution is
> found is completely acceptable.
> 
> No need to do it all-in-one, step by step should be the correct way.
> 
>>
>> As a raid6 system, 2 disks failures can always be tolerated by
>> rebuilding data from other good copies.
>>
>> Now that we have known two things, i.e.
>> a) data1 fails its checksum verification and
>> b) data2 from a out-of-sync disk might be stale,
>> it's straightforward to rebuild data1 from other good copies.
> 
> But the out-of-sync flag is too generic to determine if we can trust the
> copy.
> Csum here is a much better indicator, which can provide page level (for
> data) indicator other than device level.
> 
> For case 3 devices out-of-sync, we just treat it as 3 missing devices?
> 
> Thanks,
> Qu
>>
>>
>>>  I am wary about the In_sync approach. IMO we are loosing the
>>>  advantage of FS being merged with volume manager - which is to
>>>  know exactly which active stripes were lost for quicker reconstruct.
>>>
>>
>> Thanks for pointing this out, I'd say filesystem still has some
>> advantages over a vanilla volume manager while resync-ing out-of-sync
>> disks.
>>
>>>  Lets say RAID6 is 50% full and disk1 is pulled out, and at 75%
>>>  full disk2 is pulled out and reaches 90% full.
>>>
>>>  So now when all the disks are back, two disks will not have
>>>  In_sync flag? hope I understood this correctly.
>>>
>>>  Now.
>>>  15% of the data have lost two stripes.
>>>  25% of the data have lost one stripe.
>>>  50% of the data did not loose any stripe at all.
>>>
>>>  Now for read and reconstruct..
>>>  50% of the data does not need any reconstruction.
>>
>> A 'Yes' for this, actually I should have avoided reading from
>> out-of-sync disks, but I forgot to do so. So as a side effect, if the
>> data read out from a out-of-sync disk matches its checksum, then it's
>> proven to be OK.
>>
>>>  25% of the data needs RAID5 style reconstruction as they have lost only one
>>> stripe.
>>>  15% of the data needs RAID6 reconstruction since they have lost two
>>> stripes.
>>>
>>
>> It depends, the stripes on the out-of-sync disks could be
>> a) all data stripes(as the example in this patch)
>> b) one data stripe and one stripe P
>> c) one data stripe and one stripe Q
>>
>> with In_sync flag,
>> for a) raid6 reconstruction,
>> for b) raid6 reconstruction,
>> for c) raid5 reconstruction.
>>
>>>  Does InSync design here help to work like this ?
>>>
>>>  Now for RAID1, lets say disk1 lost first 10% of the stripes and
>>>  disk2 lost the last 10% of the stripes and these stripes are mutually
>>>  exclusive. That means there is no single stripe which has lost
>>>  completely.
>>>
>>
>> True, probably we can drop patch 3 to let read happen on out-of-sync
>> disks and go the existing repair routine.
>>
>>>  There is no code yet where I can see when the In_sync will be reset,
>>>  which is fine for this eg. Assume no reconstruction is done. But now
>>>  when both the disks are being mounted and as the In_sync will be based
>>>  on dev_item.generation, one of it will get In_sync to me it looks like
>>>  there will be 10% of stripes which will fail even though its not
>>>  practically true in this example.
>>>
>>
>> Not sure if I understand this paragraph, are you talking about raid1
>> or raid6?
>>
>> Regarding to when to reset In_sync, my naive idea is to utilize scrub
>> to verify checksum on all content that filesystem knows and do the
>> repair work, If it turns out that everything looks good, then we know
>> the disk is already resync, it's a good time to set In_sync.
>>
>> Thanks,
>>
>> -liubo
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] Btrfs: introduce device flags
  2017-11-08 19:46     ` Liu Bo
@ 2017-11-13 17:13       ` David Sterba
  0 siblings, 0 replies; 16+ messages in thread
From: David Sterba @ 2017-11-13 17:13 UTC (permalink / raw)
  To: Liu Bo; +Cc: dsterba, linux-btrfs

On Wed, Nov 08, 2017 at 11:46:13AM -0800, Liu Bo wrote:
> On Mon, Nov 06, 2017 at 05:40:25PM +0100, David Sterba wrote:
> > > +	Faulty,		/* device is known to have a fault */
> > > +	In_sync,	/* device is in_sync with rest of array */
> > Enums usually are all caps, and with some prefix.
> Well, copied from md... I'll make it all caps, but no prefix name
> seems better to me.

There are 19 .c files including volumes.h, this needs a prefix. We
should keep the naming consistent within btrfs, the references to MD can
be put to a comment.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-11-13 17:15 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-02  0:54 [PATCH 0/4] Fix raid6 reconstruction bug Liu Bo
2017-11-02  0:54 ` [PATCH 1/4] Btrfs: introduce device flags Liu Bo
2017-11-06 16:40   ` David Sterba
2017-11-08 19:46     ` Liu Bo
2017-11-13 17:13       ` David Sterba
2017-11-07  9:30   ` Anand Jain
2017-11-02  0:54 ` [PATCH 2/4] Btrfs: fix data corruption in raid6 Liu Bo
2017-11-07  8:32   ` Anand Jain
2017-11-08 19:53     ` Liu Bo
2017-11-09  9:29       ` Anand Jain
2017-11-09  9:59         ` Qu Wenruo
2017-11-10  0:12         ` Liu Bo
2017-11-10  0:54           ` Qu Wenruo
2017-11-10 10:15             ` Qu Wenruo
2017-11-02  0:54 ` [PATCH 3/4] Btrfs: make raid1 and raid10 be aware of device flag In_sync Liu Bo
2017-11-02  0:54 ` [PATCH 4/4] Btrfs: change how we set In_sync Liu Bo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).