[PATCH RFC] btrfs: Do extra device generation check at mount time

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC] btrfs: Do extra device generation check at mount time
@ 2018-06-28  7:04 Qu Wenruo
  2018-06-28  7:06 ` Nikolay Borisov
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Qu Wenruo @ 2018-06-28  7:04 UTC (permalink / raw)
  To: linux-btrfs

There is a reporter considering btrfs raid1 has a major design flaw
which can't handle nodatasum files.

Despite his incorrect expectation, btrfs indeed doesn't handle device
generation mismatch well.

This means if one devices missed and re-appeared, even its generation
no longer matches with the rest device pool, btrfs does nothing to it,
but treat it as normal good device.

At least let's detect such generation mismatch and avoid mounting the
fs.
Currently there is no automatic rebuild yet, which means if users find
device generation mismatch error message, they can only mount the fs
using "device" and "degraded" mount option (if possible), then replace
the offending device to manually "rebuild" the fs.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
I totally understand that, generation based solution can't handle
split-brain case (where 2 RAID1 devices get mounted degraded separately)
at all, but at least let's handle what we can do.

The best way to solve the problem is to make btrfs treat such lower gen
devices as some kind of missing device, and queue an automatic scrub for
that device.
But that's a lot of extra work, at least let's start from detecting such
problem first.
---
 fs/btrfs/volumes.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e034ad9e23b4..80a7c44993bc 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct btrfs_fs_info *fs_info,
 			      devid, uuid);
 }
 
+static int verify_devices_generation(struct btrfs_fs_info *fs_info,
+				     struct btrfs_device *dev)
+{
+	struct btrfs_fs_devices *fs_devices = dev->fs_devices;
+	struct btrfs_device *cur;
+	bool warn_only = false;
+	int ret = 0;
+
+	if (!fs_devices || fs_devices->seeding || !dev->generation)
+		return 0;
+
+	/*
+	 * If we're not replaying log, we're completely safe to allow
+	 * generation mismatch as it won't write anything to disks, nor
+	 * remount to rw.
+	 */
+	if (btrfs_test_opt(fs_info, NOLOGREPLAY))
+		warn_only = true;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(cur, &fs_devices->devices, dev_list) {
+		if (cur->generation && cur->generation != dev->generation) {
+			if (warn_only) {
+				btrfs_warn_rl_in_rcu(fs_info,
+	"devid %llu has unexpected generation, has %llu expected %llu",
+						     dev->devid,
+						     dev->generation,
+						     cur->generation);
+			} else {
+				btrfs_err_rl_in_rcu(fs_info,
+	"devid %llu has unexpected generation, has %llu expected %llu",
+						     dev->devid,
+						     dev->generation,
+						     cur->generation);
+				ret = -EINVAL;
+				break;
+			}
+		}
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
 static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
 			  struct extent_buffer *leaf,
 			  struct btrfs_chunk *chunk)
@@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
 				return PTR_ERR(map->stripes[i].dev);
 			}
 			btrfs_report_missing_device(fs_info, devid, uuid, false);
+		} else {
+			ret = verify_devices_generation(fs_info,
+							map->stripes[i].dev);
+			if (ret < 0) {
+				free_extent_map(em);
+				return ret;
+			}
 		}
 		set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
 				&(map->stripes[i].dev->dev_state));
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
  2018-06-28  7:04 [PATCH RFC] btrfs: Do extra device generation check at mount time Qu Wenruo
@ 2018-06-28  7:06 ` Nikolay Borisov
  2018-06-28  7:15   ` Qu Wenruo
  2018-06-28  7:17 ` Su Yue
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Nikolay Borisov @ 2018-06-28  7:06 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 28.06.2018 10:04, Qu Wenruo wrote:
> There is a reporter considering btrfs raid1 has a major design flaw
> which can't handle nodatasum files.
> 
> Despite his incorrect expectation, btrfs indeed doesn't handle device
> generation mismatch well.
> 
> This means if one devices missed and re-appeared, even its generation
> no longer matches with the rest device pool, btrfs does nothing to it,
> but treat it as normal good device.
> 
> At least let's detect such generation mismatch and avoid mounting the
> fs.
> Currently there is no automatic rebuild yet, which means if users find
> device generation mismatch error message, they can only mount the fs
> using "device" and "degraded" mount option (if possible), then replace
> the offending device to manually "rebuild" the fs.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

I think a testcase of this functionality is important as well.
> ---
> I totally understand that, generation based solution can't handle
> split-brain case (where 2 RAID1 devices get mounted degraded separately)
> at all, but at least let's handle what we can do.
> 
> The best way to solve the problem is to make btrfs treat such lower gen
> devices as some kind of missing device, and queue an automatic scrub for
> that device.
> But that's a lot of extra work, at least let's start from detecting such
> problem first.
> ---
>  fs/btrfs/volumes.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index e034ad9e23b4..80a7c44993bc 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct btrfs_fs_info *fs_info,
>  			      devid, uuid);
>  }
>  
> +static int verify_devices_generation(struct btrfs_fs_info *fs_info,
> +				     struct btrfs_device *dev)
> +{
> +	struct btrfs_fs_devices *fs_devices = dev->fs_devices;
> +	struct btrfs_device *cur;
> +	bool warn_only = false;
> +	int ret = 0;
> +
> +	if (!fs_devices || fs_devices->seeding || !dev->generation)
> +		return 0;
> +
> +	/*
> +	 * If we're not replaying log, we're completely safe to allow
> +	 * generation mismatch as it won't write anything to disks, nor
> +	 * remount to rw.
> +	 */
> +	if (btrfs_test_opt(fs_info, NOLOGREPLAY))
> +		warn_only = true;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(cur, &fs_devices->devices, dev_list) {
> +		if (cur->generation && cur->generation != dev->generation) {
> +			if (warn_only) {
> +				btrfs_warn_rl_in_rcu(fs_info,
> +	"devid %llu has unexpected generation, has %llu expected %llu",
> +						     dev->devid,
> +						     dev->generation,
> +						     cur->generation);
> +			} else {
> +				btrfs_err_rl_in_rcu(fs_info,
> +	"devid %llu has unexpected generation, has %llu expected %llu",
> +						     dev->devid,
> +						     dev->generation,
> +						     cur->generation);
> +				ret = -EINVAL;
> +				break;
> +			}
> +		}
> +	}
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
>  static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
>  			  struct extent_buffer *leaf,
>  			  struct btrfs_chunk *chunk)
> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
>  				return PTR_ERR(map->stripes[i].dev);
>  			}
>  			btrfs_report_missing_device(fs_info, devid, uuid, false);
> +		} else {
> +			ret = verify_devices_generation(fs_info,
> +							map->stripes[i].dev);
> +			if (ret < 0) {
> +				free_extent_map(em);
> +				return ret;
> +			}
>  		}
>  		set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>  				&(map->stripes[i].dev->dev_state));
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
  2018-06-28  7:06 ` Nikolay Borisov
@ 2018-06-28  7:15   ` Qu Wenruo
  2018-06-28  7:30     ` Paul Jones
  2018-06-29  3:25     ` james harvey
  0 siblings, 2 replies; 16+ messages in thread
From: Qu Wenruo @ 2018-06-28  7:15 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs



On 2018年06月28日 15:06, Nikolay Borisov wrote:
> 
> 
> On 28.06.2018 10:04, Qu Wenruo wrote:
>> There is a reporter considering btrfs raid1 has a major design flaw
>> which can't handle nodatasum files.
>>
>> Despite his incorrect expectation, btrfs indeed doesn't handle device
>> generation mismatch well.
>>
>> This means if one devices missed and re-appeared, even its generation
>> no longer matches with the rest device pool, btrfs does nothing to it,
>> but treat it as normal good device.
>>
>> At least let's detect such generation mismatch and avoid mounting the
>> fs.
>> Currently there is no automatic rebuild yet, which means if users find
>> device generation mismatch error message, they can only mount the fs
>> using "device" and "degraded" mount option (if possible), then replace
>> the offending device to manually "rebuild" the fs.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> I think a testcase of this functionality is important as well.

It's currently an RFC patch, test case would come along with final version.

I'd like to make sure everyone, including developers and end-users, are
fine with the restrict error-out behavior.

Thanks,
Qu

>> ---
>> I totally understand that, generation based solution can't handle
>> split-brain case (where 2 RAID1 devices get mounted degraded separately)
>> at all, but at least let's handle what we can do.
>>
>> The best way to solve the problem is to make btrfs treat such lower gen
>> devices as some kind of missing device, and queue an automatic scrub for
>> that device.
>> But that's a lot of extra work, at least let's start from detecting such
>> problem first.
>> ---
>>  fs/btrfs/volumes.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 50 insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index e034ad9e23b4..80a7c44993bc 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct btrfs_fs_info *fs_info,
>>  			      devid, uuid);
>>  }
>>  
>> +static int verify_devices_generation(struct btrfs_fs_info *fs_info,
>> +				     struct btrfs_device *dev)
>> +{
>> +	struct btrfs_fs_devices *fs_devices = dev->fs_devices;
>> +	struct btrfs_device *cur;
>> +	bool warn_only = false;
>> +	int ret = 0;
>> +
>> +	if (!fs_devices || fs_devices->seeding || !dev->generation)
>> +		return 0;
>> +
>> +	/*
>> +	 * If we're not replaying log, we're completely safe to allow
>> +	 * generation mismatch as it won't write anything to disks, nor
>> +	 * remount to rw.
>> +	 */
>> +	if (btrfs_test_opt(fs_info, NOLOGREPLAY))
>> +		warn_only = true;
>> +
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(cur, &fs_devices->devices, dev_list) {
>> +		if (cur->generation && cur->generation != dev->generation) {
>> +			if (warn_only) {
>> +				btrfs_warn_rl_in_rcu(fs_info,
>> +	"devid %llu has unexpected generation, has %llu expected %llu",
>> +						     dev->devid,
>> +						     dev->generation,
>> +						     cur->generation);
>> +			} else {
>> +				btrfs_err_rl_in_rcu(fs_info,
>> +	"devid %llu has unexpected generation, has %llu expected %llu",
>> +						     dev->devid,
>> +						     dev->generation,
>> +						     cur->generation);
>> +				ret = -EINVAL;
>> +				break;
>> +			}
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +	return ret;
>> +}
>> +
>>  static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
>>  			  struct extent_buffer *leaf,
>>  			  struct btrfs_chunk *chunk)
>> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
>>  				return PTR_ERR(map->stripes[i].dev);
>>  			}
>>  			btrfs_report_missing_device(fs_info, devid, uuid, false);
>> +		} else {
>> +			ret = verify_devices_generation(fs_info,
>> +							map->stripes[i].dev);
>> +			if (ret < 0) {
>> +				free_extent_map(em);
>> +				return ret;
>> +			}
>>  		}
>>  		set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>>  				&(map->stripes[i].dev->dev_state));
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
  2018-06-28  7:17 ` Su Yue
@ 2018-06-28  7:16   ` Qu Wenruo
  0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2018-06-28  7:16 UTC (permalink / raw)
  To: Su Yue, Qu Wenruo, linux-btrfs



On 2018年06月28日 15:17, Su Yue wrote:
> 
> 
> On 06/28/2018 03:04 PM, Qu Wenruo wrote:
>> There is a reporter considering btrfs raid1 has a major design flaw
>> which can't handle nodatasum files.
>>
>> Despite his incorrect expectation, btrfs indeed doesn't handle device
>> generation mismatch well.
>>
> Just say " btrfs indeed doesn't handle device generation mismatch well."
> is enough.
> Otherwise on someday someone may be confused about who is the reporter
> and what was reported.

I'm just a little over-reacting to that rude reporter.

Will definitely change that part of commit message in final version.

Thanks,
Qu

> 
> Thanks,
> Su
>> This means if one devices missed and re-appeared, even its generation
>> no longer matches with the rest device pool, btrfs does nothing to it,
>> but treat it as normal good device.
>>
>> At least let's detect such generation mismatch and avoid mounting the
>> fs.
>> Currently there is no automatic rebuild yet, which means if users find
>> device generation mismatch error message, they can only mount the fs
>> using "device" and "degraded" mount option (if possible), then replace
>> the offending device to manually "rebuild" the fs.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> I totally understand that, generation based solution can't handle
>> split-brain case (where 2 RAID1 devices get mounted degraded separately)
>> at all, but at least let's handle what we can do.
>>
>> The best way to solve the problem is to make btrfs treat such lower gen
>> devices as some kind of missing device, and queue an automatic scrub for
>> that device.
>> But that's a lot of extra work, at least let's start from detecting such
>> problem first.
>> ---
>>   fs/btrfs/volumes.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 50 insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index e034ad9e23b4..80a7c44993bc 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct
>> btrfs_fs_info *fs_info,
>>                     devid, uuid);
>>   }
>>   +static int verify_devices_generation(struct btrfs_fs_info *fs_info,
>> +                     struct btrfs_device *dev)
>> +{
>> +    struct btrfs_fs_devices *fs_devices = dev->fs_devices;
>> +    struct btrfs_device *cur;
>> +    bool warn_only = false;
>> +    int ret = 0;
>> +
>> +    if (!fs_devices || fs_devices->seeding || !dev->generation)
>> +        return 0;
>> +
>> +    /*
>> +     * If we're not replaying log, we're completely safe to allow
>> +     * generation mismatch as it won't write anything to disks, nor
>> +     * remount to rw.
>> +     */
>> +    if (btrfs_test_opt(fs_info, NOLOGREPLAY))
>> +        warn_only = true;
>> +
>> +    rcu_read_lock();
>> +    list_for_each_entry_rcu(cur, &fs_devices->devices, dev_list) {
>> +        if (cur->generation && cur->generation != dev->generation) {
>> +            if (warn_only) {
>> +                btrfs_warn_rl_in_rcu(fs_info,
>> +    "devid %llu has unexpected generation, has %llu expected %llu",
>> +                             dev->devid,
>> +                             dev->generation,
>> +                             cur->generation);
>> +            } else {
>> +                btrfs_err_rl_in_rcu(fs_info,
>> +    "devid %llu has unexpected generation, has %llu expected %llu",
>> +                             dev->devid,
>> +                             dev->generation,
>> +                             cur->generation);
>> +                ret = -EINVAL;
>> +                break;
>> +            }
>> +        }
>> +    }
>> +    rcu_read_unlock();
>> +    return ret;
>> +}
>> +
>>   static int read_one_chunk(struct btrfs_fs_info *fs_info, struct
>> btrfs_key *key,
>>                 struct extent_buffer *leaf,
>>                 struct btrfs_chunk *chunk)
>> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info
>> *fs_info, struct btrfs_key *key,
>>                   return PTR_ERR(map->stripes[i].dev);
>>               }
>>               btrfs_report_missing_device(fs_info, devid, uuid, false);
>> +        } else {
>> +            ret = verify_devices_generation(fs_info,
>> +                            map->stripes[i].dev);
>> +            if (ret < 0) {
>> +                free_extent_map(em);
>> +                return ret;
>> +            }
>>           }
>>           set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>>                   &(map->stripes[i].dev->dev_state));
>>
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
  2018-06-28  7:04 [PATCH RFC] btrfs: Do extra device generation check at mount time Qu Wenruo
  2018-06-28  7:06 ` Nikolay Borisov
@ 2018-06-28  7:17 ` Su Yue
  2018-06-28  7:16   ` Qu Wenruo
  2018-06-28  8:02 ` Qu Wenruo
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Su Yue @ 2018-06-28  7:17 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 06/28/2018 03:04 PM, Qu Wenruo wrote:
> There is a reporter considering btrfs raid1 has a major design flaw
> which can't handle nodatasum files.
> 
> Despite his incorrect expectation, btrfs indeed doesn't handle device
> generation mismatch well.
> 
Just say " btrfs indeed doesn't handle device generation mismatch well."
is enough.
Otherwise on someday someone may be confused about who is the reporter 
and what was reported.

Thanks,
Su
> This means if one devices missed and re-appeared, even its generation
> no longer matches with the rest device pool, btrfs does nothing to it,
> but treat it as normal good device.
> 
> At least let's detect such generation mismatch and avoid mounting the
> fs.
> Currently there is no automatic rebuild yet, which means if users find
> device generation mismatch error message, they can only mount the fs
> using "device" and "degraded" mount option (if possible), then replace
> the offending device to manually "rebuild" the fs.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> I totally understand that, generation based solution can't handle
> split-brain case (where 2 RAID1 devices get mounted degraded separately)
> at all, but at least let's handle what we can do.
> 
> The best way to solve the problem is to make btrfs treat such lower gen
> devices as some kind of missing device, and queue an automatic scrub for
> that device.
> But that's a lot of extra work, at least let's start from detecting such
> problem first.
> ---
>   fs/btrfs/volumes.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 50 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index e034ad9e23b4..80a7c44993bc 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct btrfs_fs_info *fs_info,
>   			      devid, uuid);
>   }
>   
> +static int verify_devices_generation(struct btrfs_fs_info *fs_info,
> +				     struct btrfs_device *dev)
> +{
> +	struct btrfs_fs_devices *fs_devices = dev->fs_devices;
> +	struct btrfs_device *cur;
> +	bool warn_only = false;
> +	int ret = 0;
> +
> +	if (!fs_devices || fs_devices->seeding || !dev->generation)
> +		return 0;
> +
> +	/*
> +	 * If we're not replaying log, we're completely safe to allow
> +	 * generation mismatch as it won't write anything to disks, nor
> +	 * remount to rw.
> +	 */
> +	if (btrfs_test_opt(fs_info, NOLOGREPLAY))
> +		warn_only = true;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(cur, &fs_devices->devices, dev_list) {
> +		if (cur->generation && cur->generation != dev->generation) {
> +			if (warn_only) {
> +				btrfs_warn_rl_in_rcu(fs_info,
> +	"devid %llu has unexpected generation, has %llu expected %llu",
> +						     dev->devid,
> +						     dev->generation,
> +						     cur->generation);
> +			} else {
> +				btrfs_err_rl_in_rcu(fs_info,
> +	"devid %llu has unexpected generation, has %llu expected %llu",
> +						     dev->devid,
> +						     dev->generation,
> +						     cur->generation);
> +				ret = -EINVAL;
> +				break;
> +			}
> +		}
> +	}
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
>   static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
>   			  struct extent_buffer *leaf,
>   			  struct btrfs_chunk *chunk)
> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
>   				return PTR_ERR(map->stripes[i].dev);
>   			}
>   			btrfs_report_missing_device(fs_info, devid, uuid, false);
> +		} else {
> +			ret = verify_devices_generation(fs_info,
> +							map->stripes[i].dev);
> +			if (ret < 0) {
> +				free_extent_map(em);
> +				return ret;
> +			}
>   		}
>   		set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>   				&(map->stripes[i].dev->dev_state));
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH RFC] btrfs: Do extra device generation check at mount time
  2018-06-28  7:15   ` Qu Wenruo
@ 2018-06-28  7:30     ` Paul Jones
  2018-06-29  3:25     ` james harvey
  1 sibling, 0 replies; 16+ messages in thread
From: Paul Jones @ 2018-06-28  7:30 UTC (permalink / raw)
  To: Qu Wenruo, Nikolay Borisov, Qu Wenruo, linux-btrfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2020 bytes --]

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> owner@vger.kernel.org> On Behalf Of Qu Wenruo
> Sent: Thursday, 28 June 2018 5:16 PM
> To: Nikolay Borisov <nborisov@suse.com>; Qu Wenruo <wqu@suse.com>;
> linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH RFC] btrfs: Do extra device generation check at mount
> time
> 
> 
> 
> On 2018å¹´06æœˆ28æ—¥ 15:06, Nikolay Borisov wrote:
> >
> >
> > On 28.06.2018 10:04, Qu Wenruo wrote:
> >> There is a reporter considering btrfs raid1 has a major design flaw
> >> which can't handle nodatasum files.
> >>
> >> Despite his incorrect expectation, btrfs indeed doesn't handle device
> >> generation mismatch well.
> >>
> >> This means if one devices missed and re-appeared, even its generation
> >> no longer matches with the rest device pool, btrfs does nothing to
> >> it, but treat it as normal good device.
> >>
> >> At least let's detect such generation mismatch and avoid mounting the
> >> fs.
> >> Currently there is no automatic rebuild yet, which means if users
> >> find device generation mismatch error message, they can only mount
> >> the fs using "device" and "degraded" mount option (if possible), then
> >> replace the offending device to manually "rebuild" the fs.
> >>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >
> > I think a testcase of this functionality is important as well.
> 
> It's currently an RFC patch, test case would come along with final version.
> 
> I'd like to make sure everyone, including developers and end-users, are fine
> with the restrict error-out behavior.

I've been bitten by this before and was most surprised the first time it happened. I had assumed that of course btrfs would check such a thing before mounting.
Refusing to mount is a great first step, auto scrub is even better, and only "scrubbing" files with incorrect generation is better yet.

Paul.



 

ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨èÚ&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~††Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ß£øm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
  2018-06-28  7:04 [PATCH RFC] btrfs: Do extra device generation check at mount time Qu Wenruo
  2018-06-28  7:06 ` Nikolay Borisov
  2018-06-28  7:17 ` Su Yue
@ 2018-06-28  8:02 ` Qu Wenruo
  2018-06-28 13:26   ` Anand Jain
  2018-06-28  8:04 ` Nikolay Borisov
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2018-06-28  8:02 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, Anand Jain


[-- Attachment #1.1: Type: text/plain, Size: 3958 bytes --]

Also CC Anand Jain, since he is also working on various device related
work, and an expert on this.

Especially I'm not pretty sure if such enhancement is already on his
agenda or there are already some unmerged patch for this.

Thanks,
Qu

On 2018年06月28日 15:04, Qu Wenruo wrote:
> There is a reporter considering btrfs raid1 has a major design flaw
> which can't handle nodatasum files.
> 
> Despite his incorrect expectation, btrfs indeed doesn't handle device
> generation mismatch well.
> 
> This means if one devices missed and re-appeared, even its generation
> no longer matches with the rest device pool, btrfs does nothing to it,
> but treat it as normal good device.
> 
> At least let's detect such generation mismatch and avoid mounting the
> fs.
> Currently there is no automatic rebuild yet, which means if users find
> device generation mismatch error message, they can only mount the fs
> using "device" and "degraded" mount option (if possible), then replace
> the offending device to manually "rebuild" the fs.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> I totally understand that, generation based solution can't handle
> split-brain case (where 2 RAID1 devices get mounted degraded separately)
> at all, but at least let's handle what we can do.
> 
> The best way to solve the problem is to make btrfs treat such lower gen
> devices as some kind of missing device, and queue an automatic scrub for
> that device.
> But that's a lot of extra work, at least let's start from detecting such
> problem first.
> ---
>  fs/btrfs/volumes.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index e034ad9e23b4..80a7c44993bc 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct btrfs_fs_info *fs_info,
>  			      devid, uuid);
>  }
>  
> +static int verify_devices_generation(struct btrfs_fs_info *fs_info,
> +				     struct btrfs_device *dev)
> +{
> +	struct btrfs_fs_devices *fs_devices = dev->fs_devices;
> +	struct btrfs_device *cur;
> +	bool warn_only = false;
> +	int ret = 0;
> +
> +	if (!fs_devices || fs_devices->seeding || !dev->generation)
> +		return 0;
> +
> +	/*
> +	 * If we're not replaying log, we're completely safe to allow
> +	 * generation mismatch as it won't write anything to disks, nor
> +	 * remount to rw.
> +	 */
> +	if (btrfs_test_opt(fs_info, NOLOGREPLAY))
> +		warn_only = true;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(cur, &fs_devices->devices, dev_list) {
> +		if (cur->generation && cur->generation != dev->generation) {
> +			if (warn_only) {
> +				btrfs_warn_rl_in_rcu(fs_info,
> +	"devid %llu has unexpected generation, has %llu expected %llu",
> +						     dev->devid,
> +						     dev->generation,
> +						     cur->generation);
> +			} else {
> +				btrfs_err_rl_in_rcu(fs_info,
> +	"devid %llu has unexpected generation, has %llu expected %llu",
> +						     dev->devid,
> +						     dev->generation,
> +						     cur->generation);
> +				ret = -EINVAL;
> +				break;
> +			}
> +		}
> +	}
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
>  static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
>  			  struct extent_buffer *leaf,
>  			  struct btrfs_chunk *chunk)
> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
>  				return PTR_ERR(map->stripes[i].dev);
>  			}
>  			btrfs_report_missing_device(fs_info, devid, uuid, false);
> +		} else {
> +			ret = verify_devices_generation(fs_info,
> +							map->stripes[i].dev);
> +			if (ret < 0) {
> +				free_extent_map(em);
> +				return ret;
> +			}
>  		}
>  		set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>  				&(map->stripes[i].dev->dev_state));
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
  2018-06-28  7:04 [PATCH RFC] btrfs: Do extra device generation check at mount time Qu Wenruo
                   ` (2 preceding siblings ...)
  2018-06-28  8:02 ` Qu Wenruo
@ 2018-06-28  8:04 ` Nikolay Borisov
  2018-06-28 10:58   ` Qu Wenruo
  2018-06-28 14:34 ` Alberto Bursi
  2018-06-28 14:36 ` Adam Borowski
  5 siblings, 1 reply; 16+ messages in thread
From: Nikolay Borisov @ 2018-06-28  8:04 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 28.06.2018 10:04, Qu Wenruo wrote:
> There is a reporter considering btrfs raid1 has a major design flaw
> which can't handle nodatasum files.
> 
> Despite his incorrect expectation, btrfs indeed doesn't handle device
> generation mismatch well.

I think this is getting a bit personal, no need for that. Just say that
btrfs doesn't handle this particular case and that's it. Though I agree
the reporter's attitude wasn't exactly nice...

> 
> This means if one devices missed and re-appeared, even its generation
> no longer matches with the rest device pool, btrfs does nothing to it,
> but treat it as normal good device.
> 
> At least let's detect such generation mismatch and avoid mounting the
> fs.
> Currently there is no automatic rebuild yet, which means if users find
> device generation mismatch error message, they can only mount the fs
> using "device" and "degraded" mount option (if possible), then replace
> the offending device to manually "rebuild" the fs.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> I totally understand that, generation based solution can't handle
> split-brain case (where 2 RAID1 devices get mounted degraded separately)
> at all, but at least let's handle what we can do.
> 
> The best way to solve the problem is to make btrfs treat such lower gen
> devices as some kind of missing device, and queue an automatic scrub for
> that device.
> But that's a lot of extra work, at least let's start from detecting such
> problem first.
> ---
>  fs/btrfs/volumes.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index e034ad9e23b4..80a7c44993bc 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct btrfs_fs_info *fs_info,
>  			      devid, uuid);
>  }
>  
> +static int verify_devices_generation(struct btrfs_fs_info *fs_info,
> +				     struct btrfs_device *dev)
> +{
> +	struct btrfs_fs_devices *fs_devices = dev->fs_devices;
> +	struct btrfs_device *cur;
> +	bool warn_only = false;
> +	int ret = 0;
> +
> +	if (!fs_devices || fs_devices->seeding || !dev->generation)
> +		return 0;
> +
> +	/*
> +	 * If we're not replaying log, we're completely safe to allow
> +	 * generation mismatch as it won't write anything to disks, nor
> +	 * remount to rw.
> +	 */
> +	if (btrfs_test_opt(fs_info, NOLOGREPLAY))
> +		warn_only = true;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(cur, &fs_devices->devices, dev_list) {> +		if (cur->generation && cur->generation != dev->generation) {
> +			if (warn_only) {
> +				btrfs_warn_rl_in_rcu(fs_info,
> +	"devid %llu has unexpected generation, has %llu expected %llu",
> +						     dev->devid,
> +						     dev->generation,
> +						     cur->generation);
> +			} else {
> +				btrfs_err_rl_in_rcu(fs_info,
> +	"devid %llu has unexpected generation, has %llu expected %llu",
> +						     dev->devid,
> +						     dev->generation,
> +						     cur->generation);
> +				ret = -EINVAL;
> +				break;
> +			}
> +		}
> +	}
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
>  static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
>  			  struct extent_buffer *leaf,
>  			  struct btrfs_chunk *chunk)
> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
>  				return PTR_ERR(map->stripes[i].dev);
>  			}
>  			btrfs_report_missing_device(fs_info, devid, uuid, false);
> +		} else {
> +			ret = verify_devices_generation(fs_info,
> +							map->stripes[i].dev);
> +			if (ret < 0) {
> +				free_extent_map(em);
> +				return ret;
> +			}

So this will be executed when doing mount. What about one device
disappearing, while the filesystem is still mounted and then later
re-appears. This code won't be executed in this case, no ?
>  		}
>  		set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>  				&(map->stripes[i].dev->dev_state));
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
  2018-06-28  8:04 ` Nikolay Borisov
@ 2018-06-28 10:58   ` Qu Wenruo
  0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2018-06-28 10:58 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs, Anand Jain



On 2018年06月28日 16:04, Nikolay Borisov wrote:
> 
> 
> On 28.06.2018 10:04, Qu Wenruo wrote:
>> There is a reporter considering btrfs raid1 has a major design flaw
>> which can't handle nodatasum files.
>>
>> Despite his incorrect expectation, btrfs indeed doesn't handle device
>> generation mismatch well.
> 
> I think this is getting a bit personal, no need for that. Just say that
> btrfs doesn't handle this particular case and that's it. Though I agree
> the reporter's attitude wasn't exactly nice...
> 
>>
>> This means if one devices missed and re-appeared, even its generation
>> no longer matches with the rest device pool, btrfs does nothing to it,
>> but treat it as normal good device.
>>
>> At least let's detect such generation mismatch and avoid mounting the
>> fs.
>> Currently there is no automatic rebuild yet, which means if users find
>> device generation mismatch error message, they can only mount the fs
>> using "device" and "degraded" mount option (if possible), then replace
>> the offending device to manually "rebuild" the fs.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> I totally understand that, generation based solution can't handle
>> split-brain case (where 2 RAID1 devices get mounted degraded separately)
>> at all, but at least let's handle what we can do.
>>
>> The best way to solve the problem is to make btrfs treat such lower gen
>> devices as some kind of missing device, and queue an automatic scrub for
>> that device.
>> But that's a lot of extra work, at least let's start from detecting such
>> problem first.
>> ---
>>  fs/btrfs/volumes.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 50 insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index e034ad9e23b4..80a7c44993bc 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct btrfs_fs_info *fs_info,
>>  			      devid, uuid);
>>  }
>>  
>> +static int verify_devices_generation(struct btrfs_fs_info *fs_info,
>> +				     struct btrfs_device *dev)
>> +{
>> +	struct btrfs_fs_devices *fs_devices = dev->fs_devices;
>> +	struct btrfs_device *cur;
>> +	bool warn_only = false;
>> +	int ret = 0;
>> +
>> +	if (!fs_devices || fs_devices->seeding || !dev->generation)
>> +		return 0;
>> +
>> +	/*
>> +	 * If we're not replaying log, we're completely safe to allow
>> +	 * generation mismatch as it won't write anything to disks, nor
>> +	 * remount to rw.
>> +	 */
>> +	if (btrfs_test_opt(fs_info, NOLOGREPLAY))
>> +		warn_only = true;
>> +
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(cur, &fs_devices->devices, dev_list) {> +		if (cur->generation && cur->generation != dev->generation) {
>> +			if (warn_only) {
>> +				btrfs_warn_rl_in_rcu(fs_info,
>> +	"devid %llu has unexpected generation, has %llu expected %llu",
>> +						     dev->devid,
>> +						     dev->generation,
>> +						     cur->generation);
>> +			} else {
>> +				btrfs_err_rl_in_rcu(fs_info,
>> +	"devid %llu has unexpected generation, has %llu expected %llu",
>> +						     dev->devid,
>> +						     dev->generation,
>> +						     cur->generation);
>> +				ret = -EINVAL;
>> +				break;
>> +			}
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +	return ret;
>> +}
>> +
>>  static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
>>  			  struct extent_buffer *leaf,
>>  			  struct btrfs_chunk *chunk)
>> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
>>  				return PTR_ERR(map->stripes[i].dev);
>>  			}
>>  			btrfs_report_missing_device(fs_info, devid, uuid, false);
>> +		} else {
>> +			ret = verify_devices_generation(fs_info,
>> +							map->stripes[i].dev);
>> +			if (ret < 0) {
>> +				free_extent_map(em);
>> +				return ret;
>> +			}
> 
> So this will be executed when doing mount. What about one device
> disappearing, while the filesystem is still mounted and then later
> re-appears. This code won't be executed in this case, no ?

This depends on how the re-appear happens.

If just several bio fails, it shouldn't be a big problem.
If it really re-appears, I'm not pretty sure how btrfs will handle it,
since there will be a new device with new device id.

If btrfs currently just replace the block device on-the-fly, this patch
can't handle it at all, just as you stated.

I think Anand Jain is more experienced in such situation.

Thanks,
Qu

>>  		}
>>  		set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>>  				&(map->stripes[i].dev->dev_state));
>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
  2018-06-28  8:02 ` Qu Wenruo
@ 2018-06-28 13:26   ` Anand Jain
  2018-06-28 13:29     ` Qu Wenruo
  0 siblings, 1 reply; 16+ messages in thread
From: Anand Jain @ 2018-06-28 13:26 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs



On 06/28/2018 04:02 PM, Qu Wenruo wrote:
> Also CC Anand Jain, since he is also working on various device related
> work, and an expert on this.
> 
> Especially I'm not pretty sure if such enhancement is already on his
> agenda or there are already some unmerged patch for this.

  Right, some of the patches are already in the ML and probably it needs
  a revival. Unless there are new challenges, my target is to consolidate
  them and get them integrated by this year. I am trying harder.

  I think its a good idea to report the write-hole status to the
  user-land instead of failing the mount, so that a script can trigger
  the re-silver as required by the use case. I introduced
  fs_devices::volume_flags, which is under review as of now, and have
  plans to add the volume degraded state bit which can be accessed by
  the sysfs. So yes this is been taken care.


Thanks, Anand


> Thanks,
> Qu
> 
> On 2018年06月28日 15:04, Qu Wenruo wrote:
>> There is a reporter considering btrfs raid1 has a major design flaw
>> which can't handle nodatasum files.
>>
>> Despite his incorrect expectation, btrfs indeed doesn't handle device
>> generation mismatch well.
>>
>> This means if one devices missed and re-appeared, even its generation
>> no longer matches with the rest device pool, btrfs does nothing to it,
>> but treat it as normal good device.
>>
>> At least let's detect such generation mismatch and avoid mounting the
>> fs.
>> Currently there is no automatic rebuild yet, which means if users find
>> device generation mismatch error message, they can only mount the fs
>> using "device" and "degraded" mount option (if possible), then replace
>> the offending device to manually "rebuild" the fs.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> I totally understand that, generation based solution can't handle
>> split-brain case (where 2 RAID1 devices get mounted degraded separately)
>> at all, but at least let's handle what we can do.
>>
>> The best way to solve the problem is to make btrfs treat such lower gen
>> devices as some kind of missing device, and queue an automatic scrub for
>> that device.
>> But that's a lot of extra work, at least let's start from detecting such
>> problem first.
>> ---
>>   fs/btrfs/volumes.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 50 insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index e034ad9e23b4..80a7c44993bc 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct btrfs_fs_info *fs_info,
>>   			      devid, uuid);
>>   }
>>   
>> +static int verify_devices_generation(struct btrfs_fs_info *fs_info,
>> +				     struct btrfs_device *dev)
>> +{
>> +	struct btrfs_fs_devices *fs_devices = dev->fs_devices;
>> +	struct btrfs_device *cur;
>> +	bool warn_only = false;
>> +	int ret = 0;
>> +
>> +	if (!fs_devices || fs_devices->seeding || !dev->generation)
>> +		return 0;
>> +
>> +	/*
>> +	 * If we're not replaying log, we're completely safe to allow
>> +	 * generation mismatch as it won't write anything to disks, nor
>> +	 * remount to rw.
>> +	 */
>> +	if (btrfs_test_opt(fs_info, NOLOGREPLAY))
>> +		warn_only = true;
>> +
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(cur, &fs_devices->devices, dev_list) {
>> +		if (cur->generation && cur->generation != dev->generation) {
>> +			if (warn_only) {
>> +				btrfs_warn_rl_in_rcu(fs_info,
>> +	"devid %llu has unexpected generation, has %llu expected %llu",
>> +						     dev->devid,
>> +						     dev->generation,
>> +						     cur->generation);
>> +			} else {
>> +				btrfs_err_rl_in_rcu(fs_info,
>> +	"devid %llu has unexpected generation, has %llu expected %llu",
>> +						     dev->devid,
>> +						     dev->generation,
>> +						     cur->generation);
>> +				ret = -EINVAL;




>> +				break;
>> +			}
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +	return ret;
>> +}
>> +
>>   static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
>>   			  struct extent_buffer *leaf,
>>   			  struct btrfs_chunk *chunk)
>> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key,
>>   				return PTR_ERR(map->stripes[i].dev);
>>   			}
>>   			btrfs_report_missing_device(fs_info, devid, uuid, false);
>> +		} else {
>> +			ret = verify_devices_generation(fs_info,
>> +							map->stripes[i].dev);
>> +			if (ret < 0) {
>> +				free_extent_map(em);
>> +				return ret;
>> +			}
>>   		}
>>   		set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>>   				&(map->stripes[i].dev->dev_state));
>>
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
  2018-06-28 13:26   ` Anand Jain
@ 2018-06-28 13:29     ` Qu Wenruo
  0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2018-06-28 13:29 UTC (permalink / raw)
  To: Anand Jain, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 6228 bytes --]



On 2018年06月28日 21:26, Anand Jain wrote:
> 
> 
> On 06/28/2018 04:02 PM, Qu Wenruo wrote:
>> Also CC Anand Jain, since he is also working on various device related
>> work, and an expert on this.
>>
>> Especially I'm not pretty sure if such enhancement is already on his
>> agenda or there are already some unmerged patch for this.
> 
>  Right, some of the patches are already in the ML and probably it needs
>  a revival. Unless there are new challenges, my target is to consolidate
>  them and get them integrated by this year. I am trying harder.
> 
>  I think its a good idea to report the write-hole status to the
>  user-land instead of failing the mount, so that a script can trigger
>  the re-silver as required by the use case. I introduced
>  fs_devices::volume_flags, which is under review as of now, and have
>  plans to add the volume degraded state bit which can be accessed by
>  the sysfs. So yes this is been taken care.

Glad to hear that!

However I found it pretty hard to trace your latest work, would you mind
to share the git repo and branch you're working on?

Maybe I could take some time to do some review.

Thanks,
Qu

> 
> 
> Thanks, Anand
> 
> 
>> Thanks,
>> Qu
>>
>> On 2018年06月28日 15:04, Qu Wenruo wrote:
>>> There is a reporter considering btrfs raid1 has a major design flaw
>>> which can't handle nodatasum files.
>>>
>>> Despite his incorrect expectation, btrfs indeed doesn't handle device
>>> generation mismatch well.
>>>
>>> This means if one devices missed and re-appeared, even its generation
>>> no longer matches with the rest device pool, btrfs does nothing to it,
>>> but treat it as normal good device.
>>>
>>> At least let's detect such generation mismatch and avoid mounting the
>>> fs.
>>> Currently there is no automatic rebuild yet, which means if users find
>>> device generation mismatch error message, they can only mount the fs
>>> using "device" and "degraded" mount option (if possible), then replace
>>> the offending device to manually "rebuild" the fs.
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>> I totally understand that, generation based solution can't handle
>>> split-brain case (where 2 RAID1 devices get mounted degraded separately)
>>> at all, but at least let's handle what we can do.
>>>
>>> The best way to solve the problem is to make btrfs treat such lower gen
>>> devices as some kind of missing device, and queue an automatic scrub for
>>> that device.
>>> But that's a lot of extra work, at least let's start from detecting such
>>> problem first.
>>> ---
>>>   fs/btrfs/volumes.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 50 insertions(+)
>>>
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index e034ad9e23b4..80a7c44993bc 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct
>>> btrfs_fs_info *fs_info,
>>>                     devid, uuid);
>>>   }
>>>   +static int verify_devices_generation(struct btrfs_fs_info *fs_info,
>>> +                     struct btrfs_device *dev)
>>> +{
>>> +    struct btrfs_fs_devices *fs_devices = dev->fs_devices;
>>> +    struct btrfs_device *cur;
>>> +    bool warn_only = false;
>>> +    int ret = 0;
>>> +
>>> +    if (!fs_devices || fs_devices->seeding || !dev->generation)
>>> +        return 0;
>>> +
>>> +    /*
>>> +     * If we're not replaying log, we're completely safe to allow
>>> +     * generation mismatch as it won't write anything to disks, nor
>>> +     * remount to rw.
>>> +     */
>>> +    if (btrfs_test_opt(fs_info, NOLOGREPLAY))
>>> +        warn_only = true;
>>> +
>>> +    rcu_read_lock();
>>> +    list_for_each_entry_rcu(cur, &fs_devices->devices, dev_list) {
>>> +        if (cur->generation && cur->generation != dev->generation) {
>>> +            if (warn_only) {
>>> +                btrfs_warn_rl_in_rcu(fs_info,
>>> +    "devid %llu has unexpected generation, has %llu expected %llu",
>>> +                             dev->devid,
>>> +                             dev->generation,
>>> +                             cur->generation);
>>> +            } else {
>>> +                btrfs_err_rl_in_rcu(fs_info,
>>> +    "devid %llu has unexpected generation, has %llu expected %llu",
>>> +                             dev->devid,
>>> +                             dev->generation,
>>> +                             cur->generation);
>>> +                ret = -EINVAL;
> 
> 
> 
> 
>>> +                break;
>>> +            }
>>> +        }
>>> +    }
>>> +    rcu_read_unlock();
>>> +    return ret;
>>> +}
>>> +
>>>   static int read_one_chunk(struct btrfs_fs_info *fs_info, struct
>>> btrfs_key *key,
>>>                 struct extent_buffer *leaf,
>>>                 struct btrfs_chunk *chunk)
>>> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info
>>> *fs_info, struct btrfs_key *key,
>>>                   return PTR_ERR(map->stripes[i].dev);
>>>               }
>>>               btrfs_report_missing_device(fs_info, devid, uuid, false);
>>> +        } else {
>>> +            ret = verify_devices_generation(fs_info,
>>> +                            map->stripes[i].dev);
>>> +            if (ret < 0) {
>>> +                free_extent_map(em);
>>> +                return ret;
>>> +            }
>>>           }
>>>           set_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>>>                   &(map->stripes[i].dev->dev_state));
>>>
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
  2018-06-28  7:04 [PATCH RFC] btrfs: Do extra device generation check at mount time Qu Wenruo
                   ` (3 preceding siblings ...)
  2018-06-28  8:04 ` Nikolay Borisov
@ 2018-06-28 14:34 ` Alberto Bursi
  2018-06-28 14:36 ` Adam Borowski
  5 siblings, 0 replies; 16+ messages in thread
From: Alberto Bursi @ 2018-06-28 14:34 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1105 bytes --]



On 28/06/2018 09:04, Qu Wenruo wrote:
> Despite his incorrect expectation, btrfs indeed doesn't handle device
> generation mismatch well.
>
> This means if one devices missed and re-appeared, even its generation
> no longer matches with the rest device pool, btrfs does nothing to it,
> but treat it as normal good device.
>
> At least let's detect such generation mismatch and avoid mounting the
> fs.
> Currently there is no automatic rebuild yet, which means if users find
> device generation mismatch error message, they can only mount the fs
> using "device" and "degraded" mount option (if possible), then replace
> the offending device to manually "rebuild" the fs.
>

Yes. This is a long-standing issue, handling it this way is similar to 
what mdadm
(software raid) also does.

Please get this merged fast, don't get bogged down too much with 
integrating with
Anand Jain's branch as this is a big issue and should get at least this 
basic mitigation asap.

-Alberto
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨èÚ&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~††Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ß£øm

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
  2018-06-28  7:04 [PATCH RFC] btrfs: Do extra device generation check at mount time Qu Wenruo
                   ` (4 preceding siblings ...)
  2018-06-28 14:34 ` Alberto Bursi
@ 2018-06-28 14:36 ` Adam Borowski
  2018-06-28 16:00   ` Remi Gauvin
  2018-06-29  0:17   ` Qu Wenruo
  5 siblings, 2 replies; 16+ messages in thread
From: Adam Borowski @ 2018-06-28 14:36 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Jun 28, 2018 at 03:04:43PM +0800, Qu Wenruo wrote:
> There is a reporter considering btrfs raid1 has a major design flaw
> which can't handle nodatasum files.
> 
> Despite his incorrect expectation, btrfs indeed doesn't handle device
> generation mismatch well.
> 
> This means if one devices missed and re-appeared, even its generation
> no longer matches with the rest device pool, btrfs does nothing to it,
> but treat it as normal good device.
> 
> At least let's detect such generation mismatch and avoid mounting the
> fs.

Uhm, that'd be a nasty regression for the regular (no-nodatacow) case. 
The vast majority of data is fine, and extents that have been written to
while a device is missing will be either placed elsewhere (if the filesystem
knew it was degraded) or read one of the copies to notice a wrong checksum
and automatically recover (if the device was still falsely believed to be
good at write time).

We currently don't have selective scrub yet so resyncing such single-copy
extents is costly, but 1. all will be fine if the data is read, 2. it's
possible to add such a smart resync in the future, far better than a
write-intent bitmap can do.

To do the latter, we can note the last generation the filesystem was known
to be fully coherent (ie, all devices were successfully flushed with no
mysterious write failures), then run selective scrub (perhaps even
automatically) when the filesystem is no longer degraded.  There's some
extra complexity with 3- or 4-way RAID (multiple levels of degradation) but
a single number would help even there.

But even currently, without the above not-yet-written recovery, it's
reasonably safe to continue without scrub -- it's a case of running
partially degraded when the bad copy is already known to be suspicious.

For no-nodatacow data and metadata, that is.

> Currently there is no automatic rebuild yet, which means if users find
> device generation mismatch error message, they can only mount the fs
> using "device" and "degraded" mount option (if possible), then replace
> the offending device to manually "rebuild" the fs.

As nodatacow already means "I don't care about this data, or have another
way of recovering it", I don't quite get why we would drop existing
auto-recovery for a common transient failure case.

If you're paranoid, perhaps some bit "this filesystem has some nodatacow
data on it" could warrant such a block, but it would still need to be
overridable _without_ a need for replace.  There's also the problem that
systemd marks its journal nodatacow (despite it having infamously bad
handling of failures!), and too many distributions infect their default
installs with systemd, meaning such a bit would be on in most cases.

But why would I put all my other data at risk, just because there's a
nodatacow file?  There's a big difference between scrubbing when only a few
transactions worth of data is suspicious and completely throwing away a
mostly-good replica to replace it from the now fully degraded copy.

> I totally understand that, generation based solution can't handle
> split-brain case (where 2 RAID1 devices get mounted degraded separately)
> at all, but at least let's handle what we can do.

Generation can do well at least unless both devices were mounted elsewhere
and got the exact same number of transactions, the problem is that nodatacow
doesn't bump generation number.

> The best way to solve the problem is to make btrfs treat such lower gen
> devices as some kind of missing device, and queue an automatic scrub for
> that device.
> But that's a lot of extra work, at least let's start from detecting such
> problem first.

I wonder if there's some way to treat problematic nodatacow files as
degraded only?

Nodatacow misses most of btrfs mechanisms, thus to get it done right you'd
need to pretty much copy all of md's logic, with a write-intent bitmap or an
equivalent.

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ There's an easy way to tell toy operating systems from real ones.
⣾⠁⢰⠒⠀⣿⡁ Just look at how their shipped fonts display U+1F52B, this makes
⢿⡄⠘⠷⠚⠋⠀ the intended audience obvious.  It's also interesting to see OSes
⠈⠳⣄⠀⠀⠀⠀ go back and forth wrt their intended target.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH RFC] btrfs: Do extra device generation check at mount time
  2018-06-28 14:36 ` Adam Borowski
@ 2018-06-28 16:00   ` Remi Gauvin
  2018-06-29  0:17   ` Qu Wenruo
  1 sibling, 0 replies; 16+ messages in thread
From: Remi Gauvin @ 2018-06-28 16:00 UTC (permalink / raw)
  To: linux-btrfs

On 2018-06-28 10:36 AM, Adam Borowski wrote:

> 
> Uhm, that'd be a nasty regression for the regular (no-nodatacow) case. 
> The vast majority of data is fine, and extents that have been written to
> while a device is missing will be either placed elsewhere (if the filesystem
> knew it was degraded) or read one of the copies to notice a wrong checksum
> and automatically recover (if the device was still falsely believed to be
> good at write time).
> 
> We currently don't have selective scrub yet so resyncing such single-copy

That might not be the case. though I don't really know the numbers
myself and repeating this is hearsay:

crc32 is not infallible.  1 in so many billion errors will be undetected
by it.  In the case of a dropped device with write failures, when you
*know* the data supposedly written to the disk is bad, re-synching from
believed good copy (so long as it passes checksum verification, of
course), is the only way to be certain that the data is good.

Otherwise, you can be left with a Schroedinger's bit somewhere,  (It's
not 0 or 1, but both, depending on which device the filesystem is
reading from at the time.)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
  2018-06-28 14:36 ` Adam Borowski
  2018-06-28 16:00   ` Remi Gauvin
@ 2018-06-29  0:17   ` Qu Wenruo
  1 sibling, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2018-06-29  0:17 UTC (permalink / raw)
  To: Adam Borowski, Qu Wenruo; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5299 bytes --]



On 2018年06月28日 22:36, Adam Borowski wrote:
> On Thu, Jun 28, 2018 at 03:04:43PM +0800, Qu Wenruo wrote:
>> There is a reporter considering btrfs raid1 has a major design flaw
>> which can't handle nodatasum files.
>>
>> Despite his incorrect expectation, btrfs indeed doesn't handle device
>> generation mismatch well.
>>
>> This means if one devices missed and re-appeared, even its generation
>> no longer matches with the rest device pool, btrfs does nothing to it,
>> but treat it as normal good device.
>>
>> At least let's detect such generation mismatch and avoid mounting the
>> fs.
> 
> Uhm, that'd be a nasty regression for the regular (no-nodatacow) case. 
> The vast majority of data is fine, and extents that have been written to
> while a device is missing will be either placed elsewhere (if the filesystem
> knew it was degraded) or read one of the copies to notice a wrong checksum
> and automatically recover (if the device was still falsely believed to be
> good at write time).

Yes, for fs without any nodatasum usage, the behavior is indeed overkilled.
But sometimes such overkilled sanity check is really important, as long
as nodatasum is a provided feature.

> 
> We currently don't have selective scrub yet so resyncing such single-copy
> extents is costly, but 1. all will be fine if the data is read, 2. it's
> possible to add such a smart resync in the future, far better than a
> write-intent bitmap can do.

Well, auto scrub for a device looks not that bad to me.
Since normally scrub is scheduled as normal maintenance work, it should
not be a super expensive work.

We only need to teach btrfs to treat such device as kind of degraded.
Then we can reuse most of the scrub routine to fix it.

> 
> To do the latter, we can note the last generation the filesystem was known
> to be fully coherent (ie, all devices were successfully flushed with no
> mysterious write failures), then run selective scrub (perhaps even
> automatically) when the filesystem is no longer degraded.  There's some
> extra complexity with 3- or 4-way RAID (multiple levels of degradation) but
> a single number would help even there.
> 
> But even currently, without the above not-yet-written recovery, it's
> reasonably safe to continue without scrub -- it's a case of running
> partially degraded when the bad copy is already known to be suspicious.
> 
> For no-nodatacow data and metadata, that is.
> 
>> Currently there is no automatic rebuild yet, which means if users find
>> device generation mismatch error message, they can only mount the fs
>> using "device" and "degraded" mount option (if possible), then replace
>> the offending device to manually "rebuild" the fs.
> 
> As nodatacow already means "I don't care about this data, or have another
> way of recovering it", I don't quite get why we would drop existing
> auto-recovery for a common transient failure case.

Yep, exactly my understanding of nodatasum behavior.

However in real world, btrfs is the only *linux* fs supports data csum,
and the most widely used fs like ext4 xfs doesn't support data csum.

As the discussion about the behavior goes, I find that LVM/mdraid +
ext4/xfs could do better device missing management than btrfs nodatasum,
this means we need to at least do something that LVM/mdraid could provide.

> 
> If you're paranoid, perhaps some bit "this filesystem has some nodatacow
> data on it" could warrant such a block, but it would still need to be
> overridable _without_ a need for replace.  There's also the problem that
> systemd marks its journal nodatacow (despite it having infamously bad
> handling of failures!), and too many distributions infect their default
> installs with systemd, meaning such a bit would be on in most cases.
> 
> But why would I put all my other data at risk, just because there's a
> nodatacow file?  There's a big difference between scrubbing when only a few
> transactions worth of data is suspicious and completely throwing away a
> mostly-good replica to replace it from the now fully degraded copy.
> 
>> I totally understand that, generation based solution can't handle
>> split-brain case (where 2 RAID1 devices get mounted degraded separately)
>> at all, but at least let's handle what we can do.
> 
> Generation can do well at least unless both devices were mounted elsewhere
> and got the exact same number of transactions, the problem is that nodatacow
> doesn't bump generation number.

Generation is never a problem, as any metadata change will still bump
generation.

> 
>> The best way to solve the problem is to make btrfs treat such lower gen
>> devices as some kind of missing device, and queue an automatic scrub for
>> that device.
>> But that's a lot of extra work, at least let's start from detecting such
>> problem first.
> 
> I wonder if there's some way to treat problematic nodatacow files as
> degraded only?
> 
> Nodatacow misses most of btrfs mechanisms, thus to get it done right you'd
> need to pretty much copy all of md's logic, with a write-intent bitmap or an
> equivalent.

At least, let's see what LVM/md is doing, and try to learn something is
never a bad idea.

Thanks,
Qu

> 
> 
> Meow!
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
  2018-06-28  7:15   ` Qu Wenruo
  2018-06-28  7:30     ` Paul Jones
@ 2018-06-29  3:25     ` james harvey
  1 sibling, 0 replies; 16+ messages in thread
From: james harvey @ 2018-06-29  3:25 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Nikolay Borisov, Qu Wenruo, Btrfs BTRFS

On Thu, Jun 28, 2018 at 3:15 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> I'd like to make sure everyone, including developers and end-users, are
> fine with the restrict error-out behavior.

Yes, please error out, as a start.  Requesting this was on my
btrfs-to-do list.  A device generation mismatch from a drive going
offline until forced reboot was what ultimately sent me on the path of
problems I had.  Had parent transid errors, which I realized was due
to this, and decided to replace the outdated lvm partition with a new
one.  (Hiding the bad one to prevent UUID problems of course.)

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-06-29  3:25 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-28  7:04 [PATCH RFC] btrfs: Do extra device generation check at mount time Qu Wenruo
2018-06-28  7:06 ` Nikolay Borisov
2018-06-28  7:15   ` Qu Wenruo
2018-06-28  7:30     ` Paul Jones
2018-06-29  3:25     ` james harvey
2018-06-28  7:17 ` Su Yue
2018-06-28  7:16   ` Qu Wenruo
2018-06-28  8:02 ` Qu Wenruo
2018-06-28 13:26   ` Anand Jain
2018-06-28 13:29     ` Qu Wenruo
2018-06-28  8:04 ` Nikolay Borisov
2018-06-28 10:58   ` Qu Wenruo
2018-06-28 14:34 ` Alberto Bursi
2018-06-28 14:36 ` Adam Borowski
2018-06-28 16:00   ` Remi Gauvin
2018-06-29  0:17   ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.