Re: [PATCH 3/3] btrfs: volumes: Allocate degraded chunks if rw devices can't fullfil a chunk

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Anand Jain <anand.jain@oracle.com>, Qu Wenruo <wqu@suse.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 3/3] btrfs: volumes: Allocate degraded chunks if rw devices can't fullfil a chunk
Date: Tue, 19 Nov 2019 18:41:49 +0800	[thread overview]
Message-ID: <f928122d-4e77-e83b-9a53-d2eea7ee16d3@gmx.com> (raw)
In-Reply-To: <6cc25dbd-55e4-43bb-7b95-86c62bee27c7@oracle.com>

[-- Attachment #1.1: Type: text/plain, Size: 8059 bytes --]

On 2019/11/19 下午6:05, Anand Jain wrote:
> On 11/7/19 2:27 PM, Qu Wenruo wrote:
>> [PROBLEM]
>> Btrfs degraded mount will fallback to SINGLE profile if there are not
>> enough devices:
> 
>  Its better to keep it like this for now until there is a fix for the
>  write hole. Otherwise hitting the write hole bug in case of degraded
>  raid1 will be more prevalent.

Write hole should be a problem for RAID5/6, not the degraded chunk
feature itself.

Furthermore, this design will try to avoid allocating chunks using
missing devices.
So even for 3 devices RAID5, new chunks will be allocated by using
existing devices (2 devices RAID5), so no new write hole is introduced.

> 
>  I proposed a RFC a long time before [1] (also in there, there
>  is a commit id which turned the degraded raid1 profile into single
>  profile (without much write-up on it)).
> 
>    [1] [PATCH 0/2] [RFC] btrfs: create degraded-RAID1 chunks

My point for this patchset is:
- Create regular chunk if we have enough devices
- Create degraded chunk only when we have not enough devices

I guess since you didn't get the point of my preparation patches, your
patches aren't that good to avoid missing devices.

> 
>  Similarly the patch related to the reappearing missing device [2]
>  falls under the same category. Will push for the integration after
>  the write hole fix.
> 
>    [2] [PATCH] btrfs: handle dynamically reappearing missing device
>    (test case 154).

That's another case, and I didn't see how it affects this feature.

> 
>  If you look close enough the original author has quite nicely made
>  sure write hole bug will be very difficultly to hit. These fixes
>  shall make it easy to hit. So its better to work on the write hole
>  first.

If you're talking about RAID5/6, you are talking at the wrong thread.
Go implement some write-a-head log for RAID5/6, or mark all degraded
RAID5/6 chunks read-only at mount time.

> 
>  I am trying to fix write hole. First attempt has limited success
>  (works fine in two disk raid1 only). Now trying other ways to fix.
> 
>>   # mkfs.btrfs -f /dev/test/scratch[12] -m raid1 -d raid1
>>   # wipefs -fa /dev/test/scratch2
>>   # mount -o degraded /dev/test/scratch1 /mnt/btrfs
>>   # fallocate -l 1G /mnt/btrfs/foobar
>>   # btrfs ins dump-tree -t chunk /dev/test/scratch1
>>          item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 1674575872) itemoff
>> 15511 itemsize 80
>>                  length 536870912 owner 2 stripe_len 65536 type DATA
>>   New data chunk will fallback to SINGLE.
>>
>> If user doesn't balance those SINGLE chunks, even with missing devices
>> replaced, the fs is no longer full RAID1, and a missing device can break
>> the tolerance.
> 
>  As its been discussed quite a lot of time before, the current
>  re-silver/recovery approach for degraded-raid1 (with offload to Single)
>  requires balance. Its kind of known.

I'd call such "well-known" behavior BS.

All other raid1 implementation can accept single device RAID1 and
resilver itself with more device into a full RAID1 setup.

But for BTRFS you're calling SINGLE profile "well-known"?
It's "well-known" because it's not working properly, that's why I'm
trying to fix it.

> 
> Thanks, Anand
> 
> 
>> [CAUSE]
>> The cause is pretty simple, when mounted degraded, missing devices can't
>> be used for chunk allocation.
>> Thus btrfs has to fall back to SINGLE profile.
>>
>> [ENHANCEMENT]
>> To avoid such problem, this patch will:
>> - Make all profiler reducer/updater to consider missing devices as part
>>    of num_devices
>> - Make chunk allocator to fallback to missing_list as last resort
>>
>> If we have enough rw_devices, then go regular chunk allocation code.
> 
>> This can avoid allocating degraded chunks.
>> E.g. for 3 devices RAID1 degraded mount, we will use the 2 existing
>> devices to allocate chunk, avoid degraded chunk.
> 
>> But if we don't have enough rw_devices, then we check missing devices to
>> allocate degraded chunks.
>> E.g. for 2 devices RAID1 degraded mount, we have to allocate degraded
>> chunks to keep the RAID1 profile.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/block-group.c | 10 +++++++---
>>   fs/btrfs/volumes.c     | 18 +++++++++++++++---
>>   2 files changed, 22 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>> index bf7e3f23bba7..1686fd31679b 100644
>> --- a/fs/btrfs/block-group.c
>> +++ b/fs/btrfs/block-group.c
>> @@ -52,11 +52,13 @@ static u64 get_restripe_target(struct
>> btrfs_fs_info *fs_info, u64 flags)
>>    */
>>   static u64 btrfs_reduce_alloc_profile(struct btrfs_fs_info *fs_info,
>> u64 flags)
>>   {
>> -    u64 num_devices = fs_info->fs_devices->rw_devices;
>> +    u64 num_devices;
>>       u64 target;
>>       u64 raid_type;
>>       u64 allowed = 0;
>>   +    num_devices = fs_info->fs_devices->rw_devices +
>> +              fs_info->fs_devices->missing_devices;
>>       /*
>>        * See if restripe for this chunk_type is in progress, if so try to
>>        * reduce to the target profile
>> @@ -1986,7 +1988,8 @@ static u64 update_block_group_flags(struct
>> btrfs_fs_info *fs_info, u64 flags)
>>       if (stripped)
>>           return extended_to_chunk(stripped);
>>   -    num_devices = fs_info->fs_devices->rw_devices;
>> +    num_devices = fs_info->fs_devices->rw_devices +
>> +              fs_info->fs_devices->missing_devices;
>>         stripped = BTRFS_BLOCK_GROUP_RAID0 |
>> BTRFS_BLOCK_GROUP_RAID56_MASK |
>>           BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10;
>> @@ -2981,7 +2984,8 @@ static u64 get_profile_num_devs(struct
>> btrfs_fs_info *fs_info, u64 type)
>>         num_dev =
>> btrfs_raid_array[btrfs_bg_flags_to_raid_index(type)].devs_max;
>>       if (!num_dev)
>> -        num_dev = fs_info->fs_devices->rw_devices;
>> +        num_dev = fs_info->fs_devices->rw_devices +
>> +              fs_info->fs_devices->missing_devices;
>>         return num_dev;
>>   }
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index a462d8de5d2a..4dee1974ceb7 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -5052,8 +5052,9 @@ static int __btrfs_alloc_chunk(struct
>> btrfs_trans_handle *trans,
>>       max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
>>                    max_chunk_size);
>>   -    devices_info = kcalloc(fs_devices->rw_devices,
>> sizeof(*devices_info),
>> -                   GFP_NOFS);
>> +    devices_info = kcalloc(fs_devices->rw_devices +
>> +                   fs_devices->missing_devices,
>> +                   sizeof(*devices_info), GFP_NOFS);
>>       if (!devices_info)
>>           return -ENOMEM;
>>   @@ -5067,7 +5068,18 @@ static int __btrfs_alloc_chunk(struct
>> btrfs_trans_handle *trans,
>>               max_stripe_size, dev_stripes);
>>       if (ret < 0)
>>           goto error;
>> -
>> +    /*
>> +     * If rw devices can't fullfil the request, fallback to missing
>> devices
>> +     * as last resort.
>> +     */
>> +    if (ndevs < devs_min) {
>> +        ret = gather_dev_holes(info, devices_info + ndevs, &ndevs,
>> +                &fs_devices->missing_list,
>> +                fs_devices->missing_devices,
>> +                max_stripe_size, dev_stripes);
>> +        if (ret < 0)
>> +            goto error;
>> +    }
>>       /*
>>        * now sort the devices by hole size / available space
>>        */
>>
> 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]