All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Anand Jain <anand.jain@oracle.com>, Qu Wenruo <wqu@suse.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 3/3] btrfs: volumes: Allocate degraded chunks if rw devices can't fullfil a chunk
Date: Tue, 19 Nov 2019 18:41:49 +0800	[thread overview]
Message-ID: <f928122d-4e77-e83b-9a53-d2eea7ee16d3@gmx.com> (raw)
In-Reply-To: <6cc25dbd-55e4-43bb-7b95-86c62bee27c7@oracle.com>


[-- Attachment #1.1: Type: text/plain, Size: 8059 bytes --]



On 2019/11/19 下午6:05, Anand Jain wrote:
> On 11/7/19 2:27 PM, Qu Wenruo wrote:
>> [PROBLEM]
>> Btrfs degraded mount will fallback to SINGLE profile if there are not
>> enough devices:
> 
>  Its better to keep it like this for now until there is a fix for the
>  write hole. Otherwise hitting the write hole bug in case of degraded
>  raid1 will be more prevalent.

Write hole should be a problem for RAID5/6, not the degraded chunk
feature itself.

Furthermore, this design will try to avoid allocating chunks using
missing devices.
So even for 3 devices RAID5, new chunks will be allocated by using
existing devices (2 devices RAID5), so no new write hole is introduced.

> 
>  I proposed a RFC a long time before [1] (also in there, there
>  is a commit id which turned the degraded raid1 profile into single
>  profile (without much write-up on it)).
> 
>    [1] [PATCH 0/2] [RFC] btrfs: create degraded-RAID1 chunks

My point for this patchset is:
- Create regular chunk if we have enough devices
- Create degraded chunk only when we have not enough devices

I guess since you didn't get the point of my preparation patches, your
patches aren't that good to avoid missing devices.

> 
>  Similarly the patch related to the reappearing missing device [2]
>  falls under the same category. Will push for the integration after
>  the write hole fix.
> 
>    [2] [PATCH] btrfs: handle dynamically reappearing missing device
>    (test case 154).

That's another case, and I didn't see how it affects this feature.

> 
>  If you look close enough the original author has quite nicely made
>  sure write hole bug will be very difficultly to hit. These fixes
>  shall make it easy to hit. So its better to work on the write hole
>  first.

If you're talking about RAID5/6, you are talking at the wrong thread.
Go implement some write-a-head log for RAID5/6, or mark all degraded
RAID5/6 chunks read-only at mount time.

> 
>  I am trying to fix write hole. First attempt has limited success
>  (works fine in two disk raid1 only). Now trying other ways to fix.
> 
>>   # mkfs.btrfs -f /dev/test/scratch[12] -m raid1 -d raid1
>>   # wipefs -fa /dev/test/scratch2
>>   # mount -o degraded /dev/test/scratch1 /mnt/btrfs
>>   # fallocate -l 1G /mnt/btrfs/foobar
>>   # btrfs ins dump-tree -t chunk /dev/test/scratch1
>>          item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 1674575872) itemoff
>> 15511 itemsize 80
>>                  length 536870912 owner 2 stripe_len 65536 type DATA
>>   New data chunk will fallback to SINGLE.
>>
>> If user doesn't balance those SINGLE chunks, even with missing devices
>> replaced, the fs is no longer full RAID1, and a missing device can break
>> the tolerance.
> 
>  As its been discussed quite a lot of time before, the current
>  re-silver/recovery approach for degraded-raid1 (with offload to Single)
>  requires balance. Its kind of known.

I'd call such "well-known" behavior BS.

All other raid1 implementation can accept single device RAID1 and
resilver itself with more device into a full RAID1 setup.

But for BTRFS you're calling SINGLE profile "well-known"?
It's "well-known" because it's not working properly, that's why I'm
trying to fix it.


> 
> Thanks, Anand
> 
> 
>> [CAUSE]
>> The cause is pretty simple, when mounted degraded, missing devices can't
>> be used for chunk allocation.
>> Thus btrfs has to fall back to SINGLE profile.
>>
>> [ENHANCEMENT]
>> To avoid such problem, this patch will:
>> - Make all profiler reducer/updater to consider missing devices as part
>>    of num_devices
>> - Make chunk allocator to fallback to missing_list as last resort
>>
>> If we have enough rw_devices, then go regular chunk allocation code.
> 
>> This can avoid allocating degraded chunks.
>> E.g. for 3 devices RAID1 degraded mount, we will use the 2 existing
>> devices to allocate chunk, avoid degraded chunk.
> 
>> But if we don't have enough rw_devices, then we check missing devices to
>> allocate degraded chunks.
>> E.g. for 2 devices RAID1 degraded mount, we have to allocate degraded
>> chunks to keep the RAID1 profile.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/block-group.c | 10 +++++++---
>>   fs/btrfs/volumes.c     | 18 +++++++++++++++---
>>   2 files changed, 22 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>> index bf7e3f23bba7..1686fd31679b 100644
>> --- a/fs/btrfs/block-group.c
>> +++ b/fs/btrfs/block-group.c
>> @@ -52,11 +52,13 @@ static u64 get_restripe_target(struct
>> btrfs_fs_info *fs_info, u64 flags)
>>    */
>>   static u64 btrfs_reduce_alloc_profile(struct btrfs_fs_info *fs_info,
>> u64 flags)
>>   {
>> -    u64 num_devices = fs_info->fs_devices->rw_devices;
>> +    u64 num_devices;
>>       u64 target;
>>       u64 raid_type;
>>       u64 allowed = 0;
>>   +    num_devices = fs_info->fs_devices->rw_devices +
>> +              fs_info->fs_devices->missing_devices;
>>       /*
>>        * See if restripe for this chunk_type is in progress, if so try to
>>        * reduce to the target profile
>> @@ -1986,7 +1988,8 @@ static u64 update_block_group_flags(struct
>> btrfs_fs_info *fs_info, u64 flags)
>>       if (stripped)
>>           return extended_to_chunk(stripped);
>>   -    num_devices = fs_info->fs_devices->rw_devices;
>> +    num_devices = fs_info->fs_devices->rw_devices +
>> +              fs_info->fs_devices->missing_devices;
>>         stripped = BTRFS_BLOCK_GROUP_RAID0 |
>> BTRFS_BLOCK_GROUP_RAID56_MASK |
>>           BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10;
>> @@ -2981,7 +2984,8 @@ static u64 get_profile_num_devs(struct
>> btrfs_fs_info *fs_info, u64 type)
>>         num_dev =
>> btrfs_raid_array[btrfs_bg_flags_to_raid_index(type)].devs_max;
>>       if (!num_dev)
>> -        num_dev = fs_info->fs_devices->rw_devices;
>> +        num_dev = fs_info->fs_devices->rw_devices +
>> +              fs_info->fs_devices->missing_devices;
>>         return num_dev;
>>   }
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index a462d8de5d2a..4dee1974ceb7 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -5052,8 +5052,9 @@ static int __btrfs_alloc_chunk(struct
>> btrfs_trans_handle *trans,
>>       max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
>>                    max_chunk_size);
>>   -    devices_info = kcalloc(fs_devices->rw_devices,
>> sizeof(*devices_info),
>> -                   GFP_NOFS);
>> +    devices_info = kcalloc(fs_devices->rw_devices +
>> +                   fs_devices->missing_devices,
>> +                   sizeof(*devices_info), GFP_NOFS);
>>       if (!devices_info)
>>           return -ENOMEM;
>>   @@ -5067,7 +5068,18 @@ static int __btrfs_alloc_chunk(struct
>> btrfs_trans_handle *trans,
>>               max_stripe_size, dev_stripes);
>>       if (ret < 0)
>>           goto error;
>> -
>> +    /*
>> +     * If rw devices can't fullfil the request, fallback to missing
>> devices
>> +     * as last resort.
>> +     */
>> +    if (ndevs < devs_min) {
>> +        ret = gather_dev_holes(info, devices_info + ndevs, &ndevs,
>> +                &fs_devices->missing_list,
>> +                fs_devices->missing_devices,
>> +                max_stripe_size, dev_stripes);
>> +        if (ret < 0)
>> +            goto error;
>> +    }
>>       /*
>>        * now sort the devices by hole size / available space
>>        */
>>
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

  reply	other threads:[~2019-11-19 10:43 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-07  6:27 [PATCH 0/3] btrfs: More intelligent degraded chunk allocator Qu Wenruo
2019-11-07  6:27 ` [PATCH 1/3] btrfs: volumes: Refactor device holes gathering into a separate function Qu Wenruo
2019-11-07  9:20   ` Johannes Thumshirn
2019-11-07  9:33     ` Qu Wenruo
2019-11-07  9:45       ` Johannes Thumshirn
2019-11-07  6:27 ` [PATCH 2/3] btrfs: volumes: Add btrfs_fs_devices::missing_list to collect missing devices Qu Wenruo
2019-11-07  9:31   ` Johannes Thumshirn
2019-11-19 10:03   ` Anand Jain
2019-11-19 10:29     ` Qu Wenruo
2019-11-27 19:36       ` David Sterba
2019-11-07  6:27 ` [PATCH 3/3] btrfs: volumes: Allocate degraded chunks if rw devices can't fullfil a chunk Qu Wenruo
2019-11-19 10:05   ` Anand Jain
2019-11-19 10:41     ` Qu Wenruo [this message]
2019-11-27 19:23       ` David Sterba
2019-11-27 23:36         ` Qu Wenruo
2019-11-28 11:24           ` David Sterba
2019-11-28 12:29             ` Qu Wenruo
2019-11-28 12:30             ` Qu WenRuo
2019-11-28 12:39               ` Qu Wenruo
2019-11-18 20:18 ` [PATCH 0/3] btrfs: More intelligent degraded chunk allocator David Sterba
2019-11-18 23:32   ` Qu Wenruo
2019-11-19  5:18     ` Alberto Bursi
2019-11-27 19:26       ` David Sterba
2019-12-02  3:22     ` Zygo Blaxell
2019-12-02  4:41       ` Qu Wenruo
2019-12-02 19:27         ` Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f928122d-4e77-e83b-9a53-d2eea7ee16d3@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=anand.jain@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.