On 2019/11/19 下午6:05, Anand Jain wrote: > On 11/7/19 2:27 PM, Qu Wenruo wrote: >> [PROBLEM] >> Btrfs degraded mount will fallback to SINGLE profile if there are not >> enough devices: > >  Its better to keep it like this for now until there is a fix for the >  write hole. Otherwise hitting the write hole bug in case of degraded >  raid1 will be more prevalent. Write hole should be a problem for RAID5/6, not the degraded chunk feature itself. Furthermore, this design will try to avoid allocating chunks using missing devices. So even for 3 devices RAID5, new chunks will be allocated by using existing devices (2 devices RAID5), so no new write hole is introduced. > >  I proposed a RFC a long time before [1] (also in there, there >  is a commit id which turned the degraded raid1 profile into single >  profile (without much write-up on it)). > >    [1] [PATCH 0/2] [RFC] btrfs: create degraded-RAID1 chunks My point for this patchset is: - Create regular chunk if we have enough devices - Create degraded chunk only when we have not enough devices I guess since you didn't get the point of my preparation patches, your patches aren't that good to avoid missing devices. > >  Similarly the patch related to the reappearing missing device [2] >  falls under the same category. Will push for the integration after >  the write hole fix. > >    [2] [PATCH] btrfs: handle dynamically reappearing missing device >    (test case 154). That's another case, and I didn't see how it affects this feature. > >  If you look close enough the original author has quite nicely made >  sure write hole bug will be very difficultly to hit. These fixes >  shall make it easy to hit. So its better to work on the write hole >  first. If you're talking about RAID5/6, you are talking at the wrong thread. Go implement some write-a-head log for RAID5/6, or mark all degraded RAID5/6 chunks read-only at mount time. > >  I am trying to fix write hole. First attempt has limited success >  (works fine in two disk raid1 only). Now trying other ways to fix. > >>   # mkfs.btrfs -f /dev/test/scratch[12] -m raid1 -d raid1 >>   # wipefs -fa /dev/test/scratch2 >>   # mount -o degraded /dev/test/scratch1 /mnt/btrfs >>   # fallocate -l 1G /mnt/btrfs/foobar >>   # btrfs ins dump-tree -t chunk /dev/test/scratch1 >>          item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 1674575872) itemoff >> 15511 itemsize 80 >>                  length 536870912 owner 2 stripe_len 65536 type DATA >>   New data chunk will fallback to SINGLE. >> >> If user doesn't balance those SINGLE chunks, even with missing devices >> replaced, the fs is no longer full RAID1, and a missing device can break >> the tolerance. > >  As its been discussed quite a lot of time before, the current >  re-silver/recovery approach for degraded-raid1 (with offload to Single) >  requires balance. Its kind of known. I'd call such "well-known" behavior BS. All other raid1 implementation can accept single device RAID1 and resilver itself with more device into a full RAID1 setup. But for BTRFS you're calling SINGLE profile "well-known"? It's "well-known" because it's not working properly, that's why I'm trying to fix it. > > Thanks, Anand > > >> [CAUSE] >> The cause is pretty simple, when mounted degraded, missing devices can't >> be used for chunk allocation. >> Thus btrfs has to fall back to SINGLE profile. >> >> [ENHANCEMENT] >> To avoid such problem, this patch will: >> - Make all profiler reducer/updater to consider missing devices as part >>    of num_devices >> - Make chunk allocator to fallback to missing_list as last resort >> >> If we have enough rw_devices, then go regular chunk allocation code. > >> This can avoid allocating degraded chunks. >> E.g. for 3 devices RAID1 degraded mount, we will use the 2 existing >> devices to allocate chunk, avoid degraded chunk. > >> But if we don't have enough rw_devices, then we check missing devices to >> allocate degraded chunks. >> E.g. for 2 devices RAID1 degraded mount, we have to allocate degraded >> chunks to keep the RAID1 profile. >> >> Signed-off-by: Qu Wenruo >> --- >>   fs/btrfs/block-group.c | 10 +++++++--- >>   fs/btrfs/volumes.c     | 18 +++++++++++++++--- >>   2 files changed, 22 insertions(+), 6 deletions(-) >> >> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c >> index bf7e3f23bba7..1686fd31679b 100644 >> --- a/fs/btrfs/block-group.c >> +++ b/fs/btrfs/block-group.c >> @@ -52,11 +52,13 @@ static u64 get_restripe_target(struct >> btrfs_fs_info *fs_info, u64 flags) >>    */ >>   static u64 btrfs_reduce_alloc_profile(struct btrfs_fs_info *fs_info, >> u64 flags) >>   { >> -    u64 num_devices = fs_info->fs_devices->rw_devices; >> +    u64 num_devices; >>       u64 target; >>       u64 raid_type; >>       u64 allowed = 0; >>   +    num_devices = fs_info->fs_devices->rw_devices + >> +              fs_info->fs_devices->missing_devices; >>       /* >>        * See if restripe for this chunk_type is in progress, if so try to >>        * reduce to the target profile >> @@ -1986,7 +1988,8 @@ static u64 update_block_group_flags(struct >> btrfs_fs_info *fs_info, u64 flags) >>       if (stripped) >>           return extended_to_chunk(stripped); >>   -    num_devices = fs_info->fs_devices->rw_devices; >> +    num_devices = fs_info->fs_devices->rw_devices + >> +              fs_info->fs_devices->missing_devices; >>         stripped = BTRFS_BLOCK_GROUP_RAID0 | >> BTRFS_BLOCK_GROUP_RAID56_MASK | >>           BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10; >> @@ -2981,7 +2984,8 @@ static u64 get_profile_num_devs(struct >> btrfs_fs_info *fs_info, u64 type) >>         num_dev = >> btrfs_raid_array[btrfs_bg_flags_to_raid_index(type)].devs_max; >>       if (!num_dev) >> -        num_dev = fs_info->fs_devices->rw_devices; >> +        num_dev = fs_info->fs_devices->rw_devices + >> +              fs_info->fs_devices->missing_devices; >>         return num_dev; >>   } >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index a462d8de5d2a..4dee1974ceb7 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -5052,8 +5052,9 @@ static int __btrfs_alloc_chunk(struct >> btrfs_trans_handle *trans, >>       max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1), >>                    max_chunk_size); >>   -    devices_info = kcalloc(fs_devices->rw_devices, >> sizeof(*devices_info), >> -                   GFP_NOFS); >> +    devices_info = kcalloc(fs_devices->rw_devices + >> +                   fs_devices->missing_devices, >> +                   sizeof(*devices_info), GFP_NOFS); >>       if (!devices_info) >>           return -ENOMEM; >>   @@ -5067,7 +5068,18 @@ static int __btrfs_alloc_chunk(struct >> btrfs_trans_handle *trans, >>               max_stripe_size, dev_stripes); >>       if (ret < 0) >>           goto error; >> - >> +    /* >> +     * If rw devices can't fullfil the request, fallback to missing >> devices >> +     * as last resort. >> +     */ >> +    if (ndevs < devs_min) { >> +        ret = gather_dev_holes(info, devices_info + ndevs, &ndevs, >> +                &fs_devices->missing_list, >> +                fs_devices->missing_devices, >> +                max_stripe_size, dev_stripes); >> +        if (ret < 0) >> +            goto error; >> +    } >>       /* >>        * now sort the devices by hole size / available space >>        */ >> >