From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Anand Jain <anand.jain@oracle.com>, Qu Wenruo <wqu@suse.com>,
linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 3/3] btrfs: volumes: Allocate degraded chunks if rw devices can't fullfil a chunk
Date: Tue, 19 Nov 2019 18:41:49 +0800 [thread overview]
Message-ID: <f928122d-4e77-e83b-9a53-d2eea7ee16d3@gmx.com> (raw)
In-Reply-To: <6cc25dbd-55e4-43bb-7b95-86c62bee27c7@oracle.com>
[-- Attachment #1.1: Type: text/plain, Size: 8059 bytes --]
On 2019/11/19 下午6:05, Anand Jain wrote:
> On 11/7/19 2:27 PM, Qu Wenruo wrote:
>> [PROBLEM]
>> Btrfs degraded mount will fallback to SINGLE profile if there are not
>> enough devices:
>
> Its better to keep it like this for now until there is a fix for the
> write hole. Otherwise hitting the write hole bug in case of degraded
> raid1 will be more prevalent.
Write hole should be a problem for RAID5/6, not the degraded chunk
feature itself.
Furthermore, this design will try to avoid allocating chunks using
missing devices.
So even for 3 devices RAID5, new chunks will be allocated by using
existing devices (2 devices RAID5), so no new write hole is introduced.
>
> I proposed a RFC a long time before [1] (also in there, there
> is a commit id which turned the degraded raid1 profile into single
> profile (without much write-up on it)).
>
> [1] [PATCH 0/2] [RFC] btrfs: create degraded-RAID1 chunks
My point for this patchset is:
- Create regular chunk if we have enough devices
- Create degraded chunk only when we have not enough devices
I guess since you didn't get the point of my preparation patches, your
patches aren't that good to avoid missing devices.
>
> Similarly the patch related to the reappearing missing device [2]
> falls under the same category. Will push for the integration after
> the write hole fix.
>
> [2] [PATCH] btrfs: handle dynamically reappearing missing device
> (test case 154).
That's another case, and I didn't see how it affects this feature.
>
> If you look close enough the original author has quite nicely made
> sure write hole bug will be very difficultly to hit. These fixes
> shall make it easy to hit. So its better to work on the write hole
> first.
If you're talking about RAID5/6, you are talking at the wrong thread.
Go implement some write-a-head log for RAID5/6, or mark all degraded
RAID5/6 chunks read-only at mount time.
>
> I am trying to fix write hole. First attempt has limited success
> (works fine in two disk raid1 only). Now trying other ways to fix.
>
>> # mkfs.btrfs -f /dev/test/scratch[12] -m raid1 -d raid1
>> # wipefs -fa /dev/test/scratch2
>> # mount -o degraded /dev/test/scratch1 /mnt/btrfs
>> # fallocate -l 1G /mnt/btrfs/foobar
>> # btrfs ins dump-tree -t chunk /dev/test/scratch1
>> item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 1674575872) itemoff
>> 15511 itemsize 80
>> length 536870912 owner 2 stripe_len 65536 type DATA
>> New data chunk will fallback to SINGLE.
>>
>> If user doesn't balance those SINGLE chunks, even with missing devices
>> replaced, the fs is no longer full RAID1, and a missing device can break
>> the tolerance.
>
> As its been discussed quite a lot of time before, the current
> re-silver/recovery approach for degraded-raid1 (with offload to Single)
> requires balance. Its kind of known.
I'd call such "well-known" behavior BS.
All other raid1 implementation can accept single device RAID1 and
resilver itself with more device into a full RAID1 setup.
But for BTRFS you're calling SINGLE profile "well-known"?
It's "well-known" because it's not working properly, that's why I'm
trying to fix it.
>
> Thanks, Anand
>
>
>> [CAUSE]
>> The cause is pretty simple, when mounted degraded, missing devices can't
>> be used for chunk allocation.
>> Thus btrfs has to fall back to SINGLE profile.
>>
>> [ENHANCEMENT]
>> To avoid such problem, this patch will:
>> - Make all profiler reducer/updater to consider missing devices as part
>> of num_devices
>> - Make chunk allocator to fallback to missing_list as last resort
>>
>> If we have enough rw_devices, then go regular chunk allocation code.
>
>> This can avoid allocating degraded chunks.
>> E.g. for 3 devices RAID1 degraded mount, we will use the 2 existing
>> devices to allocate chunk, avoid degraded chunk.
>
>> But if we don't have enough rw_devices, then we check missing devices to
>> allocate degraded chunks.
>> E.g. for 2 devices RAID1 degraded mount, we have to allocate degraded
>> chunks to keep the RAID1 profile.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> fs/btrfs/block-group.c | 10 +++++++---
>> fs/btrfs/volumes.c | 18 +++++++++++++++---
>> 2 files changed, 22 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>> index bf7e3f23bba7..1686fd31679b 100644
>> --- a/fs/btrfs/block-group.c
>> +++ b/fs/btrfs/block-group.c
>> @@ -52,11 +52,13 @@ static u64 get_restripe_target(struct
>> btrfs_fs_info *fs_info, u64 flags)
>> */
>> static u64 btrfs_reduce_alloc_profile(struct btrfs_fs_info *fs_info,
>> u64 flags)
>> {
>> - u64 num_devices = fs_info->fs_devices->rw_devices;
>> + u64 num_devices;
>> u64 target;
>> u64 raid_type;
>> u64 allowed = 0;
>> + num_devices = fs_info->fs_devices->rw_devices +
>> + fs_info->fs_devices->missing_devices;
>> /*
>> * See if restripe for this chunk_type is in progress, if so try to
>> * reduce to the target profile
>> @@ -1986,7 +1988,8 @@ static u64 update_block_group_flags(struct
>> btrfs_fs_info *fs_info, u64 flags)
>> if (stripped)
>> return extended_to_chunk(stripped);
>> - num_devices = fs_info->fs_devices->rw_devices;
>> + num_devices = fs_info->fs_devices->rw_devices +
>> + fs_info->fs_devices->missing_devices;
>> stripped = BTRFS_BLOCK_GROUP_RAID0 |
>> BTRFS_BLOCK_GROUP_RAID56_MASK |
>> BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10;
>> @@ -2981,7 +2984,8 @@ static u64 get_profile_num_devs(struct
>> btrfs_fs_info *fs_info, u64 type)
>> num_dev =
>> btrfs_raid_array[btrfs_bg_flags_to_raid_index(type)].devs_max;
>> if (!num_dev)
>> - num_dev = fs_info->fs_devices->rw_devices;
>> + num_dev = fs_info->fs_devices->rw_devices +
>> + fs_info->fs_devices->missing_devices;
>> return num_dev;
>> }
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index a462d8de5d2a..4dee1974ceb7 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -5052,8 +5052,9 @@ static int __btrfs_alloc_chunk(struct
>> btrfs_trans_handle *trans,
>> max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
>> max_chunk_size);
>> - devices_info = kcalloc(fs_devices->rw_devices,
>> sizeof(*devices_info),
>> - GFP_NOFS);
>> + devices_info = kcalloc(fs_devices->rw_devices +
>> + fs_devices->missing_devices,
>> + sizeof(*devices_info), GFP_NOFS);
>> if (!devices_info)
>> return -ENOMEM;
>> @@ -5067,7 +5068,18 @@ static int __btrfs_alloc_chunk(struct
>> btrfs_trans_handle *trans,
>> max_stripe_size, dev_stripes);
>> if (ret < 0)
>> goto error;
>> -
>> + /*
>> + * If rw devices can't fullfil the request, fallback to missing
>> devices
>> + * as last resort.
>> + */
>> + if (ndevs < devs_min) {
>> + ret = gather_dev_holes(info, devices_info + ndevs, &ndevs,
>> + &fs_devices->missing_list,
>> + fs_devices->missing_devices,
>> + max_stripe_size, dev_stripes);
>> + if (ret < 0)
>> + goto error;
>> + }
>> /*
>> * now sort the devices by hole size / available space
>> */
>>
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]
next prev parent reply other threads:[~2019-11-19 10:43 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-07 6:27 [PATCH 0/3] btrfs: More intelligent degraded chunk allocator Qu Wenruo
2019-11-07 6:27 ` [PATCH 1/3] btrfs: volumes: Refactor device holes gathering into a separate function Qu Wenruo
2019-11-07 9:20 ` Johannes Thumshirn
2019-11-07 9:33 ` Qu Wenruo
2019-11-07 9:45 ` Johannes Thumshirn
2019-11-07 6:27 ` [PATCH 2/3] btrfs: volumes: Add btrfs_fs_devices::missing_list to collect missing devices Qu Wenruo
2019-11-07 9:31 ` Johannes Thumshirn
2019-11-19 10:03 ` Anand Jain
2019-11-19 10:29 ` Qu Wenruo
2019-11-27 19:36 ` David Sterba
2019-11-07 6:27 ` [PATCH 3/3] btrfs: volumes: Allocate degraded chunks if rw devices can't fullfil a chunk Qu Wenruo
2019-11-19 10:05 ` Anand Jain
2019-11-19 10:41 ` Qu Wenruo [this message]
2019-11-27 19:23 ` David Sterba
2019-11-27 23:36 ` Qu Wenruo
2019-11-28 11:24 ` David Sterba
2019-11-28 12:29 ` Qu Wenruo
2019-11-28 12:30 ` Qu WenRuo
2019-11-28 12:39 ` Qu Wenruo
2019-11-18 20:18 ` [PATCH 0/3] btrfs: More intelligent degraded chunk allocator David Sterba
2019-11-18 23:32 ` Qu Wenruo
2019-11-19 5:18 ` Alberto Bursi
2019-11-27 19:26 ` David Sterba
2019-12-02 3:22 ` Zygo Blaxell
2019-12-02 4:41 ` Qu Wenruo
2019-12-02 19:27 ` Zygo Blaxell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f928122d-4e77-e83b-9a53-d2eea7ee16d3@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=anand.jain@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).