From: Josef Bacik <josef@toxicpanda.com>
To: Qu Wenruo <wqu@suse.com>, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v6 1/5] btrfs: Introduce per-profile available space facility
Date: Thu, 16 Jan 2020 11:14:38 -0500 [thread overview]
Message-ID: <49727617-91d3-9cff-c772-19d7cd371b55@toxicpanda.com> (raw)
In-Reply-To: <20200116060404.95200-2-wqu@suse.com>
On 1/16/20 1:04 AM, Qu Wenruo wrote:
> [PROBLEM]
> There are some locations in btrfs requiring accurate estimation on how
> many new bytes can be allocated on unallocated space.
>
> We have two types of estimation:
> - Factor based calculation
> Just use all unallocated space, divide by the profile factor
> One obvious user is can_overcommit().
>
> - Chunk allocator like calculation
> This will emulate the chunk allocator behavior, to get a proper
> estimation.
> The only user is btrfs_calc_avail_data_space(), utilized by
> btrfs_statfs().
> The problem is, that function is not generic purposed enough, can't
> handle things like RAID5/6.
>
> Current factor based calculation can't handle the following case:
> devid 1 unallocated: 1T
> devid 2 unallocated: 10T
> metadata type: RAID1
>
> If using factor, we can use (1T + 10T) / 2 = 5.5T free space for
> metadata.
> But in fact we can only get 1T free space, as we're limited by the
> smallest device for RAID1.
>
> [SOLUTION]
> This patch will introduce per-profile available space calculation,
> which can give an estimation based on chunk-allocator-like behavior.
>
> The difference between it and chunk allocator is mostly on rounding and
> [0, 1M) reserved space handling, which shouldn't cause practical impact.
>
> The newly introduced per-profile available space calculation will
> calculate available space for each type, using chunk-allocator like
> calculation.
>
> With that facility, for above device layout we get the full available
> space array:
> RAID10: 0 (not enough devices)
> RAID1: 1T
> RAID1C3: 0 (not enough devices)
> RAID1C4: 0 (not enough devices)
> DUP: 5.5T
> RAID0: 2T
> SINGLE: 11T
> RAID5: 1T
> RAID6: 0 (not enough devices)
>
> Or for a more complex example:
> devid 1 unallocated: 1T
> devid 2 unallocated: 1T
> devid 3 unallocated: 10T
>
> We will get an array of:
> RAID10: 0 (not enough devices)
> RAID1: 2T
> RAID1C3: 1T
> RAID1C4: 0 (not enough devices)
> DUP: 6T
> RAID0: 3T
> SINGLE: 12T
> RAID5: 2T
> RAID6: 0 (not enough devices)
>
> And for the each profile , we go chunk allocator level calculation:
> The pseudo code looks like:
>
> clear_virtual_used_space_of_all_rw_devices();
> do {
> /*
> * The same as chunk allocator, despite used space,
> * we also take virtual used space into consideration.
> */
> sort_device_with_virtual_free_space();
>
> /*
> * Unlike chunk allocator, we don't need to bother hole/stripe
> * size, so we use the smallest device to make sure we can
> * allocated as many stripes as regular chunk allocator
> */
> stripe_size = device_with_smallest_free->avail_space;
> stripe_size = min(stripe_size, to_alloc / ndevs);
>
> /*
> * Allocate a virtual chunk, allocated virtual chunk will
> * increase virtual used space, allow next iteration to
> * properly emulate chunk allocator behavior.
> */
> ret = alloc_virtual_chunk(stripe_size, &allocated_size);
> if (ret == 0)
> avail += allocated_size;
> } while (ret == 0)
>
> As we always select the device with least free space, the device with
> the most space will be the first to be utilized, just like chunk
> allocator.
> For above 1T + 10T device, we will allocate a 1T virtual chunk
> in the first iteration, then run out of device in next iteration.
>
> Thus only get 1T free space for RAID1 type, just like what chunk
> allocator would do.
>
> The patch will update such per-profile available space at the following
> timing:
> - Mount time
> - Chunk allocation
> - Chunk removal
> - Device grow
> - Device shrink
>
> Those timing are all protected by chunk_mutex, and what we do are only
> iterating in-memory only structures, no extra IO triggered, so the
> performance impact should be pretty small.
>
> For the extra error handling, the principle is to keep the old behavior.
> That's to say, if old error handler would just return an error, then we
> follow it, no matter if the caller reverts the device size.
>
> For the proper error handling, they will be added in later patches.
> As I don't want to make the core facility bloated by the error handling
> code, especially some situation needs quite some new code to handle
> errors.
Instead of creating a weird error handling case why not just set the
per_profile_avail to 0 on error? This will simply disable overcommit and we'll
flush more. This way we avoid making a weird situation weirder, and we don't
have to worry about returning an error from calc_one_profile_avail(). Simply
say "hey we got enomem, metadata overcommit is going off" with a
btrfs_err_ratelimited() and carry on. Maybe the next one will succeed and we'll
get overcommit turned back on. Thanks,
Josef
next prev parent reply other threads:[~2020-01-16 16:14 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-16 6:03 [PATCH v6 0/4] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
2020-01-16 6:04 ` [PATCH v6 1/5] btrfs: Introduce per-profile available space facility Qu Wenruo
2020-01-16 16:14 ` Josef Bacik [this message]
2020-01-17 0:55 ` Qu Wenruo
2020-01-17 1:50 ` Josef Bacik
2020-01-17 1:54 ` Qu Wenruo
2020-01-17 2:00 ` Josef Bacik
2020-01-16 6:04 ` [PATCH v6 2/5] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
2020-01-16 6:04 ` [PATCH v6 3/5] btrfs: statfs: Use pre-calculated per-profile available space Qu Wenruo
2020-01-16 16:21 ` Josef Bacik
2020-01-16 6:04 ` [PATCH v6 4/5] btrfs: Reset device size when btrfs_update_device() failed in btrfs_grow_device() Qu Wenruo
2020-01-16 6:04 ` [PATCH v6 5/5] btrfs: volumes: Revert device used bytes when calc_per_profile_avail() failed Qu Wenruo
2020-01-29 5:19 ` [PATCH v6 0/4] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu WenRuo
2020-01-29 9:23 ` Nikolay Borisov
2020-01-29 10:51 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49727617-91d3-9cff-c772-19d7cd371b55@toxicpanda.com \
--to=josef@toxicpanda.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).