linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Goffredo Baroncelli <kreijack@libero.it>
To: Hans van Kranenburg <hans@knorrie.org>, linux-btrfs@vger.kernel.org
Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	Josef Bacik <josef@toxicpanda.com>,
	David Sterba <dsterba@suse.cz>,
	Sinnamohideen Shafeeq <shafeeqs@panasas.com>,
	Goffredo Baroncelli <kreijack@inwind.it>
Subject: Re: [PATCH 4/4] btrfs: add allocator_hint mode
Date: Fri, 17 Dec 2021 19:28:28 +0100	[thread overview]
Message-ID: <5767702c-665f-d1a1-ea65-12eb1db96c51@libero.it> (raw)
In-Reply-To: <0fbfde93-3a00-7f20-8891-dd0fa798676e@knorrie.org>

On 12/17/21 16:58, Hans van Kranenburg wrote:
> Hi,
> 
> On 10/24/21 5:31 PM, Goffredo Baroncelli wrote:
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> When this mode is enabled, the chunk allocation policy is modified as
>> follow.
>>
>> Each disk may have a different tag:
>> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
>> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
>> - BTRFS_DEV_ALLOCATION_DATA_ONLY
>> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
>>
>> Where:
>> - ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
>> the X chunk type (the other type may be allowed when the space is low)
>> - ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
>> This means also that it is a preferred choice.
>>
>> Each time the allocator allocates a chunk of type X , first it takes the
>> disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
>> is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
>> if the space is not enough, it uses also the other disks, with the
>> exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
>> type of chunk (i.e. not X).
> 
> I read this last paragraph for 5 times now, I think, but I still have no
> idea what's going on here. Can you rephrase it? It's more complex than
> the actual code below.
> 
> What the above text tells me is that if I start filling up my disks, it
> will happily write DATA to METADATA_ONLY disks when the DATA ones are
> full? And if the METADATA_ONLY disks are full with DATA also, it will
> not write DATA into PREFERRED_METADATA disks.
> 
> I don't think that's what the code actually does. At least, I hope not.
> 
Thanks for reviewing it: my English is very bad :-(

Yes, I wrote ALLOCATION_METADATA_ONLY when I would write ALLOCATION_PREFERRED_Y.

Let to me to rewrite it as following:

-----------------------------
The chunk allocation policy is modified as follow.

Each disk may have one of the following tags:
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)

During a *mixed data/metadata* chunk allocation, BTRFS works as
usual.

During a *data* chunk allocation, the space are searched first in
BTRFS_DEV_ALLOCATION_DATA_ONLY and BTRFS_DEV_ALLOCATION_PREFERRED_DATA
tagged disks. If no space is found or the space found is not enough (eg.
in raid5, only two disks are available), then also the disks tagged
BTRFS_DEV_ALLOCATION_PREFERRED_METADATA are evaluated. If even in this
case this the space is not sufficient, -ENOSPC is raised.
A disk tagged with BTRFS_DEV_ALLOCATION_METADATA_ONLY is never considered
for a data BG allocation.

During a *metadata* chunk allocation, the space are searched first in
BTRFS_DEV_ALLOCATION_METADATA_ONLY and BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
tagged disks. If no space is found or the space found is not enough (eg.
in raid5, only two disks are available), then also the disks tagged
BTRFS_DEV_ALLOCATION_PREFERRED_DATA are considered. If even in this
case this the space is not sufficient, -ENOSPC is raised.
A disk tagged with BTRFS_DEV_ALLOCATION_DATA_ONLY is never considered
for a metadata BG allocation.

By default the disks are tagged as BTRFS_DEV_ALLOCATION_PREFERRED_DATA,
so the default behavior happens. If the user prefer to store the
metadata in the faster disks (e.g. the SSD), he can tag these with
BTRFS_DEV_ALLOCATION_PREFERRED_DATA: in this case the data BG go in the
BTRFS_DEV_ALLOCATION_PREFERRED_DATA disks and the metadata BG in the
others, until there is enough space. Only if one disks set is filled,
the other is occupied.

WARNING: if the user tags a disk with BTRFS_DEV_ALLOCATION_DATA_ONLY,
this means that this disk will never be used for allocating metadata
increasing the likelihood of exhausting the metadata space.

---------------------------------



>> Where:
>> - ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
>> the X chunk type (the other type may be allowed when the space is low)
>> - ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
>> This means also that it is a preferred choice.
>>
>> Each time the allocator allocates a chunk of type X , first it takes the
>> disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
>> is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
>> if the space is not enough, it uses also the other disks, with the
>> exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
>> type of chunk (i.e. not X).

> Hans
> 
>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>> ---
>>   fs/btrfs/volumes.c | 98 +++++++++++++++++++++++++++++++++++++++++++++-
>>   fs/btrfs/volumes.h |  1 +
>>   2 files changed, 98 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 8ac99771f43c..7ee9c6e7bd44 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -153,6 +153,20 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>>   	},
>>   };
>>   
>> +#define BTRFS_DEV_ALLOCATION_MASK ((1ULL << \
>> +		BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1)
>> +#define BTRFS_DEV_ALLOCATION_MASK_COUNT (1ULL << \
>> +		BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT)
>> +
>> +static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
>> +	[BTRFS_DEV_ALLOCATION_DATA_ONLY] = -1,
>> +	[BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
>> +	[BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 1,
>> +	[BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 2,
>> +	/* the other values are set to 0 */
>> +};
>> +
>> +
>>   const char *btrfs_bg_type_to_raid_name(u64 flags)
>>   {
>>   	const int index = btrfs_bg_flags_to_raid_index(flags);
>> @@ -4997,13 +5011,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>>   }
>>   
>>   /*
>> - * sort the devices in descending order by max_avail, total_avail
>> + * sort the devices in descending order by alloc_hint,
>> + * max_avail, total_avail
>>    */
>>   static int btrfs_cmp_device_info(const void *a, const void *b)
>>   {
>>   	const struct btrfs_device_info *di_a = a;
>>   	const struct btrfs_device_info *di_b = b;
>>   
>> +	if (di_a->alloc_hint > di_b->alloc_hint)
>> +		return -1;
>> +	if (di_a->alloc_hint < di_b->alloc_hint)
>> +		return 1;
>>   	if (di_a->max_avail > di_b->max_avail)
>>   		return -1;
>>   	if (di_a->max_avail < di_b->max_avail)
>> @@ -5166,6 +5185,8 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>   	int ndevs = 0;
>>   	u64 max_avail;
>>   	u64 dev_offset;
>> +	int hint;
>> +	int i;
>>   
>>   	/*
>>   	 * in the first pass through the devices list, we gather information
>> @@ -5218,16 +5239,91 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>   		devices_info[ndevs].max_avail = max_avail;
>>   		devices_info[ndevs].total_avail = total_avail;
>>   		devices_info[ndevs].dev = device;
>> +
>> +		if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
>> +		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
>> +			/*
>> +			 * if mixed bg set all the alloc_hint
>> +			 * fields to the same value, so the sorting
>> +			 * is not affected
>> +			 */
>> +			devices_info[ndevs].alloc_hint = 0;
>> +		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
>> +			hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
>> +
>> +			/*
>> +			 * skip BTRFS_DEV_METADATA_ONLY disks
>> +			 */
>> +			if (hint == BTRFS_DEV_ALLOCATION_METADATA_ONLY)
>> +				continue;
>> +			/*
>> +			 * if a data chunk must be allocated,
>> +			 * sort also by hint (data disk
>> +			 * higher priority)
>> +			 */
>> +			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
>> +		} else { /* BTRFS_BLOCK_GROUP_METADATA */
>> +			hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
>> +
>> +			/*
>> +			 * skip BTRFS_DEV_DATA_ONLY disks
>> +			 */
>> +			if (hint == BTRFS_DEV_ALLOCATION_DATA_ONLY)
>> +				continue;
>> +			/*
>> +			 * if a data chunk must be allocated,
>> +			 * sort also by hint (metadata hint
>> +			 * higher priority)
>> +			 */
>> +			devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
>> +		}
>> +
>>   		++ndevs;
>>   	}
>>   	ctl->ndevs = ndevs;
>>   
>> +	/*
>> +	 * no devices available
>> +	 */
>> +	if (!ndevs)
>> +		return 0;
>> +
>>   	/*
>>   	 * now sort the devices by hole size / available space
>>   	 */
>>   	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>   	     btrfs_cmp_device_info, NULL);
>>   
>> +	/*
>> +	 * select the minimum set of disks grouped by hint that
>> +	 * can host the chunk
>> +	 */
>> +	ndevs = 0;
>> +	while (ndevs < ctl->ndevs) {
>> +		hint = devices_info[ndevs++].alloc_hint;
>> +		while (ndevs < ctl->ndevs &&
>> +		       devices_info[ndevs].alloc_hint == hint)
>> +				ndevs++;
>> +		if (ndevs >= ctl->devs_min)
>> +			break;
>> +	}
>> +
>> +	BUG_ON(ndevs > ctl->ndevs);
>> +	ctl->ndevs = ndevs;
>> +
>> +	/*
>> +	 * the next layers require the devices_info ordered by
>> +	 * max_avail. If we are returing two (or more) different
>> +	 * group of alloc_hint, this is not always true. So sort
>> +	 * these gain.
>> +	 */
>> +
>> +	for (i = 0 ; i < ndevs ; i++)
>> +		devices_info[i].alloc_hint = 0;
>> +
>> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>> +	     btrfs_cmp_device_info, NULL);
>> +
>>   	return 0;
>>   }
>>   
>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>> index b8250f29df6e..37eb37b533c5 100644
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -369,6 +369,7 @@ struct btrfs_device_info {
>>   	u64 dev_offset;
>>   	u64 max_avail;
>>   	u64 total_avail;
>> +	int alloc_hint;
>>   };
>>   
>>   struct btrfs_raid_attr {
>>
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

  reply	other threads:[~2021-12-17 18:28 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-24 15:31 [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 1/4] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 2/4] btrfs: export dev_item.type in /sys/fs/btrfs/<uuid>/devinfo/<devid>/type Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 3/4] btrfs: change the DEV_ITEM 'type' field via sysfs Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 4/4] btrfs: add allocator_hint mode Goffredo Baroncelli
2021-12-17 15:58   ` Hans van Kranenburg
2021-12-17 18:28     ` Goffredo Baroncelli [this message]
2021-12-17 19:41       ` Zygo Blaxell
2021-12-18  9:07         ` Goffredo Baroncelli
2021-12-18 22:48           ` Zygo Blaxell
2021-12-19  0:03             ` Graham Cobb
2021-12-19  2:30               ` Zygo Blaxell
2021-12-13  9:39 ` [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Paul Jones
2021-12-13 19:54   ` Goffredo Baroncelli
2021-12-13 21:15     ` Josef Bacik
2021-12-13 22:49       ` Zygo Blaxell
2021-12-14 14:31         ` Josef Bacik
2021-12-14 19:03         ` Goffredo Baroncelli
2021-12-14 20:04           ` Zygo Blaxell
2021-12-14 20:34             ` Josef Bacik
2021-12-14 20:41               ` Goffredo Baroncelli
2021-12-15 13:58                 ` Josef Bacik
2021-12-15 18:53                   ` Goffredo Baroncelli
2021-12-16  0:56                     ` Josef Bacik
2021-12-17  5:40                       ` Zygo Blaxell
2021-12-17 14:48                         ` Josef Bacik
2021-12-17 16:31                           ` Zygo Blaxell
2021-12-17 18:08                         ` Goffredo Baroncelli
2021-12-16  2:30                   ` Paul Jones
2021-12-14  1:03       ` Sinnamohideen, Shafeeq
2021-12-14 18:53       ` Goffredo Baroncelli
2021-12-14 20:35         ` Josef Bacik
     [not found] <cover.1614028083.git.kreijack@inwind.it>
2021-02-22 21:19 ` [PATCH 4/4] btrfs: add allocator_hint mode Goffredo Baroncelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5767702c-665f-d1a1-ea65-12eb1db96c51@libero.it \
    --to=kreijack@libero.it \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=dsterba@suse.cz \
    --cc=hans@knorrie.org \
    --cc=josef@toxicpanda.com \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=shafeeqs@panasas.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).