All of lore.kernel.org
 help / color / mirror / Atom feed
From: Goffredo Baroncelli <kreijack@libero.it>
To: Hans van Kranenburg <hans@knorrie.org>, linux-btrfs@vger.kernel.org
Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	Josef Bacik <josef@toxicpanda.com>,
	David Sterba <dsterba@suse.cz>,
	Sinnamohideen Shafeeq <shafeeqs@panasas.com>,
	Goffredo Baroncelli <kreijack@inwind.it>
Subject: Re: [PATCH 4/4] btrfs: add allocator_hint mode
Date: Fri, 17 Dec 2021 19:28:28 +0100	[thread overview]
Message-ID: <5767702c-665f-d1a1-ea65-12eb1db96c51@libero.it> (raw)
In-Reply-To: <0fbfde93-3a00-7f20-8891-dd0fa798676e@knorrie.org>

On 12/17/21 16:58, Hans van Kranenburg wrote:
> Hi,
> 
> On 10/24/21 5:31 PM, Goffredo Baroncelli wrote:
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> When this mode is enabled, the chunk allocation policy is modified as
>> follow.
>>
>> Each disk may have a different tag:
>> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
>> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
>> - BTRFS_DEV_ALLOCATION_DATA_ONLY
>> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
>>
>> Where:
>> - ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
>> the X chunk type (the other type may be allowed when the space is low)
>> - ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
>> This means also that it is a preferred choice.
>>
>> Each time the allocator allocates a chunk of type X , first it takes the
>> disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
>> is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
>> if the space is not enough, it uses also the other disks, with the
>> exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
>> type of chunk (i.e. not X).
> 
> I read this last paragraph for 5 times now, I think, but I still have no
> idea what's going on here. Can you rephrase it? It's more complex than
> the actual code below.
> 
> What the above text tells me is that if I start filling up my disks, it
> will happily write DATA to METADATA_ONLY disks when the DATA ones are
> full? And if the METADATA_ONLY disks are full with DATA also, it will
> not write DATA into PREFERRED_METADATA disks.
> 
> I don't think that's what the code actually does. At least, I hope not.
> 
Thanks for reviewing it: my English is very bad :-(

Yes, I wrote ALLOCATION_METADATA_ONLY when I would write ALLOCATION_PREFERRED_Y.

Let to me to rewrite it as following:

-----------------------------
The chunk allocation policy is modified as follow.

Each disk may have one of the following tags:
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)

During a *mixed data/metadata* chunk allocation, BTRFS works as
usual.

During a *data* chunk allocation, the space are searched first in
BTRFS_DEV_ALLOCATION_DATA_ONLY and BTRFS_DEV_ALLOCATION_PREFERRED_DATA
tagged disks. If no space is found or the space found is not enough (eg.
in raid5, only two disks are available), then also the disks tagged
BTRFS_DEV_ALLOCATION_PREFERRED_METADATA are evaluated. If even in this
case this the space is not sufficient, -ENOSPC is raised.
A disk tagged with BTRFS_DEV_ALLOCATION_METADATA_ONLY is never considered
for a data BG allocation.

During a *metadata* chunk allocation, the space are searched first in
BTRFS_DEV_ALLOCATION_METADATA_ONLY and BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
tagged disks. If no space is found or the space found is not enough (eg.
in raid5, only two disks are available), then also the disks tagged
BTRFS_DEV_ALLOCATION_PREFERRED_DATA are considered. If even in this
case this the space is not sufficient, -ENOSPC is raised.
A disk tagged with BTRFS_DEV_ALLOCATION_DATA_ONLY is never considered
for a metadata BG allocation.

By default the disks are tagged as BTRFS_DEV_ALLOCATION_PREFERRED_DATA,
so the default behavior happens. If the user prefer to store the
metadata in the faster disks (e.g. the SSD), he can tag these with
BTRFS_DEV_ALLOCATION_PREFERRED_DATA: in this case the data BG go in the
BTRFS_DEV_ALLOCATION_PREFERRED_DATA disks and the metadata BG in the
others, until there is enough space. Only if one disks set is filled,
the other is occupied.

WARNING: if the user tags a disk with BTRFS_DEV_ALLOCATION_DATA_ONLY,
this means that this disk will never be used for allocating metadata
increasing the likelihood of exhausting the metadata space.

---------------------------------



>> Where:
>> - ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
>> the X chunk type (the other type may be allowed when the space is low)
>> - ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
>> This means also that it is a preferred choice.
>>
>> Each time the allocator allocates a chunk of type X , first it takes the
>> disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
>> is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
>> if the space is not enough, it uses also the other disks, with the
>> exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
>> type of chunk (i.e. not X).

> Hans
> 
>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>> ---
>>   fs/btrfs/volumes.c | 98 +++++++++++++++++++++++++++++++++++++++++++++-
>>   fs/btrfs/volumes.h |  1 +
>>   2 files changed, 98 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 8ac99771f43c..7ee9c6e7bd44 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -153,6 +153,20 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>>   	},
>>   };
>>   
>> +#define BTRFS_DEV_ALLOCATION_MASK ((1ULL << \
>> +		BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1)
>> +#define BTRFS_DEV_ALLOCATION_MASK_COUNT (1ULL << \
>> +		BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT)
>> +
>> +static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
>> +	[BTRFS_DEV_ALLOCATION_DATA_ONLY] = -1,
>> +	[BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
>> +	[BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 1,
>> +	[BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 2,
>> +	/* the other values are set to 0 */
>> +};
>> +
>> +
>>   const char *btrfs_bg_type_to_raid_name(u64 flags)
>>   {
>>   	const int index = btrfs_bg_flags_to_raid_index(flags);
>> @@ -4997,13 +5011,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>>   }
>>   
>>   /*
>> - * sort the devices in descending order by max_avail, total_avail
>> + * sort the devices in descending order by alloc_hint,
>> + * max_avail, total_avail
>>    */
>>   static int btrfs_cmp_device_info(const void *a, const void *b)
>>   {
>>   	const struct btrfs_device_info *di_a = a;
>>   	const struct btrfs_device_info *di_b = b;
>>   
>> +	if (di_a->alloc_hint > di_b->alloc_hint)
>> +		return -1;
>> +	if (di_a->alloc_hint < di_b->alloc_hint)
>> +		return 1;
>>   	if (di_a->max_avail > di_b->max_avail)
>>   		return -1;
>>   	if (di_a->max_avail < di_b->max_avail)
>> @@ -5166,6 +5185,8 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>   	int ndevs = 0;
>>   	u64 max_avail;
>>   	u64 dev_offset;
>> +	int hint;
>> +	int i;
>>   
>>   	/*
>>   	 * in the first pass through the devices list, we gather information
>> @@ -5218,16 +5239,91 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>   		devices_info[ndevs].max_avail = max_avail;
>>   		devices_info[ndevs].total_avail = total_avail;
>>   		devices_info[ndevs].dev = device;
>> +
>> +		if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
>> +		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
>> +			/*
>> +			 * if mixed bg set all the alloc_hint
>> +			 * fields to the same value, so the sorting
>> +			 * is not affected
>> +			 */
>> +			devices_info[ndevs].alloc_hint = 0;
>> +		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
>> +			hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
>> +
>> +			/*
>> +			 * skip BTRFS_DEV_METADATA_ONLY disks
>> +			 */
>> +			if (hint == BTRFS_DEV_ALLOCATION_METADATA_ONLY)
>> +				continue;
>> +			/*
>> +			 * if a data chunk must be allocated,
>> +			 * sort also by hint (data disk
>> +			 * higher priority)
>> +			 */
>> +			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
>> +		} else { /* BTRFS_BLOCK_GROUP_METADATA */
>> +			hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
>> +
>> +			/*
>> +			 * skip BTRFS_DEV_DATA_ONLY disks
>> +			 */
>> +			if (hint == BTRFS_DEV_ALLOCATION_DATA_ONLY)
>> +				continue;
>> +			/*
>> +			 * if a data chunk must be allocated,
>> +			 * sort also by hint (metadata hint
>> +			 * higher priority)
>> +			 */
>> +			devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
>> +		}
>> +
>>   		++ndevs;
>>   	}
>>   	ctl->ndevs = ndevs;
>>   
>> +	/*
>> +	 * no devices available
>> +	 */
>> +	if (!ndevs)
>> +		return 0;
>> +
>>   	/*
>>   	 * now sort the devices by hole size / available space
>>   	 */
>>   	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>   	     btrfs_cmp_device_info, NULL);
>>   
>> +	/*
>> +	 * select the minimum set of disks grouped by hint that
>> +	 * can host the chunk
>> +	 */
>> +	ndevs = 0;
>> +	while (ndevs < ctl->ndevs) {
>> +		hint = devices_info[ndevs++].alloc_hint;
>> +		while (ndevs < ctl->ndevs &&
>> +		       devices_info[ndevs].alloc_hint == hint)
>> +				ndevs++;
>> +		if (ndevs >= ctl->devs_min)
>> +			break;
>> +	}
>> +
>> +	BUG_ON(ndevs > ctl->ndevs);
>> +	ctl->ndevs = ndevs;
>> +
>> +	/*
>> +	 * the next layers require the devices_info ordered by
>> +	 * max_avail. If we are returing two (or more) different
>> +	 * group of alloc_hint, this is not always true. So sort
>> +	 * these gain.
>> +	 */
>> +
>> +	for (i = 0 ; i < ndevs ; i++)
>> +		devices_info[i].alloc_hint = 0;
>> +
>> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>> +	     btrfs_cmp_device_info, NULL);
>> +
>>   	return 0;
>>   }
>>   
>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>> index b8250f29df6e..37eb37b533c5 100644
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -369,6 +369,7 @@ struct btrfs_device_info {
>>   	u64 dev_offset;
>>   	u64 max_avail;
>>   	u64 total_avail;
>> +	int alloc_hint;
>>   };
>>   
>>   struct btrfs_raid_attr {
>>
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

  reply	other threads:[~2021-12-17 18:28 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-24 15:31 [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 1/4] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 2/4] btrfs: export dev_item.type in /sys/fs/btrfs/<uuid>/devinfo/<devid>/type Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 3/4] btrfs: change the DEV_ITEM 'type' field via sysfs Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 4/4] btrfs: add allocator_hint mode Goffredo Baroncelli
2021-12-17 15:58   ` Hans van Kranenburg
2021-12-17 18:28     ` Goffredo Baroncelli [this message]
2021-12-17 19:41       ` Zygo Blaxell
2021-12-18  9:07         ` Goffredo Baroncelli
2021-12-18 22:48           ` Zygo Blaxell
2021-12-19  0:03             ` Graham Cobb
2021-12-19  2:30               ` Zygo Blaxell
2021-12-13  9:39 ` [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Paul Jones
2021-12-13 19:54   ` Goffredo Baroncelli
2021-12-13 21:15     ` Josef Bacik
2021-12-13 22:49       ` Zygo Blaxell
2021-12-14 14:31         ` Josef Bacik
2021-12-14 19:03         ` Goffredo Baroncelli
2021-12-14 20:04           ` Zygo Blaxell
2021-12-14 20:34             ` Josef Bacik
2021-12-14 20:41               ` Goffredo Baroncelli
2021-12-15 13:58                 ` Josef Bacik
2021-12-15 18:53                   ` Goffredo Baroncelli
2021-12-16  0:56                     ` Josef Bacik
2021-12-17  5:40                       ` Zygo Blaxell
2021-12-17 14:48                         ` Josef Bacik
2021-12-17 16:31                           ` Zygo Blaxell
2021-12-17 18:08                         ` Goffredo Baroncelli
2021-12-16  2:30                   ` Paul Jones
2021-12-14  1:03       ` Sinnamohideen, Shafeeq
2021-12-14 18:53       ` Goffredo Baroncelli
2021-12-14 20:35         ` Josef Bacik
     [not found] <cover.1614028083.git.kreijack@inwind.it>
2021-02-22 21:19 ` [PATCH 4/4] btrfs: add allocator_hint mode Goffredo Baroncelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5767702c-665f-d1a1-ea65-12eb1db96c51@libero.it \
    --to=kreijack@libero.it \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=dsterba@suse.cz \
    --cc=hans@knorrie.org \
    --cc=josef@toxicpanda.com \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=shafeeqs@panasas.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.