Re: [PATCH rfc v3] New ioctl BTRFS_IOC_GET_CHUNK_INFO.

From: Goffredo Baroncelli <kreijack@inwind.it>
To: dsterba@suse.cz, linux-btrfs@vger.kernel.org,
	Josef Bacik <josef@toxicpanda.com>
Subject: Re: [PATCH rfc v3] New ioctl BTRFS_IOC_GET_CHUNK_INFO.
Date: Thu, 11 Jun 2020 13:50:09 +0200	[thread overview]
Message-ID: <d75fcfa5-0320-c6c2-13df-329cfdf45eaf@inwind.it> (raw)
In-Reply-To: <20200610203023.GL27795@twin.jikos.cz>

On 6/10/20 10:30 PM, David Sterba wrote:
> On Tue, May 26, 2020 at 10:19:35PM +0200, Goffredo Baroncelli wrote:
>> On 5/25/20 7:14 PM, David Sterba wrote:
>>> I'll start with the data structures
>>>
>>> On Thu, Mar 19, 2020 at 09:39:13PM +0100, Goffredo Baroncelli wrote:
>>>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>>> +struct btrfs_chunk_info_stripe {
>>>> +	__u64 devid;
>>>> +	__u64 offset;
>>>> +	__u8 dev_uuid[BTRFS_UUID_SIZE];
>>>> +};
>>>> +
>>>> +struct btrfs_chunk_info {
>>>> +	/* logical start of this chunk */
>>>> +	__u64 offset;
>>>> +	/* size of this chunk in bytes */
>>>> +	__u64 length;
>>>> +
>>>> +	__u64 stripe_len;
>>>> +	__u64 type;
>>>> +
>>>> +	/* 2^16 stripes is quite a lot, a second limit is the size of a single
>>>> +	 * item in the btree
>>>> +	 */
>>>> +	__u16 num_stripes;
>>>> +
>>>> +	/* sub stripes only matter for raid10 */
>>>> +	__u16 sub_stripes;
>>>> +
>>>> +	struct btrfs_chunk_info_stripe stripes[1];
>>>> +	/* additional stripes go here */
>>>> +};
>>>
>>> This looks like a copy of btrfs_chunk and stripe, only removing items
>>> not needed for the chunk information. Rather than copying the
>>> unnecessary fileds like dev_uuid in stripe, this should be designed for
>>> data exchange with the usecase in mind.
>>
>> There are two clients for this api:
>> - btrfs fi us
>> - btrfs dev us
>>
>> We can get rid of:
>> 	- "offset" fields (2x)
>> 	- "uuid" fields
>>
>> However the "offset" fields can be used to understand where a logical map
>> is on the physical disks. I am thinking about a graphical tool to show this
>> mapping, which doesn't exits yet -).
>> The offset field may be used as key to get further information (like the chunk
>> usage, see below)
>>
>> Regarding the UUID field, I agree it can be removed because it is redundant (there
>> is already the devid)
> 
> Offset is ok. I had something like this:
> 
> struct dump_chunks_entry {
>         u64 devid;
>         u64 start;
>         u64 lstart;
>         u64 length;
>         u64 flags;
>         u64 used;
> };
> 
> This selects the most interesting data from the CHUNK_ITEM, except the
> 'used' member, see below.

The structure above is a structure "device basis". This means that for (e.g.) raidX chunks the fields:
- lstart
- length
- flags
- used
are repeated

In fact only devid and start are device specific.
I see the following possibilities

1)

struct dump_chunks_entry {
          u64 lstart;
          u64 length;
          u64 flags;
          u64 used;
          u64 start;
	 u64 devid;
}

pro: simple api
cons: waste of space (60% of data are repeated

2)

struct dump_chunk_disk_entry {
          u64 devid;
          u64 start;
}

struct dump_chunks_entry {
          u64 lstart;
          u64 length;
          u64 flags;
          u64 used;
	 u16 disks_count;
	 struct dump_chunk_disk_entry disks[]
};

pro: smaller data
cons: variable length data

3)

two different ioctl

BTRFS_IOC_DUMP_BLOCK_GROUP

struct dump_block_group {
	u64	lstart
	u64	used
	u64	length
	u64	flags
}

BTRFS_IOC_DUMP_CHUNK

struct dump_chunks_entry {
          u64 lstart;
          u64 start;
	 u64 devid;
}

Where the filed lstart is the key

pro: smaller data (only lstart is repeated); quite orthogonal api
cons: two IOCTLs

Considering that having as optional the "used" field, means to have two ioctl (or one ioctl with a parameter, but this is not so different).
More I think, more I like option 3), however I am less happy to loose an information like sub_stripes (will something like RAID50 comes to BTRFS ?). Unfortunately to have this information, we need to duplicate it for each dump_chunks_entry...

GB

> 
>>> The format does not need follow the exact layout that kernel uses, ie.
>>> chunk info with one embedded stripe and then followed by variable length
>>> array of further stripes. This is convenient in one way but not in
>>> another one. Alternatively each chunk can be emitted as a single entry,
>>> duplicating part of the common fields and adding the stripe-specific
>>> ones. This is for consideration.
>>>
>>> I've looked at my old code doing the chunk dump based on the search
>>> ioctl and found that it also allows to read the chunk usage, with one
>>> extra search to the block group item where the usage is stored. As this
>>> is can be slow, it should be optional. Ie. the main ioctl structure
>>> needs flags where this can be requested.
>>
>> This info could be very useful. I think to something like a balance of
>> chunks which are near filled (or near empty). The question is if we
>> should have a different ioctl.
> 
> I was not proposing a new ioctl but designing the data exchange format
> to optionally provide a way to pass more information, like the usage.
> The reference to search ioctl was merely to point out that there's one
> more search for BLOCK_GROUP_ITEM where the 'used' is stored. As this is
> potentially expensive, it won't be filled by default.
> 
> The structure above does not capture all the chunk data. We could pack
> more such structures into one ioctl call. I think that num_stripes is
> missing from there as this would make possible the raid56 calculations
> but otherwise it should be it.
> 

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5