Re: What exactly is BTRFS Raid 10?

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Andrei Borzenkov <arvidjaar@gmail.com>,
	kreijack@inwind.it, George Shammas <btrfs@shamm.as>,
	linux-btrfs@vger.kernel.org
Subject: Re: What exactly is BTRFS Raid 10?
Date: Sun, 21 Aug 2022 08:23:00 +0800	[thread overview]
Message-ID: <e341879f-9e16-d0f9-dbb5-7c54a6bd28c2@gmx.com> (raw)
In-Reply-To: <c0080bf6-c433-30f1-83aa-de8ecba60bee@gmail.com>

On 2022/8/21 02:11, Andrei Borzenkov wrote:
> On 20.08.2022 14:28, Goffredo Baroncelli wrote:
>>
>> RAID1:
>> A new chunk is allocated to the two disks with more space available. Each new chunk has a size of 1GB x 2 = 2GB, but only 1GB is available for the data because the other one contains a copy of the data.
>> A raid1 layout may have more than two disks. However the data is copied only two times, this means that you can tolerate only the lost of one device.
>> For example the first chunk is allocated on the first two disks; the 2nd chunk is allocated on the first and the 3rd disk; the 3rd chunk is allocated on the 2nd and 3rd disk....
>>
> ...
>>
>> RAID10:
>> Is a mix of RAID0 and RAID1: the data is copied two times (so you can tolerate the lost of one device), but it is spread over near all the disks.
>> If you have 7 disks, a new chunk is allocated over 6 disks (the greatest even number <= to the disk count) with more space available.
>> If you write data to a disk, the first 64K are written on the 1st disk and and the 2nd disk (as 2nd copy). When you write the 2nd 64 k of data, these are written in the 3rd disk and 4th disk (as 2nd copy). And so on until you fill the chunk.
>> When the chunk is filled, a new allocation occurred. Likely the 7th disk is used and one of the first 6 isn't for the new chunk.
>>
>
> Is large IO processed in parallel? If I have 8 disks raid10 and issue
> 256K request - will btrfs submit 4 concurrent 64K requests to each disk?

That is related to the RAID10/0 stripe size.
For btrfs, it uses fixes stripe size (64K).

So if you have 8 disks raid10, and issue a 256K request, it will be
split into 4 stripes first.

Then the first stripe go to the first 2 disk group (substripe).
The 2nd stripe go to the 2nd substripe.
Until the last stripe go to the last substripe.

All the submission are in parallel.

Although in full technical details, we will never submit a full 256K
request. Btrfs will submit the first 64K as long as the write size
reaches stripe boundary.
(Which may very slightly reduce the parallism, but also very slightly
reduce memory usage).

We have some pending changes to submit larger bio in logical layer, then
do the split.
But the change in performance should not even be observable.

>
> And for raid1 - will there be single 256K physical disk request or 4 x
> 64K requests?

Stripe length only works for RAID0/RAID10/RAID5/RAID6.

DUP/SINGLE/RAID1* doesn't bother the stripe length, thus it's a single
256K bio submitted to all RAID1* disks.

>
> What about read requests - will all disks in raid1/raid10 be used
> concurrently or btrfs always reads from the "primary" copy (and how it
> is determined then)?

Currently we use pid as the criteria to load balance the reads for
DUP/RAID1* profiles.

Anand Jain has some pending patches to allow different load balance
policy to be applied for DUP/RAID1* profiles though.

Thanks,
Qu