Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: kreijack@inwind.it, Lukas Straub <lukasstraub2@web.de>
Cc: Martin Raiber <martin@urbackup.org>,
	Paul Jones <paul@pauljones.id.au>,
	Wang Yugui <wangyugui@e16-tech.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
Date: Wed, 8 Jun 2022 06:14:36 +0800	[thread overview]
Message-ID: <49b3013f-68a3-648c-7b15-42d29b64d131@gmx.com> (raw)
In-Reply-To: <f5bf7ecb-8cb1-4da1-6052-a2968d4dc6b1@inwind.it>

On 2022/6/8 01:36, Goffredo Baroncelli wrote:
> On 07/06/2022 03.27, Qu Wenruo wrote:
>>
>>
>> On 2022/6/7 02:10, Goffredo Baroncelli wrote:
> [...]
>
>>>
>>> But with a battery backup (i.e. no power failure), the likelihood of b)
>>> became
>>> negligible.
>>>
>>> This to say that a write intent bitmap will provide an huge
>>> improvement of the resilience of a btrfs raid5, and in turn raid6.
>>>
>>> My only suggestions, is to find a way to store the bitmap intent not
>>> in the
>>> raid5/6 block group, but in a separate block group, with the appropriate
>>> level
>>> of redundancy.
>>
>> That's why I want to reject RAID56 as metadata, and just store the
>> write-intent tree into the metadata, like what we did for fsync (log
>> tree).
>>
>
> My suggestion was not to use the btrfs metadata to store the
> "write-intent", but
> to track the space used by the write-intent storage area with a bg. Then
> the
> write intent can be handled not with a btrfs btree, but (e.g.) simply
> writing a bitmap of the used blocks, or the pairs [starts, length]....

That solution requires a lot of extra change to chunk allocation, and
out-of-btree tracking.

Furthermore, btrfs Btree itself has CoW to defend against the power loss.

By not using Btree, we will pay a much higher price on the complexity of
implementing everything.

>
> I really like the idea to store the write intent in a btree. I find it very
> elegant. However I don't think that it is convenient.
>
> The write intent disk format is not performance related, you don't need
> to seek
> inside it; and it is small: you need to read it (entirerly) only in case
> of power
> failure, and in any case the biggest cost is to scrub the last updated
> blocks. So
> it is not needed a btree.

But such write intent bitmap must survive powerloss by itself.

And in fact, that bitmap is not small as you think.

In fact, for users who need write-intent tree/bitmap, we're talking
about at least TiB level usage.

4TiB used space needs already 128MiB if we really go straight bitmap for
them.
Embedding them all in a per-device basis is completely possible, but
when implementing it, it's much complex.

128MiB is not that large, so in theory we're fine to keep an in-memory
bitmap.
But what would happen if we go 32TiB? Then 1GiB in-memory bitmap is
needed, which is not really acceptable anymore.

When we start to choose what part is really needed in the large bitmap
pool, then Btree starts to make sense. We can store a super large bitmap
using bitmap and extent based entries pretty easily, just like free
space cache tree.

>
> Moreover, the handling of raid5/6 is a layer below the btree.

While CSUM is also a layer below, but we still put it into CSUM tree.

The handling of write-intent bitmap/tree is indeed a layer lower.
But traditional DM lacks the awareness of the upper layer fs, thus has a
lot of problems like unable to detect bit rot in RAID1 for example.

Yes, we care about layer separation, but more in a code level.
For functionality, layer separation is not that a big deal already.

> I think that
> updating the write-intent btree would be a performance bottleneck. I am
> quite sure
> that the write intent likely requires less than one metadata page (16K
> today);
> however to store this page you need to update the metadata page tracking...

We already have the existing log tree code doing similar (but still
quite different purpose) things, and it's used to speed up fsync.

Furthermore, DM layer bitmap is not a straight bitmap of all sectors
either, and for performance it's almost negligible for sequential RW.

I don't think Btree handling would be a performance bottleneck, as
NODATACOW for data doesn't improve much performance other than the
implied NODATASUM.

>
>>>
>>> This for two main reasons:
>>> 1) in future BTRFS may get the ability of allocating this block group
>>> in a
>>> dedicate disks set. I see two main cases:
>>> a) in case of raid6, we can store the intent bitmap (or the journal)
>>> in a
>>> raid1C3 BG allocated in the faster disks. The cons is that each block
>>> has to be
>>> written 3x2 times. But if you have an hybrid disks set (some ssd and
>>> some hdd,
>>> you got a noticeable gain of performance)
>>
>> In fact, for 4 disk usage, RAID10 has good enough chance to tolerate 2
>> missing disks.
>>
>> In fact, the chance to tolerate two missing devices for 4 disks RAID10
>> is:
>>
>> 4 / 6 = 66.7%
>>
>> 4 is the total valid combinations, no order involved, including:
>> (1, 3), (1, 4), (2, 3) (2, 4).
>> (Or 4C2 - 2)
>>
>> 6 is the 4C2.
>>
>> So really no need to go RAID1C3 unless you're really want to ensured 2
>> disks tolerance.
>
> I don't get the point: I started talking about raid6. The raid6 is two
> failures proof (you need three failure to see the problem... in theory).
>
> If P is the probability of a disk failure (with P << 1), the likelihood of
> a RAID6 failure is O(P^3). The same is RAID1C3.
>
> Instead RAID10 failure likelihood is only a bit lesser than two disk
> failure:
> RAID10 (4 disks) failure is O(0.66 * P^2) ~ O(P^2).
>
> Because P is << 1 then  P^3 << 0.66 * P^2.

My point here is, although RAID10 is not ensured to lose 2 disks, just
losing two disks still have a high enough chance to survive.

While RAID10 only have two copies of data, instead of 3 from RAID1C3,
such cost saving can be attractive for a lot of users though.

Thanks,
Qu

>>
>>> b) another option is to spread the intent bitmap (or the journal) in
>>> *all* disks,
>>> where each disks contains only the the related data (if we update only
>>> disk #1
>>> and disk #2, we have to update only the intent bitmap (or the
>>> journal) in
>>> disk #1 and  disk #2)
>>
>> That's my initial per-device reservation method.
>>
>> But for write-intent tree, I tend to not go that way, but with a
>> RO-compatible flag instead, as it's much simpler and more back
>> compatible.
>>
>> Thanks,
>> Qu
>>>
>>>
>>> 2) having a dedicate bg for the intent bitmap (or the journal), has
>>> another big
>>> advantage: you don't need to change the meaning of the raid5/6 bg. This
>>> means
>>> that an older kernel can read/write a raid5/6 filesystem: it sufficient
>>> to ignore
>>> the intent bitmap (or the journal)
>>>
>>>
>>>
>>>>
>>>> Furthermore, this even allows us to go something like bitmap tree, for
>>>> such write-intent bitmap.
>>>> And as long as the user is not using RAID56 for metadata (maybe even
>>>> it's OK to use RAID56 for metadata), it should be pretty safe against
>>>> most write-hole (for metadata and CoW data only though, nocow data is
>>>> still affected).
>>>>
>>>> Thus I believe this can be a valid path to explore, and even have a
>>>> higher priority than full journal.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>
>>>
>>>
>
>