Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft

From: Goffredo Baroncelli <kreijack@libero.it>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>, Lukas Straub <lukasstraub2@web.de>
Cc: Martin Raiber <martin@urbackup.org>,
	Paul Jones <paul@pauljones.id.au>,
	Wang Yugui <wangyugui@e16-tech.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
Date: Mon, 6 Jun 2022 20:10:58 +0200	[thread overview]
Message-ID: <128e0119-088b-7a10-c874-551196df4c56@libero.it> (raw)
In-Reply-To: <c4c298bf-ca54-1915-c22f-a6d87fc5a78f@gmx.com>

Hi Qu,

On 06/06/2022 13.21, Qu Wenruo wrote:
> 
> 
> On 2022/6/6 16:16, Qu Wenruo wrote:
>>
>>
> [...]
>>>>
>>>> Hello Qu,
>>>>
>>>> If you don't care about the write-hole, you can also use a dirty bitmap
>>>> like mdraid 5/6 does. There, one bit in the bitmap represents for
>>>> example one gigabyte of the disk that _may_ be dirty, and the bit is
>>>> left
>>>> dirty for a while and doesn't need to be set for each write. Or you
>>>> could do a per-block-group dirty bit.
>>>
>>> That would be a pretty good way for auto scrub after dirty close.
>>>
>>> Currently we have quite some different ideas, but some are pretty
>>> similar but at different side of a spectrum:
>>>
>>>      Easier to implement        ..     Harder to implement
>>> |<- More on mount time scrub   ..     More on journal ->|
>>> |                    |    |    \- Full journal
>>> |                    |    \--- Per bg dirty bitmap
>>> |                    \----------- Per bg dirty flag
>>> \--------------------------------------------------- Per sb dirty flag
>>
>> In fact, recently I'm checking the MD code (including their MD-raid5).
>>
>> It turns out they have write-intent bitmap, which is almost the per-bg
>> dirty bitmap in above spectrum.
>>
>> In fact, since btrfs has all the CoW and checksum for metadata (and part
>> of its data), btrfs scrub can do a much better job than MD to resilver
>> the range.
>>
>> Furthermore, we have a pretty good reserved space (1M), and has a pretty
>> reasonable stripe length (1GiB).
>> This means, we only need 32KiB for the bitmap for each RAID56 stripe,
>> much smaller than the 1MiB we reserved.
>>
>> I think this can be a pretty reasonable middle ground, faster than full
>> journal, while the amount to scrub should be reasonable enough to be
>> done at mount time.

Raid5 is "single fault proof". This means that it can sustain only one
failure *at time* like:
1) unavailability of a disk (e.g. data disk failure)
2) a missing write in the stripe (e.g. unclean shutdown)

a) Until now (i.e. without the "bitmap intent"), even if these failures happen
in different days (i.e. not at the same time), the result may be a "write hole".

b) With the bitmap intent, the write hole requires that 1) and 2) happen
at the same time. But this would be not anymore a "single fault", with only an
exception: if these failure have a common cause (e.g. a power
failure which in turn cause the dead of a disk). In this case this has to be
considered "single fault".

But with a battery backup (i.e. no power failure), the likelihood of b) became
negligible.

This to say that a write intent bitmap will provide an huge
improvement of the resilience of a btrfs raid5, and in turn raid6.

My only suggestions, is to find a way to store the bitmap intent not in the
raid5/6 block group, but in a separate block group, with the appropriate level
of redundancy.

This for two main reasons:
1) in future BTRFS may get the ability of allocating this block group in a
dedicate disks set. I see two main cases:
a) in case of raid6, we can store the intent bitmap (or the journal) in a
raid1C3 BG allocated in the faster disks. The cons is that each block has to be
written 3x2 times. But if you have an hybrid disks set (some ssd and some hdd,
you got a noticeable gain of performance)
b) another option is to spread the intent bitmap (or the journal) in *all* disks,
where each disks contains only the the related data (if we update only disk #1
and disk #2, we have to update only the intent bitmap (or the journal) in
disk #1 and  disk #2)

2) having a dedicate bg for the intent bitmap (or the journal), has another big
advantage: you don't need to change the meaning of the raid5/6 bg. This means
that an older kernel can read/write a raid5/6 filesystem: it sufficient to ignore
the intent bitmap (or the journal)

> 
> Furthermore, this even allows us to go something like bitmap tree, for
> such write-intent bitmap.
> And as long as the user is not using RAID56 for metadata (maybe even
> it's OK to use RAID56 for metadata), it should be pretty safe against
> most write-hole (for metadata and CoW data only though, nocow data is
> still affected).
> 
> Thus I believe this can be a valid path to explore, and even have a
> higher priority than full journal.
> 
> Thanks,
> Qu
> 

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5