Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Lukas Straub <lukasstraub2@web.de>
Cc: Martin Raiber <martin@urbackup.org>,
	Paul Jones <paul@pauljones.id.au>,
	Wang Yugui <wangyugui@e16-tech.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
Date: Mon, 6 Jun 2022 16:16:17 +0800	[thread overview]
Message-ID: <252577ba-1659-62f8-fc44-fea506eb97b7@gmx.com> (raw)
In-Reply-To: <8c318892-0d36-51bb-18e0-a762dd75b723@gmx.com>

On 2022/6/3 17:59, Qu Wenruo wrote:
>
>
> On 2022/6/3 17:32, Lukas Straub wrote:
>> On Thu, 2 Jun 2022 05:37:11 +0800
>> Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>> On 2022/6/2 02:49, Martin Raiber wrote:
>>>> On 01.06.2022 12:12 Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2022/6/1 17:56, Paul Jones wrote:
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Qu Wenruo <quwenruo.btrfs@gmx.com>
>>>>>>> Sent: Wednesday, 1 June 2022 7:27 PM
>>>>>>> To: Wang Yugui <wangyugui@e16-tech.com>
>>>>>>> Cc: linux-btrfs@vger.kernel.org
>>>>>>> Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format
>>>>>>> draft
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>>>>> If we save journal on every RAID56 HDD, it will always be very
>>>>>>>>>> slow,
>>>>>>>>>> because journal data is in a different place than normal data, so
>>>>>>>>>> HDD seek is always happen?
>>>>>>>>>>
>>>>>>>>>> If we save journal on a device just like 'mke2fs -O
>>>>>>>>>> journal_dev' or
>>>>>>>>>> 'mkfs.xfs -l logdev', then this device just works like
>>>>>>>>>> NVDIMM?  We
>>>>>>>>>> may not need
>>>>>>>>>> RAID56/RAID1 for journal data.
>>>>>>>>>
>>>>>>>>> That device is the single point of failure. You lost that device,
>>>>>>>>> write hole come again.
>>>>>>>>
>>>>>>>> The HW RAID card have 'single point of failure'  too, such as the
>>>>>>>> NVDIMM inside HW RAID card.
>>>>>>>>
>>>>>>>> but  power-lost frequency > hdd failure frequency  > NVDIMM/ssd
>>>>>>>> failure frequency
>>>>>>>
>>>>>>> It's a completely different level.
>>>>>>>
>>>>>>> For btrfs RAID, we have no special treat for any disk.
>>>>>>> And our RAID is focusing on ensuring device tolerance.
>>>>>>>
>>>>>>> In your RAID card case, indeed the failure rate of the card is
>>>>>>> much lower.
>>>>>>> In journal device case, how do you ensure it's still true that
>>>>>>> the journal device
>>>>>>> missing possibility is way lower than all the other devices?
>>>>>>>
>>>>>>> So this doesn't make sense, unless you introduce the journal to
>>>>>>> something
>>>>>>> definitely not a regular disk.
>>>>>>>
>>>>>>> I don't believe this benefit most users.
>>>>>>> Just consider how many regular people use dedicated journal
>>>>>>> device for
>>>>>>> XFS/EXT4 upon md/dm RAID56.
>>>>>>
>>>>>> A good solid state drive should be far less error prone than
>>>>>> spinning drives, so would be a good candidate. Not perfect, but
>>>>>> better.
>>>>>>
>>>>>> As an end user I think focusing on stability and recovery tools is
>>>>>> a better use of time than fixing the write hole, as I wouldn't
>>>>>> even consider using Raid56 in it's current state. The write hole
>>>>>> problem can be alleviated by a UPS and not using Raid56 for a busy
>>>>>> write load. It's still good to brainstorm the issue though, as it
>>>>>> will need solving eventually.
>>>>>
>>>>> In fact, since write hole is only a problem for power loss (and
>>>>> explicit
>>>>> degraded write), another solution is, only record if the fs is
>>>>> gracefully closed.
>>>>>
>>>>> If the fs is not gracefully closed (by a bit in superblock), then we
>>>>> just trigger a full scrub on all existing RAID56 block groups.
>>>>>
>>>>> This should solve the problem, with the extra cost of slow scrub for
>>>>> each unclean shutdown.
>>>>>
>>>>> To be extra safe, during that scrub run, we really want user to
>>>>> wait for
>>>>> the scrub to finish.
>>>>>
>>>>> But on the other hand, I totally understand user won't be happy to
>>>>> wait
>>>>> for 10+ hours just due to a unclean shutdown...
>>>> Would it be possible to put the stripe offsets/numbers into a
>>>> journal/commit them before write? Then, during mount you could scrub
>>>> only those after an unclean shutdown.
>>>
>>> If we go that path, we can already do full journal, and only replay that
>>> journal without the need for scrub at all.
>>
>> Hello Qu,
>>
>> If you don't care about the write-hole, you can also use a dirty bitmap
>> like mdraid 5/6 does. There, one bit in the bitmap represents for
>> example one gigabyte of the disk that _may_ be dirty, and the bit is left
>> dirty for a while and doesn't need to be set for each write. Or you
>> could do a per-block-group dirty bit.
>
> That would be a pretty good way for auto scrub after dirty close.
>
> Currently we have quite some different ideas, but some are pretty
> similar but at different side of a spectrum:
>
>      Easier to implement        ..     Harder to implement
> |<- More on mount time scrub   ..     More on journal ->|
> |                    |    |    \- Full journal
> |                    |    \--- Per bg dirty bitmap
> |                    \----------- Per bg dirty flag
> \--------------------------------------------------- Per sb dirty flag

In fact, recently I'm checking the MD code (including their MD-raid5).

It turns out they have write-intent bitmap, which is almost the per-bg
dirty bitmap in above spectrum.

In fact, since btrfs has all the CoW and checksum for metadata (and part
of its data), btrfs scrub can do a much better job than MD to resilver
the range.

Furthermore, we have a pretty good reserved space (1M), and has a pretty
reasonable stripe length (1GiB).
This means, we only need 32KiB for the bitmap for each RAID56 stripe,
much smaller than the 1MiB we reserved.

I think this can be a pretty reasonable middle ground, faster than full
journal, while the amount to scrub should be reasonable enough to be
done at mount time.

Thanks,
Qu
>
> In fact, the dirty bitmap is just a simplified version of journal (only
> record the metadata, without data).
> Unlike dm/dm-raid56, with btrfs scrub, we should be able to fully
> recover the data without problem.
>
> Even with per-bg dirty bitmap, we still need some extra location to
> record the bitmap. Thus it needs a on-disk format change anyway.
>
> Currently only sb dirty flag may be backward compatible.
>
> And whether we should wait for the scrub to finish before allowing use
> to do anything into the fs is also another concern.
>
> Even using bitmap, we may have several GiB data needs to be scrubbed.
> If we wait for the scrub to finish, it's the best and safest way, but
> users won't be happy at all.
>
> If we go scrub resume way, it's faster but still leaves a large window
> to allow write-hole to reduce our tolerance.
>
> Thanks,
> Qu
>>
>> And while you're at it, add the same mechanism to all the other raid
>> and dup modes to fix the inconsistency of NOCOW files after a crash.
>>
>> Regards,
>> Lukas Straub
>>
>>> Thanks,
>>> Qu
>>>
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>>>
>>>>>> Paul.
>>>>
>>>>
>>
>>
>>