Re: RAID56 discussion related to RST. (Was "Re: [RFC ONLY 0/8] btrfs: introduce raid-stripe-tree")

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: kreijack@inwind.it, Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Forza <forza@tnonline.net>,
	Chris Murphy <lists@colorremedies.com>,
	Johannes Thumshirn <Johannes.Thumshirn@wdc.com>,
	Qu Wenruo <wqu@suse.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: RAID56 discussion related to RST. (Was "Re: [RFC ONLY 0/8] btrfs: introduce raid-stripe-tree")
Date: Tue, 26 Jul 2022 05:29:44 +0800	[thread overview]
Message-ID: <5699b8ca-1309-372e-5952-63daf09d9177@gmx.com> (raw)
In-Reply-To: <a3d3d872-f0f8-7cd7-7abd-6f4e8f511b57@inwind.it>

On 2022/7/26 03:58, Goffredo Baroncelli wrote:
> On 25/07/2022 02.00, Zygo Blaxell wrote:
>> On Tue, Jul 19, 2022 at 09:19:21AM +0800, Qu Wenruo wrote:
> [...]
>>
>> I'd agree with that.  e.g. some btrfs equivalent of ZFS raidZ (put parity
>> blocks inline with extents during writes) is not much more complex to
>> implement on btrfs than compression; however, the btrfs kernel code
>> couldn't read compressed data correctly for 12 years out of its 14-year
>> history, and nobody wants to wait another decade or more for raid5
>> to work.
>>
>> It seems to me the biggest problem with write hole fixes is that all
>> the potential fixes have cost tradeoffs, and everybody wants to veto
>> the fix that has a cost they don't like.
>>
>> We could implement multiple fix approaches at the same time, as AFAIK
>> most of the proposed solutions are orthogonal to each other.  e.g. a
>> write-ahead log can safely enable RMW at a higher IO cost, while the
>> allocator could place extents to avoid RMW and thereby avoid the logging
>> cost as much as possible (paid for by a deferred relocation/garbage
>> collection cost), and using both at the same time would combine both
>> benefits.  Both solutions can be used independently for filesystems at
>> extreme ends of the performance/capacity spectrum (if the filesystem is
>> never more than 50% full, then logging is all cost with no gain compared
>> to allocator avoidance of RMW, while a filesystem that is always near
>> full will have to journal writes and also throttle writes on the journal.
>
> Kudos to Zygo; I have to say that I never encountered before a so clearly
> explanation of the complexity around btrfs raid5/6 problems and the related
> solutions.
>
>>
>>>> For example a non-striped redundant-n profile as well as a striped
>>>> redundant-n profile.
>>>
>>> Non-striped redundant-n profile is already so complex that I can't
>>> figure out a working idea right now.
>>>
>>> But if there is such way, I'm pretty happy to consider.
>>>
>>>>
>>>>>
>>>>> My 2 cents...
>>>>>
>>>>> Regarding the current raid56 support, in order of preference:
>>>>>
>>>>> a. Fix the current bugs, without changing format. Zygo has an
>>>>> extensive list.
>>>>
>>>> I agree that relatively simple fixes should be made. But it seems we
>>>> will need quite a large rewrite to solve all issues? Is there a
>>>> minium viable option here?
>>>
>>> Nope. Just see my write-intent code, already have prototype (just needs
>>> new scrub based recovery code at mount time) working.
>>>
>>> And based on my write-intent code, I don't think it's that hard to
>>> implement a full journal.
>>
>> FWIW I think we can get a very usable btrfs raid5 with a small format
>> change (add a journal for stripe RMW, though we might disagree about
>> details of how it should be structured and used)...
>
> Again, I have to agree with Zygo. Even tough I am fascinating by a solution
> like ZFS (parity block inside the extent), I think that a journal (and a
> write intent log) is a more pragmatic approach:
> - this kind of solution is below the btrfs bg; this would avoid to add
> further
>    pressure on the metadata
> - being below to the other btrfs structure may be shaped more easily with
>    less risk of incompatibility
>
> It is true that a ZFS solution may be more faster in some workload, but
> I think that these are very few:
>    - for high throughput, you likely write the full stripe which doesn't
> need
>      journal/ppl
>    - for small block update, a journal is more efficient than rewrite the
>      full stripe
>
>
> I hope that the end of Qu activities, will be a more robust raid5 btrfs
> implementation, which will in turn increase the number of user, and which
> in turn increase the pressure to improve this part of btrfs.
>
> My only suggestion is to evaluate if we need to develop a write intent log
> and then a journal, instead of developing the journal alone. I think that
> two disk format changes are too much.

That won't be a problem.

For write-intent, we only need 4K, while during the development, I have
reserved 1MiB for write-intent and future journal.

Thus the format change will only be once.

Furthermore, that 1MiB can be tuned to be larger easily for journal.
And for existing RAID56 users, there will be a pretty quick way to
convert to the new write-intent/journal feature.

Thanks,
Qu

>
>
> BR
> G.Baroncelli
>> and fixes to the
>> read-repair and scrub problems.  The read-side problems in btrfs raid5
>> were always much more severe than the write hole.  As soon as a disk
>> goes offline, the read-repair code is unable to read all the surviving
>> data correctly, and the filesystem has to be kept inactive or data on
>> the disks will be gradually corrupted as bad parity gets mixed with data
>> and written back to the filesystem.
>>
>> A few of the problems will require a deeper redesign, but IMHO they're
>> not
>> important problems.  e.g. scrub can't identify which drive is corrupted
>> in all cases, because it has no csum on parity blocks.  The current
>> on-disk format needs every data block in the raid5 stripe to be occupied
>> by a file with a csum so scrub can eliminate every other block as the
>> possible source of mismatched parity.  While this could be fixed by
>> a future new raid5 profile (and/or csum tree) specifically designed
>> to avoid this, it's not something I'd insist on having before deploying
>> a fleet of btrfs raid5 boxes.  Silent corruption failures are so
>> rare on spinning disks that I'd use the feature maybe once a decade.
>> Silent corruption due to a failing or overheating HBA chip will most
>> likely affect multiple disks at once and trash the whole filesystem,
>> so individual drive-level corruption reporting isn't helpful.
>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>>> b. Mostly fix the write hole, also without changing the format, by
>>>>> only doing COW with full stripe writes. Yes you could somehow get
>>>>> corrupt parity still and not know it until degraded operation produces
>>>>> a bad reconstruction of data - but checksum will still catch that.
>>>>> This kind of "unreplicated corruption" is not quite the same thing as
>>>>> the write hole, because it isn't pernicious like the write hole.
>>>>
>>>> What is the difference to a)? Is write hole the worst issue? Judging
>>>> from the #brtfs channel discussions there seems to be other quite
>>>> severe issues, for example real data corruption risks in degraded mode.
>>>>
>>>>> c. A new de-clustered parity raid56 implementation that is not
>>>>> backwards compatible.
>>>>
>>>> Yes. We have a good opportunity to work out something much better
>>>> than current implementations. We could have  redundant-n profiles
>>>> that also works with tired storage like ssd/nvme similar to the
>>>> metadata on ssd idea.
>>>>
>>>> Variable stripe width has been brought up before, but received cool
>>>> responses. Why is that? IMO it could improve random 4k ios by doing
>>>> equivalent to RAID1 instead of RMW, while also closing the write
>>>> hole. Perhaps there is a middle ground to be found?
>>>>
>>>>
>>>>>
>>>>> Ergo, I think it's best to not break the format twice. Even if a new
>>>>> raid implementation is years off.
>>>>
>>>> I very agree here. Btrfs already suffers in public opinion from the
>>>> lack of a stable and safe-for-data RAID56, and requiring several
>>>> non-compatible chances isn't going to help.
>>>>
>>>> I also think it's important that the 'temporary' changes actually
>>>> leads to a stable filesystem. Because what is the point otherwise?
>>>>
>>>> Thanks
>>>> Forza
>>>>
>>>>>
>>>>> Metadata centric workloads suck on parity raid anyway. If Btrfs always
>>>>> does full stripe COW won't matter even if the performance is worse
>>>>> because no one should use parity raid for this workload anyway.
>>>>>
>>>>>
>>>>> --
>>>>> Chris Murphy
>>>>
>>>>
>