Re: RAID5 on SSDs - looking for advice

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: "me@jse.io" <me@jse.io>
Cc: Ochi <ochi@arcor.de>, linux-btrfs@vger.kernel.org
Subject: Re: RAID5 on SSDs - looking for advice
Date: Mon, 6 Feb 2023 11:05:03 +0800	[thread overview]
Message-ID: <7074289e-13cd-ced8-a4d8-0d0b23ba177d@gmx.com> (raw)
In-Reply-To: <CAFMvigd+j-ARVRepKKrW4KtjfAHGu9gW0YFb6BCegGj5Lj07ew@mail.gmail.com>

On 2023/2/6 10:34, me@jse.io wrote:
> Apologies for the duplicate, I sent the last reply in HTML by mistake.
> Take two lol.
> 
> Given that 6.2 basically has fixes for the RMW at least for RAID5, apart
> from scrub performance deficiencies and the write hole, are there any other
> gotchas to be aware of?

Firstly, 6.2 would only handle the RMW better for data.
There is no way to properly handle metadata easily, thus it's still not 
recommended to use RAID56 for metadata.

But still, things like parity-update-failure, read-repair-failure should 
be fixed with the RMW fixes.

Secondly the write hole is not yet fixed, the RMW fix would greately 
migrate the problem, but not a full fix.

Other ones look like regular scrub interface bugs.

> This mailing list post <
> https://lore.kernel.org/linux-btrfs/20200627030614.GW10769@hungrycats.org/>
> listed several concerning bugs, like "spurious degraded read failure" which
> is a concerning bug for me as I'm hoping to use Btrfs RAID5 for a media
> server pool and it would be nice to have it be usable when degraded
> without. It would be nice to be able to read my data when degraded. How
> many of these bugs listed here have since been fixed or addressed by the
> RMW fixes in 6.2?
> 
> Also concerning NOCOW (nocsum data), assuming no device failure, if a write
> to a NOCOW range gets out of sync with parity (ie, due to a crash/write
> hole) will scrub trust NOCOW data indiscriminately and update the parity,
> or does it get ignored like how NOCOW is basically ignored in RAID1?

NOCOW/NOCSUM is not recommended, as even with or without the RMW fix, we 
trust anything we read from disk if there is no csum to verify against.

Our trust priority is:

Data with csum (no matter pass or not, as we would recheck after repair) 
 > Data without csum (read pass, then trust it) > Parity

Thus data without csum can only be repaired if the read itself failed.
And if such data without csum has mismatch with parity, we always update 
parity unconditionally.

Thanks,
Qu

> 
> 
> On Sun, Oct 9, 2022 at 8:36 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2022/10/9 18:34, Ochi wrote:
>>> Hello,
>>>
>>> I'm currently thinking about migrating my home NAS to SSDs only. As a
>>> compromise between space efficiency and redundancy, I'm thinking about:
>>>
>>> - using RAID5 for data and RAID1 for metadata on a couple of SSDs (3 or
>>> 4 for now, with the option to expand later),
>>
>> Btrfs RAID56 is not safe against the following problems:
>>
>> - Multi-device data sync (aka, write hole)
>>     Every time a power loss happens, some RAID56 writes may get de-
>>     synchronized.
>>
>>     Unlike mdraid, we don't have journal/bitmap at all for now.
>>     We already have a PoC write-intent bitmap.
>>
>> - Destructive RMW
>>     This can happen when some of the existing data is corrupted (can be
>>     caused by above write-hole, or bitrot.
>>
>>     In that case, if we have write into the vertical stripe, we will
>>     make the original corruption further spread into the P/Q stripes,
>>     completely killing the possibility to recover the data.
>>
>>     This is for all RAID56, including mdraid56, but we're already working
>>     on this, to do full verification before a RMW cycle.
>>
>> - Extra IO for RAID56 scrub.
>>     It will cause at least twice amount of data read for RAID5, three
>>     times for RAID6, thus it can be very slow scrubbing the fs.
>>
>>     We're aware of this problem, and have one purposal to address it.
>>
>>     You may see some advice to only scrub one device one time to speed
>>     things up. But the truth is, it's causing more IO, and it will
>>     not ensure your data is correct if you just scrub one device.
>>
>>     Thus if you're going to use btrfs RAID56, you have not only to do
>>     periodical scrub, but also need to endure the slow scrub performance
>>     for now.
>>
>>
>>> - using compression to get the most out of the relatively expensive SSD
>>> storage,
>>> - encrypting each drive seperately below the FS level using LUKS (with
>>> discard enabled).
>>>
>>> The NAS is regularly backed up to another NAS with spinning disks that
>>> runs a btrfs RAID1 and takes daily snapshots.
>>>
>>> I have a few questions regarding this approach which I hope someone with
>>> more insight into btrfs can answer me:
>>>
>>> 1. Are there any known issues regarding discard/TRIM in a RAID5 setup?
>>
>> Btrfs doesn't support TRIM inside RAID56 block groups at all.
>>
>> Trim will only work for the unallocated space of each disk, and the
>> unused space inside the METADATA RAID1 block groups.
>>
>>> Is discard implemented on a lower level that is independent of the
>>> actual RAID level used? The very, very old initial merge announcement
>>> [1] stated that discard support was missing back then. Is it implemented
>>> now?
>>>
>>> 2. How is the parity data calculated when compression is in use? Is it
>>> calculated on the data _after_ compression? In particular, is the parity
>>> data expected to have the same size as the _compressed_ data?
>>
>> To your question, P/Q is calculated after compression.
>>
>> Btrfs and mdraid56, they work at block layer, thus they don't care the
>> data size of your write.(although full-stripe aligned write is way
>> better for performance)
>>
>> All writes (only considering the real writes which will go to physical
>> disks, thus the compressed data) will first be split using full stripe
>> size, then go either full-stripe write path or sub-stripe write.
>>
>>>
>>> 3. Are there any other known issues that come to mind regarding this
>>> particular setup, or do you have any other advice?
>>
>> We recently fixed a bug that read time repair for compressed data is not
>> really as robust as we think.
>> E.g. the corruption in compressed data is interleaved (like sector 1 is
>> corrupted in mirror 1, sector 2 is corrupted in mirror 2).
>>
>> In that case, we will consider the full compressed data as corrupted,
>> but in fact we should be able to repair it.
>>
>> You may want to use newer kernel with that fixed if you're going to use
>> compression.
>>
>>>
>>> [1] https://lwn.net/Articles/536038/
>>>
>>> Best regards
>>> Ochi