Re: RAID5 on SSDs - looking for advice

From: "me@jse.io" <me@jse.io>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: Ochi <ochi@arcor.de>, linux-btrfs@vger.kernel.org
Subject: Re: RAID5 on SSDs - looking for advice
Date: Sun, 5 Feb 2023 22:34:30 -0400	[thread overview]
Message-ID: <CAFMvigd+j-ARVRepKKrW4KtjfAHGu9gW0YFb6BCegGj5Lj07ew@mail.gmail.com> (raw)
In-Reply-To: <86f8b839-da7f-aa19-d824-06926db13675@gmx.com>

Apologies for the duplicate, I sent the last reply in HTML by mistake.
Take two lol.

Given that 6.2 basically has fixes for the RMW at least for RAID5, apart
from scrub performance deficiencies and the write hole, are there any other
gotchas to be aware of? This mailing list post <
https://lore.kernel.org/linux-btrfs/20200627030614.GW10769@hungrycats.org/>
listed several concerning bugs, like "spurious degraded read failure" which
is a concerning bug for me as I'm hoping to use Btrfs RAID5 for a media
server pool and it would be nice to have it be usable when degraded
without. It would be nice to be able to read my data when degraded. How
many of these bugs listed here have since been fixed or addressed by the
RMW fixes in 6.2?

Also concerning NOCOW (nocsum data), assuming no device failure, if a write
to a NOCOW range gets out of sync with parity (ie, due to a crash/write
hole) will scrub trust NOCOW data indiscriminately and update the parity,
or does it get ignored like how NOCOW is basically ignored in RAID1?

On Sun, Oct 9, 2022 at 8:36 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2022/10/9 18:34, Ochi wrote:
> > Hello,
> >
> > I'm currently thinking about migrating my home NAS to SSDs only. As a
> > compromise between space efficiency and redundancy, I'm thinking about:
> >
> > - using RAID5 for data and RAID1 for metadata on a couple of SSDs (3 or
> > 4 for now, with the option to expand later),
>
> Btrfs RAID56 is not safe against the following problems:
>
> - Multi-device data sync (aka, write hole)
>    Every time a power loss happens, some RAID56 writes may get de-
>    synchronized.
>
>    Unlike mdraid, we don't have journal/bitmap at all for now.
>    We already have a PoC write-intent bitmap.
>
> - Destructive RMW
>    This can happen when some of the existing data is corrupted (can be
>    caused by above write-hole, or bitrot.
>
>    In that case, if we have write into the vertical stripe, we will
>    make the original corruption further spread into the P/Q stripes,
>    completely killing the possibility to recover the data.
>
>    This is for all RAID56, including mdraid56, but we're already working
>    on this, to do full verification before a RMW cycle.
>
> - Extra IO for RAID56 scrub.
>    It will cause at least twice amount of data read for RAID5, three
>    times for RAID6, thus it can be very slow scrubbing the fs.
>
>    We're aware of this problem, and have one purposal to address it.
>
>    You may see some advice to only scrub one device one time to speed
>    things up. But the truth is, it's causing more IO, and it will
>    not ensure your data is correct if you just scrub one device.
>
>    Thus if you're going to use btrfs RAID56, you have not only to do
>    periodical scrub, but also need to endure the slow scrub performance
>    for now.
>
>
> > - using compression to get the most out of the relatively expensive SSD
> > storage,
> > - encrypting each drive seperately below the FS level using LUKS (with
> > discard enabled).
> >
> > The NAS is regularly backed up to another NAS with spinning disks that
> > runs a btrfs RAID1 and takes daily snapshots.
> >
> > I have a few questions regarding this approach which I hope someone with
> > more insight into btrfs can answer me:
> >
> > 1. Are there any known issues regarding discard/TRIM in a RAID5 setup?
>
> Btrfs doesn't support TRIM inside RAID56 block groups at all.
>
> Trim will only work for the unallocated space of each disk, and the
> unused space inside the METADATA RAID1 block groups.
>
> > Is discard implemented on a lower level that is independent of the
> > actual RAID level used? The very, very old initial merge announcement
> > [1] stated that discard support was missing back then. Is it implemented
> > now?
> >
> > 2. How is the parity data calculated when compression is in use? Is it
> > calculated on the data _after_ compression? In particular, is the parity
> > data expected to have the same size as the _compressed_ data?
>
> To your question, P/Q is calculated after compression.
>
> Btrfs and mdraid56, they work at block layer, thus they don't care the
> data size of your write.(although full-stripe aligned write is way
> better for performance)
>
> All writes (only considering the real writes which will go to physical
> disks, thus the compressed data) will first be split using full stripe
> size, then go either full-stripe write path or sub-stripe write.
>
> >
> > 3. Are there any other known issues that come to mind regarding this
> > particular setup, or do you have any other advice?
>
> We recently fixed a bug that read time repair for compressed data is not
> really as robust as we think.
> E.g. the corruption in compressed data is interleaved (like sector 1 is
> corrupted in mirror 1, sector 2 is corrupted in mirror 2).
>
> In that case, we will consider the full compressed data as corrupted,
> but in fact we should be able to repair it.
>
> You may want to use newer kernel with that fixed if you're going to use
> compression.
>
> >
> > [1] https://lwn.net/Articles/536038/
> >
> > Best regards
> > Ochi