Re: RAID5 on SSDs - looking for advice

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Ochi <ochi@arcor.de>, linux-btrfs@vger.kernel.org
Subject: Re: RAID5 on SSDs - looking for advice
Date: Sun, 9 Oct 2022 19:36:14 +0800	[thread overview]
Message-ID: <86f8b839-da7f-aa19-d824-06926db13675@gmx.com> (raw)
In-Reply-To: <a502eed4-b164-278a-2e80-b72013bcfc4f@arcor.de>

On 2022/10/9 18:34, Ochi wrote:
> Hello,
>
> I'm currently thinking about migrating my home NAS to SSDs only. As a
> compromise between space efficiency and redundancy, I'm thinking about:
>
> - using RAID5 for data and RAID1 for metadata on a couple of SSDs (3 or
> 4 for now, with the option to expand later),

Btrfs RAID56 is not safe against the following problems:

- Multi-device data sync (aka, write hole)
   Every time a power loss happens, some RAID56 writes may get de-
   synchronized.

   Unlike mdraid, we don't have journal/bitmap at all for now.
   We already have a PoC write-intent bitmap.

- Destructive RMW
   This can happen when some of the existing data is corrupted (can be
   caused by above write-hole, or bitrot.

   In that case, if we have write into the vertical stripe, we will
   make the original corruption further spread into the P/Q stripes,
   completely killing the possibility to recover the data.

   This is for all RAID56, including mdraid56, but we're already working
   on this, to do full verification before a RMW cycle.

- Extra IO for RAID56 scrub.
   It will cause at least twice amount of data read for RAID5, three
   times for RAID6, thus it can be very slow scrubbing the fs.

   We're aware of this problem, and have one purposal to address it.

   You may see some advice to only scrub one device one time to speed
   things up. But the truth is, it's causing more IO, and it will
   not ensure your data is correct if you just scrub one device.

   Thus if you're going to use btrfs RAID56, you have not only to do
   periodical scrub, but also need to endure the slow scrub performance
   for now.

> - using compression to get the most out of the relatively expensive SSD
> storage,
> - encrypting each drive seperately below the FS level using LUKS (with
> discard enabled).
>
> The NAS is regularly backed up to another NAS with spinning disks that
> runs a btrfs RAID1 and takes daily snapshots.
>
> I have a few questions regarding this approach which I hope someone with
> more insight into btrfs can answer me:
>
> 1. Are there any known issues regarding discard/TRIM in a RAID5 setup?

Btrfs doesn't support TRIM inside RAID56 block groups at all.

Trim will only work for the unallocated space of each disk, and the
unused space inside the METADATA RAID1 block groups.

> Is discard implemented on a lower level that is independent of the
> actual RAID level used? The very, very old initial merge announcement
> [1] stated that discard support was missing back then. Is it implemented
> now?
>
> 2. How is the parity data calculated when compression is in use? Is it
> calculated on the data _after_ compression? In particular, is the parity
> data expected to have the same size as the _compressed_ data?

To your question, P/Q is calculated after compression.

Btrfs and mdraid56, they work at block layer, thus they don't care the
data size of your write.(although full-stripe aligned write is way
better for performance)

All writes (only considering the real writes which will go to physical
disks, thus the compressed data) will first be split using full stripe
size, then go either full-stripe write path or sub-stripe write.

>
> 3. Are there any other known issues that come to mind regarding this
> particular setup, or do you have any other advice?

We recently fixed a bug that read time repair for compressed data is not
really as robust as we think.
E.g. the corruption in compressed data is interleaved (like sector 1 is
corrupted in mirror 1, sector 2 is corrupted in mirror 2).

In that case, we will consider the full compressed data as corrupted,
but in fact we should be able to repair it.

You may want to use newer kernel with that fixed if you're going to use
compression.

>
> [1] https://lwn.net/Articles/536038/
>
> Best regards
> Ochi