All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Ochi <ochi@arcor.de>, linux-btrfs@vger.kernel.org
Subject: Re: RAID5 on SSDs - looking for advice
Date: Sun, 9 Oct 2022 19:36:14 +0800	[thread overview]
Message-ID: <86f8b839-da7f-aa19-d824-06926db13675@gmx.com> (raw)
In-Reply-To: <a502eed4-b164-278a-2e80-b72013bcfc4f@arcor.de>



On 2022/10/9 18:34, Ochi wrote:
> Hello,
>
> I'm currently thinking about migrating my home NAS to SSDs only. As a
> compromise between space efficiency and redundancy, I'm thinking about:
>
> - using RAID5 for data and RAID1 for metadata on a couple of SSDs (3 or
> 4 for now, with the option to expand later),

Btrfs RAID56 is not safe against the following problems:

- Multi-device data sync (aka, write hole)
   Every time a power loss happens, some RAID56 writes may get de-
   synchronized.

   Unlike mdraid, we don't have journal/bitmap at all for now.
   We already have a PoC write-intent bitmap.

- Destructive RMW
   This can happen when some of the existing data is corrupted (can be
   caused by above write-hole, or bitrot.

   In that case, if we have write into the vertical stripe, we will
   make the original corruption further spread into the P/Q stripes,
   completely killing the possibility to recover the data.

   This is for all RAID56, including mdraid56, but we're already working
   on this, to do full verification before a RMW cycle.

- Extra IO for RAID56 scrub.
   It will cause at least twice amount of data read for RAID5, three
   times for RAID6, thus it can be very slow scrubbing the fs.

   We're aware of this problem, and have one purposal to address it.

   You may see some advice to only scrub one device one time to speed
   things up. But the truth is, it's causing more IO, and it will
   not ensure your data is correct if you just scrub one device.

   Thus if you're going to use btrfs RAID56, you have not only to do
   periodical scrub, but also need to endure the slow scrub performance
   for now.


> - using compression to get the most out of the relatively expensive SSD
> storage,
> - encrypting each drive seperately below the FS level using LUKS (with
> discard enabled).
>
> The NAS is regularly backed up to another NAS with spinning disks that
> runs a btrfs RAID1 and takes daily snapshots.
>
> I have a few questions regarding this approach which I hope someone with
> more insight into btrfs can answer me:
>
> 1. Are there any known issues regarding discard/TRIM in a RAID5 setup?

Btrfs doesn't support TRIM inside RAID56 block groups at all.

Trim will only work for the unallocated space of each disk, and the
unused space inside the METADATA RAID1 block groups.

> Is discard implemented on a lower level that is independent of the
> actual RAID level used? The very, very old initial merge announcement
> [1] stated that discard support was missing back then. Is it implemented
> now?
>
> 2. How is the parity data calculated when compression is in use? Is it
> calculated on the data _after_ compression? In particular, is the parity
> data expected to have the same size as the _compressed_ data?

To your question, P/Q is calculated after compression.

Btrfs and mdraid56, they work at block layer, thus they don't care the
data size of your write.(although full-stripe aligned write is way
better for performance)

All writes (only considering the real writes which will go to physical
disks, thus the compressed data) will first be split using full stripe
size, then go either full-stripe write path or sub-stripe write.

>
> 3. Are there any other known issues that come to mind regarding this
> particular setup, or do you have any other advice?

We recently fixed a bug that read time repair for compressed data is not
really as robust as we think.
E.g. the corruption in compressed data is interleaved (like sector 1 is
corrupted in mirror 1, sector 2 is corrupted in mirror 2).

In that case, we will consider the full compressed data as corrupted,
but in fact we should be able to repair it.

You may want to use newer kernel with that fixed if you're going to use
compression.

>
> [1] https://lwn.net/Articles/536038/
>
> Best regards
> Ochi

  reply	other threads:[~2022-10-09 11:36 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-09 10:34 RAID5 on SSDs - looking for advice Ochi
2022-10-09 11:36 ` Qu Wenruo [this message]
2022-10-09 12:56   ` Ochi
2022-10-09 13:01     ` Forza
2022-10-09 13:16       ` Ochi
2022-10-09 14:33   ` Jorge Bastos
2023-02-06  2:34   ` me
2023-02-06  3:05     ` Qu Wenruo
2023-02-09 23:12       ` me
2023-02-09 23:23         ` Remi Gauvin
2022-10-09 11:42 ` Roman Mamedov
2022-10-09 13:12   ` Ochi
2022-10-09 13:44 ` waxhead

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86f8b839-da7f-aa19-d824-06926db13675@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=ochi@arcor.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.