All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID5 on SSDs - looking for advice
@ 2022-10-09 10:34 Ochi
  2022-10-09 11:36 ` Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Ochi @ 2022-10-09 10:34 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I'm currently thinking about migrating my home NAS to SSDs only. As a 
compromise between space efficiency and redundancy, I'm thinking about:

- using RAID5 for data and RAID1 for metadata on a couple of SSDs (3 or 
4 for now, with the option to expand later),
- using compression to get the most out of the relatively expensive SSD 
storage,
- encrypting each drive seperately below the FS level using LUKS (with 
discard enabled).

The NAS is regularly backed up to another NAS with spinning disks that 
runs a btrfs RAID1 and takes daily snapshots.

I have a few questions regarding this approach which I hope someone with 
more insight into btrfs can answer me:

1. Are there any known issues regarding discard/TRIM in a RAID5 setup? 
Is discard implemented on a lower level that is independent of the 
actual RAID level used? The very, very old initial merge announcement 
[1] stated that discard support was missing back then. Is it implemented 
now?

2. How is the parity data calculated when compression is in use? Is it 
calculated on the data _after_ compression? In particular, is the parity 
data expected to have the same size as the _compressed_ data?

3. Are there any other known issues that come to mind regarding this 
particular setup, or do you have any other advice?

[1] https://lwn.net/Articles/536038/

Best regards
Ochi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID5 on SSDs - looking for advice
  2022-10-09 10:34 RAID5 on SSDs - looking for advice Ochi
@ 2022-10-09 11:36 ` Qu Wenruo
  2022-10-09 12:56   ` Ochi
                     ` (2 more replies)
  2022-10-09 11:42 ` Roman Mamedov
  2022-10-09 13:44 ` waxhead
  2 siblings, 3 replies; 13+ messages in thread
From: Qu Wenruo @ 2022-10-09 11:36 UTC (permalink / raw)
  To: Ochi, linux-btrfs



On 2022/10/9 18:34, Ochi wrote:
> Hello,
>
> I'm currently thinking about migrating my home NAS to SSDs only. As a
> compromise between space efficiency and redundancy, I'm thinking about:
>
> - using RAID5 for data and RAID1 for metadata on a couple of SSDs (3 or
> 4 for now, with the option to expand later),

Btrfs RAID56 is not safe against the following problems:

- Multi-device data sync (aka, write hole)
   Every time a power loss happens, some RAID56 writes may get de-
   synchronized.

   Unlike mdraid, we don't have journal/bitmap at all for now.
   We already have a PoC write-intent bitmap.

- Destructive RMW
   This can happen when some of the existing data is corrupted (can be
   caused by above write-hole, or bitrot.

   In that case, if we have write into the vertical stripe, we will
   make the original corruption further spread into the P/Q stripes,
   completely killing the possibility to recover the data.

   This is for all RAID56, including mdraid56, but we're already working
   on this, to do full verification before a RMW cycle.

- Extra IO for RAID56 scrub.
   It will cause at least twice amount of data read for RAID5, three
   times for RAID6, thus it can be very slow scrubbing the fs.

   We're aware of this problem, and have one purposal to address it.

   You may see some advice to only scrub one device one time to speed
   things up. But the truth is, it's causing more IO, and it will
   not ensure your data is correct if you just scrub one device.

   Thus if you're going to use btrfs RAID56, you have not only to do
   periodical scrub, but also need to endure the slow scrub performance
   for now.


> - using compression to get the most out of the relatively expensive SSD
> storage,
> - encrypting each drive seperately below the FS level using LUKS (with
> discard enabled).
>
> The NAS is regularly backed up to another NAS with spinning disks that
> runs a btrfs RAID1 and takes daily snapshots.
>
> I have a few questions regarding this approach which I hope someone with
> more insight into btrfs can answer me:
>
> 1. Are there any known issues regarding discard/TRIM in a RAID5 setup?

Btrfs doesn't support TRIM inside RAID56 block groups at all.

Trim will only work for the unallocated space of each disk, and the
unused space inside the METADATA RAID1 block groups.

> Is discard implemented on a lower level that is independent of the
> actual RAID level used? The very, very old initial merge announcement
> [1] stated that discard support was missing back then. Is it implemented
> now?
>
> 2. How is the parity data calculated when compression is in use? Is it
> calculated on the data _after_ compression? In particular, is the parity
> data expected to have the same size as the _compressed_ data?

To your question, P/Q is calculated after compression.

Btrfs and mdraid56, they work at block layer, thus they don't care the
data size of your write.(although full-stripe aligned write is way
better for performance)

All writes (only considering the real writes which will go to physical
disks, thus the compressed data) will first be split using full stripe
size, then go either full-stripe write path or sub-stripe write.

>
> 3. Are there any other known issues that come to mind regarding this
> particular setup, or do you have any other advice?

We recently fixed a bug that read time repair for compressed data is not
really as robust as we think.
E.g. the corruption in compressed data is interleaved (like sector 1 is
corrupted in mirror 1, sector 2 is corrupted in mirror 2).

In that case, we will consider the full compressed data as corrupted,
but in fact we should be able to repair it.

You may want to use newer kernel with that fixed if you're going to use
compression.

>
> [1] https://lwn.net/Articles/536038/
>
> Best regards
> Ochi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID5 on SSDs - looking for advice
  2022-10-09 10:34 RAID5 on SSDs - looking for advice Ochi
  2022-10-09 11:36 ` Qu Wenruo
@ 2022-10-09 11:42 ` Roman Mamedov
  2022-10-09 13:12   ` Ochi
  2022-10-09 13:44 ` waxhead
  2 siblings, 1 reply; 13+ messages in thread
From: Roman Mamedov @ 2022-10-09 11:42 UTC (permalink / raw)
  To: Ochi; +Cc: linux-btrfs

On Sun, 9 Oct 2022 12:34:57 +0200
Ochi <ochi@arcor.de> wrote:

> 3. Are there any other known issues that come to mind regarding this 
> particular setup, or do you have any other advice?

Keep in mind that Btrfs RAID5/6 are not currently recommended for use:
https://btrfs.readthedocs.io/en/latest/btrfs-man5.html#raid56-status-and-recommended-practices

If the NAS is backed up anyway, I suggest going with directory-level merge of
filesystems, such as MergerFS. If one SSD fails, you will need to restore only
the files which happened to be on that one, not redo the entire thing, as
would be the case with RAID0, Btrfs single profile, or LVM-based large block
device across all three.

Another alternative is mdadm RAID5 with Btrfs on top. But it feels like that
also has its own corner cases when it comes to sudden power losses, which may
result in the "parent transid failed" condition from Btrfs-side (not sure if
the recent PPL in mdadm fixes that).

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID5 on SSDs - looking for advice
  2022-10-09 11:36 ` Qu Wenruo
@ 2022-10-09 12:56   ` Ochi
  2022-10-09 13:01     ` Forza
  2022-10-09 14:33   ` Jorge Bastos
  2023-02-06  2:34   ` me
  2 siblings, 1 reply; 13+ messages in thread
From: Ochi @ 2022-10-09 12:56 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 09.10.22 13:36, Qu Wenruo wrote:
> 
> 
> On 2022/10/9 18:34, Ochi wrote:
>> Hello,
>>
>> I'm currently thinking about migrating my home NAS to SSDs only. As a
>> compromise between space efficiency and redundancy, I'm thinking about:
>>
>> - using RAID5 for data and RAID1 for metadata on a couple of SSDs (3 or
>> 4 for now, with the option to expand later),
> 
> Btrfs RAID56 is not safe against the following problems:
> 
> - Multi-device data sync (aka, write hole)
>    Every time a power loss happens, some RAID56 writes may get de-
>    synchronized.
> 
>    Unlike mdraid, we don't have journal/bitmap at all for now.
>    We already have a PoC write-intent bitmap.
> 
> - Destructive RMW
>    This can happen when some of the existing data is corrupted (can be
>    caused by above write-hole, or bitrot.
> 
>    In that case, if we have write into the vertical stripe, we will
>    make the original corruption further spread into the P/Q stripes,
>    completely killing the possibility to recover the data.
> 
>    This is for all RAID56, including mdraid56, but we're already working
>    on this, to do full verification before a RMW cycle.

Especially from the last point (and others below) I understand that 
RAID56 is still in quite active development with known issues being 
worked on, and it's not only regarding the write-hole that I guess many 
btrfs users have heard about in the context of RAID56.

> - Extra IO for RAID56 scrub.
>    It will cause at least twice amount of data read for RAID5, three
>    times for RAID6, thus it can be very slow scrubbing the fs.
> 
>    We're aware of this problem, and have one purposal to address it.
> 
>    You may see some advice to only scrub one device one time to speed
>    things up. But the truth is, it's causing more IO, and it will
>    not ensure your data is correct if you just scrub one device.
> 
>    Thus if you're going to use btrfs RAID56, you have not only to do
>    periodical scrub, but also need to endure the slow scrub performance
>    for now.

Interesting point. I will probably start out with 20-24 TB of raw 
storage space, and scrubbing may actually take a significant amount of 
time even with SATA SSD speeds. If RAID56 makes this even worse, it 
might be an issue to be aware of.

>> - using compression to get the most out of the relatively expensive SSD
>> storage,
>> - encrypting each drive seperately below the FS level using LUKS (with
>> discard enabled).
>>
>> The NAS is regularly backed up to another NAS with spinning disks that
>> runs a btrfs RAID1 and takes daily snapshots.
>>
>> I have a few questions regarding this approach which I hope someone with
>> more insight into btrfs can answer me:
>>
>> 1. Are there any known issues regarding discard/TRIM in a RAID5 setup?
> 
> Btrfs doesn't support TRIM inside RAID56 block groups at all.
> 
> Trim will only work for the unallocated space of each disk, and the
> unused space inside the METADATA RAID1 block groups.

Thank you for the insight. I didn't think that the statement from 2013 
might actually still be valid nowadays, but I'm glad I asked. :) I'm not 
sure know how important the trim information _actually_ is for the SSDs 
at the end (with regards to the internal implementation of the 
particular SSDs), but I guess it's another aspect to be aware of with 
RAID56+SSDs.

>> Is discard implemented on a lower level that is independent of the
>> actual RAID level used? The very, very old initial merge announcement
>> [1] stated that discard support was missing back then. Is it implemented
>> now?
>>
>> 2. How is the parity data calculated when compression is in use? Is it
>> calculated on the data _after_ compression? In particular, is the parity
>> data expected to have the same size as the _compressed_ data?
> 
> To your question, P/Q is calculated after compression.
> 
> Btrfs and mdraid56, they work at block layer, thus they don't care the
> data size of your write.(although full-stripe aligned write is way
> better for performance)
> 
> All writes (only considering the real writes which will go to physical
> disks, thus the compressed data) will first be split using full stripe
> size, then go either full-stripe write path or sub-stripe write.
> 
>>
>> 3. Are there any other known issues that come to mind regarding this
>> particular setup, or do you have any other advice?
> 
> We recently fixed a bug that read time repair for compressed data is not
> really as robust as we think.
> E.g. the corruption in compressed data is interleaved (like sector 1 is
> corrupted in mirror 1, sector 2 is corrupted in mirror 2).
> 
> In that case, we will consider the full compressed data as corrupted,
> but in fact we should be able to repair it.

It's always fascinating what kind of corner cases might appear that one 
may or may not have thought about initially. :)

Taking everything into account, maybe I will consider what alternative 
options I have for my particular use case until more issues have been 
ironed out. Maybe RAID5 is overcomplicating things in my case after all. 
A significant amount of the data I'm going to store is pretty static in 
nature, so maybe using single devices and merging them in some way (with 
something like MergerFS as Roman Mamedov suggested in another reply to 
my original mail, but I'll have to take a closer look), together with my 
regular backup, is another viable option for me that is possibly less 
error-prone.

Thank you!

> You may want to use newer kernel with that fixed if you're going to use
> compression.
> 
>>
>> [1] https://lwn.net/Articles/536038/
>>
>> Best regards
>> Ochi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID5 on SSDs - looking for advice
  2022-10-09 12:56   ` Ochi
@ 2022-10-09 13:01     ` Forza
  2022-10-09 13:16       ` Ochi
  0 siblings, 1 reply; 13+ messages in thread
From: Forza @ 2022-10-09 13:01 UTC (permalink / raw)
  To: Ochi, Qu Wenruo, linux-btrfs



On 2022-10-09 14:56, Ochi wrote:
> On 09.10.22 13:36, Qu Wenruo wrote:
>>
>>
>> On 2022/10/9 18:34, Ochi wrote:
>>> Hello,
>>>
>>> I'm currently thinking about migrating my home NAS to SSDs only. As a
>>> compromise between space efficiency and redundancy, I'm thinking about:
>>>
>>> - using RAID5 for data and RAID1 for metadata on a couple of SSDs (3 or
>>> 4 for now, with the option to expand later),
>>
>> Btrfs RAID56 is not safe against the following problems:
>>
>> - Multi-device data sync (aka, write hole)
>>    Every time a power loss happens, some RAID56 writes may get de-
>>    synchronized.
>>
>>    Unlike mdraid, we don't have journal/bitmap at all for now.
>>    We already have a PoC write-intent bitmap.
>>
>> - Destructive RMW
>>    This can happen when some of the existing data is corrupted (can be
>>    caused by above write-hole, or bitrot.
>>
>>    In that case, if we have write into the vertical stripe, we will
>>    make the original corruption further spread into the P/Q stripes,
>>    completely killing the possibility to recover the data.
>>
>>    This is for all RAID56, including mdraid56, but we're already working
>>    on this, to do full verification before a RMW cycle.
> 
> Especially from the last point (and others below) I understand that 
> RAID56 is still in quite active development with known issues being 
> worked on, and it's not only regarding the write-hole that I guess many 
> btrfs users have heard about in the context of RAID56.
> 
>> - Extra IO for RAID56 scrub.
>>    It will cause at least twice amount of data read for RAID5, three
>>    times for RAID6, thus it can be very slow scrubbing the fs.
>>
>>    We're aware of this problem, and have one purposal to address it.
>>
>>    You may see some advice to only scrub one device one time to speed
>>    things up. But the truth is, it's causing more IO, and it will
>>    not ensure your data is correct if you just scrub one device.
>>
>>    Thus if you're going to use btrfs RAID56, you have not only to do
>>    periodical scrub, but also need to endure the slow scrub performance
>>    for now.
> 
> Interesting point. I will probably start out with 20-24 TB of raw 
> storage space, and scrubbing may actually take a significant amount of 
> time even with SATA SSD speeds. If RAID56 makes this even worse, it 
> might be an issue to be aware of.
> 
>>> - using compression to get the most out of the relatively expensive SSD
>>> storage,
>>> - encrypting each drive seperately below the FS level using LUKS (with
>>> discard enabled).
>>>
>>> The NAS is regularly backed up to another NAS with spinning disks that
>>> runs a btrfs RAID1 and takes daily snapshots.
>>>
>>> I have a few questions regarding this approach which I hope someone with
>>> more insight into btrfs can answer me:
>>>
>>> 1. Are there any known issues regarding discard/TRIM in a RAID5 setup?
>>
>> Btrfs doesn't support TRIM inside RAID56 block groups at all.
>>
>> Trim will only work for the unallocated space of each disk, and the
>> unused space inside the METADATA RAID1 block groups.
> 
> Thank you for the insight. I didn't think that the statement from 2013 
> might actually still be valid nowadays, but I'm glad I asked. :) I'm not 
> sure know how important the trim information _actually_ is for the SSDs 
> at the end (with regards to the internal implementation of the 
> particular SSDs), but I guess it's another aspect to be aware of with 
> RAID56+SSDs.
> 
>>> Is discard implemented on a lower level that is independent of the
>>> actual RAID level used? The very, very old initial merge announcement
>>> [1] stated that discard support was missing back then. Is it implemented
>>> now?
>>>
>>> 2. How is the parity data calculated when compression is in use? Is it
>>> calculated on the data _after_ compression? In particular, is the parity
>>> data expected to have the same size as the _compressed_ data?
>>
>> To your question, P/Q is calculated after compression.
>>
>> Btrfs and mdraid56, they work at block layer, thus they don't care the
>> data size of your write.(although full-stripe aligned write is way
>> better for performance)
>>
>> All writes (only considering the real writes which will go to physical
>> disks, thus the compressed data) will first be split using full stripe
>> size, then go either full-stripe write path or sub-stripe write.
>>
>>>
>>> 3. Are there any other known issues that come to mind regarding this
>>> particular setup, or do you have any other advice?
>>
>> We recently fixed a bug that read time repair for compressed data is not
>> really as robust as we think.
>> E.g. the corruption in compressed data is interleaved (like sector 1 is
>> corrupted in mirror 1, sector 2 is corrupted in mirror 2).
>>
>> In that case, we will consider the full compressed data as corrupted,
>> but in fact we should be able to repair it.
> 
> It's always fascinating what kind of corner cases might appear that one 
> may or may not have thought about initially. :)
> 
> Taking everything into account, maybe I will consider what alternative 
> options I have for my particular use case until more issues have been 
> ironed out. Maybe RAID5 is overcomplicating things in my case after all. 
> A significant amount of the data I'm going to store is pretty static in 
> nature, so maybe using single devices and merging them in some way (with 
> something like MergerFS as Roman Mamedov suggested in another reply to 
> my original mail, but I'll have to take a closer look), together with my 
> regular backup, is another viable option for me that is possibly less 
> error-prone.

How important is up-time for you? If you can manage some hours of 
down-time if there is a H/W error, you might just do SINGLE data and 
RAID1 metadata, and schedule hourly (or more often) incremental 
snapshots+backups with btrbk.

> 
> Thank you!
> 
>> You may want to use newer kernel with that fixed if you're going to use
>> compression.
>>
>>>
>>> [1] https://lwn.net/Articles/536038/
>>>
>>> Best regards
>>> Ochi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID5 on SSDs - looking for advice
  2022-10-09 11:42 ` Roman Mamedov
@ 2022-10-09 13:12   ` Ochi
  0 siblings, 0 replies; 13+ messages in thread
From: Ochi @ 2022-10-09 13:12 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-btrfs

On 09.10.22 13:42, Roman Mamedov wrote:
> On Sun, 9 Oct 2022 12:34:57 +0200
> Ochi <ochi@arcor.de> wrote:
> 
>> 3. Are there any other known issues that come to mind regarding this
>> particular setup, or do you have any other advice?
> 
> Keep in mind that Btrfs RAID5/6 are not currently recommended for use:
> https://btrfs.readthedocs.io/en/latest/btrfs-man5.html#raid56-status-and-recommended-practices
> 
> If the NAS is backed up anyway, I suggest going with directory-level merge of
> filesystems, such as MergerFS. If one SSD fails, you will need to restore only
> the files which happened to be on that one, not redo the entire thing, as
> would be the case with RAID0, Btrfs single profile, or LVM-based large block
> device across all three.

Something like this might actually be a viable option for my use-case. A 
significant part of the data is pretty static in nature anyway, so 
having it sit on single drives together with the backup that I have 
anyway might actually be an option. I already thought about just using 
RAID0, but didn't like the idea that a single drive failure is pretty 
catastrophic (maybe RAID0 for data and RAID1 for metadata would be a 
possibility to at least have the filesystem structure survive a single 
drive failure, but I imagine restoring from that scenario to be pretty 
ugly).

I'll have a look at your suggestion and consider it as an alternative to 
my initial plan. Thank you :)

> Another alternative is mdadm RAID5 with Btrfs on top. But it feels like that
> also has its own corner cases when it comes to sudden power losses, which may
> result in the "parent transid failed" condition from Btrfs-side (not sure if
> the recent PPL in mdadm fixes that).
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID5 on SSDs - looking for advice
  2022-10-09 13:01     ` Forza
@ 2022-10-09 13:16       ` Ochi
  0 siblings, 0 replies; 13+ messages in thread
From: Ochi @ 2022-10-09 13:16 UTC (permalink / raw)
  To: Forza, Qu Wenruo, linux-btrfs

On 09.10.22 15:01, Forza wrote:
> 
> 
> On 2022-10-09 14:56, Ochi wrote:
>> On 09.10.22 13:36, Qu Wenruo wrote:
>>>
>>>
>>> On 2022/10/9 18:34, Ochi wrote:
>>>> Hello,
>>>>
>>>> I'm currently thinking about migrating my home NAS to SSDs only. As a
>>>> compromise between space efficiency and redundancy, I'm thinking about:
>>>>
>>>> - using RAID5 for data and RAID1 for metadata on a couple of SSDs (3 or
>>>> 4 for now, with the option to expand later),
>>>
>>> Btrfs RAID56 is not safe against the following problems:
>>>
>>> - Multi-device data sync (aka, write hole)
>>>    Every time a power loss happens, some RAID56 writes may get de-
>>>    synchronized.
>>>
>>>    Unlike mdraid, we don't have journal/bitmap at all for now.
>>>    We already have a PoC write-intent bitmap.
>>>
>>> - Destructive RMW
>>>    This can happen when some of the existing data is corrupted (can be
>>>    caused by above write-hole, or bitrot.
>>>
>>>    In that case, if we have write into the vertical stripe, we will
>>>    make the original corruption further spread into the P/Q stripes,
>>>    completely killing the possibility to recover the data.
>>>
>>>    This is for all RAID56, including mdraid56, but we're already working
>>>    on this, to do full verification before a RMW cycle.
>>
>> Especially from the last point (and others below) I understand that 
>> RAID56 is still in quite active development with known issues being 
>> worked on, and it's not only regarding the write-hole that I guess 
>> many btrfs users have heard about in the context of RAID56.
>>
>>> - Extra IO for RAID56 scrub.
>>>    It will cause at least twice amount of data read for RAID5, three
>>>    times for RAID6, thus it can be very slow scrubbing the fs.
>>>
>>>    We're aware of this problem, and have one purposal to address it.
>>>
>>>    You may see some advice to only scrub one device one time to speed
>>>    things up. But the truth is, it's causing more IO, and it will
>>>    not ensure your data is correct if you just scrub one device.
>>>
>>>    Thus if you're going to use btrfs RAID56, you have not only to do
>>>    periodical scrub, but also need to endure the slow scrub performance
>>>    for now.
>>
>> Interesting point. I will probably start out with 20-24 TB of raw 
>> storage space, and scrubbing may actually take a significant amount of 
>> time even with SATA SSD speeds. If RAID56 makes this even worse, it 
>> might be an issue to be aware of.
>>
>>>> - using compression to get the most out of the relatively expensive SSD
>>>> storage,
>>>> - encrypting each drive seperately below the FS level using LUKS (with
>>>> discard enabled).
>>>>
>>>> The NAS is regularly backed up to another NAS with spinning disks that
>>>> runs a btrfs RAID1 and takes daily snapshots.
>>>>
>>>> I have a few questions regarding this approach which I hope someone 
>>>> with
>>>> more insight into btrfs can answer me:
>>>>
>>>> 1. Are there any known issues regarding discard/TRIM in a RAID5 setup?
>>>
>>> Btrfs doesn't support TRIM inside RAID56 block groups at all.
>>>
>>> Trim will only work for the unallocated space of each disk, and the
>>> unused space inside the METADATA RAID1 block groups.
>>
>> Thank you for the insight. I didn't think that the statement from 2013 
>> might actually still be valid nowadays, but I'm glad I asked. :) I'm 
>> not sure know how important the trim information _actually_ is for the 
>> SSDs at the end (with regards to the internal implementation of the 
>> particular SSDs), but I guess it's another aspect to be aware of with 
>> RAID56+SSDs.
>>
>>>> Is discard implemented on a lower level that is independent of the
>>>> actual RAID level used? The very, very old initial merge announcement
>>>> [1] stated that discard support was missing back then. Is it 
>>>> implemented
>>>> now?
>>>>
>>>> 2. How is the parity data calculated when compression is in use? Is it
>>>> calculated on the data _after_ compression? In particular, is the 
>>>> parity
>>>> data expected to have the same size as the _compressed_ data?
>>>
>>> To your question, P/Q is calculated after compression.
>>>
>>> Btrfs and mdraid56, they work at block layer, thus they don't care the
>>> data size of your write.(although full-stripe aligned write is way
>>> better for performance)
>>>
>>> All writes (only considering the real writes which will go to physical
>>> disks, thus the compressed data) will first be split using full stripe
>>> size, then go either full-stripe write path or sub-stripe write.
>>>
>>>>
>>>> 3. Are there any other known issues that come to mind regarding this
>>>> particular setup, or do you have any other advice?
>>>
>>> We recently fixed a bug that read time repair for compressed data is not
>>> really as robust as we think.
>>> E.g. the corruption in compressed data is interleaved (like sector 1 is
>>> corrupted in mirror 1, sector 2 is corrupted in mirror 2).
>>>
>>> In that case, we will consider the full compressed data as corrupted,
>>> but in fact we should be able to repair it.
>>
>> It's always fascinating what kind of corner cases might appear that 
>> one may or may not have thought about initially. :)
>>
>> Taking everything into account, maybe I will consider what alternative 
>> options I have for my particular use case until more issues have been 
>> ironed out. Maybe RAID5 is overcomplicating things in my case after 
>> all. A significant amount of the data I'm going to store is pretty 
>> static in nature, so maybe using single devices and merging them in 
>> some way (with something like MergerFS as Roman Mamedov suggested in 
>> another reply to my original mail, but I'll have to take a closer 
>> look), together with my regular backup, is another viable option for 
>> me that is possibly less error-prone.
> 
> How important is up-time for you? If you can manage some hours of 
> down-time if there is a H/W error, you might just do SINGLE data and 
> RAID1 metadata, and schedule hourly (or more often) incremental 
> snapshots+backups with btrbk.

I actually thought about something like this as well. What would the 
result of a single drive failure be in that case? Would the filesystem 
structure work as usual, but trying to read a file that happens to 
reside on the failed drive would result in a read error? I'm not sure 
how the recovery process would look like in that case, do you have 
experience with such a setup?

>>
>> Thank you!
>>
>>> You may want to use newer kernel with that fixed if you're going to use
>>> compression.
>>>
>>>>
>>>> [1] https://lwn.net/Articles/536038/
>>>>
>>>> Best regards
>>>> Ochi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID5 on SSDs - looking for advice
  2022-10-09 10:34 RAID5 on SSDs - looking for advice Ochi
  2022-10-09 11:36 ` Qu Wenruo
  2022-10-09 11:42 ` Roman Mamedov
@ 2022-10-09 13:44 ` waxhead
  2 siblings, 0 replies; 13+ messages in thread
From: waxhead @ 2022-10-09 13:44 UTC (permalink / raw)
  To: Ochi, linux-btrfs

> Hello,
> 
> I'm currently thinking about migrating my home NAS to SSDs only. As a 
> compromise between space efficiency and redundancy, I'm thinking about:
> 
> - using RAID5 for data and RAID1 for metadata on a couple of SSDs (3 or 
> 4 for now, with the option to expand later),

As a BTRFS user I would personally avoid BTRFS RAID5/6 (for now).

Have you looked at this? https://carfax.org.uk/btrfs-usage/

If you are planning to only use 3 or 4 devices there is not THAT much to 
gain from running RAID5 instead of RAID1 so it might not be worth the risk.

The cost increase of slightly larger capacity SSD's may be a much better 
option for you, and thanks to BTRFS brilliant design you can always 
rebalance some years later if RAID5/6 at some point becomes safe.

(Oh, and no matter how robust your filesystem or setup is - if you value 
your data make sure you have tested, working backups! :) )

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID5 on SSDs - looking for advice
  2022-10-09 11:36 ` Qu Wenruo
  2022-10-09 12:56   ` Ochi
@ 2022-10-09 14:33   ` Jorge Bastos
  2023-02-06  2:34   ` me
  2 siblings, 0 replies; 13+ messages in thread
From: Jorge Bastos @ 2022-10-09 14:33 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Ochi, linux-btrfs

On Sun, Oct 9, 2022 at 12:50 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> - Extra IO for RAID56 scrub.
>    It will cause at least twice amount of data read for RAID5, three
>    times for RAID6, thus it can be very slow scrubbing the fs.
>
>    We're aware of this problem, and have one purposal to address it.
>
>    You may see some advice to only scrub one device one time to speed
>    things up. But the truth is, it's causing more IO, and it will
>    not ensure your data is correct if you just scrub one device.
>
>    Thus if you're going to use btrfs RAID56, you have not only to do
>    periodical scrub, but also need to endure the slow scrub performance
>    for now.
>
>

I have a few small btrfs RAID5 pools and just wanted to add that for
me scrub speed with SSDs while not ideal is still decent, for example
with 7 devices I get around 500MB/s, on the other when using disk
drives it's painfully slow, for a pool with 6 drives it scrubs at
around 60MB/s.

Jorge

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID5 on SSDs - looking for advice
  2022-10-09 11:36 ` Qu Wenruo
  2022-10-09 12:56   ` Ochi
  2022-10-09 14:33   ` Jorge Bastos
@ 2023-02-06  2:34   ` me
  2023-02-06  3:05     ` Qu Wenruo
  2 siblings, 1 reply; 13+ messages in thread
From: me @ 2023-02-06  2:34 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Ochi, linux-btrfs

Apologies for the duplicate, I sent the last reply in HTML by mistake.
Take two lol.

Given that 6.2 basically has fixes for the RMW at least for RAID5, apart
from scrub performance deficiencies and the write hole, are there any other
gotchas to be aware of? This mailing list post <
https://lore.kernel.org/linux-btrfs/20200627030614.GW10769@hungrycats.org/>
listed several concerning bugs, like "spurious degraded read failure" which
is a concerning bug for me as I'm hoping to use Btrfs RAID5 for a media
server pool and it would be nice to have it be usable when degraded
without. It would be nice to be able to read my data when degraded. How
many of these bugs listed here have since been fixed or addressed by the
RMW fixes in 6.2?

Also concerning NOCOW (nocsum data), assuming no device failure, if a write
to a NOCOW range gets out of sync with parity (ie, due to a crash/write
hole) will scrub trust NOCOW data indiscriminately and update the parity,
or does it get ignored like how NOCOW is basically ignored in RAID1?


On Sun, Oct 9, 2022 at 8:36 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2022/10/9 18:34, Ochi wrote:
> > Hello,
> >
> > I'm currently thinking about migrating my home NAS to SSDs only. As a
> > compromise between space efficiency and redundancy, I'm thinking about:
> >
> > - using RAID5 for data and RAID1 for metadata on a couple of SSDs (3 or
> > 4 for now, with the option to expand later),
>
> Btrfs RAID56 is not safe against the following problems:
>
> - Multi-device data sync (aka, write hole)
>    Every time a power loss happens, some RAID56 writes may get de-
>    synchronized.
>
>    Unlike mdraid, we don't have journal/bitmap at all for now.
>    We already have a PoC write-intent bitmap.
>
> - Destructive RMW
>    This can happen when some of the existing data is corrupted (can be
>    caused by above write-hole, or bitrot.
>
>    In that case, if we have write into the vertical stripe, we will
>    make the original corruption further spread into the P/Q stripes,
>    completely killing the possibility to recover the data.
>
>    This is for all RAID56, including mdraid56, but we're already working
>    on this, to do full verification before a RMW cycle.
>
> - Extra IO for RAID56 scrub.
>    It will cause at least twice amount of data read for RAID5, three
>    times for RAID6, thus it can be very slow scrubbing the fs.
>
>    We're aware of this problem, and have one purposal to address it.
>
>    You may see some advice to only scrub one device one time to speed
>    things up. But the truth is, it's causing more IO, and it will
>    not ensure your data is correct if you just scrub one device.
>
>    Thus if you're going to use btrfs RAID56, you have not only to do
>    periodical scrub, but also need to endure the slow scrub performance
>    for now.
>
>
> > - using compression to get the most out of the relatively expensive SSD
> > storage,
> > - encrypting each drive seperately below the FS level using LUKS (with
> > discard enabled).
> >
> > The NAS is regularly backed up to another NAS with spinning disks that
> > runs a btrfs RAID1 and takes daily snapshots.
> >
> > I have a few questions regarding this approach which I hope someone with
> > more insight into btrfs can answer me:
> >
> > 1. Are there any known issues regarding discard/TRIM in a RAID5 setup?
>
> Btrfs doesn't support TRIM inside RAID56 block groups at all.
>
> Trim will only work for the unallocated space of each disk, and the
> unused space inside the METADATA RAID1 block groups.
>
> > Is discard implemented on a lower level that is independent of the
> > actual RAID level used? The very, very old initial merge announcement
> > [1] stated that discard support was missing back then. Is it implemented
> > now?
> >
> > 2. How is the parity data calculated when compression is in use? Is it
> > calculated on the data _after_ compression? In particular, is the parity
> > data expected to have the same size as the _compressed_ data?
>
> To your question, P/Q is calculated after compression.
>
> Btrfs and mdraid56, they work at block layer, thus they don't care the
> data size of your write.(although full-stripe aligned write is way
> better for performance)
>
> All writes (only considering the real writes which will go to physical
> disks, thus the compressed data) will first be split using full stripe
> size, then go either full-stripe write path or sub-stripe write.
>
> >
> > 3. Are there any other known issues that come to mind regarding this
> > particular setup, or do you have any other advice?
>
> We recently fixed a bug that read time repair for compressed data is not
> really as robust as we think.
> E.g. the corruption in compressed data is interleaved (like sector 1 is
> corrupted in mirror 1, sector 2 is corrupted in mirror 2).
>
> In that case, we will consider the full compressed data as corrupted,
> but in fact we should be able to repair it.
>
> You may want to use newer kernel with that fixed if you're going to use
> compression.
>
> >
> > [1] https://lwn.net/Articles/536038/
> >
> > Best regards
> > Ochi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID5 on SSDs - looking for advice
  2023-02-06  2:34   ` me
@ 2023-02-06  3:05     ` Qu Wenruo
  2023-02-09 23:12       ` me
  0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2023-02-06  3:05 UTC (permalink / raw)
  To: me; +Cc: Ochi, linux-btrfs



On 2023/2/6 10:34, me@jse.io wrote:
> Apologies for the duplicate, I sent the last reply in HTML by mistake.
> Take two lol.
> 
> Given that 6.2 basically has fixes for the RMW at least for RAID5, apart
> from scrub performance deficiencies and the write hole, are there any other
> gotchas to be aware of?

Firstly, 6.2 would only handle the RMW better for data.
There is no way to properly handle metadata easily, thus it's still not 
recommended to use RAID56 for metadata.

But still, things like parity-update-failure, read-repair-failure should 
be fixed with the RMW fixes.

Secondly the write hole is not yet fixed, the RMW fix would greately 
migrate the problem, but not a full fix.

Other ones look like regular scrub interface bugs.

> This mailing list post <
> https://lore.kernel.org/linux-btrfs/20200627030614.GW10769@hungrycats.org/>
> listed several concerning bugs, like "spurious degraded read failure" which
> is a concerning bug for me as I'm hoping to use Btrfs RAID5 for a media
> server pool and it would be nice to have it be usable when degraded
> without. It would be nice to be able to read my data when degraded. How
> many of these bugs listed here have since been fixed or addressed by the
> RMW fixes in 6.2?
> 
> Also concerning NOCOW (nocsum data), assuming no device failure, if a write
> to a NOCOW range gets out of sync with parity (ie, due to a crash/write
> hole) will scrub trust NOCOW data indiscriminately and update the parity,
> or does it get ignored like how NOCOW is basically ignored in RAID1?

NOCOW/NOCSUM is not recommended, as even with or without the RMW fix, we 
trust anything we read from disk if there is no csum to verify against.

Our trust priority is:

Data with csum (no matter pass or not, as we would recheck after repair) 
 > Data without csum (read pass, then trust it) > Parity

Thus data without csum can only be repaired if the read itself failed.
And if such data without csum has mismatch with parity, we always update 
parity unconditionally.

Thanks,
Qu

> 
> 
> On Sun, Oct 9, 2022 at 8:36 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2022/10/9 18:34, Ochi wrote:
>>> Hello,
>>>
>>> I'm currently thinking about migrating my home NAS to SSDs only. As a
>>> compromise between space efficiency and redundancy, I'm thinking about:
>>>
>>> - using RAID5 for data and RAID1 for metadata on a couple of SSDs (3 or
>>> 4 for now, with the option to expand later),
>>
>> Btrfs RAID56 is not safe against the following problems:
>>
>> - Multi-device data sync (aka, write hole)
>>     Every time a power loss happens, some RAID56 writes may get de-
>>     synchronized.
>>
>>     Unlike mdraid, we don't have journal/bitmap at all for now.
>>     We already have a PoC write-intent bitmap.
>>
>> - Destructive RMW
>>     This can happen when some of the existing data is corrupted (can be
>>     caused by above write-hole, or bitrot.
>>
>>     In that case, if we have write into the vertical stripe, we will
>>     make the original corruption further spread into the P/Q stripes,
>>     completely killing the possibility to recover the data.
>>
>>     This is for all RAID56, including mdraid56, but we're already working
>>     on this, to do full verification before a RMW cycle.
>>
>> - Extra IO for RAID56 scrub.
>>     It will cause at least twice amount of data read for RAID5, three
>>     times for RAID6, thus it can be very slow scrubbing the fs.
>>
>>     We're aware of this problem, and have one purposal to address it.
>>
>>     You may see some advice to only scrub one device one time to speed
>>     things up. But the truth is, it's causing more IO, and it will
>>     not ensure your data is correct if you just scrub one device.
>>
>>     Thus if you're going to use btrfs RAID56, you have not only to do
>>     periodical scrub, but also need to endure the slow scrub performance
>>     for now.
>>
>>
>>> - using compression to get the most out of the relatively expensive SSD
>>> storage,
>>> - encrypting each drive seperately below the FS level using LUKS (with
>>> discard enabled).
>>>
>>> The NAS is regularly backed up to another NAS with spinning disks that
>>> runs a btrfs RAID1 and takes daily snapshots.
>>>
>>> I have a few questions regarding this approach which I hope someone with
>>> more insight into btrfs can answer me:
>>>
>>> 1. Are there any known issues regarding discard/TRIM in a RAID5 setup?
>>
>> Btrfs doesn't support TRIM inside RAID56 block groups at all.
>>
>> Trim will only work for the unallocated space of each disk, and the
>> unused space inside the METADATA RAID1 block groups.
>>
>>> Is discard implemented on a lower level that is independent of the
>>> actual RAID level used? The very, very old initial merge announcement
>>> [1] stated that discard support was missing back then. Is it implemented
>>> now?
>>>
>>> 2. How is the parity data calculated when compression is in use? Is it
>>> calculated on the data _after_ compression? In particular, is the parity
>>> data expected to have the same size as the _compressed_ data?
>>
>> To your question, P/Q is calculated after compression.
>>
>> Btrfs and mdraid56, they work at block layer, thus they don't care the
>> data size of your write.(although full-stripe aligned write is way
>> better for performance)
>>
>> All writes (only considering the real writes which will go to physical
>> disks, thus the compressed data) will first be split using full stripe
>> size, then go either full-stripe write path or sub-stripe write.
>>
>>>
>>> 3. Are there any other known issues that come to mind regarding this
>>> particular setup, or do you have any other advice?
>>
>> We recently fixed a bug that read time repair for compressed data is not
>> really as robust as we think.
>> E.g. the corruption in compressed data is interleaved (like sector 1 is
>> corrupted in mirror 1, sector 2 is corrupted in mirror 2).
>>
>> In that case, we will consider the full compressed data as corrupted,
>> but in fact we should be able to repair it.
>>
>> You may want to use newer kernel with that fixed if you're going to use
>> compression.
>>
>>>
>>> [1] https://lwn.net/Articles/536038/
>>>
>>> Best regards
>>> Ochi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID5 on SSDs - looking for advice
  2023-02-06  3:05     ` Qu Wenruo
@ 2023-02-09 23:12       ` me
  2023-02-09 23:23         ` Remi Gauvin
  0 siblings, 1 reply; 13+ messages in thread
From: me @ 2023-02-09 23:12 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Ochi, linux-btrfs

You know, NOCOW is plagued with issues like this. Imo, it really seems
half baked, particularly around Btrfs RAID. Not only this, but the
fact it has a "write hole" like issue on other RAID profiles since
writes are not atomic, there is no bitmap to track dirty blocks until
all redundant copies are written, and no way for scrub to resync
correctly in cases where we could. Would it be possible to have a
mount option like nodatacow, but does the opposite: it would ignore
the nocow attribute and perform COW+csuming regardless?

Perhaps extend datacow to work like this: datacow=on (the default) and
datacow=always to prevent NOCOW, sort of like discard and
discard=async? It seems asinine to me that something as critical to
data integrity which Btrfs is supposed to help protect can be bypassed
in unprivileged userspace all with a simple attribute, even against
the admins intention. It's especially infuriating since so many
programs do it lately without (ie systemd-tmpfiles, or libvirt), if
you use containers and btrfs subvolumes, then you gotta configure
every container specifically just to prevent this.

On Sun, Feb 5, 2023 at 11:05 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2023/2/6 10:34, me@jse.io wrote:
> > Apologies for the duplicate, I sent the last reply in HTML by mistake.
> > Take two lol.
> >
> > Given that 6.2 basically has fixes for the RMW at least for RAID5, apart
> > from scrub performance deficiencies and the write hole, are there any other
> > gotchas to be aware of?
>
> Firstly, 6.2 would only handle the RMW better for data.
> There is no way to properly handle metadata easily, thus it's still not
> recommended to use RAID56 for metadata.
>
> But still, things like parity-update-failure, read-repair-failure should
> be fixed with the RMW fixes.
>
> Secondly the write hole is not yet fixed, the RMW fix would greately
> migrate the problem, but not a full fix.
>
> Other ones look like regular scrub interface bugs.
>
> > This mailing list post <
> > https://lore.kernel.org/linux-btrfs/20200627030614.GW10769@hungrycats.org/>
> > listed several concerning bugs, like "spurious degraded read failure" which
> > is a concerning bug for me as I'm hoping to use Btrfs RAID5 for a media
> > server pool and it would be nice to have it be usable when degraded
> > without. It would be nice to be able to read my data when degraded. How
> > many of these bugs listed here have since been fixed or addressed by the
> > RMW fixes in 6.2?
> >
> > Also concerning NOCOW (nocsum data), assuming no device failure, if a write
> > to a NOCOW range gets out of sync with parity (ie, due to a crash/write
> > hole) will scrub trust NOCOW data indiscriminately and update the parity,
> > or does it get ignored like how NOCOW is basically ignored in RAID1?
>
> NOCOW/NOCSUM is not recommended, as even with or without the RMW fix, we
> trust anything we read from disk if there is no csum to verify against.
>
> Our trust priority is:
>
> Data with csum (no matter pass or not, as we would recheck after repair)
>  > Data without csum (read pass, then trust it) > Parity
>
> Thus data without csum can only be repaired if the read itself failed.
> And if such data without csum has mismatch with parity, we always update
> parity unconditionally.
>
> Thanks,
> Qu
>
> >
> >
> > On Sun, Oct 9, 2022 at 8:36 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2022/10/9 18:34, Ochi wrote:
> >>> Hello,
> >>>
> >>> I'm currently thinking about migrating my home NAS to SSDs only. As a
> >>> compromise between space efficiency and redundancy, I'm thinking about:
> >>>
> >>> - using RAID5 for data and RAID1 for metadata on a couple of SSDs (3 or
> >>> 4 for now, with the option to expand later),
> >>
> >> Btrfs RAID56 is not safe against the following problems:
> >>
> >> - Multi-device data sync (aka, write hole)
> >>     Every time a power loss happens, some RAID56 writes may get de-
> >>     synchronized.
> >>
> >>     Unlike mdraid, we don't have journal/bitmap at all for now.
> >>     We already have a PoC write-intent bitmap.
> >>
> >> - Destructive RMW
> >>     This can happen when some of the existing data is corrupted (can be
> >>     caused by above write-hole, or bitrot.
> >>
> >>     In that case, if we have write into the vertical stripe, we will
> >>     make the original corruption further spread into the P/Q stripes,
> >>     completely killing the possibility to recover the data.
> >>
> >>     This is for all RAID56, including mdraid56, but we're already working
> >>     on this, to do full verification before a RMW cycle.
> >>
> >> - Extra IO for RAID56 scrub.
> >>     It will cause at least twice amount of data read for RAID5, three
> >>     times for RAID6, thus it can be very slow scrubbing the fs.
> >>
> >>     We're aware of this problem, and have one purposal to address it.
> >>
> >>     You may see some advice to only scrub one device one time to speed
> >>     things up. But the truth is, it's causing more IO, and it will
> >>     not ensure your data is correct if you just scrub one device.
> >>
> >>     Thus if you're going to use btrfs RAID56, you have not only to do
> >>     periodical scrub, but also need to endure the slow scrub performance
> >>     for now.
> >>
> >>
> >>> - using compression to get the most out of the relatively expensive SSD
> >>> storage,
> >>> - encrypting each drive seperately below the FS level using LUKS (with
> >>> discard enabled).
> >>>
> >>> The NAS is regularly backed up to another NAS with spinning disks that
> >>> runs a btrfs RAID1 and takes daily snapshots.
> >>>
> >>> I have a few questions regarding this approach which I hope someone with
> >>> more insight into btrfs can answer me:
> >>>
> >>> 1. Are there any known issues regarding discard/TRIM in a RAID5 setup?
> >>
> >> Btrfs doesn't support TRIM inside RAID56 block groups at all.
> >>
> >> Trim will only work for the unallocated space of each disk, and the
> >> unused space inside the METADATA RAID1 block groups.
> >>
> >>> Is discard implemented on a lower level that is independent of the
> >>> actual RAID level used? The very, very old initial merge announcement
> >>> [1] stated that discard support was missing back then. Is it implemented
> >>> now?
> >>>
> >>> 2. How is the parity data calculated when compression is in use? Is it
> >>> calculated on the data _after_ compression? In particular, is the parity
> >>> data expected to have the same size as the _compressed_ data?
> >>
> >> To your question, P/Q is calculated after compression.
> >>
> >> Btrfs and mdraid56, they work at block layer, thus they don't care the
> >> data size of your write.(although full-stripe aligned write is way
> >> better for performance)
> >>
> >> All writes (only considering the real writes which will go to physical
> >> disks, thus the compressed data) will first be split using full stripe
> >> size, then go either full-stripe write path or sub-stripe write.
> >>
> >>>
> >>> 3. Are there any other known issues that come to mind regarding this
> >>> particular setup, or do you have any other advice?
> >>
> >> We recently fixed a bug that read time repair for compressed data is not
> >> really as robust as we think.
> >> E.g. the corruption in compressed data is interleaved (like sector 1 is
> >> corrupted in mirror 1, sector 2 is corrupted in mirror 2).
> >>
> >> In that case, we will consider the full compressed data as corrupted,
> >> but in fact we should be able to repair it.
> >>
> >> You may want to use newer kernel with that fixed if you're going to use
> >> compression.
> >>
> >>>
> >>> [1] https://lwn.net/Articles/536038/
> >>>
> >>> Best regards
> >>> Ochi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID5 on SSDs - looking for advice
  2023-02-09 23:12       ` me
@ 2023-02-09 23:23         ` Remi Gauvin
  0 siblings, 0 replies; 13+ messages in thread
From: Remi Gauvin @ 2023-02-09 23:23 UTC (permalink / raw)
  To: linux-btrfs

On 2023-02-09 6:12 p.m., me@jse.io wrote:
> You know, NOCOW is plagued with issues like this. Imo, it really seems
> half baked, particularly around Btrfs RAID. Not only this, but the
> fact it has a "write hole" like issue on other RAID profiles since
> writes are not atomic, there is no bitmap to track dirty blocks until
> all redundant copies are written, and no way for scrub to resync
> correctly in cases where we could. Would it be possible to have a
> mount option like nodatacow, but does the opposite: it would ignore
> the nocow attribute and perform COW+csuming regardless?
> 
> Perhaps extend datacow to work like this: datacow=on (the default) and
> datacow=always to prevent NOCOW, sort of like discard and
> discard=async? It seems asinine to me that something as critical to
> data integrity which Btrfs is supposed to help protect can be bypassed
> in unprivileged userspace all with a simple attribute, even against
> the admins intention. It's especially infuriating since so many
> programs do it lately without (ie systemd-tmpfiles, or libvirt), if
> you use containers and btrfs subvolumes, then you gotta configure
> every container specifically just to prevent this.


I think an option to force cow would be good,, but in my opinion,
mirrored RAID should, by itself, force cow regardless of any option.
BTRFS raid is,,, ridiculous with NoCOW..  RAID that results in
inconsistent copies that can't even be synchronized with a scrub is
pathological.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-02-09 23:23 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-09 10:34 RAID5 on SSDs - looking for advice Ochi
2022-10-09 11:36 ` Qu Wenruo
2022-10-09 12:56   ` Ochi
2022-10-09 13:01     ` Forza
2022-10-09 13:16       ` Ochi
2022-10-09 14:33   ` Jorge Bastos
2023-02-06  2:34   ` me
2023-02-06  3:05     ` Qu Wenruo
2023-02-09 23:12       ` me
2023-02-09 23:23         ` Remi Gauvin
2022-10-09 11:42 ` Roman Mamedov
2022-10-09 13:12   ` Ochi
2022-10-09 13:44 ` waxhead

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.