All of lore.kernel.org
 help / color / mirror / Atom feed
* using raid56 on a private machine
@ 2020-10-05 16:59 cryptearth
  2020-10-05 17:57 ` Goffredo Baroncelli
  0 siblings, 1 reply; 8+ messages in thread
From: cryptearth @ 2020-10-05 16:59 UTC (permalink / raw)
  To: linux-btrfs

Hello there,

as I plan to use a 8 drive RAID6 with BtrFS I'd like to ask about the 
current status of BtrFS RAID5/6 support or if I should go with a more 
traditional mdadm array.

The general status page on the btrfs wiki shows "unstable" for RAID5/6, 
and it's specific pages mentions some issue marked as "not production 
ready". It also says to not use it for the metadata but only for the 
actual data.

I plan to use it for my own personal system at home - and I do 
understand that RAID is no replacement for a backup, but I'd rather like 
to ask upfront if it's ready to use before I encounter issues when I use it.
I already had the plan about using a more "traditional" mdadm setup and 
just format the resulting volume with ext4, but as I asked about that 
many actually suggested to me to rather use modern filesystems like 
BtrFS or ZFS instead of "old school RAID".

Do you have any help for me about using BtrFS with RAID6 vs mdadm or ZFS?

I also don't really understand why and what's the difference between 
metadata, data, and system. When I set up a volume only define RAID6 for 
the data it sets metadata and systemdata default to RAID1, but doesn't 
this mean that those important metadata are only stored on two drives 
instead of spread accross all drives like in a regular RAID6? This would 
somewhat negate the benefit of RAID6 to withstand a double failure like 
a 2nd drive fail while rebuilding the first failed one.

Any information appreciated.


Greetings from Germany,

Matt

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: using raid56 on a private machine
  2020-10-05 16:59 using raid56 on a private machine cryptearth
@ 2020-10-05 17:57 ` Goffredo Baroncelli
  2020-10-05 19:22   ` cryptearth
  2020-10-06  1:24   ` Zygo Blaxell
  0 siblings, 2 replies; 8+ messages in thread
From: Goffredo Baroncelli @ 2020-10-05 17:57 UTC (permalink / raw)
  To: cryptearth, linux-btrfs; +Cc: Josef Bacik, Zygo Blaxell

On 10/5/20 6:59 PM, cryptearth wrote:
> Hello there,
> 
> as I plan to use a 8 drive RAID6 with BtrFS I'd like to ask about the current status of BtrFS RAID5/6 support or if I should go with a more traditional mdadm array.
> 
> The general status page on the btrfs wiki shows "unstable" for RAID5/6, and it's specific pages mentions some issue marked as "not production ready". It also says to not use it for the metadata but only for the actual data.
> 
> I plan to use it for my own personal system at home - and I do understand that RAID is no replacement for a backup, but I'd rather like to ask upfront if it's ready to use before I encounter issues when I use it.
> I already had the plan about using a more "traditional" mdadm setup and just format the resulting volume with ext4, but as I asked about that many actually suggested to me to rather use modern filesystems like BtrFS or ZFS instead of "old school RAID".
> 
> Do you have any help for me about using BtrFS with RAID6 vs mdadm or ZFS?

Zygo collected some useful information about RAID5/6:

https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@hungrycats.org/

However more recently Josef (one of the main developers), declared that BTRFS with RAID5/6 has  "...some dark and scary corners..."

https://lore.kernel.org/linux-btrfs/bf9594ea55ce40af80548888070427ad97daf78a.1598374255.git.josef@toxicpanda.com/

> 
> I also don't really understand why and what's the difference between metadata, data, and system.
> When I set up a volume only define RAID6 for the data it sets metadata and systemdata default to RAID1, but doesn't this mean that those important metadata are only stored on two drives instead of spread accross all drives like in a regular RAID6? This would somewhat negate the benefit of RAID6 to withstand a double failure like a 2nd drive fail while rebuilding the first failed one.

Correct. In fact Zygo suggested to user RAID6 + RAID1C3.

I have only few suggestions:
1) don't store valuable data on BTRFS with raid5/6 profile. Use it if you want to experiment and want to help the development of BTRFS. But be ready to face the lost of all data. (very unlikely, but more the size of the filesystem is big, more difficult is a restore of the data in case of problem).
2) doesn't fill the filesystem more than 70-80%. If you go further this limit the likelihood to catch the "dark and scary corners" quickly increases.
3) run scrub periodically and after a power failure ; better to use an uninterruptible power supply (this is true for all the RAID, even the MD one).
4) I don't have any data to support this; but as occasional reader of this mailing list I have the feeling that combing BTRFS with LUCKS(or bcache) raises the likelihood of a problem.
5) pay attention that having an 8 disks raid, raises the likelihood of a failure of about an order of magnitude more than a single disk ! RAID6 (or any other RAID) mitigates that, in the sense that it creates a time window where it is possible to make maintenance (e.g. a disk replacement) before the lost of data.
6) leave the room in the disks array for an additional disk (to use when a disk replacement is needed)
7) avoid the USB disks, because these are not reliable


> 
> Any information appreciated.
> 
> 
> Greetings from Germany,
> 
> Matt


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: using raid56 on a private machine
  2020-10-05 17:57 ` Goffredo Baroncelli
@ 2020-10-05 19:22   ` cryptearth
  2020-10-06  1:24   ` Zygo Blaxell
  1 sibling, 0 replies; 8+ messages in thread
From: cryptearth @ 2020-10-05 19:22 UTC (permalink / raw)
  To: linux-btrfs

Hello Goffredo,

thank you for the quick reply, I didn't expected one this late.

Thanks for the provided information.
I'm a bit surprised of the current status of support for RAID5/6 as I 
thought it to be already more advanced to at least a "relaible" state 
where data-recovery is as straight forward as with a "regular" array 
when a drive fails. But reading that there's still the risk of 
catastrophic total data loss is not what I was ready for to read.

As I tried to set up a test array in a VM I was unable to set raid1c3/4 
for metadata as I got the error it's an unknown profile. I'm using 
OpenSuSE Leap 15.2 running kernel 5.3.18. According to zypper the btrfs 
package version is 4.19.1 - so I'm at the state of december 2018. 
According to the changelog raid1c3/4 were added with 5.4 in december 
2019. As I'm using Linux only on my server, and OpenSuSE more specificly 
for its "easy to use thanks to it's overall 'control center' YaST", I'm 
unsure if it's possible to simply "upgrade" to a more recent version as 
like installing a newer version of some software on windows. Maybe a 
different distribution with more recent kernel and packages would fit me 
requirements better than rely on the rather outdated stuff the SuSE devs 
put together, although it would be a rather steep learning curve and 
most likely much copy'n'paste from some wikis.

As for your suggestions: Some good advices in there - but some are 
currently just not feaseable (like running a UPS). As for the increased 
risk of failure due to number of drives: I'm currently running a 5 drive 
RAID5 on a so called "fakeraid" provided by the chipset of my 
motherboard (specificly AMD SB950 / FX990 - AM3+ platform) which relies 
on a windows7-only driver from AMD (don't ask me why - but I'm not able 
to get it running with windows10) - and already had at least 3 or even 4 
failures (can't exactly remember right now). The provided software has 
the advantage of active drive monitoring. So, when something goes wrong 
the failing drive is put offline and the array into a critical mode. As 
it's only RAID5 it's kind of a gamble to hopefully not encounter another 
failure while rebuilding, that's why I plan to change over to RAID6. 
Going up to 8 drives is just an unrelated choice as that's the number of 
drive bays my case has and as it seem many HBAs to mostly offer ports in 
increments of powers of 2 (like 2, 4, 8, 16 - and so on). Do you see any 
problem with that? Or would run a RAID6 with only 6 drives be any 
beneficial over a 8 drive array?

Although this is the btrfs list - as it seems btrfs not really ready for 
running in RAID-like modes - and ZFS is also quite wide spread - would 
it be a better idea to use that instead?


Greetings,

Matt

Am 05.10.2020 um 19:57 schrieb Goffredo Baroncelli:
> On 10/5/20 6:59 PM, cryptearth wrote:
>> Hello there,
>>
>> as I plan to use a 8 drive RAID6 with BtrFS I'd like to ask about the 
>> current status of BtrFS RAID5/6 support or if I should go with a more 
>> traditional mdadm array.
>>
>> The general status page on the btrfs wiki shows "unstable" for 
>> RAID5/6, and it's specific pages mentions some issue marked as "not 
>> production ready". It also says to not use it for the metadata but 
>> only for the actual data.
>>
>> I plan to use it for my own personal system at home - and I do 
>> understand that RAID is no replacement for a backup, but I'd rather 
>> like to ask upfront if it's ready to use before I encounter issues 
>> when I use it.
>> I already had the plan about using a more "traditional" mdadm setup 
>> and just format the resulting volume with ext4, but as I asked about 
>> that many actually suggested to me to rather use modern filesystems 
>> like BtrFS or ZFS instead of "old school RAID".
>>
>> Do you have any help for me about using BtrFS with RAID6 vs mdadm or 
>> ZFS?
>
> Zygo collected some useful information about RAID5/6:
>
> https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@hungrycats.org/ 
>
>
> However more recently Josef (one of the main developers), declared 
> that BTRFS with RAID5/6 has  "...some dark and scary corners..."
>
> https://lore.kernel.org/linux-btrfs/bf9594ea55ce40af80548888070427ad97daf78a.1598374255.git.josef@toxicpanda.com/ 
>
>
>>
>> I also don't really understand why and what's the difference between 
>> metadata, data, and system.
>> When I set up a volume only define RAID6 for the data it sets 
>> metadata and systemdata default to RAID1, but doesn't this mean that 
>> those important metadata are only stored on two drives instead of 
>> spread accross all drives like in a regular RAID6? This would 
>> somewhat negate the benefit of RAID6 to withstand a double failure 
>> like a 2nd drive fail while rebuilding the first failed one.
>
> Correct. In fact Zygo suggested to user RAID6 + RAID1C3.
>
> I have only few suggestions:
> 1) don't store valuable data on BTRFS with raid5/6 profile. Use it if 
> you want to experiment and want to help the development of BTRFS. But 
> be ready to face the lost of all data. (very unlikely, but more the 
> size of the filesystem is big, more difficult is a restore of the data 
> in case of problem).
> 2) doesn't fill the filesystem more than 70-80%. If you go further 
> this limit the likelihood to catch the "dark and scary corners" 
> quickly increases.
> 3) run scrub periodically and after a power failure ; better to use an 
> uninterruptible power supply (this is true for all the RAID, even the 
> MD one).
> 4) I don't have any data to support this; but as occasional reader of 
> this mailing list I have the feeling that combing BTRFS with LUCKS(or 
> bcache) raises the likelihood of a problem.
> 5) pay attention that having an 8 disks raid, raises the likelihood of 
> a failure of about an order of magnitude more than a single disk ! 
> RAID6 (or any other RAID) mitigates that, in the sense that it creates 
> a time window where it is possible to make maintenance (e.g. a disk 
> replacement) before the lost of data.
> 6) leave the room in the disks array for an additional disk (to use 
> when a disk replacement is needed)
> 7) avoid the USB disks, because these are not reliable
>
>
>>
>> Any information appreciated.
>>
>>
>> Greetings from Germany,
>>
>> Matt
>
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: using raid56 on a private machine
  2020-10-05 17:57 ` Goffredo Baroncelli
  2020-10-05 19:22   ` cryptearth
@ 2020-10-06  1:24   ` Zygo Blaxell
  2020-10-06  5:50     ` cryptearth
  2020-10-06 17:12     ` Goffredo Baroncelli
  1 sibling, 2 replies; 8+ messages in thread
From: Zygo Blaxell @ 2020-10-06  1:24 UTC (permalink / raw)
  To: kreijack; +Cc: cryptearth, linux-btrfs, Josef Bacik

On Mon, Oct 05, 2020 at 07:57:51PM +0200, Goffredo Baroncelli wrote:
> On 10/5/20 6:59 PM, cryptearth wrote:
> > Hello there,
> > 
> > as I plan to use a 8 drive RAID6 with BtrFS I'd like to ask about
> > the current status of BtrFS RAID5/6 support or if I should go with a
> > more traditional mdadm array.

Definitely do not use a single mdadm raid6 array with btrfs.  It is
equivalent to running btrfs with raid6 metadata:  mdadm cannot recover
from data corruption on the disks, and btrfs cannot recover from
write hole issues in degraded mode.  Any failure messier than a total
instantaneous disk failure will probably break the filesystem.

> > The general status page on the btrfs wiki shows "unstable" for
> > RAID5/6, and it's specific pages mentions some issue marked as "not
> > production ready". It also says to not use it for the metadata but
> > only for the actual data.

That's correct.  Very briefly, the issues are:

	1.  Reads don't work properly in degraded mode.

	2.  The admin tools are incomplete.

	3.  The diagnostic tools are broken.

	4.  It is not possible to recover from all theoretically
	recoverable failure events.

Items 1 and 4 make raid5/6 unusable for metadata (total filesystem loss
is likely).  Use raid1 or raid1c3 for metadata instead.  This is likely
a good idea even if all the known issues are fixed--metadata access
patterns don't perform well with raid5/6, and the most likely proposals
to solve the raid5/6 problems will require raid1/raid1c3 metadata to
store an update journal.

If your application can tolerate small data losses correlated with disk
failures (i.e. you can restore a file from backup every year if required,
and you have no requirement for data availability while replacing disks)
then you can use raid5 now; otherwise, btrfs raid5/6 is not ready yet.

> > I plan to use it for my own personal system at home - and I do
> > understand that RAID is no replacement for a backup, but I'd rather
> > like to ask upfront if it's ready to use before I encounter issues
> > when I use it.
> > I already had the plan about using a more "traditional" mdadm setup
> > and just format the resulting volume with ext4, but as I asked about
> > that many actually suggested to me to rather use modern filesystems
> > like BtrFS or ZFS instead of "old school RAID".

Indeed, old school raid maximizes your probability of silent data loss by
allowing multiple disks in inject silent data loss failures and firmware
bug effects.

btrfs and ZFS store their own data integrity information, so they can
reliably identify failures on the drives.  If redundant storage is used,
they can recover automatically from failures the drives can't or won't
report.

> > Do you have any help for me about using BtrFS with RAID6 vs mdadm or ZFS?
> 
> Zygo collected some useful information about RAID5/6:
> 
> https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@hungrycats.org/
>
> However more recently Josef (one of the main developers), declared
> that BTRFS with RAID5/6 has  "...some dark and scary corners..."
> 
> https://lore.kernel.org/linux-btrfs/bf9594ea55ce40af80548888070427ad97daf78a.1598374255.git.josef@toxicpanda.com/

I think my list is a little more...concrete.  ;)

> > I also don't really understand why and what's the difference between
> > metadata, data, and system.
> > When I set up a volume only define RAID6 for the data it sets
> > metadata and systemdata default to RAID1, but doesn't this mean that
> > those important metadata are only stored on two drives instead of
> > spread accross all drives like in a regular RAID6? This would somewhat
> > negate the benefit of RAID6 to withstand a double failure like a 2nd
> > drive fail while rebuilding the first failed one.
> 
> Correct. In fact Zygo suggested to user RAID6 + RAID1C3.
> 
> I have only few suggestions:
> 1) don't store valuable data on BTRFS with raid5/6 profile. Use it if
> you want to experiment and want to help the development of BTRFS. But
> be ready to face the lost of all data. (very unlikely, but more the
> size of the filesystem is big, more difficult is a restore of the data
> in case of problem).

Losing all of the data seems unlikely given the bugs that exist so far.
The known issues are related to availability (it crashes a lot and
isn't fully usable in degraded mode) and small amounts of data loss
(like 5 blocks per TB).

The above assumes you never use raid5 or raid6 for btrfs metadata.  Using
raid5 or raid6 for metadata can result in total loss of the filesystem,
but you can use raid1 or raid1c3 for metadata instead.

> 2) doesn't fill the filesystem more than 70-80%. If you go further
> this limit the likelihood to catch the "dark and scary corners"
> quickly increases.

Can you elaborate on that?  I run a lot of btrfs filesystems at 99%
capacity, some of the bigger ones even higher.  If there were issues at
80% I expect I would have noticed them.  There were some performance
issues with full filesystems on kernels using space_cache=v1, but
space_cache=v2 went upstream 4 years ago, and other significant
performance problems a year before that.

The last few GB is a bit of a performance disaster and there are
some other gotchas, but that's an absolute number, not a percentage.

Never balance metadata.  That is a ticket to a dark and scary corner.
Make sure you don't do it, and that you don't accidentally install a
cron job that does it.

> 3) run scrub periodically and after a power failure ; better to use
> an uninterruptible power supply (this is true for all the RAID, even
> the MD one).

scrub also provides early warning of disk failure, and detects disks
that are silently corrupting your data.  It should be run not less than
once a month, though you can skip months where you've already run a
full-filesystem read for other reasons (e.g. replacing a failed disk).

> 4) I don't have any data to support this; but as occasional reader of
> this mailing list I have the feeling that combing BTRFS with LUCKS(or
> bcache) raises the likelihood of a problem.

I haven't seen that correlation.  All of my machines run at least one
btrfs on luks (dm-crypt).  The larger ones use lvmcache.  I've also run
bcache on test machines doing power-fail tests.

That said, there are additional hardware failure risks involved in
caching (more storage hardware components = more failures) and the
system must be designed to tolerate and recover from these failures.

When cache disks fail, just uncache and run scrub to repair.  btrfs
checksums will validate the data on the backing HDD (which will be badly
corrupted after a cache SSD failure) and will restore missing data from
other drives in the array.

It's definitely possible to configure bcache or lvmcache incorrectly,
and then you will have severe problems.  Each HDD must have a separate
dedicated SSD.  No sharing between cache devices is permitted.  They must
use separate cache pools.  If one SSD is used to cache two or more HDDs
and the SSD fails, it will behave the same as a multi-disk failure and
probably destroy the filesystem.  So don't do that.

Note that firmware in the SSDs used for caching must respect write
ordering, or the cache will do severe damage to the filesystem on
just about every power failure.  It's a good idea to test hardware
in a separate system through a few power failures under load before
deploying them in production.  Most devices are OK, but a few percent
of models out there have problems so severe they'll damage a filesystem
in a single-digit number of power loss events.  It's fairly common to
encounter users who have lost a btrfs on their first or second power
failure with a problematic drive.  If you're stuck with one of these
disks, you can disable write caching and still use it, but there will
be added write latency, and in the long run it's better to upgrade to
a better disk model.

> 5) pay attention that having an 8 disks raid, raises the likelihood of a
> failure of about an order of magnitude more than a single disk ! RAID6
> (or any other RAID) mitigates that, in the sense that it creates a
> time window where it is possible to make maintenance (e.g. a disk
> replacement) before the lost of data.
> 6) leave the room in the disks array for an additional disk (to use
> when a disk replacement is needed)
> 7) avoid the USB disks, because these are not reliable
> 
> 
> > 
> > Any information appreciated.
> > 
> > 
> > Greetings from Germany,
> > 
> > Matt
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: using raid56 on a private machine
  2020-10-06  1:24   ` Zygo Blaxell
@ 2020-10-06  5:50     ` cryptearth
  2020-10-06 19:31       ` Zygo Blaxell
  2020-10-06 17:12     ` Goffredo Baroncelli
  1 sibling, 1 reply; 8+ messages in thread
From: cryptearth @ 2020-10-06  5:50 UTC (permalink / raw)
  To: linux-btrfs

Hello Zygo,

that's quite a lot of information I wasn't aware of.

// In advance: Sorry for the wall of text. That mail got a bit longer 
than I thought.

I guess one point I still have to get my head around is about 
meta-blocks vs data-blocks: I don't even know if and how my current raid 
is capable of detecting other types of errors than instant failures, 
like corruption of structual meta-data or the actual data-blocks itself. 
Up until now, I never encountered any data corruption, neither read nor 
write issues. I always was able to correctly read all data the exact 
same as I've written them. There's only one application that uses rather 
big files (only in the range of 1gb-3gb) which keeps somehow corrupting 
itself (it's GTA V), but as the files that fail are files which, at 
least in my eyes, are only should be opened to read during runtime (as 
they contain assets like models and textures), but are actually opened 
in read/write mode I suspect that for some odd reason the game itself 
keeps writing data to those data and by it keep corrupting itself. Other 
big file, like other types of images (all about 4gb and more) or 
database files, which I also often read from and write to, never got any 
of these issues - but I guess that's just GTA V at its best - aside from 
some other rather strange CRM I had to use it's one of the worst pieces 
of modern software I know.
The types of errors I encountered and which led to me replacing the 
drives makred as failed were about this: The monitoring software of this 
amd fakeraid at some point pops up one of those notifications telling me 
that one of the drives failed, was set to offline, and the raid was set 
to critical. Looking into the logs it only says that some operation 
<hex-code> failed at some address <another hex-code> on some drive (port 
number) and that the BSL (bad sector list) was updated. This comes up a 
few times and then this line about that drive going offline - so, it's a 
burst error. But: Even using Google didn't got me what those operation 
code mean. So, I don't know if a drive failed for some read or write 
error, some parity or checksum issue, or for whatever reason. All 
information I get is that there's a burst error, the drive is marked as 
bad, some list is updated and the array is set to a degraded state.
But: Otherwise to what I tested so far with BtrFS it's not like the 
array goes offline or isn't available after reboot anymore. I can keep 
using it, and, as RAID5, at least this one, seem to always calculate and 
check at least some sort of checksum, there isn't even any performance 
penalty. To get it running again all I have to do is to shutdown the 
system, replace the drive, boot up again (that's caused by the hardware 
- it doesn't support hotplug) and hit rebuild in the raid control panel 
- which takes only a couple of hours with my 3TB drives.
But, as already mentioned, as this is only RAID5 each rebuild is like 
gambling and hoping for no other drive fails until the rebuild is 
finished. If another drive would go bad during rebuild the data would 
most likely be lost. And recovery would even be harder as it's AMDs 
proprietary stuff - and from what I was able to find AMD denied help 
even to businesses - let alone me as a private person. All I could do 
would be to replace the board with some compatible one, but I don't even 
know if it just has to have a SB950 chipset, or if it has to be the 
exact same board. The "bios-level" interface seem to be implemented as 
an option rom on its own - so it shouldn't depend on the specific board.

Anyway, long story short: I don'T want to wait until that catastrophe 
occurs, but rather want to prevent it by change my setup. Instead of 
rely something fused onto the motherboard my friend suggest to use a 
simple "dumb" HBA and do all the stuff in software, like Linux mdadm or 
BtrFS/ZFS. As I learned over the past few days while learning about 
BtrFS' RAID-like capabilities RAID isn't as simple as I thought until 
now but can actually suffer from one (or more) drive return corrupted 
data instead of just fail, and a typical hardware RAID controller or 
many "rather simple" software raid implementations can't tell the 
difference between the actual data and some not so good ones. As it was 
explained in some talk: Some implementations work in a way that if a 
data-block becomes corrupted which result in a fail of parity check the 
parity, which could actually be used to recover the correct data, is 
thrown away and overriden with some corrupted data re-calculated with 
the corrupted data-block. Hence using special filesystems like BtrFS and 
ZFS is recommended as they have additional information like per-block 
checksums to actually tell if the checksum calculation failed or if some 
data-block became corrupted.

As ZFS seem to have some not so clear license related stuff preventing 
it from get included into the kernel I took a look at BtrFS - which 
doesn't seem to fit my needs. Sure, I could go with a RAID 1+0 - but 
this still would result in only about 12TB useable space while actually 
throwing in 3 more 3TB drives, but I actually planed to increase the 
useable size of my array instead of just bump it's redundancy. As for 
metadata: I've read up about the RAID1 and RAID3/4 profiles: And 
although RAID1c3 is recommended for a RAID6 (which would store 3 copies 
so there should be at least one copy left even in a double failure) is 
using a RAID1c4 also an option? I wouldn't mind to give up a bit of the 
available space to an extra metadata copy if it helps to prevent data 
loss in the case of a drive failure.

You also wrote to never balance metadata. But how should I restore the 
lost metadata after a drive replacement if I only re-balance the 
data-blocks? Do they get updated and re-distributed in the background 
while restoring the data-blocks during a rebuild? Or is this more like: 
"redundancy builds up again over time by the regular algorithms"? I may 
still have something wrong in my understanding about "using multiple 
disks in one array" so currently I would suspect that all data are 
rebuild - also metadata - but I guess BtrFS works different on this topic?

Yes, I can tolerate loss of data as I do have an extra backup of my 
important data, and as I only use it for my personal use I guess any 
data lost by not having a proper backup of them is on me anyway, but 
seeing BtrFS and ZFS used in like 45 drive arrays for crucial data with 
requirement for high availability I'd like to find a solution I can set 
up my array in a way like RAID6, so it can withstand a double failure, 
but which is also still available during a rebuild. And although during 
my tests BtrFS showed promissing when I was able to mount and use an 
array with WinBtrFS, which would also solve my additional quest for come 
up with some way of use the same volume on both Linux and Windows, it 
doesn't seem to be ready for my plan yet, or at least not with the 
knowledge I got so far. I'm open for any suggestions and explanations as 
I obvious still have quite a lot to learn - and, if I may set up a BtrFS 
volume, likely to require some help doing it "the right way" for what 
I'd like.

Thanks to anyone, and sorry again for that rather long mail

Matt

Am 06.10.2020 um 03:24 schrieb Zygo Blaxell:
> On Mon, Oct 05, 2020 at 07:57:51PM +0200, Goffredo Baroncelli wrote:
>> On 10/5/20 6:59 PM, cryptearth wrote:
>>> Hello there,
>>>
>>> as I plan to use a 8 drive RAID6 with BtrFS I'd like to ask about
>>> the current status of BtrFS RAID5/6 support or if I should go with a
>>> more traditional mdadm array.
> Definitely do not use a single mdadm raid6 array with btrfs.  It is
> equivalent to running btrfs with raid6 metadata:  mdadm cannot recover
> from data corruption on the disks, and btrfs cannot recover from
> write hole issues in degraded mode.  Any failure messier than a total
> instantaneous disk failure will probably break the filesystem.
>
>>> The general status page on the btrfs wiki shows "unstable" for
>>> RAID5/6, and it's specific pages mentions some issue marked as "not
>>> production ready". It also says to not use it for the metadata but
>>> only for the actual data.
> That's correct.  Very briefly, the issues are:
>
> 	1.  Reads don't work properly in degraded mode.
>
> 	2.  The admin tools are incomplete.
>
> 	3.  The diagnostic tools are broken.
>
> 	4.  It is not possible to recover from all theoretically
> 	recoverable failure events.
>
> Items 1 and 4 make raid5/6 unusable for metadata (total filesystem loss
> is likely).  Use raid1 or raid1c3 for metadata instead.  This is likely
> a good idea even if all the known issues are fixed--metadata access
> patterns don't perform well with raid5/6, and the most likely proposals
> to solve the raid5/6 problems will require raid1/raid1c3 metadata to
> store an update journal.
>
> If your application can tolerate small data losses correlated with disk
> failures (i.e. you can restore a file from backup every year if required,
> and you have no requirement for data availability while replacing disks)
> then you can use raid5 now; otherwise, btrfs raid5/6 is not ready yet.
>
>>> I plan to use it for my own personal system at home - and I do
>>> understand that RAID is no replacement for a backup, but I'd rather
>>> like to ask upfront if it's ready to use before I encounter issues
>>> when I use it.
>>> I already had the plan about using a more "traditional" mdadm setup
>>> and just format the resulting volume with ext4, but as I asked about
>>> that many actually suggested to me to rather use modern filesystems
>>> like BtrFS or ZFS instead of "old school RAID".
> Indeed, old school raid maximizes your probability of silent data loss by
> allowing multiple disks in inject silent data loss failures and firmware
> bug effects.
>
> btrfs and ZFS store their own data integrity information, so they can
> reliably identify failures on the drives.  If redundant storage is used,
> they can recover automatically from failures the drives can't or won't
> report.
>
>>> Do you have any help for me about using BtrFS with RAID6 vs mdadm or ZFS?
>> Zygo collected some useful information about RAID5/6:
>>
>> https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@hungrycats.org/
>>
>> However more recently Josef (one of the main developers), declared
>> that BTRFS with RAID5/6 has  "...some dark and scary corners..."
>>
>> https://lore.kernel.org/linux-btrfs/bf9594ea55ce40af80548888070427ad97daf78a.1598374255.git.josef@toxicpanda.com/
> I think my list is a little more...concrete.  ;)
>
>>> I also don't really understand why and what's the difference between
>>> metadata, data, and system.
>>> When I set up a volume only define RAID6 for the data it sets
>>> metadata and systemdata default to RAID1, but doesn't this mean that
>>> those important metadata are only stored on two drives instead of
>>> spread accross all drives like in a regular RAID6? This would somewhat
>>> negate the benefit of RAID6 to withstand a double failure like a 2nd
>>> drive fail while rebuilding the first failed one.
>> Correct. In fact Zygo suggested to user RAID6 + RAID1C3.
>>
>> I have only few suggestions:
>> 1) don't store valuable data on BTRFS with raid5/6 profile. Use it if
>> you want to experiment and want to help the development of BTRFS. But
>> be ready to face the lost of all data. (very unlikely, but more the
>> size of the filesystem is big, more difficult is a restore of the data
>> in case of problem).
> Losing all of the data seems unlikely given the bugs that exist so far.
> The known issues are related to availability (it crashes a lot and
> isn't fully usable in degraded mode) and small amounts of data loss
> (like 5 blocks per TB).
>
> The above assumes you never use raid5 or raid6 for btrfs metadata.  Using
> raid5 or raid6 for metadata can result in total loss of the filesystem,
> but you can use raid1 or raid1c3 for metadata instead.
>
>> 2) doesn't fill the filesystem more than 70-80%. If you go further
>> this limit the likelihood to catch the "dark and scary corners"
>> quickly increases.
> Can you elaborate on that?  I run a lot of btrfs filesystems at 99%
> capacity, some of the bigger ones even higher.  If there were issues at
> 80% I expect I would have noticed them.  There were some performance
> issues with full filesystems on kernels using space_cache=v1, but
> space_cache=v2 went upstream 4 years ago, and other significant
> performance problems a year before that.
>
> The last few GB is a bit of a performance disaster and there are
> some other gotchas, but that's an absolute number, not a percentage.
>
> Never balance metadata.  That is a ticket to a dark and scary corner.
> Make sure you don't do it, and that you don't accidentally install a
> cron job that does it.
>
>> 3) run scrub periodically and after a power failure ; better to use
>> an uninterruptible power supply (this is true for all the RAID, even
>> the MD one).
> scrub also provides early warning of disk failure, and detects disks
> that are silently corrupting your data.  It should be run not less than
> once a month, though you can skip months where you've already run a
> full-filesystem read for other reasons (e.g. replacing a failed disk).
>
>> 4) I don't have any data to support this; but as occasional reader of
>> this mailing list I have the feeling that combing BTRFS with LUCKS(or
>> bcache) raises the likelihood of a problem.
> I haven't seen that correlation.  All of my machines run at least one
> btrfs on luks (dm-crypt).  The larger ones use lvmcache.  I've also run
> bcache on test machines doing power-fail tests.
>
> That said, there are additional hardware failure risks involved in
> caching (more storage hardware components = more failures) and the
> system must be designed to tolerate and recover from these failures.
>
> When cache disks fail, just uncache and run scrub to repair.  btrfs
> checksums will validate the data on the backing HDD (which will be badly
> corrupted after a cache SSD failure) and will restore missing data from
> other drives in the array.
>
> It's definitely possible to configure bcache or lvmcache incorrectly,
> and then you will have severe problems.  Each HDD must have a separate
> dedicated SSD.  No sharing between cache devices is permitted.  They must
> use separate cache pools.  If one SSD is used to cache two or more HDDs
> and the SSD fails, it will behave the same as a multi-disk failure and
> probably destroy the filesystem.  So don't do that.
>
> Note that firmware in the SSDs used for caching must respect write
> ordering, or the cache will do severe damage to the filesystem on
> just about every power failure.  It's a good idea to test hardware
> in a separate system through a few power failures under load before
> deploying them in production.  Most devices are OK, but a few percent
> of models out there have problems so severe they'll damage a filesystem
> in a single-digit number of power loss events.  It's fairly common to
> encounter users who have lost a btrfs on their first or second power
> failure with a problematic drive.  If you're stuck with one of these
> disks, you can disable write caching and still use it, but there will
> be added write latency, and in the long run it's better to upgrade to
> a better disk model.
>
>> 5) pay attention that having an 8 disks raid, raises the likelihood of a
>> failure of about an order of magnitude more than a single disk ! RAID6
>> (or any other RAID) mitigates that, in the sense that it creates a
>> time window where it is possible to make maintenance (e.g. a disk
>> replacement) before the lost of data.
>> 6) leave the room in the disks array for an additional disk (to use
>> when a disk replacement is needed)
>> 7) avoid the USB disks, because these are not reliable
>>
>>
>>> Any information appreciated.
>>>
>>>
>>> Greetings from Germany,
>>>
>>> Matt
>>
>> -- 
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: using raid56 on a private machine
  2020-10-06  1:24   ` Zygo Blaxell
  2020-10-06  5:50     ` cryptearth
@ 2020-10-06 17:12     ` Goffredo Baroncelli
  2020-10-06 20:07       ` Zygo Blaxell
  1 sibling, 1 reply; 8+ messages in thread
From: Goffredo Baroncelli @ 2020-10-06 17:12 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: cryptearth, linux-btrfs, Josef Bacik

On 10/6/20 3:24 AM, Zygo Blaxell wrote:
> On Mon, Oct 05, 2020 at 07:57:51PM +0200, Goffredo Baroncelli wrote:
[...]

>>
>> I have only few suggestions:
>> 1) don't store valuable data on BTRFS with raid5/6 profile. Use it if
>> you want to experiment and want to help the development of BTRFS. But
>> be ready to face the lost of all data. (very unlikely, but more the
>> size of the filesystem is big, more difficult is a restore of the data
>> in case of problem).
> 
> Losing all of the data seems unlikely given the bugs that exist so far.
> The known issues are related to availability (it crashes a lot and
> isn't fully usable in degraded mode) and small amounts of data loss
> (like 5 blocks per TB).

 From what I reading in the mailing list when the problem is too complex to solve
to the point that the filesystem has to be re-format, quite often the main issue is not to
"extract" the data, but is about the availability of additional space to "store" the data.

> 
> The above assumes you never use raid5 or raid6 for btrfs metadata.  Using
> raid5 or raid6 for metadata can result in total loss of the filesystem,
> but you can use raid1 or raid1c3 for metadata instead.
> 
>> 2) doesn't fill the filesystem more than 70-80%. If you go further
>> this limit the likelihood to catch the "dark and scary corners"
>> quickly increases.
> 
> Can you elaborate on that?  I run a lot of btrfs filesystems at 99%
> capacity, some of the bigger ones even higher.  If there were issues at
> 80% I expect I would have noticed them.  There were some performance
> issues with full filesystems on kernels using space_cache=v1, but
> space_cache=v2 went upstream 4 years ago, and other significant
> performance problems a year before that.

My suggestion was more to have enough space to not stress the filesystem
than "if you go behind this limit you have problem".

A problem of BTRFS that confuses the users is that you can have space, but you
can't allocate a new metadata chunk.

See
https://lore.kernel.org/linux-btrfs/6e6565b2-58c6-c8c1-62d0-6e8357e41a42@gmx.com/T/#t


Having the filesystem filled to 99% means that you have to check carefully
the filesystem (and balance it) to avoid scenarios like this.

On other side 1% of 1TB (a small filesystem for today standard) are about
10GB, that everybody should consider enough....

  
> The last few GB is a bit of a performance disaster and there are
> some other gotchas, but that's an absolute number, not a percentage.

True, it is sufficient to have few GB free (i.e. not allocated by chunk)
in *enough* disks...

However these requirements are a bit complex to understand by a new BTRFS
users.

> 
> Never balance metadata.  That is a ticket to a dark and scary corner.
> Make sure you don't do it, and that you don't accidentally install a
> cron job that does it.
> 
>> 3) run scrub periodically and after a power failure ; better to use
>> an uninterruptible power supply (this is true for all the RAID, even
>> the MD one).
> 
> scrub also provides early warning of disk failure, and detects disks
> that are silently corrupting your data.  It should be run not less than
> once a month, though you can skip months where you've already run a
> full-filesystem read for other reasons (e.g. replacing a failed disk).
> 
>> 4) I don't have any data to support this; but as occasional reader of
>> this mailing list I have the feeling that combing BTRFS with LUCKS(or
>> bcache) raises the likelihood of a problem.

> I haven't seen that correlation.  All of my machines run at least one
> btrfs on luks (dm-crypt).  The larger ones use lvmcache.  I've also run
> bcache on test machines doing power-fail tests.


> 
> That said, there are additional hardware failure risks involved in
> caching (more storage hardware components = more failures) and the
> system must be designed to tolerate and recover from these failures.
> 
> When cache disks fail, just uncache and run scrub to repair.  btrfs
> checksums will validate the data on the backing HDD (which will be badly
> corrupted after a cache SSD failure) and will restore missing data from
> other drives in the array.
> 
> It's definitely possible to configure bcache or lvmcache incorrectly,
> and then you will have severe problems.  Each HDD must have a separate
> dedicated SSD.  No sharing between cache devices is permitted.  They must
> use separate cache pools.  If one SSD is used to cache two or more HDDs
> and the SSD fails, it will behave the same as a multi-disk failure and
> probably destroy the filesystem.  So don't do that.
> 
> Note that firmware in the SSDs used for caching must respect write
> ordering, or the cache will do severe damage to the filesystem on
> just about every power failure.  It's a good idea to test hardware
> in a separate system through a few power failures under load before
> deploying them in production.  Most devices are OK, but a few percent
> of models out there have problems so severe they'll damage a filesystem
> in a single-digit number of power loss events.  It's fairly common to
> encounter users who have lost a btrfs on their first or second power
> failure with a problematic drive.  If you're stuck with one of these
> disks, you can disable write caching and still use it, but there will
> be added write latency, and in the long run it's better to upgrade to
> a better disk model.
> 
>> 5) pay attention that having an 8 disks raid, raises the likelihood of a
>> failure of about an order of magnitude more than a single disk ! RAID6
>> (or any other RAID) mitigates that, in the sense that it creates a
>> time window where it is possible to make maintenance (e.g. a disk
>> replacement) before the lost of data.
>> 6) leave the room in the disks array for an additional disk (to use
>> when a disk replacement is needed)
>> 7) avoid the USB disks, because these are not reliable
>>
>>
>>>
>>> Any information appreciated.
>>>
>>>
>>> Greetings from Germany,
>>>
>>> Matt
>>
>>
>> -- 
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: using raid56 on a private machine
  2020-10-06  5:50     ` cryptearth
@ 2020-10-06 19:31       ` Zygo Blaxell
  0 siblings, 0 replies; 8+ messages in thread
From: Zygo Blaxell @ 2020-10-06 19:31 UTC (permalink / raw)
  To: cryptearth; +Cc: linux-btrfs

On Tue, Oct 06, 2020 at 07:50:18AM +0200, cryptearth wrote:
> Hello Zygo,
> 
> that's quite a lot of information I wasn't aware of.
> 
> // In advance: Sorry for the wall of text. That mail got a bit longer than I
> thought.
> 
> I guess one point I still have to get my head around is about meta-blocks vs
> data-blocks: I don't even know if and how my current raid is capable of
> detecting other types of errors than instant failures, like corruption of
> structual meta-data or the actual data-blocks itself. Up until now, I never
> encountered any data corruption, neither read nor write issues. I always was
> able to correctly read all data the exact same as I've written them. There's
> only one application that uses rather big files (only in the range of
> 1gb-3gb) which keeps somehow corrupting itself (it's GTA V), but as the
> files that fail are files which, at least in my eyes, are only should be
> opened to read during runtime (as they contain assets like models and
> textures), but are actually opened in read/write mode I suspect that for
> some odd reason the game itself keeps writing data to those data and by it
> keep corrupting itself. Other big file, like other types of images (all
> about 4gb and more) or database files, which I also often read from and
> write to, never got any of these issues - but I guess that's just GTA V at
> its best - aside from some other rather strange CRM I had to use it's one of
> the worst pieces of modern software I know.
> The types of errors I encountered and which led to me replacing the drives
> makred as failed were about this: The monitoring software of this amd
> fakeraid at some point pops up one of those notifications telling me that
> one of the drives failed, was set to offline, and the raid was set to
> critical. Looking into the logs it only says that some operation <hex-code>
> failed at some address <another hex-code> on some drive (port number) and
> that the BSL (bad sector list) was updated. This comes up a few times and
> then this line about that drive going offline - so, it's a burst error. But:
> Even using Google didn't got me what those operation code mean. So, I don't
> know if a drive failed for some read or write error, some parity or checksum
> issue, or for whatever reason. All information I get is that there's a burst
> error, the drive is marked as bad, some list is updated and the array is set
> to a degraded state.
> But: Otherwise to what I tested so far with BtrFS it's not like the array
> goes offline or isn't available after reboot anymore. I can keep using it,
> and, as RAID5, at least this one, seem to always calculate and check at
> least some sort of checksum, there isn't even any performance penalty. To
> get it running again all I have to do is to shutdown the system, replace the
> drive, boot up again (that's caused by the hardware - it doesn't support
> hotplug) and hit rebuild in the raid control panel - which takes only a
> couple of hours with my 3TB drives.
> But, as already mentioned, as this is only RAID5 each rebuild is like
> gambling and hoping for no other drive fails until the rebuild is finished.
> If another drive would go bad during rebuild the data would most likely be
> lost. And recovery would even be harder as it's AMDs proprietary stuff - and
> from what I was able to find AMD denied help even to businesses - let alone
> me as a private person. All I could do would be to replace the board with
> some compatible one, but I don't even know if it just has to have a SB950
> chipset, or if it has to be the exact same board. The "bios-level" interface
> seem to be implemented as an option rom on its own - so it shouldn't depend
> on the specific board.
> 
> Anyway, long story short: I don'T want to wait until that catastrophe
> occurs, but rather want to prevent it by change my setup. Instead of rely
> something fused onto the motherboard my friend suggest to use a simple
> "dumb" HBA and do all the stuff in software, like Linux mdadm or BtrFS/ZFS.
> As I learned over the past few days while learning about BtrFS' RAID-like
> capabilities RAID isn't as simple as I thought until now but can actually
> suffer from one (or more) drive return corrupted data instead of just fail,
> and a typical hardware RAID controller or many "rather simple" software raid
> implementations can't tell the difference between the actual data and some
> not so good ones. As it was explained in some talk: Some implementations
> work in a way that if a data-block becomes corrupted which result in a fail
> of parity check the parity, which could actually be used to recover the
> correct data, is thrown away and overriden with some corrupted data
> re-calculated with the corrupted data-block. Hence using special filesystems
> like BtrFS and ZFS is recommended as they have additional information like
> per-block checksums to actually tell if the checksum calculation failed or
> if some data-block became corrupted.
> 
> As ZFS seem to have some not so clear license related stuff preventing it
> from get included into the kernel I took a look at BtrFS - which doesn't
> seem to fit my needs. Sure, I could go with a RAID 1+0 - but this still
> would result in only about 12TB useable space while actually throwing in 3
> more 3TB drives, but I actually planed to increase the useable size of my
> array instead of just bump it's redundancy. As for metadata: I've read up
> about the RAID1 and RAID3/4 profiles: And although RAID1c3 is recommended
> for a RAID6 (which would store 3 copies so there should be at least one copy
> left even in a double failure) is using a RAID1c4 also an option? I wouldn't
> mind to give up a bit of the available space to an extra metadata copy if it
> helps to prevent data loss in the case of a drive failure.
> 
> You also wrote to never balance metadata. But how should I restore the lost
> metadata after a drive replacement if I only re-balance the data-blocks? 

Replace and balance are distinct operations.

Replace reconstructs the contents of a missing drive on a new blank drive.
It is fast and works for all btrfs raid levels.

Balance redistributes existing data onto drives that have more space,
and reduces free space fragmentation.  This is good for data as it
makes future allocations faster and more contiguous; however, there
is no similar benefit for metadata.  Balance is bad for metadata as it
reduces space previously allocated for metadata to a minimum, and it's
possible to run out of metadata space before metadata can expand again.
This can be very difficult to recover from, and it's much easier to
simply never balance metadata to avoid the issue in the first place.

> Do
> they get updated and re-distributed in the background while restoring the
> data-blocks during a rebuild? 

'btrfs replace' reads mirror/parity disks and writes the contents of
the missing disk sequentially onto a new device.  There is no change in
distribution of the data possible with replace--all of the optimizations
in replace rely on the missing disk and the new replacement disk having
identical physical layout.  The new disk must be equal or larger size
to the missing disk that is replaced.  You start the replace process
manually when the new drive is installed, it does a lot of IO (technically
in the background, but fairly heavy for interactive use) until it is done.

You can also add a disk and remove a missing disk (the 'dev add' and 'dev
remove' operations) and then run balance to fill all the disks fairly,
which is more flexible, e.g. you can add 2 small disks to replace 1 large
disk.  This can be 10-100x slower than replace, and doesn't work properly
for raid5/6 in degraded mode yet (though it can be used for raid5 when all
disks are healthy and you just want the array to be bigger or smaller).

> Or is this more like: "redundancy builds up
> again over time by the regular algorithms"? I may still have something wrong
> in my understanding about "using multiple disks in one array" so currently I
> would suspect that all data are rebuild - also metadata - but I guess BtrFS
> works different on this topic?
> 
> Yes, I can tolerate loss of data as I do have an extra backup of my
> important data, and as I only use it for my personal use I guess any data
> lost by not having a proper backup of them is on me anyway, but seeing BtrFS
> and ZFS used in like 45 drive arrays for crucial data with requirement for
> high availability I'd like to find a solution I can set up my array in a way
> like RAID6, so it can withstand a double failure, but which is also still
> available during a rebuild. And although during my tests BtrFS showed
> promissing when I was able to mount and use an array with WinBtrFS, which
> would also solve my additional quest for come up with some way of use the
> same volume on both Linux and Windows, it doesn't seem to be ready for my
> plan yet, or at least not with the knowledge I got so far. I'm open for any
> suggestions and explanations as I obvious still have quite a lot to learn -
> and, if I may set up a BtrFS volume, likely to require some help doing it
> "the right way" for what I'd like.
> 
> Thanks to anyone, and sorry again for that rather long mail
> 
> Matt
> 
> Am 06.10.2020 um 03:24 schrieb Zygo Blaxell:
> > On Mon, Oct 05, 2020 at 07:57:51PM +0200, Goffredo Baroncelli wrote:
> > > On 10/5/20 6:59 PM, cryptearth wrote:
> > > > Hello there,
> > > > 
> > > > as I plan to use a 8 drive RAID6 with BtrFS I'd like to ask about
> > > > the current status of BtrFS RAID5/6 support or if I should go with a
> > > > more traditional mdadm array.
> > Definitely do not use a single mdadm raid6 array with btrfs.  It is
> > equivalent to running btrfs with raid6 metadata:  mdadm cannot recover
> > from data corruption on the disks, and btrfs cannot recover from
> > write hole issues in degraded mode.  Any failure messier than a total
> > instantaneous disk failure will probably break the filesystem.
> > 
> > > > The general status page on the btrfs wiki shows "unstable" for
> > > > RAID5/6, and it's specific pages mentions some issue marked as "not
> > > > production ready". It also says to not use it for the metadata but
> > > > only for the actual data.
> > That's correct.  Very briefly, the issues are:
> > 
> > 	1.  Reads don't work properly in degraded mode.
> > 
> > 	2.  The admin tools are incomplete.
> > 
> > 	3.  The diagnostic tools are broken.
> > 
> > 	4.  It is not possible to recover from all theoretically
> > 	recoverable failure events.
> > 
> > Items 1 and 4 make raid5/6 unusable for metadata (total filesystem loss
> > is likely).  Use raid1 or raid1c3 for metadata instead.  This is likely
> > a good idea even if all the known issues are fixed--metadata access
> > patterns don't perform well with raid5/6, and the most likely proposals
> > to solve the raid5/6 problems will require raid1/raid1c3 metadata to
> > store an update journal.
> > 
> > If your application can tolerate small data losses correlated with disk
> > failures (i.e. you can restore a file from backup every year if required,
> > and you have no requirement for data availability while replacing disks)
> > then you can use raid5 now; otherwise, btrfs raid5/6 is not ready yet.
> > 
> > > > I plan to use it for my own personal system at home - and I do
> > > > understand that RAID is no replacement for a backup, but I'd rather
> > > > like to ask upfront if it's ready to use before I encounter issues
> > > > when I use it.
> > > > I already had the plan about using a more "traditional" mdadm setup
> > > > and just format the resulting volume with ext4, but as I asked about
> > > > that many actually suggested to me to rather use modern filesystems
> > > > like BtrFS or ZFS instead of "old school RAID".
> > Indeed, old school raid maximizes your probability of silent data loss by
> > allowing multiple disks in inject silent data loss failures and firmware
> > bug effects.
> > 
> > btrfs and ZFS store their own data integrity information, so they can
> > reliably identify failures on the drives.  If redundant storage is used,
> > they can recover automatically from failures the drives can't or won't
> > report.
> > 
> > > > Do you have any help for me about using BtrFS with RAID6 vs mdadm or ZFS?
> > > Zygo collected some useful information about RAID5/6:
> > > 
> > > https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@hungrycats.org/
> > > 
> > > However more recently Josef (one of the main developers), declared
> > > that BTRFS with RAID5/6 has  "...some dark and scary corners..."
> > > 
> > > https://lore.kernel.org/linux-btrfs/bf9594ea55ce40af80548888070427ad97daf78a.1598374255.git.josef@toxicpanda.com/
> > I think my list is a little more...concrete.  ;)
> > 
> > > > I also don't really understand why and what's the difference between
> > > > metadata, data, and system.
> > > > When I set up a volume only define RAID6 for the data it sets
> > > > metadata and systemdata default to RAID1, but doesn't this mean that
> > > > those important metadata are only stored on two drives instead of
> > > > spread accross all drives like in a regular RAID6? This would somewhat
> > > > negate the benefit of RAID6 to withstand a double failure like a 2nd
> > > > drive fail while rebuilding the first failed one.
> > > Correct. In fact Zygo suggested to user RAID6 + RAID1C3.
> > > 
> > > I have only few suggestions:
> > > 1) don't store valuable data on BTRFS with raid5/6 profile. Use it if
> > > you want to experiment and want to help the development of BTRFS. But
> > > be ready to face the lost of all data. (very unlikely, but more the
> > > size of the filesystem is big, more difficult is a restore of the data
> > > in case of problem).
> > Losing all of the data seems unlikely given the bugs that exist so far.
> > The known issues are related to availability (it crashes a lot and
> > isn't fully usable in degraded mode) and small amounts of data loss
> > (like 5 blocks per TB).
> > 
> > The above assumes you never use raid5 or raid6 for btrfs metadata.  Using
> > raid5 or raid6 for metadata can result in total loss of the filesystem,
> > but you can use raid1 or raid1c3 for metadata instead.
> > 
> > > 2) doesn't fill the filesystem more than 70-80%. If you go further
> > > this limit the likelihood to catch the "dark and scary corners"
> > > quickly increases.
> > Can you elaborate on that?  I run a lot of btrfs filesystems at 99%
> > capacity, some of the bigger ones even higher.  If there were issues at
> > 80% I expect I would have noticed them.  There were some performance
> > issues with full filesystems on kernels using space_cache=v1, but
> > space_cache=v2 went upstream 4 years ago, and other significant
> > performance problems a year before that.
> > 
> > The last few GB is a bit of a performance disaster and there are
> > some other gotchas, but that's an absolute number, not a percentage.
> > 
> > Never balance metadata.  That is a ticket to a dark and scary corner.
> > Make sure you don't do it, and that you don't accidentally install a
> > cron job that does it.
> > 
> > > 3) run scrub periodically and after a power failure ; better to use
> > > an uninterruptible power supply (this is true for all the RAID, even
> > > the MD one).
> > scrub also provides early warning of disk failure, and detects disks
> > that are silently corrupting your data.  It should be run not less than
> > once a month, though you can skip months where you've already run a
> > full-filesystem read for other reasons (e.g. replacing a failed disk).
> > 
> > > 4) I don't have any data to support this; but as occasional reader of
> > > this mailing list I have the feeling that combing BTRFS with LUCKS(or
> > > bcache) raises the likelihood of a problem.
> > I haven't seen that correlation.  All of my machines run at least one
> > btrfs on luks (dm-crypt).  The larger ones use lvmcache.  I've also run
> > bcache on test machines doing power-fail tests.
> > 
> > That said, there are additional hardware failure risks involved in
> > caching (more storage hardware components = more failures) and the
> > system must be designed to tolerate and recover from these failures.
> > 
> > When cache disks fail, just uncache and run scrub to repair.  btrfs
> > checksums will validate the data on the backing HDD (which will be badly
> > corrupted after a cache SSD failure) and will restore missing data from
> > other drives in the array.
> > 
> > It's definitely possible to configure bcache or lvmcache incorrectly,
> > and then you will have severe problems.  Each HDD must have a separate
> > dedicated SSD.  No sharing between cache devices is permitted.  They must
> > use separate cache pools.  If one SSD is used to cache two or more HDDs
> > and the SSD fails, it will behave the same as a multi-disk failure and
> > probably destroy the filesystem.  So don't do that.
> > 
> > Note that firmware in the SSDs used for caching must respect write
> > ordering, or the cache will do severe damage to the filesystem on
> > just about every power failure.  It's a good idea to test hardware
> > in a separate system through a few power failures under load before
> > deploying them in production.  Most devices are OK, but a few percent
> > of models out there have problems so severe they'll damage a filesystem
> > in a single-digit number of power loss events.  It's fairly common to
> > encounter users who have lost a btrfs on their first or second power
> > failure with a problematic drive.  If you're stuck with one of these
> > disks, you can disable write caching and still use it, but there will
> > be added write latency, and in the long run it's better to upgrade to
> > a better disk model.
> > 
> > > 5) pay attention that having an 8 disks raid, raises the likelihood of a
> > > failure of about an order of magnitude more than a single disk ! RAID6
> > > (or any other RAID) mitigates that, in the sense that it creates a
> > > time window where it is possible to make maintenance (e.g. a disk
> > > replacement) before the lost of data.
> > > 6) leave the room in the disks array for an additional disk (to use
> > > when a disk replacement is needed)
> > > 7) avoid the USB disks, because these are not reliable
> > > 
> > > 
> > > > Any information appreciated.
> > > > 
> > > > 
> > > > Greetings from Germany,
> > > > 
> > > > Matt
> > > 
> > > -- 
> > > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> > > Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> > > 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: using raid56 on a private machine
  2020-10-06 17:12     ` Goffredo Baroncelli
@ 2020-10-06 20:07       ` Zygo Blaxell
  0 siblings, 0 replies; 8+ messages in thread
From: Zygo Blaxell @ 2020-10-06 20:07 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: cryptearth, linux-btrfs, Josef Bacik

On Tue, Oct 06, 2020 at 07:12:04PM +0200, Goffredo Baroncelli wrote:
> On 10/6/20 3:24 AM, Zygo Blaxell wrote:
> > On Mon, Oct 05, 2020 at 07:57:51PM +0200, Goffredo Baroncelli wrote:
> [...]
> 
> > > 
> > > I have only few suggestions:
> > > 1) don't store valuable data on BTRFS with raid5/6 profile. Use it if
> > > you want to experiment and want to help the development of BTRFS. But
> > > be ready to face the lost of all data. (very unlikely, but more the
> > > size of the filesystem is big, more difficult is a restore of the data
> > > in case of problem).
> > 
> > Losing all of the data seems unlikely given the bugs that exist so far.
> > The known issues are related to availability (it crashes a lot and
> > isn't fully usable in degraded mode) and small amounts of data loss
> > (like 5 blocks per TB).
> 
> From what I reading in the mailing list when the problem is too complex to solve
> to the point that the filesystem has to be re-format, quite often the main issue is not to
> "extract" the data, but is about the availability of additional space to "store" the data.
> 
> > 
> > The above assumes you never use raid5 or raid6 for btrfs metadata.  Using
> > raid5 or raid6 for metadata can result in total loss of the filesystem,
> > but you can use raid1 or raid1c3 for metadata instead.
> > 
> > > 2) doesn't fill the filesystem more than 70-80%. If you go further
> > > this limit the likelihood to catch the "dark and scary corners"
> > > quickly increases.
> > 
> > Can you elaborate on that?  I run a lot of btrfs filesystems at 99%
> > capacity, some of the bigger ones even higher.  If there were issues at
> > 80% I expect I would have noticed them.  There were some performance
> > issues with full filesystems on kernels using space_cache=v1, but
> > space_cache=v2 went upstream 4 years ago, and other significant
> > performance problems a year before that.
> 
> My suggestion was more to have enough space to not stress the filesystem
> than "if you go behind this limit you have problem".
> 
> A problem of BTRFS that confuses the users is that you can have space, but you
> can't allocate a new metadata chunk.
> 
> See
> https://lore.kernel.org/linux-btrfs/6e6565b2-58c6-c8c1-62d0-6e8357e41a42@gmx.com/T/#t
> 
> 
> Having the filesystem filled to 99% means that you have to check carefully
> the filesystem (and balance it) to avoid scenarios like this.

Nah.  Never balance metadata, and let the filesystem fill up.  As long
as the largest disks are sufficient for the raid profile, and there
isn't a radical change in usage after it's mostly full (i.e. a sudden
increase in snapshots or dedupe for the first time after the filesystem
has been filled), it'll be fine with no balances at all.

If the largest disks are the wrong sizes (e.g.  you're using raid6 and
the 2 largest disks are larger than the third), then you'll hit ENOSPC
at some point (which might be 80% full, or 10% full, depending on disk
sizes and profile), and it's inevitable, you'll hit ENOSPC no matter
what you do.

Arguably the tools could detect and warn about that case.  It's not
exactly hard to predict the failure.  A few lines of 'btrfs fi usage'
output and some counting are enough to spot it.

> On other side 1% of 1TB (a small filesystem for today standard) are about
> 10GB, that everybody should consider enough....

Most of the ENOSPC paint-into-the-corner cases I've seen (assuming it's
possible to allocate all the space with the chosen disk sizes and raid
profile) are on drives below 1TB, where GB-sized metadata allocations take
up bigger percentages of free space.  btrfs allocates space in absolute
units (1GB per chunk), so it's the absolute amount of free space that
causes the problems.  A 20GB filesystem can run into problems that need
workarounds at 50% full, while a 20TB filesystem can go all the way up
to 99.95% without issue.

> > The last few GB is a bit of a performance disaster and there are
> > some other gotchas, but that's an absolute number, not a percentage.
> 
> True, it is sufficient to have few GB free (i.e. not allocated by chunk)
> in *enough* disks...
> 
> However these requirements are a bit complex to understand by a new BTRFS
> users.

True, more tool support for the provably bad cases would be good here,
as well as a reliable estimate for how much space remains available for
metadata.  Some of the issues are difficult to boil down to numbers,
though, they are about _shape_.

> > Never balance metadata.  That is a ticket to a dark and scary corner.
> > Make sure you don't do it, and that you don't accidentally install a
> > cron job that does it.
> > 
> > > 3) run scrub periodically and after a power failure ; better to use
> > > an uninterruptible power supply (this is true for all the RAID, even
> > > the MD one).
> > 
> > scrub also provides early warning of disk failure, and detects disks
> > that are silently corrupting your data.  It should be run not less than
> > once a month, though you can skip months where you've already run a
> > full-filesystem read for other reasons (e.g. replacing a failed disk).
> > 
> > > 4) I don't have any data to support this; but as occasional reader of
> > > this mailing list I have the feeling that combing BTRFS with LUCKS(or
> > > bcache) raises the likelihood of a problem.
> 
> > I haven't seen that correlation.  All of my machines run at least one
> > btrfs on luks (dm-crypt).  The larger ones use lvmcache.  I've also run
> > bcache on test machines doing power-fail tests.
> 
> 
> > 
> > That said, there are additional hardware failure risks involved in
> > caching (more storage hardware components = more failures) and the
> > system must be designed to tolerate and recover from these failures.
> > 
> > When cache disks fail, just uncache and run scrub to repair.  btrfs
> > checksums will validate the data on the backing HDD (which will be badly
> > corrupted after a cache SSD failure) and will restore missing data from
> > other drives in the array.
> > 
> > It's definitely possible to configure bcache or lvmcache incorrectly,
> > and then you will have severe problems.  Each HDD must have a separate
> > dedicated SSD.  No sharing between cache devices is permitted.  They must
> > use separate cache pools.  If one SSD is used to cache two or more HDDs
> > and the SSD fails, it will behave the same as a multi-disk failure and
> > probably destroy the filesystem.  So don't do that.
> > 
> > Note that firmware in the SSDs used for caching must respect write
> > ordering, or the cache will do severe damage to the filesystem on
> > just about every power failure.  It's a good idea to test hardware
> > in a separate system through a few power failures under load before
> > deploying them in production.  Most devices are OK, but a few percent
> > of models out there have problems so severe they'll damage a filesystem
> > in a single-digit number of power loss events.  It's fairly common to
> > encounter users who have lost a btrfs on their first or second power
> > failure with a problematic drive.  If you're stuck with one of these
> > disks, you can disable write caching and still use it, but there will
> > be added write latency, and in the long run it's better to upgrade to
> > a better disk model.
> > 
> > > 5) pay attention that having an 8 disks raid, raises the likelihood of a
> > > failure of about an order of magnitude more than a single disk ! RAID6
> > > (or any other RAID) mitigates that, in the sense that it creates a
> > > time window where it is possible to make maintenance (e.g. a disk
> > > replacement) before the lost of data.
> > > 6) leave the room in the disks array for an additional disk (to use
> > > when a disk replacement is needed)
> > > 7) avoid the USB disks, because these are not reliable
> > > 
> > > 
> > > > 
> > > > Any information appreciated.
> > > > 
> > > > 
> > > > Greetings from Germany,
> > > > 
> > > > Matt
> > > 
> > > 
> > > -- 
> > > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> > > Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> > > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-10-06 20:07 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-05 16:59 using raid56 on a private machine cryptearth
2020-10-05 17:57 ` Goffredo Baroncelli
2020-10-05 19:22   ` cryptearth
2020-10-06  1:24   ` Zygo Blaxell
2020-10-06  5:50     ` cryptearth
2020-10-06 19:31       ` Zygo Blaxell
2020-10-06 17:12     ` Goffredo Baroncelli
2020-10-06 20:07       ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.