All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: List of known BTRFS Raid 5/6 Bugs?
@ 2018-09-07 13:58 Stefan K
  2018-09-08  8:40 ` Duncan
  0 siblings, 1 reply; 21+ messages in thread
From: Stefan K @ 2018-09-07 13:58 UTC (permalink / raw)
  To: linux-btrfs

sorry for disturb this discussion,

are there any plans/dates to fix the raid5/6 issue? Is somebody working on this issue? Cause this is for me one of the most important things for a fileserver, with a raid1 config I loose to much diskspace.

best regards
Stefan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
  2018-09-07 13:58 List of known BTRFS Raid 5/6 Bugs? Stefan K
@ 2018-09-08  8:40 ` Duncan
  2018-09-11 11:29   ` Stefan K
  0 siblings, 1 reply; 21+ messages in thread
From: Duncan @ 2018-09-08  8:40 UTC (permalink / raw)
  To: linux-btrfs

Stefan K posted on Fri, 07 Sep 2018 15:58:36 +0200 as excerpted:

> sorry for disturb this discussion,
> 
> are there any plans/dates to fix the raid5/6 issue? Is somebody working
> on this issue? Cause this is for me one of the most important things for
> a fileserver, with a raid1 config I loose to much diskspace.

There's a more technically complete discussion of this in at least two 
earlier threads you can find on the list archive, if you're interested, 
but here's the basics (well, extended basics...) from a btrfs-using-
sysadmin perspective.

"The raid5/6 issue" can refer to at least three conceptually separate 
issues, with different states of solution maturity:

1) Now generally historic bugs in btrfs scrub, etc, that are fixed (thus 
the historic) in current kernels and tools.  Unfortunately these will 
still affect for some time many users of longer-term stale^H^Hble distros 
who don't update using other sources for some time, as because the raid56 
feature wasn't yet stable at the lock-in time for whatever versions they 
stabilized on, they're not likely to get the fixes as it's new-feature 
material.

If you're using a current kernel and tools, however, this issue is 
fixed.  You can look on the wiki for the specific versions, but with the 
4.18 kernel current latest stable, it or 4.17, and similar tools versions 
since the version numbers are synced, are the two latest release series, 
with the two latest release series being best supported and considered 
"current" on this list.

Also see...

2) General feature maturity:  While raid56 mode should be /reasonably/ 
stable now, it remains one of the newer features and simply hasn't yet 
had the testing of time that tends to flush out the smaller and corner-
case bugs, that more mature features such as raid1 have now had the 
benefit of.

There's nothing to do for this but test, report any bugs you find, and 
wait for the maturity that time brings.

Of course this is one of several reasons we so strongly emphasize and 
recommend "current" on this list, because even for reasonably stable and 
mature features such as raid1, btrfs itself remains new enough that they 
still occasionally get latent bugs found and fixed, and while /some/ of 
those fixes get backported to LTS kernels (with even less chance for 
distros to backport tools fixes), not all of them do and even when they 
do, current still gets the fixes first.

3) The remaining issue is the infamous parity-raid write-hole that 
affects all parity-raid implementations (not just btrfs) unless they take 
specific steps to work around the issue.

The first thing to point out here again is that it's not btrfs-specific.  
Between that and the fact that it *ONLY* affects parity-raid operating in 
degraded mode *WITH* an ungraceful-shutdown recovery situation, it could 
be argued not to be a btrfs issue at all, but rather one inherent to 
parity-raid mode and considered an acceptable risk to those choosing 
parity-raid because it's only a factor when operating degraded, if an 
ungraceful shutdown does occur.

But btrfs' COW nature along with a couple technical implementation 
factors (the read-modify-write cycle for incomplete stripe widths and how 
that risks existing metadata when new metadata is written) does amplify 
the risk somewhat compared to that seen with the same write-hole issue in 
various other parity-raid implementations that don't avoid it due to 
taking write-hole avoidance countermeasures.


So what can be done right now?

As it happens there is a mitigation the admin can currently take -- btrfs 
allows specifying data and metadata modes separately, and even where 
raid1 loses too much space to be used for both, it's possible to specify 
data as raid5/6 and metadata as raid1.  While btrfs raid1 only covers 
loss of a single device, it doesn't have the parity-raid write-hole as 
it's not parity-raid, and for most use-cases at least, specifying raid1 
for metadata only, while raid5 for data, should strictly limit both the 
risk of the parity-raid write-hole as it'll be limited to data which in 
most cases will be full-stripe writes and thus not subject to the 
problem, and the size-doubling of raid1 as it'll be limited to metadata.

Meanwhile, arguably, for a sysadmin properly following the sysadmin's 
first rule of backups, that the true value of data isn't defined by 
arbitrary claims, but by the number of backups it is considered worth the 
time/trouble/resources to have of that data, it's a known parity-raid 
risk specifically limited to the corner-case of having an ungraceful 
shutdown *WHILE* already operating degraded, and as such, it can be 
managed along with all the other known risks to the data, including admin 
fat-fingering, the risk that more devices will go out than the array can 
tolerate, the risk of general bugs affecting the filesystem or other 
storage-function related code, etc.

IOW, in the context of the admin's first rule of backups, no matter the 
issue, raid56 write hole or whatever other issue of the many possible, 
loss of data can *never* be a particularly big issue, because by 
definition, in *all* cases, what was of most value was saved, either the 
data if it was defined as valuable enough to have a backup, or the time/
trouble/resources that would have otherwise gone into making that backup, 
if the data wasn't worth it to have a backup.

(One nice thing about this rule is that it covers the loss of whatever 
number of backups along with the working copy just as well as it does 
loss of just the working copy.  No matter the number of backups, the 
value of the data is either worth having one more backup, just in case, 
or it's not.  Similarly, the rule covers the age of the backup and 
updates nicely as well, as that's just a subset of the original problem, 
with the value of the data in the delta between the last backup and the 
working copy now being the deciding factor, either the risk of losing it 
is worth updating the backup, or not, same rule, applied to a data 
subset.)

So from an admin's perspective, in practice, while not entirely stable 
and mature yet, and with the risk of the already-degraded crash-case 
corner-case that's already known to apply to parity-raid unless 
mitigation steps are taken, btrfs raid56 mode should now be within the 
acceptable risk range already well covered by the risk mitigation of 
following an appropriate backup policy, optionally combined with the 
partial write-hole-mitigation strategy of doing data as raid5/6, with 
metadata as raid1.


OK, but what is being done to better mitigate the parity-raid write-hole 
problem for the future, and when might we be able to use that mitigation?

There are a number of possible mitigation strategies, and there's 
actually code being written using one of them right now, tho it'll be (at 
least) a few kernel cycles until it's considered complete and stable 
enough for mainline, and as mentioned in #2 above, even after that it'll 
take some time to mature to reasonable stability.

The strategy being taken is partial-stripe-write logging.  Full stripe 
writes aren't affected by the write hole and (AFAIK) won't be logged, but 
partial stripe writes are read-modify-write and thus write-hole 
succeptible, and will be logged.  That means small files and 
modifications to existing files, the ends of large files, and much of the 
metadata, will be written twice, first to the log, then to the final 
location.  In the event of a crash, on reboot and mount, anything in the 
log can be replayed, thus preventing the write hole.

As for the log, it'll be written using a new 3/4-way-mirroring mode, 
basically raid1 but mirrored more than two ways (which current btrfs 
raid1 is limited to, even with more than two devices in the filesystem), 
thus handling the loss of multiple devices.

That's actually what's being developed ATM, the 3/4-way-mirroring mode, 
which will be available for other uses as well.

Actually, that's what I'm personally excited about, as years ago, when I 
first looked into btrfs, I was running older devices in mdraid's raid1 
mode, which does N-way mirroring.  I liked the btrfs data checksumming 
and scrubbing ability, but with the older devices I didn't trust having 
just two-way-mirroring and wanted at least 3-way-mirroring, so back at 
that time I skipped btrfs and stayed with mdraid.  Later I upgraded to 
ssds and decided btrfs-raid1's two-way-mirroring was sufficient, but when 
one of the ssds ended up prematurely bad and needed replaced, I would 
have sure felt a bit better before I got the replacement done, if I still 
had good two-way-mirroring even with the bad device.

So I'm still interested in 3-way-mirroring and would probably use it for 
some things now, were it available and "stabilish", and I'm eager to see 
that code merged, not for the parity-raid logging it'll also be used for, 
but for the reliability of 3-way-mirroring.  Tho I'll probably wait at 
least 2-5 kernel cycles after introduction and see how it stabilizes 
before actually considering it stable enough to use myself, because even 
tho I do follow the backups policy above, just because I'm not 
considering the updated-data delta worth an updated backup yet, doesn't 
mean I want to unnecessarily risk having to redo the work since the last 
backup, which means choosing the newer 3-way-mirroring over the more 
stable and mature existing raid1 2-way-mirroring isn't going to be worth 
it to me until the 3-way-mirroring has at least a /few/ kernel cycles to 
stabilize.

And I'd recommend the same caution with the new raid5/6 logging mode 
built on top of that multi-way-mirroring, once it's merged as well.  
Don't just jump on it immediately after merge unless you're deliberately 
doing so to help test for bugs and get them fixed and the feature 
stabilized as soon as possible.  Wait a few kernel cycles, follow the 
list to see how the feature's stability is coming, and /then/ use it, 
after factoring in its remaining then still new and less mature 
additional risk into your backup risks profile, of course.

Time?  Not a dev but following the list and obviously following the new 3-
way-mirroring, I'd say probably not 4.20 (5.0?) for the new mirroring 
modes, so 4.21/5.1 more reasonably likely (if all goes well, could be 
longer), probably another couple cycles (if all goes well) after that for 
the parity-raid logging code built on top of the new mirroring modes, so 
perhaps a year (~5 kernel cycles) to introduction for it.  Then wait 
however many cycles until you think it has stabilized.  Call that another 
year.  So say about 10 kernel cycles or two years.  It could be a bit 
less than that, say 5-7 cycles, if things go well and you take it before 
I'd really consider it stable enough to recommend, but given the 
historically much longer than predicted development and stabilization 
times for raid56 already, it could just as easily end up double that, 4-5 
years out, too.

But raid56 logging mode for write-hole mitigation is indeed actively 
being worked on right now.  That's what we know at this time.

And even before that, right now, raid56 mode should already be reasonably 
usable, especially if you do data raid5/6 and metadata raid1, as long as 
your backup policy and practice is equally reasonable.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
  2018-09-08  8:40 ` Duncan
@ 2018-09-11 11:29   ` Stefan K
  2018-09-12  1:57     ` Duncan
  0 siblings, 1 reply; 21+ messages in thread
From: Stefan K @ 2018-09-11 11:29 UTC (permalink / raw)
  To: linux-btrfs

wow, holy shit, thanks for this extended answer!

> The first thing to point out here again is that it's not btrfs-specific.  
so that mean that every RAID implemantation (with parity) has such Bug? I'm looking a bit, it looks like that ZFS doesn't have a write hole. And it _only_ happens when the server has a ungraceful shutdown, caused by poweroutage? So that mean if I running btrfs raid5/6 and I've no poweroutages I've no problems?

>  it's possible to specify data as raid5/6 and metadata as raid1
does some have this in production? ZFS btw have 2 copies of metadata by default, maybe it would also be an option or btrfs?
in this case you think 'btrfs fi balance start -mconvert=raid1 -dconvert=raid5 /path ' is safe at the moment?

> That means small files and modifications to existing files, the ends of large files, and much of the 
> metadata, will be written twice, first to the log, then to the final location. 
that sounds that the performance will go down? So far as I can see btrfs can't beat ext4 or btrfs nor zfs and then they will made it even slower?

thanks in advanced!

best regards
Stefan



On Saturday, September 8, 2018 8:40:50 AM CEST Duncan wrote:
> Stefan K posted on Fri, 07 Sep 2018 15:58:36 +0200 as excerpted:
> 
> > sorry for disturb this discussion,
> > 
> > are there any plans/dates to fix the raid5/6 issue? Is somebody working
> > on this issue? Cause this is for me one of the most important things for
> > a fileserver, with a raid1 config I loose to much diskspace.
> 
> There's a more technically complete discussion of this in at least two 
> earlier threads you can find on the list archive, if you're interested, 
> but here's the basics (well, extended basics...) from a btrfs-using-
> sysadmin perspective.
> 
> "The raid5/6 issue" can refer to at least three conceptually separate 
> issues, with different states of solution maturity:
> 
> 1) Now generally historic bugs in btrfs scrub, etc, that are fixed (thus 
> the historic) in current kernels and tools.  Unfortunately these will 
> still affect for some time many users of longer-term stale^H^Hble distros 
> who don't update using other sources for some time, as because the raid56 
> feature wasn't yet stable at the lock-in time for whatever versions they 
> stabilized on, they're not likely to get the fixes as it's new-feature 
> material.
> 
> If you're using a current kernel and tools, however, this issue is 
> fixed.  You can look on the wiki for the specific versions, but with the 
> 4.18 kernel current latest stable, it or 4.17, and similar tools versions 
> since the version numbers are synced, are the two latest release series, 
> with the two latest release series being best supported and considered 
> "current" on this list.
> 
> Also see...
> 
> 2) General feature maturity:  While raid56 mode should be /reasonably/ 
> stable now, it remains one of the newer features and simply hasn't yet 
> had the testing of time that tends to flush out the smaller and corner-
> case bugs, that more mature features such as raid1 have now had the 
> benefit of.
> 
> There's nothing to do for this but test, report any bugs you find, and 
> wait for the maturity that time brings.
> 
> Of course this is one of several reasons we so strongly emphasize and 
> recommend "current" on this list, because even for reasonably stable and 
> mature features such as raid1, btrfs itself remains new enough that they 
> still occasionally get latent bugs found and fixed, and while /some/ of 
> those fixes get backported to LTS kernels (with even less chance for 
> distros to backport tools fixes), not all of them do and even when they 
> do, current still gets the fixes first.
> 
> 3) The remaining issue is the infamous parity-raid write-hole that 
> affects all parity-raid implementations (not just btrfs) unless they take 
> specific steps to work around the issue.
> 
> The first thing to point out here again is that it's not btrfs-specific.  
> Between that and the fact that it *ONLY* affects parity-raid operating in 
> degraded mode *WITH* an ungraceful-shutdown recovery situation, it could 
> be argued not to be a btrfs issue at all, but rather one inherent to 
> parity-raid mode and considered an acceptable risk to those choosing 
> parity-raid because it's only a factor when operating degraded, if an 
> ungraceful shutdown does occur.
> 
> But btrfs' COW nature along with a couple technical implementation 
> factors (the read-modify-write cycle for incomplete stripe widths and how 
> that risks existing metadata when new metadata is written) does amplify 
> the risk somewhat compared to that seen with the same write-hole issue in 
> various other parity-raid implementations that don't avoid it due to 
> taking write-hole avoidance countermeasures.
> 
> 
> So what can be done right now?
> 
> As it happens there is a mitigation the admin can currently take -- btrfs 
> allows specifying data and metadata modes separately, and even where 
> raid1 loses too much space to be used for both, it's possible to specify 
> data as raid5/6 and metadata as raid1.  While btrfs raid1 only covers 
> loss of a single device, it doesn't have the parity-raid write-hole as 
> it's not parity-raid, and for most use-cases at least, specifying raid1 
> for metadata only, while raid5 for data, should strictly limit both the 
> risk of the parity-raid write-hole as it'll be limited to data which in 
> most cases will be full-stripe writes and thus not subject to the 
> problem, and the size-doubling of raid1 as it'll be limited to metadata.
> 
> Meanwhile, arguably, for a sysadmin properly following the sysadmin's 
> first rule of backups, that the true value of data isn't defined by 
> arbitrary claims, but by the number of backups it is considered worth the 
> time/trouble/resources to have of that data, it's a known parity-raid 
> risk specifically limited to the corner-case of having an ungraceful 
> shutdown *WHILE* already operating degraded, and as such, it can be 
> managed along with all the other known risks to the data, including admin 
> fat-fingering, the risk that more devices will go out than the array can 
> tolerate, the risk of general bugs affecting the filesystem or other 
> storage-function related code, etc.
> 
> IOW, in the context of the admin's first rule of backups, no matter the 
> issue, raid56 write hole or whatever other issue of the many possible, 
> loss of data can *never* be a particularly big issue, because by 
> definition, in *all* cases, what was of most value was saved, either the 
> data if it was defined as valuable enough to have a backup, or the time/
> trouble/resources that would have otherwise gone into making that backup, 
> if the data wasn't worth it to have a backup.
> 
> (One nice thing about this rule is that it covers the loss of whatever 
> number of backups along with the working copy just as well as it does 
> loss of just the working copy.  No matter the number of backups, the 
> value of the data is either worth having one more backup, just in case, 
> or it's not.  Similarly, the rule covers the age of the backup and 
> updates nicely as well, as that's just a subset of the original problem, 
> with the value of the data in the delta between the last backup and the 
> working copy now being the deciding factor, either the risk of losing it 
> is worth updating the backup, or not, same rule, applied to a data 
> subset.)
> 
> So from an admin's perspective, in practice, while not entirely stable 
> and mature yet, and with the risk of the already-degraded crash-case 
> corner-case that's already known to apply to parity-raid unless 
> mitigation steps are taken, btrfs raid56 mode should now be within the 
> acceptable risk range already well covered by the risk mitigation of 
> following an appropriate backup policy, optionally combined with the 
> partial write-hole-mitigation strategy of doing data as raid5/6, with 
> metadata as raid1.
> 
> 
> OK, but what is being done to better mitigate the parity-raid write-hole 
> problem for the future, and when might we be able to use that mitigation?
> 
> There are a number of possible mitigation strategies, and there's 
> actually code being written using one of them right now, tho it'll be (at 
> least) a few kernel cycles until it's considered complete and stable 
> enough for mainline, and as mentioned in #2 above, even after that it'll 
> take some time to mature to reasonable stability.
> 
> The strategy being taken is partial-stripe-write logging.  Full stripe 
> writes aren't affected by the write hole and (AFAIK) won't be logged, but 
> partial stripe writes are read-modify-write and thus write-hole 
> succeptible, and will be logged.  That means small files and 
> modifications to existing files, the ends of large files, and much of the 
> metadata, will be written twice, first to the log, then to the final 
> location.  In the event of a crash, on reboot and mount, anything in the 
> log can be replayed, thus preventing the write hole.
> 
> As for the log, it'll be written using a new 3/4-way-mirroring mode, 
> basically raid1 but mirrored more than two ways (which current btrfs 
> raid1 is limited to, even with more than two devices in the filesystem), 
> thus handling the loss of multiple devices.
> 
> That's actually what's being developed ATM, the 3/4-way-mirroring mode, 
> which will be available for other uses as well.
> 
> Actually, that's what I'm personally excited about, as years ago, when I 
> first looked into btrfs, I was running older devices in mdraid's raid1 
> mode, which does N-way mirroring.  I liked the btrfs data checksumming 
> and scrubbing ability, but with the older devices I didn't trust having 
> just two-way-mirroring and wanted at least 3-way-mirroring, so back at 
> that time I skipped btrfs and stayed with mdraid.  Later I upgraded to 
> ssds and decided btrfs-raid1's two-way-mirroring was sufficient, but when 
> one of the ssds ended up prematurely bad and needed replaced, I would 
> have sure felt a bit better before I got the replacement done, if I still 
> had good two-way-mirroring even with the bad device.
> 
> So I'm still interested in 3-way-mirroring and would probably use it for 
> some things now, were it available and "stabilish", and I'm eager to see 
> that code merged, not for the parity-raid logging it'll also be used for, 
> but for the reliability of 3-way-mirroring.  Tho I'll probably wait at 
> least 2-5 kernel cycles after introduction and see how it stabilizes 
> before actually considering it stable enough to use myself, because even 
> tho I do follow the backups policy above, just because I'm not 
> considering the updated-data delta worth an updated backup yet, doesn't 
> mean I want to unnecessarily risk having to redo the work since the last 
> backup, which means choosing the newer 3-way-mirroring over the more 
> stable and mature existing raid1 2-way-mirroring isn't going to be worth 
> it to me until the 3-way-mirroring has at least a /few/ kernel cycles to 
> stabilize.
> 
> And I'd recommend the same caution with the new raid5/6 logging mode 
> built on top of that multi-way-mirroring, once it's merged as well.  
> Don't just jump on it immediately after merge unless you're deliberately 
> doing so to help test for bugs and get them fixed and the feature 
> stabilized as soon as possible.  Wait a few kernel cycles, follow the 
> list to see how the feature's stability is coming, and /then/ use it, 
> after factoring in its remaining then still new and less mature 
> additional risk into your backup risks profile, of course.
> 
> Time?  Not a dev but following the list and obviously following the new 3-
> way-mirroring, I'd say probably not 4.20 (5.0?) for the new mirroring 
> modes, so 4.21/5.1 more reasonably likely (if all goes well, could be 
> longer), probably another couple cycles (if all goes well) after that for 
> the parity-raid logging code built on top of the new mirroring modes, so 
> perhaps a year (~5 kernel cycles) to introduction for it.  Then wait 
> however many cycles until you think it has stabilized.  Call that another 
> year.  So say about 10 kernel cycles or two years.  It could be a bit 
> less than that, say 5-7 cycles, if things go well and you take it before 
> I'd really consider it stable enough to recommend, but given the 
> historically much longer than predicted development and stabilization 
> times for raid56 already, it could just as easily end up double that, 4-5 
> years out, too.
> 
> But raid56 logging mode for write-hole mitigation is indeed actively 
> being worked on right now.  That's what we know at this time.
> 
> And even before that, right now, raid56 mode should already be reasonably 
> usable, especially if you do data raid5/6 and metadata raid1, as long as 
> your backup policy and practice is equally reasonable.
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
  2018-09-11 11:29   ` Stefan K
@ 2018-09-12  1:57     ` Duncan
  0 siblings, 0 replies; 21+ messages in thread
From: Duncan @ 2018-09-12  1:57 UTC (permalink / raw)
  To: linux-btrfs

Stefan K posted on Tue, 11 Sep 2018 13:29:38 +0200 as excerpted:

> wow, holy shit, thanks for this extended answer!
> 
>> The first thing to point out here again is that it's not
>> btrfs-specific.
> so that mean that every RAID implemantation (with parity) has such Bug?
> I'm looking a bit, it looks like that ZFS doesn't have a write hole.

Every parity-raid implementation that doesn't contain specific write-hole 
workarounds, yes, but some already have workarounds built-in, as btrfs 
will after the planned code is written/tested/merged/tested-more-broadly.

https://www.google.com/search?q=parity-raid+write-hole [1]

As an example, back some years ago when I was doing raid6 on mdraid, it 
had the write-hole problem and I remember reading about it at the time.  
However, right on the first page of hits for the above search...

LWN: A journal for MD/RAID5 : https://lwn.net/Articles/665299/

Seems md/raid5's write hole was (optionally) closed in kernel 4.4 with an 
optional journal device... preferably a fast ssd or nvram, to avoid 
performance issues, and mirrored, to avoid the journal itself being a 
single point of failure.

For me zfs is strictly an arm's-length thing, because if Oracle wanted to 
they could easily resolve the licensing thing as they own the code, but 
they haven't, which at this point can only be deliberate, and as I result 
I simply don't touch it.  That isn't to say I don't recommend it for 
those comfortable with or simply willing to overlook the licensing 
issues, however, because zfs remains the most mature Linux option for 
many of the same feature points that btrfs has, only at a lower maturity 
level.

But while I keep zfs at personal arm's length, from what I've picked up, 
I /believe/ zfs gets around the write-hole by doing strict copy-on-write 
combined with variable-length stripes -- unlike current btrfs, a stripe 
isn't always written as widely as possible, so for instance in a 20-
device raid5-alike they're able to do a 3-device and possibly even 2-
device "stripe", which then being entirely copy-on-write, avoids the read-
modify-write cycle of modified existing data that unless mitigated 
creates the parity-raid write-hole.

Variable-length stripes is actually one of the possible longer-term 
solutions already discussed for btrfs as well, but the logging/journalling 
solution seems to be what they've decided to implement first, and there's 
other tradeoffs to it (as discussed elsewhere).  Of course because as 
I've already explained I'm interested in the 3/4-way-mirroring option 
that would be used for the journal but also available to expand the 
current 2-way-raid1 mode to additional mirroring, this is absolutely fine 
with me! =:^)

> And
> it _only_ happens when the server has a ungraceful shutdown, caused by
> poweroutage? So that mean if I running btrfs raid5/6 and I've no
> poweroutages I've no problems?

Sort-of yes?

Keep in mind that power-outage isn't the /only/ way to have an ungraceful 
shutdown, just one of the most common.  Should the kernel crash or lock 
up for some reason, common examples include video and occasionally 
network driver bugs due to the direct access to hardware and memory they 
get, that can trigger an "ungraceful shutdown" as well, altho with care 
(basically always trying ssh-ing in for a remote shutdown if possible and/
or using alt-sysrq-reisub sequences on apparent lockups) it's often 
possible to prevent those being /entirely/ ungraceful at the hardware 
level, so it's not /quite/ as bad as an abrupt power outage or perhaps 
even worse a brownout that doesn't kill writes entirely but can at least 
theoretically trigger garbage scribbling in random device blocks.

So yes, sort-of, but it'd not just power-outages.

>>  it's possible to specify data as raid5/6 and metadata as raid1

> does some have this in production?

I'm sure people do.  (As I said I'm a raid1 guy here, even 3-way-
mirroring for some things were it possible, so no parity-raid at all for 
me personally.)

On btrfs, it is in fact the multi-device default and thus quite common to 
have data and metadata as different profiles.  The multi-device default 
for metadata if not specified is raid1, with single profile data.  So if 
you just specify raid5/6 data and don't specify metadata at all, you'll 
get exactly what was mentioned, raid5/6 data as specified, raid1 metadata 
as the unspecified multi-device default.

So were I to guess I'd guess that a lot of people who weren't paying 
attention when setting up but saying they have raid5/6, actually only 
have it for data, having not specified anything for metadata, so they got 
raid1 for it.


> ZFS btw have 2 copies of metadata by
> default, maybe it would also be an option or btrfs?

It actually sounds like they do hybrid raid, then, not just pure parity-
raid, but mirroring the metadata as well.  That would be in accord with a 
couple things I'd read about zfs but hadn't quite pursued to the logical 
conclusion, and would be what btrfs as already available does with raid5/6 
data and raid1 metadata.

> in this case you think 'btrfs fi balance start -mconvert=raid1
> -dconvert=raid5 /path ' is safe at the moment?

Provided you have backups in accordance with the "if it's more valuable 
than the time/trouble/resources for the backup, it's backed up" rule, and 
on current kernels, yes.

>> That means small files and modifications to existing files, the ends of
>> large files, and much of the metadata, will be written twice, first to
>> the log, then to the final location.

> that sounds that the performance will go down? So far as I can see btrfs
> can't beat ext4 or btrfs nor zfs and then they will made it even slower?

That's the effect journaling[2] partial-stripe-writes will have, yes.

However, parity-raid /always/ has a write-performance tradeoff, well 
either that or a space/organization-tradeoff if it does less than full-
width stripes, because traditional parity-raid /already/ has the read-
modify-write problem for partial-stripe-width writes (and partial-width-
stripe-writes for non-traditional solutions such as zfs a space layout 
efficiency problem), so lower small-write performance is already a 
tradeoff you're making choosing parity-raid in the first place, and 
journaling only accentuates it a bit, as the price paid for closing the 
write hole.

The performance issue was a big part of the reason I ended up switching 
from parity-raid to raid1, back in the day on mdraid.  And it turned out 
I was /much/ happier with raid1. which had much better performance than I 
had thought it would (the mdraid raid1 scheduler is recognized for its 
high efficiency read-scheduling and for parallel-write scheduling, so 
write latency is only about the same as writes to a single device, while 
many or large reads are smart-scheduled to parallelize across all 
mirrors).

(The other part of the reason I switched back to raid1 on mdraid was 
because I had rather naively thought on parity-raid the parity would be 
cross-checked in the standard read path, giving me integrity checking as 
well.  Turns out that's not the case; parity is only used for rebuilds in 
case of device-loss and isn't checked for normal reads, a great 
disappointment.  That's actually why I'm so looking forward to btrfs 3-
and-4-way mirroring, because btrfs already has full checksumming and 
routine checking on read, for data and metadata integrity, but currently 
only has two-way-mirroring, so if you're down a device and the copy on 
the remaining device is bad, you're just out of luck, whereas 3-way-
mirroring would let a device be bad and still give me a backup if one of 
the two remaining copies ended up failing checksum verification.  4-way-
mirroring would obviously add yet another copy, but 3-way is the sweet-
spot for me.)

The performance is why mdraid recommends putting the journal on a faster 
device (or better yet mirrors, avoiding the single point of failure of a 
single journal device), ssd or nvram, for performance reasons, turning a 
slow-down into a speedup due to the write-cache.  But btrfs doesn't have 
device-purpose-specification like that built-in yet, so it's either all 
devices, or use something like bcache with an ssd as the front device.  
(The ssd used as the bcache front device can be partitioned to allow a 
single ssd to cache multiple slower backing devices.)

OTOH, as stated it's only smaller less-than-stripe-width writes that will 
be affected.  As soon as you're writing more than stripe width, as with 
large files for data or for metadata when copying whole subdir trees, 
most of it will be full-stripe-writes and thus shouldn't have to (I'm not 
sure how it's actually going to be implemented) be logged/journalled.


Meanwhile, at least one of the other alternatives, less-than-full-width-
stripes, or writing partly empty full-width-stripes, as necessary, of 
course with COW so read-modify-write is entirely avoided, will likely 
eventually be available on btrfs as well.  But they have their own 
tradeoffs, faster initially than the logged/journalled solution, but less 
efficient initial space utilization, with a clean-up balance likely 
periodically required to rewrite all the short-stripes (either less than 
full width or partially empty) to full width.  So all possibilities have 
their tradeoffs, none are a "magic" solution that entirely does away with 
problems inherent to parity-raid, without tradeoffs of /some/ sort.

But zfs is already (optionally? I don't know) using these tradeoffs, and 
on mdraid there's options, and people often aren't even aware of the 
tradeoffs they're taking on those solutions, so... I suppose when it's 
all said and done the only people aware of the issues on btrfs are likely 
going to be the highly technical and case-optimizer crowds, too.  
Everyone else will probably just use the defaults and not even be aware 
of the tradeoffs they're making by doing so, as is already the case on 
mdraid and zfs.

---
[1] As I'm no longer running either mdraid or parity-raid, I've not 
followed this extremely closely, but writing this actually spurred me to 
google the problem and see when and how mdraid fixed it.  So the links 
are from that. =:^)

[2] Journalling/journaling, one or two Ls?  The spellcheck flags both and 
last I tried googling it the answer was inconclusive.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
  2018-08-16 19:38       ` erenthetitan
@ 2018-08-17  8:33         ` Menion
  0 siblings, 0 replies; 21+ messages in thread
From: Menion @ 2018-08-17  8:33 UTC (permalink / raw)
  To: erenthetitan; +Cc: Zygo Blaxell, linux-btrfs

Ok, but I cannot guarantee that I don't need to cancel scrub during the process
As said, this is a domestic storage, and when scrub is running the
performance hit is big enough to prevent smooth streaming of HD and 4k
movies
Il giorno gio 16 ago 2018 alle ore 21:38 <erenthetitan@mail.de> ha scritto:
>
> Could you show scrub status -d, then start a new scrub (all drives) and show scrub status -d again? This may help us diagnose the problem.
>
> Am 15-Aug-2018 09:27:40 +0200 schrieb menion@gmail.com:
> > I needed to resume scrub two times after an unclear shutdown (I was
> > cooking and using too much electricity) and two times after a manual
> > cancel, because I wanted to watch a 4k movie and the array
> > performances were not enough with scrub active.
> > Each time I resumed it, I checked also the status, and the total
> > number of data scrubbed was keep counting (never started from zero)
> > Il giorno mer 15 ago 2018 alle ore 05:33 Zygo Blaxell
> > <ce3g8jdj@umail.furryterror.org> ha scritto:
> > >
> > > On Tue, Aug 14, 2018 at 09:32:51AM +0200, Menion wrote:
> > > > Hi
> > > > Well, I think it is worth to give more details on the array.
> > > > the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII enclosure
> > > > The enclosure is a cheap JMicron based chinese stuff (from Orico).
> > > > There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb
> > > > multiplexer behind it. So you cannot expect peak performance, which is
> > > > not the goal of this array (domestic data storage).
> > > > Also the USB to SATA firmware is buggy, so UAS operations are not
> > > > stable, it run in BOT mode.
> > > > Having said so, the scrub has been started (and resumed) on the array
> > > > mount point:
> > > >
> > > > sudo btrfs scrub start(resume) /media/storage/das1
> > >
> > > So is 2.59TB the amount scrubbed _since resume_? If you run a complete
> > > scrub end to end without cancelling or rebooting in between, what is
> > > the size on all disks (btrfs scrub status -d)?
> > >
> > > > even if reading the documentation I understand that it is the same
> > > > invoking it on mountpoint or one of the HDD in the array.
> > > > In the end, especially for a RAID5 array, does it really make sense to
> > > > scrub only one disk in the array???
> > >
> > > You would set up a shell for-loop and scrub each disk of the array
> > > in turn. Each scrub would correct errors on a single device.
> > >
> > > There was a bug in btrfs scrub where scrubbing the filesystem would
> > > create one thread for each disk, and the threads would issue commands
> > > to all disks and compete with each other for IO, resulting in terrible
> > > performance on most non-SSD hardware. By scrubbing disks one at a time,
> > > there are no competing threads, so the scrub runs many times faster.
> > > With this bug the total time to scrub all disks individually is usually
> > > less than the time to scrub the entire filesystem at once, especially
> > > on HDD (and even if it's not faster, one-at-a-time disk scrubs are
> > > much kinder to any other process trying to use the filesystem at the
> > > same time).
> > >
> > > It appears this bug is not fixed, based on some timing results I am
> > > getting from a test array. iostat shows 10x more reads than writes on
> > > all disks even when all blocks on one disk are corrupted and the scrub
> > > is given only a single disk to process (that should result in roughly
> > > equal reads on all disks slightly above the number of writes on the
> > > corrupted disk).
> > >
> > > This is where my earlier caveat about performance comes from. Many parts
> > > of btrfs raid5 are somewhere between slower and *much* slower than
> > > comparable software raid5 implementations. Some of that is by design:
> > > btrfs must be at least 1% slower than mdadm because btrfs needs to read
> > > metadata to verify data block csums in scrub, and the difference would
> > > be much larger in practice due to HDD seek times, but 500%-900% overhead
> > > still seems high especially when compared to btrfs raid1 that has the
> > > same metadata csum reading issue without the huge performance gap.
> > >
> > > It seems like btrfs raid5 could still use a thorough profiling to figure
> > > out where it's spending all its IO.
> > >
> > > > Regarding the data usage, here you have the current figures:
> > > >
> > > > menion@Menionubuntu:~$ sudo btrfs fi show
> > > > [sudo] password for menion:
> > > > Label: none uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f
> > > > Total devices 1 FS bytes used 11.44GiB
> > > > devid 1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3
> > > >
> > > > Label: none uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > > > Total devices 5 FS bytes used 6.57TiB
> > > > devid 1 size 7.28TiB used 1.64TiB path /dev/sda
> > > > devid 2 size 7.28TiB used 1.64TiB path /dev/sdb
> > > > devid 3 size 7.28TiB used 1.64TiB path /dev/sdc
> > > > devid 4 size 7.28TiB used 1.64TiB path /dev/sdd
> > > > devid 5 size 7.28TiB used 1.64TiB path /dev/sde
> > > >
> > > > menion@Menionubuntu:~$ sudo btrfs fi df /media/storage/das1
> > > > Data, RAID5: total=6.57TiB, used=6.56TiB
> > > > System, RAID5: total=12.75MiB, used=416.00KiB
> > > > Metadata, RAID5: total=9.00GiB, used=8.16GiB
> > > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > > menion@Menionubuntu:~$ sudo btrfs fi usage /media/storage/das1
> > > > WARNING: RAID56 detected, not implemented
> > > > WARNING: RAID56 detected, not implemented
> > > > WARNING: RAID56 detected, not implemented
> > > > Overall:
> > > > Device size: 36.39TiB
> > > > Device allocated: 0.00B
> > > > Device unallocated: 36.39TiB
> > > > Device missing: 0.00B
> > > > Used: 0.00B
> > > > Free (estimated): 0.00B (min: 8.00EiB)
> > > > Data ratio: 0.00
> > > > Metadata ratio: 0.00
> > > > Global reserve: 512.00MiB (used: 32.00KiB)
> > > >
> > > > Data,RAID5: Size:6.57TiB, Used:6.56TiB
> > > > /dev/sda 1.64TiB
> > > > /dev/sdb 1.64TiB
> > > > /dev/sdc 1.64TiB
> > > > /dev/sdd 1.64TiB
> > > > /dev/sde 1.64TiB
> > > >
> > > > Metadata,RAID5: Size:9.00GiB, Used:8.16GiB
> > > > /dev/sda 2.25GiB
> > > > /dev/sdb 2.25GiB
> > > > /dev/sdc 2.25GiB
> > > > /dev/sdd 2.25GiB
> > > > /dev/sde 2.25GiB
> > > >
> > > > System,RAID5: Size:12.75MiB, Used:416.00KiB
> > > > /dev/sda 3.19MiB
> > > > /dev/sdb 3.19MiB
> > > > /dev/sdc 3.19MiB
> > > > /dev/sdd 3.19MiB
> > > > /dev/sde 3.19MiB
> > > >
> > > > Unallocated:
> > > > /dev/sda 5.63TiB
> > > > /dev/sdb 5.63TiB
> > > > /dev/sdc 5.63TiB
> > > > /dev/sdd 5.63TiB
> > > > /dev/sde 5.63TiB
> > > > menion@Menionubuntu:~$
> > > > menion@Menionubuntu:~$ sf -h
> > > > The program 'sf' is currently not installed. You can install it by typing:
> > > > sudo apt install ruby-sprite-factory
> > > > menion@Menionubuntu:~$ df -h
> > > > Filesystem Size Used Avail Use% Mounted on
> > > > udev 934M 0 934M 0% /dev
> > > > tmpfs 193M 22M 171M 12% /run
> > > > /dev/mmcblk0p3 28G 12G 15G 44% /
> > > > tmpfs 962M 0 962M 0% /dev/shm
> > > > tmpfs 5,0M 0 5,0M 0% /run/lock
> > > > tmpfs 962M 0 962M 0% /sys/fs/cgroup
> > > > /dev/mmcblk0p1 188M 3,4M 184M 2% /boot/efi
> > > > /dev/mmcblk0p3 28G 12G 15G 44% /home
> > > > /dev/sda 37T 6,6T 29T 19% /media/storage/das1
> > > > tmpfs 193M 0 193M 0% /run/user/1000
> > > > menion@Menionubuntu:~$ btrfs --version
> > > > btrfs-progs v4.17
> > > >
> > > > So I don't fully understand where the scrub data size comes from
> > > > Il giorno lun 13 ago 2018 alle ore 23:56 <erenthetitan@mail.de> ha scritto:
> > > > >
> > > > > Running time of 55:06:35 indicates that the counter is right, it is not enough time to scrub the entire array using hdd.
> > > > >
> > > > > 2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub start /dev/sdx1" only scrubs the selected partition,
> > > > > whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual array.
> > > > >
> > > > > Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics and post the output.
> > > > > For live statistics, use "sudo watch -n 1".
> > > > >
> > > > > By the way:
> > > > > 0 errors despite multiple unclean shutdowns? I assumed that the write hole would corrupt parity the first time around, was i wrong?
> > > > >
> > > > > Am 13-Aug-2018 09:20:36 +0200 schrieb menion@gmail.com:
> > > > > > Hi
> > > > > > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
> > > > > > there are contradicting opinions by the, well, "several" ways to check
> > > > > > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
> > > > > > data.
> > > > > > This array is running on kernel 4.17.3 and it definitely experienced
> > > > > > power loss while data was being written.
> > > > > > I can say that it wen through at least a dozen of unclear shutdown
> > > > > > So following this thread I started my first scrub on the array. and
> > > > > > this is the outcome (after having resumed it 4 times, two after a
> > > > > > power loss...):
> > > > > >
> > > > > > menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
> > > > > > scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > > > > > scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
> > > > > > total bytes scrubbed: 2.59TiB with 0 errors
> > > > > >
> > > > > > So, there are 0 errors, but I don't understand why it says 2.59TiB of
> > > > > > scrubbed data. Is it possible that also this values is crap, as the
> > > > > > non zero counters for RAID5 array?
> > > > > > Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
> > > > > > <ce3g8jdj@umail.furryterror.org> ha scritto:
> > > > > > >
> > > > > > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erenthetitan@mail.de wrote:
> > > > > > > > I guess that covers most topics, two last questions:
> > > > > > > >
> > > > > > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
> > > > > > >
> > > > > > > Not really. It changes the probability distribution (you get an extra
> > > > > > > chance to recover using a parity block in some cases), but there are
> > > > > > > still cases where data gets lost that didn't need to be.
> > > > > > >
> > > > > > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
> > > > > > >
> > > > > > > There may be benefits of raid5 metadata, but they are small compared to
> > > > > > > the risks.
> > > > > > >
> > > > > > > In some configurations it may not be possible to allocate the last
> > > > > > > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a
> > > > > > > time while raid5 will allocate 1GB chunks from N disks at a time, and if
> > > > > > > N is an odd number there could be one chunk left over in the array that
> > > > > > > is unusable. Most users will find this irrelevant because a large disk
> > > > > > > array that is filled to the last GB will become quite slow due to long
> > > > > > > free space search and seek times--you really want to keep usage below 95%,
> > > > > > > maybe 98% at most, and that means the last GB will never be needed.
> > > > > > >
> > > > > > > Reading raid5 metadata could theoretically be faster than raid1, but that
> > > > > > > depends on a lot of variables, so you can't assume it as a rule of thumb.
> > > > > > >
> > > > > > > Raid6 metadata is more interesting because it's the only currently
> > > > > > > supported way to get 2-disk failure tolerance in btrfs. Unfortunately
> > > > > > > that benefit is rather limited due to the write hole bug.
> > > > > > >
> > > > > > > There are patches floating around that implement multi-disk raid1 (i.e. 3
> > > > > > > or 4 mirror copies instead of just 2). This would be much better for
> > > > > > > metadata than raid6--more flexible, more robust, and my guess is that
> > > > > > > it will be faster as well (no need for RMW updates or journal seeks).
> > > > > > >
> > > > > > > > -------------------------------------------------------------------------------------------------
> > > > > > > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> > > > > > > >
> > > > >
> > > > >
> > > > > -------------------------------------------------------------------------------------------------
> > > > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
>
> -------------------------------------------------------------------------------------------------
> FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
  2018-08-15  7:27     ` Menion
@ 2018-08-16 19:38       ` erenthetitan
  2018-08-17  8:33         ` Menion
  0 siblings, 1 reply; 21+ messages in thread
From: erenthetitan @ 2018-08-16 19:38 UTC (permalink / raw)
  To: menion, ce3g8jdj; +Cc: linux-btrfs

Could you show scrub status -d, then start a new scrub (all drives) and show scrub status -d again? This may help us diagnose the problem.

Am 15-Aug-2018 09:27:40 +0200 schrieb menion@gmail.com: 
> I needed to resume scrub two times after an unclear shutdown (I was
> cooking and using too much electricity) and two times after a manual
> cancel, because I wanted to watch a 4k movie and the array
> performances were not enough with scrub active.
> Each time I resumed it, I checked also the status, and the total
> number of data scrubbed was keep counting (never started from zero)
> Il giorno mer 15 ago 2018 alle ore 05:33 Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> ha scritto:
> >
> > On Tue, Aug 14, 2018 at 09:32:51AM +0200, Menion wrote:
> > > Hi
> > > Well, I think it is worth to give more details on the array.
> > > the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII enclosure
> > > The enclosure is a cheap JMicron based chinese stuff (from Orico).
> > > There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb
> > > multiplexer behind it. So you cannot expect peak performance, which is
> > > not the goal of this array (domestic data storage).
> > > Also the USB to SATA firmware is buggy, so UAS operations are not
> > > stable, it run in BOT mode.
> > > Having said so, the scrub has been started (and resumed) on the array
> > > mount point:
> > >
> > > sudo btrfs scrub start(resume) /media/storage/das1
> >
> > So is 2.59TB the amount scrubbed _since resume_? If you run a complete
> > scrub end to end without cancelling or rebooting in between, what is
> > the size on all disks (btrfs scrub status -d)?
> >
> > > even if reading the documentation I understand that it is the same
> > > invoking it on mountpoint or one of the HDD in the array.
> > > In the end, especially for a RAID5 array, does it really make sense to
> > > scrub only one disk in the array???
> >
> > You would set up a shell for-loop and scrub each disk of the array
> > in turn. Each scrub would correct errors on a single device.
> >
> > There was a bug in btrfs scrub where scrubbing the filesystem would
> > create one thread for each disk, and the threads would issue commands
> > to all disks and compete with each other for IO, resulting in terrible
> > performance on most non-SSD hardware. By scrubbing disks one at a time,
> > there are no competing threads, so the scrub runs many times faster.
> > With this bug the total time to scrub all disks individually is usually
> > less than the time to scrub the entire filesystem at once, especially
> > on HDD (and even if it's not faster, one-at-a-time disk scrubs are
> > much kinder to any other process trying to use the filesystem at the
> > same time).
> >
> > It appears this bug is not fixed, based on some timing results I am
> > getting from a test array. iostat shows 10x more reads than writes on
> > all disks even when all blocks on one disk are corrupted and the scrub
> > is given only a single disk to process (that should result in roughly
> > equal reads on all disks slightly above the number of writes on the
> > corrupted disk).
> >
> > This is where my earlier caveat about performance comes from. Many parts
> > of btrfs raid5 are somewhere between slower and *much* slower than
> > comparable software raid5 implementations. Some of that is by design:
> > btrfs must be at least 1% slower than mdadm because btrfs needs to read
> > metadata to verify data block csums in scrub, and the difference would
> > be much larger in practice due to HDD seek times, but 500%-900% overhead
> > still seems high especially when compared to btrfs raid1 that has the
> > same metadata csum reading issue without the huge performance gap.
> >
> > It seems like btrfs raid5 could still use a thorough profiling to figure
> > out where it's spending all its IO.
> >
> > > Regarding the data usage, here you have the current figures:
> > >
> > > menion@Menionubuntu:~$ sudo btrfs fi show
> > > [sudo] password for menion:
> > > Label: none uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f
> > > Total devices 1 FS bytes used 11.44GiB
> > > devid 1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3
> > >
> > > Label: none uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > > Total devices 5 FS bytes used 6.57TiB
> > > devid 1 size 7.28TiB used 1.64TiB path /dev/sda
> > > devid 2 size 7.28TiB used 1.64TiB path /dev/sdb
> > > devid 3 size 7.28TiB used 1.64TiB path /dev/sdc
> > > devid 4 size 7.28TiB used 1.64TiB path /dev/sdd
> > > devid 5 size 7.28TiB used 1.64TiB path /dev/sde
> > >
> > > menion@Menionubuntu:~$ sudo btrfs fi df /media/storage/das1
> > > Data, RAID5: total=6.57TiB, used=6.56TiB
> > > System, RAID5: total=12.75MiB, used=416.00KiB
> > > Metadata, RAID5: total=9.00GiB, used=8.16GiB
> > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > menion@Menionubuntu:~$ sudo btrfs fi usage /media/storage/das1
> > > WARNING: RAID56 detected, not implemented
> > > WARNING: RAID56 detected, not implemented
> > > WARNING: RAID56 detected, not implemented
> > > Overall:
> > > Device size: 36.39TiB
> > > Device allocated: 0.00B
> > > Device unallocated: 36.39TiB
> > > Device missing: 0.00B
> > > Used: 0.00B
> > > Free (estimated): 0.00B (min: 8.00EiB)
> > > Data ratio: 0.00
> > > Metadata ratio: 0.00
> > > Global reserve: 512.00MiB (used: 32.00KiB)
> > >
> > > Data,RAID5: Size:6.57TiB, Used:6.56TiB
> > > /dev/sda 1.64TiB
> > > /dev/sdb 1.64TiB
> > > /dev/sdc 1.64TiB
> > > /dev/sdd 1.64TiB
> > > /dev/sde 1.64TiB
> > >
> > > Metadata,RAID5: Size:9.00GiB, Used:8.16GiB
> > > /dev/sda 2.25GiB
> > > /dev/sdb 2.25GiB
> > > /dev/sdc 2.25GiB
> > > /dev/sdd 2.25GiB
> > > /dev/sde 2.25GiB
> > >
> > > System,RAID5: Size:12.75MiB, Used:416.00KiB
> > > /dev/sda 3.19MiB
> > > /dev/sdb 3.19MiB
> > > /dev/sdc 3.19MiB
> > > /dev/sdd 3.19MiB
> > > /dev/sde 3.19MiB
> > >
> > > Unallocated:
> > > /dev/sda 5.63TiB
> > > /dev/sdb 5.63TiB
> > > /dev/sdc 5.63TiB
> > > /dev/sdd 5.63TiB
> > > /dev/sde 5.63TiB
> > > menion@Menionubuntu:~$
> > > menion@Menionubuntu:~$ sf -h
> > > The program 'sf' is currently not installed. You can install it by typing:
> > > sudo apt install ruby-sprite-factory
> > > menion@Menionubuntu:~$ df -h
> > > Filesystem Size Used Avail Use% Mounted on
> > > udev 934M 0 934M 0% /dev
> > > tmpfs 193M 22M 171M 12% /run
> > > /dev/mmcblk0p3 28G 12G 15G 44% /
> > > tmpfs 962M 0 962M 0% /dev/shm
> > > tmpfs 5,0M 0 5,0M 0% /run/lock
> > > tmpfs 962M 0 962M 0% /sys/fs/cgroup
> > > /dev/mmcblk0p1 188M 3,4M 184M 2% /boot/efi
> > > /dev/mmcblk0p3 28G 12G 15G 44% /home
> > > /dev/sda 37T 6,6T 29T 19% /media/storage/das1
> > > tmpfs 193M 0 193M 0% /run/user/1000
> > > menion@Menionubuntu:~$ btrfs --version
> > > btrfs-progs v4.17
> > >
> > > So I don't fully understand where the scrub data size comes from
> > > Il giorno lun 13 ago 2018 alle ore 23:56 <erenthetitan@mail.de> ha scritto:
> > > >
> > > > Running time of 55:06:35 indicates that the counter is right, it is not enough time to scrub the entire array using hdd.
> > > >
> > > > 2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub start /dev/sdx1" only scrubs the selected partition,
> > > > whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual array.
> > > >
> > > > Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics and post the output.
> > > > For live statistics, use "sudo watch -n 1".
> > > >
> > > > By the way:
> > > > 0 errors despite multiple unclean shutdowns? I assumed that the write hole would corrupt parity the first time around, was i wrong?
> > > >
> > > > Am 13-Aug-2018 09:20:36 +0200 schrieb menion@gmail.com:
> > > > > Hi
> > > > > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
> > > > > there are contradicting opinions by the, well, "several" ways to check
> > > > > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
> > > > > data.
> > > > > This array is running on kernel 4.17.3 and it definitely experienced
> > > > > power loss while data was being written.
> > > > > I can say that it wen through at least a dozen of unclear shutdown
> > > > > So following this thread I started my first scrub on the array. and
> > > > > this is the outcome (after having resumed it 4 times, two after a
> > > > > power loss...):
> > > > >
> > > > > menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
> > > > > scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > > > > scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
> > > > > total bytes scrubbed: 2.59TiB with 0 errors
> > > > >
> > > > > So, there are 0 errors, but I don't understand why it says 2.59TiB of
> > > > > scrubbed data. Is it possible that also this values is crap, as the
> > > > > non zero counters for RAID5 array?
> > > > > Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
> > > > > <ce3g8jdj@umail.furryterror.org> ha scritto:
> > > > > >
> > > > > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erenthetitan@mail.de wrote:
> > > > > > > I guess that covers most topics, two last questions:
> > > > > > >
> > > > > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
> > > > > >
> > > > > > Not really. It changes the probability distribution (you get an extra
> > > > > > chance to recover using a parity block in some cases), but there are
> > > > > > still cases where data gets lost that didn't need to be.
> > > > > >
> > > > > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
> > > > > >
> > > > > > There may be benefits of raid5 metadata, but they are small compared to
> > > > > > the risks.
> > > > > >
> > > > > > In some configurations it may not be possible to allocate the last
> > > > > > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a
> > > > > > time while raid5 will allocate 1GB chunks from N disks at a time, and if
> > > > > > N is an odd number there could be one chunk left over in the array that
> > > > > > is unusable. Most users will find this irrelevant because a large disk
> > > > > > array that is filled to the last GB will become quite slow due to long
> > > > > > free space search and seek times--you really want to keep usage below 95%,
> > > > > > maybe 98% at most, and that means the last GB will never be needed.
> > > > > >
> > > > > > Reading raid5 metadata could theoretically be faster than raid1, but that
> > > > > > depends on a lot of variables, so you can't assume it as a rule of thumb.
> > > > > >
> > > > > > Raid6 metadata is more interesting because it's the only currently
> > > > > > supported way to get 2-disk failure tolerance in btrfs. Unfortunately
> > > > > > that benefit is rather limited due to the write hole bug.
> > > > > >
> > > > > > There are patches floating around that implement multi-disk raid1 (i.e. 3
> > > > > > or 4 mirror copies instead of just 2). This would be much better for
> > > > > > metadata than raid6--more flexible, more robust, and my guess is that
> > > > > > it will be faster as well (no need for RMW updates or journal seeks).
> > > > > >
> > > > > > > -------------------------------------------------------------------------------------------------
> > > > > > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> > > > > > >
> > > >
> > > >
> > > > -------------------------------------------------------------------------------------------------
> > > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT

-------------------------------------------------------------------------------------------------
FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
  2018-08-15  3:33   ` Zygo Blaxell
@ 2018-08-15  7:27     ` Menion
  2018-08-16 19:38       ` erenthetitan
  0 siblings, 1 reply; 21+ messages in thread
From: Menion @ 2018-08-15  7:27 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: erenthetitan, linux-btrfs

I needed to resume scrub two times after an unclear shutdown (I was
cooking and using too much electricity) and two times after a manual
cancel, because I wanted to watch a 4k movie and the array
performances were not enough with scrub active.
Each time I resumed it, I checked also the status, and the total
number of data scrubbed was keep counting (never started from zero)
Il giorno mer 15 ago 2018 alle ore 05:33 Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> ha scritto:
>
> On Tue, Aug 14, 2018 at 09:32:51AM +0200, Menion wrote:
> > Hi
> > Well, I think it is worth to give more details on the array.
> > the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII enclosure
> > The enclosure is a cheap JMicron based chinese stuff (from Orico).
> > There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb
> > multiplexer behind it. So you cannot expect peak performance, which is
> > not the goal of this array (domestic data storage).
> > Also the USB to SATA firmware is buggy, so UAS operations are not
> > stable, it run in BOT mode.
> > Having said so, the scrub has been started (and resumed) on the array
> > mount point:
> >
> > sudo btrfs scrub start(resume) /media/storage/das1
>
> So is 2.59TB the amount scrubbed _since resume_?  If you run a complete
> scrub end to end without cancelling or rebooting in between, what is
> the size on all disks (btrfs scrub status -d)?
>
> > even if reading the documentation I understand that it is the same
> > invoking it on mountpoint or one of the HDD in the array.
> > In the end, especially for a RAID5 array, does it really make sense to
> > scrub only one disk in the array???
>
> You would set up a shell for-loop and scrub each disk of the array
> in turn.  Each scrub would correct errors on a single device.
>
> There was a bug in btrfs scrub where scrubbing the filesystem would
> create one thread for each disk, and the threads would issue commands
> to all disks and compete with each other for IO, resulting in terrible
> performance on most non-SSD hardware.  By scrubbing disks one at a time,
> there are no competing threads, so the scrub runs many times faster.
> With this bug the total time to scrub all disks individually is usually
> less than the time to scrub the entire filesystem at once, especially
> on HDD (and even if it's not faster, one-at-a-time disk scrubs are
> much kinder to any other process trying to use the filesystem at the
> same time).
>
> It appears this bug is not fixed, based on some timing results I am
> getting from a test array.  iostat shows 10x more reads than writes on
> all disks even when all blocks on one disk are corrupted and the scrub
> is given only a single disk to process (that should result in roughly
> equal reads on all disks slightly above the number of writes on the
> corrupted disk).
>
> This is where my earlier caveat about performance comes from.  Many parts
> of btrfs raid5 are somewhere between slower and *much* slower than
> comparable software raid5 implementations.  Some of that is by design:
> btrfs must be at least 1% slower than mdadm because btrfs needs to read
> metadata to verify data block csums in scrub, and the difference would
> be much larger in practice due to HDD seek times, but 500%-900% overhead
> still seems high especially when compared to btrfs raid1 that has the
> same metadata csum reading issue without the huge performance gap.
>
> It seems like btrfs raid5 could still use a thorough profiling to figure
> out where it's spending all its IO.
>
> > Regarding the data usage, here you have the current figures:
> >
> > menion@Menionubuntu:~$ sudo btrfs fi show
> > [sudo] password for menion:
> > Label: none  uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f
> > Total devices 1 FS bytes used 11.44GiB
> > devid    1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3
> >
> > Label: none  uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > Total devices 5 FS bytes used 6.57TiB
> > devid    1 size 7.28TiB used 1.64TiB path /dev/sda
> > devid    2 size 7.28TiB used 1.64TiB path /dev/sdb
> > devid    3 size 7.28TiB used 1.64TiB path /dev/sdc
> > devid    4 size 7.28TiB used 1.64TiB path /dev/sdd
> > devid    5 size 7.28TiB used 1.64TiB path /dev/sde
> >
> > menion@Menionubuntu:~$ sudo btrfs fi df /media/storage/das1
> > Data, RAID5: total=6.57TiB, used=6.56TiB
> > System, RAID5: total=12.75MiB, used=416.00KiB
> > Metadata, RAID5: total=9.00GiB, used=8.16GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> > menion@Menionubuntu:~$ sudo btrfs fi usage /media/storage/das1
> > WARNING: RAID56 detected, not implemented
> > WARNING: RAID56 detected, not implemented
> > WARNING: RAID56 detected, not implemented
> > Overall:
> >     Device size:   36.39TiB
> >     Device allocated:      0.00B
> >     Device unallocated:   36.39TiB
> >     Device missing:      0.00B
> >     Used:      0.00B
> >     Free (estimated):      0.00B (min: 8.00EiB)
> >     Data ratio:       0.00
> >     Metadata ratio:       0.00
> >     Global reserve: 512.00MiB (used: 32.00KiB)
> >
> > Data,RAID5: Size:6.57TiB, Used:6.56TiB
> >    /dev/sda    1.64TiB
> >    /dev/sdb    1.64TiB
> >    /dev/sdc    1.64TiB
> >    /dev/sdd    1.64TiB
> >    /dev/sde    1.64TiB
> >
> > Metadata,RAID5: Size:9.00GiB, Used:8.16GiB
> >    /dev/sda    2.25GiB
> >    /dev/sdb    2.25GiB
> >    /dev/sdc    2.25GiB
> >    /dev/sdd    2.25GiB
> >    /dev/sde    2.25GiB
> >
> > System,RAID5: Size:12.75MiB, Used:416.00KiB
> >    /dev/sda    3.19MiB
> >    /dev/sdb    3.19MiB
> >    /dev/sdc    3.19MiB
> >    /dev/sdd    3.19MiB
> >    /dev/sde    3.19MiB
> >
> > Unallocated:
> >    /dev/sda    5.63TiB
> >    /dev/sdb    5.63TiB
> >    /dev/sdc    5.63TiB
> >    /dev/sdd    5.63TiB
> >    /dev/sde    5.63TiB
> > menion@Menionubuntu:~$
> > menion@Menionubuntu:~$ sf -h
> > The program 'sf' is currently not installed. You can install it by typing:
> > sudo apt install ruby-sprite-factory
> > menion@Menionubuntu:~$ df -h
> > Filesystem      Size  Used Avail Use% Mounted on
> > udev            934M     0  934M   0% /dev
> > tmpfs           193M   22M  171M  12% /run
> > /dev/mmcblk0p3   28G   12G   15G  44% /
> > tmpfs           962M     0  962M   0% /dev/shm
> > tmpfs           5,0M     0  5,0M   0% /run/lock
> > tmpfs           962M     0  962M   0% /sys/fs/cgroup
> > /dev/mmcblk0p1  188M  3,4M  184M   2% /boot/efi
> > /dev/mmcblk0p3   28G   12G   15G  44% /home
> > /dev/sda         37T  6,6T   29T  19% /media/storage/das1
> > tmpfs           193M     0  193M   0% /run/user/1000
> > menion@Menionubuntu:~$ btrfs --version
> > btrfs-progs v4.17
> >
> > So I don't fully understand where the scrub data size comes from
> > Il giorno lun 13 ago 2018 alle ore 23:56 <erenthetitan@mail.de> ha scritto:
> > >
> > > Running time of 55:06:35 indicates that the counter is right, it is not enough time to scrub the entire array using hdd.
> > >
> > > 2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub start /dev/sdx1" only scrubs the selected partition,
> > > whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual array.
> > >
> > > Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics and post the output.
> > > For live statistics, use "sudo watch -n 1".
> > >
> > > By the way:
> > > 0 errors despite multiple unclean shutdowns? I assumed that the write hole would corrupt parity the first time around, was i wrong?
> > >
> > > Am 13-Aug-2018 09:20:36 +0200 schrieb menion@gmail.com:
> > > > Hi
> > > > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
> > > > there are contradicting opinions by the, well, "several" ways to check
> > > > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
> > > > data.
> > > > This array is running on kernel 4.17.3 and it definitely experienced
> > > > power loss while data was being written.
> > > > I can say that it wen through at least a dozen of unclear shutdown
> > > > So following this thread I started my first scrub on the array. and
> > > > this is the outcome (after having resumed it 4 times, two after a
> > > > power loss...):
> > > >
> > > > menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
> > > > scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > > > scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
> > > > total bytes scrubbed: 2.59TiB with 0 errors
> > > >
> > > > So, there are 0 errors, but I don't understand why it says 2.59TiB of
> > > > scrubbed data. Is it possible that also this values is crap, as the
> > > > non zero counters for RAID5 array?
> > > > Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
> > > > <ce3g8jdj@umail.furryterror.org> ha scritto:
> > > > >
> > > > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erenthetitan@mail.de wrote:
> > > > > > I guess that covers most topics, two last questions:
> > > > > >
> > > > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
> > > > >
> > > > > Not really. It changes the probability distribution (you get an extra
> > > > > chance to recover using a parity block in some cases), but there are
> > > > > still cases where data gets lost that didn't need to be.
> > > > >
> > > > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
> > > > >
> > > > > There may be benefits of raid5 metadata, but they are small compared to
> > > > > the risks.
> > > > >
> > > > > In some configurations it may not be possible to allocate the last
> > > > > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a
> > > > > time while raid5 will allocate 1GB chunks from N disks at a time, and if
> > > > > N is an odd number there could be one chunk left over in the array that
> > > > > is unusable. Most users will find this irrelevant because a large disk
> > > > > array that is filled to the last GB will become quite slow due to long
> > > > > free space search and seek times--you really want to keep usage below 95%,
> > > > > maybe 98% at most, and that means the last GB will never be needed.
> > > > >
> > > > > Reading raid5 metadata could theoretically be faster than raid1, but that
> > > > > depends on a lot of variables, so you can't assume it as a rule of thumb.
> > > > >
> > > > > Raid6 metadata is more interesting because it's the only currently
> > > > > supported way to get 2-disk failure tolerance in btrfs. Unfortunately
> > > > > that benefit is rather limited due to the write hole bug.
> > > > >
> > > > > There are patches floating around that implement multi-disk raid1 (i.e. 3
> > > > > or 4 mirror copies instead of just 2). This would be much better for
> > > > > metadata than raid6--more flexible, more robust, and my guess is that
> > > > > it will be faster as well (no need for RMW updates or journal seeks).
> > > > >
> > > > > > -------------------------------------------------------------------------------------------------
> > > > > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> > > > > >
> > >
> > >
> > > -------------------------------------------------------------------------------------------------
> > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
  2018-08-14  7:32 ` Menion
@ 2018-08-15  3:33   ` Zygo Blaxell
  2018-08-15  7:27     ` Menion
  0 siblings, 1 reply; 21+ messages in thread
From: Zygo Blaxell @ 2018-08-15  3:33 UTC (permalink / raw)
  To: Menion; +Cc: erenthetitan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 10541 bytes --]

On Tue, Aug 14, 2018 at 09:32:51AM +0200, Menion wrote:
> Hi
> Well, I think it is worth to give more details on the array.
> the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII enclosure
> The enclosure is a cheap JMicron based chinese stuff (from Orico).
> There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb
> multiplexer behind it. So you cannot expect peak performance, which is
> not the goal of this array (domestic data storage).
> Also the USB to SATA firmware is buggy, so UAS operations are not
> stable, it run in BOT mode.
> Having said so, the scrub has been started (and resumed) on the array
> mount point:
> 
> sudo btrfs scrub start(resume) /media/storage/das1

So is 2.59TB the amount scrubbed _since resume_?  If you run a complete
scrub end to end without cancelling or rebooting in between, what is
the size on all disks (btrfs scrub status -d)?

> even if reading the documentation I understand that it is the same
> invoking it on mountpoint or one of the HDD in the array.
> In the end, especially for a RAID5 array, does it really make sense to
> scrub only one disk in the array???

You would set up a shell for-loop and scrub each disk of the array
in turn.  Each scrub would correct errors on a single device.

There was a bug in btrfs scrub where scrubbing the filesystem would
create one thread for each disk, and the threads would issue commands
to all disks and compete with each other for IO, resulting in terrible
performance on most non-SSD hardware.  By scrubbing disks one at a time,
there are no competing threads, so the scrub runs many times faster.
With this bug the total time to scrub all disks individually is usually
less than the time to scrub the entire filesystem at once, especially
on HDD (and even if it's not faster, one-at-a-time disk scrubs are
much kinder to any other process trying to use the filesystem at the
same time).

It appears this bug is not fixed, based on some timing results I am
getting from a test array.  iostat shows 10x more reads than writes on
all disks even when all blocks on one disk are corrupted and the scrub
is given only a single disk to process (that should result in roughly
equal reads on all disks slightly above the number of writes on the
corrupted disk).

This is where my earlier caveat about performance comes from.  Many parts
of btrfs raid5 are somewhere between slower and *much* slower than
comparable software raid5 implementations.  Some of that is by design:
btrfs must be at least 1% slower than mdadm because btrfs needs to read
metadata to verify data block csums in scrub, and the difference would
be much larger in practice due to HDD seek times, but 500%-900% overhead
still seems high especially when compared to btrfs raid1 that has the
same metadata csum reading issue without the huge performance gap.

It seems like btrfs raid5 could still use a thorough profiling to figure
out where it's spending all its IO.

> Regarding the data usage, here you have the current figures:
> 
> menion@Menionubuntu:~$ sudo btrfs fi show
> [sudo] password for menion:
> Label: none  uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f
> Total devices 1 FS bytes used 11.44GiB
> devid    1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3
> 
> Label: none  uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> Total devices 5 FS bytes used 6.57TiB
> devid    1 size 7.28TiB used 1.64TiB path /dev/sda
> devid    2 size 7.28TiB used 1.64TiB path /dev/sdb
> devid    3 size 7.28TiB used 1.64TiB path /dev/sdc
> devid    4 size 7.28TiB used 1.64TiB path /dev/sdd
> devid    5 size 7.28TiB used 1.64TiB path /dev/sde
> 
> menion@Menionubuntu:~$ sudo btrfs fi df /media/storage/das1
> Data, RAID5: total=6.57TiB, used=6.56TiB
> System, RAID5: total=12.75MiB, used=416.00KiB
> Metadata, RAID5: total=9.00GiB, used=8.16GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> menion@Menionubuntu:~$ sudo btrfs fi usage /media/storage/das1
> WARNING: RAID56 detected, not implemented
> WARNING: RAID56 detected, not implemented
> WARNING: RAID56 detected, not implemented
> Overall:
>     Device size:   36.39TiB
>     Device allocated:      0.00B
>     Device unallocated:   36.39TiB
>     Device missing:      0.00B
>     Used:      0.00B
>     Free (estimated):      0.00B (min: 8.00EiB)
>     Data ratio:       0.00
>     Metadata ratio:       0.00
>     Global reserve: 512.00MiB (used: 32.00KiB)
> 
> Data,RAID5: Size:6.57TiB, Used:6.56TiB
>    /dev/sda    1.64TiB
>    /dev/sdb    1.64TiB
>    /dev/sdc    1.64TiB
>    /dev/sdd    1.64TiB
>    /dev/sde    1.64TiB
> 
> Metadata,RAID5: Size:9.00GiB, Used:8.16GiB
>    /dev/sda    2.25GiB
>    /dev/sdb    2.25GiB
>    /dev/sdc    2.25GiB
>    /dev/sdd    2.25GiB
>    /dev/sde    2.25GiB
> 
> System,RAID5: Size:12.75MiB, Used:416.00KiB
>    /dev/sda    3.19MiB
>    /dev/sdb    3.19MiB
>    /dev/sdc    3.19MiB
>    /dev/sdd    3.19MiB
>    /dev/sde    3.19MiB
> 
> Unallocated:
>    /dev/sda    5.63TiB
>    /dev/sdb    5.63TiB
>    /dev/sdc    5.63TiB
>    /dev/sdd    5.63TiB
>    /dev/sde    5.63TiB
> menion@Menionubuntu:~$
> menion@Menionubuntu:~$ sf -h
> The program 'sf' is currently not installed. You can install it by typing:
> sudo apt install ruby-sprite-factory
> menion@Menionubuntu:~$ df -h
> Filesystem      Size  Used Avail Use% Mounted on
> udev            934M     0  934M   0% /dev
> tmpfs           193M   22M  171M  12% /run
> /dev/mmcblk0p3   28G   12G   15G  44% /
> tmpfs           962M     0  962M   0% /dev/shm
> tmpfs           5,0M     0  5,0M   0% /run/lock
> tmpfs           962M     0  962M   0% /sys/fs/cgroup
> /dev/mmcblk0p1  188M  3,4M  184M   2% /boot/efi
> /dev/mmcblk0p3   28G   12G   15G  44% /home
> /dev/sda         37T  6,6T   29T  19% /media/storage/das1
> tmpfs           193M     0  193M   0% /run/user/1000
> menion@Menionubuntu:~$ btrfs --version
> btrfs-progs v4.17
> 
> So I don't fully understand where the scrub data size comes from
> Il giorno lun 13 ago 2018 alle ore 23:56 <erenthetitan@mail.de> ha scritto:
> >
> > Running time of 55:06:35 indicates that the counter is right, it is not enough time to scrub the entire array using hdd.
> >
> > 2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub start /dev/sdx1" only scrubs the selected partition,
> > whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual array.
> >
> > Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics and post the output.
> > For live statistics, use "sudo watch -n 1".
> >
> > By the way:
> > 0 errors despite multiple unclean shutdowns? I assumed that the write hole would corrupt parity the first time around, was i wrong?
> >
> > Am 13-Aug-2018 09:20:36 +0200 schrieb menion@gmail.com:
> > > Hi
> > > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
> > > there are contradicting opinions by the, well, "several" ways to check
> > > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
> > > data.
> > > This array is running on kernel 4.17.3 and it definitely experienced
> > > power loss while data was being written.
> > > I can say that it wen through at least a dozen of unclear shutdown
> > > So following this thread I started my first scrub on the array. and
> > > this is the outcome (after having resumed it 4 times, two after a
> > > power loss...):
> > >
> > > menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
> > > scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > > scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
> > > total bytes scrubbed: 2.59TiB with 0 errors
> > >
> > > So, there are 0 errors, but I don't understand why it says 2.59TiB of
> > > scrubbed data. Is it possible that also this values is crap, as the
> > > non zero counters for RAID5 array?
> > > Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
> > > <ce3g8jdj@umail.furryterror.org> ha scritto:
> > > >
> > > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erenthetitan@mail.de wrote:
> > > > > I guess that covers most topics, two last questions:
> > > > >
> > > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
> > > >
> > > > Not really. It changes the probability distribution (you get an extra
> > > > chance to recover using a parity block in some cases), but there are
> > > > still cases where data gets lost that didn't need to be.
> > > >
> > > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
> > > >
> > > > There may be benefits of raid5 metadata, but they are small compared to
> > > > the risks.
> > > >
> > > > In some configurations it may not be possible to allocate the last
> > > > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a
> > > > time while raid5 will allocate 1GB chunks from N disks at a time, and if
> > > > N is an odd number there could be one chunk left over in the array that
> > > > is unusable. Most users will find this irrelevant because a large disk
> > > > array that is filled to the last GB will become quite slow due to long
> > > > free space search and seek times--you really want to keep usage below 95%,
> > > > maybe 98% at most, and that means the last GB will never be needed.
> > > >
> > > > Reading raid5 metadata could theoretically be faster than raid1, but that
> > > > depends on a lot of variables, so you can't assume it as a rule of thumb.
> > > >
> > > > Raid6 metadata is more interesting because it's the only currently
> > > > supported way to get 2-disk failure tolerance in btrfs. Unfortunately
> > > > that benefit is rather limited due to the write hole bug.
> > > >
> > > > There are patches floating around that implement multi-disk raid1 (i.e. 3
> > > > or 4 mirror copies instead of just 2). This would be much better for
> > > > metadata than raid6--more flexible, more robust, and my guess is that
> > > > it will be faster as well (no need for RMW updates or journal seeks).
> > > >
> > > > > -------------------------------------------------------------------------------------------------
> > > > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> > > > >
> >
> >
> > -------------------------------------------------------------------------------------------------
> > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
  2018-08-13 21:56 erenthetitan
  2018-08-14  4:09 ` Zygo Blaxell
@ 2018-08-14  7:32 ` Menion
  2018-08-15  3:33   ` Zygo Blaxell
  1 sibling, 1 reply; 21+ messages in thread
From: Menion @ 2018-08-14  7:32 UTC (permalink / raw)
  To: erenthetitan; +Cc: Zygo Blaxell, linux-btrfs

Hi
Well, I think it is worth to give more details on the array.
the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII enclosure
The enclosure is a cheap JMicron based chinese stuff (from Orico).
There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb
multiplexer behind it. So you cannot expect peak performance, which is
not the goal of this array (domestic data storage).
Also the USB to SATA firmware is buggy, so UAS operations are not
stable, it run in BOT mode.
Having said so, the scrub has been started (and resumed) on the array
mount point:

sudo btrfs scrub start(resume) /media/storage/das1

even if reading the documentation I understand that it is the same
invoking it on mountpoint or one of the HDD in the array.
In the end, especially for a RAID5 array, does it really make sense to
scrub only one disk in the array???
Regarding the data usage, here you have the current figures:

menion@Menionubuntu:~$ sudo btrfs fi show
[sudo] password for menion:
Label: none  uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f
Total devices 1 FS bytes used 11.44GiB
devid    1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3

Label: none  uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
Total devices 5 FS bytes used 6.57TiB
devid    1 size 7.28TiB used 1.64TiB path /dev/sda
devid    2 size 7.28TiB used 1.64TiB path /dev/sdb
devid    3 size 7.28TiB used 1.64TiB path /dev/sdc
devid    4 size 7.28TiB used 1.64TiB path /dev/sdd
devid    5 size 7.28TiB used 1.64TiB path /dev/sde

menion@Menionubuntu:~$ sudo btrfs fi df /media/storage/das1
Data, RAID5: total=6.57TiB, used=6.56TiB
System, RAID5: total=12.75MiB, used=416.00KiB
Metadata, RAID5: total=9.00GiB, used=8.16GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
menion@Menionubuntu:~$ sudo btrfs fi usage /media/storage/das1
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
Overall:
    Device size:   36.39TiB
    Device allocated:      0.00B
    Device unallocated:   36.39TiB
    Device missing:      0.00B
    Used:      0.00B
    Free (estimated):      0.00B (min: 8.00EiB)
    Data ratio:       0.00
    Metadata ratio:       0.00
    Global reserve: 512.00MiB (used: 32.00KiB)

Data,RAID5: Size:6.57TiB, Used:6.56TiB
   /dev/sda    1.64TiB
   /dev/sdb    1.64TiB
   /dev/sdc    1.64TiB
   /dev/sdd    1.64TiB
   /dev/sde    1.64TiB

Metadata,RAID5: Size:9.00GiB, Used:8.16GiB
   /dev/sda    2.25GiB
   /dev/sdb    2.25GiB
   /dev/sdc    2.25GiB
   /dev/sdd    2.25GiB
   /dev/sde    2.25GiB

System,RAID5: Size:12.75MiB, Used:416.00KiB
   /dev/sda    3.19MiB
   /dev/sdb    3.19MiB
   /dev/sdc    3.19MiB
   /dev/sdd    3.19MiB
   /dev/sde    3.19MiB

Unallocated:
   /dev/sda    5.63TiB
   /dev/sdb    5.63TiB
   /dev/sdc    5.63TiB
   /dev/sdd    5.63TiB
   /dev/sde    5.63TiB
menion@Menionubuntu:~$
menion@Menionubuntu:~$ sf -h
The program 'sf' is currently not installed. You can install it by typing:
sudo apt install ruby-sprite-factory
menion@Menionubuntu:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            934M     0  934M   0% /dev
tmpfs           193M   22M  171M  12% /run
/dev/mmcblk0p3   28G   12G   15G  44% /
tmpfs           962M     0  962M   0% /dev/shm
tmpfs           5,0M     0  5,0M   0% /run/lock
tmpfs           962M     0  962M   0% /sys/fs/cgroup
/dev/mmcblk0p1  188M  3,4M  184M   2% /boot/efi
/dev/mmcblk0p3   28G   12G   15G  44% /home
/dev/sda         37T  6,6T   29T  19% /media/storage/das1
tmpfs           193M     0  193M   0% /run/user/1000
menion@Menionubuntu:~$ btrfs --version
btrfs-progs v4.17

So I don't fully understand where the scrub data size comes from
Il giorno lun 13 ago 2018 alle ore 23:56 <erenthetitan@mail.de> ha scritto:
>
> Running time of 55:06:35 indicates that the counter is right, it is not enough time to scrub the entire array using hdd.
>
> 2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub start /dev/sdx1" only scrubs the selected partition,
> whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual array.
>
> Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics and post the output.
> For live statistics, use "sudo watch -n 1".
>
> By the way:
> 0 errors despite multiple unclean shutdowns? I assumed that the write hole would corrupt parity the first time around, was i wrong?
>
> Am 13-Aug-2018 09:20:36 +0200 schrieb menion@gmail.com:
> > Hi
> > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
> > there are contradicting opinions by the, well, "several" ways to check
> > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
> > data.
> > This array is running on kernel 4.17.3 and it definitely experienced
> > power loss while data was being written.
> > I can say that it wen through at least a dozen of unclear shutdown
> > So following this thread I started my first scrub on the array. and
> > this is the outcome (after having resumed it 4 times, two after a
> > power loss...):
> >
> > menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
> > scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
> > total bytes scrubbed: 2.59TiB with 0 errors
> >
> > So, there are 0 errors, but I don't understand why it says 2.59TiB of
> > scrubbed data. Is it possible that also this values is crap, as the
> > non zero counters for RAID5 array?
> > Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
> > <ce3g8jdj@umail.furryterror.org> ha scritto:
> > >
> > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erenthetitan@mail.de wrote:
> > > > I guess that covers most topics, two last questions:
> > > >
> > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
> > >
> > > Not really. It changes the probability distribution (you get an extra
> > > chance to recover using a parity block in some cases), but there are
> > > still cases where data gets lost that didn't need to be.
> > >
> > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
> > >
> > > There may be benefits of raid5 metadata, but they are small compared to
> > > the risks.
> > >
> > > In some configurations it may not be possible to allocate the last
> > > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a
> > > time while raid5 will allocate 1GB chunks from N disks at a time, and if
> > > N is an odd number there could be one chunk left over in the array that
> > > is unusable. Most users will find this irrelevant because a large disk
> > > array that is filled to the last GB will become quite slow due to long
> > > free space search and seek times--you really want to keep usage below 95%,
> > > maybe 98% at most, and that means the last GB will never be needed.
> > >
> > > Reading raid5 metadata could theoretically be faster than raid1, but that
> > > depends on a lot of variables, so you can't assume it as a rule of thumb.
> > >
> > > Raid6 metadata is more interesting because it's the only currently
> > > supported way to get 2-disk failure tolerance in btrfs. Unfortunately
> > > that benefit is rather limited due to the write hole bug.
> > >
> > > There are patches floating around that implement multi-disk raid1 (i.e. 3
> > > or 4 mirror copies instead of just 2). This would be much better for
> > > metadata than raid6--more flexible, more robust, and my guess is that
> > > it will be faster as well (no need for RMW updates or journal seeks).
> > >
> > > > -------------------------------------------------------------------------------------------------
> > > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> > > >
>
>
> -------------------------------------------------------------------------------------------------
> FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
  2018-08-13 21:56 erenthetitan
@ 2018-08-14  4:09 ` Zygo Blaxell
  2018-08-14  7:32 ` Menion
  1 sibling, 0 replies; 21+ messages in thread
From: Zygo Blaxell @ 2018-08-14  4:09 UTC (permalink / raw)
  To: erenthetitan; +Cc: menion, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5356 bytes --]

On Mon, Aug 13, 2018 at 11:56:05PM +0200, erenthetitan@mail.de wrote:
> Running time of 55:06:35 indicates that the counter is right, it is
> not enough time to scrub the entire array using hdd.
> 
> 2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub
> start /dev/sdx1" only scrubs the selected partition,
> whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual array.
> 
> Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics
> and post the output.
> For live statistics, use "sudo watch -n 1".
> 
> By the way:
> 0 errors despite multiple unclean shutdowns? I assumed that the write
> hole would corrupt parity the first time around, was i wrong?

You won't see the write hole from just a power failure.  You need a
power failure *and* a disk failure, and writes need to be happening at
the moment power fails.

Write hole breaks parity.  Scrub silently(!) fixes parity.  Scrub reads
the parity block and compares it to the computed parity, and if it's
wrong, scrub writes the computed parity back.  Normal RAID5 reads with
all disks online read only the data blocks, so they won't read the parity
block and won't detect wrong parity.

I did a couple of order-of-magnitude estimations of how likely a power
failure is to trash a btrfs RAID system and got a probability between 3%
and 30% per power failure if there were writes active at the time, and
a disk failed to join the array after boot.  That was based on 5 disks
having 31 writes queued with one of the disks being significantly slower
than the others (as failing disks often are) with continuous write load.

If you have a power failure on an array that isn't writing anything at
the time, nothing happens.

> 
> Am 13-Aug-2018 09:20:36 +0200 schrieb menion@gmail.com: 
> > Hi
> > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
> > there are contradicting opinions by the, well, "several" ways to check
> > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
> > data.
> > This array is running on kernel 4.17.3 and it definitely experienced
> > power loss while data was being written.
> > I can say that it wen through at least a dozen of unclear shutdown
> > So following this thread I started my first scrub on the array. and
> > this is the outcome (after having resumed it 4 times, two after a
> > power loss...):
> > 
> > menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
> > scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
> > total bytes scrubbed: 2.59TiB with 0 errors
> > 
> > So, there are 0 errors, but I don't understand why it says 2.59TiB of
> > scrubbed data. Is it possible that also this values is crap, as the
> > non zero counters for RAID5 array?
> > Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
> > <ce3g8jdj@umail.furryterror.org> ha scritto:
> > >
> > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erenthetitan@mail.de wrote:
> > > > I guess that covers most topics, two last questions:
> > > >
> > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
> > >
> > > Not really. It changes the probability distribution (you get an extra
> > > chance to recover using a parity block in some cases), but there are
> > > still cases where data gets lost that didn't need to be.
> > >
> > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
> > >
> > > There may be benefits of raid5 metadata, but they are small compared to
> > > the risks.
> > >
> > > In some configurations it may not be possible to allocate the last
> > > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a
> > > time while raid5 will allocate 1GB chunks from N disks at a time, and if
> > > N is an odd number there could be one chunk left over in the array that
> > > is unusable. Most users will find this irrelevant because a large disk
> > > array that is filled to the last GB will become quite slow due to long
> > > free space search and seek times--you really want to keep usage below 95%,
> > > maybe 98% at most, and that means the last GB will never be needed.
> > >
> > > Reading raid5 metadata could theoretically be faster than raid1, but that
> > > depends on a lot of variables, so you can't assume it as a rule of thumb.
> > >
> > > Raid6 metadata is more interesting because it's the only currently
> > > supported way to get 2-disk failure tolerance in btrfs. Unfortunately
> > > that benefit is rather limited due to the write hole bug.
> > >
> > > There are patches floating around that implement multi-disk raid1 (i.e. 3
> > > or 4 mirror copies instead of just 2). This would be much better for
> > > metadata than raid6--more flexible, more robust, and my guess is that
> > > it will be faster as well (no need for RMW updates or journal seeks).
> > >
> > > > -------------------------------------------------------------------------------------------------
> > > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> > > >
> 
> 
> -------------------------------------------------------------------------------------------------
> FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
  2018-08-13  7:20       ` Menion
@ 2018-08-14  3:49         ` Zygo Blaxell
  0 siblings, 0 replies; 21+ messages in thread
From: Zygo Blaxell @ 2018-08-14  3:49 UTC (permalink / raw)
  To: Menion; +Cc: erenthetitan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3907 bytes --]

On Mon, Aug 13, 2018 at 09:20:22AM +0200, Menion wrote:
> Hi
> I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
> there are contradicting opinions by the, well, "several" ways to check
> the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
> data.
> This array is running on kernel 4.17.3 and it definitely experienced
> power loss while data was being written.
> I can say that it wen through at least a dozen of unclear shutdown
> So following this thread I started my first scrub on the array. and
> this is the outcome (after having resumed it 4 times, two after a
> power loss...):
> 
> menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
> scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
>         scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
>         total bytes scrubbed: 2.59TiB with 0 errors
> 
> So, there are 0 errors, but I don't understand why it says 2.59TiB of
> scrubbed data. Is it possible that also this values is crap, as the
> non zero counters for RAID5 array?

I just tested a quick scrub with injected errors on 4.18.0 and it looks
like the garbage values are finally fixed (yay!).

I never saw invalid values for 'total bytes' from raid5; however, scrub
has (had?) trouble resuming, especially if the system was rebooted between
cancel and resume, but sometimes just if the scrub had just been suspended
too long (maybe if there are changes to the chunk tree...?).

55 hours for 2600 GB is just under 50GB per hour, which doesn't sound
too unreasonable for btrfs, though it is known to be a bit slow compared
to other raid5 implementations.

> Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> ha scritto:
> >
> > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erenthetitan@mail.de wrote:
> > > I guess that covers most topics, two last questions:
> > >
> > > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
> >
> > Not really.  It changes the probability distribution (you get an extra
> > chance to recover using a parity block in some cases), but there are
> > still cases where data gets lost that didn't need to be.
> >
> > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
> >
> > There may be benefits of raid5 metadata, but they are small compared to
> > the risks.
> >
> > In some configurations it may not be possible to allocate the last
> > gigabyte of space.  raid1 will allocate 1GB chunks from 2 disks at a
> > time while raid5 will allocate 1GB chunks from N disks at a time, and if
> > N is an odd number there could be one chunk left over in the array that
> > is unusable.  Most users will find this irrelevant because a large disk
> > array that is filled to the last GB will become quite slow due to long
> > free space search and seek times--you really want to keep usage below 95%,
> > maybe 98% at most, and that means the last GB will never be needed.
> >
> > Reading raid5 metadata could theoretically be faster than raid1, but that
> > depends on a lot of variables, so you can't assume it as a rule of thumb.
> >
> > Raid6 metadata is more interesting because it's the only currently
> > supported way to get 2-disk failure tolerance in btrfs.  Unfortunately
> > that benefit is rather limited due to the write hole bug.
> >
> > There are patches floating around that implement multi-disk raid1 (i.e. 3
> > or 4 mirror copies instead of just 2).  This would be much better for
> > metadata than raid6--more flexible, more robust, and my guess is that
> > it will be faster as well (no need for RMW updates or journal seeks).
> >
> > > -------------------------------------------------------------------------------------------------
> > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> > >
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
@ 2018-08-13 21:56 erenthetitan
  2018-08-14  4:09 ` Zygo Blaxell
  2018-08-14  7:32 ` Menion
  0 siblings, 2 replies; 21+ messages in thread
From: erenthetitan @ 2018-08-13 21:56 UTC (permalink / raw)
  To: menion, ce3g8jdj; +Cc: linux-btrfs

Running time of 55:06:35 indicates that the counter is right, it is not enough time to scrub the entire array using hdd.

2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub start /dev/sdx1" only scrubs the selected partition, 
whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual array.

Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics and post the output.
For live statistics, use "sudo watch -n 1".

By the way:
0 errors despite multiple unclean shutdowns? I assumed that the write hole would corrupt parity the first time around, was i wrong?

Am 13-Aug-2018 09:20:36 +0200 schrieb menion@gmail.com: 
> Hi
> I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
> there are contradicting opinions by the, well, "several" ways to check
> the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
> data.
> This array is running on kernel 4.17.3 and it definitely experienced
> power loss while data was being written.
> I can say that it wen through at least a dozen of unclear shutdown
> So following this thread I started my first scrub on the array. and
> this is the outcome (after having resumed it 4 times, two after a
> power loss...):
> 
> menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
> scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
> total bytes scrubbed: 2.59TiB with 0 errors
> 
> So, there are 0 errors, but I don't understand why it says 2.59TiB of
> scrubbed data. Is it possible that also this values is crap, as the
> non zero counters for RAID5 array?
> Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> ha scritto:
> >
> > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erenthetitan@mail.de wrote:
> > > I guess that covers most topics, two last questions:
> > >
> > > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
> >
> > Not really. It changes the probability distribution (you get an extra
> > chance to recover using a parity block in some cases), but there are
> > still cases where data gets lost that didn't need to be.
> >
> > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
> >
> > There may be benefits of raid5 metadata, but they are small compared to
> > the risks.
> >
> > In some configurations it may not be possible to allocate the last
> > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a
> > time while raid5 will allocate 1GB chunks from N disks at a time, and if
> > N is an odd number there could be one chunk left over in the array that
> > is unusable. Most users will find this irrelevant because a large disk
> > array that is filled to the last GB will become quite slow due to long
> > free space search and seek times--you really want to keep usage below 95%,
> > maybe 98% at most, and that means the last GB will never be needed.
> >
> > Reading raid5 metadata could theoretically be faster than raid1, but that
> > depends on a lot of variables, so you can't assume it as a rule of thumb.
> >
> > Raid6 metadata is more interesting because it's the only currently
> > supported way to get 2-disk failure tolerance in btrfs. Unfortunately
> > that benefit is rather limited due to the write hole bug.
> >
> > There are patches floating around that implement multi-disk raid1 (i.e. 3
> > or 4 mirror copies instead of just 2). This would be much better for
> > metadata than raid6--more flexible, more robust, and my guess is that
> > it will be faster as well (no need for RMW updates or journal seeks).
> >
> > > -------------------------------------------------------------------------------------------------
> > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> > >


-------------------------------------------------------------------------------------------------
FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
  2018-08-11 15:25     ` Zygo Blaxell
@ 2018-08-13  7:20       ` Menion
  2018-08-14  3:49         ` Zygo Blaxell
  0 siblings, 1 reply; 21+ messages in thread
From: Menion @ 2018-08-13  7:20 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: erenthetitan, linux-btrfs

Hi
I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :),
there are contradicting opinions by the, well, "several" ways to check
the used space on a BTRFS RAID5 array, but I should be aroud 8TB of
data.
This array is running on kernel 4.17.3 and it definitely experienced
power loss while data was being written.
I can say that it wen through at least a dozen of unclear shutdown
So following this thread I started my first scrub on the array. and
this is the outcome (after having resumed it 4 times, two after a
power loss...):

menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/
scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
        scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35
        total bytes scrubbed: 2.59TiB with 0 errors

So, there are 0 errors, but I don't understand why it says 2.59TiB of
scrubbed data. Is it possible that also this values is crap, as the
non zero counters for RAID5 array?
Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> ha scritto:
>
> On Sat, Aug 11, 2018 at 08:27:04AM +0200, erenthetitan@mail.de wrote:
> > I guess that covers most topics, two last questions:
> >
> > Will the write hole behave differently on Raid 6 compared to Raid 5 ?
>
> Not really.  It changes the probability distribution (you get an extra
> chance to recover using a parity block in some cases), but there are
> still cases where data gets lost that didn't need to be.
>
> > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ?
>
> There may be benefits of raid5 metadata, but they are small compared to
> the risks.
>
> In some configurations it may not be possible to allocate the last
> gigabyte of space.  raid1 will allocate 1GB chunks from 2 disks at a
> time while raid5 will allocate 1GB chunks from N disks at a time, and if
> N is an odd number there could be one chunk left over in the array that
> is unusable.  Most users will find this irrelevant because a large disk
> array that is filled to the last GB will become quite slow due to long
> free space search and seek times--you really want to keep usage below 95%,
> maybe 98% at most, and that means the last GB will never be needed.
>
> Reading raid5 metadata could theoretically be faster than raid1, but that
> depends on a lot of variables, so you can't assume it as a rule of thumb.
>
> Raid6 metadata is more interesting because it's the only currently
> supported way to get 2-disk failure tolerance in btrfs.  Unfortunately
> that benefit is rather limited due to the write hole bug.
>
> There are patches floating around that implement multi-disk raid1 (i.e. 3
> or 4 mirror copies instead of just 2).  This would be much better for
> metadata than raid6--more flexible, more robust, and my guess is that
> it will be faster as well (no need for RMW updates or journal seeks).
>
> > -------------------------------------------------------------------------------------------------
> > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
  2018-08-11  6:27   ` erenthetitan
@ 2018-08-11 15:25     ` Zygo Blaxell
  2018-08-13  7:20       ` Menion
  0 siblings, 1 reply; 21+ messages in thread
From: Zygo Blaxell @ 2018-08-11 15:25 UTC (permalink / raw)
  To: erenthetitan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1968 bytes --]

On Sat, Aug 11, 2018 at 08:27:04AM +0200, erenthetitan@mail.de wrote:
> I guess that covers most topics, two last questions:
> 
> Will the write hole behave differently on Raid 6 compared to Raid 5 ?

Not really.  It changes the probability distribution (you get an extra
chance to recover using a parity block in some cases), but there are
still cases where data gets lost that didn't need to be.

> Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? 

There may be benefits of raid5 metadata, but they are small compared to
the risks.

In some configurations it may not be possible to allocate the last
gigabyte of space.  raid1 will allocate 1GB chunks from 2 disks at a
time while raid5 will allocate 1GB chunks from N disks at a time, and if
N is an odd number there could be one chunk left over in the array that
is unusable.  Most users will find this irrelevant because a large disk
array that is filled to the last GB will become quite slow due to long
free space search and seek times--you really want to keep usage below 95%,
maybe 98% at most, and that means the last GB will never be needed.

Reading raid5 metadata could theoretically be faster than raid1, but that
depends on a lot of variables, so you can't assume it as a rule of thumb.

Raid6 metadata is more interesting because it's the only currently
supported way to get 2-disk failure tolerance in btrfs.  Unfortunately
that benefit is rather limited due to the write hole bug.

There are patches floating around that implement multi-disk raid1 (i.e. 3
or 4 mirror copies instead of just 2).  This would be much better for
metadata than raid6--more flexible, more robust, and my guess is that
it will be faster as well (no need for RMW updates or journal seeks).

> -------------------------------------------------------------------------------------------------
> FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
  2018-08-11  5:49 ` Zygo Blaxell
@ 2018-08-11  6:27   ` erenthetitan
  2018-08-11 15:25     ` Zygo Blaxell
  0 siblings, 1 reply; 21+ messages in thread
From: erenthetitan @ 2018-08-11  6:27 UTC (permalink / raw)
  To: ce3g8jdj; +Cc: linux-btrfs

I guess that covers most topics, two last questions:

Will the write hole behave differently on Raid 6 compared to Raid 5 ?
Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? 
-------------------------------------------------------------------------------------------------
FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
  2018-08-11  2:18 erenthetitan
@ 2018-08-11  5:49 ` Zygo Blaxell
  2018-08-11  6:27   ` erenthetitan
  0 siblings, 1 reply; 21+ messages in thread
From: Zygo Blaxell @ 2018-08-11  5:49 UTC (permalink / raw)
  To: erenthetitan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 13176 bytes --]

On Sat, Aug 11, 2018 at 04:18:35AM +0200, erenthetitan@mail.de wrote:
> Write hole:
> 
> 
> > The data will be readable until one of the data blocks becomes
> > inaccessible (bad sector or failed disk). This is because it is only the
> > parity block that is corrupted (old data blocks are still not modified
> > due to btrfs CoW), and the parity block is only required when recovering
> > from a disk failure.
> 
> I am unsure about your meaning. 
> Assuming you perform an unclean shutdown (eg. crash), and after restart
> perform a scrub, with no additional error (bad sector, bit-rot) before
> or after the crash:
> will you loose data? 

No, the parity blocks will be ignored and RAID5 will act like slow RAID0
if no other errors occur.

> Will you be able to mount the filesystem like normal? 

Yes.

> Additionaly, will the crash create additional errors like bad
> sectors and or bit-rot aside from the parity-block corruption?

No, only parity-block corruptions should occur.

> Its actually part of my first mail, where the btrfs Raid5/6 page
> assumes no data damage while the spinics comment implies the opposite.

The above assumes no drive failures or data corruption; however, if this
were the case, you could use RAID0 instead of RAID5.

The only reason to use RAID5 is to handle cases where at least one block
(or an entire disk) fails, so the behavior of RAID5 when all disks are
working is almost irrelevant.

A drive failure could occur at any time, so even if you mount successfully,
if a disk fails immediately after, any stripes affected by write hole will
be unrecoverably corrupted.

> The write hole does not seem as dangerous if you could simply scrub
> to repair damage (On smaller discs that is, where scrub doesnt take
> enough time for additional errors to occur)

Scrub can repair parity damage on normal data and metadata--it recomputes
parity from data if the data passes a CRC check.

No repair is possible for data in nodatasum files--the parity can be
recomputed, but there is no way to determine if the result is correct.

Metadata is always checksummed and transid verified; alas, there isn't
an easy way to get btrfs to perform an urgent scrub on metadata only.

> > Put another way: if all disks are online then RAID5/6 behaves like a slow
> > RAID0, and RAID0 does not have the partial stripe update problem because
> > all of the data blocks in RAID0 are independent. It is only when a disk
> > fails in RAID5/6 that the parity block is combined with data blocks, so
> > it is only in this case that the write hole bug can result in lost data.
> 
> So data will not be lost if no drive has failed?

Correct, but the array will have reduced failure tolerance, and RAID5
only matters when a drive has failed.  It is effectively operating in
degraded mode on parts of the array affected by write hole, and no single
disk failure can be tolerated there.

It is possible to recover the parity by performing an immediate scrub
after reboot, but this cannot be as effective as a proper RAID5 update
journal which avoids making the parity bad in the first place.

> > > > If the filesystem is -draid5 -mraid1 then the metadata is not vulnerable
> > > > to the write hole, but data is. In this configuration you can determine
> > > > with high confidence which files you need to restore from backup, and
> > > > the filesystem will remain writable to replace the restored data, because
> > > > raid1 does not have the write hole bug.
> 
> In regards to my earlier questions, what would change if i do -draid5 -mraid1?

Metadata would be using RAID1 which is not subject to the RAID5 write
hole issue.  It is much more tolerant of unclean shutdowns especially
in degraded mode.

Data in RAID5 may be damaged when the array is in degraded mode and
a write hole occurs (in either order as long as both occur).  Due to
RAID1 metadata, the filesystem will continue to operate properly,
allowing the damaged data to be overwritten or deleted.

> Lost Writes:
> 
> > Hotplugging causes an effect (lost writes) which can behave similarly
> > to the write hole bug in some instances. The similarity ends there.
> 
> Are we speaking about the same problem that is causing transid mismatch? 

Transid mismatch is usually caused by lost writes, by any mechanism
that prevents a write from being completed after the disk reports that
it was completed.

Drives may report that data is "in stable storage", i.e. the drive
believes it can complete the write in the future even if power is lost
now because the drive or controller has capacitors or NVRAM or similar.
If the drive is reset by the SATA host because of a cable disconnect
event, the drive may forget that it has promised to do writes in the
future.  Drives may simply lie, and claim that data has been written to
disk when the data is actually in volatile RAM and will disappear in a
power failure.

btrfs uses a transaction mechanism and CoW metadata to handle lost writes
within an interrupted transaction.  Incomplete data is simply discarded on
next mount.  A transid mismatch is caused by a write that was lost _after_
the transaction commit write is reported completed by the disk firmware.

Transid mismatch is a serious problem as it means that disks or other
lower layers of the storage stack are injecting errors and violating
the data integrity requirements of the filesystem.  It is worse than
a csum error as csum errors are usually caused by random media faults
(with low correlated failure probability) while transid mismatches are
usually caused by firmware or controller problems (with high correlated
failure probability).  If you have multiple identical disks and the
transid mismatches are caused by a firmware bug, then each identical disk
may inject an identical error causing unrecoverable filesystem errors.

> > They are really two distinct categories of problem. Temporary connection
> > loss can do bad things to all RAID profiles on btrfs (not just RAID5/6)
> > and the btrfs requirements for handling connection loss and write holes
> > are very different.
> 
> What kind of bad things? Will scrub (1/10, 5/6) detect and repair it?

Scrub does not handle all the cases.  Scrub relies on CRC to detect data
errors, which causes two problems:  scrub cannot handle nodatasum files
because those have no CRC, and CRC32 has a non-zero false acceptance rate.

Lost writes should be fixed by performing a replace operation on the
disconnected disk (i.e. it should be treated like the drive failed and
was replaced with a new blank disk).  The replace operation informs btrfs
which version of a nodatasum file it can consider to be correct (i.e. the
one stored on disks that did not disconnect).  Scrub can only detect that
two copies of a nodatasum file are different, it has no way to choose one.

Mature RAID implementations like mdadm have optimizations for this
case that rebuild only areas of the disk that were modified just before
disconnection.  btrfs has no such optimization so it can only replace
the entire disk at once.

> > > > Hot-unplugging a device can cause many lost write events at once, and
> > > > each lost write event is very bad.
> 
> > Transid mismatch is btrfs detecting data
> > that was previously silently corrupted by some component outside of btrfs.
> > 
> > btrfs can't prevent disks from silently corrupting data. It can only
> > try to detect and repair the damage after the damage has occurred.
> 
> Aside from the chance that all copies of data are corrupted, is there any way scrubbing could fail?

There is a small chance of undetected errors due to the limitations
of CRC32.

> > Normally RAID1/5/6/10 or DUP profiles are used for btrfs metadata, so any
> > transid mismatches can be recovered by reading up-to-date data from the
> > other mirror copy of the metadata, or by reconstructing the data with
> > parity blocks in the RAID 5/6 case. It is only after this recovery
> > mechanism fails (i.e. too many disks have a failure or corruption at
> > the same time on the same sectors) that the filesystem is ended.
> 
> Does this mean that transid mismatch is harmless unless both copys
> are hit at once (And in case of Raid 6 all three)?

It's not entirely harmless because it is a form of data corruption error.
A disk failure could occur before the corrupted data is recovered,
and should that occur it would be a multiple failure that breaks the
filesystem.

If you find that a disk in your array produces multiple transid
failures, you should treat it like any other failing disk, and replace
it immediately to avoid risk of future data loss.

> Old hardware:
> 
> > > > It's fun and/or scary to put known good and bad hardware in the same
> > > > RAID1 array and watch btrfs autocorrecting the bad data after every
> > > > other power failure; however, the bad hardware is clearly not sufficient
> > > > to implement any sort of reliable data persistence, and arrays with bad
> > > > hardware in them will eventually fail.
> 
> I am having a hard time wrapping my head around this statement.
> If Btrfs can repair corrupted data and Raid  6 allows two disc failures
> at once without data loss, is using old discs even with high average
> error count not still pretty much safe?
> You would simply have to repeat the scrubbing process more often to
> make sure that not enough data is corrupted to break redundancy.

Old disks and disks that have bad firmware behave differently.  An old
disk fails in multiple random and unpredictable ways, while a disk with
bad firwmare always fails the same way every time (until it eventually
becomes an old disk, and starts failing in random ways as well).

This kind of array has an elevated probability of failure and is not
"safe".  A small error rate is enough to make the probability of
concurrent failure so high that you would not be able to scrub often
enough even if you were running scrubs continuously.

Such devices should only be used for development and testing of failure
recovery algorithms.

> > > > I have one test case where I write millions of errors into a raid5/6 and
> > > > the filesystem recovers every single one transparently while verifying
> > > > SHA1 hashes of test data. After years of rebuilding busted ext3 on
> > > > mdadm-raid5 filesystems, watching btrfs do it all automatically is
> > > > just...beautiful.
> 
> Once again, if Btrfs is THIS good at repairing data, then is old
> hardware,

I make arrays that combine hardware of different ages.  The failure rates
increase as hardware ages, so you don't want an array of all old drives.
e.g. in a 5-disk array replace the oldest disk every year.

Once disks have been in service for many years, they become very fragile
and can be broken by simply turning them sideways.  You can have at
most one such disk in a RAID5 array (or two in RAID6, but that seems
unnecessarily risky).

> hotplugging and maybe even (depending on whether i understood
> your point) write hole really dangerous? Are there bugs that could
> destroy the data or filesystem whitout corrupting all copies of data
> (Or all copies at once)? 

There are always bugs, but they get harder and harder to trigger over
time.  At some point the probability of hitting a kernel bug becomes
lower than the probability of failures due to other causes, at which
point improving the reliability of the software has no further impact
on the reliability of the overall system.

Are there still bugs?  Probably, but except for write hole they are
getting harder to hit.  Which will happen first:  you hit one of of the
bugs, or you get a bad batch of drives that all fail in the same hour,
or your RAM goes bad and corrupts everything your CPU touches?

> Assuming Raid 6, corrupted data would not
> break redundancy and repeated scrubbing would fix any upcoming issue.

If a drive is allowed to continually inject small errors into the array,
it introduces a higher risk of data loss over time than if that drive is
immediately replaced with a properly functioning unit.  You can never
make the failure rate zero, but you can keep it as low as possible by
proactively eliminating misbehaving hardware.

If a drive is replaced with all disks online and reasonably healthy then
btrfs can recover data from read errors that occur during replacement.
If you wait for a drive to fail completely before replacement, the
replaced disk will be reconstructed while the array is in degraded mode,
and any other errors that occur during that process are not recoverable.

Old disks can also do nasty things, e.g. run 99% slower than normal,
or lock up the IO bus during read errors.  These events may not corrupt
any data, but an unexpected watchdog reboot or random performance issues
can still ruin your day.

> -------------------------------------------------------------------------------------------------
> FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
@ 2018-08-11  2:18 erenthetitan
  2018-08-11  5:49 ` Zygo Blaxell
  0 siblings, 1 reply; 21+ messages in thread
From: erenthetitan @ 2018-08-11  2:18 UTC (permalink / raw)
  To: ce3g8jdj, linux-btrfs

Write hole:


> The data will be readable until one of the data blocks becomes
> inaccessible (bad sector or failed disk). This is because it is only the
> parity block that is corrupted (old data blocks are still not modified
> due to btrfs CoW), and the parity block is only required when recovering
> from a disk failure.

I am unsure about your meaning. 
Assuming you perform an unclean shutdown (eg. crash), and after restart perform a scrub, with no additional error (bad sector, bit-rot) before or after the crash:
will you loose data? Will you be able to mount the filesystem like normal? Additionaly, will the crash create additional errors like bad sectors and or bit-rot aside from the parity-block corruption?
Its actually part of my first mail, where the btrfs Raid5/6 page assumes no data damage while the spinics comment implies the opposite.
The write hole does not seem as dangerous if you could simply scrub to repair damage (On smaller discs that is, where scrub doesnt take enough time for additional errors to occur)

> Put another way: if all disks are online then RAID5/6 behaves like a slow
> RAID0, and RAID0 does not have the partial stripe update problem because
> all of the data blocks in RAID0 are independent. It is only when a disk
> fails in RAID5/6 that the parity block is combined with data blocks, so
> it is only in this case that the write hole bug can result in lost data.

So data will not be lost if no drive has failed?

> > > If the filesystem is -draid5 -mraid1 then the metadata is not vulnerable
> > > to the write hole, but data is. In this configuration you can determine
> > > with high confidence which files you need to restore from backup, and
> > > the filesystem will remain writable to replace the restored data, because
> > > raid1 does not have the write hole bug.

In regards to my earlier questions, what would change if i do -draid5 -mraid1?


Lost Writes:


> Hotplugging causes an effect (lost writes) which can behave similarly
> to the write hole bug in some instances. The similarity ends there.

Are we speaking about the same problem that is causing transid mismatch? 

> They are really two distinct categories of problem. Temporary connection
> loss can do bad things to all RAID profiles on btrfs (not just RAID5/6)
> and the btrfs requirements for handling connection loss and write holes
> are very different.

What kind of bad things? Will scrub (1/10, 5/6) detect and repair it?

> > > Hot-unplugging a device can cause many lost write events at once, and
> > > each lost write event is very bad.

> Transid mismatch is btrfs detecting data
> that was previously silently corrupted by some component outside of btrfs.
> 
> btrfs can't prevent disks from silently corrupting data. It can only
> try to detect and repair the damage after the damage has occurred.

Aside from the chance that all copies of data are corrupted, is there any way scrubbing could fail?

> Normally RAID1/5/6/10 or DUP profiles are used for btrfs metadata, so any
> transid mismatches can be recovered by reading up-to-date data from the
> other mirror copy of the metadata, or by reconstructing the data with
> parity blocks in the RAID 5/6 case. It is only after this recovery
> mechanism fails (i.e. too many disks have a failure or corruption at
> the same time on the same sectors) that the filesystem is ended.

Does this mean that transid mismatch is harmless unless both copys are hit at once (And in case of Raid 6 all three)?


Old hardware:


> > > It's fun and/or scary to put known good and bad hardware in the same
> > > RAID1 array and watch btrfs autocorrecting the bad data after every
> > > other power failure; however, the bad hardware is clearly not sufficient
> > > to implement any sort of reliable data persistence, and arrays with bad
> > > hardware in them will eventually fail.

I am having a hard time wrapping my head around this statement.
If Btrfs can repair corrupted data and Raid  6 allows two disc failures at once without data loss, is using old discs even with high average error count not still pretty much safe?
You would simply have to repeat the scrubbing process more often to make sure that not enough data is corrupted to break redundancy.

> > > I have one test case where I write millions of errors into a raid5/6 and
> > > the filesystem recovers every single one transparently while verifying
> > > SHA1 hashes of test data. After years of rebuilding busted ext3 on
> > > mdadm-raid5 filesystems, watching btrfs do it all automatically is
> > > just...beautiful.

Once again, if Btrfs is THIS good at repairing data, then is old hardware, hotplugging and maybe even (depending on whether i understood your point) write hole really dangerous? Are there bugs that could destroy the data or filesystem whitout corrupting all copies of data (Or all copies at once)? Assuming Raid 6, corrupted data would not break redundancy and repeated scrubbing would fix any upcoming issue.
-------------------------------------------------------------------------------------------------
FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
@ 2018-08-11  0:45 Zygo Blaxell
  0 siblings, 0 replies; 21+ messages in thread
From: Zygo Blaxell @ 2018-08-11  0:45 UTC (permalink / raw)
  To: linux-btrfs; +Cc: erenthetitan

[-- Attachment #1: Type: text/plain, Size: 23547 bytes --]

On Fri, Aug 10, 2018 at 06:55:58PM +0200, erenthetitan@mail.de wrote:
> Did i get you right?
> Please correct me if i am wrong:
> 
> Scrubbing seems to have been fixed, you only have to run it once.

Yes.

There is one minor bug remaining here:  when scrub detects an error
on any disk in a raid5/6 array, the error counts are garbage (random
numbers on all the disks).  You will need to inspect btrfs dev stats
or the kernel log messages to learn which disks are injecting errors.

This does not impair the scrubbing function, only the detailed statistics
report (scrub status -d).

If there are no errors, scrub correctly reports 0 for all error counts.
Only raid5/6 is affected this way--other RAID profiles produce correct
scrub statistics.

> Hotplugging (temporary connection loss) is affected by the write hole
> bug, and will create undetectable errors every 16 TB (crc32 limitation).

Hotplugging causes an effect (lost writes) which can behave similarly
to the write hole bug in some instances.  The similarity ends there.

They are really two distinct categories of problem.  Temporary connection
loss can do bad things to all RAID profiles on btrfs (not just RAID5/6)
and the btrfs requirements for handling connection loss and write holes
are very different.

> The write Hole Bug can affect both old and new data. 

Normally, only old data can be affected by the write hole bug.

The "new" data is not committed before the power failure (otherwise we
would call it "old" data), so any corrupted new data will be inaccessible
as a result of the power failure.  The filesytem will roll back to the
last complete committed data tree (discarding all new and modified data
blocks), then replay the fsync log (which repeats and completes some
writes that occurred since the last commit).  This process eliminates
new data from the filesystem whether the new data was corrupted by the
write hole or not.

Only corruptions that affect old data will remain, because old data is
not overwritten by data saved in the fsync log, and old data is not part
of the incomplete data tree that is rolled back after power failure.

Exception:  new data in nodatasum files can also be corrupted, but since
nodatasum disables all data integrity or recovery features it's hard to
define what "corrupted" means for a nodatasum file.

> Reason: BTRFS saves data in fixed size stripes, if the write operation
> fails midway, the stripe is lost.
> This does not matter much for Raid 1/10, data always uses a full stripe,
> and stripes are copied on write. Only new data could be lost.

This is incorrect.  Btrfs saves data in variable-sized extents (between
1 and 32768 4K data blocks) and btrfs has no concept of stripes outside of
its raid layer.  Stripes are never copied.

In RAID 1/10/DUP all data blocks are fully independent of each other,
i.e. writing to any block on these RAID profiles does not corrupt data in
any other block.  As a result these RAID profiles do not allow old data
to be corrupted by partially completed writes of new data.

There is striping in some profiles, but it is only used for performance
in these cases, and has no effect on data recovery.

> However, for some reason Raid 5/6 works with partial stripes, meaning
> that data is stored in stripes not completley filled by prior data,

In RAID 5/6 each data block is related to all other data blocks in the
same stripe with the parity block(s).  If any individual data block in the
stripe is updated, the parity block(s) must also be updated atomically,
or the wrong data will be reconstructed during RAID5/6 recovery.

Because btrfs does nothing to prevent it, some writes will occur
to RAID5/6 stripes that are already partially occupied by old data.
btrfs also does nothing to ensure that parity block updates are atomic,
so btrfs has the write hole bug as a result.

> and stripes are removed on write.

Stripes are never removed...?  A stripe is just a group of disk blocks
divided on 64K boundaries, same as mdadm and many hardware RAID5/6
implementations.

> Result: If the operation fails midway, the stripe is lost as is all
> data previously stored it.

You can only lose as many data blocks in each stripe as there are parity
disks (i.e. raid5 can lose 0 or 1 block, while raid6 can lose 0, 1, or 2
blocks); however, multiple writes can be lost affecting multiple stripes
in a single power loss event.  Losing even 1 block is often too much.  ;)

The data will be readable until one of the data blocks becomes
inaccessible (bad sector or failed disk).  This is because it is only the
parity block that is corrupted (old data blocks are still not modified
due to btrfs CoW), and the parity block is only required when recovering
from a disk failure.

Put another way:  if all disks are online then RAID5/6 behaves like a slow
RAID0, and RAID0 does not have the partial stripe update problem because
all of the data blocks in RAID0 are independent.  It is only when a disk
fails in RAID5/6 that the parity block is combined with data blocks, so
it is only in this case that the write hole bug can result in lost data.

> Transid Mismatch can silently corrupt data.

This is the wrong way around.  Transid mismatch is btrfs detecting data
that was previously silently corrupted by some component outside of btrfs.

btrfs can't prevent disks from silently corrupting data.  It can only
try to detect and repair the damage after the damage has occurred.

> Reason: It is a seperate metadata failure that is trigged by lost or
> incomplete writes, writes that are lost somewhere during transmission.
> It can happen to all BTRFS configurations and is not trigerred by the
> write hole.
> It could happen due to brown out (temporary undersupply of voltage),
> faulty cables, faulty ram, faulty disc cache, faulty discs in general.
> 
> Both bugs could damage metadata and trigger the following:
> Data will be lost (0 to 100% unreadable), the filesystem will be readonly.
> Reason: BTRFS saves metadata as a tree structure. The closer the error
> to the root, the more data cannot be read.
> 
> Transid Mismatch can happen up to once every 3 months per device,
> depending on the drive hardware!

It can happen much more often than that on a disk that is truly failing
(as opposed to one that merely has firmware bugs).  I've had RAID1 arrays
where transid failures from one failing disk were repaired thousands of
times over a period of several hours, stopping only when the bad disk
was replaced.

> Question: Does this not make transid mismatch way more dangerous than
> the write hole? 

*Unrecoverable* transid mismatch is fatal.  A btrfs that uses 'single' or
'raid0' profiles for metadata will be unable to recover from even minor
failures.  A single bit error in metadata could end the filesystem.

Normally RAID1/5/6/10 or DUP profiles are used for btrfs metadata, so any
transid mismatches can be recovered by reading up-to-date data from the
other mirror copy of the metadata, or by reconstructing the data with
parity blocks in the RAID 5/6 case.  It is only after this recovery
mechanism fails (i.e. too many disks have a failure or corruption at
the same time on the same sectors) that the filesystem is ended.

This is the same as any other RAID implementation:  if there are failures
on too many disks, data will be lost.

> What would happen to other filesystems, like ext4?

In the best case ext4 silently corrupts data.  In the worst cases (if
all the ext2/3 legacy features are turned off, so there are no fixed
locations on disk for block group and extent structures), the filesystem
can be severely damaged, possibly beyond the ability of tools to usefully
recover.  "Recovery" by e2fsck may remove bad metadata until there is no
data left on the filesystem, or the entire filesystem becomes nameless
lost+found soup.

ext2 and ext3 and some configurations of ext4 are more resilient to
lost writes because metadata is always overwritten in the same place,
metadata changes slowly over time, and minor inconsistencies in metadata
can often be ignored in practice.  This means that data integrity on
these filesystems relies more on luck than anything else.

btrfs is nothing like that:  metadata is (almost) never written to the
same location on disk twice, all metadata pages have transid stamps and
checksums to detect errors in the disk layer, and btrfs verifies metadata
and refuses to process data that it does not deem to be entirely correct.

> Am 10-Aug-2018 09:12:21 +0200 schrieb ce3g8jdj@umail.furryterror.org: 
> > On Fri, Aug 10, 2018 at 03:40:23AM +0200, erenthetitan@mail.de wrote:
> > > I am searching for more information regarding possible bugs related to
> > > BTRFS Raid 5/6. All sites i could find are incomplete and information
> > > contradicts itself:
> > >
> > > The Wiki Raid 5/6 Page (https://btrfs.wiki.kernel.org/index.php/RAID56)
> > > warns of the write hole bug, stating that your data remains safe
> > > (except data written during power loss, obviously) upon unclean shutdown
> > > unless your data gets corrupted by further issues like bit-rot, drive
> > > failure etc.
> > 
> > The raid5/6 write hole bug exists on btrfs (as of 4.14-4.16) and there are
> > no mitigations to prevent or avoid it in mainline kernels.
> > 
> > The write hole results from allowing a mixture of old (committed) and
> > new (uncommitted) writes to the same RAID5/6 stripe (i.e. a group of
> > blocks consisting of one related data or parity block from each disk
> > in the array, such that writes to any of the data blocks affect the
> > correctness of the parity block and vice versa). If the writes were
> > not completed and one or more of the data blocks are not online, the
> > data blocks reconstructed by the raid5/6 algorithm will be corrupt.
> > 
> > If all disks are online, the write hole does not immediately
> > damage user-visible data as the old data blocks can still be read
> > directly; however, should a drive failure occur later, old data may
> > not be recoverable because the parity block will not be correct for
> > reconstructing the missing data block. A scrub can fix write hole
> > errors if all disks are online, and a scrub should be performed after
> > any unclean shutdown to recompute parity data.
> > 
> > The write hole always puts both old and new data at risk of damage;
> > however, due to btrfs's copy-on-write behavior, only the old damaged
> > data can be observed after power loss. The damaged new data will have
> > no references to it written to the disk due to the power failure, so
> > there is no way to observe the new damaged data using the filesystem.
> > Not every interrupted write causes damage to old data, but some will.
> > 
> > Two possible mitigations for the write hole are:
> > 
> > - modify the btrfs allocator to prevent writes to partially filled
> > raid5/6 stripes (similar to what the ssd mount option does, except
> > with the correct parameters to match RAID5/6 stripe boundaries),
> > and advise users to run btrfs balance much more often to reclaim
> > free space in partially occupied raid stripes
> > 
> > - add a stripe write journal to the raid5/6 layer (either in
> > btrfs itself, or in a lower RAID5 layer).
> > 
> > There are assorted other ideas (e.g. copy the RAID-Z approach from zfs
> > to btrfs or dramatically increase the btrfs block size) that also solve
> > the write hole problem but are somewhat more invasive and less practical
> > for btrfs.
> > 
> > Note that the write hole also affects btrfs on top of other similar
> > raid5/6 implementations (e.g. mdadm raid5 without stripe journal).
> > The btrfs CoW layer does not understand how to allocate data to avoid RMW
> > raid5 stripe updates without corrupting existing committed data, and this
> > limitation applies to every combination of unjournalled raid5/6 and btrfs.
> > 
> > > The Wiki Gotchas Page (https://btrfs.wiki.kernel.org/index.php/Gotchas)
> > > warns of possible incorrigible "transid" mismatch, not stating which
> > > versions are affected or what transid mismatch means for your data. It
> > > does not mention the write hole at all.
> > 
> > Neither raid5 nor write hole are required to produce a transid mismatch
> > failure. transid mismatch usually occurs due to a lost write. Write hole
> > is a specific case of lost write, but write hole does not usually produce
> > transid failures (it produces header or csum failures instead).
> > 
> > During real disk failure events, multiple distinct failure modes can
> > occur concurrently. i.e. both transid failure and write hole can occur
> > at different places in the same filesystem as a result of attempting to
> > use a failing disk over a long period of time.
> > 
> > A transid verify failure is metadata damage. It will make the filesystem
> > readonly and make some data inaccessible as described below.
> > 
> > > This Mail Archive (linux-btrfs@vger.kernel.org/msg55161.html"
> > > target="_blank">linux-btrfs@vger.kernel.org/msg55161.html" target="_blank">https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg55161.html)
> > > states that scrubbing BTRFS Raid 5/6 will always repair Data Corruption,
> > > but may corrupt your Metadata while trying to do so - meaning you have
> > > to scrub twice in a row to ensure data integrity.
> > 
> > Simple corruption (without write hole errors) is fixed by scrubbing
> > as of the last...at least six months? Kernel v4.14.xx and later can
> > definitely do it these days. Both data and metadata.
> > 
> > If the metadata is damaged in any way (corruption, write hole, or transid
> > verify failure) on btrfs and btrfs cannot use the raid profile for
> > metadata to recover the damaged data, the filesystem is usually forever
> > readonly, and anywhere from 0 to 100% of the filesystem may be readable
> > depending on where in the metadata tree structure the error occurs (the
> > closer to the root, the more data is lost). This is the same for dup,
> > raid1, raid5, raid6, and raid10 profiles. raid0 and single profiles are
> > not a good idea for metadata if you want a filesystem that can persist
> > across reboots (some use cases don't require persistence, so they can
> > use -msingle/-mraid0 btrfs as a large-scale tmpfs).
> > 
> > For all metadata raid profiles, recovery can fail due to risks including
> > RAM corruption, multiple drives having defects in the same locations,
> > or multiple drives with identically-behaving firmware bugs. For raid5/6
> > metadata there is the *additional* risk of the write hole bug preventing
> > recovery of metadata.
> > 
> > If the filesystem is -draid5 -mraid1 then the metadata is not vulnerable
> > to the write hole, but data is. In this configuration you can determine
> > with high confidence which files you need to restore from backup, and
> > the filesystem will remain writable to replace the restored data, because
> > raid1 does not have the write hole bug.
> > 
> > More than one scrub for a single write hole event won't help (and never
> > did). If the first scrub doesn't fix all the errors then your kernel
> > probably also has a race condition bug or regression that will permanently
> > corrupt the data (this was true in 2016 when the referenced mailing
> > list post was written).
> > 
> > Current kernels don't have such bugs--if the first scrub can correct
> > the data, it does, and if the first scrub can't correct the data then
> > all future scrubs will produce identical results.
> > 
> > Older kernels (2016) had problems reconstructing data during read()
> > operations but could fix data during scrub or balance operations.
> > These bugs, as far as I am able to test, have been fixed by v4.17 and
> > backported to v4.14.
> > 
> > > The Bugzilla Entry
> > > (https://bugzilla.kernel.org/buglist.cgi?component=btrfs) contains
> > > mostly unanswered bugs, which may or may not still count (2013 - 2018).
> > 
> > I find that any open bug over three years old on b.k.o can be safely
> > ignored because it has either already been fixed or there is not enough
> > information provided to understand what is going on.
> > 
> > > This Spinics Discussion
> > > (https://www.spinics.net/lists/linux-btrfs/msg76471.html) states
> > > that the write hole can even damage old data eg. data that was not
> > > accessed during unclean shutdown, the opposite of what the Raid5/6
> > > Status Page states!
> > 
> > Correct, write hole can *only* damage old data as described above.
> > 
> > > This Spinics comment
> > > (https://www.spinics.net/lists/linux-btrfs/msg76412.html) informs that
> > > hot-plugging a device will trigger the write hole. Accessed data will
> > > therefore be corrupted. In case the earlier statement about old data
> > > corruption is true, random data could be permamently lost. This is even
> > > more dangerous if you are connecting your devices via USB, as USB can
> > > unconnect due to external influence, eg. touching the cables, shaking...
> > 
> > Hot-unplugging a device can cause many lost write events at once, and
> > each lost write event is very bad.
> > 
> > btrfs does not reject and resynchronize a device from a raid array if a
> > write to the device fails (unlike every other working RAID implementation
> > on Earth...). If the device reconnects, btrfs will read a mixture of
> > old and new data and rely on checksums to determine which blocks are
> > out of date (as opposed to treating the departed disk as entirely out
> > of date and initiating a disk replace operation when it reconnects).
> > 
> > A scrub after a momentary disconnect can reconstruct most missing data,
> > but not all. CRC32 lets one error through per 16 TB of corrupted blocks,
> > and all nodatasum/nodatacow files modified while a drive was offline
> > will be corrupted without detection or recovery by btrfs.
> > 
> > Device replace is currently the best recovery option from this kind
> > of failure. Ideally btrfs would implement something like mdadm write
> > intent bitmaps so only those block groups that were modified while
> > the device as offline would be replaced, but this is the btrfs we want
> > not the btrfs we have.
> > 
> > > Lastly, this Superuser question
> > > (https://superuser.com/questions/1325245/btrfs-transid-failure#1344494)
> > > assumes that the transid mismatch bug could toggle your system
> > > unmountable. While it might be possible to restore your data using
> > > sudo BTRFS Restore, it is still unknown how the transid mismatch is
> > > even toggled, meaning that your file system could fail at any time!
> > 
> > Note that transid failure risk applies to all btrfs configurations.
> > It is not specific to raid5/6. The write hole errors from raid5/6 will
> > typically produce a header or csum failure (from reading garbage) not a
> > transid failure (from reading an old, valid, but deleted metadata block).
> > 
> > transid mismatch is pretty simple: one of your disk drives, or some
> > caching or translation layer between btrfs and your disk drives, dropped
> > a write (or, less likely, read from or wrote to the wrong sector address).
> > btrfs detects this by embedding transids into all data structures where
> > one object points to another object in a different block.
> > 
> > transid mismatch is also hard: you then have to figure out which layer
> > of your possibly quite complicated RAID setup is doing that, and make
> > it stop. This process almost never involves btrfs. Sometimes it's the
> > bottom layer (i.e. the drives themselves) but the more layers you add,
> > the more candidates need to be eliminated before the cause can be found.
> > Sometimes it's a *power supply* (i.e. the drive controller CPU browns
> > out and forgets it was writing something or corrupts its embedded RAM).
> > Sometimes it's host RAM going bad, corrupting and breaking everything
> > it touches.
> > 
> > I have a variety of test setups and the correlation between hardware
> > model (especially drive model, but also some SATA controller models)
> > and total filesystem loss due to transid verify failure is very strong.
> > Out of 10 drive models from 5 vendors, 2 models can't keep a filesystem
> > intact for more than a few months, while the other models average 3 years
> > old and still hold the first btrfs filesystem they were formatted with.
> > 
> > Disabling drive write caching sometimes helps, but some hardware eats
> > a filesystem every few months no matter what settings I change. If the
> > problem is a broken SATA controller or cable then changing drive settings
> > won't help.
> > 
> > It's fun and/or scary to put known good and bad hardware in the same
> > RAID1 array and watch btrfs autocorrecting the bad data after every
> > other power failure; however, the bad hardware is clearly not sufficient
> > to implement any sort of reliable data persistence, and arrays with bad
> > hardware in them will eventually fail.
> > 
> > The bad drives can still contribute to society as media cache servers or
> > point-of-sale terminals where the only response to any data integrity
> > issue is a full reformat and image reinstall. This seems to be the
> > target market that low-end consumer drives are aiming for, as they seem
> > to be useless for anything else.
> > 
> > Adopt a zero-tolerance policy for drive resets after the array is
> > mounted and active. A drive reset means a potential lost write leading
> > to a transid verify failure. Swap out both drive and SATA cable the
> > first time a reset occurs during a read or write operation, and consider
> > swapping out SATA controller, changing drive model, and upgrading power
> > supply if it happens twice.
> > 
> > > Do you know of any comprehensive and complete Bug list?
> > 
> > ...related to raid5/6:
> > 
> > - no write hole mitigation (at least two viable strategies
> > available)
> > 
> > - no device bouncing mitigation (mdadm had this working 20
> > years ago)
> > 
> > - probably slower than it could be
> > 
> > - no recovery strategy other than raid (btrfs check --repair is
> > useless on non-trivial filesytems, and a single-bit uncorrected
> > metadata error makes the filesystem unusable)
> > 
> > > Do you know more about the stated Bugs?
> > >
> > > Do you know further Bugs that are not addressed in any of these sites?
> > 
> > My testing on raid5/6 filesystems is producing pretty favorable results
> > these days. There do not seem to be many bugs left.
> > 
> > I have one test case where I write millions of errors into a raid5/6 and
> > the filesystem recovers every single one transparently while verifying
> > SHA1 hashes of test data. After years of rebuilding busted ext3 on
> > mdadm-raid5 filesystems, watching btrfs do it all automatically is
> > just...beautiful.
> > 
> > I think once the write hole and device bouncing mitigations are in place,
> > I'll start looking at migrating -draid1/-mraid1 setups to -draid5/-mraid1,
> > assuming the performance isn't too painful.
> 
> -------------------------------------------------------------------------------------------------
> FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
@ 2018-08-10 23:32 erenthetitan
  0 siblings, 0 replies; 21+ messages in thread
From: erenthetitan @ 2018-08-10 23:32 UTC (permalink / raw)
  To: linux-btrfs

Did i get you right?
Please correct me if i am wrong:

Scrubbing seems to have been fixed, you only have to run it once.

Hotplugging (temporary connection loss) is affected by the write hole bug, and will create undetectable errors every 16 TB (crc32 limitation).

The write Hole Bug can affect both old and new data.
Reason: BTRFS saves data in fixed size stripes, if the write operation fails midway, the stripe is lost.
This does not matter much for Raid 1/10, data always uses a full stripe, and stripes are copied on write. Only new data could be lost.
However, for some reason Raid 5/6 works with partial stripes, meaning that data is stored in stripes not completley filled by prior data, and stripes are removed on write.
Result: If the operation fails midway, the stripe is lost as is all data previously stored it.

Transid Mismatch can silently corrupt data.
Reason: It is a seperate metadata failure that is trigged by lost or incomplete writes, writes that are lost somewhere during transmission.
It can happen to all BTRFS configurations and is not trigerred by the write hole.
It could happen due to brown out (temporary undersupply of voltage), faulty cables, faulty ram, faulty disc cache, faulty discs in general.

Both bugs could damage metadata and trigger the following:
Data will be lost (0 to 100% unreadable), the filesystem will be readonly.
Reason: BTRFS saves metadata as a tree structure. The closer the error to the root, the more data cannot be read.

Transid Mismatch can happen up to once every 3 months per device,
depending on the drive hardware!

Question: Does this not make transid mismatch way more dangerous than
the write hole? What would happen to other filesystems, like ext4?

Am 10-Aug-2018 09:12:21 +0200 schrieb ce3g8jdj@umail.furryterror.org: 
> > On Fri, Aug 10, 2018 at 03:40:23AM +0200, erenthetitan@mail.de wrote:
> > > I am searching for more information regarding possible bugs related to
> > > BTRFS Raid 5/6. All sites i could find are incomplete and information
> > > contradicts itself:
> > >
> > > The Wiki Raid 5/6 Page (https://btrfs.wiki.kernel.org/index.php/RAID56)
> > > warns of the write hole bug, stating that your data remains safe
> > > (except data written during power loss, obviously) upon unclean shutdown
> > > unless your data gets corrupted by further issues like bit-rot, drive
> > > failure etc.
> > 
> > The raid5/6 write hole bug exists on btrfs (as of 4.14-4.16) and there are
> > no mitigations to prevent or avoid it in mainline kernels.
> > 
> > The write hole results from allowing a mixture of old (committed) and
> > new (uncommitted) writes to the same RAID5/6 stripe (i.e. a group of
> > blocks consisting of one related data or parity block from each disk
> > in the array, such that writes to any of the data blocks affect the
> > correctness of the parity block and vice versa). If the writes were
> > not completed and one or more of the data blocks are not online, the
> > data blocks reconstructed by the raid5/6 algorithm will be corrupt.
> > 
> > If all disks are online, the write hole does not immediately
> > damage user-visible data as the old data blocks can still be read
> > directly; however, should a drive failure occur later, old data may
> > not be recoverable because the parity block will not be correct for
> > reconstructing the missing data block. A scrub can fix write hole
> > errors if all disks are online, and a scrub should be performed after
> > any unclean shutdown to recompute parity data.
> > 
> > The write hole always puts both old and new data at risk of damage;
> > however, due to btrfs's copy-on-write behavior, only the old damaged
> > data can be observed after power loss. The damaged new data will have
> > no references to it written to the disk due to the power failure, so
> > there is no way to observe the new damaged data using the filesystem.
> > Not every interrupted write causes damage to old data, but some will.
> > 
> > Two possible mitigations for the write hole are:
> > 
> > - modify the btrfs allocator to prevent writes to partially filled
> > raid5/6 stripes (similar to what the ssd mount option does, except
> > with the correct parameters to match RAID5/6 stripe boundaries),
> > and advise users to run btrfs balance much more often to reclaim
> > free space in partially occupied raid stripes
> > 
> > - add a stripe write journal to the raid5/6 layer (either in
> > btrfs itself, or in a lower RAID5 layer).
> > 
> > There are assorted other ideas (e.g. copy the RAID-Z approach from zfs
> > to btrfs or dramatically increase the btrfs block size) that also solve
> > the write hole problem but are somewhat more invasive and less practical
> > for btrfs.
> > 
> > Note that the write hole also affects btrfs on top of other similar
> > raid5/6 implementations (e.g. mdadm raid5 without stripe journal).
> > The btrfs CoW layer does not understand how to allocate data to avoid RMW
> > raid5 stripe updates without corrupting existing committed data, and this
> > limitation applies to every combination of unjournalled raid5/6 and btrfs.
> > 
> > > The Wiki Gotchas Page (https://btrfs.wiki.kernel.org/index.php/Gotchas)
> > > warns of possible incorrigible "transid" mismatch, not stating which
> > > versions are affected or what transid mismatch means for your data. It
> > > does not mention the write hole at all.
> > 
> > Neither raid5 nor write hole are required to produce a transid mismatch
> > failure. transid mismatch usually occurs due to a lost write. Write hole
> > is a specific case of lost write, but write hole does not usually produce
> > transid failures (it produces header or csum failures instead).
> > 
> > During real disk failure events, multiple distinct failure modes can
> > occur concurrently. i.e. both transid failure and write hole can occur
> > at different places in the same filesystem as a result of attempting to
> > use a failing disk over a long period of time.
> > 
> > A transid verify failure is metadata damage. It will make the filesystem
> > readonly and make some data inaccessible as described below.
> > 
> > > This Mail Archive (linux-btrfs@vger.kernel.org/msg55161.html"
> > > target="_blank">linux-btrfs@vger.kernel.org/msg55161.html" target="_blank">linux-btrfs@vger.kernel.org/msg55161.html" target="_blank">linux-btrfs@vger.kernel.org/msg55161.html" target="_blank">https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg55161.html)
> > > states that scrubbing BTRFS Raid 5/6 will always repair Data Corruption,
> > > but may corrupt your Metadata while trying to do so - meaning you have
> > > to scrub twice in a row to ensure data integrity.
> > 
> > Simple corruption (without write hole errors) is fixed by scrubbing
> > as of the last...at least six months? Kernel v4.14.xx and later can
> > definitely do it these days. Both data and metadata.
> > 
> > If the metadata is damaged in any way (corruption, write hole, or transid
> > verify failure) on btrfs and btrfs cannot use the raid profile for
> > metadata to recover the damaged data, the filesystem is usually forever
> > readonly, and anywhere from 0 to 100% of the filesystem may be readable
> > depending on where in the metadata tree structure the error occurs (the
> > closer to the root, the more data is lost). This is the same for dup,
> > raid1, raid5, raid6, and raid10 profiles. raid0 and single profiles are
> > not a good idea for metadata if you want a filesystem that can persist
> > across reboots (some use cases don't require persistence, so they can
> > use -msingle/-mraid0 btrfs as a large-scale tmpfs).
> > 
> > For all metadata raid profiles, recovery can fail due to risks including
> > RAM corruption, multiple drives having defects in the same locations,
> > or multiple drives with identically-behaving firmware bugs. For raid5/6
> > metadata there is the *additional* risk of the write hole bug preventing
> > recovery of metadata.
> > 
> > If the filesystem is -draid5 -mraid1 then the metadata is not vulnerable
> > to the write hole, but data is. In this configuration you can determine
> > with high confidence which files you need to restore from backup, and
> > the filesystem will remain writable to replace the restored data, because
> > raid1 does not have the write hole bug.
> > 
> > More than one scrub for a single write hole event won't help (and never
> > did). If the first scrub doesn't fix all the errors then your kernel
> > probably also has a race condition bug or regression that will permanently
> > corrupt the data (this was true in 2016 when the referenced mailing
> > list post was written).
> > 
> > Current kernels don't have such bugs--if the first scrub can correct
> > the data, it does, and if the first scrub can't correct the data then
> > all future scrubs will produce identical results.
> > 
> > Older kernels (2016) had problems reconstructing data during read()
> > operations but could fix data during scrub or balance operations.
> > These bugs, as far as I am able to test, have been fixed by v4.17 and
> > backported to v4.14.
> > 
> > > The Bugzilla Entry
> > > (https://bugzilla.kernel.org/buglist.cgi?component=btrfs) contains
> > > mostly unanswered bugs, which may or may not still count (2013 - 2018).
> > 
> > I find that any open bug over three years old on b.k.o can be safely
> > ignored because it has either already been fixed or there is not enough
> > information provided to understand what is going on.
> > 
> > > This Spinics Discussion
> > > (https://www.spinics.net/lists/linux-btrfs/msg76471.html) states
> > > that the write hole can even damage old data eg. data that was not
> > > accessed during unclean shutdown, the opposite of what the Raid5/6
> > > Status Page states!
> > 
> > Correct, write hole can *only* damage old data as described above.
> > 
> > > This Spinics comment
> > > (https://www.spinics.net/lists/linux-btrfs/msg76412.html) informs that
> > > hot-plugging a device will trigger the write hole. Accessed data will
> > > therefore be corrupted. In case the earlier statement about old data
> > > corruption is true, random data could be permamently lost. This is even
> > > more dangerous if you are connecting your devices via USB, as USB can
> > > unconnect due to external influence, eg. touching the cables, shaking...
> > 
> > Hot-unplugging a device can cause many lost write events at once, and
> > each lost write event is very bad.
> > 
> > btrfs does not reject and resynchronize a device from a raid array if a
> > write to the device fails (unlike every other working RAID implementation
> > on Earth...). If the device reconnects, btrfs will read a mixture of
> > old and new data and rely on checksums to determine which blocks are
> > out of date (as opposed to treating the departed disk as entirely out
> > of date and initiating a disk replace operation when it reconnects).
> > 
> > A scrub after a momentary disconnect can reconstruct most missing data,
> > but not all. CRC32 lets one error through per 16 TB of corrupted blocks,
> > and all nodatasum/nodatacow files modified while a drive was offline
> > will be corrupted without detection or recovery by btrfs.
> > 
> > Device replace is currently the best recovery option from this kind
> > of failure. Ideally btrfs would implement something like mdadm write
> > intent bitmaps so only those block groups that were modified while
> > the device as offline would be replaced, but this is the btrfs we want
> > not the btrfs we have.
> > 
> > > Lastly, this Superuser question
> > > (https://superuser.com/questions/1325245/btrfs-transid-failure#1344494)
> > > assumes that the transid mismatch bug could toggle your system
> > > unmountable. While it might be possible to restore your data using
> > > sudo BTRFS Restore, it is still unknown how the transid mismatch is
> > > even toggled, meaning that your file system could fail at any time!
> > 
> > Note that transid failure risk applies to all btrfs configurations.
> > It is not specific to raid5/6. The write hole errors from raid5/6 will
> > typically produce a header or csum failure (from reading garbage) not a
> > transid failure (from reading an old, valid, but deleted metadata block).
> > 
> > transid mismatch is pretty simple: one of your disk drives, or some
> > caching or translation layer between btrfs and your disk drives, dropped
> > a write (or, less likely, read from or wrote to the wrong sector address).
> > btrfs detects this by embedding transids into all data structures where
> > one object points to another object in a different block.
> > 
> > transid mismatch is also hard: you then have to figure out which layer
> > of your possibly quite complicated RAID setup is doing that, and make
> > it stop. This process almost never involves btrfs. Sometimes it's the
> > bottom layer (i.e. the drives themselves) but the more layers you add,
> > the more candidates need to be eliminated before the cause can be found.
> > Sometimes it's a *power supply* (i.e. the drive controller CPU browns
> > out and forgets it was writing something or corrupts its embedded RAM).
> > Sometimes it's host RAM going bad, corrupting and breaking everything
> > it touches.
> > 
> > I have a variety of test setups and the correlation between hardware
> > model (especially drive model, but also some SATA controller models)
> > and total filesystem loss due to transid verify failure is very strong.
> > Out of 10 drive models from 5 vendors, 2 models can't keep a filesystem
> > intact for more than a few months, while the other models average 3 years
> > old and still hold the first btrfs filesystem they were formatted with.
> > 
> > Disabling drive write caching sometimes helps, but some hardware eats
> > a filesystem every few months no matter what settings I change. If the
> > problem is a broken SATA controller or cable then changing drive settings
> > won't help.
> > 
> > It's fun and/or scary to put known good and bad hardware in the same
> > RAID1 array and watch btrfs autocorrecting the bad data after every
> > other power failure; however, the bad hardware is clearly not sufficient
> > to implement any sort of reliable data persistence, and arrays with bad
> > hardware in them will eventually fail.
> > 
> > The bad drives can still contribute to society as media cache servers or
> > point-of-sale terminals where the only response to any data integrity
> > issue is a full reformat and image reinstall. This seems to be the
> > target market that low-end consumer drives are aiming for, as they seem
> > to be useless for anything else.
> > 
> > Adopt a zero-tolerance policy for drive resets after the array is
> > mounted and active. A drive reset means a potential lost write leading
> > to a transid verify failure. Swap out both drive and SATA cable the
> > first time a reset occurs during a read or write operation, and consider
> > swapping out SATA controller, changing drive model, and upgrading power
> > supply if it happens twice.
> > 
> > > Do you know of any comprehensive and complete Bug list?
> > 
> > ...related to raid5/6:
> > 
> > - no write hole mitigation (at least two viable strategies
> > available)
> > 
> > - no device bouncing mitigation (mdadm had this working 20
> > years ago)
> > 
> > - probably slower than it could be
> > 
> > - no recovery strategy other than raid (btrfs check --repair is
> > useless on non-trivial filesytems, and a single-bit uncorrected
> > metadata error makes the filesystem unusable)
> > 
> > > Do you know more about the stated Bugs?
> > >
> > > Do you know further Bugs that are not addressed in any of these sites?
> > 
> > My testing on raid5/6 filesystems is producing pretty favorable results
> > these days. There do not seem to be many bugs left.
> > 
> > I have one test case where I write millions of errors into a raid5/6 and
> > the filesystem recovers every single one transparently while verifying
> > SHA1 hashes of test data. After years of rebuilding busted ext3 on
> > mdadm-raid5 filesystems, watching btrfs do it all automatically is
> > just...beautiful.
> > 
> > I think once the write hole and device bouncing mitigations are in place,
> > I'll start looking at migrating -draid1/-mraid1 setups to -draid5/-mraid1,
> > assuming the performance isn't too painful.
-------------------------------------------------------------------------------------------------
FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: List of known BTRFS Raid 5/6 Bugs?
  2018-08-10  1:40 erenthetitan
@ 2018-08-10  7:12 ` Zygo Blaxell
  0 siblings, 0 replies; 21+ messages in thread
From: Zygo Blaxell @ 2018-08-10  7:12 UTC (permalink / raw)
  To: erenthetitan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 13787 bytes --]

On Fri, Aug 10, 2018 at 03:40:23AM +0200, erenthetitan@mail.de wrote:
> I am searching for more information regarding possible bugs related to
> BTRFS Raid 5/6. All sites i could find are incomplete and information
> contradicts itself:
>
> The Wiki Raid 5/6 Page (https://btrfs.wiki.kernel.org/index.php/RAID56)
> warns of the write hole bug, stating that your data remains safe
> (except data written during power loss, obviously) upon unclean shutdown
> unless your data gets corrupted by further issues like bit-rot, drive
> failure etc.

The raid5/6 write hole bug exists on btrfs (as of 4.14-4.16) and there are
no mitigations to prevent or avoid it in mainline kernels.

The write hole results from allowing a mixture of old (committed) and
new (uncommitted) writes to the same RAID5/6 stripe (i.e. a group of
blocks consisting of one related data or parity block from each disk
in the array, such that writes to any of the data blocks affect the
correctness of the parity block and vice versa).  If the writes were
not completed and one or more of the data blocks are not online, the
data blocks reconstructed by the raid5/6 algorithm will be corrupt.

If all disks are online, the write hole does not immediately
damage user-visible data as the old data blocks can still be read
directly; however, should a drive failure occur later, old data may
not be recoverable because the parity block will not be correct for
reconstructing the missing data block.  A scrub can fix write hole
errors if all disks are online, and a scrub should be performed after
any unclean shutdown to recompute parity data.

The write hole always puts both old and new data at risk of damage;
however, due to btrfs's copy-on-write behavior, only the old damaged
data can be observed after power loss.  The damaged new data will have
no references to it written to the disk due to the power failure, so
there is no way to observe the new damaged data using the filesystem.
Not every interrupted write causes damage to old data, but some will.

Two possible mitigations for the write hole are:

	- modify the btrfs allocator to prevent writes to partially filled
	raid5/6 stripes (similar to what the ssd mount option does, except
	with the correct parameters to match RAID5/6 stripe boundaries),
	and advise users to run btrfs balance much more often to reclaim
	free space in partially occupied raid stripes

	- add a stripe write journal to the raid5/6 layer (either in
	btrfs itself, or in a lower RAID5 layer).

There are assorted other ideas (e.g. copy the RAID-Z approach from zfs
to btrfs or dramatically increase the btrfs block size) that also solve
the write hole problem but are somewhat more invasive and less practical
for btrfs.

Note that the write hole also affects btrfs on top of other similar
raid5/6 implementations (e.g. mdadm raid5 without stripe journal).
The btrfs CoW layer does not understand how to allocate data to avoid RMW
raid5 stripe updates without corrupting existing committed data, and this
limitation applies to every combination of unjournalled raid5/6 and btrfs.

> The Wiki Gotchas Page (https://btrfs.wiki.kernel.org/index.php/Gotchas)
> warns of possible incorrigible "transid" mismatch, not stating which
> versions are affected or what transid mismatch means for your data. It
> does not mention the write hole at all.

Neither raid5 nor write hole are required to produce a transid mismatch
failure.  transid mismatch usually occurs due to a lost write.  Write hole
is a specific case of lost write, but write hole does not usually produce
transid failures (it produces header or csum failures instead).

During real disk failure events, multiple distinct failure modes can
occur concurrently.  i.e. both transid failure and write hole can occur
at different places in the same filesystem as a result of attempting to
use a failing disk over a long period of time.

A transid verify failure is metadata damage.  It will make the filesystem
readonly and make some data inaccessible as described below.

> This Mail Archive (linux-btrfs@vger.kernel.org/msg55161.html"
> target="_blank">https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg55161.html)
> states that scrubbing BTRFS Raid 5/6 will always repair Data Corruption,
> but may corrupt your Metadata while trying to do so - meaning you have
> to scrub twice in a row to ensure data integrity.

Simple corruption (without write hole errors) is fixed by scrubbing
as of the last...at least six months?  Kernel v4.14.xx and later can
definitely do it these days.  Both data and metadata.

If the metadata is damaged in any way (corruption, write hole, or transid
verify failure) on btrfs and btrfs cannot use the raid profile for
metadata to recover the damaged data, the filesystem is usually forever
readonly, and anywhere from 0 to 100% of the filesystem may be readable
depending on where in the metadata tree structure the error occurs (the
closer to the root, the more data is lost).  This is the same for dup,
raid1, raid5, raid6, and raid10 profiles.  raid0 and single profiles are
not a good idea for metadata if you want a filesystem that can persist
across reboots (some use cases don't require persistence, so they can
use -msingle/-mraid0 btrfs as a large-scale tmpfs).

For all metadata raid profiles, recovery can fail due to risks including
RAM corruption, multiple drives having defects in the same locations,
or multiple drives with identically-behaving firmware bugs.  For raid5/6
metadata there is the *additional* risk of the write hole bug preventing
recovery of metadata.

If the filesystem is -draid5 -mraid1 then the metadata is not vulnerable
to the write hole, but data is.  In this configuration you can determine
with high confidence which files you need to restore from backup, and
the filesystem will remain writable to replace the restored data, because
raid1 does not have the write hole bug.

More than one scrub for a single write hole event won't help (and never
did).  If the first scrub doesn't fix all the errors then your kernel
probably also has a race condition bug or regression that will permanently
corrupt the data (this was true in 2016 when the referenced mailing
list post was written).

Current kernels don't have such bugs--if the first scrub can correct
the data, it does, and if the first scrub can't correct the data then
all future scrubs will produce identical results.

Older kernels (2016) had problems reconstructing data during read()
operations but could fix data during scrub or balance operations.
These bugs, as far as I am able to test, have been fixed by v4.17 and
backported to v4.14.

> The Bugzilla Entry
> (https://bugzilla.kernel.org/buglist.cgi?component=btrfs) contains
> mostly unanswered bugs, which may or may not still count (2013 - 2018).

I find that any open bug over three years old on b.k.o can be safely
ignored because it has either already been fixed or there is not enough
information provided to understand what is going on.

> This Spinics Discussion
> (https://www.spinics.net/lists/linux-btrfs/msg76471.html) states
> that the write hole can even damage old data eg. data that was not
> accessed during unclean shutdown, the opposite of what the Raid5/6
> Status Page states!

Correct, write hole can *only* damage old data as described above.

> This Spinics comment
> (https://www.spinics.net/lists/linux-btrfs/msg76412.html) informs that
> hot-plugging a device will trigger the write hole. Accessed data will
> therefore be corrupted.  In case the earlier statement about old data
> corruption is true, random data could be permamently lost.  This is even
> more dangerous if you are connecting your devices via USB, as USB can
> unconnect due to external influence, eg. touching the cables, shaking...

Hot-unplugging a device can cause many lost write events at once, and
each lost write event is very bad.

btrfs does not reject and resynchronize a device from a raid array if a
write to the device fails (unlike every other working RAID implementation
on Earth...).  If the device reconnects, btrfs will read a mixture of
old and new data and rely on checksums to determine which blocks are
out of date (as opposed to treating the departed disk as entirely out
of date and initiating a disk replace operation when it reconnects).

A scrub after a momentary disconnect can reconstruct most missing data,
but not all.  CRC32 lets one error through per 16 TB of corrupted blocks,
and all nodatasum/nodatacow files modified while a drive was offline
will be corrupted without detection or recovery by btrfs.

Device replace is currently the best recovery option from this kind
of failure.  Ideally btrfs would implement something like mdadm write
intent bitmaps so only those block groups that were modified while
the device as offline would be replaced, but this is the btrfs we want
not the btrfs we have.

> Lastly, this Superuser question
> (https://superuser.com/questions/1325245/btrfs-transid-failure#1344494)
> assumes that the transid mismatch bug could toggle your system
> unmountable.  While it might be possible to restore your data using
> sudo BTRFS Restore, it is still unknown how the transid mismatch is
> even toggled, meaning that your file system could fail at any time!

Note that transid failure risk applies to all btrfs configurations.
It is not specific to raid5/6.  The write hole errors from raid5/6 will
typically produce a header or csum failure (from reading garbage) not a
transid failure (from reading an old, valid, but deleted metadata block).

transid mismatch is pretty simple:  one of your disk drives, or some
caching or translation layer between btrfs and your disk drives, dropped
a write (or, less likely, read from or wrote to the wrong sector address).
btrfs detects this by embedding transids into all data structures where
one object points to another object in a different block.

transid mismatch is also hard:  you then have to figure out which layer
of your possibly quite complicated RAID setup is doing that, and make
it stop.  This process almost never involves btrfs.  Sometimes it's the
bottom layer (i.e. the drives themselves) but the more layers you add,
the more candidates need to be eliminated before the cause can be found.
Sometimes it's a *power supply* (i.e. the drive controller CPU browns
out and forgets it was writing something or corrupts its embedded RAM).
Sometimes it's host RAM going bad, corrupting and breaking everything
it touches.

I have a variety of test setups and the correlation between hardware
model (especially drive model, but also some SATA controller models)
and total filesystem loss due to transid verify failure is very strong.
Out of 10 drive models from 5 vendors, 2 models can't keep a filesystem
intact for more than a few months, while the other models average 3 years
old and still hold the first btrfs filesystem they were formatted with.

Disabling drive write caching sometimes helps, but some hardware eats
a filesystem every few months no matter what settings I change.  If the
problem is a broken SATA controller or cable then changing drive settings
won't help.

It's fun and/or scary to put known good and bad hardware in the same
RAID1 array and watch btrfs autocorrecting the bad data after every
other power failure; however, the bad hardware is clearly not sufficient
to implement any sort of reliable data persistence, and arrays with bad
hardware in them will eventually fail.

The bad drives can still contribute to society as media cache servers or
point-of-sale terminals where the only response to any data integrity
issue is a full reformat and image reinstall.  This seems to be the
target market that low-end consumer drives are aiming for, as they seem
to be useless for anything else.

Adopt a zero-tolerance policy for drive resets after the array is
mounted and active.  A drive reset means a potential lost write leading
to a transid verify failure.  Swap out both drive and SATA cable the
first time a reset occurs during a read or write operation, and consider
swapping out SATA controller, changing drive model, and upgrading power
supply if it happens twice.

> Do you know of any comprehensive and complete Bug list?

...related to raid5/6:

	- no write hole mitigation (at least two viable strategies
	available)

	- no device bouncing mitigation (mdadm had this working 20
	years ago)

	- probably slower than it could be

	- no recovery strategy other than raid (btrfs check --repair is
	useless on non-trivial filesytems, and a single-bit uncorrected
	metadata error makes the filesystem unusable)

> Do you know more about the stated Bugs?
>
> Do you know further Bugs that are not addressed in any of these sites?

My testing on raid5/6 filesystems is producing pretty favorable results
these days.  There do not seem to be many bugs left.

I have one test case where I write millions of errors into a raid5/6 and
the filesystem recovers every single one transparently while verifying
SHA1 hashes of test data.  After years of rebuilding busted ext3 on
mdadm-raid5 filesystems, watching btrfs do it all automatically is
just...beautiful.

I think once the write hole and device bouncing mitigations are in place,
I'll start looking at migrating -draid1/-mraid1 setups to -draid5/-mraid1,
assuming the performance isn't too painful.

> -------------------------------------------------------------------------------------------------
> FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* List of known BTRFS Raid 5/6 Bugs?
@ 2018-08-10  1:40 erenthetitan
  2018-08-10  7:12 ` Zygo Blaxell
  0 siblings, 1 reply; 21+ messages in thread
From: erenthetitan @ 2018-08-10  1:40 UTC (permalink / raw)
  To: linux-btrfs

I am searching for more information regarding possible bugs related to BTRFS Raid 5/6. All sites i could find are incomplete and information contradicts itself:

The Wiki Raid 5/6 Page (https://btrfs.wiki.kernel.org/index.php/RAID56)
warns of the write hole bug, stating that your data remains safe (except data written during power loss, obviously) upon unclean shutdown unless your data gets corrupted by further issues like bit-rot, drive failure etc.

The Wiki Gotchas Page (https://btrfs.wiki.kernel.org/index.php/Gotchas)
warns of possible incorrigible "transid" mismatch, not stating which versions are affected or what transid mismatch means for your data. It does not mention the write hole at all.

This Mail Archive (linux-btrfs@vger.kernel.org/msg55161.html" target="_blank">https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg55161.html)
states that scrubbing BTRFS Raid 5/6 will always repair Data Corruption, but may corrupt your Metadata while trying to do so - meaning you have to scrub twice in a row to ensure data integrity.

The Bugzilla Entry (https://bugzilla.kernel.org/buglist.cgi?component=btrfs)
contains mostly unanswered bugs, which may or may not still count (2013 - 2018).

This Spinics Discussion (https://www.spinics.net/lists/linux-btrfs/msg76471.html)
states that the write hole can even damage old data eg. data that was not accessed during unclean shutdown, the opposite of what the Raid5/6 Status Page states!

This Spinics comment (https://www.spinics.net/lists/linux-btrfs/msg76412.html)
informs that hot-plugging a device will trigger the write hole. Accessed data will therefore be corrupted.
In case the earlier statement about old data corruption is true, random data could be permamently lost.
This is even more dangerous if you are connecting your devices via USB, as USB can unconnect due to external influence, eg. touching the cables, shaking...

Lastly, this Superuser question (https://superuser.com/questions/1325245/btrfs-transid-failure#1344494)
assumes that the transid mismatch bug could toggle your system unmountable.
While it might be possible to restore your data using sudo BTRFS Restore, it is still unknown how the transid mismatch is even toggled, meaning that your file system could fail at any time!

Do you know of any comprehensive and complete Bug list?

Do you know more about the stated Bugs?

Do you know further Bugs that are not addressed in any of these sites?

-------------------------------------------------------------------------------------------------
FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-09-12  7:02 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-07 13:58 List of known BTRFS Raid 5/6 Bugs? Stefan K
2018-09-08  8:40 ` Duncan
2018-09-11 11:29   ` Stefan K
2018-09-12  1:57     ` Duncan
  -- strict thread matches above, loose matches on Subject: below --
2018-08-13 21:56 erenthetitan
2018-08-14  4:09 ` Zygo Blaxell
2018-08-14  7:32 ` Menion
2018-08-15  3:33   ` Zygo Blaxell
2018-08-15  7:27     ` Menion
2018-08-16 19:38       ` erenthetitan
2018-08-17  8:33         ` Menion
2018-08-11  2:18 erenthetitan
2018-08-11  5:49 ` Zygo Blaxell
2018-08-11  6:27   ` erenthetitan
2018-08-11 15:25     ` Zygo Blaxell
2018-08-13  7:20       ` Menion
2018-08-14  3:49         ` Zygo Blaxell
2018-08-11  0:45 Zygo Blaxell
2018-08-10 23:32 erenthetitan
2018-08-10  1:40 erenthetitan
2018-08-10  7:12 ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.