All of lore.kernel.org
 help / color / mirror / Atom feed
* MD RAID 5/6 vs BTRFS RAID 5/6
@ 2019-10-16 15:40 Edmund Urbani
  2019-10-16 19:42 ` Zygo Blaxell
  2019-10-17  4:07 ` Jon Ander MB
  0 siblings, 2 replies; 12+ messages in thread
From: Edmund Urbani @ 2019-10-16 15:40 UTC (permalink / raw)
  To: linux-btrfs

Hello everyone,

having recovered most of my data from my btrfs RAID-6, I have by now migrated to
mdadm RAID (with btrfs on top). I am considering switching back to btrfs RAID
some day, when I feel more confident regarding its maturity.

I put some thought into the pros and cons of this choice that I would like to share:

btrfs RAID-5/6:

- RAID write hole issue still unsolved (assuming
https://btrfs.wiki.kernel.org/index.php/RAID56 is up-to-date)
+ can detect and fix bit rot
+ flexibility (resizing / reshaping)
- maturity ? (I had a hard time recovering my data after removal of a drive that
had developed some bad blocks. That's not what I would expect from a RAID-6
setup. To be fair I should point out that I was running kernel 4.14 at the time
and did not do regular scrubbing.)

btrfs on MD RAID 5/6:

+ options to mitigate RAID write hole
- bitrot can only be detected but not fixed
+ mature and proven RAID implementation (based on personal experience of
replacing plenty of drives over the years without data loss)

I would be interested in getting your feedback on this comparison. Do you agree
with my observations? Did I miss anything you would consider important?

Regards,
 Edmund





-- 
*Liland IT GmbH*


Ferlach ● Wien ● München
Tel: +43 463 220111
Tel: +49 89 
458 15 940
office@Liland.com
https://Liland.com <https://Liland.com> 



Copyright © 2019 Liland IT GmbH 

Diese Mail enthaelt vertrauliche und/oder 
rechtlich geschuetzte Informationen. 
Wenn Sie nicht der richtige Adressat 
sind oder diese Email irrtuemlich erhalten haben, informieren Sie bitte 
sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren 
sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. 

This 
email may contain confidential and/or privileged information. 
If you are 
not the intended recipient (or have received this email in error) please 
notify the sender immediately and destroy this email. Any unauthorised 
copying, disclosure or distribution of the material in this email is 
strictly forbidden.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: MD RAID 5/6 vs BTRFS RAID 5/6
  2019-10-16 15:40 MD RAID 5/6 vs BTRFS RAID 5/6 Edmund Urbani
@ 2019-10-16 19:42 ` Zygo Blaxell
  2019-10-21 15:27   ` Edmund Urbani
  2019-10-17  4:07 ` Jon Ander MB
  1 sibling, 1 reply; 12+ messages in thread
From: Zygo Blaxell @ 2019-10-16 19:42 UTC (permalink / raw)
  To: Edmund Urbani; +Cc: linux-btrfs

On Wed, Oct 16, 2019 at 05:40:10PM +0200, Edmund Urbani wrote:
> Hello everyone,
> 
> having recovered most of my data from my btrfs RAID-6, I have by now migrated to
> mdadm RAID (with btrfs on top). I am considering switching back to btrfs RAID
> some day, when I feel more confident regarding its maturity.
> 
> I put some thought into the pros and cons of this choice that I would like to share:
> 
> btrfs RAID-5/6:
> 
> - RAID write hole issue still unsolved (assuming
> https://btrfs.wiki.kernel.org/index.php/RAID56 is up-to-date)
> + can detect and fix bit rot
> + flexibility (resizing / reshaping)
> - maturity ? (I had a hard time recovering my data after removal of a drive that
> had developed some bad blocks. That's not what I would expect from a RAID-6
> setup. To be fair I should point out that I was running kernel 4.14 at the time
> and did not do regular scrubbing.)

That only really started working (including data reconstruction after
corruption events) around 4.15 or 4.16.  On later kernels, one can
destroy one byte from every block on two disks in a raid6 array and
still recover everything.  This is a somewhat stronger requirement than
degraded mode with two disks missing, since a pass on this test requires
btrfs to prove every individual block is bad using csums and correct the
data without errors.  With degraded mode btrfs always knows two blocks
are missing, so the pass requirement is weaker.

The one thing that doesn't work yet is turning the system off or reset
while writing, then dropping two disks, then reading all the data without
errors.  If a single write hole event occurs, some data may be lost,
from a few blocks to the entire filesystem, depending on which blocks
are affected.

> btrfs on MD RAID 5/6:
> 
> + options to mitigate RAID write hole

Those options _must_ be used with btrfs, or the exact same write hole
issue will occur.  They must be used with other filesystems too, but
other filesystems will tolerate metadata corruption while btrfs will not.

The write hole issue is caused by the interaction between
committed/uncommitted data boundaries and RAID stripe boundaries--the
boundaries must be respected at every layer, or btrfs CoW data integrity
doesn't work when there are disk failures.  The reason why btrfs
currently fails is because the boundaries at the different layers are
not respected--committed and uncommitted data are mixed up inside single
RAID stripes.  mdadm and btrfs raid6 use _identical_ stripe boundaries
(btrfs raid6 is a simplified copy of mdadm raid6) and write hole causes
the same failures in both configurations if the mdadm mitigations are
not enabled.

> - bitrot can only be detected but not fixed
> + mature and proven RAID implementation (based on personal experience of
> replacing plenty of drives over the years without data loss)

I've replaced enough drives to lose data on everything, including mdadm.
mdadm is more mature and proven than cheap hard disk firmware or
bleeding-edge LVM/device-mapper, but that's a low bar.

> I would be interested in getting your feedback on this comparison. Do you agree
> with my observations? Did I miss anything you would consider important?

Single data dup metadata btrfs on mdadm raid6 + write hole mitigation.
Nothing less for raid6.

If you use single metadata, you have no way to recover the filesystem if
there is bitflip in a metadata block on one of the drives.  So always
use -mdup on a btrfs filesystem on top of one mdadm device regardless
of mdadm raid level.  Use -mraid1 if on two or more mdadm devices.

For raid5 I'd choose btrfs -draid5 -mraid1 over mdadm raid5
sometimes--even with the write hole, I'd expect smaller average data
losses than mdadm raid5 + write hole mitigation due to the way disk
failure modes are distributed.  Bit flips and silent corruption (that
mdadm cannot repair) are much more common than bad sectors (that mdadm
can repair) in some drive model families.

> Regards,
>  Edmund
> 
> 
> 
> 
> 
> -- 
> *Liland IT GmbH*
> 
> 
> Ferlach ● Wien ● München
> Tel: +43 463 220111
> Tel: +49 89 
> 458 15 940
> office@Liland.com
> https://Liland.com <https://Liland.com> 
> 
> 
> 
> Copyright © 2019 Liland IT GmbH 
> 
> Diese Mail enthaelt vertrauliche und/oder 
> rechtlich geschuetzte Informationen. 
> Wenn Sie nicht der richtige Adressat 
> sind oder diese Email irrtuemlich erhalten haben, informieren Sie bitte 
> sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren 
> sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. 
> 
> This 
> email may contain confidential and/or privileged information. 
> If you are 
> not the intended recipient (or have received this email in error) please 
> notify the sender immediately and destroy this email. Any unauthorised 
> copying, disclosure or distribution of the material in this email is 
> strictly forbidden.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: MD RAID 5/6 vs BTRFS RAID 5/6
  2019-10-16 15:40 MD RAID 5/6 vs BTRFS RAID 5/6 Edmund Urbani
  2019-10-16 19:42 ` Zygo Blaxell
@ 2019-10-17  4:07 ` Jon Ander MB
  2019-10-17 15:57   ` Chris Murphy
  1 sibling, 1 reply; 12+ messages in thread
From: Jon Ander MB @ 2019-10-17  4:07 UTC (permalink / raw)
  To: Edmund Urbani; +Cc: linux-btrfs

It would be interesting to know the pros and cons of this setup that
you are suggesting vs zfs.
+zfs detects and corrects bitrot (
http://www.zfsnas.com/2015/05/24/testing-bit-rot/ )
+zfs has working raid56
-modules out of kernel for license incompatibilities (a big minus)

BTRFS can detect bitrot but... are we sure it can fix it? (can't seem
to find any conclusive doc about it right now)

I'm one of those that is waiting for the write hole bug to be fixed in
order to use raid5 on my home setup. It's a shame it's taking so long.

Regards


On Thu, Oct 17, 2019 at 12:21 AM Edmund Urbani <edmund.urbani@liland.com> wrote:
>
> Hello everyone,
>
> having recovered most of my data from my btrfs RAID-6, I have by now migrated to
> mdadm RAID (with btrfs on top). I am considering switching back to btrfs RAID
> some day, when I feel more confident regarding its maturity.
>
> I put some thought into the pros and cons of this choice that I would like to share:
>
> btrfs RAID-5/6:
>
> - RAID write hole issue still unsolved (assuming
> https://btrfs.wiki.kernel.org/index.php/RAID56 is up-to-date)
> + can detect and fix bit rot
> + flexibility (resizing / reshaping)
> - maturity ? (I had a hard time recovering my data after removal of a drive that
> had developed some bad blocks. That's not what I would expect from a RAID-6
> setup. To be fair I should point out that I was running kernel 4.14 at the time
> and did not do regular scrubbing.)
>
> btrfs on MD RAID 5/6:
>
> + options to mitigate RAID write hole
> - bitrot can only be detected but not fixed
> + mature and proven RAID implementation (based on personal experience of
> replacing plenty of drives over the years without data loss)
>
> I would be interested in getting your feedback on this comparison. Do you agree
> with my observations? Did I miss anything you would consider important?
>
> Regards,
>  Edmund
>
>
>
>
>
> --
> *Liland IT GmbH*
>
>
> Ferlach ● Wien ● München
> Tel: +43 463 220111
> Tel: +49 89
> 458 15 940
> office@Liland.com
> https://Liland.com <https://Liland.com>
>
>
>
> Copyright © 2019 Liland IT GmbH
>
> Diese Mail enthaelt vertrauliche und/oder
> rechtlich geschuetzte Informationen.
> Wenn Sie nicht der richtige Adressat
> sind oder diese Email irrtuemlich erhalten haben, informieren Sie bitte
> sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren
> sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet.
>
> This
> email may contain confidential and/or privileged information.
> If you are
> not the intended recipient (or have received this email in error) please
> notify the sender immediately and destroy this email. Any unauthorised
> copying, disclosure or distribution of the material in this email is
> strictly forbidden.
>


-- 
--- Jon Ander Monleón Besteiro ---

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: MD RAID 5/6 vs BTRFS RAID 5/6
  2019-10-17  4:07 ` Jon Ander MB
@ 2019-10-17 15:57   ` Chris Murphy
  2019-10-17 18:23     ` Graham Cobb
                       ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Chris Murphy @ 2019-10-17 15:57 UTC (permalink / raw)
  To: Btrfs BTRFS; +Cc: Qu Wenruo

On Wed, Oct 16, 2019 at 10:07 PM Jon Ander MB <jonandermonleon@gmail.com> wrote:
>
> It would be interesting to know the pros and cons of this setup that
> you are suggesting vs zfs.
> +zfs detects and corrects bitrot (
> http://www.zfsnas.com/2015/05/24/testing-bit-rot/ )
> +zfs has working raid56
> -modules out of kernel for license incompatibilities (a big minus)
>
> BTRFS can detect bitrot but... are we sure it can fix it? (can't seem
> to find any conclusive doc about it right now)

Yes. Active fixups with scrub since 3.19. Passive fixups since 4.12.

> I'm one of those that is waiting for the write hole bug to be fixed in
> order to use raid5 on my home setup. It's a shame it's taking so long.

For what it's worth, the write hole is considered to be rare.
https://lwn.net/Articles/665299/

Further, the write hole means a) parity is corrupt or stale compared
to data stripe elements which is caused by a crash or powerloss during
writes, and b) subsequently there is a missing device or bad sector in
the same stripe as the corrupt/stale parity stripe element. The effect
of b) is that reconstruction from parity is necessary, and the effect
of a) is that it's reconstructed incorrectly, thus corruption. But
Btrfs detects this corruption, whether it's metadata or data. The
corruption isn't propagated in any case. But it makes the filesystem
fragile if this happens with metadata. Any parity stripe element
staleness likely results in significantly bad reconstruction in this
case, and just can't be worked around, even btrfs check probably can't
fix it. If the write hole problem happens with data block group, then
EIO. But the good news is that this isn't going to result in silent
data or file system metadata corruption. For sure you'll know about
it.

This is why scrub after a crash or powerloss with raid56 is important,
while the array is still whole (not degraded). The two problems with
that are:

a) the scrub isn't initiated automatically, nor is it obvious to the
user it's necessary
b) the scrub can take a long time, Btrfs has no partial scrubbing.

Wheras mdadm arrays offer a write intent bitmap to know what blocks to
partially scrub, and to trigger it automatically following a crash or
powerloss.

It seems Btrfs already has enough on-disk metadata to infer a
functional equivalent to the write intent bitmap, via transid. Just
scrub the last ~50 generations the next time it's mounted. Either do
this every time a Btrfs raid56 is mounted. Or create some flag that
allows Btrfs to know if the filesystem was not cleanly shutdown. It's
possible 50 generations could be a lot of data, but since it's an
online scrub triggered after mount, it wouldn't add much to mount
times. I'm also picking 50 generations arbitrarily, there's no basis
for that number.

The above doesn't cover the case where partial stripe write (which
leads to write hole problem), and a crash or powerloss, and at the
same time one or more device failures. In that case there's no time
for a partial scrub to fix the problem leading to the write hole. So
even if the corruption is detected, it's too late to fix it. But at
least an automatic partial scrub, even degraded, will mean the user
would be flagged of the uncorrectable problem before they get too far
along.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: MD RAID 5/6 vs BTRFS RAID 5/6
  2019-10-17 15:57   ` Chris Murphy
@ 2019-10-17 18:23     ` Graham Cobb
  2019-10-20 21:41       ` Chris Murphy
  2019-10-18 22:19     ` Supercilious Dude
       [not found]     ` <CAGmvKk4wENpDqLFZG+D8_zzjhXokjMfdbmgTKTL49EFcfdVEtQ@mail.gmail.com>
  2 siblings, 1 reply; 12+ messages in thread
From: Graham Cobb @ 2019-10-17 18:23 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS; +Cc: Qu Wenruo

On 17/10/2019 16:57, Chris Murphy wrote:
> On Wed, Oct 16, 2019 at 10:07 PM Jon Ander MB <jonandermonleon@gmail.com> wrote:
>>
>> It would be interesting to know the pros and cons of this setup that
>> you are suggesting vs zfs.
>> +zfs detects and corrects bitrot (
>> http://www.zfsnas.com/2015/05/24/testing-bit-rot/ )
>> +zfs has working raid56
>> -modules out of kernel for license incompatibilities (a big minus)
>>
>> BTRFS can detect bitrot but... are we sure it can fix it? (can't seem
>> to find any conclusive doc about it right now)
> 
> Yes. Active fixups with scrub since 3.19. Passive fixups since 4.12.

Presumably this is dependent on checksums? So neither detection nor
fixup happen for NOCOW files? Even a scrub won't notice because scrub
doesn't attempt to compare both copies unless the first copy has a bad
checksum -- is that correct?

> 
>> I'm one of those that is waiting for the write hole bug to be fixed in
>> order to use raid5 on my home setup. It's a shame it's taking so long.
> 
> For what it's worth, the write hole is considered to be rare.
> https://lwn.net/Articles/665299/
> 
> Further, the write hole means a) parity is corrupt or stale compared
> to data stripe elements which is caused by a crash or powerloss during
> writes, and b) subsequently there is a missing device or bad sector in
> the same stripe as the corrupt/stale parity stripe element. The effect
> of b) is that reconstruction from parity is necessary, and the effect
> of a) is that it's reconstructed incorrectly, thus corruption. But
> Btrfs detects this corruption, whether it's metadata or data. The
> corruption isn't propagated in any case. But it makes the filesystem
> fragile if this happens with metadata. Any parity stripe element
> staleness likely results in significantly bad reconstruction in this
> case, and just can't be worked around, even btrfs check probably can't
> fix it. If the write hole problem happens with data block group, then
> EIO. But the good news is that this isn't going to result in silent
> data or file system metadata corruption. For sure you'll know about
> it.

If I understand correctly, metadata always has checksums so that is true
for filesystem structure. But for no-checksum files (such as nocow
files) the corruption will be silent, won't it?

Graham

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: MD RAID 5/6 vs BTRFS RAID 5/6
  2019-10-17 15:57   ` Chris Murphy
  2019-10-17 18:23     ` Graham Cobb
@ 2019-10-18 22:19     ` Supercilious Dude
       [not found]     ` <CAGmvKk4wENpDqLFZG+D8_zzjhXokjMfdbmgTKTL49EFcfdVEtQ@mail.gmail.com>
  2 siblings, 0 replies; 12+ messages in thread
From: Supercilious Dude @ 2019-10-18 22:19 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS, Qu Wenruo

It would be be useful to have the ability to scrub only the metadata.
In many cases the data is so large that a full scrub is not feasible.
In my "little" test system of 34TB a full scrub takes many hours and
the IOPS saturate the disks to the extent that the volume is unusable
due to the high latencies. Ideally there would be a way to rate limit
the scrub operation I/Os so that it can happen in the background
without impacting the normal workload.


On Fri, 18 Oct 2019 at 21:38, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Wed, Oct 16, 2019 at 10:07 PM Jon Ander MB <jonandermonleon@gmail.com> wrote:
> >
> > It would be interesting to know the pros and cons of this setup that
> > you are suggesting vs zfs.
> > +zfs detects and corrects bitrot (
> > http://www.zfsnas.com/2015/05/24/testing-bit-rot/ )
> > +zfs has working raid56
> > -modules out of kernel for license incompatibilities (a big minus)
> >
> > BTRFS can detect bitrot but... are we sure it can fix it? (can't seem
> > to find any conclusive doc about it right now)
>
> Yes. Active fixups with scrub since 3.19. Passive fixups since 4.12.
>
> > I'm one of those that is waiting for the write hole bug to be fixed in
> > order to use raid5 on my home setup. It's a shame it's taking so long.
>
> For what it's worth, the write hole is considered to be rare.
> https://lwn.net/Articles/665299/
>
> Further, the write hole means a) parity is corrupt or stale compared
> to data stripe elements which is caused by a crash or powerloss during
> writes, and b) subsequently there is a missing device or bad sector in
> the same stripe as the corrupt/stale parity stripe element. The effect
> of b) is that reconstruction from parity is necessary, and the effect
> of a) is that it's reconstructed incorrectly, thus corruption. But
> Btrfs detects this corruption, whether it's metadata or data. The
> corruption isn't propagated in any case. But it makes the filesystem
> fragile if this happens with metadata. Any parity stripe element
> staleness likely results in significantly bad reconstruction in this
> case, and just can't be worked around, even btrfs check probably can't
> fix it. If the write hole problem happens with data block group, then
> EIO. But the good news is that this isn't going to result in silent
> data or file system metadata corruption. For sure you'll know about
> it.
>
> This is why scrub after a crash or powerloss with raid56 is important,
> while the array is still whole (not degraded). The two problems with
> that are:
>
> a) the scrub isn't initiated automatically, nor is it obvious to the
> user it's necessary
> b) the scrub can take a long time, Btrfs has no partial scrubbing.
>
> Wheras mdadm arrays offer a write intent bitmap to know what blocks to
> partially scrub, and to trigger it automatically following a crash or
> powerloss.
>
> It seems Btrfs already has enough on-disk metadata to infer a
> functional equivalent to the write intent bitmap, via transid. Just
> scrub the last ~50 generations the next time it's mounted. Either do
> this every time a Btrfs raid56 is mounted. Or create some flag that
> allows Btrfs to know if the filesystem was not cleanly shutdown. It's
> possible 50 generations could be a lot of data, but since it's an
> online scrub triggered after mount, it wouldn't add much to mount
> times. I'm also picking 50 generations arbitrarily, there's no basis
> for that number.
>
> The above doesn't cover the case where partial stripe write (which
> leads to write hole problem), and a crash or powerloss, and at the
> same time one or more device failures. In that case there's no time
> for a partial scrub to fix the problem leading to the write hole. So
> even if the corruption is detected, it's too late to fix it. But at
> least an automatic partial scrub, even degraded, will mean the user
> would be flagged of the uncorrectable problem before they get too far
> along.
>
>
> --
> Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: MD RAID 5/6 vs BTRFS RAID 5/6
  2019-10-17 18:23     ` Graham Cobb
@ 2019-10-20 21:41       ` Chris Murphy
  0 siblings, 0 replies; 12+ messages in thread
From: Chris Murphy @ 2019-10-20 21:41 UTC (permalink / raw)
  To: Graham Cobb; +Cc: Chris Murphy, Btrfs BTRFS, Qu Wenruo

On Thu, Oct 17, 2019 at 8:23 PM Graham Cobb <g.btrfs@cobb.uk.net> wrote:
>
> On 17/10/2019 16:57, Chris Murphy wrote:
> > On Wed, Oct 16, 2019 at 10:07 PM Jon Ander MB <jonandermonleon@gmail.com> wrote:
> >>
> >> It would be interesting to know the pros and cons of this setup that
> >> you are suggesting vs zfs.
> >> +zfs detects and corrects bitrot (
> >> http://www.zfsnas.com/2015/05/24/testing-bit-rot/ )
> >> +zfs has working raid56
> >> -modules out of kernel for license incompatibilities (a big minus)
> >>
> >> BTRFS can detect bitrot but... are we sure it can fix it? (can't seem
> >> to find any conclusive doc about it right now)
> >
> > Yes. Active fixups with scrub since 3.19. Passive fixups since 4.12.
>
> Presumably this is dependent on checksums? So neither detection nor
> fixup happen for NOCOW files? Even a scrub won't notice because scrub
> doesn't attempt to compare both copies unless the first copy has a bad
> checksum -- is that correct?

Normal read (passive) it can't be detected if nocow, since nocow means
nodatasum. If the problem happens in metadata, it's detected because
metadata is always cow and always has csum.

I'm not sure what the scrub behavior is for nocow. There's enough
information to detect a mismatch if in normal (not degraded)
operation. But I don't know if Btrfs scrub warns on this case.


> If I understand correctly, metadata always has checksums so that is true
> for filesystem structure. But for no-checksum files (such as nocow
> files) the corruption will be silent, won't it?

Corruption is always silent for nocow data. Same as any other
filesystem, it's up to the application layer to detect it.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: MD RAID 5/6 vs BTRFS RAID 5/6
       [not found]     ` <CAGmvKk4wENpDqLFZG+D8_zzjhXokjMfdbmgTKTL49EFcfdVEtQ@mail.gmail.com>
@ 2019-10-20 21:43       ` Chris Murphy
  0 siblings, 0 replies; 12+ messages in thread
From: Chris Murphy @ 2019-10-20 21:43 UTC (permalink / raw)
  To: Supercilious Dude; +Cc: Chris Murphy, Btrfs BTRFS, Qu Wenruo

On Sat, Oct 19, 2019 at 12:18 AM Supercilious Dude
<supercilious.dude@gmail.com> wrote:
>
> It would be be useful to have the ability to scrub only the metadata. In many cases the data is so large that a full scrub is not feasible. In my "little" test system of 34TB a full scrub takes many hours and the IOPS saturate the disks to the extent that the volume is unusable due to the high latencies. Ideally there should be a way to rate limit the scrub operation so that it can happen in the background without impacting the normal workload.

In effect a 'btrfs check' is a read only scrub of metadata, as all
metadata is needed to be read for that. Of course it's more expensive
than just confirm checksums are OK, because it's also doing a bunch of
sanity and logical tests that take much longer.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: MD RAID 5/6 vs BTRFS RAID 5/6
  2019-10-16 19:42 ` Zygo Blaxell
@ 2019-10-21 15:27   ` Edmund Urbani
  2019-10-21 19:34     ` Zygo Blaxell
  0 siblings, 1 reply; 12+ messages in thread
From: Edmund Urbani @ 2019-10-21 15:27 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On 10/16/19 9:42 PM, Zygo Blaxell wrote:
>
> For raid5 I'd choose btrfs -draid5 -mraid1 over mdadm raid5
> sometimes--even with the write hole, I'd expect smaller average data
> losses than mdadm raid5 + write hole mitigation due to the way disk
> failure modes are distributed.  

What about the write hole and RAID-1? I understand the write hole is most
commonly associated with RAID-5, but it is also a problem with other RAID levels.

You would need to scrub after a power failure to be sure that no meta data gets
corrupted even with RAID-1. Otherwise you might still have inconsistent copies
and the problem may only become apparent when one drive fails. Can we be certain
that scrubbing is always able to fix such inconsistencies with RAID-1?

Regards,
 Edmund


-- 
*Liland IT GmbH*


Ferlach ● Wien ● München
Tel: +43 463 220111
Tel: +49 89 
458 15 940
office@Liland.com
https://Liland.com <https://Liland.com> 



Copyright © 2019 Liland IT GmbH 

Diese Mail enthaelt vertrauliche und/oder 
rechtlich geschuetzte Informationen. 
Wenn Sie nicht der richtige Adressat 
sind oder diese Email irrtuemlich erhalten haben, informieren Sie bitte 
sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren 
sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. 

This 
email may contain confidential and/or privileged information. 
If you are 
not the intended recipient (or have received this email in error) please 
notify the sender immediately and destroy this email. Any unauthorised 
copying, disclosure or distribution of the material in this email is 
strictly forbidden.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: MD RAID 5/6 vs BTRFS RAID 5/6
  2019-10-21 15:27   ` Edmund Urbani
@ 2019-10-21 19:34     ` Zygo Blaxell
  2019-10-23 16:32       ` Edmund Urbani
  0 siblings, 1 reply; 12+ messages in thread
From: Zygo Blaxell @ 2019-10-21 19:34 UTC (permalink / raw)
  To: Edmund Urbani; +Cc: linux-btrfs

On Mon, Oct 21, 2019 at 05:27:54PM +0200, Edmund Urbani wrote:
> On 10/16/19 9:42 PM, Zygo Blaxell wrote:
> >
> > For raid5 I'd choose btrfs -draid5 -mraid1 over mdadm raid5
> > sometimes--even with the write hole, I'd expect smaller average data
> > losses than mdadm raid5 + write hole mitigation due to the way disk
> > failure modes are distributed.  
> 
> What about the write hole and RAID-1? I understand the write hole is most
> commonly associated with RAID-5, but it is also a problem with other RAID levels.

Filesystem tree updates are atomic on btrfs.  Everything persistent on
btrfs is part of a committed tree.  The current writes in progress are
initially stored in an uncommitted tree, which consists of blocks that
are isolated from any committed tree block.  The algorithm relies on
two things:

	Isolation:  every write to any uncommitted data block must not
	affect the correctness of any data in any committed data block.

	Ordering:  a commit completes all uncommitted tree updates on
	all disks in any order, then updates superblocks to point to the
	updated tree roots.  A barrier is used to separate these phases
	of updates across disks.

Isolation and ordering make each transaction atomic.  If either
requirement is not implemented correctly, data or metadata may be
corrupted.  If metadata is corrupted, the filesystem can be destroyed.

Transactions close write holes in all btrfs RAID profiles except 5
and 6.  mdadm RAID levels 0, 1, 10, and linear have isolation properties
sufficient for btrfs.  mdadm RAID 4/5/6 work only if the mdadm write hole
mitigation feature is enabled to provide isolation.  All mdadm and btrfs
RAID profiles provide sufficient ordering.

The problem on mdadm/btrfs raid5/6 is that committed data blocks are not
fully isolated from uncommitted data blocks when they share a parity block
in the RAID5/6 layer (wherever that layer is).  This is why the problem
only affects raid5/6 on btrfs (and why it also applies to mdadm raid5/6).
In this case, writing to any single block within a stripe makes the parity
block inconsistent with previously committed data blocks.  This violates
the isolation requirement for CoW transactions: a committed data block
is related to uncommitted data blocks in the same stripe through the
parity block and the RAID5/6 data/parity equation.

The isolation failure affects only parity blocks.  You could kill
power all day long and not lose any committed data on any btrfs raid
profile--as long as none of the disks fail and each disk's firmware
implements write barriers correctly or write cache is disabled (sadly,
even in 2019, a few drive models still don't have working barriers).
btrfs on raid5/6 is as robust as raid0 if you ignore the parity blocks.

> You would need to scrub after a power failure to be sure that no meta data gets
> corrupted even with RAID-1. Otherwise you might still have inconsistent copies
> and the problem may only become apparent when one drive fails. 

There are no inconsistencies expected with RAID1 unless the hardware is
already failing (or there's a software/firmware bug).

Scrub after power failure is required only if raid5/6 is used.
All committed data and metadata blocks will be consistent on all
RAID profiles--even in raid5/6, only parity blocks are inconsistent.
Uncommitted data returns to free space when the filesystem is mounted,
so the the consistency of uncommitted data blocks doesn't matter.

Parity block updates are not handled by the btrfs transaction update
mechanism.  The scrub after power failure is required specifically
to repair inconsistent parity blocks.  This should happen as soon as
possible after a crash, while all the data blocks are still readable.

> Can we be certain that scrubbing is always able to fix such
> inconsistencies...

Scrub can reliably repair raid5/6 parity blocks assuming all data blocks
are still correct and readable.

> ...with RAID-1?

We cannot be certain that scrub fixes inconsistencies in the general case.
If the hardware is failing, scrub relies on crc32c to detect missing
writes.  Error detection and repair is highly likely, but not certain.

Collisions between old/bad data crc32c and correct data crc32c are
not detected.  nodatasum files (which have no csums) are corrupted.
Metadata has a greater chance of successful error detection because more
fields in metadata are checked than just CRC.

If all the errors are confined to a single drive, device 'replace' is more
appropriate than 'scrub'; however, it can be hard to determine if errors
are confined to a single disk without running 'scrub' first to find out.

> Regards,
>  Edmund
> 
> 
> -- 
> *Liland IT GmbH*
> 
> 
> Ferlach ● Wien ● München
> Tel: +43 463 220111
> Tel: +49 89 
> 458 15 940
> office@Liland.com
> https://Liland.com <https://Liland.com> 
> 
> 
> 
> Copyright © 2019 Liland IT GmbH 
> 
> Diese Mail enthaelt vertrauliche und/oder 
> rechtlich geschuetzte Informationen. 
> Wenn Sie nicht der richtige Adressat 
> sind oder diese Email irrtuemlich erhalten haben, informieren Sie bitte 
> sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren 
> sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. 
> 
> This 
> email may contain confidential and/or privileged information. 
> If you are 
> not the intended recipient (or have received this email in error) please 
> notify the sender immediately and destroy this email. Any unauthorised 
> copying, disclosure or distribution of the material in this email is 
> strictly forbidden.
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: MD RAID 5/6 vs BTRFS RAID 5/6
  2019-10-21 19:34     ` Zygo Blaxell
@ 2019-10-23 16:32       ` Edmund Urbani
  2019-10-26  0:01         ` Zygo Blaxell
  0 siblings, 1 reply; 12+ messages in thread
From: Edmund Urbani @ 2019-10-23 16:32 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On 10/21/19 9:34 PM, Zygo Blaxell wrote:
> On Mon, Oct 21, 2019 at 05:27:54PM +0200, Edmund Urbani wrote:
>> On 10/16/19 9:42 PM, Zygo Blaxell wrote:
>>> For raid5 I'd choose btrfs -draid5 -mraid1 over mdadm raid5
>>> sometimes--even with the write hole, I'd expect smaller average data
>>> losses than mdadm raid5 + write hole mitigation due to the way disk
>>> failure modes are distributed.  
>> What about the write hole and RAID-1? I understand the write hole is most
>> commonly associated with RAID-5, but it is also a problem with other RAID levels.
> Filesystem tree updates are atomic on btrfs.  Everything persistent on
> btrfs is part of a committed tree.  The current writes in progress are
> initially stored in an uncommitted tree, which consists of blocks that
> are isolated from any committed tree block.  The algorithm relies on
> two things:
>
> 	Isolation:  every write to any uncommitted data block must not
> 	affect the correctness of any data in any committed data block.
>
> 	Ordering:  a commit completes all uncommitted tree updates on
> 	all disks in any order, then updates superblocks to point to the
> 	updated tree roots.  A barrier is used to separate these phases
> 	of updates across disks.
>
> Isolation and ordering make each transaction atomic.  If either
> requirement is not implemented correctly, data or metadata may be
> corrupted.  If metadata is corrupted, the filesystem can be destroyed.

Ok, the ordering enforced with the barrier ensures that all uncommitted data is
persisted before the superblocks are updated. But eg. a power loss could still
cause the superblock to be updated on only 1 of 2 RAID-1 drives. But I assume
that is not an issue because mismatching superblocks can be easily detected
(automatically fixed?) on mount. Otherwise you could still end up with 2 RAID-1
disks that seem consistent in and of themselves but that hold a different state
(until the superblocks are overwritten on both). 

> The isolation failure affects only parity blocks.  You could kill
> power all day long and not lose any committed data on any btrfs raid
> profile--as long as none of the disks fail and each disk's firmware
> implements write barriers correctly or write cache is disabled (sadly,
> even in 2019, a few drive models still don't have working barriers).
> btrfs on raid5/6 is as robust as raid0 if you ignore the parity blocks.
I hope my WD Reds implement write barriers correctly. Does anyone know for certain?




-- 
*Liland IT GmbH*


Ferlach ● Wien ● München
Tel: +43 463 220111
Tel: +49 89 
458 15 940
office@Liland.com
https://Liland.com <https://Liland.com> 



Copyright © 2019 Liland IT GmbH 

Diese Mail enthaelt vertrauliche und/oder 
rechtlich geschuetzte Informationen. 
Wenn Sie nicht der richtige Adressat 
sind oder diese Email irrtuemlich erhalten haben, informieren Sie bitte 
sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren 
sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. 

This 
email may contain confidential and/or privileged information. 
If you are 
not the intended recipient (or have received this email in error) please 
notify the sender immediately and destroy this email. Any unauthorised 
copying, disclosure or distribution of the material in this email is 
strictly forbidden.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: MD RAID 5/6 vs BTRFS RAID 5/6
  2019-10-23 16:32       ` Edmund Urbani
@ 2019-10-26  0:01         ` Zygo Blaxell
  0 siblings, 0 replies; 12+ messages in thread
From: Zygo Blaxell @ 2019-10-26  0:01 UTC (permalink / raw)
  To: Edmund Urbani; +Cc: linux-btrfs

On Wed, Oct 23, 2019 at 06:32:04PM +0200, Edmund Urbani wrote:
> On 10/21/19 9:34 PM, Zygo Blaxell wrote:
> > On Mon, Oct 21, 2019 at 05:27:54PM +0200, Edmund Urbani wrote:
> >> On 10/16/19 9:42 PM, Zygo Blaxell wrote:
> >>> For raid5 I'd choose btrfs -draid5 -mraid1 over mdadm raid5
> >>> sometimes--even with the write hole, I'd expect smaller average data
> >>> losses than mdadm raid5 + write hole mitigation due to the way disk
> >>> failure modes are distributed.  
> >> What about the write hole and RAID-1? I understand the write hole is most
> >> commonly associated with RAID-5, but it is also a problem with other RAID levels.
> > Filesystem tree updates are atomic on btrfs.  Everything persistent on
> > btrfs is part of a committed tree.  The current writes in progress are
> > initially stored in an uncommitted tree, which consists of blocks that
> > are isolated from any committed tree block.  The algorithm relies on
> > two things:
> >
> > 	Isolation:  every write to any uncommitted data block must not
> > 	affect the correctness of any data in any committed data block.
> >
> > 	Ordering:  a commit completes all uncommitted tree updates on
> > 	all disks in any order, then updates superblocks to point to the
> > 	updated tree roots.  A barrier is used to separate these phases
> > 	of updates across disks.
> >
> > Isolation and ordering make each transaction atomic.  If either
> > requirement is not implemented correctly, data or metadata may be
> > corrupted.  If metadata is corrupted, the filesystem can be destroyed.
> 
> Ok, the ordering enforced with the barrier ensures that all uncommitted data is
> persisted before the superblocks are updated. But eg. a power loss could still
> cause the superblock to be updated on only 1 of 2 RAID-1 drives. But I assume
> that is not an issue because mismatching superblocks can be easily detected
> (automatically fixed?) on mount. Otherwise you could still end up with 2 RAID-1
> disks that seem consistent in and of themselves but that hold a different state
> (until the superblocks are overwritten on both). 

During the entire time interval between the first and last superblock
update, both the old and new filesystem tree roots are already completely
written to on all disks.  Either is the correct state of the filesystem.
fsync() and similar functions only require that the function not return
to userspace until after the new state is persisted--they don't specify
what happens if the power fails before fsync() returns.  In btrfs,
metadata trees will be fully updated or fully rolled back.

If an array is fully online before and after a power failure, the worst
possible superblock inconsistency is that some superblocks point to the
tree root for commit N and the others point to N+1 (there's a very small
window this, superblocks are almost always consistent).  If N is chosen
during mount, transaction N+1 is overwritten by a new transaction N+1
after the filesystem is mounted.  If N+1 is chosen during mount then
the filesystem simply proceeds to transaction N+2.

If you split a RAID1 pair after a power failure and mount each mirror
drive on two separate machines, you could see different filesystem
contents on the two machines.  One disk may present the contents
for transid N, the other N+1.  It is not a good idea to recombine the
disks if the separated mirrors are both mounted read-write.  Both disks
will contain data that passes transid and csum consistency checks, but
reflect the contents of different transaction histories.  Choose one
disk to keep, and wipe the other before reinserting it into the array.
mdadm does this better--there are event counts and timestamps that can
more reliably reject inconsistent disks.

> > The isolation failure affects only parity blocks.  You could kill
> > power all day long and not lose any committed data on any btrfs raid
> > profile--as long as none of the disks fail and each disk's firmware
> > implements write barriers correctly or write cache is disabled (sadly,
> > even in 2019, a few drive models still don't have working barriers).
> > btrfs on raid5/6 is as robust as raid0 if you ignore the parity blocks.
> I hope my WD Reds implement write barriers correctly. Does anyone know for certain?

Some WD Red and Green models definitely do not have correct write barrier
behavior.  Some WD Black models are OK until they have bad sectors,
then during sector reallocation events they discard the contents of the
write cache, corrupting the filesystem.

This seems to affect older models more than newer ones, but drives with
bad firmware can sit in sales channels for years before they reach end
consumers.  Also when a drive is failing its write caching correctness may
change, turning a trivially repairable bad sector event into irreparable
filesystem loss for single-disk filesystems.

When in doubt, disable write cache (hdparm -W0) at boot and after any
SATA bus reset (bus resets revert to the default and re-enable the
write cache).

> 
> 
> 
> -- 
> *Liland IT GmbH*
> 
> 
> Ferlach ● Wien ● München
> Tel: +43 463 220111
> Tel: +49 89 
> 458 15 940
> office@Liland.com
> https://Liland.com <https://Liland.com> 
> 
> 
> 
> Copyright © 2019 Liland IT GmbH 
> 
> Diese Mail enthaelt vertrauliche und/oder 
> rechtlich geschuetzte Informationen. 
> Wenn Sie nicht der richtige Adressat 
> sind oder diese Email irrtuemlich erhalten haben, informieren Sie bitte 
> sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren 
> sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. 
> 
> This 
> email may contain confidential and/or privileged information. 
> If you are 
> not the intended recipient (or have received this email in error) please 
> notify the sender immediately and destroy this email. Any unauthorised 
> copying, disclosure or distribution of the material in this email is 
> strictly forbidden.
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-10-26  0:02 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-16 15:40 MD RAID 5/6 vs BTRFS RAID 5/6 Edmund Urbani
2019-10-16 19:42 ` Zygo Blaxell
2019-10-21 15:27   ` Edmund Urbani
2019-10-21 19:34     ` Zygo Blaxell
2019-10-23 16:32       ` Edmund Urbani
2019-10-26  0:01         ` Zygo Blaxell
2019-10-17  4:07 ` Jon Ander MB
2019-10-17 15:57   ` Chris Murphy
2019-10-17 18:23     ` Graham Cobb
2019-10-20 21:41       ` Chris Murphy
2019-10-18 22:19     ` Supercilious Dude
     [not found]     ` <CAGmvKk4wENpDqLFZG+D8_zzjhXokjMfdbmgTKTL49EFcfdVEtQ@mail.gmail.com>
2019-10-20 21:43       ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.