All of lore.kernel.org
 help / color / mirror / Atom feed
* Loss of connection to Half of the drives
@ 2015-12-22 19:12 Dave S
  2015-12-22 20:02 ` Chris Murphy
  0 siblings, 1 reply; 15+ messages in thread
From: Dave S @ 2015-12-22 19:12 UTC (permalink / raw)
  To: linux-btrfs

Hi Everyone,

I've been testing btrfs and have been simulating typical real world
failure scenarios and have encountered one that I am having trouble
recovering from without resorting to btrfs restore.

If anyone has any advice it'd be much appreciated.  Thanks.

Some background:

I have 2 separate disk drawers (on 2 different SAS controllers) and
I'm using 10 disks on each drawer in a 20 disk btrfs raid10
configuration -- metadata is the default.

The scenario that I'm testing is to start a heavy write to the
filesystem then pull one of the SAS cables so that half of the disks
suddenly disappear from the system.  Let's face it.  This is something
that can happen in a real system.  One power supply shorts out and
trips the breaker... Power fails on the non-UPS powersupply and the
UPS powersupply fails when it suddenly has to handle the entire
load... etc.

I suppose what I would expect to happen is that the filesystem would
lock up to prevent metadata problems like split-brain.  Granted,
writes could continue writing to segments on unaffected disks.
Wouldn't the generation numbers allow btrfs to sort out which is old
and which is new at mount time resolving with a simple balance
operation?

When I try to mount, it gives the following:
# mount LABEL=scratch2 /scratch2
mount: wrong fs type, bad option, bad superblock on /dev/sdal,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

When I try a btrfs rescue super-recover, I get the baffling fsid
mismatch and a segfault:
# ./btrfs rescue super-recover /dev/sdc
Make sure this is a btrfs disk otherwise the tool will destroy other
fs, Are you sure? [y/N]: y
parent transid verify failed on 20971520 wanted 8 found 4
parent transid verify failed on 20971520 wanted 8 found 4
parent transid verify failed on 20971520 wanted 8 found 4
parent transid verify failed on 20971520 wanted 8 found 4
Ignoring transid failure
fsid mismatch, want=bff1bc57-d5aa-48f0-ae2d-6c130b49e87b,
have=1a5d52ef-882f-4f33-8463-4d8647878626
Couldn't read tree root
Failed to recover bad superblocks
Segmentation fault


I get the above with both the CentOS7 RPM installed btrfs-tools and
the devel version I compiled (mkfs was done with version 3.19.1):

Installed:3.19.1
Devel: 4.3.1

# ./btrfs rescue super-recover /dev/sdc
Make sure this is a btrfs disk otherwise the tool will destroy other
fs, Are you sure? [y/N]: y
parent transid verify failed on 20971520 wanted 8 found 4
parent transid verify failed on 20971520 wanted 8 found 4
parent transid verify failed on 20971520 wanted 8 found 4
parent transid verify failed on 20971520 wanted 8 found 4
Ignoring transid failure
fsid mismatch, want=bff1bc57-d5aa-48f0-ae2d-6c130b49e87b,
have=1a5d52ef-882f-4f33-8463-4d8647878626
Couldn't read tree root
Failed to recover bad superblocks
Segmentation fault

To ensure it wasn't picking up a fsid from an old test I dd'd out the
3 superblock locations on each disk and re-ran the test.  Same
results.

One curiosity is that the write that is happening when I pull the SAS
cable continues uninterrupted -- I have left it for a few minutes and
it doesn't seem to stop.  At this point, I stop the write, unmount the
FS, poweroff the host, reconnect the cable, and boot back up.  Now I
can't mount the FS.

I've tried this a few times with different "repair" commands.  None
seem to clear up the metadata disagreement.  I've tried various
combinations of:
btrfs check
btrfs-zero-log
btrfs rescue chunk-reciver <- finishes successfully but still can't mount
btrfs rescue super-recover < - see above


The info requested from the wiki page (Note that the drives removed by
the SAS cable pull are the /dev/sda? ones):

# uname -a
Linux ceph2 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015
x86_64 x86_64 x86_64 GNU/Linux

# btrfs fi show
Label: 'scratch2'  uuid: bff1bc57-d5aa-48f0-ae2d-6c130b49e87b
Total devices 20 FS bytes used 30.78GiB
devid    1 size 931.51GiB used 5.02GiB path /dev/sdc
devid    2 size 931.51GiB used 5.00GiB path /dev/sdd
devid    3 size 931.51GiB used 5.00GiB path /dev/sde
devid    4 size 931.51GiB used 5.00GiB path /dev/sdf
devid    5 size 931.51GiB used 5.00GiB path /dev/sdg
devid    6 size 931.51GiB used 5.00GiB path /dev/sdh
devid    7 size 931.51GiB used 5.00GiB path /dev/sdi
devid    8 size 931.51GiB used 5.00GiB path /dev/sdj
devid    9 size 931.51GiB used 5.00GiB path /dev/sdk
devid   10 size 931.51GiB used 5.00GiB path /dev/sdl
devid   11 size 931.51GiB used 3.00GiB path /dev/sdag
devid   12 size 931.51GiB used 3.00GiB path /dev/sdah
devid   13 size 931.51GiB used 3.00GiB path /dev/sdai
devid   14 size 931.51GiB used 3.00GiB path /dev/sdaj
devid   15 size 931.51GiB used 3.00GiB path /dev/sdak
devid   16 size 931.51GiB used 3.00GiB path /dev/sdal
devid   17 size 931.51GiB used 4.00GiB path /dev/sdam
devid   18 size 931.51GiB used 4.00GiB path /dev/sdan
devid   19 size 931.51GiB used 3.01GiB path /dev/sdao
devid   20 size 931.51GiB used 3.01GiB path /dev/sdap

btrfs-progs v3.19.1

# dmesg|grep -i btrfs
[   15.938327] Btrfs loaded
[   15.938826] BTRFS: device label scratch2 devid 15 transid 7 /dev/sdak
[   15.939138] BTRFS: device label scratch2 devid 14 transid 7 /dev/sdaj
[   15.939163] BTRFS: device label scratch2 devid 11 transid 7 /dev/sdag
[   15.939960] BTRFS: device label scratch2 devid 19 transid 7 /dev/sdao
[   15.940056] BTRFS: device label scratch2 devid 18 transid 7 /dev/sdan
[   15.941805] BTRFS: device label scratch2 devid 4 transid 9 /dev/sdf
[   15.944312] BTRFS: device label scratch2 devid 10 transid 9 /dev/sdl
[   15.952288] BTRFS: device label scratch2 devid 7 transid 9 /dev/sdi
[   15.962451] BTRFS: device label scratch2 devid 2 transid 9 /dev/sdd
[   15.962995] BTRFS: device label scratch2 devid 3 transid 9 /dev/sde
[   15.963431] BTRFS: device label scratch2 devid 13 transid 7 /dev/sdai
[   15.974559] BTRFS: device label scratch2 devid 5 transid 9 /dev/sdg
[   15.979969] BTRFS: device label scratch2 devid 16 transid 7 /dev/sdal
[   15.981962] BTRFS: device label scratch2 devid 8 transid 9 /dev/sdj
[   15.988912] BTRFS: device label scratch2 devid 1 transid 9 /dev/sdc
[   15.998077] BTRFS: device label scratch2 devid 20 transid 7 /dev/sdap
[   16.001058] BTRFS: device label scratch2 devid 12 transid 7 /dev/sdah
[   16.008946] BTRFS: device label scratch2 devid 9 transid 9 /dev/sdk
[   16.010560] BTRFS: device label scratch2 devid 17 transid 7 /dev/sdam
[   16.011840] BTRFS: device label scratch2 devid 6 transid 9 /dev/sdh
[  120.971556] BTRFS info (device sdh): disk space caching is enabled
[  120.971560] BTRFS: has skinny extents
[  120.974091] BTRFS: failed to read chunk root on sdh
[  120.982591] BTRFS: open_ctree failed
[  142.749246] btrfs[2551]: segfault at 100108 ip 000000000044c503 sp
00007fff8f261290 error 6 in btrfs[400000+83000]
[  190.570074] btrfs[2614]: segfault at 100108 ip 000000000044c503 sp
00007fff8f11aef0 error 6 in btrfs[400000+83000]
[  215.734281] btrfs[2656]: segfault at 100108 ip 000000000044f5a3 sp
00007fff4cfd23e0 error 6 in btrfs[400000+88000]
[ 2545.896233] btrfs[4576]: segfault at 100108 ip 000000000044c503 sp
00007ffffc919400 error 6 in btrfs[400000+83000]
[ 3000.106228] BTRFS info (device sdh): disk space caching is enabled
[ 3000.106233] BTRFS: has skinny extents
[ 3000.148162] BTRFS: failed to read chunk root on sdh
[ 3000.168649] BTRFS: open_ctree failed
[ 3071.430479] btrfs[4701]: segfault at 100108 ip 000000000046e06a sp
00007fff44539688 error 6 in btrfs[400000+c7000]
[ 3147.025032] btrfs[4755]: segfault at 100108 ip 000000000044c503 sp
00007fffa79bbf40 error 6 in btrfs[400000+83000]


Sincerely
-Dave

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Loss of connection to Half of the drives
  2015-12-22 19:12 Loss of connection to Half of the drives Dave S
@ 2015-12-22 20:02 ` Chris Murphy
  2015-12-22 23:56   ` Donald Pearson
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Murphy @ 2015-12-22 20:02 UTC (permalink / raw)
  To: Dave S; +Cc: Btrfs BTRFS

On Tue, Dec 22, 2015 at 12:12 PM, Dave S <bigdave.schulz@gmail.com> wrote:
> To ensure it wasn't picking up a fsid from an old test I dd'd out the
> 3 superblock locations on each disk and re-ran the test.  Same
> results.

The fsid is strewn throughout the fs metadata, not just in
superblocks, so it might be finding a stale fsid. I find dmcrypt
useful for fast formatting a drive, just luksFormat and use the same
passphrase. In your case this is a PITA because of how many drives you
have. A pile of SEDs would make this much easier with crypto erase and
no unlock passphrase.

Of course, stale metadata shouldn't confuse the filesystem so it's
possible you've found a bug. It's hard to say with the kernel you have
just exactly what equivalent upstream btrfs kernel code you have
though, maybe through 3.19? But even 3.19 to 4.4 there's a lot of code
changes.

> One curiosity is that the write that is happening when I pull the SAS
> cable continues uninterrupted -- I have left it for a few minutes and
> it doesn't seem to stop.

Right. As far as I know btrfs has no understanding of failed devices
at all, when they should be ignored and go degraded, and not even when
there are too many failures and the volume needs to go read only.

Also understand with Brfs RAID 10 you can't lose more than 1 drive
reliably. It's not like a strict raid1+0 where you can lose all of the
"copy 1" *OR* "copy 2" mirrors. With a two drive failure, maybe it'll
let you mount -o degraded, but there's no way to know in advance
(either by the user or the fs) whether or not there's missing metadata
copies 1 and 2 possibly on those two particular drives that are now
gone. So really there's more than one kind of degraded in such a case,
1 device loss is degraded all data intact. 2+ device loss is degraded
with increasing chances of data loss for each lost drive, and you have
fully 50% lost drives in this test case so the volume is imploded and
just doesn't know what to do.




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Loss of connection to Half of the drives
  2015-12-22 20:02 ` Chris Murphy
@ 2015-12-22 23:56   ` Donald Pearson
  2015-12-23  4:13     ` Duncan
  0 siblings, 1 reply; 15+ messages in thread
From: Donald Pearson @ 2015-12-22 23:56 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Dave S, Btrfs BTRFS

>
> Also understand with Brfs RAID 10 you can't lose more than 1 drive
> reliably. It's not like a strict raid1+0 where you can lose all of the
> "copy 1" *OR* "copy 2" mirrors.

Pardon my pea brain but this sounds like a pretty bad design flaw?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Loss of connection to Half of the drives
  2015-12-22 23:56   ` Donald Pearson
@ 2015-12-23  4:13     ` Duncan
  2015-12-23 15:53       ` Donald Pearson
  0 siblings, 1 reply; 15+ messages in thread
From: Duncan @ 2015-12-23  4:13 UTC (permalink / raw)
  To: linux-btrfs

Donald Pearson posted on Tue, 22 Dec 2015 17:56:29 -0600 as excerpted:


>> Also understand with Brfs RAID 10 you can't lose more than 1 drive
>> reliably. It's not like a strict raid1+0 where you can lose all of the
>> "copy 1" *OR* "copy 2" mirrors.
> 
> Pardon my pea brain but this sounds like a pretty bad design flaw?

It's not a design flaw, it's EUNIMPLEMENTED.  Btrfs raid1, unlike say 
mdraid1 (and now various hardware raid vendors), implements exactly two 
copy raid1 -- each chunk is mirrored to exactly two devices.  And btrfs 
raid10, because it builds on btrfs raid1, is likewise exactly two copies.

With raid1 on two devices, where those two copies go is defined, one to 
each device.  With raid1 on more than two devices, the current chunk-
allocator will allocate one copy each to the two devices with the most 
free space left, so that if the devices are all the same size, they'll 
all be used to about the same level and will run out of space at about 
the same time.  (If they're not the same size, with one much larger than 
the others, it'll get one copy all the time, with the other copy going to 
the second largest or to each in turn once remaining empty sizes even 
out.)

Similarly with raid10, except each strip is two-way mirrored and a stripe 
created of the mirrors.

And because the raid is managed and allocated per-chunk, drop more than a 
single device, and it's very likely you _will_ be dropping both copies of 
_some_ chunks on raid1, and some strips of chunks on raid10, making them 
entirely unavailable.

In that case you _might_ be able to mount degraded,ro, but you won't be 
able to mount writable.

The other btrfs-only alternative at this point would be btrfs raid6, 
which should let you drop TWO devices before data is simply missing and 
unrecreatable from parity.  But btrfs raid6 is far newer and less mature 
than either raid1 or raid10, and running the truly latest versions is 
very strongly recommended upto v4.4 or so, which is actually soon to be 
released now, as older versions WILL quite likely have issues.  As it 
happens, kernel v4.4 is an LTS series, so the timing for btrfs raid5 and 
raid6 there is quite nice, as 4.4 should see them finally reasonably 
stable, and being LTS, should continue to be supported for quite some 
time.

(The current btrfs list recommendation in general is to stay within two 
LTS versions in ordered to avoid getting /too/ far behind, as while 
stabilizing, btrfs isn't entirely stable and mature yet, and further back 
then that simply gets unrealistic to support very well.  That's 3.18 and 
4.1 currently, with 3.18 being soon to drop as 4.4 is soon to release as 
the next LTS.  But as btrfs stabilizes further, it's somewhat likely that 
4.1 or at least 4.4, will continue to be reasonably supported beyond the 
second LTS back phase, perhaps to the third, and sometime after that, 
support will probably last more or less as long as the LTS stable branch 
continues getting updates.)

But even btrfs raid6 only lets you drop two devices before general data 
loss occurs.

The other alternative, as regularly used and recommended by one regular 
poster here, would be btrfs raid1 on top of mdraid0 or possibly mdraid10 
or whatever.  The same general principle would apply to btrfs raid5 and 
raid6 as they mature, on top of mdraidN, with the important point being 
that the btrfs level has redundancy, raid1/10/5/6, since it has real-time 
data and metadata checksumming and integrity management features that are 
lacking in mdraid.  By putting the btrfs raid with either redundancy or 
parity on top, you get the benefit of actual error recovery that would be 
lacking if it was btrfs raid0 on top.

That would let you manage loss of one entire set of the underlying mdraid 
devices, one copy of the overlying btrfs raid1/10 or one strip/parity of 
btrfs raid5, which could then be rebuilt from the other two, while 
maintaining btrfs data and metadata integrity as one copy (or stripe-
minus-one-plus-one-parity) would always exist.  With btrfs raid6, it 
would of course let you lose two of the underlying sets of devices 
composing the btrfs raid6.

In the precise scenario the OP posted, that would work well, since in the 
huge numbers of devices going offline case, it'd always be complete sets 
of devices, corresponding to one of the underlying mdraidNs, because the 
scenario is that set getting unplugged or whatever.

Of course in the more general random N devices going offline case, with 
the N devices coming from any of the underlying mdraidNs, it could still 
result in not all data being available to the btrfs raid level, but 
except for mdraid0, the chances of it happening are still relatively low, 
and with mdraid0, they're still within reason, if not /as/ low.  But that 
general scenario isn't what was posted; the posted scenario was entire 
specific sets going offline, and that such a setup could handle quite 
well indeed.


Meanwhile, I /did/ say EUNIMPLEMENTED.  N-way-mirroring has long been on 
the roadmap for implementation shortly after raid56 mode, which was 
finally nominally complete in 3.19, and is reasonably stabilized in 4.4, 
so based on the roadmap, N-way-mirroring should be one of the next major 
features to appear.  That would let you do 3-way-mirroring, 4-way-
mirroring, etc, which would then give you loss of N-1 devices before risk 
of data loss.  That has certainly been my most hotly anticipated feature 
since 3.5 or so, when I first looked at btrfs raid1 and found it only had 
2-way-mirroring, but saw N-way-mirroring roadmapped for after raid56, 
which at the time was /supposed/ to be introduced in 3.6, two and a half 
years before it was actually fully implemented in 3.19.

Of course N-way-mirroring in the raid1 context.  In the raid10 context, 
it would then obviously translate into being able to specify at least one 
of the stripe width or number of mirrors, with the other one either 
determined based on the first and the number of devices present, or also 
specifiable at the same time.

And of course N-way-mirroring in the raid10 context would be the most 
direct solution to the current discussion... were it available currently 
or were this current discussion in the future when it was available.  But 
lacking it as a current solution, the closest direct solutions allowing 
loss-of-one device on a many-device btrfs are btrfs raid1/5/10, with 
btrfs raid6 allowing a two-device drop.  But the nearest comparable 
solution isn't quite as direct, a btrfs raid1/5/10 (or btrfs raid6 for 
double set loss), on top of mdraidN.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Loss of connection to Half of the drives
  2015-12-23  4:13     ` Duncan
@ 2015-12-23 15:53       ` Donald Pearson
  2015-12-23 18:20         ` Goffredo Baroncelli
  2015-12-24  1:21         ` Duncan
  0 siblings, 2 replies; 15+ messages in thread
From: Donald Pearson @ 2015-12-23 15:53 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

On Tue, Dec 22, 2015 at 10:13 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Donald Pearson posted on Tue, 22 Dec 2015 17:56:29 -0600 as excerpted:
>
>
>>> Also understand with Brfs RAID 10 you can't lose more than 1 drive
>>> reliably. It's not like a strict raid1+0 where you can lose all of the
>>> "copy 1" *OR* "copy 2" mirrors.
>>
>> Pardon my pea brain but this sounds like a pretty bad design flaw?
>
> It's not a design flaw, it's EUNIMPLEMENTED.  Btrfs raid1, unlike say
> mdraid1 (and now various hardware raid vendors), implements exactly two
> copy raid1 -- each chunk is mirrored to exactly two devices.  And btrfs
> raid10, because it builds on btrfs raid1, is likewise exactly two copies.
>
> With raid1 on two devices, where those two copies go is defined, one to
> each device.  With raid1 on more than two devices, the current chunk-
> allocator will allocate one copy each to the two devices with the most
> free space left, so that if the devices are all the same size, they'll
> all be used to about the same level and will run out of space at about
> the same time.  (If they're not the same size, with one much larger than
> the others, it'll get one copy all the time, with the other copy going to
> the second largest or to each in turn once remaining empty sizes even
> out.)
>
> Similarly with raid10, except each strip is two-way mirrored and a stripe
> created of the mirrors.
>
> And because the raid is managed and allocated per-chunk, drop more than a
> single device, and it's very likely you _will_ be dropping both copies of
> _some_ chunks on raid1, and some strips of chunks on raid10, making them
> entirely unavailable.
>
> In that case you _might_ be able to mount degraded,ro, but you won't be
> able to mount writable.
>
> The other btrfs-only alternative at this point would be btrfs raid6,
> which should let you drop TWO devices before data is simply missing and
> unrecreatable from parity.  But btrfs raid6 is far newer and less mature
> than either raid1 or raid10, and running the truly latest versions is
> very strongly recommended upto v4.4 or so, which is actually soon to be
> released now, as older versions WILL quite likely have issues.  As it
> happens, kernel v4.4 is an LTS series, so the timing for btrfs raid5 and
> raid6 there is quite nice, as 4.4 should see them finally reasonably
> stable, and being LTS, should continue to be supported for quite some
> time.
>
> (The current btrfs list recommendation in general is to stay within two
> LTS versions in ordered to avoid getting /too/ far behind, as while
> stabilizing, btrfs isn't entirely stable and mature yet, and further back
> then that simply gets unrealistic to support very well.  That's 3.18 and
> 4.1 currently, with 3.18 being soon to drop as 4.4 is soon to release as
> the next LTS.  But as btrfs stabilizes further, it's somewhat likely that
> 4.1 or at least 4.4, will continue to be reasonably supported beyond the
> second LTS back phase, perhaps to the third, and sometime after that,
> support will probably last more or less as long as the LTS stable branch
> continues getting updates.)
>
> But even btrfs raid6 only lets you drop two devices before general data
> loss occurs.
>
> The other alternative, as regularly used and recommended by one regular
> poster here, would be btrfs raid1 on top of mdraid0 or possibly mdraid10
> or whatever.  The same general principle would apply to btrfs raid5 and
> raid6 as they mature, on top of mdraidN, with the important point being
> that the btrfs level has redundancy, raid1/10/5/6, since it has real-time
> data and metadata checksumming and integrity management features that are
> lacking in mdraid.  By putting the btrfs raid with either redundancy or
> parity on top, you get the benefit of actual error recovery that would be
> lacking if it was btrfs raid0 on top.
>
> That would let you manage loss of one entire set of the underlying mdraid
> devices, one copy of the overlying btrfs raid1/10 or one strip/parity of
> btrfs raid5, which could then be rebuilt from the other two, while
> maintaining btrfs data and metadata integrity as one copy (or stripe-
> minus-one-plus-one-parity) would always exist.  With btrfs raid6, it
> would of course let you lose two of the underlying sets of devices
> composing the btrfs raid6.
>
> In the precise scenario the OP posted, that would work well, since in the
> huge numbers of devices going offline case, it'd always be complete sets
> of devices, corresponding to one of the underlying mdraidNs, because the
> scenario is that set getting unplugged or whatever.
>
> Of course in the more general random N devices going offline case, with
> the N devices coming from any of the underlying mdraidNs, it could still
> result in not all data being available to the btrfs raid level, but
> except for mdraid0, the chances of it happening are still relatively low,
> and with mdraid0, they're still within reason, if not /as/ low.  But that
> general scenario isn't what was posted; the posted scenario was entire
> specific sets going offline, and that such a setup could handle quite
> well indeed.
>
>
> Meanwhile, I /did/ say EUNIMPLEMENTED.  N-way-mirroring has long been on
> the roadmap for implementation shortly after raid56 mode, which was
> finally nominally complete in 3.19, and is reasonably stabilized in 4.4,
> so based on the roadmap, N-way-mirroring should be one of the next major
> features to appear.  That would let you do 3-way-mirroring, 4-way-
> mirroring, etc, which would then give you loss of N-1 devices before risk
> of data loss.  That has certainly been my most hotly anticipated feature
> since 3.5 or so, when I first looked at btrfs raid1 and found it only had
> 2-way-mirroring, but saw N-way-mirroring roadmapped for after raid56,
> which at the time was /supposed/ to be introduced in 3.6, two and a half
> years before it was actually fully implemented in 3.19.
>
> Of course N-way-mirroring in the raid1 context.  In the raid10 context,
> it would then obviously translate into being able to specify at least one
> of the stripe width or number of mirrors, with the other one either
> determined based on the first and the number of devices present, or also
> specifiable at the same time.
>
> And of course N-way-mirroring in the raid10 context would be the most
> direct solution to the current discussion... were it available currently
> or were this current discussion in the future when it was available.  But
> lacking it as a current solution, the closest direct solutions allowing
> loss-of-one device on a many-device btrfs are btrfs raid1/5/10, with
> btrfs raid6 allowing a two-device drop.  But the nearest comparable
> solution isn't quite as direct, a btrfs raid1/5/10 (or btrfs raid6 for
> double set loss), on top of mdraidN.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thanks for that description, but what I'm reading is pretty bad, so
maybe I'm just not comprehending how it isn't pretty bad.

I don't think the n-way mirroring is going to solve the problem in the
context of the current discussion.  For the sake of this example I'm
going to assume that current Raid10 uses the equivalent of N-way
mirroring where N=2 (it may actually be considered N=1 but it isn't
really important for the discussion).

With N-way mirroring you can safely drop N-1 drives without concern of
data loss.  In the context of this discussion let's say you have a 20
drive array and we're going to drop half of those drives because of a
controller failure.  Where N=2 I can't drop more than 1 drive without
rolling the dice.  Where N=10 I can't drop more than 9 drives without
rolling the dice, and because dropping a controller is going to drop
10 drives I need to use 11-way mirroring.

Additionally real Raid10 will run circles around what BTRFS is doing
in terms of performance.  In the 20 drive array you're striping across
10 drives, in BTRFS right now you're striping across 2 no matter what.
So not only do I lose in terms of resilience I lose in terms of
performance.  I assume that N-way-mirroring used with BTRFS Raid10
will also increase the stripe width so that will level out the
performance but you're always going to be short a drive for equal
resilience.

And finally the elephant in the room that comes with the necessary
11-way mirroring is that the usable capacity of that 20 drive array.
Remember, pea brain so my math may be wrong in application and
calculation but if it's made of 1T drives for 20T raw, there is only
1.82T usable (20 / 11) and if I'm completely off in that figure the
point is still that such a high level of mirroring is going to
excessively consume drive space.

If I were to suggest implementing BTRFS Raid10 professionally and then
explained these circumstances I'd get laughed out of the data center.

What Raid10 is and means is well defined and what BTRFS is
implementing and calling Raid10 is not Raid10 and it's somewhat
irresponsible to not distinguish it as different in name.  If it's
going to continue this way It really should be called something else
much like Sun called their parity in ZFS "Raid-Z".

All that said, I completely understand that with traditional Raid10
you can lose 2 drives and lose data, you just have to lose both A and
B mirrored pairs and of course resiliency is not a substitute for
backups.  However, the reason Raid10 is what is used in the real world
for business critical storage is because it's (relatively) fast, you
can align your hardware redundancy with your data redundancy, and a
2:1 cost in raw to usable storage is acceptable to the bean counters.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Loss of connection to Half of the drives
  2015-12-23 15:53       ` Donald Pearson
@ 2015-12-23 18:20         ` Goffredo Baroncelli
  2015-12-23 22:15           ` Donald Pearson
  2015-12-24  1:29           ` Duncan
  2015-12-24  1:21         ` Duncan
  1 sibling, 2 replies; 15+ messages in thread
From: Goffredo Baroncelli @ 2015-12-23 18:20 UTC (permalink / raw)
  To: Donald Pearson, Duncan; +Cc: Btrfs BTRFS

On 2015-12-23 16:53, Donald Pearson wrote:
[...]
> 
> Additionally real Raid10 will run circles around what BTRFS is doing
> in terms of performance.  In the 20 drive array you're striping across
> 10 drives, in BTRFS right now you're striping across 2 no matter what.
> So not only do I lose in terms of resilience I lose in terms of
> performance.  I assume that N-way-mirroring used with BTRFS Raid10
> will also increase the stripe width so that will level out the
> performance but you're always going to be short a drive for equal
> resilience.

In case of RAID10,on the best of my knowledge, BTRFS allocate each CHUNK across *all* the available devices. It uses the usual RAID0 (==striping) over a RAID1 (mirroring).

What you are describing is the BTRFS RAID1; i.e. LINEAR over a RAID1:each chunk is allocated in *two*, only *two* different disks from the disks pool; the disks are the ones with the largest free space. Each chunk may be allocated on a different *pair* of disks.

> And finally the elephant in the room that comes with the necessary
> 11-way mirroring is that the usable capacity of that 20 drive array.
> Remember, pea brain so my math may be wrong in application and
> calculation but if it's made of 1T drives for 20T raw, there is only
> 1.82T usable (20 / 11) and if I'm completely off in that figure the
> point is still that such a high level of mirroring is going to
> excessively consume drive space.

Ducan talked about a N-way mirroring, where each disks contains a copy of the same data. Nobody talked about N-way mirroring where N is less than the number of the available disks.

To be honest in the past appeared some patches to implement a generalized RAID-NxM raid, where N are the total disk, M are the redundancy disks: i.e. the filesystem could allow a drop of M disks (see http://www.spinics.net/lists/linux-btrfs/msg29245.html).

BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Loss of connection to Half of the drives
  2015-12-23 18:20         ` Goffredo Baroncelli
@ 2015-12-23 22:15           ` Donald Pearson
  2015-12-23 23:13             ` Chris Murphy
  2015-12-24  1:29           ` Duncan
  1 sibling, 1 reply; 15+ messages in thread
From: Donald Pearson @ 2015-12-23 22:15 UTC (permalink / raw)
  To: Btrfs BTRFS

On Wed, Dec 23, 2015 at 12:20 PM, Goffredo Baroncelli
<kreijack@inwind.it> wrote:
> On 2015-12-23 16:53, Donald Pearson wrote:
> [...]
>>
>> Additionally real Raid10 will run circles around what BTRFS is doing
>> in terms of performance.  In the 20 drive array you're striping across
>> 10 drives, in BTRFS right now you're striping across 2 no matter what.
>> So not only do I lose in terms of resilience I lose in terms of
>> performance.  I assume that N-way-mirroring used with BTRFS Raid10
>> will also increase the stripe width so that will level out the
>> performance but you're always going to be short a drive for equal
>> resilience.
>
> In case of RAID10,on the best of my knowledge, BTRFS allocate each CHUNK across *all* the available devices. It uses the usual RAID0 (==striping) over a RAID1 (mirroring).
>
> What you are describing is the BTRFS RAID1; i.e. LINEAR over a RAID1:each chunk is allocated in *two*, only *two* different disks from the disks pool; the disks are the ones with the largest free space. Each chunk may be allocated on a different *pair* of disks.
>

Okay so however the chunk is divided up, 2 copies of each chunk
division is written somewhere.  So I misunderstood, thanks for
clearing it up!

>> And finally the elephant in the room that comes with the necessary
>> 11-way mirroring is that the usable capacity of that 20 drive array.
>> Remember, pea brain so my math may be wrong in application and
>> calculation but if it's made of 1T drives for 20T raw, there is only
>> 1.82T usable (20 / 11) and if I'm completely off in that figure the
>> point is still that such a high level of mirroring is going to
>> excessively consume drive space.
>
> Ducan talked about a N-way mirroring, where each disks contains a copy of the same data. Nobody talked about N-way mirroring where N is less than the number of the available disks.
>

Well that was certainly implied as the unimplemented solution to
dropping half the drives that the OP tested.  N-way mirroring where N
= the number of drives is just Raid1 on crack and not the Raid10
use-case that the OP is asking about.

> To be honest in the past appeared some patches to implement a generalized RAID-NxM raid, where N are the total disk, M are the redundancy disks: i.e. the filesystem could allow a drop of M disks (see http://www.spinics.net/lists/linux-btrfs/msg29245.html).
>
> BR
> G.Baroncelli
>
>
> --
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


Yeah that whole thing is pretty upsetting.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Loss of connection to Half of the drives
  2015-12-23 22:15           ` Donald Pearson
@ 2015-12-23 23:13             ` Chris Murphy
  0 siblings, 0 replies; 15+ messages in thread
From: Chris Murphy @ 2015-12-23 23:13 UTC (permalink / raw)
  To: Btrfs BTRFS

On Wed, Dec 23, 2015 at 3:15 PM, Donald Pearson
<donaldwhpearson@gmail.com> wrote:
> On Wed, Dec 23, 2015 at 12:20 PM, Goffredo Baroncelli

>> Ducan talked about a N-way mirroring, where each disks contains a copy of the same data. Nobody talked about N-way mirroring where N is less than the number of the available disks.
>>
>
> Well that was certainly implied as the unimplemented solution to
> dropping half the drives that the OP tested.  N-way mirroring where N
> = the number of drives is just Raid1 on crack and not the Raid10
> use-case that the OP is asking about.

How does the OP's use case normally get implemented? For separate
controllers, this would need to be software raid10, but you'd need a
way to specify the drive pairings. How does mdadm create -l raid10
enable that? Or to make absolutely certain, do you put them all in a
container and then first create -l raid1, and then second create -l
raid0?

In any case, what you get is drive level granularity for mirroring. A
drive has an exact (excluding layout options, but still data exact)
copy. That's not true with Btrfs where the granularity is the data
chunk (1+GiB). A given drive's chunks will definitely have copies on
multiple drives rather than on a single drive. And those multiple
drives will variably be on both sides of a controller or drive
make/model division.

One of the major differences of Btrfs with all profiles is that it
deals with different sized devices elegantly. That's because of the
chunk level granularity.

So I think that having mirrors of drives rather than chunks means that
we have to have exact size drive pairings.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Loss of connection to Half of the drives
  2015-12-23 15:53       ` Donald Pearson
  2015-12-23 18:20         ` Goffredo Baroncelli
@ 2015-12-24  1:21         ` Duncan
  2015-12-24 16:19           ` Donald Pearson
  1 sibling, 1 reply; 15+ messages in thread
From: Duncan @ 2015-12-24  1:21 UTC (permalink / raw)
  To: linux-btrfs

Donald Pearson posted on Wed, 23 Dec 2015 09:53:41 -0600 as excerpted:

> Additionally real Raid10 will run circles around what BTRFS is doing in
> terms of performance.  In the 20 drive array you're striping across 10
> drives, in BTRFS right now you're striping across 2 no matter what. So
> not only do I lose in terms of resilience I lose in terms of
> performance.  I assume that N-way-mirroring used with BTRFS Raid10 will
> also increase the stripe width so that will level out the performance
> but you're always going to be short a drive for equal resilience.

No, with btrfs raid10, you're /mirroring/ across two drives no matter 
what.  With 20 devices, you're /striping/ across 10 two-way mirrors.  
It's the same as a standard raid10, in that regard.  

Tho it's a bit different in that the mix of devices forming the above can 
differ among different chunks.  IOW, the first chunk might be mirrored a/
b c/d e/f g/h i/j k/l m/n o/p q/r s/t, with the stripe across each mirror-
pair, but the chunk might be mirrored a/l g/o f/k b/n c/d e/s j/q h/t i/p 
m/r (I think I got each letter once...), and striped across those pairs.

So you get the same performance as a normal raid10 (well, to the extent 
that btrfs has been optimized, which in large part it hasn't been, yet), 
but as should always be the case in a raid10, randomized loss of more 
than a single device can mean data loss.

But, because each chunk pair assignment is more or less randomized, 
unlike a conventional raid10 which lets you map all of one mirror set to 
one cabinet and all of the second mirror set to another cabinet, so you 
can reliably lose an entire cabinet and be fine since it's known to 
correspond exactly to a single mirror set, you can't do that with btrfs 
raid10, because there's no way to specify individual chunk mirroring and 
what might be precisely one mirror set with one chunk, is very likely to 
be both copies of some mirrors and no copies of other mirrors, with 
another chunk.

What I was suggesting as a solution was a setup that:
(a) has btrfs raid1 at the top level
(b) has a pair of mdraidNs underneath, in this case a pair of 10-device 
mdraid10s.
(c) has the pair of mdraidNs each presented to btrfs as one of its raid1 
mirrors.

While this is actually raid01, not raid10, in this case it makes more 
sense than a mixed raid10, because by doing it that way, you'd:
1) keep btrfs' data integrity and error correction at the top level, as 
it could pull from the second copy if the first failed checksum.
2) be able to stick each mdraid0 in its own cabinet, so loss of the 
entire cabinet wouldn't be data loss, only redundancy loss.

(Reversing that, btrfs raid0 on top of mdraid1, would lose btrfs' ability 
to correct checksum errors as at the btrfs level, it'd be non-redundant, 
and mdraid1 doesn't have checksumming, so it couldn't provide the same 
data integrity service.  Without checksumming and pull from the other 
copy in case of error, you could scrub the mdraid1 to make its mirrors 
identical again, but you'd be just as likely to copy the bad one to the 
good one as the reverse.  Thus, btrfs really needs to be the raid1 layer 
unless you simply don't care about data integrity, and because btrfs is 
the filesystem layer, it has to be the top layer, so you're left doing a 
raid01 instead of the raid10 that's ordinarily preferred due to locality 
of a rebuild, absent other factors like this data integrity factor.)

And what btrfs N-way-mirroring will provide, in the longer term once 
btrfs gets that feature and it stabilizes to usability, is the ability to 
actually have three cabinets, and sustain the loss of two, or four 
cabinets, and sustain the loss of three, etc.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Loss of connection to Half of the drives
  2015-12-23 18:20         ` Goffredo Baroncelli
  2015-12-23 22:15           ` Donald Pearson
@ 2015-12-24  1:29           ` Duncan
  1 sibling, 0 replies; 15+ messages in thread
From: Duncan @ 2015-12-24  1:29 UTC (permalink / raw)
  To: linux-btrfs

Goffredo Baroncelli posted on Wed, 23 Dec 2015 19:20:32 +0100 as
excerpted:

> Ducan talked about a N-way mirroring, where each disks contains a copy
> of the same data. Nobody talked about N-way mirroring where N is less
> than the number of the available disks.

Well, to be fair, I did /try/ to talk about raid10 in the context of N-
way-mirroring, as *one*future*option*, which would let you do say 3-way-
mirroring, 2-way-striping, using six devices, giving you that choice in 
addition to the current 3-way-striping, 2-way-mirroring, that's the only 
current choice for btrfs raid10 with six devices, since it's limited to 
two-way-mirroring.

But obviously I was more confusing than clear, since you apparently 
didn't see that bit at all, and he saw it, but apparently ended up more 
confused than helped by it, possibly due to trying to apply that 
discussion to a larger scope than the limited one-future-option scope 
that I had originally intended.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Loss of connection to Half of the drives
  2015-12-24  1:21         ` Duncan
@ 2015-12-24 16:19           ` Donald Pearson
  2015-12-24 20:57             ` Chris Murphy
  0 siblings, 1 reply; 15+ messages in thread
From: Donald Pearson @ 2015-12-24 16:19 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

On Wed, Dec 23, 2015 at 7:21 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Donald Pearson posted on Wed, 23 Dec 2015 09:53:41 -0600 as excerpted:
>
>> Additionally real Raid10 will run circles around what BTRFS is doing in
>> terms of performance.  In the 20 drive array you're striping across 10
>> drives, in BTRFS right now you're striping across 2 no matter what. So
>> not only do I lose in terms of resilience I lose in terms of
>> performance.  I assume that N-way-mirroring used with BTRFS Raid10 will
>> also increase the stripe width so that will level out the performance
>> but you're always going to be short a drive for equal resilience.
>
> No, with btrfs raid10, you're /mirroring/ across two drives no matter
> what.  With 20 devices, you're /striping/ across 10 two-way mirrors.
> It's the same as a standard raid10, in that regard.
>
> Tho it's a bit different in that the mix of devices forming the above can
> differ among different chunks.  IOW, the first chunk might be mirrored a/
> b c/d e/f g/h i/j k/l m/n o/p q/r s/t, with the stripe across each mirror-
> pair, but the chunk might be mirrored a/l g/o f/k b/n c/d e/s j/q h/t i/p
> m/r (I think I got each letter once...), and striped across those pairs.
>
> So you get the same performance as a normal raid10 (well, to the extent
> that btrfs has been optimized, which in large part it hasn't been, yet),
> but as should always be the case in a raid10, randomized loss of more
> than a single device can mean data loss.
>
> But, because each chunk pair assignment is more or less randomized,
> unlike a conventional raid10 which lets you map all of one mirror set to
> one cabinet and all of the second mirror set to another cabinet, so you
> can reliably lose an entire cabinet and be fine since it's known to
> correspond exactly to a single mirror set, you can't do that with btrfs
> raid10, because there's no way to specify individual chunk mirroring and
> what might be precisely one mirror set with one chunk, is very likely to
> be both copies of some mirrors and no copies of other mirrors, with
> another chunk.

Understood.  I was definitely confused on how it worked earlier.  What
I thought I read was really bizarre.

>
> What I was suggesting as a solution was a setup that:
> (a) has btrfs raid1 at the top level
> (b) has a pair of mdraidNs underneath, in this case a pair of 10-device
> mdraid10s.
> (c) has the pair of mdraidNs each presented to btrfs as one of its raid1
> mirrors.
>
> While this is actually raid01, not raid10, in this case it makes more
> sense than a mixed raid10, because by doing it that way, you'd:
> 1) keep btrfs' data integrity and error correction at the top level, as
> it could pull from the second copy if the first failed checksum.
> 2) be able to stick each mdraid0 in its own cabinet, so loss of the
> entire cabinet wouldn't be data loss, only redundancy loss.
>
> (Reversing that, btrfs raid0 on top of mdraid1, would lose btrfs' ability
> to correct checksum errors as at the btrfs level, it'd be non-redundant,
> and mdraid1 doesn't have checksumming, so it couldn't provide the same
> data integrity service.  Without checksumming and pull from the other
> copy in case of error, you could scrub the mdraid1 to make its mirrors
> identical again, but you'd be just as likely to copy the bad one to the
> good one as the reverse.  Thus, btrfs really needs to be the raid1 layer
> unless you simply don't care about data integrity, and because btrfs is
> the filesystem layer, it has to be the top layer, so you're left doing a
> raid01 instead of the raid10 that's ordinarily preferred due to locality
> of a rebuild, absent other factors like this data integrity factor.)
>

Got it.  I'm not the biggest fan of mixing mdraid with btrfs raid in
order to work around deficiencies.  Hopefully in the future btrfs will
allow me to select my mirror groups.

The trouble with a mirror of stripes is you take a nasty impact to
your fault tolerance for dropping drives.  With Raid01 dropping just 1
drive from each cabinet will fail the entire array because there is
only one mirror group.  So now it's a choice between fault tolerance
of dropping drives or fault tolerance of file-level errors.

So we're in this position of forced compromise where I have to decide
between a pure and simpler btrfs raidx configuration but give up
controller tolerance, or accept a more convoluted hybrid of mdraid +
btrfs which then forces me in to the compromise of Raid10 where I can
suffer more drive failure but lose on btrfs' checksumming, or Raid01
where I'm more vulnerable to drive failure but I get to benefit from
the checksumming.

All this makes me ask why?  Why implement Raid10 in this non-standard
fashion and create this mess of compromise?  It's frustrating on the
user side and makes admins look at alternatives.  All this is because
I can't define what the mirrored pairs (or beyond in the future) are,
just to gain elegance in supporting different sized drives?  That can
be done at the stripe level, it doesn't need to be done at the mirror
level, and if it were done at the stripe level this issue wouldn't
exist.

> And what btrfs N-way-mirroring will provide, in the longer term once
> btrfs gets that feature and it stabilizes to usability, is the ability to
> actually have three cabinets, and sustain the loss of two, or four
> cabinets, and sustain the loss of three, etc.
>

I get it but this really isn't compelling.  This can't be done without
using a hybrid of mdraid + btrfs; I can already do this in a raid 1+0
arrangement I just don't benefit from checksumming.  All
N-way-mirroring is going to give me is the ability to do it in a 0+1
arrangement which means my filesystem made of 3 trays of 30 drives
total will be failed with just the failure of 1 drive in each tray and
that's not acceptable.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Loss of connection to Half of the drives
  2015-12-24 16:19           ` Donald Pearson
@ 2015-12-24 20:57             ` Chris Murphy
  2015-12-25  0:23               ` Duncan
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Murphy @ 2015-12-24 20:57 UTC (permalink / raw)
  To: Donald Pearson; +Cc: Duncan, Btrfs BTRFS

On Thu, Dec 24, 2015 at 9:19 AM, Donald Pearson
<donaldwhpearson@gmail.com> wrote:

> Got it.  I'm not the biggest fan of mixing mdraid with btrfs raid in
> order to work around deficiencies.  Hopefully in the future btrfs will
> allow me to select my mirror groups.

As far as I know, mdadm -l raid10 works this same way, you don't have
control over this. But what you can do with mdadm is create mirrored
pairs first, and then stripe those arrays. I don't know if that's
better/easier/necessary to do with an mdadm container.



> The trouble with a mirror of stripes is you take a nasty impact to
> your fault tolerance for dropping drives.  With Raid01 dropping just 1
> drive from each cabinet will fail the entire array because there is
> only one mirror group.  So now it's a choice between fault tolerance
> of dropping drives or fault tolerance of file-level errors.

Right. Open question if btrfs raid10 is more like raid01 because of
this. Where it's more like raid10 than 01 is rebuild. With 01, when
one drive dies, the entire raid0 array its in dies and has to be
rebuilt, which is not the case for btrfs. So it has characteristics of
raid10 and raid01 depending on the mode and context.

Thing is the trend of building storage stacks, because drive
capacities are so huge but their performance hasn't scaled at the same
rate, is to build more arrays with fewer drives, and pool the arrays
with something like ceph or glusterfs.

While the controller tolerance is a legit concern, is it more or less
likely to have a controller problem than it is a power supply problem?
Or something with that particular system that just craps out rather
than the array attached to it?


> All this makes me ask why?  Why implement Raid10 in this non-standard
> fashion and create this mess of compromise?

Because it was a straightforward extension of how the file system
already behaves. To implement drive based copies rather than chunk
based copies is a totally different strategy that actually negates how
btrfs does allocation, and would require things like logically
checking for mirrored pairs being the same size +/- maybe 1% similar
to mdadm.

And keep in mind the raid10 multiple device failure is not fixed, not
just any additional failure is OK. It just depends on aviation's
equivalent of "big sky theory" for air traffic separation. Yes the
probability of mirror A's two drives dying is next to zero, but it's
not zero. If you're building arrays depending on it being zero, well
that's not a good idea. The way to look at it is more of a bonus of
uptime, rather than depending on it in design. You design for it's
scaleable performance, which it does have.



>  It's frustrating on the
> user side and makes admins look at alternatives.  All this is because
> I can't define what the mirrored pairs (or beyond in the future) are,
> just to gain elegance in supporting different sized drives?  That can
> be done at the stripe level, it doesn't need to be done at the mirror
> level, and if it were done at the stripe level this issue wouldn't
> exist.

Whether the granularity for mirroring shifts from chunks to drives or
to stripes doesn't matter. A mirrored pair will have to be the same
size, or the bullseye simply gets bigger, from one drive to two or
more.


> I get it but this really isn't compelling.  This can't be done without
> using a hybrid of mdraid + btrfs; I can already do this in a raid 1+0
> arrangement I just don't benefit from checksumming.  All
> N-way-mirroring is going to give me is the ability to do it in a 0+1
> arrangement which means my filesystem made of 3 trays of 30 drives
> total will be failed with just the failure of 1 drive in each tray and
> that's not acceptable.

OK, so in that case, you can't use Btrfs alone to get the fault
tolerance you need. There are other things I'd think an admin would
want in a Btrfs only solution that Btrfs doesn't have, like the faulty
state for devices and notifications for that state change. This isn't
the only one, it's just rather a gotcha if you come with the
expectation of raid10 being almost certainly capable of tolerating a 2
disk failure. So I do kinda wonder if it ought to be called raid01,
even though that's misleading too, but at least not in a way that
causes an overestimation of data availability.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Loss of connection to Half of the drives
  2015-12-24 20:57             ` Chris Murphy
@ 2015-12-25  0:23               ` Duncan
  2015-12-26  6:12                 ` David Schulz
  0 siblings, 1 reply; 15+ messages in thread
From: Duncan @ 2015-12-25  0:23 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Thu, 24 Dec 2015 13:57:35 -0700 as excerpted:

>> All this makes me ask why?  Why implement Raid10 in this non-standard
>> fashion and create this mess of compromise?
> 
> Because it was a straightforward extension of how the file system
> already behaves. To implement drive based copies rather than chunk based
> copies is a totally different strategy that actually negates how btrfs
> does allocation, and would require things like logically checking for
> mirrored pairs being the same size +/- maybe 1% similar to mdadm.
> 
> And keep in mind the raid10 multiple device failure is not fixed, not
> just any additional failure is OK. It just depends on aviation's
> equivalent of "big sky theory" for air traffic separation. Yes the
> probability of mirror A's two drives dying is next to zero, but it's not
> zero. If you're building arrays depending on it being zero, well that's
> not a good idea. The way to look at it is more of a bonus of uptime,
> rather than depending on it in design. You design for it's scaleable
> performance, which it does have.

This.

Raid10 doesn't guard against any random two devices going down, let alone 
a random half of all devices, and anyone running a raid10 with the 
assumption that it does is simply asking for trouble.

What it /does/ do, in the device-scope raid10 case, is minimize the 
/chance/ that two devices down will take out the entire array, 
particularly on big raid10 arrays, because the chances of any random two 
devices being the two devices mirroring the same content goes down as the 
number of total devices goes up.

But as Chris Murphy says, btrfs is inherently chunk-scope, not drive-
scope.  In fact, that's a very large part of its multi-device flexibility 
in the first place.  And raid10 functionality was a straightforward 
extension of the existing raid1 and raid0 functionality, simply combining 
them into one at the same filesystem level with comparatively little 
extra code.  And that, again, was due to the incredible flexibility that 
chunk-scope granularity exposes.

Of course one drawback is that with chunk-scope allocation, the per-
device allocation of successive chunks is likely to vary, meaning you 
lose the low device-scope chance of two random devices taking the entire 
array down, because the chances of those two random devices containing 
/both/ mirrors of _some_ chunk-strips is much higher than it is with 
device-scope allocation and both copies of the device-scope mirror, but 
that's a taken tradeoff that allowed the exposure of striped-mirrors 
raid10 functionality in the first place, and as Chris and I are both 
saying, any admin relying on chance to cover his *** in the two-device 
failure case on a raid10 is already asking for trouble.

But there are known workarounds for that problem, the layers on top of 
layers scenario, raid0+1 or raid1+0, each with its own advantages and 
disadvantages.  Of course, btrfs arguably being a layering violation 
incorporating both filesystem and block level layers, tho it's done with 
specific advantages in mind, does by definition of implementation have to 
be the top layer, which does impose some limits if other btrfs features 
such as checksumming and data integrity are wanted, but it remains simply 
a question of matching the tradeoffs the technology makes against the 
ones you're willing to make, within the limitations of the available 
tradeoffs pool, of course.


Meanwhile, there has been discussion of enhancements to the chunk 
allocator that would let you pick allocation schemes.  Presumably, this 
would include the ability to nail down mirror allocation to specific 
devices, which seems to be the requested feature here.  However, while 
definitely possible within the flexible framework btrfs' chunk-scope 
allocation provides, to my knowledge at least, this isn't anywhere on the 
existing near or intermediate term roadmap, so implementation by current 
developers is likely out beyond the five year time frame, along with a 
lot of other such features, making it effectively "bluesky", aka, 
possible, and would be nice, but no near or intermediate term plans, tho 
if someone with that itch to scratch appears with the patches ready to 
go, who moreover is willing to join the btrfs team and help maintain them 
longer term, assuming there's no huge personality clash, the feature 
could be implemented rather sooner, perhaps with initial implementation 
in a year or two and relative stability in two to three.

In that regard, it's more ENOTIMPLEMENTED, rather than EBLACKLISTED.  
There's all sorts of features that /could/ be implemented, and this one 
simply hasn't been a priority for existing developers, given the other 
features they've found to be more pressing.  But it may indeed eventually 
come, five or ten years out, sooner if a suitable developer with suitable 
interest and social compatibility with existing devs is found to champion 
the cause.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Loss of connection to Half of the drives
  2015-12-25  0:23               ` Duncan
@ 2015-12-26  6:12                 ` David Schulz
  2015-12-26 18:49                   ` Chris Murphy
  0 siblings, 1 reply; 15+ messages in thread
From: David Schulz @ 2015-12-26  6:12 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 6567 bytes --]

HI Everyone,

I suppose I have an answer to my initial question.  Thanks for all the discussion.  I'd just like to stress the importance in my opinion of btrfs understanding that drives are missing/dead and to halt all operations that would advance the metadata in the case of a temporary disconnection of a portion of the drives.  Even if it requires a tool to restore consistency after this sort of failure.

I mentioned the btrfs rescue command with the mismatching fsid message.  After dd'ing /dev/zero to all but the boot drive, the fsid mismatch went away, but the tool still segfaults on the filesystem after losing 1/2 of the drives, so at best, the fsid mismatch error was just cosmetic.

-Dave


> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-
> owner@vger.kernel.org] On Behalf Of Duncan
> Sent: Thursday, December 24, 2015 5:23 PM
> To: linux-btrfs@vger.kernel.org
> Subject: Re: Loss of connection to Half of the drives
> 
> Chris Murphy posted on Thu, 24 Dec 2015 13:57:35 -0700 as excerpted:
> 
> >> All this makes me ask why?  Why implement Raid10 in this non-standard
> >> fashion and create this mess of compromise?
> >
> > Because it was a straightforward extension of how the file system
> > already behaves. To implement drive based copies rather than chunk
> > based copies is a totally different strategy that actually negates how
> > btrfs does allocation, and would require things like logically
> > checking for mirrored pairs being the same size +/- maybe 1% similar to
> mdadm.
> >
> > And keep in mind the raid10 multiple device failure is not fixed, not
> > just any additional failure is OK. It just depends on aviation's
> > equivalent of "big sky theory" for air traffic separation. Yes the
> > probability of mirror A's two drives dying is next to zero, but it's
> > not zero. If you're building arrays depending on it being zero, well
> > that's not a good idea. The way to look at it is more of a bonus of
> > uptime, rather than depending on it in design. You design for it's
> > scaleable performance, which it does have.
> 
> This.
> 
> Raid10 doesn't guard against any random two devices going down, let alone a
> random half of all devices, and anyone running a raid10 with the assumption
> that it does is simply asking for trouble.
> 
> What it /does/ do, in the device-scope raid10 case, is minimize the /chance/
> that two devices down will take out the entire array, particularly on big raid10
> arrays, because the chances of any random two devices being the two devices
> mirroring the same content goes down as the number of total devices goes up.
> 
> But as Chris Murphy says, btrfs is inherently chunk-scope, not drive- scope.  In
> fact, that's a very large part of its multi-device flexibility in the first place.  And
> raid10 functionality was a straightforward extension of the existing raid1 and
> raid0 functionality, simply combining them into one at the same filesystem
> level with comparatively little extra code.  And that, again, was due to the
> incredible flexibility that chunk-scope granularity exposes.
> 
> Of course one drawback is that with chunk-scope allocation, the per- device
> allocation of successive chunks is likely to vary, meaning you lose the low
> device-scope chance of two random devices taking the entire array down,
> because the chances of those two random devices containing /both/ mirrors of
> _some_ chunk-strips is much higher than it is with device-scope allocation and
> both copies of the device-scope mirror, but that's a taken tradeoff that allowed
> the exposure of striped-mirrors
> raid10 functionality in the first place, and as Chris and I are both saying, any
> admin relying on chance to cover his *** in the two-device failure case on a
> raid10 is already asking for trouble.
> 
> But there are known workarounds for that problem, the layers on top of layers
> scenario, raid0+1 or raid1+0, each with its own advantages and disadvantages.
> Of course, btrfs arguably being a layering violation incorporating both
> filesystem and block level layers, tho it's done with specific advantages in mind,
> does by definition of implementation have to be the top layer, which does
> impose some limits if other btrfs features such as checksumming and data
> integrity are wanted, but it remains simply a question of matching the tradeoffs
> the technology makes against the ones you're willing to make, within the
> limitations of the available tradeoffs pool, of course.
> 
> 
> Meanwhile, there has been discussion of enhancements to the chunk allocator
> that would let you pick allocation schemes.  Presumably, this would include the
> ability to nail down mirror allocation to specific devices, which seems to be the
> requested feature here.  However, while definitely possible within the flexible
> framework btrfs' chunk-scope allocation provides, to my knowledge at least,
> this isn't anywhere on the existing near or intermediate term roadmap, so
> implementation by current developers is likely out beyond the five year time
> frame, along with a lot of other such features, making it effectively "bluesky",
> aka, possible, and would be nice, but no near or intermediate term plans, tho if
> someone with that itch to scratch appears with the patches ready to go, who
> moreover is willing to join the btrfs team and help maintain them longer term,
> assuming there's no huge personality clash, the feature could be implemented
> rather sooner, perhaps with initial implementation in a year or two and relative
> stability in two to three.
> 
> In that regard, it's more ENOTIMPLEMENTED, rather than EBLACKLISTED.
> There's all sorts of features that /could/ be implemented, and this one simply
> hasn't been a priority for existing developers, given the other features they've
> found to be more pressing.  But it may indeed eventually come, five or ten
> years out, sooner if a suitable developer with suitable interest and social
> compatibility with existing devs is found to champion the cause.
> 
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master -- and if you use the program, he
> is your master."  Richard Stallman
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body
> of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨è­Ú&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~†­†Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ߣøm

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Loss of connection to Half of the drives
  2015-12-26  6:12                 ` David Schulz
@ 2015-12-26 18:49                   ` Chris Murphy
  0 siblings, 0 replies; 15+ messages in thread
From: Chris Murphy @ 2015-12-26 18:49 UTC (permalink / raw)
  To: David Schulz, Duncan, linux-btrfs

On Fri, Dec 25, 2015, 11:28 PM David Schulz <dschulz@ucalgary.ca> wrote:
>
>
> I mentioned the btrfs rescue command with the mismatching fsid message.  After dd'ing /dev/zero to all but the boot drive, the fsid mismatch went away, but the tool still segfaults on the filesystem after losing 1/2 of the drives, so at best, the fsid mismatch error was just cosmetic.
>
>

Are you running btrfs rescue on all drives? Or just 1/2?

Because while rescue shouldn't crash, I also don't expect it to scrape
any files. It's probably worth submitting a bug and a strace.


---
Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-12-26 18:49 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-22 19:12 Loss of connection to Half of the drives Dave S
2015-12-22 20:02 ` Chris Murphy
2015-12-22 23:56   ` Donald Pearson
2015-12-23  4:13     ` Duncan
2015-12-23 15:53       ` Donald Pearson
2015-12-23 18:20         ` Goffredo Baroncelli
2015-12-23 22:15           ` Donald Pearson
2015-12-23 23:13             ` Chris Murphy
2015-12-24  1:29           ` Duncan
2015-12-24  1:21         ` Duncan
2015-12-24 16:19           ` Donald Pearson
2015-12-24 20:57             ` Chris Murphy
2015-12-25  0:23               ` Duncan
2015-12-26  6:12                 ` David Schulz
2015-12-26 18:49                   ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.