All of lore.kernel.org
 help / color / mirror / Atom feed
* I need to P. are we almost there yet?
@ 2014-12-29 18:56 sys.syphus
  2014-12-29 19:00 ` sys.syphus
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: sys.syphus @ 2014-12-29 18:56 UTC (permalink / raw)
  To: linux-btrfs

specifically (P)arity. very specifically n+2. when will raid5 & raid6
be at least as safe to run as raid1 currently is? I don't like the
idea of being 2 bad drives away from total catastrophe.

(and yes i backup, it just wouldn't be fun to go down that route.)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2014-12-29 18:56 I need to P. are we almost there yet? sys.syphus
@ 2014-12-29 19:00 ` sys.syphus
  2014-12-29 19:04   ` Hugo Mills
  2014-12-29 21:16   ` Chris Murphy
  2014-12-29 21:13 ` Chris Murphy
  2015-01-03 11:34 ` Bob Marley
  2 siblings, 2 replies; 29+ messages in thread
From: sys.syphus @ 2014-12-29 19:00 UTC (permalink / raw)
  To: linux-btrfs

oh, and sorry to bump myself. but is raid10 *ever* more redundant in
btrfs-speak than raid1? I currently use raid1 but i know in mdadm
speak raid10 means you can lose 2 drives assuming they aren't the
"wrong ones", is it safe to say with btrfs / raid 10 you can only lose
one no matter what?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2014-12-29 19:00 ` sys.syphus
@ 2014-12-29 19:04   ` Hugo Mills
  2014-12-29 20:25     ` sys.syphus
  2014-12-29 21:16   ` Chris Murphy
  1 sibling, 1 reply; 29+ messages in thread
From: Hugo Mills @ 2014-12-29 19:04 UTC (permalink / raw)
  To: sys.syphus; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 858 bytes --]

On Mon, Dec 29, 2014 at 01:00:05PM -0600, sys.syphus wrote:
> oh, and sorry to bump myself. but is raid10 *ever* more redundant in
> btrfs-speak than raid1? I currently use raid1 but i know in mdadm
> speak raid10 means you can lose 2 drives assuming they aren't the
> "wrong ones", is it safe to say with btrfs / raid 10 you can only lose
> one no matter what?

   I think that with an even number of identical-sized devices, you
get the same "guarantees" (well, behaviour) as you would with
traditional RAID-10.

   I may be wrong about that -- do test before relying on it. The FS
probably won't like losing two devices, though, even if the remaining
data is actually enough to reconstruct the FS.

   Hugo.

-- 
Hugo Mills             | I can resist everything except temptation
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: 65E74AC0          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2014-12-29 19:04   ` Hugo Mills
@ 2014-12-29 20:25     ` sys.syphus
  2014-12-29 21:50       ` Hugo Mills
  0 siblings, 1 reply; 29+ messages in thread
From: sys.syphus @ 2014-12-29 20:25 UTC (permalink / raw)
  To: Hugo Mills, sys.syphus, linux-btrfs

so am I to read that as if btrfs redundancy isn't really functional?
if i yank a member of my raid 1 out in live "prod" is it going to take
a dump on my data?

On Mon, Dec 29, 2014 at 1:04 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Mon, Dec 29, 2014 at 01:00:05PM -0600, sys.syphus wrote:
>> oh, and sorry to bump myself. but is raid10 *ever* more redundant in
>> btrfs-speak than raid1? I currently use raid1 but i know in mdadm
>> speak raid10 means you can lose 2 drives assuming they aren't the
>> "wrong ones", is it safe to say with btrfs / raid 10 you can only lose
>> one no matter what?
>
>    I think that with an even number of identical-sized devices, you
> get the same "guarantees" (well, behaviour) as you would with
> traditional RAID-10.
>
>    I may be wrong about that -- do test before relying on it. The FS
> probably won't like losing two devices, though, even if the remaining
> data is actually enough to reconstruct the FS.
>
>    Hugo.
>
> --
> Hugo Mills             | I can resist everything except temptation
> hugo@... carfax.org.uk |
> http://carfax.org.uk/  |
> PGP: 65E74AC0          |

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2014-12-29 18:56 I need to P. are we almost there yet? sys.syphus
  2014-12-29 19:00 ` sys.syphus
@ 2014-12-29 21:13 ` Chris Murphy
  2015-01-03 11:34 ` Bob Marley
  2 siblings, 0 replies; 29+ messages in thread
From: Chris Murphy @ 2014-12-29 21:13 UTC (permalink / raw)
  Cc: Btrfs BTRFS

By asking the question this way, I don't think you understand how
Btrfs development works. But if you check out the git pull for 3.19
you'll see a bunch of patches that pretty much close the feature
parity (no pun intended) gap for raid56 and raid0,1,10. But it is an
rc, and still needs testing, and even once 3.19 becomes a stable
kernel it's new enough code there can always be edge cases. And raid1
has been tested in Btrfs for how many years now? So if you want the
same amount of raid6 testing by time it would be however many years
that's been from the time 3.19 is released.

Chris Murphy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2014-12-29 19:00 ` sys.syphus
  2014-12-29 19:04   ` Hugo Mills
@ 2014-12-29 21:16   ` Chris Murphy
  2014-12-30  0:20     ` ashford
  1 sibling, 1 reply; 29+ messages in thread
From: Chris Murphy @ 2014-12-29 21:16 UTC (permalink / raw)
  To: sys.syphus; +Cc: Btrfs BTRFS

On Mon, Dec 29, 2014 at 12:00 PM, sys.syphus <syssyphus@gmail.com> wrote:
> oh, and sorry to bump myself. but is raid10 *ever* more redundant in
> btrfs-speak than raid1? I currently use raid1 but i know in mdadm
> speak raid10 means you can lose 2 drives assuming they aren't the
> "wrong ones", is it safe to say with btrfs / raid 10 you can only lose
> one no matter what?

It's only for sure one in any case even with conventional raid10. It
just depends on which 2 you lose that depends whether your data has
dodged a bullet. Obviously you can't lose a drive and its mirror,
ever, or the array collapses.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2014-12-29 20:25     ` sys.syphus
@ 2014-12-29 21:50       ` Hugo Mills
  0 siblings, 0 replies; 29+ messages in thread
From: Hugo Mills @ 2014-12-29 21:50 UTC (permalink / raw)
  To: sys.syphus; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1947 bytes --]

On Mon, Dec 29, 2014 at 02:25:14PM -0600, sys.syphus wrote:
> so am I to read that as if btrfs redundancy isn't really functional?
> if i yank a member of my raid 1 out in live "prod" is it going to take
> a dump on my data?

   Eh? Where did that conclusion some from? I said nothing at all
about RAID-1, only RAID-10.

   So, to clarify:

   In the general case, you can safely lose one device from a btrfs
RAID-10. Also in the general case, losing a second device will break
the filesystem (with very high probability).

   In the case I gave below, with an even number of equal sized
devices, the second device to be lost *may* allow the data to be
recovered with sufficient effort, but the FS in general will probably
not be mountable with two missing devices.

   So, btrfs RAID-10 offers the same *guarantees* as traditional
RAID-10. It's generally less effective with the probabilities of the
failure modes beyond the guarantee.

   Hugo.

> On Mon, Dec 29, 2014 at 1:04 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
> > On Mon, Dec 29, 2014 at 01:00:05PM -0600, sys.syphus wrote:
> >> oh, and sorry to bump myself. but is raid10 *ever* more redundant in
> >> btrfs-speak than raid1? I currently use raid1 but i know in mdadm
> >> speak raid10 means you can lose 2 drives assuming they aren't the
> >> "wrong ones", is it safe to say with btrfs / raid 10 you can only lose
> >> one no matter what?
> >
> >    I think that with an even number of identical-sized devices, you
> > get the same "guarantees" (well, behaviour) as you would with
> > traditional RAID-10.
> >
> >    I may be wrong about that -- do test before relying on it. The FS
> > probably won't like losing two devices, though, even if the remaining
> > data is actually enough to reconstruct the FS.
> >
> >    Hugo.
> >

-- 
Hugo Mills             | emacs: Eighty Megabytes And Constantly Swapping.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: 65E74AC0          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2014-12-29 21:16   ` Chris Murphy
@ 2014-12-30  0:20     ` ashford
       [not found]       ` <CALBWd85UsSih24RhwpmDeMjuMWCKj9dGeuZes5POj6qEFkiz2w@mail.gmail.com>
  2014-12-30 21:44       ` Phillip Susi
  0 siblings, 2 replies; 29+ messages in thread
From: ashford @ 2014-12-30  0:20 UTC (permalink / raw)
  To: Chris Murphy; +Cc: sys.syphus, Btrfs BTRFS

> On Mon, Dec 29, 2014 at 12:00 PM, sys.syphus <syssyphus@gmail.com> wrote:
>> oh, and sorry to bump myself. but is raid10 *ever* more redundant in
>> btrfs-speak than raid1? I currently use raid1 but i know in mdadm
>> speak raid10 means you can lose 2 drives assuming they aren't the
>> "wrong ones", is it safe to say with btrfs / raid 10 you can only lose
>> one no matter what?
>
> It's only for sure one in any case even with conventional raid10. It
> just depends on which 2 you lose that depends whether your data has
> dodged a bullet. Obviously you can't lose a drive and its mirror,
> ever, or the array collapses.

Just some background data on traditional RAID, and the chances of survival
with a 2-drive failure.

In traditional RAID-10, the chances of surviving a 2-drive failure is 66%
on a 4-drive array, and approaches 100% as the number of drives in the
array increase.

In traditional RAID-0+1 (used to be common in low-end fake-RAID cards),
the chances of surviving a 2-drive failure is 33% on a 4-drive array, and
approaches 50% as the number of drives in the array increase.

In traditional RAID-1E, the chances of surviving a 2-drive failure is 66%
on a 4-drive array, and approaches 100% as the number of drives in the
array increase.  This is the same as for RAID-10.  RAID-1E allows an odd
number of disks to be actively used in the array. 
https://en.wikipedia.org/wiki/File:RAID_1E.png

I'm wondering which of the above the BTRFS implementation most closely
resembles.

> So if you want the same amount of raid6 testing by time it would be
> however many years that's been from the time 3.19 is released.

I don't believe that's correct.  Over those several years, quite a few
tests for corner cases have been developed.  I expect that those tests are
used for regression testing of each release to ensure that old bugs aren't
inadvertently reintroduced.  Furthermore, I expect that a large number of
those corner case tests can be easily modified to test RAID-5 and RAID-6. 
In reality, I expect the stability (i.e. similar to RAID-10 currently) of
RAID-5/6 code in BTRFS will be achieved rather quickly (only a year or
two).

I expect that the difficult part will be to optimize the performance of
BTRFS.  Hopefully those tests (and others, yet to be developed) will be
able to keep it stable while the code is optimized for performance.

Peter Ashford


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Fwd: I need to P. are we almost there yet?
       [not found]       ` <CALBWd85UsSih24RhwpmDeMjuMWCKj9dGeuZes5POj6qEFkiz2w@mail.gmail.com>
@ 2014-12-30 17:09         ` Jose Manuel Perez Bethencourt
  0 siblings, 0 replies; 29+ messages in thread
From: Jose Manuel Perez Bethencourt @ 2014-12-30 17:09 UTC (permalink / raw)
  To: Btrfs BTRFS

I think you are missing crucial info on the layout on disk that BTRFS
implements. While a traditional RAID1 has a rigid layout that has
fixed and easily predictable locations for all data (exactly on two
specific disks), BTRFS allocs chunks as needed on ANY two disks.
Please research into this to understand the problem fully, this is the
key to your question.

I mean with RAID1 you know your data is on disk 1 and 2, and if one of
those fails you have a surviving mirror. Two disk failures with RAID10
when they pertain to different mirror disk pairs is no problem.

With BTRFS you cannot guarantee that simultaneous two disk failures
won't affect chunks that have the two mirrors precisely in that two
disks... even when there is greater chance that the chunks are
mirrored in other drives... probability of surviving is greater with
greater number of disks but we are talking about worst case scenarios
and guarantees. There will be, eventually, chunks that are using those
two disks for mirrors...

Take into account that traditional RAID10 has higher probability of
surviving but in worst case scenario is exactly the same: on
simultaneous failure of any one two disk mirror pair.

Please think about this as well: "simultaneous" should be read as
"within a rebuild window". In a hardware RAID, the HBA is expected to
kick in rebuild as soon as you replace failing disk (zero delay if you
have a hotspare). In BTRFS you are expected to first notice the
problem and second replace and scrub or rebalance. Any second failure
before full rebuild will be fatal to some extent.

I would also discard raid5 as you would have a complete failure with
two simultaneous disk failures, be it traditional or btrfs
implementation.

You should aim at RAID6 at minimum on hardware implementations, or
equivalent on btrfs. so to withstand a two disk failure. Some guys are
pushing for triple "mirror" but it's expensive in "wasted" disk space
(altough implementations like Ceph are good IMHO). Better are
generalized forms of parity that extend to more than two parity
"disks" if you want maximum storage capacity (but probably slow
writing).

Jose Manuel Perez Bethencourt

>
> > On Mon, Dec 29, 2014 at 12:00 PM, sys.syphus <syssyphus@gmail.com> wrote:
> >> oh, and sorry to bump myself. but is raid10 *ever* more redundant in
> >> btrfs-speak than raid1? I currently use raid1 but i know in mdadm
> >> speak raid10 means you can lose 2 drives assuming they aren't the
> >> "wrong ones", is it safe to say with btrfs / raid 10 you can only lose
> >> one no matter what?
> >
> > It's only for sure one in any case even with conventional raid10. It
> > just depends on which 2 you lose that depends whether your data has
> > dodged a bullet. Obviously you can't lose a drive and its mirror,
> > ever, or the array collapses.
>
> Just some background data on traditional RAID, and the chances of survival
> with a 2-drive failure.
>
> In traditional RAID-10, the chances of surviving a 2-drive failure is 66%
> on a 4-drive array, and approaches 100% as the number of drives in the
> array increase.
>
> In traditional RAID-0+1 (used to be common in low-end fake-RAID cards),
> the chances of surviving a 2-drive failure is 33% on a 4-drive array, and
> approaches 50% as the number of drives in the array increase.
>
> In traditional RAID-1E, the chances of surviving a 2-drive failure is 66%
> on a 4-drive array, and approaches 100% as the number of drives in the
> array increase.  This is the same as for RAID-10.  RAID-1E allows an odd
> number of disks to be actively used in the array.
> https://en.wikipedia.org/wiki/File:RAID_1E.png
>
> I'm wondering which of the above the BTRFS implementation most closely
> resembles.
>
> > So if you want the same amount of raid6 testing by time it would be
> > however many years that's been from the time 3.19 is released.
>
> I don't believe that's correct.  Over those several years, quite a few
> tests for corner cases have been developed.  I expect that those tests are
> used for regression testing of each release to ensure that old bugs aren't
> inadvertently reintroduced.  Furthermore, I expect that a large number of
> those corner case tests can be easily modified to test RAID-5 and RAID-6.
> In reality, I expect the stability (i.e. similar to RAID-10 currently) of
> RAID-5/6 code in BTRFS will be achieved rather quickly (only a year or
> two).
>
> I expect that the difficult part will be to optimize the performance of
> BTRFS.  Hopefully those tests (and others, yet to be developed) will be
> able to keep it stable while the code is optimized for performance.
>
> Peter Ashford
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2014-12-30  0:20     ` ashford
       [not found]       ` <CALBWd85UsSih24RhwpmDeMjuMWCKj9dGeuZes5POj6qEFkiz2w@mail.gmail.com>
@ 2014-12-30 21:44       ` Phillip Susi
  2014-12-30 23:17         ` ashford
  1 sibling, 1 reply; 29+ messages in thread
From: Phillip Susi @ 2014-12-30 21:44 UTC (permalink / raw)
  To: ashford, Chris Murphy; +Cc: sys.syphus, Btrfs BTRFS

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/29/2014 7:20 PM, ashford@whisperpc.com wrote:
> Just some background data on traditional RAID, and the chances of
> survival with a 2-drive failure.
> 
> In traditional RAID-10, the chances of surviving a 2-drive failure
> is 66% on a 4-drive array, and approaches 100% as the number of
> drives in the array increase.
> 
> In traditional RAID-0+1 (used to be common in low-end fake-RAID
> cards), the chances of surviving a 2-drive failure is 33% on a
> 4-drive array, and approaches 50% as the number of drives in the
> array increase.

In terms of data layout, there is really no difference between raid-10
( or raid1+0 ) and raid0+1, aside from the designation you assign to
each drive.  With a dumb implementation of 0+1, any single drive
failure offlines the entire stripe, discarding the remaining good
disks in it, thus giving the probability you describe as the only
possible remaining failure(s) that do not result in the mirror also
failing is a drive in the same stripe as the original.  This however,
is only a deficiency of the implementation, not the data layout, as
all of the data on the first failed drive could be recovered from a
drive in the second stripe, so long as the second drive that failed
was any drive other than the one holding the duplicate data of the first.

This is partly why I agree with linux mdadm that raid10 is *not*
simply raid1+0; the latter is just a naive, degenerate implementation
of the former.

> In traditional RAID-1E, the chances of surviving a 2-drive failure
> is 66% on a 4-drive array, and approaches 100% as the number of
> drives in the array increase.  This is the same as for RAID-10.
> RAID-1E allows an odd number of disks to be actively used in the
> array.

What some vendors have called "1E" is simply raid10 in the default
"near" layout to mdadm.  I prefer the higher performance "offset"
layout myself.

> I'm wondering which of the above the BTRFS implementation most
> closely resembles.

Unfortunately, btrfs just uses the naive raid1+0, so no 2 or 3 disk
raid10 arrays, and no higher performing offset layout.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUoxyuAAoJENRVrw2cjl5R72oH/1nypXV72Bk4PBeaGAwH7559
lL6JH80216lbhv8hHopIeXKe7uqPGFAE5F1ArChIi08HA+CqKr5cfPNzJPlobyFj
KNLzeXi+wnJO2mbvWnnJak83GVmvpBnYvS+22RCweDELCb3pulybleJnN4yVSL25
WpVfUGnAg5lQJdX2l6THeClWX6V47NKqD6iXbt9+jyADCK2yk/5+TVbS8tixFUtj
PBxe+XGNrkTREnPAAFy6BgwO2vCD92F6+mm/lHJ0fg7gOm41UE09gzabsCGQ9LFA
kk99c9WAnJdkTqUJVw49MEwmmhs/2gluKWTeaHONpBePoFIpQEjHI89TqBsKhY4=
=+oed
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2014-12-30 21:44       ` Phillip Susi
@ 2014-12-30 23:17         ` ashford
  2014-12-31  2:45           ` Phillip Susi
  0 siblings, 1 reply; 29+ messages in thread
From: ashford @ 2014-12-30 23:17 UTC (permalink / raw)
  To: Phillip Susi, Jose Manuel Perez Bethencourt
  Cc: ashford, Chris Murphy, sys.syphus, Btrfs BTRFS

> Phillip Susi wrote:
>
>> I'm wondering which of the above the BTRFS implementation most
>> closely resembles.
>
> Unfortunately, btrfs just uses the naive raid1+0, so no 2 or 3 disk
> raid10 arrays, and no higher performing offset layout.

> Jose Manuel Perez Bethencourt wrote:
>
> I think you are missing crucial info on the layout on disk that BTRFS
> implements. While a traditional RAID1 has a rigid layout that has
> fixed and easily predictable locations for all data (exactly on two
> specific disks), BTRFS allocs chunks as needed on ANY two disks.
> Please research into this to understand the problem fully, this is the
> key to your question.

There is a HUGE difference here.  In the first case, the data will have a
>50% chance of surviving a 2-drive failure.  In the second case, the data
will have an effectively 0% chance of surviving a 2-drive failure.  I
don't believe I need to mention which of the above is more reliable, or
which I would prefer.

I believe that someone who understands the code in depth (and that may
also be one of the people above) determine exactly how BTRFS implements
RAID-10.

Thank you.

Peter Ashford


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2014-12-30 23:17         ` ashford
@ 2014-12-31  2:45           ` Phillip Susi
  2014-12-31 17:27             ` ashford
  0 siblings, 1 reply; 29+ messages in thread
From: Phillip Susi @ 2014-12-31  2:45 UTC (permalink / raw)
  To: ashford, Jose Manuel Perez Bethencourt
  Cc: Chris Murphy, sys.syphus, Btrfs BTRFS

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 12/30/2014 06:17 PM, ashford@whisperpc.com wrote:
> I believe that someone who understands the code in depth (and that
> may also be one of the people above) determine exactly how BTRFS
> implements RAID-10.

I am such a person.  I had a similar question a year or two ago (
specifically about raid10  ) so I both experimented and read the code
myself to find out.  I was disappointed to find that it won't do
raid10 on 3 disks since the chunk metadata describes raid10 as a
stripe layered on top of a mirror.

Jose's point was also a good one though; one chunk may decide to
mirror disks A and B, so a failure of A and C it could recover from,
but a different chunk could choose to mirror on disks A and C, so that
chunk would be lost if A and C fail.  It would probably be nice if the
chunk allocator tried to be more deterministic about that.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCgAGBQJUo2M8AAoJENRVrw2cjl5RihoH/1ulWpEK6lPaYhBSBbmWQyGu
obJZBTbeMgBAfO9VMq9X2laUfmEprwYi8FuKnCwVgA1KyftFsaJngckqMoTtpwdI
IXx2X2++MjZBkFBUFRhGlSQcbDgeB/RbBx+Vtxi2dNq3/WgZyHRfIJT1moRrxY0V
UTH1kI7JsWg4blpdm+xW4o7UKds7JKHr5Th1PUH9SmJOdsBe2efIFQyC7hyuSQs0
gBUQzxmo3HcRzBtJwJjKRICU16VBN0NW7w3m/y6K1yIlkGi4U7MZgzMSUJw/BiMT
tGX48AhBH3D3R2sjmF2aO5suPaHEVYoZuqhKevKZfTGS7izSYA74LqrGHkq5QBk=
=ESya
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2014-12-31  2:45           ` Phillip Susi
@ 2014-12-31 17:27             ` ashford
  2014-12-31 23:38               ` Phillip Susi
                                 ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: ashford @ 2014-12-31 17:27 UTC (permalink / raw)
  To: Phillip Susi
  Cc: ashford, Jose Manuel Perez Bethencourt, Chris Murphy, sys.syphus,
	Btrfs BTRFS

Phillip

> I had a similar question a year or two ago (
> specifically about raid10  ) so I both experimented and read the code
> myself to find out.  I was disappointed to find that it won't do
> raid10 on 3 disks since the chunk metadata describes raid10 as a
> stripe layered on top of a mirror.
>
> Jose's point was also a good one though; one chunk may decide to
> mirror disks A and B, so a failure of A and C it could recover from,
> but a different chunk could choose to mirror on disks A and C, so that
> chunk would be lost if A and C fail.  It would probably be nice if the
> chunk allocator tried to be more deterministic about that.

I see this as a CRITICAL design flaw.  The reason for calling it CRITICAL
is that System Administrators have been trained for >20 years that RAID-10
can usually handle a dual-disk failure, but the BTRFS implementation has
effectively ZERO chance of doing so.

According to every description of RAID-10 I've ever seen (including
documentation from MaxStrat), RAID-10 stripes mirrored pairs/sets of
disks.  The device-level description is a critical component of what makes
an array "RAID-10", and is the reason for many of the attributes of
RAID-10.  This is NOT what BTRFS has implemented.

While BTRFS may be distributing the chunks according to a RAID-10
methodology, that is NOT what the industry considers to be RAID-10.  While
the current methodology has the data replication of RAID-10, and it may
have the performance of RAID-10, it absolutely DOES NOT have the
robustness or uptime benefits that are expected of RAID-10.

In order to remove this potentially catestrophic confusion, BTRFS should
either call their "RAID-10" implementation something else, or they should
adhere to the long-established definition of RAID-10.

Peter Ashford


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2014-12-31 17:27             ` ashford
@ 2014-12-31 23:38               ` Phillip Susi
  2015-01-01  1:26               ` Chris Samuel
  2015-01-02 13:42               ` Austin S Hemmelgarn
  2 siblings, 0 replies; 29+ messages in thread
From: Phillip Susi @ 2014-12-31 23:38 UTC (permalink / raw)
  To: ashford
  Cc: Jose Manuel Perez Bethencourt, Chris Murphy, sys.syphus, Btrfs BTRFS

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 12/31/2014 12:27 PM, ashford@whisperpc.com wrote:
> I see this as a CRITICAL design flaw.  The reason for calling it
> CRITICAL is that System Administrators have been trained for >20
> years that RAID-10 can usually handle a dual-disk failure, but the
> BTRFS implementation has effectively ZERO chance of doing so.

Sure, but you never *count* on that second failure since it is a (
relatively even ) probability game.

> In order to remove this potentially catestrophic confusion, BTRFS
> should either call their "RAID-10" implementation something else,
> or they should adhere to the long-established definition of
> RAID-10.

Personally I'd prefer it follow the way mdadm does it, which is much
better than what the rest of the industry calls raid-10, which is to
say, simply a naive raid-0 on top of raid-1.  I'm very happy with my 3
disk offset layout raid-10 which gets the sequential read throughput
of a 3 disk raid-0, while still being able to handle a single drive
failure.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCgAGBQJUpIkQAAoJENRVrw2cjl5R/wAIAKNU0NfEFGXLQK0lB1kZMJQt
WrBjKih8xG2WIYAqYCHoNTWJtmCZOCHtltt+OsUb8Pa8u075ALtEQBRNminLLuqV
LjREOyOzvzaDfNSEhptdBZ4YazqFt6UChWtu7RWhMtb7u61pmqMJatDhxLe+2CF9
YQE3qgLfP+PAMIGO/xN5m+hYba4hbF/MoqQ/XN7Z1VWvT9FNR7Dn8frflpmI2Cyh
iAravNS78hUjbxTtNz1qVXLosDVsjyZpz9UY9occNJ/vlF/GMd5q2c8xXkDTczGB
O9B55OXGzfmzPZzlNJ2MyBLgwQx/huPH8RiyuuIdy3AVubc/pXuAZQqaydf/lQg=
=qwW+
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2014-12-31 17:27             ` ashford
  2014-12-31 23:38               ` Phillip Susi
@ 2015-01-01  1:26               ` Chris Samuel
  2015-01-01 20:12                 ` Roger Binns
  2015-01-02 13:42               ` Austin S Hemmelgarn
  2 siblings, 1 reply; 29+ messages in thread
From: Chris Samuel @ 2015-01-01  1:26 UTC (permalink / raw)
  To: linux-btrfs

On Wed, 31 Dec 2014 09:27:14 AM ashford@whisperpc.com wrote:

> I see this as a CRITICAL design flaw.  The reason for calling it CRITICAL
> is that System Administrators have been trained for >20 years that RAID-10
> can usually handle a dual-disk failure, but the BTRFS implementation has
> effectively ZERO chance of doing so.

I suspect this is a knock-on effect of the fact that (unless this has changed 
recently & IIRC) RAID-1 with btrfs will only mirrors data over two drives, no 
matter how many you add to an array.

-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2015-01-01  1:26               ` Chris Samuel
@ 2015-01-01 20:12                 ` Roger Binns
  2015-01-02  3:47                   ` Duncan
  0 siblings, 1 reply; 29+ messages in thread
From: Roger Binns @ 2015-01-01 20:12 UTC (permalink / raw)
  To: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/31/2014 05:26 PM, Chris Samuel wrote:
> I suspect this is a knock-on effect of the fact that (unless this
> has changed recently & IIRC) RAID-1 with btrfs will only mirrors
> data over two drives, no matter how many you add to an array.

I wish btrfs wouldn't use the old school micro-managing storage
terminology (or only as aliases) and instead let you set the goals.
What people really mean is that they want their data to survive the
failure of N drives - exactly how that is done doesn't matter.  It
would also be nice to be settable as an xattr on files and directories.

Roger

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iEYEARECAAYFAlSlqi8ACgkQmOOfHg372QTtrACeKT9OfzEtJyucEDNfeisfAw9z
Ao8AoIEevlY7MEyBHFBqyCCE1LJXGDw9
=zUJs
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2015-01-01 20:12                 ` Roger Binns
@ 2015-01-02  3:47                   ` Duncan
  0 siblings, 0 replies; 29+ messages in thread
From: Duncan @ 2015-01-02  3:47 UTC (permalink / raw)
  To: linux-btrfs

Roger Binns posted on Thu, 01 Jan 2015 12:12:31 -0800 as excerpted:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 12/31/2014 05:26 PM, Chris Samuel wrote:
>> I suspect this is a knock-on effect of the fact that (unless this has
>> changed recently & IIRC) RAID-1 with btrfs will only mirrors data over
>> two drives, no matter how many you add to an array.

It hasn't changed yet, but now that raid56 support is basically complete 
(with 3.19, other than bugs of course, it'll be another kernel cycle or 
two before I'd rely on it), that's next up on the raid-features roadmap. 
=:^)

I know as that's my most hotly anticipated roadmapped btrfs feature yet 
to hit, and I've been waiting for it, patiently only because I didn't 
have much choice, for a couple years now.

> I wish btrfs wouldn't use the old school micro-managing storage
> terminology (or only as aliases) and instead let you set the goals. What
> people really mean is that they want their data to survive the failure
> of N drives - exactly how that is done doesn't matter.  It would also be
> nice to be settable as an xattr on files and directories.

Actually, a more flexible terminology has been discussed, and /might/ 
actually be introduced either along with or prior to the multi-way-
mirroring feature (depending on how long the latter takes to develop, I'd 
guess).  The suggested terminology would basically treat number of data 
strips, mirrors, parity, hot-spares, etc, each on its own separate axis, 
with parity levels ultimately extended well beyond 2 (aka raid6) as well 
-- I think to something like a dozen or 16.

Obviously if it's introduced before N-way-mirroring, N-way-parity, etc, 
it would only support the current feature set for now, and would just be 
a different way of configuring mkfs as well as displaying the current 
layouts in btrfs filesystem df and usage.

Hugo's the guy who has proposed that, and has been doing the preliminary 
patch development.

Meanwhile, ultimately the ability to configure all this at least by 
subvolume is planned, and once it's actually possible to set it on less 
than a full filesystem basis, setting it by individual xattr has been 
discussed as well.  I think the latter depends on the sorts of issues 
they run into in actual implementation.

Finally, btrfs is already taking the xattr/property route with this sort 
of attribute.  The basic infrastructure for that went in a couple kernel 
cycles ago, and can be seen and worked with using the btrfs property 
command.  So the basic property/xattr infrastructure is already there, 
and the ability to configure redundancy per subvolume already built into 
the original btrfs design and roadmapped altho it's not yet implemented, 
which means it's actually quite likely to eventually be configurable by 
file via xattr/properties as well -- emphasis on /eventually/, as these 
features /do/ tend to take rather longer to actually develop and 
stabilize than originally predicted.  The raid56 code is a good example, 
as it was originally slated for kernel cycle 3.6 or so, IIRC, but it took 
it over two years to cook and we're finally getting it in 3.19!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2014-12-31 17:27             ` ashford
  2014-12-31 23:38               ` Phillip Susi
  2015-01-01  1:26               ` Chris Samuel
@ 2015-01-02 13:42               ` Austin S Hemmelgarn
  2015-01-02 17:45                 ` Brendan Hide
  2 siblings, 1 reply; 29+ messages in thread
From: Austin S Hemmelgarn @ 2015-01-02 13:42 UTC (permalink / raw)
  To: ashford, Phillip Susi
  Cc: Jose Manuel Perez Bethencourt, Chris Murphy, sys.syphus, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1668 bytes --]

On 2014-12-31 12:27, ashford@whisperpc.com wrote:
> Phillip
>
>> I had a similar question a year or two ago (
>> specifically about raid10  ) so I both experimented and read the code
>> myself to find out.  I was disappointed to find that it won't do
>> raid10 on 3 disks since the chunk metadata describes raid10 as a
>> stripe layered on top of a mirror.
>>
>> Jose's point was also a good one though; one chunk may decide to
>> mirror disks A and B, so a failure of A and C it could recover from,
>> but a different chunk could choose to mirror on disks A and C, so that
>> chunk would be lost if A and C fail.  It would probably be nice if the
>> chunk allocator tried to be more deterministic about that.
>
> I see this as a CRITICAL design flaw.  The reason for calling it CRITICAL
> is that System Administrators have been trained for >20 years that RAID-10
> can usually handle a dual-disk failure, but the BTRFS implementation has
> effectively ZERO chance of doing so.
No, some rather simple math will tell you that a 4 disk BTRFS filesystem 
in raid10 mode has exactly a 50% chance of surviving a dual disk 
failure, and that as the number of disks goes up, the chance of survival 
will asymptotically approach 100% (but never reach it).
This is the case for _every_ RAID-10 implementation that I have ever 
seen, including hardware raid controllers; the only real difference is 
in the stripe length (usually 512 bytes * half the number of disks for 
hardware raid, 4k * half the number of disks for software raid, and the 
filesystem block size (default is 16k in current versions) * half the 
number of disks for BTRFS).



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2015-01-02 13:42               ` Austin S Hemmelgarn
@ 2015-01-02 17:45                 ` Brendan Hide
  2015-01-02 19:41                   ` Austin S Hemmelgarn
  0 siblings, 1 reply; 29+ messages in thread
From: Brendan Hide @ 2015-01-02 17:45 UTC (permalink / raw)
  To: Austin S Hemmelgarn, ashford, Phillip Susi
  Cc: Jose Manuel Perez Bethencourt, Chris Murphy, sys.syphus, Btrfs BTRFS

On 2015/01/02 15:42, Austin S Hemmelgarn wrote:
> On 2014-12-31 12:27, ashford@whisperpc.com wrote:
>> I see this as a CRITICAL design flaw.  The reason for calling it 
>> CRITICAL
>> is that System Administrators have been trained for >20 years that 
>> RAID-10
>> can usually handle a dual-disk failure, but the BTRFS implementation has
>> effectively ZERO chance of doing so.
> No, some rather simple math
That's the problem. The math isn't as simple as you'd expect:

The example below is probably a pathological case - but here goes. Let's 
say in this 4-disk example that chunks are striped as d1,d2,d1,d2 where 
d1 is the first bit of data and d2 is the second:
Chunk 1 might be striped across disks A,B,C,D d1,d2,d1,d2
Chunk 2 might be striped across disks B,C,A,D d3,d4,d3,d4
Chunk 3 might be striped across disks D,A,C,B d5,d6,d5,d6
Chunk 4 might be striped across disks A,C,B,D d7,d8,d7,d8
Chunk 5 might be striped across disks A,C,D,B d9,d10,d9,d10

Lose any two disks and you have a 50% chance on *each* chunk to have 
lost that chunk. With traditional RAID10 you have a 50% chance of losing 
the array entirely. With btrfs, the more data you have stored, the 
chances get closer to 100% of losing *some* data in a 2-disk failure.

In the above example, losing A and B means you lose d3, d6, and d7 
(which ends up being 60% of all chunks).
Losing A and C means you lose d1 (20% of all chunks).
Losing A and D means you lose d9 (20% of all chunks).
Losing B and C means you lose d10 (20% of all chunks).
Losing B and D means you lose d2 (20% of all chunks).
Losing C and D means you lose d4,d5, AND d8 (60% of all chunks)

The above skewed example has an average of 40% of all chunks failed. As 
you add more data and randomise the allocation, this will approach 50% - 
BUT, the chances of losing *some* data is already clearly shown to be 
very close to 100%.

-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2015-01-02 17:45                 ` Brendan Hide
@ 2015-01-02 19:41                   ` Austin S Hemmelgarn
  0 siblings, 0 replies; 29+ messages in thread
From: Austin S Hemmelgarn @ 2015-01-02 19:41 UTC (permalink / raw)
  To: Brendan Hide, ashford, Phillip Susi
  Cc: Jose Manuel Perez Bethencourt, Chris Murphy, sys.syphus, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 2485 bytes --]

On 2015-01-02 12:45, Brendan Hide wrote:
> On 2015/01/02 15:42, Austin S Hemmelgarn wrote:
>> On 2014-12-31 12:27, ashford@whisperpc.com wrote:
>>> I see this as a CRITICAL design flaw.  The reason for calling it
>>> CRITICAL
>>> is that System Administrators have been trained for >20 years that
>>> RAID-10
>>> can usually handle a dual-disk failure, but the BTRFS implementation has
>>> effectively ZERO chance of doing so.
>> No, some rather simple math
> That's the problem. The math isn't as simple as you'd expect:
>
> The example below is probably a pathological case - but here goes. Let's
> say in this 4-disk example that chunks are striped as d1,d2,d1,d2 where
> d1 is the first bit of data and d2 is the second:
> Chunk 1 might be striped across disks A,B,C,D d1,d2,d1,d2
> Chunk 2 might be striped across disks B,C,A,D d3,d4,d3,d4
> Chunk 3 might be striped across disks D,A,C,B d5,d6,d5,d6
> Chunk 4 might be striped across disks A,C,B,D d7,d8,d7,d8
> Chunk 5 might be striped across disks A,C,D,B d9,d10,d9,d10
>
> Lose any two disks and you have a 50% chance on *each* chunk to have
> lost that chunk. With traditional RAID10 you have a 50% chance of losing
> the array entirely. With btrfs, the more data you have stored, the
> chances get closer to 100% of losing *some* data in a 2-disk failure.
>
> In the above example, losing A and B means you lose d3, d6, and d7
> (which ends up being 60% of all chunks).
> Losing A and C means you lose d1 (20% of all chunks).OK
> Losing A and D means you lose d9 (20% of all chunks).
> Losing B and C means you lose d10 (20% of all chunks).
> Losing B and D means you lose d2 (20% of all chunks).
> Losing C and D means you lose d4,d5, AND d8 (60% of all chunks)
>
> The above skewed example has an average of 40% of all chunks failed. As
> you add more data and randomise the allocation, this will approach 50% -
> BUT, the chances of losing *some* data is already clearly shown to be
> very close to 100%.
>
OK, I forgot about the randomization effect that the chunk allocation 
and freeing has.  We really should slap a *BIG* warning label on that 
(and ideally find some better way to do it so it's more reliable).

As an aside, I've found that a BTRFS raid1 set on top of 2 LVM/MD RAID0 
sets is actually faster than using a BTRFS raid10 set with the same 
number of disks (how much faster is workload dependent), and provides 
better guarantees than a BTRFS raid10 set.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2014-12-29 18:56 I need to P. are we almost there yet? sys.syphus
  2014-12-29 19:00 ` sys.syphus
  2014-12-29 21:13 ` Chris Murphy
@ 2015-01-03 11:34 ` Bob Marley
  2015-01-03 13:11   ` Duncan
  2 siblings, 1 reply; 29+ messages in thread
From: Bob Marley @ 2015-01-03 11:34 UTC (permalink / raw)
  To: sys.syphus, linux-btrfs

On 29/12/2014 19:56, sys.syphus wrote:
> specifically (P)arity. very specifically n+2. when will raid5 & raid6
> be at least as safe to run as raid1 currently is? I don't like the
> idea of being 2 bad drives away from total catastrophe.
>
> (and yes i backup, it just wouldn't be fun to go down that route.)

What about using btrfs on top of MD raid?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2015-01-03 11:34 ` Bob Marley
@ 2015-01-03 13:11   ` Duncan
  2015-01-03 18:53     ` Bob Marley
                       ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Duncan @ 2015-01-03 13:11 UTC (permalink / raw)
  To: linux-btrfs

Bob Marley posted on Sat, 03 Jan 2015 12:34:41 +0100 as excerpted:

> On 29/12/2014 19:56, sys.syphus wrote:
>> specifically (P)arity. very specifically n+2. when will raid5 & raid6
>> be at least as safe to run as raid1 currently is? I don't like the idea
>> of being 2 bad drives away from total catastrophe.
>>
>> (and yes i backup, it just wouldn't be fun to go down that route.)
> 
> What about using btrfs on top of MD raid?

The problem with that is data integrity.  mdraid doesn't have it.  btrfs 
does.

If you present a single mdraid device to btrfs and run single mode on it, 
and one copy on the mdraid is corrupt, mdraid may well simply present it 
as it does no integrity checking.  btrfs will catch and reject that, but 
because it sees a single device, it'll think the entire thing is corrupt.

If you present multiple devices to btrfs and run btrfs raid1 mode, it'll 
have a second copy to check, but if a bad copy exists on each side and 
that's the copy mdraid hands btrfs, again, btrfs will reject it, having 
no idea there's actually a good copy on the mdraid underneath; the mdraid 
simply didn't happen to pick that copy to present.

And mdraid-5/6 doesn't make things any better, because unless there's a 
problem, mdraid will simply read and present the data, ignoring the 
parity with which it could probably correct the bad data (at least with 
raid6).

The only way to get truly verified data with triple-redundancy or 2X 
parity or better is when btrfs handles it, as it keeps and actually 
checks checksums to verify.

But btrfs raid56 mode should be complete with kernel 3.19 and presumably 
btrfs-progs 3.19 tho I'd give it a kernel or two to mature to be sure.
N-way-mirroring (my particular hotly awaited feature) is next up, but 
given the time raid56 took, I don't think anybody's predicting when it'll 
be actually in-tree and ready for use.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2015-01-03 13:11   ` Duncan
@ 2015-01-03 18:53     ` Bob Marley
  2015-01-03 19:03       ` sys.syphus
  2015-01-03 18:55     ` sys.syphus
  2015-01-03 21:58     ` Roman Mamedov
  2 siblings, 1 reply; 29+ messages in thread
From: Bob Marley @ 2015-01-03 18:53 UTC (permalink / raw)
  To: linux-btrfs

On 03/01/2015 14:11, Duncan wrote:
> Bob Marley posted on Sat, 03 Jan 2015 12:34:41 +0100 as excerpted:
>
>> On 29/12/2014 19:56, sys.syphus wrote:
>>> specifically (P)arity. very specifically n+2. when will raid5 & raid6
>>> be at least as safe to run as raid1 currently is? I don't like the idea
>>> of being 2 bad drives away from total catastrophe.
>>>
>>> (and yes i backup, it just wouldn't be fun to go down that route.)
>> What about using btrfs on top of MD raid?
> The problem with that is data integrity.  mdraid doesn't have it.  btrfs
> does.
>
> If you present a single mdraid device to btrfs and run single mode on it,
> and one copy on the mdraid is corrupt, mdraid may well simply present it
> as it does no integrity checking.  btrfs will catch and reject that, but
> because it sees a single device, it'll think the entire thing is corrupt.

Which is really not bad, considering the chance that something gets corrupt.
Already it is an exceedingly rare event. Detection without correction 
can be more than enough. Since always things have worked in the computer 
science field without even the detection feature.
Most likely even your bank account and mine are held in databases which 
are located in filesystems or blockdevices which do not even have the 
corruption detection feature.
And, last but not least, as of now a btrfs bug is more likely than hard 
disks' silent data corruption.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2015-01-03 13:11   ` Duncan
  2015-01-03 18:53     ` Bob Marley
@ 2015-01-03 18:55     ` sys.syphus
  2015-01-04  3:22       ` Duncan
  2015-01-03 21:58     ` Roman Mamedov
  2 siblings, 1 reply; 29+ messages in thread
From: sys.syphus @ 2015-01-03 18:55 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

>
> But btrfs raid56 mode should be complete with kernel 3.19 and presumably
> btrfs-progs 3.19 tho I'd give it a kernel or two to mature to be sure.
> N-way-mirroring (my particular hotly awaited feature) is next up, but
> given the time raid56 took, I don't think anybody's predicting when it'll
> be actually in-tree and ready for use.
>

is that the feature where you say i want x copies of this file and y
copies of this other file? e.g. raid at the file level, with the
ability to adjust redundancy by file?

I wonder if there is any sort of bandaid you can put on top of btrfs
to give some of this redundancy. things exist like git annex, but i
don't love it's bugs and oddball selection of programming language.

Do you guys use any other open source tools on top of btrfs to help
manage your data? (i.e. git annex; camlistore)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2015-01-03 18:53     ` Bob Marley
@ 2015-01-03 19:03       ` sys.syphus
  0 siblings, 0 replies; 29+ messages in thread
From: sys.syphus @ 2015-01-03 19:03 UTC (permalink / raw)
  To: Bob Marley; +Cc: linux-btrfs

>
> Which is really not bad, considering the chance that something gets corrupt.
> Already it is an exceedingly rare event. Detection without correction can be
> more than enough. Since always things have worked in the computer science
> field without even the detection feature.
> Most likely even your bank account and mine are held in databases which are
> located in filesystems or blockdevices which do not even have the corruption
> detection feature.
> And, last but not least, as of now a btrfs bug is more likely than hard
> disks' silent data corruption.
>
>

I think thats dangerous thinking and what has gotten us here.

The whole point of zfs / btrfs is that due to the current size of
storage, what was previously unlikely is now a statistical certitude.
In short, Murphy's law.

We are now using green drives and s3 fuse and shitty flash media, the
era of trusting the block device is over.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2015-01-03 13:11   ` Duncan
  2015-01-03 18:53     ` Bob Marley
  2015-01-03 18:55     ` sys.syphus
@ 2015-01-03 21:58     ` Roman Mamedov
  2015-01-04  3:24       ` Duncan
  2 siblings, 1 reply; 29+ messages in thread
From: Roman Mamedov @ 2015-01-03 21:58 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On Sat, 3 Jan 2015 13:11:57 +0000 (UTC)
Duncan <1i5t5.duncan@cox.net> wrote:

> > What about using btrfs on top of MD raid?
> 
> The problem with that is data integrity.  mdraid doesn't have it.  btrfs 
> does.

Most importantly however, you aren't any worse off with Btrfs on top of MD,
than with Btrfs on a single device, or with Ext4/XFS/JFS/etc on top of MD.

Sure you don't get checksum-based recovery from partial corruption of a RAID,
but you do get other features of Btrfs, such as robust snapshot support,
ability to online-resize up and down, compression, and actually, checksum
verification: even if it won't be able to recover from a corruption, at least
it will warn you of it (and you could recover from backups), while other FSes
will pass through the corrupted data silently.

So until Btrfs multi-device support is feature-complete (and yes that includes
performance-wise), running Btrfs in single-device mode on top of MD RAID is
arguably the most optimal way to use Btrfs in a RAID setup.

(Personally I am running Btrfs on top of 7x2TB MD RAID6, 3x2TB MD RAID5 and
2x2TB MD RAID1).

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2015-01-03 18:55     ` sys.syphus
@ 2015-01-04  3:22       ` Duncan
  2015-01-04  3:54         ` Hugo Mills
  0 siblings, 1 reply; 29+ messages in thread
From: Duncan @ 2015-01-04  3:22 UTC (permalink / raw)
  To: linux-btrfs

sys.syphus posted on Sat, 03 Jan 2015 12:55:27 -0600 as excerpted:

>> But btrfs raid56 mode should be complete with kernel 3.19 and
>> presumably btrfs-progs 3.19 tho I'd give it a kernel or two to mature
>> to be sure. N-way-mirroring (my particular hotly awaited feature) is
>> next up, but given the time raid56 took, I don't think anybody's
>> predicting when it'll be actually in-tree and ready for use.
>>
>>
> is that the feature where you say i want x copies of this file and y
> copies of this other file? e.g. raid at the file level, with the ability
> to adjust redundancy by file?

Per-file isn't available yet, tho at least per-subvolume is roadmapped, 
and now that we have the properties framework working via xattr for files 
as well, at least in theory, there is AFAIK no reason to limit it to per-
subvolume, as per-file should be about as easy once the code that 
currently limits it to per-filesystem is rewritten.

But actually fully working per-filesystem raid56 is enough for a lot of 
people, and actually working per-filesystem N-way-mirroring is what I'm 
after, since I already setup multiple filesystems in ordered to keep my 
data eggs from all being in the same filesystem basket.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2015-01-03 21:58     ` Roman Mamedov
@ 2015-01-04  3:24       ` Duncan
  0 siblings, 0 replies; 29+ messages in thread
From: Duncan @ 2015-01-04  3:24 UTC (permalink / raw)
  To: linux-btrfs

Roman Mamedov posted on Sun, 04 Jan 2015 02:58:35 +0500 as excerpted:

> On Sat, 3 Jan 2015 13:11:57 +0000 (UTC)
> Duncan <1i5t5.duncan@cox.net> wrote:
> 
>> > What about using btrfs on top of MD raid?
>> 
>> The problem with that is data integrity.  mdraid doesn't have it. 
>> btrfs does.
> 
> Most importantly however, you aren't any worse off with Btrfs on top of
> MD, than with Btrfs on a single device, or with Ext4/XFS/JFS/etc on top
> of MD.

Good point! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: I need to P. are we almost there yet?
  2015-01-04  3:22       ` Duncan
@ 2015-01-04  3:54         ` Hugo Mills
  0 siblings, 0 replies; 29+ messages in thread
From: Hugo Mills @ 2015-01-04  3:54 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2742 bytes --]

On Sun, Jan 04, 2015 at 03:22:53AM +0000, Duncan wrote:
> sys.syphus posted on Sat, 03 Jan 2015 12:55:27 -0600 as excerpted:
> 
> >> But btrfs raid56 mode should be complete with kernel 3.19 and
> >> presumably btrfs-progs 3.19 tho I'd give it a kernel or two to mature
> >> to be sure. N-way-mirroring (my particular hotly awaited feature) is
> >> next up, but given the time raid56 took, I don't think anybody's
> >> predicting when it'll be actually in-tree and ready for use.
> >>
> >>
> > is that the feature where you say i want x copies of this file and y
> > copies of this other file? e.g. raid at the file level, with the ability
> > to adjust redundancy by file?
> 
> Per-file isn't available yet, tho at least per-subvolume is roadmapped, 
> and now that we have the properties framework working via xattr for files 
> as well, at least in theory, there is AFAIK no reason to limit it to per-
> subvolume, as per-file should be about as easy once the code that 
> currently limits it to per-filesystem is rewritten.

   "roadmapped" --> "fond wish".

   Also, per-file is a bit bloody awkward to get working. Having sat
and thought about it hard for a while, I'm not convinced that it would
actually be worth the implementation effort.

   Certainly, nobody should be thinking about having (say) a different
RAID config for every file -- that way lies madness. I would expect,
at most, "small integers" (<=3) of different profiles for data in any
given filesystem, with the majority of data being of one particular
profile. Anything trying to get more spohisticated than that is likely
asking for intractable space-allocation problems. Think, "requiring
regular full-balance operations".

   The behaviour of the chunk allocator in the presence of merely two
allocation profiles (data/metadata) is awkward enough. Introducing
more of them is something that will require a separate research
programme to understand fully.

   I will probably have an opportunity to discuss the basics of
multiple allocations schemes with someone more qualified than I am on
Tuesday, but I doubt that we'll reach any firm conclusion for many
months at best (if ever). The formal maths involved gets quite nasty,
quite quickly.

   Hugo.

> But actually fully working per-filesystem raid56 is enough for a lot of 
> people, and actually working per-filesystem N-way-mirroring is what I'm 
> after, since I already setup multiple filesystems in ordered to keep my 
> data eggs from all being in the same filesystem basket.
> 

-- 
Hugo Mills             | If it's December 1941 in Casablanca, what time is it
hugo@... carfax.org.uk | in New York?
http://carfax.org.uk/  |
PGP: 65E74AC0          |                               Rick Blaine, Casablanca

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2015-01-04  3:54 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-29 18:56 I need to P. are we almost there yet? sys.syphus
2014-12-29 19:00 ` sys.syphus
2014-12-29 19:04   ` Hugo Mills
2014-12-29 20:25     ` sys.syphus
2014-12-29 21:50       ` Hugo Mills
2014-12-29 21:16   ` Chris Murphy
2014-12-30  0:20     ` ashford
     [not found]       ` <CALBWd85UsSih24RhwpmDeMjuMWCKj9dGeuZes5POj6qEFkiz2w@mail.gmail.com>
2014-12-30 17:09         ` Fwd: " Jose Manuel Perez Bethencourt
2014-12-30 21:44       ` Phillip Susi
2014-12-30 23:17         ` ashford
2014-12-31  2:45           ` Phillip Susi
2014-12-31 17:27             ` ashford
2014-12-31 23:38               ` Phillip Susi
2015-01-01  1:26               ` Chris Samuel
2015-01-01 20:12                 ` Roger Binns
2015-01-02  3:47                   ` Duncan
2015-01-02 13:42               ` Austin S Hemmelgarn
2015-01-02 17:45                 ` Brendan Hide
2015-01-02 19:41                   ` Austin S Hemmelgarn
2014-12-29 21:13 ` Chris Murphy
2015-01-03 11:34 ` Bob Marley
2015-01-03 13:11   ` Duncan
2015-01-03 18:53     ` Bob Marley
2015-01-03 19:03       ` sys.syphus
2015-01-03 18:55     ` sys.syphus
2015-01-04  3:22       ` Duncan
2015-01-04  3:54         ` Hugo Mills
2015-01-03 21:58     ` Roman Mamedov
2015-01-04  3:24       ` Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.