All of lore.kernel.org
 help / color / mirror / Atom feed
* Recommended why to use btrfs for production?
@ 2016-06-03  9:49 Martin
  2016-06-03  9:53 ` Marc Haber
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Martin @ 2016-06-03  9:49 UTC (permalink / raw)
  To: linux-btrfs

Hello,

We would like to use urBackup to make laptop backups, and they mention
btrfs as an option.

https://www.urbackup.org/administration_manual.html#x1-8400010.6

So if we go with btrfs and we need 100TB usable space in raid6, and to
have it replicated each night to another btrfs server for "backup" of
the backup, how should we then install btrfs?

E.g. Should we use the latest Fedora, CentOS, Ubuntu, Ubuntu LTS, or
should we compile the kernel our self?

And a bonus question: How stable is raid6 and detecting and replacing
failed drives?

-RC

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03  9:49 Recommended why to use btrfs for production? Martin
@ 2016-06-03  9:53 ` Marc Haber
  2016-06-03  9:57   ` Martin
  2016-06-03 10:01 ` Hans van Kranenburg
  2016-06-03 12:55 ` Austin S. Hemmelgarn
  2 siblings, 1 reply; 28+ messages in thread
From: Marc Haber @ 2016-06-03  9:53 UTC (permalink / raw)
  To: linux-btrfs

On Fri, Jun 03, 2016 at 11:49:09AM +0200, Martin wrote:
> We would like to use urBackup to make laptop backups, and they mention
> btrfs as an option.
> 
> https://www.urbackup.org/administration_manual.html#x1-8400010.6
> 
> So if we go with btrfs and we need 100TB usable space in raid6, and to
> have it replicated each night to another btrfs server for "backup" of
> the backup, how should we then install btrfs?

Do you plan to use Snapshots? How many of them?

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03  9:53 ` Marc Haber
@ 2016-06-03  9:57   ` Martin
  0 siblings, 0 replies; 28+ messages in thread
From: Martin @ 2016-06-03  9:57 UTC (permalink / raw)
  To: Marc Haber; +Cc: linux-btrfs

> Do you plan to use Snapshots? How many of them?

Yes, minimum 7 for each day of the week.

Nice to have would be 4 extra for each week of the month and then 12
for each month of the year.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03  9:49 Recommended why to use btrfs for production? Martin
  2016-06-03  9:53 ` Marc Haber
@ 2016-06-03 10:01 ` Hans van Kranenburg
  2016-06-03 10:15   ` Martin
  2016-06-03 12:55 ` Austin S. Hemmelgarn
  2 siblings, 1 reply; 28+ messages in thread
From: Hans van Kranenburg @ 2016-06-03 10:01 UTC (permalink / raw)
  To: Martin, linux-btrfs

Hi Martin,

On 06/03/2016 11:49 AM, Martin wrote:
>
> We would like to use urBackup to make laptop backups, and they mention
> btrfs as an option.
>
> [...]
>
> And a bonus question: How stable is raid6 and detecting and replacing
> failed drives?

Before trying RAID5/6 in production, be sure to read posts like these:

http://www.spinics.net/lists/linux-btrfs/msg55642.html

o/

Hans van Kranenburg

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03 10:01 ` Hans van Kranenburg
@ 2016-06-03 10:15   ` Martin
  0 siblings, 0 replies; 28+ messages in thread
From: Martin @ 2016-06-03 10:15 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: linux-btrfs

> Before trying RAID5/6 in production, be sure to read posts like these:
>
> http://www.spinics.net/lists/linux-btrfs/msg55642.html

Very interesting post and very recent even.

If I decide to try raid6 and of course everything is replicated each
day (for a bit of a safety net), and disks begin to fail, how much
help will I likely get from this list to recover?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03  9:49 Recommended why to use btrfs for production? Martin
  2016-06-03  9:53 ` Marc Haber
  2016-06-03 10:01 ` Hans van Kranenburg
@ 2016-06-03 12:55 ` Austin S. Hemmelgarn
  2016-06-03 13:31   ` Martin
  2016-06-03 14:05   ` Chris Murphy
  2 siblings, 2 replies; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-06-03 12:55 UTC (permalink / raw)
  To: Martin, linux-btrfs

On 2016-06-03 05:49, Martin wrote:
> Hello,
>
> We would like to use urBackup to make laptop backups, and they mention
> btrfs as an option.
>
> https://www.urbackup.org/administration_manual.html#x1-8400010.6
>
> So if we go with btrfs and we need 100TB usable space in raid6, and to
> have it replicated each night to another btrfs server for "backup" of
> the backup, how should we then install btrfs?
>
> E.g. Should we use the latest Fedora, CentOS, Ubuntu, Ubuntu LTS, or
> should we compile the kernel our self?
In general, avoid Ubuntu LTS versions when dealing with BTRFS, as well 
as most enterprise distros, they all tend to back-port patches instead 
of using newer kernels, which means it's functionally impossible to 
provide good support for them here (because we can't know for sure what 
exactly they've back-ported).  I'd suggest building your own kernel if 
possible, with Arch Linux being a close second (they follow upstream 
very closely), followed by Fedora and non-LTS Ubuntu.
>
> And a bonus question: How stable is raid6 and detecting and replacing
> failed drives?
Do not use BTRFS raid6 mode in production, it has at least 2 known 
serious bugs that may cause complete loss of the array due to a disk 
failure.  Both of these issues have as of yet unknown trigger 
conditions, although they do seem to occur more frequently with larger 
arrays.

That said, there are other options.  If you have enough disks, you can 
run BTRFS raid1 on top of LVM or MD RAID5 or RAID6, which provides you 
with the benefits of both.

Alternatively, you could use BTRFS raid1 on top of LVM or MD RAID1, 
which actually gets relatively decent performance and can provide even 
better guarantees than RAID6 would (depending on how you set it up, you 
can lose a lot more disks safely).  If you go this way, I'd suggest 
setting up disks in pairs at the lower level, and then just let BTRFS 
handle spanning the data across disks (BTRFS raid1 mode keeps exactly 
two copies of each block).  While this is not quite as efficient as just 
doing LVM based RAID6 with a traditional FS on top, it's also a lot 
easier to handle reshaping the array on-line because of the device 
management in BTRFS itself.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03 12:55 ` Austin S. Hemmelgarn
@ 2016-06-03 13:31   ` Martin
  2016-06-03 13:47     ` Julian Taylor
  2016-06-03 14:21     ` Austin S. Hemmelgarn
  2016-06-03 14:05   ` Chris Murphy
  1 sibling, 2 replies; 28+ messages in thread
From: Martin @ 2016-06-03 13:31 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs

> In general, avoid Ubuntu LTS versions when dealing with BTRFS, as well as
> most enterprise distros, they all tend to back-port patches instead of using
> newer kernels, which means it's functionally impossible to provide good
> support for them here (because we can't know for sure what exactly they've
> back-ported).  I'd suggest building your own kernel if possible, with Arch
> Linux being a close second (they follow upstream very closely), followed by
> Fedora and non-LTS Ubuntu.

Then I would build my own, if that is the preferred option.

> Do not use BTRFS raid6 mode in production, it has at least 2 known serious
> bugs that may cause complete loss of the array due to a disk failure.  Both
> of these issues have as of yet unknown trigger conditions, although they do
> seem to occur more frequently with larger arrays.

Ok. No raid6.

> That said, there are other options.  If you have enough disks, you can run
> BTRFS raid1 on top of LVM or MD RAID5 or RAID6, which provides you with the
> benefits of both.
>
> Alternatively, you could use BTRFS raid1 on top of LVM or MD RAID1, which
> actually gets relatively decent performance and can provide even better
> guarantees than RAID6 would (depending on how you set it up, you can lose a
> lot more disks safely).  If you go this way, I'd suggest setting up disks in
> pairs at the lower level, and then just let BTRFS handle spanning the data
> across disks (BTRFS raid1 mode keeps exactly two copies of each block).
> While this is not quite as efficient as just doing LVM based RAID6 with a
> traditional FS on top, it's also a lot easier to handle reshaping the array
> on-line because of the device management in BTRFS itself.

Right now I only have 10TB of backup data, but this is grow when
urbackup is roled out. So maybe I could get a way with plain btrfs
raid10 for the first year, and then re-balance to raid6 when the two
bugs have been found...

is the failed disk handling in btrfs raid10 considered stable?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03 13:31   ` Martin
@ 2016-06-03 13:47     ` Julian Taylor
  2016-06-03 14:21     ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 28+ messages in thread
From: Julian Taylor @ 2016-06-03 13:47 UTC (permalink / raw)
  To: Martin; +Cc: linux-btrfs

On 06/03/2016 03:31 PM, Martin wrote:
>> In general, avoid Ubuntu LTS versions when dealing with BTRFS, as well as
>> most enterprise distros, they all tend to back-port patches instead of using
>> newer kernels, which means it's functionally impossible to provide good
>> support for them here (because we can't know for sure what exactly they've
>> back-ported).  I'd suggest building your own kernel if possible, with Arch
>> Linux being a close second (they follow upstream very closely), followed by
>> Fedora and non-LTS Ubuntu.
>
> Then I would build my own, if that is the preferred option.
>

Ubuntu also provides newer kernels for their LTS via the Hardware 
Enablement Stack:

https://wiki.ubuntu.com/Kernel/LTSEnablementStack

So if you can live with about 6 month time lag and shorter support for 
the non-lts versions of those kernels that is a good option.
As you can see 16.04 currently provides 4.4 and the next update will 
likely be 4.8.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03 12:55 ` Austin S. Hemmelgarn
  2016-06-03 13:31   ` Martin
@ 2016-06-03 14:05   ` Chris Murphy
  2016-06-03 14:11     ` Martin
  2016-06-05 10:45     ` Mladen Milinkovic
  1 sibling, 2 replies; 28+ messages in thread
From: Chris Murphy @ 2016-06-03 14:05 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Martin, Btrfs BTRFS

On Fri, Jun 3, 2016 at 6:55 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

>
> That said, there are other options.  If you have enough disks, you can run
> BTRFS raid1 on top of LVM or MD RAID5 or RAID6, which provides you with the
> benefits of both.

There is a trade off. Either mdadm or lvm raid5, raid6, are more
mature and stable, but it's more maintenance. You have a btrfs scrub
as well as the md scrub. Btrfs on md/lvm raid56 will detect mismatches
but won't be able to fix them because from its perspective there's no
redundancy, except possibly metadata. So the repair has to happen on
the mdadm/lvm side

Make certain the kernel command timer value is greater than the driver
error recovery timeout. The former is found in sysfs, per block
device, the latter can be get and set with smartctl. Wrong
configuration is common (it's actually the default) when using
consumer drives, and inevitably leads to problems, even the loss of
the entire array. It really is a terrible default.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03 14:05   ` Chris Murphy
@ 2016-06-03 14:11     ` Martin
  2016-06-03 15:33       ` Austin S. Hemmelgarn
  2016-06-04  1:34       ` Chris Murphy
  2016-06-05 10:45     ` Mladen Milinkovic
  1 sibling, 2 replies; 28+ messages in thread
From: Martin @ 2016-06-03 14:11 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Austin S. Hemmelgarn, Btrfs BTRFS

> Make certain the kernel command timer value is greater than the driver
> error recovery timeout. The former is found in sysfs, per block
> device, the latter can be get and set with smartctl. Wrong
> configuration is common (it's actually the default) when using
> consumer drives, and inevitably leads to problems, even the loss of
> the entire array. It really is a terrible default.

Are nearline SAS drives considered consumer drives?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03 13:31   ` Martin
  2016-06-03 13:47     ` Julian Taylor
@ 2016-06-03 14:21     ` Austin S. Hemmelgarn
  2016-06-03 14:39       ` Martin
                         ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-06-03 14:21 UTC (permalink / raw)
  To: Martin; +Cc: linux-btrfs

On 2016-06-03 09:31, Martin wrote:
>> In general, avoid Ubuntu LTS versions when dealing with BTRFS, as well as
>> most enterprise distros, they all tend to back-port patches instead of using
>> newer kernels, which means it's functionally impossible to provide good
>> support for them here (because we can't know for sure what exactly they've
>> back-ported).  I'd suggest building your own kernel if possible, with Arch
>> Linux being a close second (they follow upstream very closely), followed by
>> Fedora and non-LTS Ubuntu.
>
> Then I would build my own, if that is the preferred option.
If you do go this route, make sure to keep an eye on the mailing list, 
as this is usually where any bugs get reported.  New bugs have 
thankfully been decreasing in number each release, but they do still 
happen, and it's important to know what to avoid and what to look out 
for when dealing with something under such active development.
>
>> Do not use BTRFS raid6 mode in production, it has at least 2 known serious
>> bugs that may cause complete loss of the array due to a disk failure.  Both
>> of these issues have as of yet unknown trigger conditions, although they do
>> seem to occur more frequently with larger arrays.
>
> Ok. No raid6.
>
>> That said, there are other options.  If you have enough disks, you can run
>> BTRFS raid1 on top of LVM or MD RAID5 or RAID6, which provides you with the
>> benefits of both.
>>
>> Alternatively, you could use BTRFS raid1 on top of LVM or MD RAID1, which
>> actually gets relatively decent performance and can provide even better
>> guarantees than RAID6 would (depending on how you set it up, you can lose a
>> lot more disks safely).  If you go this way, I'd suggest setting up disks in
>> pairs at the lower level, and then just let BTRFS handle spanning the data
>> across disks (BTRFS raid1 mode keeps exactly two copies of each block).
>> While this is not quite as efficient as just doing LVM based RAID6 with a
>> traditional FS on top, it's also a lot easier to handle reshaping the array
>> on-line because of the device management in BTRFS itself.
>
> Right now I only have 10TB of backup data, but this is grow when
> urbackup is roled out. So maybe I could get a way with plain btrfs
> raid10 for the first year, and then re-balance to raid6 when the two
> bugs have been found...
>
> is the failed disk handling in btrfs raid10 considered stable?
>
I would say it is, but I also don't have quite as much experience with 
it as with BTRFS raid1 mode.  The one thing I do know for certain about 
it is that even if it theoretically could recover from two failed disks 
(ie, if they're from different positions in the striping of each 
mirror), there is no code to actually do so, so make sure you replace 
any failed disks as soon as possible (or at least balance the array so 
that you don't have a missing device anymore).

Most of my systems where I would run raid10 mode are set up as BTRFS 
raid1 on top of two LVM based RAID0 volumes, as this gets measurably 
better performance than BTRFS raid10 mode at the moment (I see roughly a 
10-20% difference on my home server system), and provides the same data 
safety guarantees as well.  It's worth noting for such a setup that the 
current default block size in BTRFS is 16k except on very small 
filesystems, so you may want a larger stripe size than you would on a 
traditional filesystem.

As far as BTRFS raid10 mode in general, there are a few things that are 
important to remember about it:
1. It stores exactly two copies of everything, any extra disks just add 
to the stripe length on each copy.
2. Because each stripe has the same number of disks as it's mirrored 
partner, the total number of disks in any chunk allocation will always 
be even, which means that if your using an odd number of disks, there 
will always be one left out of every chunk.  This has limited impact on 
actual performance usually, but can cause confusing results if you have 
differently sized disks.
3. BTRFS (whether using raid10, raid0, or even raid5/6) will always try 
to use as many devices as possible for a stripe.  As a result of this, 
the moment you add a new disk, the total length of all new stripes will 
adjust to fit the new configuration.  If you want maximal performance 
when adding new disks, make sure to balance the rest of the filesystem 
afterwards, otherwise any existing stripes will just stay the same size.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03 14:21     ` Austin S. Hemmelgarn
@ 2016-06-03 14:39       ` Martin
  2016-06-03 19:09       ` Christoph Anton Mitterer
  2016-06-09  6:16       ` Duncan
  2 siblings, 0 replies; 28+ messages in thread
From: Martin @ 2016-06-03 14:39 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Btrfs BTRFS

> I would say it is, but I also don't have quite as much experience with it as
> with BTRFS raid1 mode.  The one thing I do know for certain about it is that
> even if it theoretically could recover from two failed disks (ie, if they're
> from different positions in the striping of each mirror), there is no code
> to actually do so, so make sure you replace any failed disks as soon as
> possible (or at least balance the array so that you don't have a missing
> device anymore).

Ok, so that really speaks for raid1...

> Most of my systems where I would run raid10 mode are set up as BTRFS raid1
> on top of two LVM based RAID0 volumes, as this gets measurably better
> performance than BTRFS raid10 mode at the moment (I see roughly a 10-20%
> difference on my home server system), and provides the same data safety
> guarantees as well.  It's worth noting for such a setup that the current
> default block size in BTRFS is 16k except on very small filesystems, so you
> may want a larger stripe size than you would on a traditional filesystem.
>
> As far as BTRFS raid10 mode in general, there are a few things that are
> important to remember about it:
> 1. It stores exactly two copies of everything, any extra disks just add to
> the stripe length on each copy.
> 2. Because each stripe has the same number of disks as it's mirrored
> partner, the total number of disks in any chunk allocation will always be
> even, which means that if your using an odd number of disks, there will
> always be one left out of every chunk.  This has limited impact on actual
> performance usually, but can cause confusing results if you have differently
> sized disks.
> 3. BTRFS (whether using raid10, raid0, or even raid5/6) will always try to
> use as many devices as possible for a stripe.  As a result of this, the
> moment you add a new disk, the total length of all new stripes will adjust
> to fit the new configuration.  If you want maximal performance when adding
> new disks, make sure to balance the rest of the filesystem afterwards,
> otherwise any existing stripes will just stay the same size.

Those are very good things to know!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03 14:11     ` Martin
@ 2016-06-03 15:33       ` Austin S. Hemmelgarn
  2016-06-04  0:48         ` Nicholas D Steeves
  2016-06-04  1:34       ` Chris Murphy
  1 sibling, 1 reply; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-06-03 15:33 UTC (permalink / raw)
  To: Martin, Chris Murphy; +Cc: Btrfs BTRFS

On 2016-06-03 10:11, Martin wrote:
>> Make certain the kernel command timer value is greater than the driver
>> error recovery timeout. The former is found in sysfs, per block
>> device, the latter can be get and set with smartctl. Wrong
>> configuration is common (it's actually the default) when using
>> consumer drives, and inevitably leads to problems, even the loss of
>> the entire array. It really is a terrible default.
>
> Are nearline SAS drives considered consumer drives?
>
If it's a SAS drive, then no, especially when you start talking about 
things marketed as 'nearline'.  Additionally, SCT ERC is entirely a SATA 
thing, I forget what the equivalent in SCSI (and by extension SAS) terms 
is, but I'm pretty sure that the kernel handles things differently there.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03 14:21     ` Austin S. Hemmelgarn
  2016-06-03 14:39       ` Martin
@ 2016-06-03 19:09       ` Christoph Anton Mitterer
  2016-06-09  6:16       ` Duncan
  2 siblings, 0 replies; 28+ messages in thread
From: Christoph Anton Mitterer @ 2016-06-03 19:09 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 166 bytes --]

Hey.

Does anyone know whether the write hole issues have been fixed already?
https://btrfs.wiki.kernel.org/index.php/RAID56 still mentions it.

Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03 15:33       ` Austin S. Hemmelgarn
@ 2016-06-04  0:48         ` Nicholas D Steeves
  2016-06-04  1:48           ` Chris Murphy
  0 siblings, 1 reply; 28+ messages in thread
From: Nicholas D Steeves @ 2016-06-04  0:48 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Martin, Chris Murphy, Btrfs BTRFS

On 3 June 2016 at 11:33, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
> On 2016-06-03 10:11, Martin wrote:
>>>
>>> Make certain the kernel command timer value is greater than the driver
>>> error recovery timeout. The former is found in sysfs, per block
>>> device, the latter can be get and set with smartctl. Wrong
>>> configuration is common (it's actually the default) when using
>>> consumer drives, and inevitably leads to problems, even the loss of
>>> the entire array. It really is a terrible default.
>>
>>
>> Are nearline SAS drives considered consumer drives?
>>
> If it's a SAS drive, then no, especially when you start talking about things
> marketed as 'nearline'.  Additionally, SCT ERC is entirely a SATA thing, I
> forget what the equivalent in SCSI (and by extension SAS) terms is, but I'm
> pretty sure that the kernel handles things differently there.

For the purposes of BTRFS RAID1: For drives that ship with SCT ERC of
7sec, is the default kernel command timeout of 30sec appropriate, or
should it be reduced?  For SATA drives that do not support SC TERC, is
it true that 120sec is a sane value?  I forget where I got this value
of 120sec; it might have been this list, it might have been an mdadm
bug report.  Also, in terms of tuning, I've been unable to find
whether the ideal kernel timeout value changes depending on RAID
type...is that a factor in selecting a sane kernel timeout value?

Kind regards,
Nicholas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03 14:11     ` Martin
  2016-06-03 15:33       ` Austin S. Hemmelgarn
@ 2016-06-04  1:34       ` Chris Murphy
  1 sibling, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2016-06-04  1:34 UTC (permalink / raw)
  To: Martin; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS

On Fri, Jun 3, 2016 at 8:11 AM, Martin <rc6encrypted@gmail.com> wrote:
>> Make certain the kernel command timer value is greater than the driver
>> error recovery timeout. The former is found in sysfs, per block
>> device, the latter can be get and set with smartctl. Wrong
>> configuration is common (it's actually the default) when using
>> consumer drives, and inevitably leads to problems, even the loss of
>> the entire array. It really is a terrible default.
>
> Are nearline SAS drives considered consumer drives?

No, they should have configurable sct erc setting using smartctl.
Many, possibly most, consumer drives now do not support it, so often
the only workable way to use them in any kind of multiple device
scenario other than linear/concat or raid0 is to significantly
increase the scsi command timer - upwards or 2 or 3 minutes. So if
your use case cannot tolerate such delays, then the drives must be
disqualified.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-04  0:48         ` Nicholas D Steeves
@ 2016-06-04  1:48           ` Chris Murphy
  2016-06-06 13:29             ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2016-06-04  1:48 UTC (permalink / raw)
  To: Nicholas D Steeves
  Cc: Austin S. Hemmelgarn, Martin, Chris Murphy, Btrfs BTRFS

On Fri, Jun 3, 2016 at 6:48 PM, Nicholas D Steeves <nsteeves@gmail.com> wrote:
> On 3 June 2016 at 11:33, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>> On 2016-06-03 10:11, Martin wrote:
>>>>
>>>> Make certain the kernel command timer value is greater than the driver
>>>> error recovery timeout. The former is found in sysfs, per block
>>>> device, the latter can be get and set with smartctl. Wrong
>>>> configuration is common (it's actually the default) when using
>>>> consumer drives, and inevitably leads to problems, even the loss of
>>>> the entire array. It really is a terrible default.
>>>
>>>
>>> Are nearline SAS drives considered consumer drives?
>>>
>> If it's a SAS drive, then no, especially when you start talking about things
>> marketed as 'nearline'.  Additionally, SCT ERC is entirely a SATA thing, I
>> forget what the equivalent in SCSI (and by extension SAS) terms is, but I'm
>> pretty sure that the kernel handles things differently there.
>
> For the purposes of BTRFS RAID1: For drives that ship with SCT ERC of
> 7sec, is the default kernel command timeout of 30sec appropriate, or
> should it be reduced?

It's fine. But it depends on your use case, if it can tolerate a rare
> 7 second < 30 second hang, and you're prepared to start
investigating the cause then I'd leave it alone. If the use case
prefers resetting the drive when it stops responding, then you'd go
with something shorter.

I'm fairly certain SAS's command queue doesn't get obliterated with
such a link reset, just the hung command; where SATA drives all
information in the queue is lost. So resets on SATA are a much bigger
penalty if I have the correct understanding.


>  For SATA drives that do not support SC TERC, is
> it true that 120sec is a sane value?  I forget where I got this value
> of 120sec;

It's a good question. It's not well documented, is not defined in the
SATA spec, so it's probably make/model specific. The linux-raid@ list
probably has the most information on this just because their users get
nailed by this problem often. And the recommendation does seem to vary
around 120 to 180. That is of course a maximum. The drive could give
up much sooner. But what you don't want is for the drive to be in
recovery for a bad sector, and the command timer does a link reset,
losing all of what the drive was doing: all of which is replaceable
except really one thing which is what sector was having the problem.
And right now there's no report of the drive for slow sectors. It only
reports failed reads, and it's that failed read error that includes
the sector, so that the raid mechanism can figure out what data is
missing, recongistruct from mirror or parity, and then fix the bad
sector by writing to it.

> it might have been this list, it might have been an mdadm
> bug report.  Also, in terms of tuning, I've been unable to find
> whether the ideal kernel timeout value changes depending on RAID
> type...is that a factor in selecting a sane kernel timeout value?

No. It's strictly a value to make certain you get read errors from the
drive rather than link resets.

And that's why I think it's a bad default, because it totally thwarts
attempts by manufacturers to recover marginal sectors, even in the
single disk case.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03 14:05   ` Chris Murphy
  2016-06-03 14:11     ` Martin
@ 2016-06-05 10:45     ` Mladen Milinkovic
  2016-06-05 16:33       ` James Johnston
  2016-06-06  1:47       ` Chris Murphy
  1 sibling, 2 replies; 28+ messages in thread
From: Mladen Milinkovic @ 2016-06-05 10:45 UTC (permalink / raw)
  To: Chris Murphy, Austin S. Hemmelgarn; +Cc: Martin, Btrfs BTRFS

On 06/03/2016 04:05 PM, Chris Murphy wrote:
> Make certain the kernel command timer value is greater than the driver
> error recovery timeout. The former is found in sysfs, per block
> device, the latter can be get and set with smartctl. Wrong
> configuration is common (it's actually the default) when using
> consumer drives, and inevitably leads to problems, even the loss of
> the entire array. It really is a terrible default.

Since it's first time i've heard of this I did some googling.

Here's some nice article about these timeouts:
http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive-timeouts/comment-page-1/

And some udev rules that should apply this automatically:
http://comments.gmane.org/gmane.linux.raid/48193

Cheers

-- 
Mladen Milinkovic
GPG: EF9D9B26


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Recommended why to use btrfs for production?
  2016-06-05 10:45     ` Mladen Milinkovic
@ 2016-06-05 16:33       ` James Johnston
  2016-06-05 18:20         ` Andrei Borzenkov
  2016-06-06  1:47       ` Chris Murphy
  1 sibling, 1 reply; 28+ messages in thread
From: James Johnston @ 2016-06-05 16:33 UTC (permalink / raw)
  To: 'Mladen Milinkovic', 'Chris Murphy',
	'Austin S. Hemmelgarn'
  Cc: 'Martin', 'Btrfs BTRFS'

On 06/05/2016 10:46 AM, Mladen Milinkovic wrote:
> On 06/03/2016 04:05 PM, Chris Murphy wrote:
> > Make certain the kernel command timer value is greater than the driver
> > error recovery timeout. The former is found in sysfs, per block
> > device, the latter can be get and set with smartctl. Wrong
> > configuration is common (it's actually the default) when using
> > consumer drives, and inevitably leads to problems, even the loss of
> > the entire array. It really is a terrible default.
> 
> Since it's first time i've heard of this I did some googling.
> 
> Here's some nice article about these timeouts:
> http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive-
> timeouts/comment-page-1/
> 
> And some udev rules that should apply this automatically:
> http://comments.gmane.org/gmane.linux.raid/48193

I think the first link there is a good one.  On my system:

/sys/block/sdX/device/timeout

defaults to 30 seconds - long enough for a drive with short TLER setting
but too short for a consumer drive.

There is a Red Hat link on setting up a udev rule for it here:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Online_Storage_Reconfiguration_Guide/task_controlling-scsi-command-timer-onlining-devices.html

I thought it looked a little funny, so I combined the above with one of the
VMware udev rules pre-installed on my Ubuntu system and came up with this:

# Update timeout from 180 to one of your choosing:
ACTION=="add|change", SUBSYSTEMS=="scsi", ATTRS{type}=="0|7|14", \
RUN+="/bin/sh -c 'echo 180 >/sys$DEVPATH/device/timeout'"

Now my attached drives automatically get this timeout without any scripting
or manual setting of the timeout.

James



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-05 16:33       ` James Johnston
@ 2016-06-05 18:20         ` Andrei Borzenkov
  0 siblings, 0 replies; 28+ messages in thread
From: Andrei Borzenkov @ 2016-06-05 18:20 UTC (permalink / raw)
  To: James Johnston, 'Mladen Milinkovic',
	'Chris Murphy', 'Austin S. Hemmelgarn'
  Cc: 'Martin', 'Btrfs BTRFS'

05.06.2016 19:33, James Johnston пишет:
> On 06/05/2016 10:46 AM, Mladen Milinkovic wrote:
>> On 06/03/2016 04:05 PM, Chris Murphy wrote:
>>> Make certain the kernel command timer value is greater than the driver
>>> error recovery timeout. The former is found in sysfs, per block
>>> device, the latter can be get and set with smartctl. Wrong
>>> configuration is common (it's actually the default) when using
>>> consumer drives, and inevitably leads to problems, even the loss of
>>> the entire array. It really is a terrible default.
>>
>> Since it's first time i've heard of this I did some googling.
>>
>> Here's some nice article about these timeouts:
>> http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive-
>> timeouts/comment-page-1/
>>
>> And some udev rules that should apply this automatically:
>> http://comments.gmane.org/gmane.linux.raid/48193
> 
> I think the first link there is a good one.  On my system:
> 
> /sys/block/sdX/device/timeout
> 
> defaults to 30 seconds - long enough for a drive with short TLER setting
> but too short for a consumer drive.
> 
> There is a Red Hat link on setting up a udev rule for it here:
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Online_Storage_Reconfiguration_Guide/task_controlling-scsi-command-timer-onlining-devices.html
> 
> I thought it looked a little funny, so I combined the above with one of the
> VMware udev rules pre-installed on my Ubuntu system and came up with this:
> 
> # Update timeout from 180 to one of your choosing:
> ACTION=="add|change", SUBSYSTEMS=="scsi", ATTRS{type}=="0|7|14", \
> RUN+="/bin/sh -c 'echo 180 >/sys$DEVPATH/device/timeout'"
> 

Last line is actually

ATTR{device/timeout}="100"

to avoid spawning extra process for every device.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-05 10:45     ` Mladen Milinkovic
  2016-06-05 16:33       ` James Johnston
@ 2016-06-06  1:47       ` Chris Murphy
  2016-06-06  2:40         ` James Johnston
  1 sibling, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2016-06-06  1:47 UTC (permalink / raw)
  To: Mladen Milinkovic; +Cc: Chris Murphy, Austin S. Hemmelgarn, Martin, Btrfs BTRFS

On Sun, Jun 5, 2016 at 4:45 AM, Mladen Milinkovic <maxrd2@smoothware.net> wrote:
> On 06/03/2016 04:05 PM, Chris Murphy wrote:
>> Make certain the kernel command timer value is greater than the driver
>> error recovery timeout. The former is found in sysfs, per block
>> device, the latter can be get and set with smartctl. Wrong
>> configuration is common (it's actually the default) when using
>> consumer drives, and inevitably leads to problems, even the loss of
>> the entire array. It really is a terrible default.
>
> Since it's first time i've heard of this I did some googling.
>
> Here's some nice article about these timeouts:
> http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive-timeouts/comment-page-1/
>
> And some udev rules that should apply this automatically:
> http://comments.gmane.org/gmane.linux.raid/48193

Yes it's a constant problem that pops up on the linux-raid list.
Sometimes the list is quiet on this issue but it really seems like
it's once a week. From last week...

http://www.spinics.net/lists/raid/msg52447.html

And you wouldn't know it because the subject is "raid 5 crashed" so
you wouldn't think, oh bad sectors are accumulating because they're
not getting fixed up and they're not getting fixed up because the
kernel command timer is resetting the link preventing the drive from
reporting a read error and the associated sector LBA. It starts with
that, and then you get a single disk failure, and now when doing a
rebuild, you hit the bad sector on an otherwise good drive and in
effect that's like a 2nd drive failure and now the raid5 implodes.
It's fixable, sometimes, but really tedious.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Recommended why to use btrfs for production?
  2016-06-06  1:47       ` Chris Murphy
@ 2016-06-06  2:40         ` James Johnston
  2016-06-06 13:36           ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 28+ messages in thread
From: James Johnston @ 2016-06-06  2:40 UTC (permalink / raw)
  To: 'Chris Murphy', 'Mladen Milinkovic'
  Cc: 'Austin S. Hemmelgarn', 'Martin', 'Btrfs BTRFS'

On 06/06/2016 at 01:47, Chris Murphy wrote:
> On Sun, Jun 5, 2016 at 4:45 AM, Mladen Milinkovic <maxrd2@smoothware.net> wrote:
> > On 06/03/2016 04:05 PM, Chris Murphy wrote:
> >> Make certain the kernel command timer value is greater than the driver
> >> error recovery timeout. The former is found in sysfs, per block
> >> device, the latter can be get and set with smartctl. Wrong
> >> configuration is common (it's actually the default) when using
> >> consumer drives, and inevitably leads to problems, even the loss of
> >> the entire array. It really is a terrible default.
> >
> > Since it's first time i've heard of this I did some googling.
> >
> > Here's some nice article about these timeouts:
> > http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive-
> timeouts/comment-page-1/
> >
> > And some udev rules that should apply this automatically:
> > http://comments.gmane.org/gmane.linux.raid/48193
> 
> Yes it's a constant problem that pops up on the linux-raid list.
> Sometimes the list is quiet on this issue but it really seems like
> it's once a week. From last week...
> 
> http://www.spinics.net/lists/raid/msg52447.html

It seems like it would be useful if the distributions or the kernel could
automatically set the kernel timeout to an appropriate value.  If the TLER can be
indeed be queried via smartctl, then it would be easy to automatically read it,
and then calculate a suitable timeout.  A RAID-oriented drive would end up leaving
the current 30 seconds, while if it can't successfully query for TLER or the drive
just doesn't support it, then assume a consumer drive and set timeout for 180
seconds.

That way, zero user configuration would be needed in the common case.  Or is it
not that simple?

James



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-04  1:48           ` Chris Murphy
@ 2016-06-06 13:29             ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-06-06 13:29 UTC (permalink / raw)
  To: Chris Murphy, Nicholas D Steeves; +Cc: Martin, Btrfs BTRFS

On 2016-06-03 21:48, Chris Murphy wrote:
> On Fri, Jun 3, 2016 at 6:48 PM, Nicholas D Steeves <nsteeves@gmail.com> wrote:
>> On 3 June 2016 at 11:33, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>>> On 2016-06-03 10:11, Martin wrote:
>>>>>
>>>>> Make certain the kernel command timer value is greater than the driver
>>>>> error recovery timeout. The former is found in sysfs, per block
>>>>> device, the latter can be get and set with smartctl. Wrong
>>>>> configuration is common (it's actually the default) when using
>>>>> consumer drives, and inevitably leads to problems, even the loss of
>>>>> the entire array. It really is a terrible default.
>>>>
>>>>
>>>> Are nearline SAS drives considered consumer drives?
>>>>
>>> If it's a SAS drive, then no, especially when you start talking about things
>>> marketed as 'nearline'.  Additionally, SCT ERC is entirely a SATA thing, I
>>> forget what the equivalent in SCSI (and by extension SAS) terms is, but I'm
>>> pretty sure that the kernel handles things differently there.
>>
>> For the purposes of BTRFS RAID1: For drives that ship with SCT ERC of
>> 7sec, is the default kernel command timeout of 30sec appropriate, or
>> should it be reduced?
>
> It's fine. But it depends on your use case, if it can tolerate a rare
>> 7 second < 30 second hang, and you're prepared to start
> investigating the cause then I'd leave it alone. If the use case
> prefers resetting the drive when it stops responding, then you'd go
> with something shorter.
>
> I'm fairly certain SAS's command queue doesn't get obliterated with
> such a link reset, just the hung command; where SATA drives all
> information in the queue is lost. So resets on SATA are a much bigger
> penalty if I have the correct understanding.
There's also more involved otherwise with a ATA link reset because AHCI 
controllers aren't MP safe, so there's a global lock that has to be held 
while talking to them.  Because of this, a link reset on an ATA drive 
(be it SATA or PATA) will cause performance degradation for all other 
devices on that controller as well until the reset is complete.
>
>
>>  For SATA drives that do not support SC TERC, is
>> it true that 120sec is a sane value?  I forget where I got this value
>> of 120sec;
>
> It's a good question. It's not well documented, is not defined in the
> SATA spec, so it's probably make/model specific. The linux-raid@ list
> probably has the most information on this just because their users get
> nailed by this problem often. And the recommendation does seem to vary
> around 120 to 180. That is of course a maximum. The drive could give
> up much sooner. But what you don't want is for the drive to be in
> recovery for a bad sector, and the command timer does a link reset,
> losing all of what the drive was doing: all of which is replaceable
> except really one thing which is what sector was having the problem.
> And right now there's no report of the drive for slow sectors. It only
> reports failed reads, and it's that failed read error that includes
> the sector, so that the raid mechanism can figure out what data is
> missing, recongistruct from mirror or parity, and then fix the bad
> sector by writing to it.
FWIW, I usually go with 150 on the Seagate 'Desktop' drives I use.  I've 
seen some cheap Hitachi and Toshiba disks that need it as high as 300 
though to work right.
>
>> it might have been this list, it might have been an mdadm
>> bug report.  Also, in terms of tuning, I've been unable to find
>> whether the ideal kernel timeout value changes depending on RAID
>> type...is that a factor in selecting a sane kernel timeout value?
>
> No. It's strictly a value to make certain you get read errors from the
> drive rather than link resets.
You have to factor in how the controller handles things too.  SOme of 
them will retry just like a desktop drive, and you need to account for that.
>
> And that's why I think it's a bad default, because it totally thwarts
> attempts by manufacturers to recover marginal sectors, even in the
> single disk case.
That's debatable, by attempting to recover the bad sector, they're 
slowing down the whole system.  The likelihood of recovering a bad 
sectors functionally falls off linearly the longer you try, and not 
having the ability to choose when to report an error is the bigger issue 
here.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-06  2:40         ` James Johnston
@ 2016-06-06 13:36           ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-06-06 13:36 UTC (permalink / raw)
  To: James Johnston, 'Chris Murphy', 'Mladen Milinkovic'
  Cc: 'Martin', 'Btrfs BTRFS'

On 2016-06-05 22:40, James Johnston wrote:
> On 06/06/2016 at 01:47, Chris Murphy wrote:
>> On Sun, Jun 5, 2016 at 4:45 AM, Mladen Milinkovic <maxrd2@smoothware.net> wrote:
>>> On 06/03/2016 04:05 PM, Chris Murphy wrote:
>>>> Make certain the kernel command timer value is greater than the driver
>>>> error recovery timeout. The former is found in sysfs, per block
>>>> device, the latter can be get and set with smartctl. Wrong
>>>> configuration is common (it's actually the default) when using
>>>> consumer drives, and inevitably leads to problems, even the loss of
>>>> the entire array. It really is a terrible default.
>>>
>>> Since it's first time i've heard of this I did some googling.
>>>
>>> Here's some nice article about these timeouts:
>>> http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive-
>> timeouts/comment-page-1/
>>>
>>> And some udev rules that should apply this automatically:
>>> http://comments.gmane.org/gmane.linux.raid/48193
>>
>> Yes it's a constant problem that pops up on the linux-raid list.
>> Sometimes the list is quiet on this issue but it really seems like
>> it's once a week. From last week...
>>
>> http://www.spinics.net/lists/raid/msg52447.html
>
> It seems like it would be useful if the distributions or the kernel could
> automatically set the kernel timeout to an appropriate value.  If the TLER can be
> indeed be queried via smartctl, then it would be easy to automatically read it,
> and then calculate a suitable timeout.  A RAID-oriented drive would end up leaving
> the current 30 seconds, while if it can't successfully query for TLER or the drive
> just doesn't support it, then assume a consumer drive and set timeout for 180
> seconds.
>
> That way, zero user configuration would be needed in the common case.  Or is it
> not that simple?
Strictly speaking, it's policy, and therefore shouldn't be in the 
kernel.  It's not hard to write a script to handle this though, both 
hdparm and smartctl can set the SCT ERC value, and will report an error 
if it fails, so you can try and set the value as you want (I personally 
would go with 10 seconds instead of 7), and if that fails, bump the 
kernel command timout.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-03 14:21     ` Austin S. Hemmelgarn
  2016-06-03 14:39       ` Martin
  2016-06-03 19:09       ` Christoph Anton Mitterer
@ 2016-06-09  6:16       ` Duncan
  2016-06-09 11:38         ` Austin S. Hemmelgarn
  2 siblings, 1 reply; 28+ messages in thread
From: Duncan @ 2016-06-09  6:16 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Fri, 03 Jun 2016 10:21:12 -0400 as
excerpted:

> As far as BTRFS raid10 mode in general, there are a few things that are 
> important to remember about it:
> 1. It stores exactly two copies of everything, any extra disks just add 
> to the stripe length on each copy.

I'll add one more, potentially very important, related to this one:

Btrfs raid mode (any of them) works in relation to individual chunks, 
*NOT* individual devices.

What that means for btrfs raid10 in combination with the above exactly 
two copies rule, is that it works rather differently than a standard 
raid10, which can tolerate loss of two devices as long as they're from 
the same mirror set, as the other mirror set will then still be whole.  
Because with btrfs raid10 the mirror sets are dynamic per-chunk, loss of 
a second device close to assures loss of data, because the very likely 
true assumption is that both mirror sets will be affected for some 
chunks, but not others.

By using a layered approach, btrfs raid1 on top (for its error correction 
from the other copy feature) of a pair of mdraid0s, you force one of the 
btrfs raid1 copies to each of the mdraid0s, thus making allocation more 
deterministic than btrfs raid10, and can thus again tolerate loss of two 
devices, as long as they're from the same underlying mdraid0.

(Traditionally, raid1 on top of raid0 is called raid01, and is 
discouraged compared to raid10, raid0 on top of raid1, because device 
failure and replacement with the latter triggers a much more localized 
rebuild than the former, across the pair of devices in the raid1 when 
it's closest to the physical devices, across the whole array, one raid0 
to the other, when the raid1 is on top.  However, btrfs raid1's data 
integrity and error repair from the good mirror feature is generally 
considered to be useful enough to be worth the rebuild-inefficiency of 
the raid01 design.)

So in regard to failure tolerance, btrfs raid10 is far closer to 
traditional raid5, loss of a single device is tolerated, loss of a second 
before a repair is complete generally means data loss -- there's not the 
chance of it being on the same mirror set to save you that traditional 
raid10 has.

Similarly, btrfs raid10 doesn't have the cleanly separate pair of mirrors 
on raid0 arrays that traditional raid10 does, thus doesn't have the fault 
tolerance of losing say the connection or power to one entire device 
bank, as long as it's all one mirror set, that traditional raid10 has.

And again, doing the layered thing with btrfs raid1 on top and mdraid0 
(or whatever else) underneath gets that back for you, if you set it up 
that way, of course.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-09  6:16       ` Duncan
@ 2016-06-09 11:38         ` Austin S. Hemmelgarn
  2016-06-09 17:39           ` Chris Murphy
  0 siblings, 1 reply; 28+ messages in thread
From: Austin S. Hemmelgarn @ 2016-06-09 11:38 UTC (permalink / raw)
  To: linux-btrfs

On 2016-06-09 02:16, Duncan wrote:
> Austin S. Hemmelgarn posted on Fri, 03 Jun 2016 10:21:12 -0400 as
> excerpted:
>
>> As far as BTRFS raid10 mode in general, there are a few things that are
>> important to remember about it:
>> 1. It stores exactly two copies of everything, any extra disks just add
>> to the stripe length on each copy.
>
> I'll add one more, potentially very important, related to this one:
>
> Btrfs raid mode (any of them) works in relation to individual chunks,
> *NOT* individual devices.
>
> What that means for btrfs raid10 in combination with the above exactly
> two copies rule, is that it works rather differently than a standard
> raid10, which can tolerate loss of two devices as long as they're from
> the same mirror set, as the other mirror set will then still be whole.
> Because with btrfs raid10 the mirror sets are dynamic per-chunk, loss of
> a second device close to assures loss of data, because the very likely
> true assumption is that both mirror sets will be affected for some
> chunks, but not others.
Actually, that's not _quite_ the case.  Assuming that you have an even 
number of devices, BTRFS raid10 will currently always span all the 
available devices with two striped copies of the data (if there's an odd 
number, it spans one less than the total, and rotates which one gets 
left out of each chunk).  This means that as long as all the devices are 
the same size and you have have stripes that are the full width of the 
array (you can end up with shorter ones if you have run in degraded mode 
or expanded the array), your probability of data loss per-chunk goes 
down as you add more devices (because the probability of a two device 
failure affecting both copies of a stripe in a given chunk decreases), 
but goes up as you add more chunks (because you then have to apply that 
probability for each individual chunk).  Once you've lost one disk, the 
probability that losing another will compromise a specific chunk is:
1/(N - 1)
Where N is the total number of devices.
The probability that it will compromise _any_ chunk is:
(1/(N - 1))/C
Where C is the total number of chunks
BTRFS raid1 mode actually has the exact same probabilities, but they 
apply even if you have an odd number of disks.
>
> By using a layered approach, btrfs raid1 on top (for its error correction
> from the other copy feature) of a pair of mdraid0s, you force one of the
> btrfs raid1 copies to each of the mdraid0s, thus making allocation more
> deterministic than btrfs raid10, and can thus again tolerate loss of two
> devices, as long as they're from the same underlying mdraid0.
>
> (Traditionally, raid1 on top of raid0 is called raid01, and is
> discouraged compared to raid10, raid0 on top of raid1, because device
> failure and replacement with the latter triggers a much more localized
> rebuild than the former, across the pair of devices in the raid1 when
> it's closest to the physical devices, across the whole array, one raid0
> to the other, when the raid1 is on top.  However, btrfs raid1's data
> integrity and error repair from the good mirror feature is generally
> considered to be useful enough to be worth the rebuild-inefficiency of
> the raid01 design.)
>
> So in regard to failure tolerance, btrfs raid10 is far closer to
> traditional raid5, loss of a single device is tolerated, loss of a second
> before a repair is complete generally means data loss -- there's not the
> chance of it being on the same mirror set to save you that traditional
> raid10 has.
>
> Similarly, btrfs raid10 doesn't have the cleanly separate pair of mirrors
> on raid0 arrays that traditional raid10 does, thus doesn't have the fault
> tolerance of losing say the connection or power to one entire device
> bank, as long as it's all one mirror set, that traditional raid10 has.
>
> And again, doing the layered thing with btrfs raid1 on top and mdraid0
> (or whatever else) underneath gets that back for you, if you set it up
> that way, of course.
And will get you better performance than just BTRFS most of the time too.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-09 11:38         ` Austin S. Hemmelgarn
@ 2016-06-09 17:39           ` Chris Murphy
  2016-06-09 19:57             ` Duncan
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2016-06-09 17:39 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Btrfs BTRFS

On Thu, Jun 9, 2016 at 5:38 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-06-09 02:16, Duncan wrote:
>>
>> Austin S. Hemmelgarn posted on Fri, 03 Jun 2016 10:21:12 -0400 as
>> excerpted:
>>
>>> As far as BTRFS raid10 mode in general, there are a few things that are
>>> important to remember about it:
>>> 1. It stores exactly two copies of everything, any extra disks just add
>>> to the stripe length on each copy.
>>
>>
>> I'll add one more, potentially very important, related to this one:
>>
>> Btrfs raid mode (any of them) works in relation to individual chunks,
>> *NOT* individual devices.
>>
>> What that means for btrfs raid10 in combination with the above exactly
>> two copies rule, is that it works rather differently than a standard
>> raid10, which can tolerate loss of two devices as long as they're from
>> the same mirror set, as the other mirror set will then still be whole.
>> Because with btrfs raid10 the mirror sets are dynamic per-chunk, loss of
>> a second device close to assures loss of data, because the very likely
>> true assumption is that both mirror sets will be affected for some
>> chunks, but not others.
>
> Actually, that's not _quite_ the case.  Assuming that you have an even
> number of devices, BTRFS raid10 will currently always span all the available
> devices with two striped copies of the data (if there's an odd number, it
> spans one less than the total, and rotates which one gets left out of each
> chunk).  This means that as long as all the devices are the same size and
> you have have stripes that are the full width of the array (you can end up
> with shorter ones if you have run in degraded mode or expanded the array),
> your probability of data loss per-chunk goes down as you add more devices
> (because the probability of a two device failure affecting both copies of a
> stripe in a given chunk decreases), but goes up as you add more chunks
> (because you then have to apply that probability for each individual chunk).
> Once you've lost one disk, the probability that losing another will
> compromise a specific chunk is:
> 1/(N - 1)
> Where N is the total number of devices.
> The probability that it will compromise _any_ chunk is:
> (1/(N - 1))/C
> Where C is the total number of chunks
> BTRFS raid1 mode actually has the exact same probabilities, but they apply
> even if you have an odd number of disks.

Yeah but somewhere there's a chunk that's likely affected by two
losses, with a probability much higher than for conventional raid10
where such a loss is very binary: if the loss is a mirrored pair, the
whole array and filesystem implodes; if the loss does not affect an
entire mirrored pair, the whole array survives.

The thing with Btrfs raid 10 is you can't really tell in advance to
what degree you have loss. It's not a binary condition, it has a gray
area where a lot of data can still be retrieved, but the instant you
hit missing data it's a loss, and if you hit missing metadata then the
fs will either go read only or crash, it just can't continue. So that
"walking on egg shells" behavior in a 2+ drive loss is really
different from a conventional raid10 where it's either gonna
completely work or completely fail.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Recommended why to use btrfs for production?
  2016-06-09 17:39           ` Chris Murphy
@ 2016-06-09 19:57             ` Duncan
  0 siblings, 0 replies; 28+ messages in thread
From: Duncan @ 2016-06-09 19:57 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Thu, 09 Jun 2016 11:39:23 -0600 as excerpted:

> Yeah but somewhere there's a chunk that's likely affected by two losses,
> with a probability much higher than for conventional raid10 where such a
> loss is very binary: if the loss is a mirrored pair, the whole array and
> filesystem implodes; if the loss does not affect an entire mirrored
> pair, the whole array survives.
> 
> The thing with Btrfs raid 10 is you can't really tell in advance to what
> degree you have loss. It's not a binary condition, it has a gray area
> where a lot of data can still be retrieved, but the instant you hit
> missing data it's a loss, and if you hit missing metadata then the fs
> will either go read only or crash, it just can't continue. So that
> "walking on egg shells" behavior in a 2+ drive loss is really different
> from a conventional raid10 where it's either gonna completely work or
> completely fail.

Yes, thanks, CMurphy.  That's exactly what I was trying to explain. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-06-09 19:58 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-03  9:49 Recommended why to use btrfs for production? Martin
2016-06-03  9:53 ` Marc Haber
2016-06-03  9:57   ` Martin
2016-06-03 10:01 ` Hans van Kranenburg
2016-06-03 10:15   ` Martin
2016-06-03 12:55 ` Austin S. Hemmelgarn
2016-06-03 13:31   ` Martin
2016-06-03 13:47     ` Julian Taylor
2016-06-03 14:21     ` Austin S. Hemmelgarn
2016-06-03 14:39       ` Martin
2016-06-03 19:09       ` Christoph Anton Mitterer
2016-06-09  6:16       ` Duncan
2016-06-09 11:38         ` Austin S. Hemmelgarn
2016-06-09 17:39           ` Chris Murphy
2016-06-09 19:57             ` Duncan
2016-06-03 14:05   ` Chris Murphy
2016-06-03 14:11     ` Martin
2016-06-03 15:33       ` Austin S. Hemmelgarn
2016-06-04  0:48         ` Nicholas D Steeves
2016-06-04  1:48           ` Chris Murphy
2016-06-06 13:29             ` Austin S. Hemmelgarn
2016-06-04  1:34       ` Chris Murphy
2016-06-05 10:45     ` Mladen Milinkovic
2016-06-05 16:33       ` James Johnston
2016-06-05 18:20         ` Andrei Borzenkov
2016-06-06  1:47       ` Chris Murphy
2016-06-06  2:40         ` James Johnston
2016-06-06 13:36           ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.