All of lore.kernel.org
 help / color / mirror / Atom feed
* How does btrfs handle bad blocks in raid1?
@ 2014-01-09 10:26 Clemens Eisserer
  2014-01-09 10:42 ` Hugo Mills
  0 siblings, 1 reply; 45+ messages in thread
From: Clemens Eisserer @ 2014-01-09 10:26 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I am running write-intensive (well sort of, one write every 10s)
workloads on cheap flash media which proved to be horribly unreliable.
A 32GB microSDHC card reported bad blocks after 4 days, while a usb
pen drive returns bogus data without any warning at all.

So I wonder, how would btrfs behave in raid1 on two such devices?
Would it simply mark bad blocks as "bad" and continue to be
operational, or will it bail out when some block can not be
read/written anymore on one of the two devices?

Thank you in advance, Clemens

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 10:26 How does btrfs handle bad blocks in raid1? Clemens Eisserer
@ 2014-01-09 10:42 ` Hugo Mills
  2014-01-09 12:41   ` Duncan
  2014-01-09 18:40   ` Chris Murphy
  0 siblings, 2 replies; 45+ messages in thread
From: Hugo Mills @ 2014-01-09 10:42 UTC (permalink / raw)
  To: Clemens Eisserer; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1750 bytes --]

On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote:
> Hi,
> 
> I am running write-intensive (well sort of, one write every 10s)
> workloads on cheap flash media which proved to be horribly unreliable.
> A 32GB microSDHC card reported bad blocks after 4 days, while a usb
> pen drive returns bogus data without any warning at all.
> 
> So I wonder, how would btrfs behave in raid1 on two such devices?
> Would it simply mark bad blocks as "bad" and continue to be
> operational, or will it bail out when some block can not be
> read/written anymore on one of the two devices?

   If a block is read and fails its checksum, then the other copy (in
RAID-1) is checked and used if it's good. The bad copy is rewritten to
use the good data.

   If the block is bad such that writing to it won't fix it, then
there's probably two cases: the device returns an IO error, in which
case I suspect (but can't be sure) that the FS will go read-only. Or
the device silently fails the write and claims success, in which case
you're back to the situation above of the block failing its checksum.

   There's no marking of bad blocks right now, and I don't know of
anyone working on the feature, so the FS will probably keep going back
to the bad blocks as it makes CoW copies for modification.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Trouble rather the tiger in his lair than the sage amongst ---    
        his books for to you kingdoms and their armies are mighty        
        and enduring,  but to him they are but toys of the moment        
              to be overturned by the flicking of a finger.              

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 10:42 ` Hugo Mills
@ 2014-01-09 12:41   ` Duncan
  2014-01-09 12:52     ` Austin S Hemmelgarn
                       ` (2 more replies)
  2014-01-09 18:40   ` Chris Murphy
  1 sibling, 3 replies; 45+ messages in thread
From: Duncan @ 2014-01-09 12:41 UTC (permalink / raw)
  To: linux-btrfs

Hugo Mills posted on Thu, 09 Jan 2014 10:42:47 +0000 as excerpted:

> On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote:
>> Hi,
>> 
>> I am running write-intensive (well sort of, one write every 10s)
>> workloads on cheap flash media which proved to be horribly unreliable.
>> A 32GB microSDHC card reported bad blocks after 4 days, while a usb pen
>> drive returns bogus data without any warning at all.
>> 
>> So I wonder, how would btrfs behave in raid1 on two such devices? Would
>> it simply mark bad blocks as "bad" and continue to be operational, or
>> will it bail out when some block can not be read/written anymore on one
>> of the two devices?
> 
> If a block is read and fails its checksum, then the other copy (in
> RAID-1) is checked and used if it's good. The bad copy is rewritten to
> use the good data.

This is why I'm (semi-impatiently, but not being a coder, I have little 
choice, and I do see advances happening) so looking forward to the 
planned N-way-mirroring, aka true-raid-1, feature, as opposed to btrfs' 
current 2-way-only mirroring.  Having checksumming is good, and a second 
copy in case one fails the checksum is nice, but what if they BOTH do?
I'd love to have the choice of (at least) three-way-mirroring, as for me 
that seems the best practical hassle/cost vs. risk balance I could get, 
but it's not yet possible. =:^(

For (at least) year now, the roadmap has had N-way-mirroring on the list 
for after raid5/6 as they want to build on its features, but (like much 
of the btrfs work) raid5/6 took about three kernels longer to introduce 
than originally thought, and even when introduced, the raid5/6 feature 
lacked some critical parts (like scrub) and wasn't considered real-world 
usable as integrity over a crash and/or device failure, the primary 
feature of raid5/6, couldn't be assured.  That itself was about three 
kernels ago now, and the raid5/6 functionality remains partial -- it 
writes the data and parities as it should, but scrub and recovery remain 
only partially coded, so it looks like that'll /still/ be a few more 
kernels before that's fully implemented and most bugs worked out, with 
very likely a similar story to play out for N-way-mirroring after that, 
thus placing it late this year for introduction and early next for 
actually usable stability.

But it remains on the roadmap and btrfs should have it... eventually.  
Meanwhile, I keep telling myself that this is filesystem code which a LOT 
of folks including me stake the survival of their data on, and I along 
with all the others definitely prefer it done CORRECTLY, even if it takes 
TEN years longer than intended, than have it sloppily and unreliably 
implemented sooner.

But it's still hard to wait, when sometimes I begin to think of it like 
that carrot suspended in front of the donkey, never to actually be 
reached.  Except... I *DO* see changes, and after originally taking off 
for a few months after my original btrfs investigation, finding it 
unusable in its then-current state, upon coming back about 5 months 
later, actual usability and stability on current features had improved to 
the point that I'm actually using it now, so there's certainly progress 
being made, and the fact that I'm actually using it now attests to that 
progress *NOT* being a simple illusion.  So it'll come, even if it /does/ 
sometimes seem it's Duke-Nukem-Forever.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 12:41   ` Duncan
@ 2014-01-09 12:52     ` Austin S Hemmelgarn
  2014-01-09 15:15       ` Duncan
  2014-01-09 17:31       ` Chris Murphy
  2014-01-09 14:58     ` Chris Mason
  2014-01-09 18:08     ` Chris Murphy
  2 siblings, 2 replies; 45+ messages in thread
From: Austin S Hemmelgarn @ 2014-01-09 12:52 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 2014-01-09 07:41, Duncan wrote:
> Hugo Mills posted on Thu, 09 Jan 2014 10:42:47 +0000 as excerpted:
> 
>> On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote:
>>> Hi,
>>>
>>> I am running write-intensive (well sort of, one write every 10s)
>>> workloads on cheap flash media which proved to be horribly unreliable.
>>> A 32GB microSDHC card reported bad blocks after 4 days, while a usb pen
>>> drive returns bogus data without any warning at all.
>>>
>>> So I wonder, how would btrfs behave in raid1 on two such devices? Would
>>> it simply mark bad blocks as "bad" and continue to be operational, or
>>> will it bail out when some block can not be read/written anymore on one
>>> of the two devices?
>>
>> If a block is read and fails its checksum, then the other copy (in
>> RAID-1) is checked and used if it's good. The bad copy is rewritten to
>> use the good data.
> 
> This is why I'm (semi-impatiently, but not being a coder, I have little 
> choice, and I do see advances happening) so looking forward to the 
> planned N-way-mirroring, aka true-raid-1, feature, as opposed to btrfs' 
> current 2-way-only mirroring.  Having checksumming is good, and a second 
> copy in case one fails the checksum is nice, but what if they BOTH do?
> I'd love to have the choice of (at least) three-way-mirroring, as for me 
> that seems the best practical hassle/cost vs. risk balance I could get, 
> but it's not yet possible. =:^(
> 
> For (at least) year now, the roadmap has had N-way-mirroring on the list 
> for after raid5/6 as they want to build on its features, but (like much 
> of the btrfs work) raid5/6 took about three kernels longer to introduce 
> than originally thought, and even when introduced, the raid5/6 feature 
> lacked some critical parts (like scrub) and wasn't considered real-world 
> usable as integrity over a crash and/or device failure, the primary 
> feature of raid5/6, couldn't be assured.  That itself was about three 
> kernels ago now, and the raid5/6 functionality remains partial -- it 
> writes the data and parities as it should, but scrub and recovery remain 
> only partially coded, so it looks like that'll /still/ be a few more 
> kernels before that's fully implemented and most bugs worked out, with 
> very likely a similar story to play out for N-way-mirroring after that, 
> thus placing it late this year for introduction and early next for 
> actually usable stability.
> 
> But it remains on the roadmap and btrfs should have it... eventually.  
> Meanwhile, I keep telling myself that this is filesystem code which a LOT 
> of folks including me stake the survival of their data on, and I along 
> with all the others definitely prefer it done CORRECTLY, even if it takes 
> TEN years longer than intended, than have it sloppily and unreliably 
> implemented sooner.
> 
> But it's still hard to wait, when sometimes I begin to think of it like 
> that carrot suspended in front of the donkey, never to actually be 
> reached.  Except... I *DO* see changes, and after originally taking off 
> for a few months after my original btrfs investigation, finding it 
> unusable in its then-current state, upon coming back about 5 months 
> later, actual usability and stability on current features had improved to 
> the point that I'm actually using it now, so there's certainly progress 
> being made, and the fact that I'm actually using it now attests to that 
> progress *NOT* being a simple illusion.  So it'll come, even if it /does/ 
> sometimes seem it's Duke-Nukem-Forever.
> 
Just a thought, you might consider running btrfs on top of LVM in the
interim, it isn't quite as efficient as btrfs by itself, but it does
allow N-way mirroring (and the efficiency is much better now that they
have switched to RAID1 as the default mirroring backend)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 12:41   ` Duncan
  2014-01-09 12:52     ` Austin S Hemmelgarn
@ 2014-01-09 14:58     ` Chris Mason
  2014-01-09 18:08     ` Chris Murphy
  2 siblings, 0 replies; 45+ messages in thread
From: Chris Mason @ 2014-01-09 14:58 UTC (permalink / raw)
  To: 1i5t5.duncan; +Cc: linux-btrfs

On Thu, 2014-01-09 at 12:41 +0000, Duncan wrote:
> Hugo Mills posted on Thu, 09 Jan 2014 10:42:47 +0000 as excerpted:
> 
> > On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote:
> >> Hi,
> >> 
> >> I am running write-intensive (well sort of, one write every 10s)
> >> workloads on cheap flash media which proved to be horribly unreliable.
> >> A 32GB microSDHC card reported bad blocks after 4 days, while a usb pen
> >> drive returns bogus data without any warning at all.
> >> 
> >> So I wonder, how would btrfs behave in raid1 on two such devices? Would
> >> it simply mark bad blocks as "bad" and continue to be operational, or
> >> will it bail out when some block can not be read/written anymore on one
> >> of the two devices?
> > 
> > If a block is read and fails its checksum, then the other copy (in
> > RAID-1) is checked and used if it's good. The bad copy is rewritten to
> > use the good data.
> 
> This is why I'm (semi-impatiently, but not being a coder, I have little 
> choice, and I do see advances happening) so looking forward to the 
> planned N-way-mirroring, aka true-raid-1, feature, as opposed to btrfs' 
> current 2-way-only mirroring.  Having checksumming is good, and a second 
> copy in case one fails the checksum is nice, but what if they BOTH do?
> I'd love to have the choice of (at least) three-way-mirroring, as for me 
> that seems the best practical hassle/cost vs. risk balance I could get, 
> but it's not yet possible. =:^(
> 
> For (at least) year now, the roadmap has had N-way-mirroring on the list 
> for after raid5/6 as they want to build on its features, but (like much 
> of the btrfs work) raid5/6 took about three kernels longer to introduce 
> than originally thought, and even when introduced, the raid5/6 feature 
> lacked some critical parts (like scrub) and wasn't considered real-world 
> usable as integrity over a crash and/or device failure, the primary 
> feature of raid5/6, couldn't be assured.  

I'm frustrated too that I haven't pushed this out yet.  I've been trying
different methods to keep the performance up and in the end tried to do
pile on too many other features in the patches.  So, I'm breaking it up
a bit and reworking things for faster release.

-chris


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 12:52     ` Austin S Hemmelgarn
@ 2014-01-09 15:15       ` Duncan
  2014-01-09 16:49         ` George Eleftheriou
  2014-01-09 17:31       ` Chris Murphy
  1 sibling, 1 reply; 45+ messages in thread
From: Duncan @ 2014-01-09 15:15 UTC (permalink / raw)
  To: linux-btrfs

Austin S Hemmelgarn posted on Thu, 09 Jan 2014 07:52:44 -0500 as
excerpted:

> On 2014-01-09 07:41, Duncan wrote:
>> Hugo Mills posted on Thu, 09 Jan 2014 10:42:47 +0000 as excerpted:
>> 
>>> If a [btrfs ]block is read and fails its checksum, then the other
>>> copy (in RAID-1) is checked and used if it's good. The bad copy is
>>> rewritten to use the good data.
>> 
>> This is why I'm so looking forward to the planned N-way-mirroring,
>> aka true-raid-1, feature, as opposed to btrfs' current 2-way-only
>> mirroring.  Having checksumming is good, and a second copy in case
>> one fails the checksum is nice, but what if they BOTH do? I'd love
>> to have the choice of (at least) three-way-mirroring, as for me
>> that seems the best practical hassle/cost vs. risk balance I
>> could get, but it's not yet possible. =:^(
>> 
> Just a thought, you might consider running btrfs on top of LVM in the
> interim, it isn't quite as efficient as btrfs by itself, but it does
> allow N-way mirroring (and the efficiency is much better now that they
> have switched to RAID1 as the default mirroring backend)

Except... AFAIK LVM is like mdraid in that regard -- no checksums, 
leaving the software entirely at the mercy of the hardware's ability to 
detect and properly report failure.

In fact, it's exactly as bad as that, since while both lvm and mdraid 
offer N-way-mirroring, they generally only fetch a single unchecksummed 
copy from whatever mirror they happen to choose to request it from, and 
use whatever they get without even a comparison againt the other copies 
to see if they match or majority vote on which is the valid copy if 
something doesn't match.  The ONLY way they know there's an error (unless 
the hardware reports one) at all is if a deliberate scrub is done.

And the raid5/6 parity-checking isn't any better, as while those parities 
are written, they're never checked or otherwise actually used except in 
recovery.  Normal read operation is just like raid0; only the device(s) 
containing the data itself is(are) read, no parity/checksum checking at 
all, even tho the trouble was taken to calculate and write it out.  When 
I had mdraid6 deployed and realized that, I switched back to raid1 (which 
would have been raid10 on a larger system), because while I considered 
the raid6 performance costs worth it for parity checking, they most 
definitely weren't once I realized all those calculates and writes were 
for nothing unless an actual device died, and raid1 gave me THAT level of 
protection at far better performance.

Which means neither lvm nor mdraid solve the problem at all.  Even btrfs 
on top of them won't solve the problem, while adding all sorts of 
complexity, because btrfs still has only the two-way check, and if one 
device gets corrupted in the underlying mirrors but another actually 
returns the data, btrfs will be entirely oblivious.

What one /could/ in theory do at the moment, altho it's hardly worth it 
due to the complexity[1] and the fact that btrfs itself is still a 
relatively immature filesystem under heavy development, and thus not 
suited to being part of such extreme solutions yet, is layered raid1 
btrfs on loopback over raid1 btrfs, say four devices, separate on-the-
hardware-device raid1 btrfs on two pairs, with a single huge loopback-
file on each lower-level btrfs, with raid1 btrfs layered on top of the 
loopback devices, too, manually creating an effective 4-device btrfs 
raid11.  Or use btrf raid10 at one or the other level and make it an 8-
device btrfs raid101 or raid110.  Tho as I said btrfs maturity level in 
general is a mismatch for such extreme measures, at present.  But in 
theory...

Zfs is arguably a more practically viable solution as it's mature and 
ready for deployment today, tho there's legal/license issues with the 
Linux kernel module and the usual userspace performance issues (tho the 
btrfs-on-loopback-on-btrfs solution above wouldn't be performance issue 
free either) with the fuse alternative.

I'm sure that's why a lot of folks needing multi-mirror checksum-verified 
reliability remain on Solaris/OpenIndiana/ZFS-on-BSD, as Linux simply 
doesn't /have/ a solution for that yet.  Btrfs /will/ have it, but as I 
explained, it's taking awhile.

---
[1] Complexity: Complexity can be the PRIMARY failure factor when an 
admin must understand enough about the layout to reliably manage recovery 
when they're already under the extreme pressure of a disaster recovery 
situation.  If complexity in even an otherwise 100% reliable solution is 
high enough an admin isn't confident of his ability to manage it, then 
the admin themself becomes the week link the the reliability chain!!  
That's the reason I tried and ultimately dropped lvm over mdraid here, 
since I couldn't be confident in my ability to understand both well 
enough to without admin error recover from disaster.  Thus, higher 
complexity really *IS* a SERIOUS negative in this sort of discussion, 
since it can be *THE* failure factor!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 15:15       ` Duncan
@ 2014-01-09 16:49         ` George Eleftheriou
  2014-01-09 17:09           ` Hugo Mills
                             ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: George Eleftheriou @ 2014-01-09 16:49 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

Duncan,

As a silent reader of this list (for almost a year)...
As an anonymous supporter of the BAARF (Battle Against Any RAID
Four/Five/Six/ Z etc...) initiative...

I can only break my silence and applaud your frequent interventions
referring to N-Way mirroring (searching the list for the string
"n-way" brings up almost exclusively your posts, at least in recent
times).

Because that's what I' m also eager to see implemented in BTRFS and
somehow felt disappointed that it wasn't given priority over the
parity solutions...

I currently use ZFS on Linux in a 4-disk RAID10 (performing pretty
good by the way) being stuck with the 3.11 kernel because of DKMS
issues and not being able to share by SMB or NFS because of some bugs.

I'm really looking forward to the day that typing:

mkfs.btrfs -d raid10 -m raid10  /dev/sd[abcd]

will do exactly what is expected to do. A true RAID10 resilient in 2
disks' failure. Simple and beautiful.

We're almost there...

Best regards to all BTRFS developers/contributors

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 16:49         ` George Eleftheriou
@ 2014-01-09 17:09           ` Hugo Mills
  2014-01-09 17:34             ` George Eleftheriou
  2014-01-09 17:29           ` Chris Murphy
  2014-01-10 15:27           ` Duncan
  2 siblings, 1 reply; 45+ messages in thread
From: Hugo Mills @ 2014-01-09 17:09 UTC (permalink / raw)
  To: George Eleftheriou; +Cc: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2404 bytes --]

On Thu, Jan 09, 2014 at 05:49:48PM +0100, George Eleftheriou wrote:
> Duncan,
> 
> As a silent reader of this list (for almost a year)...
> As an anonymous supporter of the BAARF (Battle Against Any RAID
> Four/Five/Six/ Z etc...) initiative...
> 
> I can only break my silence and applaud your frequent interventions
> referring to N-Way mirroring (searching the list for the string
> "n-way" brings up almost exclusively your posts, at least in recent
> times).
> 
> Because that's what I' m also eager to see implemented in BTRFS and
> somehow felt disappointed that it wasn't given priority over the
> parity solutions...
> 
> I currently use ZFS on Linux in a 4-disk RAID10 (performing pretty
> good by the way) being stuck with the 3.11 kernel because of DKMS
> issues and not being able to share by SMB or NFS because of some bugs.
> 
> I'm really looking forward to the day that typing:
> 
> mkfs.btrfs -d raid10 -m raid10  /dev/sd[abcd]
> 
> will do exactly what is expected to do. A true RAID10 resilient in 2
> disks' failure. Simple and beautiful.

   RAID-10 isn't guaranteed to be robust against two devices failing.
Not just the btrfs implementation -- any RAID-10 will die if the wrong
two devices fail. In the simplest case:

A }
B } Mirrored    }
                }
C }             }
D } Mirrored    } Striped
                }
E }             }
F } Mirrored    }

   If A and B both die, then you're stuffed. (For the four-disk case,
just remove E and F from the diagram).

   If you want to talk odds, then that's OK, I'll admit that btrfs
doesn't necessarily do as well there(*) as the scheme above. But
claiming that RAID-10 (with 2-way mirroring) is guaranteed to survive
an arbitrary 2-device failure is incorrect.

   Hugo.

(*) Actually, I suspect that with even numbers of equal-sized disks,
it'll do just as well, but I'm not willing to guarantee that behaviour
without hacking up the allocator a bit to add the capability.

> We're almost there...
> 
> Best regards to all BTRFS developers/contributors

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- But people have always eaten people,  / what else is there to ---  
         eat?  / If the Juju had meant us not to eat people / he         
                     wouldn't have made us of meat.                      

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 16:49         ` George Eleftheriou
  2014-01-09 17:09           ` Hugo Mills
@ 2014-01-09 17:29           ` Chris Murphy
  2014-01-09 18:00             ` George Eleftheriou
  2014-01-10 15:27           ` Duncan
  2 siblings, 1 reply; 45+ messages in thread
From: Chris Murphy @ 2014-01-09 17:29 UTC (permalink / raw)
  To: Btrfs BTRFS


On Jan 9, 2014, at 9:49 AM, George Eleftheriou <eleftg@gmail.com> wrote:
> 
> I'm really looking forward to the day that typing:
> 
> mkfs.btrfs -d raid10 -m raid10  /dev/sd[abcd]
> 
> will do exactly what is expected to do. A true RAID10 resilient in 2
> disks' failure. Simple and beautiful.


How is a resilient 2 disk failure possible with four disk raid10?

Chris Murphy


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 12:52     ` Austin S Hemmelgarn
  2014-01-09 15:15       ` Duncan
@ 2014-01-09 17:31       ` Chris Murphy
  2014-01-09 18:20         ` Austin S Hemmelgarn
  1 sibling, 1 reply; 45+ messages in thread
From: Chris Murphy @ 2014-01-09 17:31 UTC (permalink / raw)
  To: Btrfs BTRFS


On Jan 9, 2014, at 5:52 AM, Austin S Hemmelgarn <ahferroin7@gmail.com> wrote:
> Just a thought, you might consider running btrfs on top of LVM in the
> interim, it isn't quite as efficient as btrfs by itself, but it does
> allow N-way mirroring (and the efficiency is much better now that they
> have switched to RAID1 as the default mirroring backend)

The problem that in case of mismatches, it's ambiguous which are correct.


Chris Murphy


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
@ 2014-01-09 17:34             ` George Eleftheriou
  2014-01-09 17:43               ` Hugo Mills
  0 siblings, 1 reply; 45+ messages in thread
From: George Eleftheriou @ 2014-01-09 17:34 UTC (permalink / raw)
  To: Hugo Mills, linux-btrfs

> claiming that RAID-10 (with 2-way mirroring) is guaranteed to survive
> an arbitrary 2-device failure is incorrect.

Yes, you are right.  I didn't mean "any 2 devices". I should have
added "from different mirrors" :)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 17:34             ` George Eleftheriou
@ 2014-01-09 17:43               ` Hugo Mills
  2014-01-09 18:40                 ` George Eleftheriou
  0 siblings, 1 reply; 45+ messages in thread
From: Hugo Mills @ 2014-01-09 17:43 UTC (permalink / raw)
  To: George Eleftheriou; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1769 bytes --]

On Thu, Jan 09, 2014 at 06:34:23PM +0100, George Eleftheriou wrote:
> > claiming that RAID-10 (with 2-way mirroring) is guaranteed to survive
> > an arbitrary 2-device failure is incorrect.
> 
> Yes, you are right.  I didn't mean "any 2 devices". I should have
> added "from different mirrors" :)

   If you have an even number of devices and all the devices are the
same size, then:

 * the block group allocator will use all the devices each time
 * the amount of space on each device will always be the same

If the sort in the allocator is stable and resolves ties in free space
by using the device ID number, the above properties should guarantee
that the allocation is stable, so each new block group will have the
same functional chunk on the same device, and you get your wish.

   It's been a few months since I looked at that code, but I don't
recall seeing anything directly contradictory to the above
assumptions.

   Of course, if you have an odd number of devices, the allocator will
omit a different device on each block group, and you lose the ability
to survive (some) two-device failures. I suspect that the odds of
surviving a two-device failure are still non-zero, but less than if
you had an even number of devices. I'm not about to attempt an
ab-initio computation of the probabilities, but it shouldn't be too
hard to do either a monte-carlo simulation or a simple brute-force
enumeration of the possibilities for a given configuration.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- <Diablo-D3> My code is never released,  it escapes from the ---   
          git repo and kills a few beta testers on the way out           

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 17:29           ` Chris Murphy
@ 2014-01-09 18:00             ` George Eleftheriou
  0 siblings, 0 replies; 45+ messages in thread
From: George Eleftheriou @ 2014-01-09 18:00 UTC (permalink / raw)
  To: linux-btrfs

> How is a resilient 2 disk failure possible with four disk raid10?

   __________     ___ RAID0
__|__       __|__   ___ RAID1
 |     |        |     |
A    B      C    D

Loosing A+C / A+D / B+C / B+D  is resilient.
Loosing A+B or C+D is catastrophic.

Sorry, it's my fault. In my urge to praise Duncan's promotion of n-way
mirroring I created a misunderstanding.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 12:41   ` Duncan
  2014-01-09 12:52     ` Austin S Hemmelgarn
  2014-01-09 14:58     ` Chris Mason
@ 2014-01-09 18:08     ` Chris Murphy
  2014-01-09 18:22       ` Austin S Hemmelgarn
  2 siblings, 1 reply; 45+ messages in thread
From: Chris Murphy @ 2014-01-09 18:08 UTC (permalink / raw)
  To: Btrfs BTRFS


On Jan 9, 2014, at 5:41 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Having checksumming is good, and a second 
> copy in case one fails the checksum is nice, but what if they BOTH do?
> I'd love to have the choice of (at least) three-way-mirroring, as for me 
> that seems the best practical hassle/cost vs. risk balance I could get, 
> but it's not yet possible. =:^(

I'm on the fence on n-way. 

HDDs get bigger at a faster rate than their performance improves, so rebuild times keep getting higher. For cases where the data is really important, backup-restore doesn't provide the necessary uptime, and minimum single drive performance is needed, it can make sense to want three copies.

But what's the probability of both drives in a mirrored raid set dying, compared to something else in the storage stack dying? I think at 3 copies, you've got other risks that the 3rd copy doesn't manage, like a power supply, controller card, or logic board dying.


Chris Murphy


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 17:31       ` Chris Murphy
@ 2014-01-09 18:20         ` Austin S Hemmelgarn
  0 siblings, 0 replies; 45+ messages in thread
From: Austin S Hemmelgarn @ 2014-01-09 18:20 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

On 2014-01-09 12:31, Chris Murphy wrote:
> 
> On Jan 9, 2014, at 5:52 AM, Austin S Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> Just a thought, you might consider running btrfs on top of LVM in
>> the interim, it isn't quite as efficient as btrfs by itself, but
>> it does allow N-way mirroring (and the efficiency is much better
>> now that they have switched to RAID1 as the default mirroring
>> backend)
> 
> The problem that in case of mismatches, it's ambiguous which are
> correct.
> 
At the moment that is correct, I've been planning for some time now to
write a patch so that the RAID1 implementation on more than 2 devices
checks what the majority of other devices say about the block, and
then updates all of them with the majority.  Barring a manufacturing
defect or firmware bug, any group of three or more disks is
statistically very unlikely to have a read error at the same place on
each disk until they have accumulated enough bad sectors that they are
totally unusable, so this would allow recovery in a non-degraded RAID1
array in most cases.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 18:08     ` Chris Murphy
@ 2014-01-09 18:22       ` Austin S Hemmelgarn
  2014-01-09 18:52         ` Chris Murphy
  0 siblings, 1 reply; 45+ messages in thread
From: Austin S Hemmelgarn @ 2014-01-09 18:22 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

On 2014-01-09 13:08, Chris Murphy wrote:
> 
> On Jan 9, 2014, at 5:41 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>> Having checksumming is good, and a second 
>> copy in case one fails the checksum is nice, but what if they BOTH do?
>> I'd love to have the choice of (at least) three-way-mirroring, as for me 
>> that seems the best practical hassle/cost vs. risk balance I could get, 
>> but it's not yet possible. =:^(
> 
> I'm on the fence on n-way. 
> 
> HDDs get bigger at a faster rate than their performance improves, so rebuild times keep getting higher. For cases where the data is really important, backup-restore doesn't provide the necessary uptime, and minimum single drive performance is needed, it can make sense to want three copies.
> 
> But what's the probability of both drives in a mirrored raid set dying, compared to something else in the storage stack dying? I think at 3 copies, you've got other risks that the 3rd copy doesn't manage, like a power supply, controller card, or logic board dying.
> 
The risk isn't as much both drives dying at the same time as one dying
during a rebuild of the array, which is more and more likely as drives
get bigger and bigger.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 10:42 ` Hugo Mills
  2014-01-09 12:41   ` Duncan
@ 2014-01-09 18:40   ` Chris Murphy
  2014-01-09 19:13     ` Kyle Gates
  1 sibling, 1 reply; 45+ messages in thread
From: Chris Murphy @ 2014-01-09 18:40 UTC (permalink / raw)
  To: Btrfs BTRFS; +Cc: Hugo Mills


On Jan 9, 2014, at 3:42 AM, Hugo Mills <hugo@carfax.org.uk> wrote:

> On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote:
>> Hi,
>> 
>> I am running write-intensive (well sort of, one write every 10s)
>> workloads on cheap flash media which proved to be horribly unreliable.
>> A 32GB microSDHC card reported bad blocks after 4 days, while a usb
>> pen drive returns bogus data without any warning at all.
>> 
>> So I wonder, how would btrfs behave in raid1 on two such devices?
>> Would it simply mark bad blocks as "bad" and continue to be
>> operational, or will it bail out when some block can not be
>> read/written anymore on one of the two devices?
> 
>   If a block is read and fails its checksum, then the other copy (in
> RAID-1) is checked and used if it's good. The bad copy is rewritten to
> use the good data.
> 
>   If the block is bad such that writing to it won't fix it, then
> there's probably two cases: the device returns an IO error, in which
> case I suspect (but can't be sure) that the FS will go read-only. Or
> the device silently fails the write and claims success, in which case
> you're back to the situation above of the block failing its checksum.

In a normally operating drive, when the drive firmware locates a physical sector with persistent write failures, it's dereferenced. So the LBA points to a reserve physical sector, the originally can't be accessed by LBA. If all of the reserve sectors get used up, the next persistent write failure will result in a write error reported to libata and this will appear in dmesg, and should be treated as the drive being no longer in normal operation. It's a drive useful for storage developers, but not for production usage.

>   There's no marking of bad blocks right now, and I don't know of
> anyone working on the feature, so the FS will probably keep going back
> to the bad blocks as it makes CoW copies for modification.

This is maybe relevant:
https://www.kernel.org/doc/htmldocs/libata/ataExceptions.html

"READ and WRITE commands report CHS or LBA of the first failed sector but ATA/ATAPI standard specifies that the amount of transferred data on error completion is indeterminate, so we cannot assume that sectors preceding the failed sector have been transferred and thus cannot complete those sectors successfully as SCSI does."

If I understand that correctly, Btrfs really ought to either punt the device, or make the whole volume read-only. For production use, going read-only very well could mean data loss, even while preserving the state of the file system. Eventually I'd rather see the offending device ejected from the volume, and for the volume to remain rw,degraded.


Chris Murphy


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 17:43               ` Hugo Mills
@ 2014-01-09 18:40                 ` George Eleftheriou
  0 siblings, 0 replies; 45+ messages in thread
From: George Eleftheriou @ 2014-01-09 18:40 UTC (permalink / raw)
  To: Hugo Mills, linux-btrfs

Thanks Hugo,

Since:

-- i keep daily backups
-- all 4 devices are of the same size

I think I can test it (as soon as I have some time to spend in the
transition to BTRFS) and verify your assumptions (...and get my wish)



>    If you have an even number of devices and all the devices are the
> same size, then:
>
>  * the block group allocator will use all the devices each time
>  * the amount of space on each device will always be the same
>
> If the sort in the allocator is stable and resolves ties in free space
> by using the device ID number, the above properties should guarantee
> that the allocation is stable, so each new block group will have the
> same functional chunk on the same device, and you get your wish.
>
>    It's been a few months since I looked at that code, but I don't
> recall seeing anything directly contradictory to the above
> assumptions.
>
>    Of course, if you have an odd number of devices, the allocator will
> omit a different device on each block group, and you lose the ability
> to survive (some) two-device failures. I suspect that the odds of
> surviving a two-device failure are still non-zero, but less than if
> you had an even number of devices. I'm not about to attempt an
> ab-initio computation of the probabilities, but it shouldn't be too
> hard to do either a monte-carlo simulation or a simple brute-force
> enumeration of the possibilities for a given configuration.
>
>    Hugo.
>
> --
> === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
>   PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
>    --- <Diablo-D3> My code is never released,  it escapes from the ---
>           git repo and kills a few beta testers on the way out

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 18:22       ` Austin S Hemmelgarn
@ 2014-01-09 18:52         ` Chris Murphy
  2014-01-10 17:03           ` Duncan
  0 siblings, 1 reply; 45+ messages in thread
From: Chris Murphy @ 2014-01-09 18:52 UTC (permalink / raw)
  To: Btrfs BTRFS


On Jan 9, 2014, at 11:22 AM, Austin S Hemmelgarn <ahferroin7@gmail.com> wrote:

> On 2014-01-09 13:08, Chris Murphy wrote:
>> 
>> On Jan 9, 2014, at 5:41 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>>> Having checksumming is good, and a second 
>>> copy in case one fails the checksum is nice, but what if they BOTH do?
>>> I'd love to have the choice of (at least) three-way-mirroring, as for me 
>>> that seems the best practical hassle/cost vs. risk balance I could get, 
>>> but it's not yet possible. =:^(
>> 
>> I'm on the fence on n-way. 
>> 
>> HDDs get bigger at a faster rate than their performance improves, so rebuild times keep getting higher. For cases where the data is really important, backup-restore doesn't provide the necessary uptime, and minimum single drive performance is needed, it can make sense to want three copies.
>> 
>> But what's the probability of both drives in a mirrored raid set dying, compared to something else in the storage stack dying? I think at 3 copies, you've got other risks that the 3rd copy doesn't manage, like a power supply, controller card, or logic board dying.
>> 
> The risk isn't as much both drives dying at the same time as one dying
> during a rebuild of the array, which is more and more likely as drives
> get bigger and bigger.

Understood. I'm considering a 2nd drive dying during rebuild (from a 1st drive dying) as essentially simultaneous failures. And in the case of raid10, the likelihood of a 2nd drive failure being the lonesome drive in a mirrored set is statistically very unlikely. The next drive to fail is going to be some other drive in the array, which still has a mirror.

I'm not saying there's no value in n-way. I'm just saying adding more redundancy only solves on particular vector for failure that's still probably less likely than losing a power supply or a controller or even user induced data loss that ends up affecting all three copies anyway.

And yes, it's easier to just add drives and make 3 copies, than it is to setup a cluster. But that's the trade off when using such high density drives that the rebuild times cause consideration of adding even more high density drives to solve the problem. 




Chris Murphy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: How does btrfs handle bad blocks in raid1?
  2014-01-09 18:40   ` Chris Murphy
@ 2014-01-09 19:13     ` Kyle Gates
  2014-01-09 19:31       ` Chris Murphy
  0 siblings, 1 reply; 45+ messages in thread
From: Kyle Gates @ 2014-01-09 19:13 UTC (permalink / raw)
  To: linux-btrfs

On Thu, 9 Jan 2014 11:40:20 -0700 Chris Murphy wrote:
>
> On Jan 9, 2014, at 3:42 AM, Hugo Mills wrote:
>
>> On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote:
>>> Hi,
>>>
>>> I am running write-intensive (well sort of, one write every 10s)
>>> workloads on cheap flash media which proved to be horribly unreliable.
>>> A 32GB microSDHC card reported bad blocks after 4 days, while a usb
>>> pen drive returns bogus data without any warning at all.
>>>
>>> So I wonder, how would btrfs behave in raid1 on two such devices?
>>> Would it simply mark bad blocks as "bad" and continue to be
>>> operational, or will it bail out when some block can not be
>>> read/written anymore on one of the two devices?
>>
>> If a block is read and fails its checksum, then the other copy (in
>> RAID-1) is checked and used if it's good. The bad copy is rewritten to
>> use the good data.
>>
>> If the block is bad such that writing to it won't fix it, then
>> there's probably two cases: the device returns an IO error, in which
>> case I suspect (but can't be sure) that the FS will go read-only. Or
>> the device silently fails the write and claims success, in which case
>> you're back to the situation above of the block failing its checksum.
>
> In a normally operating drive, when the drive firmware locates a physical sector with persistent write failures, it's dereferenced. So the LBA points to a reserve physical sector, the originally can't be accessed by LBA. If all of the reserve sectors get used up, the next persistent write failure will result in a write error reported to libata and this will appear in dmesg, and should be treated as the drive being no longer in normal operation. It's a drive useful for storage developers, but not for production usage.
>
>> There's no marking of bad blocks right now, and I don't know of
>> anyone working on the feature, so the FS will probably keep going back
>> to the bad blocks as it makes CoW copies for modification.
>
> This is maybe relevant:
> https://www.kernel.org/doc/htmldocs/libata/ataExceptions.html
>
> "READ and WRITE commands report CHS or LBA of the first failed sector but ATA/ATAPI standard specifies that the amount of transferred data on error completion is indeterminate, so we cannot assume that sectors preceding the failed sector have been transferred and thus cannot complete those sectors successfully as SCSI does."
>
> If I understand that correctly, Btrfs really ought to either punt the device, or make the whole volume read-only. For production use, going read-only very well could mean data loss, even while preserving the state of the file system. Eventually I'd rather see the offending device ejected from the volume, and for the volume to remain rw,degraded.

I would like to see btrfs hold onto the device in a read-only state like is done during a device replace operation. New writes would maintain the raid level but go out to the remaining devices and only go full filesystem read-only if the minimum number of writable devices is not met. Once a new device is added in, the replace operation could commence and drop the bad device when complete. 		 	   		  

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 19:13     ` Kyle Gates
@ 2014-01-09 19:31       ` Chris Murphy
  2014-01-09 23:24         ` George Mitchell
  0 siblings, 1 reply; 45+ messages in thread
From: Chris Murphy @ 2014-01-09 19:31 UTC (permalink / raw)
  To: Btrfs BTRFS


On Jan 9, 2014, at 12:13 PM, Kyle Gates <kylegates@hotmail.com> wrote:

> On Thu, 9 Jan 2014 11:40:20 -0700 Chris Murphy wrote:
>> 
>> On Jan 9, 2014, at 3:42 AM, Hugo Mills wrote:
>> 
>>> On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote:
>>>> Hi,
>>>> 
>>>> I am running write-intensive (well sort of, one write every 10s)
>>>> workloads on cheap flash media which proved to be horribly unreliable.
>>>> A 32GB microSDHC card reported bad blocks after 4 days, while a usb
>>>> pen drive returns bogus data without any warning at all.
>>>> 
>>>> So I wonder, how would btrfs behave in raid1 on two such devices?
>>>> Would it simply mark bad blocks as "bad" and continue to be
>>>> operational, or will it bail out when some block can not be
>>>> read/written anymore on one of the two devices?
>>> 
>>> If a block is read and fails its checksum, then the other copy (in
>>> RAID-1) is checked and used if it's good. The bad copy is rewritten to
>>> use the good data.
>>> 
>>> If the block is bad such that writing to it won't fix it, then
>>> there's probably two cases: the device returns an IO error, in which
>>> case I suspect (but can't be sure) that the FS will go read-only. Or
>>> the device silently fails the write and claims success, in which case
>>> you're back to the situation above of the block failing its checksum.
>> 
>> In a normally operating drive, when the drive firmware locates a physical sector with persistent write failures, it's dereferenced. So the LBA points to a reserve physical sector, the originally can't be accessed by LBA. If all of the reserve sectors get used up, the next persistent write failure will result in a write error reported to libata and this will appear in dmesg, and should be treated as the drive being no longer in normal operation. It's a drive useful for storage developers, but not for production usage.
>> 
>>> There's no marking of bad blocks right now, and I don't know of
>>> anyone working on the feature, so the FS will probably keep going back
>>> to the bad blocks as it makes CoW copies for modification.
>> 
>> This is maybe relevant:
>> https://www.kernel.org/doc/htmldocs/libata/ataExceptions.html
>> 
>> "READ and WRITE commands report CHS or LBA of the first failed sector but ATA/ATAPI standard specifies that the amount of transferred data on error completion is indeterminate, so we cannot assume that sectors preceding the failed sector have been transferred and thus cannot complete those sectors successfully as SCSI does."
>> 
>> If I understand that correctly, Btrfs really ought to either punt the device, or make the whole volume read-only. For production use, going read-only very well could mean data loss, even while preserving the state of the file system. Eventually I'd rather see the offending device ejected from the volume, and for the volume to remain rw,degraded.
> 
> I would like to see btrfs hold onto the device in a read-only state like is done during a device replace operation. New writes would maintain the raid level but go out to the remaining devices and only go full filesystem read-only if the minimum number of writable devices is not met. Once a new device is added in, the replace operation could commence and drop the bad device when complete. 	

Sure that's a fine optimization for a bad device to be read-only while the volume is still rw, if that's possible.

Chris Murphy


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 19:31       ` Chris Murphy
@ 2014-01-09 23:24         ` George Mitchell
  2014-01-10  0:08           ` Clemens Eisserer
  0 siblings, 1 reply; 45+ messages in thread
From: George Mitchell @ 2014-01-09 23:24 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

I really suspect a lot of bad block issues can be avoided by monitoring 
SMART data.  SMART is working very well for me with btrfs formatted 
drives.  SMART will detect when sectors silently fail and as those 
failures accumulate, SMART will warn in an obvious way that the drive in 
question is at end of life.  So I think the whole bad block issue should 
ideally be handled at a lower level than filesystem with modern hard drives.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 23:24         ` George Mitchell
@ 2014-01-10  0:08           ` Clemens Eisserer
  2014-01-10  0:46             ` George Mitchell
  0 siblings, 1 reply; 45+ messages in thread
From: Clemens Eisserer @ 2014-01-10  0:08 UTC (permalink / raw)
  To: linux-btrfs

Hi George,

> I really suspect a lot of bad block issues can be avoided by monitoring
> SMART data.  SMART is working very well for me with btrfs formatted drives.
> SMART will detect when sectors silently fail and as those failures
> accumulate, SMART will warn in an obvious way that the drive in question is
> at end of life.  So I think the whole bad block issue should ideally be
> handled at a lower level than filesystem with modern hard drives.

At least my original request was about cheap flash media, where you
don't have the luxury that you can "trust" the hardware behaving
properly. In fact, it might be benefitial for a SD card to not report
ECC errors - most likely the user won't notice a small glitch playing
back music - but he definitively will when the smartphone reports read
errors and stopping playback which will cause that card to be RMAd.

Also, wouldn't your argument be also valid for checksums - why
checksum in software, when in theory the drive + controllers should do
it anyway ;)

Regards, Clemens

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-10  0:08           ` Clemens Eisserer
@ 2014-01-10  0:46             ` George Mitchell
  0 siblings, 0 replies; 45+ messages in thread
From: George Mitchell @ 2014-01-10  0:46 UTC (permalink / raw)
  To: Clemens Eisserer, linux-btrfs

Hello Clemens,

On 01/09/2014 04:08 PM, Clemens Eisserer wrote:
> Hi George,
>
>> I really suspect a lot of bad block issues can be avoided by monitoring
>> SMART data.  SMART is working very well for me with btrfs formatted drives.
>> SMART will detect when sectors silently fail and as those failures
>> accumulate, SMART will warn in an obvious way that the drive in question is
>> at end of life.  So I think the whole bad block issue should ideally be
>> handled at a lower level than filesystem with modern hard drives.
> At least my original request was about cheap flash media, where you
> don't have the luxury that you can "trust" the hardware behaving
> properly. In fact, it might be benefitial for a SD card to not report
> ECC errors - most likely the user won't notice a small glitch playing
> back music - but he definitively will when the smartphone reports read
> errors and stopping playback which will cause that card to be RMAd.
>
> Also, wouldn't your argument be also valid for checksums - why
> checksum in software, when in theory the drive + controllers should do
> it anyway ;)
>
> Regards, Clemens
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
It would certainly be a vast improvement if flash media had some of the 
sanity checking capability that conventional media has, but to say that 
these sorts of problems with flash media are legendary would almost be 
an understatement.

As for checksums, I view them more as a tool to detect data decay as 
opposed to checking for failed writes.  Of course that data decay might 
well result in failed writes when btrfs scrub tries to correct it.  At 
that point I would prefer that the drive, even flash media type, would 
catch and resolve write failures.  If it doesn't happen at the hardware 
layer, according to how I understand Hugo's answer, btrfs, at least for 
now, is not capable of it.  I believe it is true that filesystems 
historically done bad blocking, but I do think it is moving now to the 
hardware layer which is probably the best place for it to be and the 
flash drive industry needs to solve this problem at the 
hardware/firmware level.  That is my opinion anyway.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 16:49         ` George Eleftheriou
  2014-01-09 17:09           ` Hugo Mills
  2014-01-09 17:29           ` Chris Murphy
@ 2014-01-10 15:27           ` Duncan
  2014-01-10 15:46             ` George Mitchell
  2 siblings, 1 reply; 45+ messages in thread
From: Duncan @ 2014-01-10 15:27 UTC (permalink / raw)
  To: linux-btrfs

George Eleftheriou posted on Thu, 09 Jan 2014 17:49:48 +0100 as excerpted:

> I'm really looking forward to the day that typing:
> 
> mkfs.btrfs -d raid10 -m raid10  /dev/sd[abcd]
> 
> will do exactly what is expected to do. A true RAID10 resilient in 2
> disks' failure. Simple and beautiful.
> 
> We're almost there...

I see the further discussion, but three comments:

1) (As should be obvious by now, but as the saying goes...)

I want N-way-mirroring so bad I can taste it!  =:^)

2) Assuming a guaranteed 2-device-drop safe 3(+)-way-mirroring 
possibility, the above mkfs.btrfs would by the same assumption of 
necessity be a bit more complicated than that (and would require six 
devices of the same size for simplest conceptual formulation, not the 
four shown above).

Because at that point, a distinction between these two possibilities for 
a 6-device raid10 would need to be made:

* Two-way raid1/mirror on the devices, three-way raid0/stripe on top.

This is the current default and only choice, as discussed elsewhere in 
the subthread.  The three-way-stripe is 3X fast (ideal, probably more 
like 2X fast in practice, allowing for overhead), while the 2-way-mirror 
provides guaranteed 1-device-drop safety, with a possibility to lose two 
devices and recover, or not, depending on which two they are.

For maximum backward compatibility with what we have now, since it /is/ 
what we have now, that's likely what you'd still get with this:

mkfs.btrfs -d raid10 -m raid10 /dev/sd[abcdef]

... but it'd only guarantee single-device-drop safety.

The alternative, which I want so bad I can taste it, would be:

* Three-way raid1/mirror on the devices, two-way raid0/stripe on top.

That would sacrifice the 3X speed reducing it to 2X (ideal, probably 1.5X 
in practice due to overhead), but the 3-way-mirror would provide *BOTH* 
guaranteed 2-device-drop safety, *AND* guaranteed checksummed 3-way 
individual-btrfs-node integrity-checked mirroring, such that should any 
two of the three mirrors fail checksum, there'd still be that third copy.

What would the mkfs.btrfs command look like for that?  I've no insight on 
exactly how they plan to implement it, but here's one possible idea:

mkfs.btrfs -d raid10.3 -m raid10.3 /dev/sd[abcdef]

The ".3" bit would indicate three-way-mirroring instead of the default 2-
way-mirroring.  It has the advantage of relative brevity, but isn't 
entirely intuitive.

Another possibility would be a more explicit two-component mode-spec, 
like this:

mkfs.btrfs -d mirror3 (-d) raid10, -m mirror3 (-m) raid10 /dev/sd[abcdef]

(Whether the second -d/-m specifier was required to be there, optional, 
or could not be there, would depend on how they setup the parser.  
Another option would be a no-space comma separator: -d mirror3,raid10
-m mirror3,raid10 .)

This is more verbose but MUCH clearer, and as such I believe would be 
preferred to the dot-format, since after all, mkfs isn't something most 
peope do a lot of, so clarity should be preferred to brevity.  And I'd 
predict the no-space-comma-separator, since that format's least 
complicated in terms of shell parsing, and is already familiar from usage 
in fstab, among other places.

Oh, that would taste SOOO good! =:^)

3) Just for clarity in case anyone were to get mixed up, those devices 
can be partitions (or for that matter, mdraids or whatever) too.  They 
don't have to be actual whole physical devices.  So /dev/sd[abcdef]5 , 
for instance, would work too.  That's actually what I'm already doing 
here, altho obviously not with the n-way-mirroring I so want, as it's not 
available yet.

(This comment specifically included since the fact that multi-device 
btrfs could be on partition-devices wasn't clear to at least one list 
poster, not that long ago.  So just to make it explicitly clear to 
anybody stumbling on this post from google or whatever...)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-10 15:27           ` Duncan
@ 2014-01-10 15:46             ` George Mitchell
  0 siblings, 0 replies; 45+ messages in thread
From: George Mitchell @ 2014-01-10 15:46 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 01/10/2014 07:27 AM, Duncan wrote:
> George Eleftheriou posted on Thu, 09 Jan 2014 17:49:48 +0100 as excerpted:
>
>> I'm really looking forward to the day that typing:
>>
>> mkfs.btrfs -d raid10 -m raid10  /dev/sd[abcd]
>>
>> will do exactly what is expected to do. A true RAID10 resilient in 2
>> disks' failure. Simple and beautiful.
>>
>> We're almost there...
> I see the further discussion, but three comments:
>
> 1) (As should be obvious by now, but as the saying goes...)
>
> I want N-way-mirroring so bad I can taste it!  =:^)
>
> 2) Assuming a guaranteed 2-device-drop safe 3(+)-way-mirroring
> possibility, the above mkfs.btrfs would by the same assumption of
> necessity be a bit more complicated than that (and would require six
> devices of the same size for simplest conceptual formulation, not the
> four shown above).
>
> Because at that point, a distinction between these two possibilities for
> a 6-device raid10 would need to be made:
>
> * Two-way raid1/mirror on the devices, three-way raid0/stripe on top.
>
> This is the current default and only choice, as discussed elsewhere in
> the subthread.  The three-way-stripe is 3X fast (ideal, probably more
> like 2X fast in practice, allowing for overhead), while the 2-way-mirror
> provides guaranteed 1-device-drop safety, with a possibility to lose two
> devices and recover, or not, depending on which two they are.
>
> For maximum backward compatibility with what we have now, since it /is/
> what we have now, that's likely what you'd still get with this:
>
> mkfs.btrfs -d raid10 -m raid10 /dev/sd[abcdef]
>
> ... but it'd only guarantee single-device-drop safety.
>
> The alternative, which I want so bad I can taste it, would be:
>
> * Three-way raid1/mirror on the devices, two-way raid0/stripe on top.
>
> That would sacrifice the 3X speed reducing it to 2X (ideal, probably 1.5X
> in practice due to overhead), but the 3-way-mirror would provide *BOTH*
> guaranteed 2-device-drop safety, *AND* guaranteed checksummed 3-way
> individual-btrfs-node integrity-checked mirroring, such that should any
> two of the three mirrors fail checksum, there'd still be that third copy.
>
> What would the mkfs.btrfs command look like for that?  I've no insight on
> exactly how they plan to implement it, but here's one possible idea:
>
> mkfs.btrfs -d raid10.3 -m raid10.3 /dev/sd[abcdef]
>
> The ".3" bit would indicate three-way-mirroring instead of the default 2-
> way-mirroring.  It has the advantage of relative brevity, but isn't
> entirely intuitive.
>
> Another possibility would be a more explicit two-component mode-spec,
> like this:
>
> mkfs.btrfs -d mirror3 (-d) raid10, -m mirror3 (-m) raid10 /dev/sd[abcdef]
>
> (Whether the second -d/-m specifier was required to be there, optional,
> or could not be there, would depend on how they setup the parser.
> Another option would be a no-space comma separator: -d mirror3,raid10
> -m mirror3,raid10 .)
>
> This is more verbose but MUCH clearer, and as such I believe would be
> preferred to the dot-format, since after all, mkfs isn't something most
> peope do a lot of, so clarity should be preferred to brevity.  And I'd
> predict the no-space-comma-separator, since that format's least
> complicated in terms of shell parsing, and is already familiar from usage
> in fstab, among other places.
>
> Oh, that would taste SOOO good! =:^)
>
> 3) Just for clarity in case anyone were to get mixed up, those devices
> can be partitions (or for that matter, mdraids or whatever) too.  They
> don't have to be actual whole physical devices.  So /dev/sd[abcdef]5 ,
> for instance, would work too.  That's actually what I'm already doing
> here, altho obviously not with the n-way-mirroring I so want, as it's not
> available yet.
>
> (This comment specifically included since the fact that multi-device
> btrfs could be on partition-devices wasn't clear to at least one list
> poster, not that long ago.  So just to make it explicitly clear to
> anybody stumbling on this post from google or whatever...)
>
Duncan, you are describing exactly the sort of ROBUST RAID product I 
would like to see btrfs become.  In this world of ridiculously 
inexpensive hard drives I don't think we should ever have to risk ending 
up in a degraded state, at least certainly not for long, but not ever 
would be ideal.  We should never end up being in a panic to change out a 
drive and facing additional panic as to whether a rebuild is going to 
succeed or fall on its face.  Those days should be over forever, 
barring, of course, a direct nuclear hit.  - George

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-09 18:52         ` Chris Murphy
@ 2014-01-10 17:03           ` Duncan
  0 siblings, 0 replies; 45+ messages in thread
From: Duncan @ 2014-01-10 17:03 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Thu, 09 Jan 2014 11:52:08 -0700 as excerpted:

> Understood. I'm considering a 2nd drive dying during rebuild (from a 1st
> drive dying) as essentially simultaneous failures. And in the case of
> raid10, the likelihood of a 2nd drive failure being the lonesome drive
> in a mirrored set is statistically very unlikely. The next drive to fail
> is going to be some other drive in the array, which still has a mirror.

While still statistically unlikely, the likelihood of that critical 
second device[1] in a mirror-pair on a raid10 dying isn't /as/ unlikely 
as you might think -- it's actually more likely than that of any one of 
the still mirrored devices failing, for example.

The reason is that as soon as one of the devices in a mirror-pair fails, 
the other one is suddenly doing double the work it was previously, and 
twice the work any other still-paired devices in the array are doing!  
And as any human who has tried to pull an 80-hour-work-week can attest, 
double the work is *NOT* simply double the stress!

If both devices in the pair are from the same manufacturing run and were 
installed at the same time and run under exactly the same conditions, as 
quite likely unless deliberately guarded against, chances are rather 
higher than you'd like that by the time one fails, suddenly piling twice 
the workload on the OTHER one isn't going to end well, especially under 
the increased workload of a recovery after a replacement device has been 
added.

That's the well known but all too infrequently considered trap of both 
raid5 and 2-way-mirrored raid1, thus the reason many admins are so 
reluctant to trust them and prefer N-way-mirroring/parity, with N bumped 
upward as necessary to suit the level of device-failure paranoia.

For me, that cost/benefit/paranoia balance tends toward N=3 for 
mirroring, N=2 for parity (since parity parallels mirror redundancy 
count, not mirror total count). =:^)

---
[1] I'm trying to train myself to use "device" in most cases where I 
formerly used "drive", since "device" is generally technically correct 
even if it's a logical/virtual device such as an mdraid device or even 
simply a partition on a physical device, while "drive" may well be 
technically incorrect, since both virtual devices such as mdraid and 
partitions, and physical devices such as SSDs, are arguably not "drives" 
at all.  But it's definitely a process I'm still in the middle of.  It's 
not a formed habit yet and if I'm not thinking about that when I chose my 
term...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-14 21:14       ` Chris Murphy
  2014-01-14 21:48         ` George Mitchell
  2014-01-14 21:48         ` George Mitchell
@ 2014-01-14 22:14         ` George Mitchell
  2 siblings, 0 replies; 45+ messages in thread
From: George Mitchell @ 2014-01-14 22:14 UTC (permalink / raw)
  To: Btrfs BTRFS

On 01/14/2014 01:14 PM, Chris Murphy wrote:
>
>> And the key to monitoring hard drive health, in my opinion, is SMART and what we are lacking at this point is a SMART capability to provide visual notifications to the user when any hard drive starts to seriously degrade or suddenly fails.
> Gnome does this:
> https://mail.gnome.org/archives/commits-list/2012-November/msg03124.html
>
> The problem is that something around 40% of failures come with absolutely no advance warning by SMART. So yes it's better than nothing but we're still rather likely to not get sufficient warning.
>
>
>
Well, I *think* I found the answer to this one.

http://forum.kde.org/viewtopic.php?f=66&t=99555

And note the response to the poll question.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-14 21:37           ` Chris Murphy
  2014-01-14 21:45             ` Chris Murphy
@ 2014-01-14 21:54             ` Roman Mamedov
  1 sibling, 0 replies; 45+ messages in thread
From: Roman Mamedov @ 2014-01-14 21:54 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 2105 bytes --]

On Tue, 14 Jan 2014 14:37:46 -0700
Chris Murphy <lists@colorremedies.com> wrote:

> Reserve sectors are fundamental to ECC. If there are no more reserves, the
> status should be a failed drive, it can no longer do its own relocation of
> data experiencing transient read errors in this case.

With the Reallocated sector count being low in that case, I assume the drive
*had* a lot of reserve space, but due to a buggy firmware didn't "see the need
to use it" just yet for this particular sector.

It's not the question of what is "fundamental" or right, I am describing
observed behavior in the real world - and yes, of course, that behavior is
incorrect and probably an indication of a buggy peculiar firmware.

> I'm considering persistent write failure as a result of no more reserve
> sectors being available.

Write doesn't fail, it succeeds, but you still can't read back the bad block.
And the "reallocated sector count" does not increase.

> Well, not totally useless, if it flags the user with an hour's notice in Gnome,

If we're again talking the user GUI-level indicators, then an increase in
"Reallocated sector count", "Current pending sectors" or "Reported
uncorrectable" should also be a reason for such GUI notice to appear. And
AFAIK that's how smartd howto/manual recommends configuring it (i.e. an E-Mail
on any change in those critical attributes).

> >> a way to send a command to the firmware to persistently increase the reserve
> >> sectors at the expensive of available space - in effect it reduces the LBA
> >> count by e.g. 10MB, thereby increasing the reserve pool by 10MB.
> > 
> > Yes please that, and also a pony. :)
> 
> That seems a lot easier to implement than anything else being discussed. 

Oh really? Pushing that feature into multiple competing vendor's HDD firmwares
across diverse models, product lines, interfaces and revisions is *easier* than
mainlining a patch into the Linux kernel?... My point was that this would have
been a good feature, but no chance we're going to see it realized.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-14 21:14       ` Chris Murphy
  2014-01-14 21:48         ` George Mitchell
@ 2014-01-14 21:48         ` George Mitchell
  2014-01-14 22:14         ` George Mitchell
  2 siblings, 0 replies; 45+ messages in thread
From: George Mitchell @ 2014-01-14 21:48 UTC (permalink / raw)
  To: Btrfs BTRFS

On 01/14/2014 01:14 PM, Chris Murphy wrote:
> On Jan 14, 2014, at 1:29 PM, George Mitchell <george@chinilu.com> wrote:
>>> And the key to monitoring hard drive health, in my opinion, is SMART and what we are lacking at this point is a SMART capability to provide visual notifications to the user when any hard drive starts to seriously degrade or suddenly fails.
> Gnome does this:
> https://mail.gnome.org/archives/commits-list/2012-November/msg03124.html
>
> The problem is that something around 40% of failures come with absolutely no advance warning by SMART. So yes it's better than nothing but we're still rather likely to not get sufficient warning.
>
The journal already marks and red types serious problems.  I agree that 
the role of bringing this information to the desktop would be the 
desktop, in my case KDE, which in my case is not providing me with a 
solution to this.  I am glad that at least Gnome has taken care of this, 
thanks probably to Red Hat.
>>   If SMART were capable of launching pop up warnings, btrfs would not have to worry so much about arrays going simplex undetected.
> I don't see a tie in between Btrfs and SMART. Btrfs's behavior in the face of SMART indicating e.g. a high number of reallocated sectors in the past hour, shouldn't change. Only once the drive reports read or write failures would Btrfs need to change its behavior. The SMART warnings preferably should flag the user with some kind of warning.
Well OK, I get that.
>
>>    And it should really be the user's responsibility to be running SMART and providing sufficient number of drives AND sufficient additional free space to accommodate potential drive failure and still retain desired level of redundancy extra drives in their RAID arrays.  That is where I stand on this.
> I'd say the OS should do it. With linux distros, that's the desktop. I don't think users should have to be configuring SMART at all.
>
The problem is that some users might be running systems that, for 
whatever reason, are not compatible with SMART and distros are loath to 
automatically enable functions that might result in DOAs.  I myself have 
had multiple cases of locked up systems before due to SMART issues.  But 
I do think the distros need to make SMART configuration a whole lot 
easier than it currently is.
> Chris Murphy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-14 21:14       ` Chris Murphy
@ 2014-01-14 21:48         ` George Mitchell
  2014-01-14 21:48         ` George Mitchell
  2014-01-14 22:14         ` George Mitchell
  2 siblings, 0 replies; 45+ messages in thread
From: George Mitchell @ 2014-01-14 21:48 UTC (permalink / raw)
  To: Btrfs BTRFS

On 01/14/2014 01:14 PM, Chris Murphy wrote:
> On Jan 14, 2014, at 1:29 PM, George Mitchell <george@chinilu.com> wrote:
>>> And the key to monitoring hard drive health, in my opinion, is SMART and what we are lacking at this point is a SMART capability to provide visual notifications to the user when any hard drive starts to seriously degrade or suddenly fails.
> Gnome does this:
> https://mail.gnome.org/archives/commits-list/2012-November/msg03124.html
>
> The problem is that something around 40% of failures come with absolutely no advance warning by SMART. So yes it's better than nothing but we're still rather likely to not get sufficient warning.
>
The journal already marks and red types serious problems.  I agree that 
the role of bringing this information to the desktop would be the 
desktop, in my case KDE, which in my case is not providing me with a 
solution to this.  I am glad that at least Gnome has taken care of this, 
thanks probably to Red Hat.
>>   If SMART were capable of launching pop up warnings, btrfs would not have to worry so much about arrays going simplex undetected.
> I don't see a tie in between Btrfs and SMART. Btrfs's behavior in the face of SMART indicating e.g. a high number of reallocated sectors in the past hour, shouldn't change. Only once the drive reports read or write failures would Btrfs need to change its behavior. The SMART warnings preferably should flag the user with some kind of warning.
Well OK, I get that.
>
>>    And it should really be the user's responsibility to be running SMART and providing sufficient number of drives AND sufficient additional free space to accommodate potential drive failure and still retain desired level of redundancy extra drives in their RAID arrays.  That is where I stand on this.
> I'd say the OS should do it. With linux distros, that's the desktop. I don't think users should have to be configuring SMART at all.
>
The problem is that some users might be running systems that, for 
whatever reason, are not compatible with SMART and distros are loath to 
automatically enable functions that might result in DOAs.  I myself have 
had multiple cases of locked up systems before due to SMART issues.  But 
I do think the distros need to make SMART configuration a whole lot 
easier than it currently is.
> Chris Murphy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-14 21:37           ` Chris Murphy
@ 2014-01-14 21:45             ` Chris Murphy
  2014-01-14 21:54             ` Roman Mamedov
  1 sibling, 0 replies; 45+ messages in thread
From: Chris Murphy @ 2014-01-14 21:45 UTC (permalink / raw)
  To: Btrfs BTRFS


On Jan 14, 2014, at 2:37 PM, Chris Murphy <lists@colorremedies.com> wrote:

> I've seen that happen on OS X Server (client doesn't produce SMART warnings in user space).

Oops. It does, just not automatically, it seems you have to go look for this in Disk Utility.

Chris Murphy


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-14 21:19         ` Roman Mamedov
@ 2014-01-14 21:37           ` Chris Murphy
  2014-01-14 21:45             ` Chris Murphy
  2014-01-14 21:54             ` Roman Mamedov
  0 siblings, 2 replies; 45+ messages in thread
From: Chris Murphy @ 2014-01-14 21:37 UTC (permalink / raw)
  To: Btrfs BTRFS


On Jan 14, 2014, at 2:19 PM, Roman Mamedov <rm@romanrm.net> wrote:

> On Tue, 14 Jan 2014 14:05:11 -0700
> Chris Murphy <lists@colorremedies.com> wrote:
> 
>> 
>> On Jan 14, 2014, at 12:37 PM, Roman Mamedov <rm@romanrm.net> wrote:
>>> 
>>> I vaguely remember having some drives that were not able to remap a single
>>> block on write, but doing that successfully if I overwrote a sizable area
>>> around (and including) that block, or overwrite the whole drive. And after
>>> that they worked without issue not exhibiting further bad blocks.
>> 
>> Presumably the SMART self-assessment for this drive was FAIL? 
> 
> No of course not, why?

Reserve sectors are fundamental to ECC. If there are no more reserves, the status should be a failed drive, it can no longer do its own relocation of data experiencing transient read errors in this case.

> SMART goes to FAIL only if one of the attributes falls
> below threshold, in this case that would be Reallocated Sector Count having
> too much sectors. But nope, it either had zero, or in single-digit numbers.

It sounds like we aren't talking about the same thing. I'm considering persistent write failure as a result of no more reserve sectors being available.

> I don't ever remember seeing a SMART FAIL drive that would function in any
> usual sense of that word.

Oh I have. For a week a drive was reporting failure imminent before we got around to replacing it. It hadn't actually failed at that time still.


>> And if so what's the point of the work around when we only have a pass/fail
>> level granularity for drive health?
> 
> Not sure what you're referring to here. As said above, the FAIL/PASS status is
> largely useless, and the more important indicators are the values and dynamics
> in Reallocated sector count, Current pending sectors, Reported uncorrectable,
> etc.

Well, not totally useless, if it flags the user with an hour's notice in Gnome, they can do some minimal backup. I've seen that happen on OS X Server (client doesn't produce SMART warnings in user space).


> 
>> a way to send a command to the firmware to persistently increase the reserve
>> sectors at the expensive of available space - in effect it reduces the LBA
>> count by e.g. 10MB, thereby increasing the reserve pool by 10MB.
> 
> Yes please that, and also a pony. :)

That seems a lot easier to implement than anything else being discussed. 

Chris Murphy


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-14 21:00       ` Roman Mamedov
  2014-01-14 21:06         ` Hugo Mills
  2014-01-14 21:27         ` George Mitchell
@ 2014-01-14 21:28         ` George Mitchell
  2 siblings, 0 replies; 45+ messages in thread
From: George Mitchell @ 2014-01-14 21:28 UTC (permalink / raw)
  Cc: Btrfs BTRFS

On 01/14/2014 01:00 PM, Roman Mamedov wrote:
> On Tue, 14 Jan 2014 12:29:28 -0800
> George Mitchell <george@chinilu.com> wrote:
>
>> what we are lacking at this point is a SMART capability to provide
>> visual notifications to the user when any hard drive starts to seriously
>> degrade or suddenly fails.
> You can configure smartd (from smartmontools) to send you E-Mails on any
> change of the monitored SMART attributes.
Well, sorry, email is nice, but it doesn't work for me as a desktop 
user.  1) I don't want to have to hand configure and email system like 
sendmail to generate outgoing emails.  2) I don't want to have to go 
through and hand configure /etc/smartd/conf.  3) I don't want my email 
clogged up with SMART messages.
>
>> If SMART were capable of launching pop up warnings
> And I'm sure there are a number of GUI tools out there for just about any OS,
> which can do just that
Really?  And for Linux they are?  I know it can be done because various 
administrative tools provided by my distro routinely flash up status 
notices on my screen.  I really like that.  It lets my know what my 
system is doing.  journalctl provides red type warnings when a drive is 
failing.  That should be thrown up on the screen and stay there till I 
dismiss it.  To me thats a no brainer.
>
>> btrfs would not have to worry so much about arrays going simplex undetected.
> That said, do not fall into a false sense of security relying on proprietary,
> barely if ever updated after the device has been shipped, and often very
> peculiar-behaving SMART routines inside the black-box HDD firmware as your
> most important data safeguard.
>
> Of course SMART must be checked and monitored, but don't delude yourself into
> thinking it will always warn you of anything going wrong well in advance of
> failure, or even at all.

Well, since I began using SMART I have had a total of two drive failures 
so far and both of them generated warnings and the drives in question 
were still operating normally when I disgarded them. Plenty of 
opportunity to retrieve the data and wipe the drives.  I'm sure SMART 
does fail advance notifictation in some instances, but even when it 
does, at least it will certainly warn me IMMEDIATELY that the drive is 
gone after the fact.
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-14 21:06         ` Hugo Mills
@ 2014-01-14 21:27           ` Chris Murphy
  0 siblings, 0 replies; 45+ messages in thread
From: Chris Murphy @ 2014-01-14 21:27 UTC (permalink / raw)
  To: Btrfs BTRFS


On Jan 14, 2014, at 2:06 PM, Hugo Mills <hugo@carfax.org.uk> wrote:

> On Wed, Jan 15, 2014 at 03:00:21AM +0600, Roman Mamedov wrote:
>> That said, do not fall into a false sense of security relying on proprietary,
>> barely if ever updated after the device has been shipped, and often very
>> peculiar-behaving SMART routines inside the black-box HDD firmware as your
>> most important data safeguard.
>> 
>> Of course SMART must be checked and monitored, but don't delude yourself into
>> thinking it will always warn you of anything going wrong well in advance of
>> failure, or even at all.
> 
>   The famous paper from Google a few years back suggested that SMART
> was a useful predictor of failure in something like 20% of failures.

https://static.googleusercontent.com/media/research.google.com/en/us/archive/disk_failures.pdf


"Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals,"

and

"even when we add all remaining SMART parameters (except temperature) we still find that over 36% of all failed drives had zero
counts on all variables."

For those that do report counts, they still weren't able to come up with a good predictor of failures based on any combination of the current attributes.

So there's a really good chance the drive will fail without warning. This study doesn't report on the accuracy of the health self-assessment, i.e. the PASS/FAIL state for the drive.


Chris Murphy

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-14 21:00       ` Roman Mamedov
  2014-01-14 21:06         ` Hugo Mills
@ 2014-01-14 21:27         ` George Mitchell
  2014-01-14 21:28         ` George Mitchell
  2 siblings, 0 replies; 45+ messages in thread
From: George Mitchell @ 2014-01-14 21:27 UTC (permalink / raw)
  Cc: Btrfs BTRFS

On 01/14/2014 01:00 PM, Roman Mamedov wrote:
> On Tue, 14 Jan 2014 12:29:28 -0800
> George Mitchell <george@chinilu.com> wrote:
>
>> what we are lacking at this point is a SMART capability to provide
>> visual notifications to the user when any hard drive starts to seriously
>> degrade or suddenly fails.
> You can configure smartd (from smartmontools) to send you E-Mails on any
> change of the monitored SMART attributes.
Well, sorry, email is nice, but it doesn't work for me as a desktop 
user.  1) I don't want to have to hand configure and email system like 
sendmail to generate outgoing emails.  2) I don't want to have to go 
through and hand configure /etc/smartd/conf.  3) I don't want my email 
clogged up with SMART messages.
>
>> If SMART were capable of launching pop up warnings
> And I'm sure there are a number of GUI tools out there for just about any OS,
> which can do just that
Really?  And for Linux they are?  I know it can be done because various 
administrative tools provided by my distro routinely flash up status 
notices on my screen.  I really like that.  It lets my know what my 
system is doing.  journalctl provides red type warnings when a drive is 
failing.  That should be thrown up on the screen and stay there till I 
dismiss it.  To me thats a no brainer.
>
>> btrfs would not have to worry so much about arrays going simplex undetected.
> That said, do not fall into a false sense of security relying on proprietary,
> barely if ever updated after the device has been shipped, and often very
> peculiar-behaving SMART routines inside the black-box HDD firmware as your
> most important data safeguard.
>
> Of course SMART must be checked and monitored, but don't delude yourself into
> thinking it will always warn you of anything going wrong well in advance of
> failure, or even at all.

Well, since I began using SMART I have had a total of two drive failures 
so far and both of them generated warnings and the drives in question 
were still operating normally when I disgarded them. Plenty of 
opportunity to retrieve the data and wipe the drives.  I'm sure SMART 
does fail advance notifictation in some instances, but even when it 
does, at least it will certainly warn me IMMEDIATELY that the drive is 
gone after the fact.
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-14 21:05       ` Chris Murphy
@ 2014-01-14 21:19         ` Roman Mamedov
  2014-01-14 21:37           ` Chris Murphy
  0 siblings, 1 reply; 45+ messages in thread
From: Roman Mamedov @ 2014-01-14 21:19 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1765 bytes --]

On Tue, 14 Jan 2014 14:05:11 -0700
Chris Murphy <lists@colorremedies.com> wrote:

> 
> On Jan 14, 2014, at 12:37 PM, Roman Mamedov <rm@romanrm.net> wrote:
> > 
> > I vaguely remember having some drives that were not able to remap a single
> > block on write, but doing that successfully if I overwrote a sizable area
> > around (and including) that block, or overwrite the whole drive. And after
> > that they worked without issue not exhibiting further bad blocks.
> 
> Presumably the SMART self-assessment for this drive was FAIL? 

No of course not, why? SMART goes to FAIL only if one of the attributes falls
below threshold, in this case that would be Reallocated Sector Count having
too much sectors. But nope, it either had zero, or in single-digit numbers.

I don't ever remember seeing a SMART FAIL drive that would function in any
usual sense of that word. Whereas such pecularities (bad sectors /
unremappable sectors / not wanting to remap until you overwrite large area and
perhaps even multiple times at that), all with SMART = PASS, are seen left
right and center.

> And if so what's the point of the work around when we only have a pass/fail
> level granularity for drive health?

Not sure what you're referring to here. As said above, the FAIL/PASS status is
largely useless, and the more important indicators are the values and dynamics
in Reallocated sector count, Current pending sectors, Reported uncorrectable,
etc.

> a way to send a command to the firmware to persistently increase the reserve
> sectors at the expensive of available space - in effect it reduces the LBA
> count by e.g. 10MB, thereby increasing the reserve pool by 10MB.

Yes please that, and also a pony. :)

- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-14 20:29     ` George Mitchell
  2014-01-14 21:00       ` Roman Mamedov
@ 2014-01-14 21:14       ` Chris Murphy
  2014-01-14 21:48         ` George Mitchell
                           ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: Chris Murphy @ 2014-01-14 21:14 UTC (permalink / raw)
  To: Btrfs BTRFS


On Jan 14, 2014, at 1:29 PM, George Mitchell <george@chinilu.com> wrote:
>> 
> Chris,  Please don't misunderstand me.  I am not advocating that btrfs or any other filesystem should be dealing with bad blocks.  I believe very strongly that if the drive firmware can't deal with that transparently the drive is, indeed, toast, and should be tossed.

OK.


> And the key to monitoring hard drive health, in my opinion, is SMART and what we are lacking at this point is a SMART capability to provide visual notifications to the user when any hard drive starts to seriously degrade or suddenly fails.

Gnome does this:
https://mail.gnome.org/archives/commits-list/2012-November/msg03124.html

The problem is that something around 40% of failures come with absolutely no advance warning by SMART. So yes it's better than nothing but we're still rather likely to not get sufficient warning.


>  If SMART were capable of launching pop up warnings, btrfs would not have to worry so much about arrays going simplex undetected.

I don't see a tie in between Btrfs and SMART. Btrfs's behavior in the face of SMART indicating e.g. a high number of reallocated sectors in the past hour, shouldn't change. Only once the drive reports read or write failures would Btrfs need to change its behavior. The SMART warnings preferably should flag the user with some kind of warning.

>   And it should really be the user's responsibility to be running SMART and providing sufficient number of drives AND sufficient additional free space to accommodate potential drive failure and still retain desired level of redundancy extra drives in their RAID arrays.  That is where I stand on this.

I'd say the OS should do it. With linux distros, that's the desktop. I don't think users should have to be configuring SMART at all.


Chris Murphy


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-14 21:00       ` Roman Mamedov
@ 2014-01-14 21:06         ` Hugo Mills
  2014-01-14 21:27           ` Chris Murphy
  2014-01-14 21:27         ` George Mitchell
  2014-01-14 21:28         ` George Mitchell
  2 siblings, 1 reply; 45+ messages in thread
From: Hugo Mills @ 2014-01-14 21:06 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: george, Chris Murphy, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 965 bytes --]

On Wed, Jan 15, 2014 at 03:00:21AM +0600, Roman Mamedov wrote:
> That said, do not fall into a false sense of security relying on proprietary,
> barely if ever updated after the device has been shipped, and often very
> peculiar-behaving SMART routines inside the black-box HDD firmware as your
> most important data safeguard.
> 
> Of course SMART must be checked and monitored, but don't delude yourself into
> thinking it will always warn you of anything going wrong well in advance of
> failure, or even at all.

   The famous paper from Google a few years back suggested that SMART
was a useful predictor of failure in something like 20% of failures.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
              --- Python is executable pseudocode; perl ---              
                        is executable line-noise.                        

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-14 19:37     ` Roman Mamedov
@ 2014-01-14 21:05       ` Chris Murphy
  2014-01-14 21:19         ` Roman Mamedov
  0 siblings, 1 reply; 45+ messages in thread
From: Chris Murphy @ 2014-01-14 21:05 UTC (permalink / raw)
  To: Btrfs BTRFS


On Jan 14, 2014, at 12:37 PM, Roman Mamedov <rm@romanrm.net> wrote:
> 
> I vaguely remember having some drives that were not able to remap a single
> block on write, but doing that successfully if I overwrote a sizable area
> around (and including) that block, or overwrite the whole drive. And after
> that they worked without issue not exhibiting further bad blocks.

Presumably the SMART self-assessment for this drive was FAIL? And if so what's the point of the work around when we only have a pass/fail level granularity for drive health?

> 
> Or for example consider the 4k sector drives. If even any portion of the
> physical 4k sector is corrupt, some of the eight 512 virtual blocks will be
> unreadable; but the thing is, writing to ANY of them individually will fail,
> because the drive's internal r-m-w will fail to obtain all the pieces of the
> 4k sector from disk to overwrite it.
> 
> So in my opinion one (and perhaps one of the easier) things to consider here,
> would be to try being "generous" in recovery-overwrite, say, rewrite a
> whole 1MB-sized region centered at the unreadable sector.

Not if the SMART status is a FAIL. And if it's not a fail then that sounds like a firmware bug. A better feature would be a way to send a command to the firmware to persistently increase the reserve sectors at the expensive of available space - in effect it reduces the LBA count by e.g. 10MB, thereby increasing the reserve pool by 10MB. Changing this setting in either directionwould obviously mean data loss; but at least it's managed by firmware which transports the accumulated knowledge. Otherwise these block files are going to be backed up, or subject to btrfs send. And then restored invariably in the wrong location on the same drive, or unnecessarily restored onto a new drive so then you need exception code if you don't want that behavior. I think it's trouble.


Chris Murphy


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-14 20:29     ` George Mitchell
@ 2014-01-14 21:00       ` Roman Mamedov
  2014-01-14 21:06         ` Hugo Mills
                           ` (2 more replies)
  2014-01-14 21:14       ` Chris Murphy
  1 sibling, 3 replies; 45+ messages in thread
From: Roman Mamedov @ 2014-01-14 21:00 UTC (permalink / raw)
  To: george; +Cc: Chris Murphy, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1099 bytes --]

On Tue, 14 Jan 2014 12:29:28 -0800
George Mitchell <george@chinilu.com> wrote:

> what we are lacking at this point is a SMART capability to provide 
> visual notifications to the user when any hard drive starts to seriously 
> degrade or suddenly fails.

You can configure smartd (from smartmontools) to send you E-Mails on any
change of the monitored SMART attributes.

> If SMART were capable of launching pop up warnings

And I'm sure there are a number of GUI tools out there for just about any OS,
which can do just that

> btrfs would not have to worry so much about arrays going simplex undetected.

That said, do not fall into a false sense of security relying on proprietary,
barely if ever updated after the device has been shipped, and often very
peculiar-behaving SMART routines inside the black-box HDD firmware as your
most important data safeguard.

Of course SMART must be checked and monitored, but don't delude yourself into
thinking it will always warn you of anything going wrong well in advance of
failure, or even at all.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-14 19:13   ` Chris Murphy
  2014-01-14 19:37     ` Roman Mamedov
@ 2014-01-14 20:29     ` George Mitchell
  2014-01-14 21:00       ` Roman Mamedov
  2014-01-14 21:14       ` Chris Murphy
  1 sibling, 2 replies; 45+ messages in thread
From: George Mitchell @ 2014-01-14 20:29 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

On 01/14/2014 11:13 AM, Chris Murphy wrote:
> On Jan 9, 2014, at 6:31 PM, George Mitchell <george@chinilu.com> wrote:
>> Jim, my point was that IF the drive does not successfully resolve the bad block issue and btrfs takes a write failure every time it attempts to overwrite the bad data, it is not going to remap that data, but rather it is going to fail the drive.
> If the drive doesn't resolve a bad block on write, then the drive is toast. That's how md handles it. That's even how manufacturers handle it. The point at which write failures occur mean there are no reserve sectors left, or the head itself is having problems writing data to even good sectors. Either way, the drive isn't reliable for rw purposes and coming up with a bunch of code to fix bad drives isn't worth development time in my opinion. Such a drive is vaguely interesting for test purposes however, because even though the drive is toast, we'd like the system to remain stable with it connected first and foremost. And maybe we'd want it as a source during rebuild/replacement.
>
>>   In other words, if the drive has a bad sector which it has not done anything about at the drive level, btrfs will not remap the sector.  It will, rather, fail the drive. Is that not correct?
> I've skimmed for this in the code, but haven't found it, so I'm not sure what the handling is. It's probably easier to take a drive I don't care about, and use hdparm to cause a sector to be flagged as bad, and see how Btrfs handles it. (The hdparm command should be clearable, but I'd rather not screw up a drive I like.)
>
> Chris Murphy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
Chris,  Please don't misunderstand me.  I am not advocating that btrfs 
or any other filesystem should be dealing with bad blocks.  I believe 
very strongly that if the drive firmware can't deal with that 
transparently the drive is, indeed, toast, and should be tossed.  And 
the key to monitoring hard drive health, in my opinion, is SMART and 
what we are lacking at this point is a SMART capability to provide 
visual notifications to the user when any hard drive starts to seriously 
degrade or suddenly fails.  This would ideally be mediated by journal 
daemon which desperately needs to be enhanced to provide visual and 
ideally audible pop up warnings to the user in such cases.  It would be 
nice to have those notifications from btrfs as well, also mediated by 
journal daemon, but this is really a SMART specialty and SMART should be 
our first line defense.  Where we need btrfs to move is toward automated 
resiliency, automatically dropping the bad drive(s) and automatically 
following up with a rebalance and return to sanity.  If SMART were 
capable of launching pop up warnings, btrfs would not have to worry so 
much about arrays going simplex undetected.   And it should really be 
the user's responsibility to be running SMART and providing sufficient 
number of drives AND sufficient additional free space to accommodate 
potential drive failure and still retain desired level of redundancy 
extra drives in their RAID arrays.  That is where I stand on this.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-14 19:13   ` Chris Murphy
@ 2014-01-14 19:37     ` Roman Mamedov
  2014-01-14 21:05       ` Chris Murphy
  2014-01-14 20:29     ` George Mitchell
  1 sibling, 1 reply; 45+ messages in thread
From: Roman Mamedov @ 2014-01-14 19:37 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1389 bytes --]

On Tue, 14 Jan 2014 12:13:09 -0700
Chris Murphy <lists@colorremedies.com> wrote:

> 
> On Jan 9, 2014, at 6:31 PM, George Mitchell <george@chinilu.com> wrote:
> >> 
> > Jim, my point was that IF the drive does not successfully resolve the bad block issue and btrfs takes a write failure every time it attempts to overwrite the bad data, it is not going to remap that data, but rather it is going to fail the drive.
> 
> If the drive doesn't resolve a bad block on write, then the drive is toast.

I vaguely remember having some drives that were not able to remap a single
block on write, but doing that successfully if I overwrote a sizable area
around (and including) that block, or overwrite the whole drive. And after
that they worked without issue not exhibiting further bad blocks.

Or for example consider the 4k sector drives. If even any portion of the
physical 4k sector is corrupt, some of the eight 512 virtual blocks will be
unreadable; but the thing is, writing to ANY of them individually will fail,
because the drive's internal r-m-w will fail to obtain all the pieces of the
4k sector from disk to overwrite it.

So in my opinion one (and perhaps one of the easier) things to consider here,
would be to try being "generous" in recovery-overwrite, say, rewrite a
whole 1MB-sized region centered at the unreadable sector.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
  2014-01-10  1:31 ` George Mitchell
@ 2014-01-14 19:13   ` Chris Murphy
  2014-01-14 19:37     ` Roman Mamedov
  2014-01-14 20:29     ` George Mitchell
  0 siblings, 2 replies; 45+ messages in thread
From: Chris Murphy @ 2014-01-14 19:13 UTC (permalink / raw)
  To: Btrfs BTRFS


On Jan 9, 2014, at 6:31 PM, George Mitchell <george@chinilu.com> wrote:
>> 
> Jim, my point was that IF the drive does not successfully resolve the bad block issue and btrfs takes a write failure every time it attempts to overwrite the bad data, it is not going to remap that data, but rather it is going to fail the drive.

If the drive doesn't resolve a bad block on write, then the drive is toast. That's how md handles it. That's even how manufacturers handle it. The point at which write failures occur mean there are no reserve sectors left, or the head itself is having problems writing data to even good sectors. Either way, the drive isn't reliable for rw purposes and coming up with a bunch of code to fix bad drives isn't worth development time in my opinion. Such a drive is vaguely interesting for test purposes however, because even though the drive is toast, we'd like the system to remain stable with it connected first and foremost. And maybe we'd want it as a source during rebuild/replacement.

>  In other words, if the drive has a bad sector which it has not done anything about at the drive level, btrfs will not remap the sector.  It will, rather, fail the drive. Is that not correct?

I've skimmed for this in the code, but haven't found it, so I'm not sure what the handling is. It's probably easier to take a drive I don't care about, and use hdparm to cause a sector to be flagged as bad, and see how Btrfs handles it. (The hdparm command should be clearable, but I'd rather not screw up a drive I like.)

Chris Murphy


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: How does btrfs handle bad blocks in raid1?
       [not found] <201401100106.s0A16CNd016476@atl4mhib27.myregisteredsite.com>
@ 2014-01-10  1:31 ` George Mitchell
  2014-01-14 19:13   ` Chris Murphy
  0 siblings, 1 reply; 45+ messages in thread
From: George Mitchell @ 2014-01-10  1:31 UTC (permalink / raw)
  To: Jim Salter; +Cc: Clemens Eisserer, linux-btrfs

On 01/09/2014 05:06 PM, Jim Salter wrote:
> On Jan 9, 2014 7:46 PM, George Mitchell <george@chinilu.com> wrote:
>> I would prefer that the drive, even flash media type, would
>> catch and resolve write failures.  If it doesn't happen at the hardware
>> layer, according to how I understand Hugo's answer, btrfs, at least for
>> now, is not capable of it.
> Not sure what you mean by this. If a bit flips on a btrfs-raid1 block, btrfs will detect it. Then it checks the mirror's copy of that block. It returns the good copy, then immediately writes the good copy over the bad copy.
>
> I know this because I tested it directly just last week by flipping a bit in an offline btrfs filesystem manually. When I brought the volume back online and read the file containing the bit I flipped, it operated exactly as described, and logged its actions in kern.log, . :-)
Jim, my point was that IF the drive does not successfully resolve the 
bad block issue and btrfs takes a write failure every time it attempts 
to overwrite the bad data, it is not going to remap that data, but 
rather it is going to fail the drive.  In other words, if the drive has 
a bad sector which it has not done anything about at the drive level, 
btrfs will not remap the sector.  It will, rather, fail the drive. Is 
that not correct?

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2014-01-14 22:13 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-09 10:26 How does btrfs handle bad blocks in raid1? Clemens Eisserer
2014-01-09 10:42 ` Hugo Mills
2014-01-09 12:41   ` Duncan
2014-01-09 12:52     ` Austin S Hemmelgarn
2014-01-09 15:15       ` Duncan
2014-01-09 16:49         ` George Eleftheriou
2014-01-09 17:09           ` Hugo Mills
2014-01-09 17:34             ` George Eleftheriou
2014-01-09 17:43               ` Hugo Mills
2014-01-09 18:40                 ` George Eleftheriou
2014-01-09 17:29           ` Chris Murphy
2014-01-09 18:00             ` George Eleftheriou
2014-01-10 15:27           ` Duncan
2014-01-10 15:46             ` George Mitchell
2014-01-09 17:31       ` Chris Murphy
2014-01-09 18:20         ` Austin S Hemmelgarn
2014-01-09 14:58     ` Chris Mason
2014-01-09 18:08     ` Chris Murphy
2014-01-09 18:22       ` Austin S Hemmelgarn
2014-01-09 18:52         ` Chris Murphy
2014-01-10 17:03           ` Duncan
2014-01-09 18:40   ` Chris Murphy
2014-01-09 19:13     ` Kyle Gates
2014-01-09 19:31       ` Chris Murphy
2014-01-09 23:24         ` George Mitchell
2014-01-10  0:08           ` Clemens Eisserer
2014-01-10  0:46             ` George Mitchell
     [not found] <201401100106.s0A16CNd016476@atl4mhib27.myregisteredsite.com>
2014-01-10  1:31 ` George Mitchell
2014-01-14 19:13   ` Chris Murphy
2014-01-14 19:37     ` Roman Mamedov
2014-01-14 21:05       ` Chris Murphy
2014-01-14 21:19         ` Roman Mamedov
2014-01-14 21:37           ` Chris Murphy
2014-01-14 21:45             ` Chris Murphy
2014-01-14 21:54             ` Roman Mamedov
2014-01-14 20:29     ` George Mitchell
2014-01-14 21:00       ` Roman Mamedov
2014-01-14 21:06         ` Hugo Mills
2014-01-14 21:27           ` Chris Murphy
2014-01-14 21:27         ` George Mitchell
2014-01-14 21:28         ` George Mitchell
2014-01-14 21:14       ` Chris Murphy
2014-01-14 21:48         ` George Mitchell
2014-01-14 21:48         ` George Mitchell
2014-01-14 22:14         ` George Mitchell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.