All of lore.kernel.org
 help / color / mirror / Atom feed
* Exactly what is wrong with RAID5/6
@ 2017-06-20 22:57 waxhead
  2017-06-20 23:25 ` Hugo Mills
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: waxhead @ 2017-06-20 22:57 UTC (permalink / raw)
  To: linux-btrfs

I am trying to piece together the actual status of the RAID5/6 bit of BTRFS.
The wiki refer to kernel 3.19 which was released in February 2015 so I 
assume that the information there is a tad outdated (the last update on 
the wiki page was July 2016)
https://btrfs.wiki.kernel.org/index.php/RAID56

Now there are four problems listed

1. Parity may be inconsistent after a crash (the "write hole")
Is this still true, if yes - would not this apply for RAID1 / RAID10 as 
well? How was it solved there , and why can't that be done for RAID5/6

2. Parity data is not checksummed
Why is this a problem? Does it have to do with the design of BTRFS somehow?
Parity is after all just data, BTRFS does checksum data so what is the 
reason this is a problem?

3. No support for discard? (possibly -- needs confirmation with cmason)
Does this matter that much really?, is there an update on this?

4. The algorithm uses as many devices as are available: No support for a 
fixed-width stripe.
What is the plan for this one? There was patches on the mailing list by 
the SnapRAID author to support up to 6 parity devices. Will the (re?) 
resign of btrfs raid5/6 support a scheme that allows for multiple parity 
devices?

I do have a few other questions as well...

5. BTRFS does still (kernel 4.9) not seem to use the device ID to 
communicate with devices.

If you on a multi device filesystem yank out a device, for example 
/dev/sdg and it reappear as /dev/sdx for example btrfs will still 
happily try to write to /dev/sdg even if btrfs fi sh /mnt shows the 
correct device ID. What is the status for getting BTRFS to properly 
understand that a device is missing?

6. RAID1 needs to be able to make two copies always. E.g. if you have 
three disks you can loose one and it should still work. What about 
RAID10 ? If you have for example 6 disk RAID10 array, loose one disk and 
reboots (due to #5 above). Will RAID10 recognize that the array now is a 
5 disk array and stripe+mirror over 2 disks (or possibly 2.5 disks?) 
instead of 3? In other words, will it work as long as it can create a 
RAID10 profile that requires a minimum of four disks?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-20 22:57 Exactly what is wrong with RAID5/6 waxhead
@ 2017-06-20 23:25 ` Hugo Mills
  2017-06-21  3:48   ` Chris Murphy
  2017-06-21  8:45 ` Qu Wenruo
  2017-06-23 17:25 ` Michał Sokołowski
  2 siblings, 1 reply; 23+ messages in thread
From: Hugo Mills @ 2017-06-20 23:25 UTC (permalink / raw)
  To: waxhead; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5501 bytes --]

On Wed, Jun 21, 2017 at 12:57:19AM +0200, waxhead wrote:
> I am trying to piece together the actual status of the RAID5/6 bit of BTRFS.
> The wiki refer to kernel 3.19 which was released in February 2015 so
> I assume that the information there is a tad outdated (the last
> update on the wiki page was July 2016)
> https://btrfs.wiki.kernel.org/index.php/RAID56
> 
> Now there are four problems listed
> 
> 1. Parity may be inconsistent after a crash (the "write hole")
> Is this still true, if yes - would not this apply for RAID1 / RAID10
> as well? How was it solved there , and why can't that be done for
> RAID5/6

   Yes, it's still true, and it's specific to parity RAID, not the
other RAID levels. The issue is (I think) that if you write one block,
that block is replaced, but then the other blocks in the stripe need
to be read for the parity block to be recalculated, before the new
parity can be written. There's a read-modify-write cycle involved
which isn't inherent for the non-parity RAID levels (which would just
overwrite both copies).

   One of the proposed solutions for dealing with the write hole in
btrfs's parity RAID is to ensure that any new writes are written to a
completely new stripe. The problem is that this introduces a whole new
level of fragmentation if the FS has lots of small writes (because
your write unit is limited to a complete stripe, even for a single
byte update).

   There are probably others here who can explain this better. :)

> 2. Parity data is not checksummed
> Why is this a problem? Does it have to do with the design of BTRFS somehow?
> Parity is after all just data, BTRFS does checksum data so what is
> the reason this is a problem?

   It increases the number of unrecoverable (or not-guaranteed-
recoverable) cases. btrfs's csums are based on individual blocks on
individual devices -- each item of data is independently checksummed
(even if it's a copy of something else). On parity RAID
configurations, if you have a device failure, you've lost a piece of
the parity-protected data. To repair it, you have to recover from n-1
data blocks (which are checksummed), and one parity block (which
isn't). This means that if the parity block happens to have an error
on it, you can't recover cleanly from the device loss, *and you can't
know that an error has happened*.

> 3. No support for discard? (possibly -- needs confirmation with cmason)
> Does this matter that much really?, is there an update on this?
> 
> 4. The algorithm uses as many devices as are available: No support
> for a fixed-width stripe.
> What is the plan for this one? There was patches on the mailing list
> by the SnapRAID author to support up to 6 parity devices. Will the
> (re?) resign of btrfs raid5/6 support a scheme that allows for
> multiple parity devices?

   That's a problem because it limits the practical number of devices
you can use. When the stripe size gets too large, you're having to
read/modify/(re)write every device on an update, even for very small
updates -- as this ratio of update-size to read-size goes up, the FS
has increasingly bad performance. Your personal limits of what's
acceptable will vary, but I'd be surprised to find anyone with, say,
40 parity RAID devices who finds their performance acceptable. Limit
the stripe width, and you can limit the performance degradation from
lots of devices.

   Even with a limited stripe width, however, you're still looking at
decreasing reliability as the number of devices increases...

   It shouldn't be *massively* hard to implement, but there's a load
of opportunities around managing RAID options in general that would
probably need to be addressed at the same time (e.g. per-subvol RAID
settings, more general RAID parameterisation). It's going to need some
fairly major properties handling, plus rewriting the chunk allocator
and pushing the allocator decisions quite a way up from where they're
currently made.

> I do have a few other questions as well...
> 
> 5. BTRFS does still (kernel 4.9) not seem to use the device ID to
> communicate with devices.
> 
> If you on a multi device filesystem yank out a device, for example
> /dev/sdg and it reappear as /dev/sdx for example btrfs will still
> happily try to write to /dev/sdg even if btrfs fi sh /mnt shows the
> correct device ID. What is the status for getting BTRFS to properly
> understand that a device is missing?

   I don't know about this one.

> 6. RAID1 needs to be able to make two copies always. E.g. if you
> have three disks you can loose one and it should still work. What
> about RAID10 ? If you have for example 6 disk RAID10 array, loose
> one disk and reboots (due to #5 above). Will RAID10 recognize that
> the array now is a 5 disk array and stripe+mirror over 2 disks (or
> possibly 2.5 disks?) instead of 3? In other words, will it work as
> long as it can create a RAID10 profile that requires a minimum of
> four disks?

   Yes. RAID-10 will work on any number of devices (>=4), not just an
even number. Obviously, if you have a 6-device array and lose one,
you'll need to deal with the loss of redundancy -- either add a new
device and rebalance, or replace the missing device with a new one, or
(space permitting) rebalance with existing devices.

   Hugo.

-- 
Hugo Mills             | Let me past! There's been a major scientific
hugo@... carfax.org.uk | break-in!
http://carfax.org.uk/  | Through! Break-through!
PGP: E2AB1DE4          |                                          Ford Prefect

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-20 23:25 ` Hugo Mills
@ 2017-06-21  3:48   ` Chris Murphy
  2017-06-21  6:51     ` Marat Khalili
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2017-06-21  3:48 UTC (permalink / raw)
  To: Hugo Mills, waxhead, Btrfs BTRFS

On Tue, Jun 20, 2017 at 5:25 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Wed, Jun 21, 2017 at 12:57:19AM +0200, waxhead wrote:
>> I am trying to piece together the actual status of the RAID5/6 bit of BTRFS.
>> The wiki refer to kernel 3.19 which was released in February 2015 so
>> I assume that the information there is a tad outdated (the last
>> update on the wiki page was July 2016)
>> https://btrfs.wiki.kernel.org/index.php/RAID56
>>
>> Now there are four problems listed
>>
>> 1. Parity may be inconsistent after a crash (the "write hole")
>> Is this still true, if yes - would not this apply for RAID1 / RAID10
>> as well? How was it solved there , and why can't that be done for
>> RAID5/6
>
>    Yes, it's still true, and it's specific to parity RAID, not the
> other RAID levels. The issue is (I think) that if you write one block,
> that block is replaced, but then the other blocks in the stripe need
> to be read for the parity block to be recalculated, before the new
> parity can be written. There's a read-modify-write cycle involved
> which isn't inherent for the non-parity RAID levels (which would just
> overwrite both copies).

Yeah, there's a lwn article from Neil Brown about how the likelihood
of hitting the write hole is almost impossible. But nevertheless the
md devs implemented a journal to close the write hole.

Also on Btrfs, while the write hole can manifest on disk, it does get
detected on a subsequent read. That is, a bad reconstruction of data
from parity, will not match data csum and you'll get EIO and an path
to the bad file.

What is really not good though I think is metadata raid56. If that
gets hose, the whole fs is going to face plant. And we've seen some
evidence of this. So I really think the wiki could make it more clear
to just not use raid56 for metadata.


>
>    One of the proposed solutions for dealing with the write hole in
> btrfs's parity RAID is to ensure that any new writes are written to a
> completely new stripe. The problem is that this introduces a whole new
> level of fragmentation if the FS has lots of small writes (because
> your write unit is limited to a complete stripe, even for a single
> byte update).

Another possibility is to ensure a new write is written to a new *not*
full stripe, i.e. dynamic stripe size. So if the modification is a 50K
file on a 4 disk raid5; instead of writing 3 64K data strips + 1 64K
parity strip (a full stripe write); write out 1 64K data strip + 1 64K
parity strip. In effect, a 4 disk raid5 would quickly get not just 3
data + 1 parity strip Btrfs block groups; but 1 data + 1 parity, and 2
data + 1 parity chunks, and direct those write to the proper chunk
based on size. Anyway that's beyond my ability to assess how much
allocator work that is. Balance I'd expect to rewrite everything to
max data strips possible; the optimization would only apply to normal
operation COW.

Also, ZFS has a functional equivalent, a variable stripe size for
raid, so it's always doing COW writes for raid56, no RMW.


>
>    There are probably others here who can explain this better. :)
>
>> 2. Parity data is not checksummed
>> Why is this a problem? Does it have to do with the design of BTRFS somehow?
>> Parity is after all just data, BTRFS does checksum data so what is
>> the reason this is a problem?
>
>    It increases the number of unrecoverable (or not-guaranteed-
> recoverable) cases. btrfs's csums are based on individual blocks on
> individual devices -- each item of data is independently checksummed
> (even if it's a copy of something else). On parity RAID
> configurations, if you have a device failure, you've lost a piece of
> the parity-protected data. To repair it, you have to recover from n-1
> data blocks (which are checksummed), and one parity block (which
> isn't). This means that if the parity block happens to have an error
> on it, you can't recover cleanly from the device loss, *and you can't
> know that an error has happened*.

Uhh no I've done quite a number of tests and absolutely if the parity
is corrupt and therefore you get a bad reconstruction, you definitely
get a csum mismatch and EIO. Corrupt data does not propagate upward.

The csums are in the csum tree which is part of metadata block groups.
If those are raid56 and there's a loss of data, now you're at pretty
high risk because you can get a bad reconstruction, which btrfs will
recognize but unable to recover from, and should go read only. We've
seen that on the list.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-21  3:48   ` Chris Murphy
@ 2017-06-21  6:51     ` Marat Khalili
  2017-06-21  7:31       ` Peter Grandi
                         ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Marat Khalili @ 2017-06-21  6:51 UTC (permalink / raw)
  To: Btrfs BTRFS

On 21/06/17 06:48, Chris Murphy wrote:
> Another possibility is to ensure a new write is written to a new*not*
> full stripe, i.e. dynamic stripe size. So if the modification is a 50K
> file on a 4 disk raid5; instead of writing 3 64K data strips + 1 64K
> parity strip (a full stripe write); write out 1 64K data strip + 1 64K
> parity strip. In effect, a 4 disk raid5 would quickly get not just 3
> data + 1 parity strip Btrfs block groups; but 1 data + 1 parity, and 2
> data + 1 parity chunks, and direct those write to the proper chunk
> based on size. Anyway that's beyond my ability to assess how much
> allocator work that is. Balance I'd expect to rewrite everything to
> max data strips possible; the optimization would only apply to normal
> operation COW.
This will make some filesystems mostly RAID1, negating all space savings 
of RAID5, won't it?

Isn't it easier to recalculate parity block based using previous state 
of two rewritten strips, parity and data? I don't understand all 
performance implications, but it might scale better with number of devices.

--

With Best Regards,
Marat Khalili

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-21  6:51     ` Marat Khalili
@ 2017-06-21  7:31       ` Peter Grandi
  2017-06-21 17:13       ` Andrei Borzenkov
  2017-06-21 18:43       ` Chris Murphy
  2 siblings, 0 replies; 23+ messages in thread
From: Peter Grandi @ 2017-06-21  7:31 UTC (permalink / raw)
  To: Btrfs BTRFS

> [ ... ] This will make some filesystems mostly RAID1, negating
> all space savings of RAID5, won't it? [ ... ]

RAID5/RAID6/... don't merely save space, more precisely they
trade lower resilience and a more anisotropic and smaller
performance envelope to gain lower redundancy (= save space).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-20 22:57 Exactly what is wrong with RAID5/6 waxhead
  2017-06-20 23:25 ` Hugo Mills
@ 2017-06-21  8:45 ` Qu Wenruo
  2017-06-21 12:43   ` Christoph Anton Mitterer
                     ` (2 more replies)
  2017-06-23 17:25 ` Michał Sokołowski
  2 siblings, 3 replies; 23+ messages in thread
From: Qu Wenruo @ 2017-06-21  8:45 UTC (permalink / raw)
  To: waxhead, linux-btrfs



At 06/21/2017 06:57 AM, waxhead wrote:
> I am trying to piece together the actual status of the RAID5/6 bit of 
> BTRFS.
> The wiki refer to kernel 3.19 which was released in February 2015 so I 
> assume that the information there is a tad outdated (the last update on 
> the wiki page was July 2016)
> https://btrfs.wiki.kernel.org/index.php/RAID56
> 
> Now there are four problems listed
> 
> 1. Parity may be inconsistent after a crash (the "write hole")
> Is this still true, if yes - would not this apply for RAID1 / RAID10 as 
> well? How was it solved there , and why can't that be done for RAID5/6

Unlike pure stripe method, one fully functional RAID5/6 should be 
written in full stripe behavior, which is made up by N data stripes and 
correct P/Q.

Given one example to show how write sequence affects the usability of 
RAID5/6.

Existing full stripe:
X = Used space (Extent allocated)
O = Unused space
Data 1   |XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|

When some new extent is allocated to data 1 stripe, if we write
data directly into that region, and crashed.
The result will be:

Data 1   |XXXXXX|XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|

Parity stripe is not updated, although it's fine since data is still 
correct, this reduces the usability, as in this case, if we lost device 
containing data 2 stripe, we can't recover correct data of data 2.

Although personally I don't think it's a big problem yet.

Someone has idea to modify extent allocator to handle it, but anyway I 
don't consider it's worthy.

> 
> 2. Parity data is not checksummed
> Why is this a problem? Does it have to do with the design of BTRFS somehow?
> Parity is after all just data, BTRFS does checksum data so what is the 
> reason this is a problem?

Because that's one solution to solve above problem.

And no, parity is not data.

Parity/mirror/stripe is done in btrfs chunk level, and represents a 
nice, easy to understand linear logical space for higher level.

For example:
If in the btrfs logical space, 0~1G is mapped to a RAID5 chunk with 3 
devices, higher level only needs to tell btrfs chunk layer how many 
bytes it wants to read and where the read starts.

If one devices is missing, then try to rebuild the data using parity so 
that higher layer don't need to care what the profile is or if there is 
parity or not.

So parity can't be addressed from btrfs logical space, that's to say no 
possible position to record csum for it in current btrfs design.



> 
> 3. No support for discard? (possibly -- needs confirmation with cmason)
> Does this matter that much really?, is there an update on this?

Not familiar with this though.

> 
> 4. The algorithm uses as many devices as are available: No support for a 
> fixed-width stripe.
> What is the plan for this one? There was patches on the mailing list by 
> the SnapRAID author to support up to 6 parity devices. Will the (re?) 
> resign of btrfs raid5/6 support a scheme that allows for multiple parity 
> devices?

Considering current maintainers seems to be focusing on bug fixes, not 
new features, I'm not confident with such new feature.

> 
> I do have a few other questions as well...
> 
> 5. BTRFS does still (kernel 4.9) not seem to use the device ID to 
> communicate with devices.

Btrfs is always using device ID to build up its device mapping.
And for any multi-device implementation (LVM,mdadam) it's never a good 
idea to use device path.

> 
> If you on a multi device filesystem yank out a device, for example 
> /dev/sdg and it reappear as /dev/sdx for example btrfs will still 
> happily try to write to /dev/sdg even if btrfs fi sh /mnt shows the 
> correct device ID. What is the status for getting BTRFS to properly 
> understand that a device is missing?

It's btrfs that doesn't support runtime switch from missing device to 
re-appeared device.

Most device missing detection is done at btrfs device scan time.
Runtime detect is not that perfect, but Anand Jain is introducing some 
nice infrastructure as basis to enhance it.

> 
> 6. RAID1 needs to be able to make two copies always. E.g. if you have 
> three disks you can loose one and it should still work. What about 
> RAID10 ? If you have for example 6 disk RAID10 array, loose one disk and 
> reboots (due to #5 above). Will RAID10 recognize that the array now is a 
> 5 disk array and stripe+mirror over 2 disks (or possibly 2.5 disks?) 
> instead of 3? In other words, will it work as long as it can create a 
> RAID10 profile that requires a minimum of four disks?

At least, after reboot, btrfs still knows it's a fs on 6 disks, although 
lost one, it will still create new chunk using all 6 disks.

Thanks,
Qu

> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-21  8:45 ` Qu Wenruo
@ 2017-06-21 12:43   ` Christoph Anton Mitterer
  2017-06-21 13:41     ` Austin S. Hemmelgarn
  2017-06-21 17:03   ` Goffredo Baroncelli
  2017-06-21 18:24   ` Chris Murphy
  2 siblings, 1 reply; 23+ messages in thread
From: Christoph Anton Mitterer @ 2017-06-21 12:43 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 421 bytes --]

On Wed, 2017-06-21 at 16:45 +0800, Qu Wenruo wrote:
> Btrfs is always using device ID to build up its device mapping.
> And for any multi-device implementation (LVM,mdadam) it's never a
> good 
> idea to use device path.

Isn't it rather the other way round? Using the ID is bad? Don't you
remember our discussion about using leaked UUIDs (or accidental
collisions) for all kinds of attacks?


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-21 12:43   ` Christoph Anton Mitterer
@ 2017-06-21 13:41     ` Austin S. Hemmelgarn
  2017-06-21 17:20       ` Andrei Borzenkov
  0 siblings, 1 reply; 23+ messages in thread
From: Austin S. Hemmelgarn @ 2017-06-21 13:41 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Qu Wenruo, linux-btrfs

On 2017-06-21 08:43, Christoph Anton Mitterer wrote:
> On Wed, 2017-06-21 at 16:45 +0800, Qu Wenruo wrote:
>> Btrfs is always using device ID to build up its device mapping.
>> And for any multi-device implementation (LVM,mdadam) it's never a
>> good
>> idea to use device path.
> 
> Isn't it rather the other way round? Using the ID is bad? Don't you
> remember our discussion about using leaked UUIDs (or accidental
> collisions) for all kinds of attacks?
Both are bad for different reasons.  For the particular case of sanely 
handling transient storage failures (device disappears then reappears), 
you can't do it with a path in /dev (which is what most people mean when 
they say device path), and depending on how the hardware failed and the 
specifics of the firmware, you may not be able to do it with a 
hardware-level device path, but you can do it with a device ID assuming 
you sanely verify the ID.  Right now, BTRFS is not sanely checking the 
ID (it only verifies the UUID's in the FS itself, it should also be 
checking hardware-level identifiers like WWN).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-21  8:45 ` Qu Wenruo
  2017-06-21 12:43   ` Christoph Anton Mitterer
@ 2017-06-21 17:03   ` Goffredo Baroncelli
  2017-06-22  2:05     ` Qu Wenruo
  2017-06-21 18:24   ` Chris Murphy
  2 siblings, 1 reply; 23+ messages in thread
From: Goffredo Baroncelli @ 2017-06-21 17:03 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: waxhead, linux-btrfs

Hi Qu,

On 2017-06-21 10:45, Qu Wenruo wrote:
> At 06/21/2017 06:57 AM, waxhead wrote:
>> I am trying to piece together the actual status of the RAID5/6 bit of BTRFS.
>> The wiki refer to kernel 3.19 which was released in February 2015 so I assume
>> that the information there is a tad outdated (the last update on the wiki page was July 2016)
>> https://btrfs.wiki.kernel.org/index.php/RAID56
>>
>> Now there are four problems listed
>>
>> 1. Parity may be inconsistent after a crash (the "write hole")
>> Is this still true, if yes - would not this apply for RAID1 / 
>> RAID10 as well? How was it solved there , and why can't that be done for RAID5/6
> 
> Unlike pure stripe method, one fully functional RAID5/6 should be written in full stripe behavior,
>  which is made up by N data stripes and correct P/Q.
> 
> Given one example to show how write sequence affects the usability of RAID5/6.
> 
> Existing full stripe:
> X = Used space (Extent allocated)
> O = Unused space
> Data 1   |XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
> Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
> Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|
> 
> When some new extent is allocated to data 1 stripe, if we write
> data directly into that region, and crashed.
> The result will be:
> 
> Data 1   |XXXXXX|XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
> Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
> Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|
> 
> Parity stripe is not updated, although it's fine since data is still correct, this reduces the 
> usability, as in this case, if we lost device containing data 2 stripe, we can't 
> recover correct data of data 2.
> 
> Although personally I don't think it's a big problem yet.
> 
> Someone has idea to modify extent allocator to handle it, but anyway I don't consider it's worthy.
> 
>>
>> 2. Parity data is not checksummed
>> Why is this a problem? Does it have to do with the design of BTRFS somehow?
>> Parity is after all just data, BTRFS does checksum data so what is the reason this is a problem?
> 
> Because that's one solution to solve above problem.

In what it could be a solution for the write hole ? If a parity is wrong AND you lost a disk, even having a checksum of the parity, you are not in position to rebuild the missing data. And if you rebuild wrong data, anyway the checksum highlights it. So adding the checksum to the parity should not solve any issue.

A possible "mitigation", is to track in a "intent log" all the not "full stripe writes" during a transaction. If a power failure aborts a transaction, in the next mount a scrub process is started to correct the parities only in the stripes tracked before.

A solution, is to journal all the not "full stripe writes", as MD does.


BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-21  6:51     ` Marat Khalili
  2017-06-21  7:31       ` Peter Grandi
@ 2017-06-21 17:13       ` Andrei Borzenkov
  2017-06-21 18:43       ` Chris Murphy
  2 siblings, 0 replies; 23+ messages in thread
From: Andrei Borzenkov @ 2017-06-21 17:13 UTC (permalink / raw)
  To: Marat Khalili, Btrfs BTRFS

21.06.2017 09:51, Marat Khalili пишет:
> On 21/06/17 06:48, Chris Murphy wrote:
>> Another possibility is to ensure a new write is written to a new*not*
>> full stripe, i.e. dynamic stripe size. So if the modification is a 50K
>> file on a 4 disk raid5; instead of writing 3 64K data strips + 1 64K
>> parity strip (a full stripe write); write out 1 64K data strip + 1 64K
>> parity strip. In effect, a 4 disk raid5 would quickly get not just 3
>> data + 1 parity strip Btrfs block groups; but 1 data + 1 parity, and 2
>> data + 1 parity chunks, and direct those write to the proper chunk
>> based on size. Anyway that's beyond my ability to assess how much
>> allocator work that is. Balance I'd expect to rewrite everything to
>> max data strips possible; the optimization would only apply to normal
>> operation COW.
> This will make some filesystems mostly RAID1, negating all space savings
> of RAID5, won't it?
> 
> Isn't it easier to recalculate parity block based using previous state
> of two rewritten strips, parity and data? I don't understand all
> performance implications, but it might scale better with number of devices.
> 

That's what it effectively does today; the problem is, RAID[56] layer is
below btrfs allocator so same stripe may be shared by different
transactions. This defeats the very idea of redirect on write where data
on disk is assumed to never be changed by subsequent modifications.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-21 13:41     ` Austin S. Hemmelgarn
@ 2017-06-21 17:20       ` Andrei Borzenkov
  2017-06-21 17:30         ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 23+ messages in thread
From: Andrei Borzenkov @ 2017-06-21 17:20 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Christoph Anton Mitterer, Qu Wenruo, linux-btrfs

21.06.2017 16:41, Austin S. Hemmelgarn пишет:
> On 2017-06-21 08:43, Christoph Anton Mitterer wrote:
>> On Wed, 2017-06-21 at 16:45 +0800, Qu Wenruo wrote:
>>> Btrfs is always using device ID to build up its device mapping.
>>> And for any multi-device implementation (LVM,mdadam) it's never a
>>> good
>>> idea to use device path.
>>
>> Isn't it rather the other way round? Using the ID is bad? Don't you
>> remember our discussion about using leaked UUIDs (or accidental
>> collisions) for all kinds of attacks?
> Both are bad for different reasons.  For the particular case of sanely
> handling transient storage failures (device disappears then reappears),
> you can't do it with a path in /dev (which is what most people mean when
> they say device path), and depending on how the hardware failed and the
> specifics of the firmware, you may not be able to do it with a
> hardware-level device path, but you can do it with a device ID assuming
> you sanely verify the ID.  Right now, BTRFS is not sanely checking the
> ID (it only verifies the UUID's in the FS itself, it should also be
> checking hardware-level identifiers like WWN).

Which is not enough too; if device dropped off array and reappeared
later we need to be able to declare it stale, even if it has exactly the
same UUID and WWN and whatever hardware identifier is used. So we need
some generation number to be able to do it. Incidentally MD does have
them and compares generation numbers to decide whether device can be
assimilated.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-21 17:20       ` Andrei Borzenkov
@ 2017-06-21 17:30         ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 23+ messages in thread
From: Austin S. Hemmelgarn @ 2017-06-21 17:30 UTC (permalink / raw)
  To: Andrei Borzenkov, Christoph Anton Mitterer, Qu Wenruo, linux-btrfs

On 2017-06-21 13:20, Andrei Borzenkov wrote:
> 21.06.2017 16:41, Austin S. Hemmelgarn пишет:
>> On 2017-06-21 08:43, Christoph Anton Mitterer wrote:
>>> On Wed, 2017-06-21 at 16:45 +0800, Qu Wenruo wrote:
>>>> Btrfs is always using device ID to build up its device mapping.
>>>> And for any multi-device implementation (LVM,mdadam) it's never a
>>>> good
>>>> idea to use device path.
>>>
>>> Isn't it rather the other way round? Using the ID is bad? Don't you
>>> remember our discussion about using leaked UUIDs (or accidental
>>> collisions) for all kinds of attacks?
>> Both are bad for different reasons.  For the particular case of sanely
>> handling transient storage failures (device disappears then reappears),
>> you can't do it with a path in /dev (which is what most people mean when
>> they say device path), and depending on how the hardware failed and the
>> specifics of the firmware, you may not be able to do it with a
>> hardware-level device path, but you can do it with a device ID assuming
>> you sanely verify the ID.  Right now, BTRFS is not sanely checking the
>> ID (it only verifies the UUID's in the FS itself, it should also be
>> checking hardware-level identifiers like WWN).
> 
> Which is not enough too; if device dropped off array and reappeared
> later we need to be able to declare it stale, even if it has exactly the
> same UUID and WWN and whatever hardware identifier is used. So we need
> some generation number to be able to do it. Incidentally MD does have
> them and compares generation numbers to decide whether device can be
> assimilated.
> 
I was not disputing that aspect, just the method verifying the device 
that reappeared is the same one that disappeared.  Outside of the 
requirement to properly re-sync (we would also need to do some kind of 
sanity check on the generation number too, otherwise we end up with the 
possibility of a partial write there nuking the whole FS when the device 
reconnects), verifying some level of hardware identification covers the 
security and data safety issues that Christoph is referring to 
sufficiently for the common cases (with the biggest being USB attached 
devices with BTRFS volumes on them).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-21  8:45 ` Qu Wenruo
  2017-06-21 12:43   ` Christoph Anton Mitterer
  2017-06-21 17:03   ` Goffredo Baroncelli
@ 2017-06-21 18:24   ` Chris Murphy
  2017-06-21 20:12     ` Goffredo Baroncelli
  2017-06-22  2:12     ` Qu Wenruo
  2 siblings, 2 replies; 23+ messages in thread
From: Chris Murphy @ 2017-06-21 18:24 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: waxhead, Btrfs BTRFS

On Wed, Jun 21, 2017 at 2:45 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:

> Unlike pure stripe method, one fully functional RAID5/6 should be written in
> full stripe behavior, which is made up by N data stripes and correct P/Q.
>
> Given one example to show how write sequence affects the usability of
> RAID5/6.
>
> Existing full stripe:
> X = Used space (Extent allocated)
> O = Unused space
> Data 1   |XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
> Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
> Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|
>
> When some new extent is allocated to data 1 stripe, if we write
> data directly into that region, and crashed.
> The result will be:
>
> Data 1   |XXXXXX|XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
> Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
> Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|
>
> Parity stripe is not updated, although it's fine since data is still
> correct, this reduces the usability, as in this case, if we lost device
> containing data 2 stripe, we can't recover correct data of data 2.
>
> Although personally I don't think it's a big problem yet.
>
> Someone has idea to modify extent allocator to handle it, but anyway I don't
> consider it's worthy.


If there is parity corruption and there is a lost device (or bad
sector causing lost data strip), that is in effect two failures and no
raid5 recovers, you have to have raid6. However, I don't know whether
Btrfs raid6 can even recover from it? If there is a single device
failure, with a missing data strip, you have both P&Q. Typically raid6
implementations use P first, and only use Q if P is not available. Is
Btrfs raid6 the same? And if reconstruction from P fails to match data
csum, does Btrfs retry using Q? Probably not is my guess.

I think that is a valid problem calling for a solution on Btrfs, given
its mandate. It is no worse than other raid6 implementations though
which would reconstruct from bad P, and give no warning, leaving it up
to application layers to deal with the problem.

I have no idea how ZFS RAIDZ2 and RAIDZ3 handle this same scenario.



>
>>
>> 2. Parity data is not checksummed
>> Why is this a problem? Does it have to do with the design of BTRFS
>> somehow?
>> Parity is after all just data, BTRFS does checksum data so what is the
>> reason this is a problem?
>
>
> Because that's one solution to solve above problem.
>
> And no, parity is not data.

Parity strip is differentiated from data strip, and by itself parity
is meaningless. But parity plus n-1 data strips is an encoded form of
the missing data strip, and is therefore an encoded copy of the data.
We kinda have to treat the parity as fractionally important compared
to data; just like each mirror copy has some fractional value. You
don't have to have both of them, but you do have to have at least one
of them.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-21  6:51     ` Marat Khalili
  2017-06-21  7:31       ` Peter Grandi
  2017-06-21 17:13       ` Andrei Borzenkov
@ 2017-06-21 18:43       ` Chris Murphy
  2 siblings, 0 replies; 23+ messages in thread
From: Chris Murphy @ 2017-06-21 18:43 UTC (permalink / raw)
  To: Marat Khalili; +Cc: Btrfs BTRFS

On Wed, Jun 21, 2017 at 12:51 AM, Marat Khalili <mkh@rqc.ru> wrote:
> On 21/06/17 06:48, Chris Murphy wrote:
>>
>> Another possibility is to ensure a new write is written to a new*not*
>> full stripe, i.e. dynamic stripe size. So if the modification is a 50K
>> file on a 4 disk raid5; instead of writing 3 64K data strips + 1 64K
>> parity strip (a full stripe write); write out 1 64K data strip + 1 64K
>> parity strip. In effect, a 4 disk raid5 would quickly get not just 3
>> data + 1 parity strip Btrfs block groups; but 1 data + 1 parity, and 2
>> data + 1 parity chunks, and direct those write to the proper chunk
>> based on size. Anyway that's beyond my ability to assess how much
>> allocator work that is. Balance I'd expect to rewrite everything to
>> max data strips possible; the optimization would only apply to normal
>> operation COW..

> This will make some filesystems mostly RAID1, negating all space savings of
> RAID5, won't it?

No. It'd only apply to partial stripe writes, typically small files.
But small file, metadata centric workloads suck for raid5 anyway, and
should use raid1. So making the implementation more like raid1 than
raid5 for the RMW case I think is still better than Btrfs raid56 RMW
writes in effect being no-COW.


> Isn't it easier to recalculate parity block based using previous state of
> two rewritten strips, parity and data? I don't understand all performance
> implications, but it might scale better with number of devices.

The problem is atomicity. Either the data strip or parity strip is
overwritten first, and before the other is committed, the file system
is not merely inconsistent, it's basically lying, there's no way to
know for sure after the fact whether the data or parity were properly
written. And even the metadata is inconsistent too because it can only
describe the unmodified state and the successfully modified state,
whereas a 3rd state "partially modified" is possible and no way to
really fix it.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-21 18:24   ` Chris Murphy
@ 2017-06-21 20:12     ` Goffredo Baroncelli
  2017-06-21 23:19       ` Chris Murphy
  2017-06-22  2:12     ` Qu Wenruo
  1 sibling, 1 reply; 23+ messages in thread
From: Goffredo Baroncelli @ 2017-06-21 20:12 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Qu Wenruo, waxhead, Btrfs BTRFS

On 2017-06-21 20:24, Chris Murphy wrote:
> On Wed, Jun 21, 2017 at 2:45 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> 
>> Unlike pure stripe method, one fully functional RAID5/6 should be written in
>> full stripe behavior, which is made up by N data stripes and correct P/Q.
>>
>> Given one example to show how write sequence affects the usability of
>> RAID5/6.
>>
>> Existing full stripe:
>> X = Used space (Extent allocated)
>> O = Unused space
>> Data 1   |XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|
>>
>> When some new extent is allocated to data 1 stripe, if we write
>> data directly into that region, and crashed.
>> The result will be:
>>
>> Data 1   |XXXXXX|XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|
>>
>> Parity stripe is not updated, although it's fine since data is still
>> correct, this reduces the usability, as in this case, if we lost device
>> containing data 2 stripe, we can't recover correct data of data 2.
>>
>> Although personally I don't think it's a big problem yet.
>>
>> Someone has idea to modify extent allocator to handle it, but anyway I don't
>> consider it's worthy.
> 
> 
> If there is parity corruption and there is a lost device (or bad
> sector causing lost data strip), that is in effect two failures and no
> raid5 recovers, you have to have raid6. 

Generally speaking, when you write "two failure" this means two failure at the same time. But the write hole happens even if these two failures are not at the same time:

Event #1: power failure between the data stripe write and the parity stripe write. The stripe is incoherent.
Event #2: a disk is failing: if you try to read the data from the remaining data and the parity you have wrong data.

The likelihood of these two event at the same time (power failure and  in the next boot a disk is failing) is quite low. But in the life of a filesystem, these two event likely happens.

However BTRFS has an advantage: a simple scrub may (crossing finger) recover from event #1.

> However, I don't know whether
> Btrfs raid6 can even recover from it? If there is a single device
> failure, with a missing data strip, you have both P&Q. Typically raid6
> implementations use P first, and only use Q if P is not available. Is
> Btrfs raid6 the same? And if reconstruction from P fails to match data
> csum, does Btrfs retry using Q? Probably not is my guess.

It could, and in any case it is only an "implementation detail" :-)
> 
> I think that is a valid problem calling for a solution on Btrfs, given
> its mandate. It is no worse than other raid6 implementations though
> which would reconstruct from bad P, and give no warning, leaving it up
> to application layers to deal with the problem.
> 
> I have no idea how ZFS RAIDZ2 and RAIDZ3 handle this same scenario.

If I understood correctly, ZFS has a variable stripe size. In BTRFS could be easily implemented: it would be sufficient to have different block group with different number of disk.

If a filesystem is composed by 5 disks, it will contain:

1 BG RAID1 for writing up-to 64k
1 BG RAID5 (3 disks) for writing up-to 128k
1 BG RAID5 (4 disks) for writing up-to 192k
1 BG RAID5 (5 disks) for all other disks

Time to time the filesystem would need a re-balance in order to empty the smaller block group. 


Another option could be to track the stripes involved by a RWM cycle (i.e. all the writings smaller than a stripe size, which in a COW filesystem, are suppose to be few) in an "intent log", and scrubbing all these stripes if a power failure happens .




> 
> 
> 
>>
>>>
>>> 2. Parity data is not checksummed
>>> Why is this a problem? Does it have to do with the design of BTRFS
>>> somehow?
>>> Parity is after all just data, BTRFS does checksum data so what is the
>>> reason this is a problem?
>>
>>
>> Because that's one solution to solve above problem.
>>
>> And no, parity is not data.
> 
> Parity strip is differentiated from data strip, and by itself parity
> is meaningless. But parity plus n-1 data strips is an encoded form of
> the missing data strip, and is therefore an encoded copy of the data.
> We kinda have to treat the parity as fractionally important compared
> to data; just like each mirror copy has some fractional value. You
> don't have to have both of them, but you do have to have at least one
> of them.
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-21 20:12     ` Goffredo Baroncelli
@ 2017-06-21 23:19       ` Chris Murphy
  0 siblings, 0 replies; 23+ messages in thread
From: Chris Murphy @ 2017-06-21 23:19 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: Chris Murphy, Qu Wenruo, waxhead, Btrfs BTRFS

On Wed, Jun 21, 2017 at 2:12 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote:


>
> Generally speaking, when you write "two failure" this means two failure at the same time. But the write hole happens even if these two failures are not at the same time:
>
> Event #1: power failure between the data stripe write and the parity stripe write. The stripe is incoherent.
> Event #2: a disk is failing: if you try to read the data from the remaining data and the parity you have wrong data.
>
> The likelihood of these two event at the same time (power failure and  in the next boot a disk is failing) is quite low. But in the life of a filesystem, these two event likely happens.
>
> However BTRFS has an advantage: a simple scrub may (crossing finger) recover from event #1.

Event #3: the stripe is read, missing a data strip due to event #2,
and is wrongly reconstructed due to event #1, Btrfs computes crc32c on
the reconstructed data and compares to extent csum, which then fails
and EIO happens.

Btrfs is susceptible to the write hole happening on disk. But it's
still detected and corrupt data isn't propagated upward.




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-21 17:03   ` Goffredo Baroncelli
@ 2017-06-22  2:05     ` Qu Wenruo
  0 siblings, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2017-06-22  2:05 UTC (permalink / raw)
  To: kreijack; +Cc: waxhead, linux-btrfs



At 06/22/2017 01:03 AM, Goffredo Baroncelli wrote:
> Hi Qu,
> 
> On 2017-06-21 10:45, Qu Wenruo wrote:
>> At 06/21/2017 06:57 AM, waxhead wrote:
>>> I am trying to piece together the actual status of the RAID5/6 bit of BTRFS.
>>> The wiki refer to kernel 3.19 which was released in February 2015 so I assume
>>> that the information there is a tad outdated (the last update on the wiki page was July 2016)
>>> https://btrfs.wiki.kernel.org/index.php/RAID56
>>>
>>> Now there are four problems listed
>>>
>>> 1. Parity may be inconsistent after a crash (the "write hole")
>>> Is this still true, if yes - would not this apply for RAID1 /
>>> RAID10 as well? How was it solved there , and why can't that be done for RAID5/6
>>
>> Unlike pure stripe method, one fully functional RAID5/6 should be written in full stripe behavior,
>>   which is made up by N data stripes and correct P/Q.
>>
>> Given one example to show how write sequence affects the usability of RAID5/6.
>>
>> Existing full stripe:
>> X = Used space (Extent allocated)
>> O = Unused space
>> Data 1   |XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|
>>
>> When some new extent is allocated to data 1 stripe, if we write
>> data directly into that region, and crashed.
>> The result will be:
>>
>> Data 1   |XXXXXX|XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|
>>
>> Parity stripe is not updated, although it's fine since data is still correct, this reduces the
>> usability, as in this case, if we lost device containing data 2 stripe, we can't
>> recover correct data of data 2.
>>
>> Although personally I don't think it's a big problem yet.
>>
>> Someone has idea to modify extent allocator to handle it, but anyway I don't consider it's worthy.
>>
>>>
>>> 2. Parity data is not checksummed
>>> Why is this a problem? Does it have to do with the design of BTRFS somehow?
>>> Parity is after all just data, BTRFS does checksum data so what is the reason this is a problem?
>>
>> Because that's one solution to solve above problem.
> 
> In what it could be a solution for the write hole ?

Not my idea, so I don't why this is a solution either.

I prefer to lower the priority for such case as we have more work to do.

Thanks,
Qu

> If a parity is wrong AND you lost a disk, even having a checksum of the parity, you are not in position to rebuild the missing data. And if you rebuild wrong data, anyway the checksum highlights it. So adding the checksum to the parity should not solve any issue.
> 
> A possible "mitigation", is to track in a "intent log" all the not "full stripe writes" during a transaction. If a power failure aborts a transaction, in the next mount a scrub process is started to correct the parities only in the stripes tracked before.
> 
> A solution, is to journal all the not "full stripe writes", as MD does.
> 
> 
> BR
> G.Baroncelli
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-21 18:24   ` Chris Murphy
  2017-06-21 20:12     ` Goffredo Baroncelli
@ 2017-06-22  2:12     ` Qu Wenruo
  2017-06-22  2:43       ` Chris Murphy
  2017-06-22  5:15       ` Goffredo Baroncelli
  1 sibling, 2 replies; 23+ messages in thread
From: Qu Wenruo @ 2017-06-22  2:12 UTC (permalink / raw)
  To: Chris Murphy; +Cc: waxhead, Btrfs BTRFS



At 06/22/2017 02:24 AM, Chris Murphy wrote:
> On Wed, Jun 21, 2017 at 2:45 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> 
>> Unlike pure stripe method, one fully functional RAID5/6 should be written in
>> full stripe behavior, which is made up by N data stripes and correct P/Q.
>>
>> Given one example to show how write sequence affects the usability of
>> RAID5/6.
>>
>> Existing full stripe:
>> X = Used space (Extent allocated)
>> O = Unused space
>> Data 1   |XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|
>>
>> When some new extent is allocated to data 1 stripe, if we write
>> data directly into that region, and crashed.
>> The result will be:
>>
>> Data 1   |XXXXXX|XXXXXX|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Data 2   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO|
>> Parity   |WWWWWW|ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ|
>>
>> Parity stripe is not updated, although it's fine since data is still
>> correct, this reduces the usability, as in this case, if we lost device
>> containing data 2 stripe, we can't recover correct data of data 2.
>>
>> Although personally I don't think it's a big problem yet.
>>
>> Someone has idea to modify extent allocator to handle it, but anyway I don't
>> consider it's worthy.
> 
> 
> If there is parity corruption and there is a lost device (or bad
> sector causing lost data strip), that is in effect two failures and no
> raid5 recovers, you have to have raid6. However, I don't know whether
> Btrfs raid6 can even recover from it? If there is a single device
> failure, with a missing data strip, you have both P&Q. Typically raid6
> implementations use P first, and only use Q if P is not available. Is
> Btrfs raid6 the same? And if reconstruction from P fails to match data
> csum, does Btrfs retry using Q? Probably not is my guess.

Well, in fact, thanks to data csum and btrfs metadata CoW, there is 
quite a high chance that we won't cause any data damage.

For the example I gave above, no data damage at all.

First the data is written and power loss, and data is always written 
before metadata, so that's to say, after power loss, superblock is still 
using the old tree roots.

So no one is really using that newly written data.

And in that case even device of data stripe 2 is missing, btrfs don't 
really need to use parity to rebuild it, as btrfs knows there is no 
extent in that stripe, and data csum matches for data stripe 1.
No need to use parity at all.

So that's why I think the hole write is not an urgent case to handle 
right now.

Thanks,
Qu
> 
> I think that is a valid problem calling for a solution on Btrfs, given
> its mandate. It is no worse than other raid6 implementations though
> which would reconstruct from bad P, and give no warning, leaving it up
> to application layers to deal with the problem.
> 
> I have no idea how ZFS RAIDZ2 and RAIDZ3 handle this same scenario.
> 
> 
> 
>>
>>>
>>> 2. Parity data is not checksummed
>>> Why is this a problem? Does it have to do with the design of BTRFS
>>> somehow?
>>> Parity is after all just data, BTRFS does checksum data so what is the
>>> reason this is a problem?
>>
>>
>> Because that's one solution to solve above problem.
>>
>> And no, parity is not data.
> 
> Parity strip is differentiated from data strip, and by itself parity
> is meaningless. But parity plus n-1 data strips is an encoded form of
> the missing data strip, and is therefore an encoded copy of the data.
> We kinda have to treat the parity as fractionally important compared
> to data; just like each mirror copy has some fractional value. You
> don't have to have both of them, but you do have to have at least one
> of them.
> 
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-22  2:12     ` Qu Wenruo
@ 2017-06-22  2:43       ` Chris Murphy
  2017-06-22  3:55         ` Qu Wenruo
  2017-06-22  5:15       ` Goffredo Baroncelli
  1 sibling, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2017-06-22  2:43 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Murphy, waxhead, Btrfs BTRFS

On Wed, Jun 21, 2017 at 8:12 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:

>
> Well, in fact, thanks to data csum and btrfs metadata CoW, there is quite a
> high chance that we won't cause any data damage.

But we have examples where data does not COW, we see a partial stripe
overwrite. And if that is interrupted it's clear that both old and new
metadata pointing to that stripe is wrong. There are way more problems
where we see csum errors on Btrfs raid56 after crashes, and there are
no bad devices.



>
> For the example I gave above, no data damage at all.
>
> First the data is written and power loss, and data is always written before
> metadata, so that's to say, after power loss, superblock is still using the
> old tree roots.
>
> So no one is really using that newly written data.

OK but that assumes that the newly written data is always COW which on
Btrfs raid56 is not certain, there's a bunch of RMW code which
suggests overwrites are possible.

And for raid56 metadata it suggests RMW could happen for metadata also.

There's fairly strong anecdotal evidence that people have less
problems with Btrfs raid5 when raid5 applies to data block groups, and
metadata block groups use some other non-parity based profile like
raid1.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-22  2:43       ` Chris Murphy
@ 2017-06-22  3:55         ` Qu Wenruo
  0 siblings, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2017-06-22  3:55 UTC (permalink / raw)
  To: Chris Murphy; +Cc: waxhead, Btrfs BTRFS



At 06/22/2017 10:43 AM, Chris Murphy wrote:
> On Wed, Jun 21, 2017 at 8:12 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> 
>>
>> Well, in fact, thanks to data csum and btrfs metadata CoW, there is quite a
>> high chance that we won't cause any data damage.
> 
> But we have examples where data does not COW, we see a partial stripe
> overwrite. And if that is interrupted it's clear that both old and new
> metadata pointing to that stripe is wrong. There are way more problems
> where we see csum errors on Btrfs raid56 after crashes, and there are
> no bad devices.

First, if it's interrupted, there is no new metadata, as metadata is 
always updated after data.

And metadata is always update CoW, so if data write is interrupted, we 
are still at previous trans.


And in that case, no COW means no csum.
Btrfs won't check the correctness due to the lack of csum.

So the case will be that, for nodatacow case, btrfs won't detect the 
corruption, users take the responsibility to keep their data correct.

> 
> 
> 
>>
>> For the example I gave above, no data damage at all.
>>
>> First the data is written and power loss, and data is always written before
>> metadata, so that's to say, after power loss, superblock is still using the
>> old tree roots.
>>
>> So no one is really using that newly written data.
> 
> OK but that assumes that the newly written data is always COW which on
> Btrfs raid56 is not certain, there's a bunch of RMW code which
> suggests overwrites are possible.

RMW is mainly to update P/Q, as even we only update data stripe1, we 
still need data stripe 2 to calculate P/Q.

> And for raid56 metadata it suggests RMW could happen for metadata also.

As long as we have P/Q, RMW must be used.

The root problem will be, we need cross-device FUA to ensure full stripe 
is written correctly.
Or we may take the extent allocator modification, to ensure we only 
write into vertical stripe without used data.


So anyway, RAID5/6 is only designed to handle missing devices, not power 
loss.
IIRC mdadm RAID5/6 array needs to be scrubbed each time power loss is 
detected.

Thanks,
Qu

> 
> There's fairly strong anecdotal evidence that people have less
> problems with Btrfs raid5 when raid5 applies to data block groups, and
> metadata block groups use some other non-parity based profile like
> raid1.
> 
> 
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-22  2:12     ` Qu Wenruo
  2017-06-22  2:43       ` Chris Murphy
@ 2017-06-22  5:15       ` Goffredo Baroncelli
  1 sibling, 0 replies; 23+ messages in thread
From: Goffredo Baroncelli @ 2017-06-22  5:15 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Murphy, waxhead, Btrfs BTRFS

On 2017-06-22 04:12, Qu Wenruo wrote:
> 
> And in that case even device of data stripe 2 is missing, btrfs don't really need to use parity to rebuild it, as btrfs knows there is no extent in that stripe, and data csum matches for data stripe 1.

You are assuming that there is no data in disk2. This is likely, due to COW nature of BTRFS. But it is not always true. 

Anyway, the same problem happens if you are writing data on disk2 . If 
a) data (disk2) is written 
b) parity is not updated (due to a power failure)

until that you don't lose anything, but if

c) disk1 disappear 

you are not in position to recompute valid data in disk1 using only data2 and parity


> No need to use parity at all.
> 
> So that's why I think the hole write is not an urgent case to handle right now.
> 
> Thanks,
> Qu


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-20 22:57 Exactly what is wrong with RAID5/6 waxhead
  2017-06-20 23:25 ` Hugo Mills
  2017-06-21  8:45 ` Qu Wenruo
@ 2017-06-23 17:25 ` Michał Sokołowski
  2017-06-23 18:45   ` Austin S. Hemmelgarn
  2 siblings, 1 reply; 23+ messages in thread
From: Michał Sokołowski @ 2017-06-23 17:25 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2618 bytes --]

Hello group.

I am confused: Can somebody please confirm/deny, which RAID subsystem is
affected? BTRFS' RAID5/6 or mdadm (Linux kernel raid) RAID 5/6 ?

Are there some gotchas (in terms of broken reliability) when using
kernel one?

The web is full of legends, it seems that this confusion is quite common...

On 06/21/2017 12:57 AM, waxhead wrote:
> I am trying to piece together the actual status of the RAID5/6 bit of
> BTRFS.
> The wiki refer to kernel 3.19 which was released in February 2015 so I
> assume that the information there is a tad outdated (the last update
> on the wiki page was July 2016)
> https://btrfs.wiki.kernel.org/index.php/RAID56
>
> Now there are four problems listed
>
> 1. Parity may be inconsistent after a crash (the "write hole")
> Is this still true, if yes - would not this apply for RAID1 / RAID10
> as well? How was it solved there , and why can't that be done for RAID5/6
>
> 2. Parity data is not checksummed
> Why is this a problem? Does it have to do with the design of BTRFS
> somehow?
> Parity is after all just data, BTRFS does checksum data so what is the
> reason this is a problem?
>
> 3. No support for discard? (possibly -- needs confirmation with cmason)
> Does this matter that much really?, is there an update on this?
>
> 4. The algorithm uses as many devices as are available: No support for
> a fixed-width stripe.
> What is the plan for this one? There was patches on the mailing list
> by the SnapRAID author to support up to 6 parity devices. Will the
> (re?) resign of btrfs raid5/6 support a scheme that allows for
> multiple parity devices?
>
> I do have a few other questions as well...
>
> 5. BTRFS does still (kernel 4.9) not seem to use the device ID to
> communicate with devices.
>
> If you on a multi device filesystem yank out a device, for example
> /dev/sdg and it reappear as /dev/sdx for example btrfs will still
> happily try to write to /dev/sdg even if btrfs fi sh /mnt shows the
> correct device ID. What is the status for getting BTRFS to properly
> understand that a device is missing?
>
> 6. RAID1 needs to be able to make two copies always. E.g. if you have
> three disks you can loose one and it should still work. What about
> RAID10 ? If you have for example 6 disk RAID10 array, loose one disk
> and reboots (due to #5 above). Will RAID10 recognize that the array
> now is a 5 disk array and stripe+mirror over 2 disks (or possibly 2.5
> disks?) instead of 3? In other words, will it work as long as it can
> create a RAID10 profile that requires a minimum of four disks? 


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3849 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Exactly what is wrong with RAID5/6
  2017-06-23 17:25 ` Michał Sokołowski
@ 2017-06-23 18:45   ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 23+ messages in thread
From: Austin S. Hemmelgarn @ 2017-06-23 18:45 UTC (permalink / raw)
  To: Michał Sokołowski, linux-btrfs

On 2017-06-23 13:25, Michał Sokołowski wrote:
> Hello group.
> 
> I am confused: Can somebody please confirm/deny, which RAID subsystem is
> affected? BTRFS' RAID5/6 or mdadm (Linux kernel raid) RAID 5/6 ?
All of the issues mentioned here are specific to BTRFS raid5/raid6 
profiles, with the exception of the write-hole, which inherently affects 
any raid5/raid6 system that does not specifically account for it (which 
means that it does affect MD RAID5 and RAID6 modes if you aren't using 
the journaling).
> 
> Are there some gotchas (in terms of broken reliability) when using
> kernel one?
> 
> The web is full of legends, it seems that this confusion is quite common...
Which brings up one of the reasons I really hate the choice to use the 
term 'raid' in the profile names.  At a minimum, we should have gone a 
similar route to ZFS in naming the striped parity implementations 
(RAID-B1 and RAID-B2 for example), but personally I really would have 
preferred if they were just called what they are (namely, (n,n+1) and 
(n,n+2) erasure coding for raid5 and raid6 respectively, with mirroring, 
striping and striped mirroring for raid1, raid0, and raid10), or at 
least used some naming scheme that wasn't obviously going to cause such 
issues.

> 
> On 06/21/2017 12:57 AM, waxhead wrote:
>> I am trying to piece together the actual status of the RAID5/6 bit of
>> BTRFS.
>> The wiki refer to kernel 3.19 which was released in February 2015 so I
>> assume that the information there is a tad outdated (the last update
>> on the wiki page was July 2016)
>> https://btrfs.wiki.kernel.org/index.php/RAID56
>>
>> Now there are four problems listed
>>
>> 1. Parity may be inconsistent after a crash (the "write hole")
>> Is this still true, if yes - would not this apply for RAID1 / RAID10
>> as well? How was it solved there , and why can't that be done for RAID5/6
>>
>> 2. Parity data is not checksummed
>> Why is this a problem? Does it have to do with the design of BTRFS
>> somehow?
>> Parity is after all just data, BTRFS does checksum data so what is the
>> reason this is a problem?
>>
>> 3. No support for discard? (possibly -- needs confirmation with cmason)
>> Does this matter that much really?, is there an update on this?
>>
>> 4. The algorithm uses as many devices as are available: No support for
>> a fixed-width stripe.
>> What is the plan for this one? There was patches on the mailing list
>> by the SnapRAID author to support up to 6 parity devices. Will the
>> (re?) resign of btrfs raid5/6 support a scheme that allows for
>> multiple parity devices?
>>
>> I do have a few other questions as well...
>>
>> 5. BTRFS does still (kernel 4.9) not seem to use the device ID to
>> communicate with devices.
>>
>> If you on a multi device filesystem yank out a device, for example
>> /dev/sdg and it reappear as /dev/sdx for example btrfs will still
>> happily try to write to /dev/sdg even if btrfs fi sh /mnt shows the
>> correct device ID. What is the status for getting BTRFS to properly
>> understand that a device is missing?
>>
>> 6. RAID1 needs to be able to make two copies always. E.g. if you have
>> three disks you can loose one and it should still work. What about
>> RAID10 ? If you have for example 6 disk RAID10 array, loose one disk
>> and reboots (due to #5 above). Will RAID10 recognize that the array
>> now is a 5 disk array and stripe+mirror over 2 disks (or possibly 2.5
>> disks?) instead of 3? In other words, will it work as long as it can
>> create a RAID10 profile that requires a minimum of four disks?
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-06-23 18:45 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-20 22:57 Exactly what is wrong with RAID5/6 waxhead
2017-06-20 23:25 ` Hugo Mills
2017-06-21  3:48   ` Chris Murphy
2017-06-21  6:51     ` Marat Khalili
2017-06-21  7:31       ` Peter Grandi
2017-06-21 17:13       ` Andrei Borzenkov
2017-06-21 18:43       ` Chris Murphy
2017-06-21  8:45 ` Qu Wenruo
2017-06-21 12:43   ` Christoph Anton Mitterer
2017-06-21 13:41     ` Austin S. Hemmelgarn
2017-06-21 17:20       ` Andrei Borzenkov
2017-06-21 17:30         ` Austin S. Hemmelgarn
2017-06-21 17:03   ` Goffredo Baroncelli
2017-06-22  2:05     ` Qu Wenruo
2017-06-21 18:24   ` Chris Murphy
2017-06-21 20:12     ` Goffredo Baroncelli
2017-06-21 23:19       ` Chris Murphy
2017-06-22  2:12     ` Qu Wenruo
2017-06-22  2:43       ` Chris Murphy
2017-06-22  3:55         ` Qu Wenruo
2017-06-22  5:15       ` Goffredo Baroncelli
2017-06-23 17:25 ` Michał Sokołowski
2017-06-23 18:45   ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.