All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Checksum of the parity
@ 2017-08-13 14:16 Goffredo Baroncelli
  2017-08-13 18:45 ` Chris Murphy
  2017-08-14 13:23 ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 8+ messages in thread
From: Goffredo Baroncelli @ 2017-08-13 14:16 UTC (permalink / raw)
  To: linux-btrfs

Hi all,

in the BTRFS wiki, in the status page, in the "line" RAID5/6 it is reported that the parity is not checksummed. This was reported several time in the ML and also on other site (e.g. phoronix) as a BTRFS defect.

However I was unable to understand it, and I am supposing that this is a false mith. 

So my question is: the fact that in the BTRFS5/6 the parity is not checksummed could be considered a defect ? 

My goal is to verify if there is a rationale to require the parity checksummed, and if no I would like to remove this from the wiki.

BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Checksum of the parity
  2017-08-13 14:16 [RFC] Checksum of the parity Goffredo Baroncelli
@ 2017-08-13 18:45 ` Chris Murphy
  2017-08-13 23:40   ` Janos Toth F.
  2017-08-14 14:12   ` Goffredo Baroncelli
  2017-08-14 13:23 ` Austin S. Hemmelgarn
  1 sibling, 2 replies; 8+ messages in thread
From: Chris Murphy @ 2017-08-13 18:45 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: linux-btrfs

On Sun, Aug 13, 2017 at 8:16 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
> Hi all,
>
> in the BTRFS wiki, in the status page, in the "line" RAID5/6 it is reported that the parity is not checksummed. This was reported several time in the ML and also on other site (e.g. phoronix) as a BTRFS defect.
>
> However I was unable to understand it, and I am supposing that this is a false mith.
>
> So my question is: the fact that in the BTRFS5/6 the parity is not checksummed could be considered a defect ?
>
> My goal is to verify if there is a rationale to require the parity checksummed, and if no I would like to remove this from the wiki.

It is not a per se defect. If parity is corrupt, and parity is needed
for reconstruction, reconstruction will be corrupt, but is then
detected and we get EIO [1]

Further, the error detection of corrupt reconstruction is why I say
Btrfs is not subject *in practice* to the write hole problem. [2]


[1]
I haven't tested the raid6 normal read case where a stripe contains
corrupt data strip and corrupt P strip, and Q strip is good. I expect
instead of EIO, we get a reconstruction from Q, and then both data and
P get fixed up, but I can't find it in comments or code.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?h=v4.12.7
line 1851, I'm not sure where we're at exactly at this line; seems
like it must be a scrub because P & Q are not relevant if data is
good.

[2]
Is Btrfs subject to the write hole problem manifesting on disk? I'm
not sure, sadly I don't read the code well enough. But if all Btrfs
raid56 writes are full stripe CoW writes, and if the prescribed order
guarantees still happen: data CoW to disk > metadata CoW to disk >
superblock update, then I don't see how the write hole happens. Write
hole requires: RMW of a stripe, which is a partial stripe overwrite,
and a crash during the modification of the stripe making that stripe
inconsistent as well as still pointed to by metadata.




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Checksum of the parity
  2017-08-13 18:45 ` Chris Murphy
@ 2017-08-13 23:40   ` Janos Toth F.
  2017-08-14 14:12   ` Goffredo Baroncelli
  1 sibling, 0 replies; 8+ messages in thread
From: Janos Toth F. @ 2017-08-13 23:40 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Goffredo Baroncelli, linux-btrfs

On Sun, Aug 13, 2017 at 8:45 PM, Chris Murphy <lists@colorremedies.com> wrote:
> Further, the error detection of corrupt reconstruction is why I say
> Btrfs is not subject *in practice* to the write hole problem. [2]
>
> [1]
> I haven't tested the raid6 normal read case where a stripe contains
> corrupt data strip and corrupt P strip, and Q strip is good. I expect
> instead of EIO, we get a reconstruction from Q, and then both data and
> P get fixed up, but I can't find it in comments or code.

Yes, that's what I would expect (which theoretically makes the odds of
successful recovery better on RAID6, possible "good enough") but I
have no clue how that actually gets handled right now (I guess the
current code isn't that thorough).

> [2]
> Is Btrfs subject to the write hole problem manifesting on disk? I'm
> not sure, sadly I don't read the code well enough. But if all Btrfs
> raid56 writes are full stripe CoW writes, and if the prescribed order
> guarantees still happen: data CoW to disk > metadata CoW to disk >
> superblock update, then I don't see how the write hole happens. Write
> hole requires: RMW of a stripe, which is a partial stripe overwrite,
> and a crash during the modification of the stripe making that stripe
> inconsistent as well as still pointed to by metadata.

I guess the problem is that stripe size or stripe element size is
(sort of) fixed (not sure which one, I guess it's the latter, in which
case the actual stripe size depends on the number of devices) and
relatively big (much bigger than the usual 4k sector size or even the
leaf size which now defaults to 16k, if I recall [but I set this to 4k
myself]), so a partial stripe update (RMW) is certainly possible
during generic use.

This is why I threw the idea around a few months ago to resurrect that
old (but dead looking / stuck) project about making the stripe
(element) size configurable by the user. That would allow for making
the stripe size equal to the filesystem sector size on a limited
amount of setups (for example, 5 or 6 HDD with 512-byte physical
sectors in RAID-5 or RAID-6 respectively) which would (as I
understand) practically eliminate the problem (at least on the
filesystem side, I am not sure if the HDD's volatile write-cache or at
least it's internal re-ordering feature should still be disabled for
this to really avoid inconsistencies between stripe elements --- I
can't recall ever seeing partially written sectors [we would know
since these are checksummed in place and thus appear unreadable if
partially written], I guess there might be usually enough electricity
in some small capacitor to finish the current sector after the power
gets cut ???).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Checksum of the parity
  2017-08-13 14:16 [RFC] Checksum of the parity Goffredo Baroncelli
  2017-08-13 18:45 ` Chris Murphy
@ 2017-08-14 13:23 ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 8+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-14 13:23 UTC (permalink / raw)
  To: kreijack, linux-btrfs

On 2017-08-13 10:16, Goffredo Baroncelli wrote:
> Hi all,
> 
> in the BTRFS wiki, in the status page, in the "line" RAID5/6 it is reported that the parity is not checksummed. This was reported several time in the ML and also on other site (e.g. phoronix) as a BTRFS defect.
> 
> However I was unable to understand it, and I am supposing that this is a false mith.
> 
> So my question is: the fact that in the BTRFS5/6 the parity is not checksummed could be considered a defect ?
> 
> My goal is to verify if there is a rationale to require the parity checksummed, and if no I would like to remove this from the wiki.

While there isn't for normal operation (as Chris did a good job of 
explaining), there is a benefit for scrubbing the filesystem.

Without checksummed parity, you have to verify checksums on all the 
data, and then either recompute and compare the parity, or recompute and 
compare the data from parity to be able to verify that everything is 
correct.

With checksummed parity you just verify checksums on everything.

So, overall, I wouldn't consider it a defect, but having it could 
improve performance for a very specific use case.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Checksum of the parity
  2017-08-13 18:45 ` Chris Murphy
  2017-08-13 23:40   ` Janos Toth F.
@ 2017-08-14 14:12   ` Goffredo Baroncelli
  2017-08-14 19:28     ` Chris Murphy
  1 sibling, 1 reply; 8+ messages in thread
From: Goffredo Baroncelli @ 2017-08-14 14:12 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

On 08/13/2017 08:45 PM, Chris Murphy wrote:
> [2]
> Is Btrfs subject to the write hole problem manifesting on disk? I'm
> not sure, sadly I don't read the code well enough. But if all Btrfs
> raid56 writes are full stripe CoW writes, and if the prescribed order
> guarantees still happen: data CoW to disk > metadata CoW to disk >
> superblock update, then I don't see how the write hole happens. Write
> hole requires: RMW of a stripe, which is a partial stripe overwrite,
> and a crash during the modification of the stripe making that stripe
> inconsistent as well as still pointed to by metadata.


RAID5 is *single* failure prof. And in order to have the write hole bug we need two failure:
1) a transaction is aborted (e.g. due to a power failure) and the results is that data and parity are mis-aligned
2) a disk disappears

These two events may happen even in different moment.

The key is that when a disk disappear, all remaining ones are used to rebuild the missing one. So if data and parity are mis-aligned the rebuild disk is wrong.

Let me to show an example

Disk 1            Disk 2         Disk 3  (parity)
AAAAAA            BBBBBB         CCCCCC

where CCCCCC = AAAAA ^ BBBBB

Note1: AAAAA is a valid data

Supposing to update B and due to a power failure you can't update parity, you have:


Disk 1            Disk 2         Disk 3  (parity)
AAAAAA            DDDDDDD        CCCCCC

Of course CCCCCC != AAAAA ^ DDDDD  (data and parity are misaligned).


Pay attention that AAAAAA is still valid data.

Now suppose to loose disk1. If you want to read from it, you have to perform a read of disk2 and disk3 to compute disk1. 

However Disk2 and disk3 are misaligned, so doing a DDDDD ^ CCCCC you don't got AAAAA anymore.


Note that it is not important if DDDDDD or BBBBB are valid or invalid data.


Moreover I have to point out that a simple scrub process between 1 and 2, is able to rebuild a correct parity. This would reduce the likelihood of the "write hole" bug. 
The only case which would still exists is when 1) and 2) happen at the same time (which is not impossible: i.e. if a disk die, it is not infrequent that the user shutdown the machine without waiting a clean shutdown).

BR
G.Baroncelli



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Checksum of the parity
  2017-08-14 14:12   ` Goffredo Baroncelli
@ 2017-08-14 19:28     ` Chris Murphy
  2017-08-14 20:18       ` Goffredo Baroncelli
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Murphy @ 2017-08-14 19:28 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: Chris Murphy, linux-btrfs

On Mon, Aug 14, 2017 at 8:12 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
> On 08/13/2017 08:45 PM, Chris Murphy wrote:
>> [2]
>> Is Btrfs subject to the write hole problem manifesting on disk? I'm
>> not sure, sadly I don't read the code well enough. But if all Btrfs
>> raid56 writes are full stripe CoW writes, and if the prescribed order
>> guarantees still happen: data CoW to disk > metadata CoW to disk >
>> superblock update, then I don't see how the write hole happens. Write
>> hole requires: RMW of a stripe, which is a partial stripe overwrite,
>> and a crash during the modification of the stripe making that stripe
>> inconsistent as well as still pointed to by metadata.
>
>
> RAID5 is *single* failure prof. And in order to have the write hole bug we need two failure:
> 1) a transaction is aborted (e.g. due to a power failure) and the results is that data and parity are mis-aligned
> 2) a disk disappears
>
> These two events may happen even in different moment.
>
> The key is that when a disk disappear, all remaining ones are used to rebuild the missing one. So if data and parity are mis-aligned the rebuild disk is wrong.
>
> Let me to show an example
>
> Disk 1            Disk 2         Disk 3  (parity)
> AAAAAA            BBBBBB         CCCCCC
>
> where CCCCCC = AAAAA ^ BBBBB
>
> Note1: AAAAA is a valid data
>
> Supposing to update B and due to a power failure you can't update parity, you have:
>
>
> Disk 1            Disk 2         Disk 3  (parity)
> AAAAAA            DDDDDDD        CCCCCC
>
> Of course CCCCCC != AAAAA ^ DDDDD  (data and parity are misaligned).
>
>
> Pay attention that AAAAAA is still valid data.
>
> Now suppose to loose disk1. If you want to read from it, you have to perform a read of disk2 and disk3 to compute disk1.
>
> However Disk2 and disk3 are misaligned, so doing a DDDDD ^ CCCCC you don't got AAAAA anymore.
>
>
> Note that it is not important if DDDDDD or BBBBB are valid or invalid data.


Doesn't matter on Btrfs. Bad reconstruction due to wrong parity
results in csum mismatch. This I've tested.

I vaguely remember a while ago doing a dd conv=notrunc modification of
a file that's raid5, and there was no RMW, what happened is the whole
stripe was CoW'd and had the modification. So that would, hardware
behaving correctly, mean that the raid5 data CoW succeeds, then there
is a metadata CoW to point to it, then the super block is updated to
point to the new tree.

At any point, if there's an interruption, we have the old super
pointing to the old tree which points to premodified data.

Anyway, I do wish I read the code better, so I knew exactly where, if
at all, the RMW code was happening on disk rather than just in memory.
There very clearly is RMW in memory code as a performanc optimizer,
before a stripe gets written out it's possible to RMW it to add in
more changes or new files, that way raid56 isn't dog slow CoW'ing
literally a handful of 16KiB leaves each time, that then translate
into a minimum of 384K of writes.

But yeah, Qu just said in another thread that Liu is working on a
journal for the raid56 write hole problem. Thing is I don't see when
it happens in the code or in practice (so far, it's really tedious to
poke a file system with a stick).




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Checksum of the parity
  2017-08-14 19:28     ` Chris Murphy
@ 2017-08-14 20:18       ` Goffredo Baroncelli
  2017-08-14 21:10         ` Chris Murphy
  0 siblings, 1 reply; 8+ messages in thread
From: Goffredo Baroncelli @ 2017-08-14 20:18 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

On 08/14/2017 09:28 PM, Chris Murphy wrote:
> On Mon, Aug 14, 2017 at 8:12 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
>> On 08/13/2017 08:45 PM, Chris Murphy wrote:
>>> [2]
>>> Is Btrfs subject to the write hole problem manifesting on disk? I'm
>>> not sure, sadly I don't read the code well enough. But if all Btrfs
>>> raid56 writes are full stripe CoW writes, and if the prescribed order
>>> guarantees still happen: data CoW to disk > metadata CoW to disk >
>>> superblock update, then I don't see how the write hole happens. Write
>>> hole requires: RMW of a stripe, which is a partial stripe overwrite,
>>> and a crash during the modification of the stripe making that stripe
>>> inconsistent as well as still pointed to by metadata.
>>
>>
>> RAID5 is *single* failure prof. And in order to have the write hole bug we need two failure:
>> 1) a transaction is aborted (e.g. due to a power failure) and the results is that data and parity are mis-aligned
>> 2) a disk disappears
>>
>> These two events may happen even in different moment.
>>
>> The key is that when a disk disappear, all remaining ones are used to rebuild the missing one. So if data and parity are mis-aligned the rebuild disk is wrong.
>>
>> Let me to show an example
>>
>> Disk 1            Disk 2         Disk 3  (parity)
>> AAAAAA            BBBBBB         CCCCCC
>>
>> where CCCCCC = AAAAA ^ BBBBB
>>
>> Note1: AAAAA is a valid data
>>
>> Supposing to update B and due to a power failure you can't update parity, you have:
>>
>>
>> Disk 1            Disk 2         Disk 3  (parity)
>> AAAAAA            DDDDDDD        CCCCCC
>>
>> Of course CCCCCC != AAAAA ^ DDDDD  (data and parity are misaligned).
>>
>>
>> Pay attention that AAAAAA is still valid data.
>>
>> Now suppose to loose disk1. If you want to read from it, you have to perform a read of disk2 and disk3 to compute disk1.
>>
>> However Disk2 and disk3 are misaligned, so doing a DDDDD ^ CCCCC you don't got AAAAA anymore.
>>
>>
>> Note that it is not important if DDDDDD or BBBBB are valid or invalid data.
> 
> 
> Doesn't matter on Btrfs. Bad reconstruction due to wrong parity
> results in csum mismatch. This I've tested.

I never argued about that. The write hole is related to *loss* of "valid data" due to a mis-alignement between data and parity.
The fact that  BTRFS is capable to detect the problem and return an -EIO, doesn't mitigate the loss of valid data. Pay attention that in my example AAAAA reached the disk before the "failure events"

> 
> I vaguely remember a while ago doing a dd conv=notrunc modification of
> a file that's raid5, and there was no RMW, what happened is the whole
> stripe was CoW'd and had the modification. So that would, hardware
> behaving correctly, mean that the raid5 data CoW succeeds, then there
> is a metadata CoW to point to it, then the super block is updated to
> point to the new tree.
> 
> At any point, if there's an interruption, we have the old super
> pointing to the old tree which points to premodified data.
> 
> Anyway, I do wish I read the code better, so I knew exactly where, if
> at all, the RMW code was happening on disk rather than just in memory.
> There very clearly is RMW in memory code as a performanc optimizer,
> before a stripe gets written out it's possible to RMW it to add in
> more changes or new files, that way raid56 isn't dog slow CoW'ing
> literally a handful of 16KiB leaves each time, that then translate
> into a minimum of 384K of writes.

In case of a fully stripe write, there is no RMW cycle, so no "write hole". Unfortunately not all writes are full stripe size. I never checked the code, but I hope that during a commit of the transaction all the writing are grouped in "full stripe write" as possible.

Just of curiosity, what is "minimum of 384k" ? In a 3 disks raid5 case, the minimum data is 64k * 2 (+ 64kb of parity).....

> But yeah, Qu just said in another thread that Liu is working on a
> journal for the raid56 write hole problem. Thing is I don't see when
> it happens in the code or in practice (so far, it's really tedious to
> poke a file system with a stick).
> 



> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Checksum of the parity
  2017-08-14 20:18       ` Goffredo Baroncelli
@ 2017-08-14 21:10         ` Chris Murphy
  0 siblings, 0 replies; 8+ messages in thread
From: Chris Murphy @ 2017-08-14 21:10 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: Chris Murphy, linux-btrfs

On Mon, Aug 14, 2017 at 2:18 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote:

>> Anyway, I do wish I read the code better, so I knew exactly where, if
>> at all, the RMW code was happening on disk rather than just in memory.
>> There very clearly is RMW in memory code as a performanc optimizer,
>> before a stripe gets written out it's possible to RMW it to add in
>> more changes or new files, that way raid56 isn't dog slow CoW'ing
>> literally a handful of 16KiB leaves each time, that then translate
>> into a minimum of 384K of writes.
>
> In case of a fully stripe write, there is no RMW cycle, so no "write hole".

That is conflating disk writes and in-memory RMW. They have to be
separated. For sure there's in-memory RMW + CoW of the entire stripe
to disk, for a tiny (1 byte) change to a file because I've seen it.
What I don't know, and can't tell from the code, is if there's ever
such a thing as partial stripe RMW (and over write of just a data
strip and a corresponding over write for parity). Any overwrite is
where the write hole comes into play.

But inferring from the work Liu is apparently working on for a
journal, it must be true that there is such a thing as a overwrites
with Btrfs raid56.



>
> Just of curiosity, what is "minimum of 384k" ? In a 3 disks raid5 case, the minimum data is 64k * 2 (+ 64kb of parity).....

Bad addition on my part.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-08-14 21:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-13 14:16 [RFC] Checksum of the parity Goffredo Baroncelli
2017-08-13 18:45 ` Chris Murphy
2017-08-13 23:40   ` Janos Toth F.
2017-08-14 14:12   ` Goffredo Baroncelli
2017-08-14 19:28     ` Chris Murphy
2017-08-14 20:18       ` Goffredo Baroncelli
2017-08-14 21:10         ` Chris Murphy
2017-08-14 13:23 ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.