All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID56 - 6 parity raid
@ 2018-05-01 21:57 Gandalf Corvotempesta
  2018-05-02  1:47 ` Duncan
  2018-05-03 12:47 ` Alberto Bursi
  0 siblings, 2 replies; 22+ messages in thread
From: Gandalf Corvotempesta @ 2018-05-01 21:57 UTC (permalink / raw)
  To: linux-btrfs

Hi to all
I've found some patches from Andrea Mazzoleni that adds support up to 6
parity raid.
Why these are wasn't merged ?
With modern disk size, having something greater than 2 parity, would be
great.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-01 21:57 RAID56 - 6 parity raid Gandalf Corvotempesta
@ 2018-05-02  1:47 ` Duncan
  2018-05-02 16:27   ` Goffredo Baroncelli
  2018-05-03 12:47 ` Alberto Bursi
  1 sibling, 1 reply; 22+ messages in thread
From: Duncan @ 2018-05-02  1:47 UTC (permalink / raw)
  To: linux-btrfs

Gandalf Corvotempesta posted on Tue, 01 May 2018 21:57:59 +0000 as
excerpted:

> Hi to all I've found some patches from Andrea Mazzoleni that adds
> support up to 6 parity raid.
> Why these are wasn't merged ?
> With modern disk size, having something greater than 2 parity, would be
> great.

1) Btrfs parity-raid was known to be seriously broken until quite 
recently (and still has the common parity-raid write-hole, which is more 
serious on btrfs because btrfs otherwise goes to some lengths to ensure 
data/metadata integrity via checksumming and verification, and the parity 
isn't checksummed, risking even old data due to the write hole, but there 
are a number of proposals to fix that), and piling even more not well 
tested patches on top was _not_ the way toward a solution.

2) Btrfs features in general have taken longer to merge and stabilize 
than one might expect, and parity-raid has been a prime example, with the 
original roadmap calling for parity-raid merge back in the 3.5 timeframe 
or so... partial/runtime (not full recovery) code was finally merged ~3 
years later in (IIRC) 3.19, took several development cycles for the 
initial critical bugs to be worked out but by 4.2 or so was starting to 
look good, then more bugs were found and reported, that took several more 
years to fix, tho IIRC LTS-4.14 has them.

Meanwhile, consider that N-way-mirroring was fast-path roadmapped for 
"right after raid56 mode, because some of its code depends on that), so 
was originally expected in 3.6 or so...  As someone who had been wanting 
to use /that/, I personally know the pain of "still waiting".

And that was "fast-pathed".

So even if the multi-way-parity patches were on the "fast" path, it's 
only "now" (for relative values of now, for argument say by 4.20/5.0 or 
whatever it ends up being called) that such a thing could be reasonably 
considered.


3) AFAIK none of the btrfs devs have flat rejected the idea, but btrfs 
remains development opportunity rich and implementing dev poor... there's 
likely 20 years or more of "good" ideas out there.  And the N-way-parity-
raid patches haven't hit any of the current devs' (or their employers') 
"personal itch that needs to be scratched" interest points, so while it 
certainly does remain a "nice idea", given the implementation timeline 
history for even 'fast-pathed" ideas, realistically we're looking at at 
least a decade out.  But with the practical projection horizon no more 
than 5-7 years out (beyond that other, unpredicted, developments, are 
likely to change things so much that projection is effectively 
impossible), in practice, a decade out is "bluesky", aka "it'd be nice to 
have someday, but it's not a priority, and with current developer 
manpower, it's unlikely to happen any time in the practically projectable 
future.

4) Of course all that's subject to no major new btrfs developer (or 
sponsor) making it a high priority, but even should such a developer (and/
or sponsor) appear, they'd probably need to spend at least two years 
coming up to speed with the code first, fixing normal bugs and improving 
the existing code quality, then post the updated and rebased N-way-parity 
patches for discussion, and get them roadmapped for merge probably some 
years later due to other then-current project feature dependencies.

So even if the N-way-parity patches became some new developer's (or 
sponsor's) personal itch to scratch, by the time they came up to speed 
and the code was actually merged, there's no realistic projection that it 
would be in under 5 years, plus another couple to stabilize, so at least 
7 years to properly usable stability.  So even then, we're already at the 
5-7 years practical projectability limit.


Meanwhile, have you looked at zfs?  Perhaps they have something like 
that?  And there's also a new(?) one, stratis, AFAIK commercially 
sponsored and device-mapper based, that I saw an article on recently, tho 
I've seen/heard no kernel-community discussion on it (there's a good 
chance followup here will change that if it's worth discussing, as 
there's several folks here for whom knowing about such things is part of 
their job) and no other articles (besides the pt 1 of the series 
mentioned below), so for all I know it's pie-in-the-sky or still new 
enough it'd be 5-7 years before it can be used in practice, as well.  But 
assuming it's a viable project, presumably it would get support if device-
mapper did/has.

The stratis article I saw (apparently part 2 in a series):
https://opensource.com/article/18/4/stratis-lessons-learned

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-02  1:47 ` Duncan
@ 2018-05-02 16:27   ` Goffredo Baroncelli
  2018-05-02 16:55     ` waxhead
  0 siblings, 1 reply; 22+ messages in thread
From: Goffredo Baroncelli @ 2018-05-02 16:27 UTC (permalink / raw)
  To: Duncan, linux-btrfs

Hi
On 05/02/2018 03:47 AM, Duncan wrote:
> Gandalf Corvotempesta posted on Tue, 01 May 2018 21:57:59 +0000 as
> excerpted:
> 
>> Hi to all I've found some patches from Andrea Mazzoleni that adds
>> support up to 6 parity raid.
>> Why these are wasn't merged ?
>> With modern disk size, having something greater than 2 parity, would be
>> great.
> 1) [...] the parity isn't checksummed, ....

Why the fact that the parity is not checksummed is a problem ?
I read several times that this is a problem. However each time the thread reached the conclusion that... it is not a problem.

So again, which problem would solve having the parity checksummed ? On the best of my knowledge nothing. In any case the data is checksummed so it is impossible to return corrupted data (modulo bug :-) ).

On the other side, having the parity would increase both the code complexity and the write amplification, because every time a part of the stripe is touched not only the parity has to be updated, but also the checksum has too..


BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-02 16:27   ` Goffredo Baroncelli
@ 2018-05-02 16:55     ` waxhead
  2018-05-02 17:19       ` Austin S. Hemmelgarn
  2018-05-02 17:25       ` Goffredo Baroncelli
  0 siblings, 2 replies; 22+ messages in thread
From: waxhead @ 2018-05-02 16:55 UTC (permalink / raw)
  To: kreijack, Duncan, linux-btrfs

Goffredo Baroncelli wrote:
> Hi
> On 05/02/2018 03:47 AM, Duncan wrote:
>> Gandalf Corvotempesta posted on Tue, 01 May 2018 21:57:59 +0000 as
>> excerpted:
>>
>>> Hi to all I've found some patches from Andrea Mazzoleni that adds
>>> support up to 6 parity raid.
>>> Why these are wasn't merged ?
>>> With modern disk size, having something greater than 2 parity, would be
>>> great.
>> 1) [...] the parity isn't checksummed, ....
> 
> Why the fact that the parity is not checksummed is a problem ?
> I read several times that this is a problem. However each time the thread reached the conclusion that... it is not a problem.
> 
> So again, which problem would solve having the parity checksummed ? On the best of my knowledge nothing. In any case the data is checksummed so it is impossible to return corrupted data (modulo bug :-) ).
> 
I am not a BTRFS dev , but this should be quite easy to answer. Unless 
you checksum the parity there is no way to verify that that the data 
(parity) you use to reconstruct other data is correct.

> On the other side, having the parity would increase both the code complexity and the write amplification, because every time a part of the stripe is touched not only the parity has to be updated, but also the checksum has too..
Which is a good thing. BTRFS main selling point is that you can feel 
pretty confident that whatever you put is exactly what you get out.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-02 16:55     ` waxhead
@ 2018-05-02 17:19       ` Austin S. Hemmelgarn
  2018-05-02 17:25       ` Goffredo Baroncelli
  1 sibling, 0 replies; 22+ messages in thread
From: Austin S. Hemmelgarn @ 2018-05-02 17:19 UTC (permalink / raw)
  To: waxhead, kreijack, Duncan, linux-btrfs

On 2018-05-02 12:55, waxhead wrote:
> Goffredo Baroncelli wrote:
>> Hi
>> On 05/02/2018 03:47 AM, Duncan wrote:
>>> Gandalf Corvotempesta posted on Tue, 01 May 2018 21:57:59 +0000 as
>>> excerpted:
>>>
>>>> Hi to all I've found some patches from Andrea Mazzoleni that adds
>>>> support up to 6 parity raid.
>>>> Why these are wasn't merged ?
>>>> With modern disk size, having something greater than 2 parity, would be
>>>> great.
>>> 1) [...] the parity isn't checksummed, ....
>>
>> Why the fact that the parity is not checksummed is a problem ?
>> I read several times that this is a problem. However each time the 
>> thread reached the conclusion that... it is not a problem.
>>
>> So again, which problem would solve having the parity checksummed ? On 
>> the best of my knowledge nothing. In any case the data is checksummed 
>> so it is impossible to return corrupted data (modulo bug :-) ).
>>
> I am not a BTRFS dev , but this should be quite easy to answer. Unless 
> you checksum the parity there is no way to verify that that the data 
> (parity) you use to reconstruct other data is correct.
While this is the biggest benefit (and it's a _huge_ one, because it 
means you don't have to waste time doing the parity reconstruction if 
you know the result won't be right), there's also a rather nice benefit 
for scrubbing the array, namely that you don't have to recompute parity 
to check if it's right or not (and thus can avoid wasting time 
recomputing it for every stripe in the common case of almost every 
stripe being correct).
> 
>> On the other side, having the parity would increase both the code 
>> complexity and the write amplification, because every time a part of 
>> the stripe is touched not only the parity has to be updated, but also 
>> the checksum has too..
> Which is a good thing. BTRFS main selling point is that you can feel 
> pretty confident that whatever you put is exactly what you get out.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-02 16:55     ` waxhead
  2018-05-02 17:19       ` Austin S. Hemmelgarn
@ 2018-05-02 17:25       ` Goffredo Baroncelli
  2018-05-02 18:17         ` waxhead
  2018-05-02 19:29         ` Austin S. Hemmelgarn
  1 sibling, 2 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2018-05-02 17:25 UTC (permalink / raw)
  To: waxhead, Duncan, linux-btrfs

On 05/02/2018 06:55 PM, waxhead wrote:
>>
>> So again, which problem would solve having the parity checksummed ? On the best of my knowledge nothing. In any case the data is checksummed so it is impossible to return corrupted data (modulo bug :-) ).
>>
> I am not a BTRFS dev , but this should be quite easy to answer. Unless you checksum the parity there is no way to verify that that the data (parity) you use to reconstruct other data is correct.

In any case you could catch that the compute data is wrong, because the data is always checksummed. And in any case you must check the data against their checksum.

My point is that storing the checksum is a cost that you pay *every time*. Every time you update a part of a stripe you need to update the parity, and then in turn the parity checksum. It is not a problem of space occupied nor a computational problem. It is a a problem of write amplification...

The only gain is to avoid to try to use the parity when 
a) you need it (i.e. when the data is missing and/or corrupted)
and b) it is corrupted. 
But the likelihood of this case is very low. And you can catch it during the data checksum check (which has to be performed in any case !).

So from one side you have a *cost every time* (the write amplification), to other side you have a gain (cpu-time) *only in case* of the parity is corrupted and you need it (eg. scrub or corrupted data)).

IMHO the cost are very higher than the gain, and the likelihood the gain is very lower compared to the likelihood (=100% or always) of the cost.


BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-02 17:25       ` Goffredo Baroncelli
@ 2018-05-02 18:17         ` waxhead
  2018-05-02 18:50           ` Andrei Borzenkov
  2018-05-02 19:04           ` Goffredo Baroncelli
  2018-05-02 19:29         ` Austin S. Hemmelgarn
  1 sibling, 2 replies; 22+ messages in thread
From: waxhead @ 2018-05-02 18:17 UTC (permalink / raw)
  To: kreijack, Duncan, linux-btrfs

Goffredo Baroncelli wrote:
> On 05/02/2018 06:55 PM, waxhead wrote:
>>>
>>> So again, which problem would solve having the parity checksummed ? On the best of my knowledge nothing. In any case the data is checksummed so it is impossible to return corrupted data (modulo bug :-) ).
>>>
>> I am not a BTRFS dev , but this should be quite easy to answer. Unless you checksum the parity there is no way to verify that that the data (parity) you use to reconstruct other data is correct.
> 
> In any case you could catch that the compute data is wrong, because the data is always checksummed. And in any case you must check the data against their checksum.
> 
What if you lost an entire disk? or had corruption for both data AND 
checksum? How do you plan to safely reconstruct that without checksummed 
parity?

> My point is that storing the checksum is a cost that you pay *every time*. Every time you update a part of a stripe you need to update the parity, and then in turn the parity checksum. It is not a problem of space occupied nor a computational problem. It is a a problem of write amplification...
How much of a problem is this? no benchmarks have been run since the 
feature is not yet there I suppose.

> 
> The only gain is to avoid to try to use the parity when
> a) you need it (i.e. when the data is missing and/or corrupted)
I'm not sure I can make out your argument here , but with RAID5/6 you 
don't have another copy to restore from. You *have* to use the parity to 
reconstruct data and it is a good thing if this data is trusted.

> and b) it is corrupted.
> But the likelihood of this case is very low. And you can catch it during the data checksum check (which has to be performed in any case !).
> 
> So from one side you have a *cost every time* (the write amplification), to other side you have a gain (cpu-time) *only in case* of the parity is corrupted and you need it (eg. scrub or corrupted data)).
> 
> IMHO the cost are very higher than the gain, and the likelihood the gain is very lower compared to the likelihood (=100% or always) of the cost.
> 
Then run benchmarks and considering making parity checksums optional 
(but pretty please dipped in syrup with sugar on top - keep in on by 
default).


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-02 18:17         ` waxhead
@ 2018-05-02 18:50           ` Andrei Borzenkov
  2018-05-02 21:20             ` waxhead
  2018-05-02 19:04           ` Goffredo Baroncelli
  1 sibling, 1 reply; 22+ messages in thread
From: Andrei Borzenkov @ 2018-05-02 18:50 UTC (permalink / raw)
  To: waxhead, kreijack, Duncan, linux-btrfs

02.05.2018 21:17, waxhead пишет:
> Goffredo Baroncelli wrote:
>> On 05/02/2018 06:55 PM, waxhead wrote:
>>>>
>>>> So again, which problem would solve having the parity checksummed ?
>>>> On the best of my knowledge nothing. In any case the data is
>>>> checksummed so it is impossible to return corrupted data (modulo bug
>>>> :-) ).
>>>>
>>> I am not a BTRFS dev , but this should be quite easy to answer.
>>> Unless you checksum the parity there is no way to verify that that
>>> the data (parity) you use to reconstruct other data is correct.
>>
>> In any case you could catch that the compute data is wrong, because
>> the data is always checksummed. And in any case you must check the
>> data against their checksum.
>>
> What if you lost an entire disk? 

How does it matter exactly? RAID is per chunk anyway.

> or had corruption for both data AND checksum?

By the same logic you may have corrupted parity and its checksum.

> How do you plan to safely reconstruct that without checksummed
> parity?
> 

Define "safely". The main problem of current RAID56 implementation is
that stripe is not updated atomically (at least, that is what I
understood from the past discussions) and this is not solved by having
extra parity checksum. So how exactly "safety" is improved here? You
still need overall checksum to verify result of reconstruction, what
exactly extra parity checksum buys you?

>> My point is that storing the checksum is a cost that you pay *every
>> time*. Every time you update a part of a stripe you need to update the
>> parity, and then in turn the parity checksum. It is not a problem of
>> space occupied nor a computational problem. It is a a problem of write
>> amplification...
> How much of a problem is this? no benchmarks have been run since the
> feature is not yet there I suppose.
> 
>>
>> The only gain is to avoid to try to use the parity when
>> a) you need it (i.e. when the data is missing and/or corrupted)
> I'm not sure I can make out your argument here , but with RAID5/6 you
> don't have another copy to restore from. You *have* to use the parity to
> reconstruct data and it is a good thing if this data is trusted.
> 

Again - please describe when having parity checksum will be beneficial
over current implementation. You do not reconstruct anything as long as
all data strips are there, so parity checksum will not be used. If one
data strip fails (including checksum) it will be reconstructed and
verified. If parity itself is corrupted, checksum verification fails
(hopefully). How is it different from verifying parity checksum before
reconstructing? In both cases data cannot be reconstructed, end of story.

>> and b) it is corrupted.
>> But the likelihood of this case is very low. And you can catch it
>> during the data checksum check (which has to be performed in any case !).
>>
>> So from one side you have a *cost every time* (the write
>> amplification), to other side you have a gain (cpu-time) *only in
>> case* of the parity is corrupted and you need it (eg. scrub or
>> corrupted data)).
>>
>> IMHO the cost are very higher than the gain, and the likelihood the
>> gain is very lower compared to the likelihood (=100% or always) of the
>> cost.
>>
> Then run benchmarks and considering making parity checksums optional
> (but pretty please dipped in syrup with sugar on top - keep in on by
> default).
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-02 18:17         ` waxhead
  2018-05-02 18:50           ` Andrei Borzenkov
@ 2018-05-02 19:04           ` Goffredo Baroncelli
  1 sibling, 0 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2018-05-02 19:04 UTC (permalink / raw)
  To: waxhead, Duncan, linux-btrfs

On 05/02/2018 08:17 PM, waxhead wrote:
> Goffredo Baroncelli wrote:
>> On 05/02/2018 06:55 PM, waxhead wrote:
>>>>
>>>> So again, which problem would solve having the parity checksummed ? On the best of my knowledge nothing. In any case the data is checksummed so it is impossible to return corrupted data (modulo bug :-) ).
>>>>
>>> I am not a BTRFS dev , but this should be quite easy to answer. Unless you checksum the parity there is no way to verify that that the data (parity) you use to reconstruct other data is correct.
>>
>> In any case you could catch that the compute data is wrong, because the data is always checksummed. And in any case you must check the data against their checksum.
>>
> What if you lost an entire disk? or had corruption for both data AND checksum? How do you plan to safely reconstruct that without checksummed parity?

As general rule, the returned data is *always* check against their checksum. So in any case wrong data is never returned. Let me to say in another words: having the parity checksummed doesn't increase the reliability and or the safety of the RAID rebuilding. I want to repeat again: even if the parity is corrupted, the rebuild (wrong) data fails the check against its checksum and it is not returned !

Back to your questions:

1) Loosing 1 disks -> 1 fault
2) Loosing both data and checksum -> 2 faults

RAID5 is single fault prof. So if case #2 happens, raid5 can't protect you. However BTRFS is capable to detect that the data is wrong due to checksum.
In case #1, there is no problem, because for each stripe you have enough data to rebuild the missing one.

Because I read several time that the checksum parity would increase the reliability and/or the safety of the raid5/6 profile, let me to explain the logic:

read_from_disk() {
	data = read_data()
	if (data != ERROR && check_checksum(data))
		return data;
	/* checksum mismatch or data is missing */
	full_stripe = read_full_stripe()
	if (raid5_profile) {
		/* raid5 has only one way of rebuilding data */
		data = rebuild_raid5_data(full_stripe)
		if (data != ERROR && check_checksum(data)) {
			rebuild_stripe(data, full_stripe)
			return data;
		}
		/* parity and/or another data is corrupted/missed */
		return ERROR
	}

	for_each raid6_rebuilding_strategies(full_stripe) {
		/* 
		 * raid6 might have more than one way of rebuilding data 
		 * depending by the degree of failure
		 */
		data = rebuild_raid6_data(full_stripe)
		if (data != ERROR && check_checksum(data)) {
			rebuild_stripe(data, full_stripe)
			/* data is good, return it */
			return data;
		}
	}
	return ERROR	
}

In case of raid5, there is only one way of rebuild the data. In case of raid6 and 1 fault, there are several way of rebuilding data (however in case of two failure, there are only one way). So more possibilities have to be tested for rebuilding the data.
If the parity is corrupted, the rebuild data is corrupted too, and it fails the check against its checksum.


> 
>> My point is that storing the checksum is a cost that you pay *every time*. Every time you update a part of a stripe you need to update the parity, and then in turn the parity checksum. It is not a problem of space occupied nor a computational problem. It is a a problem of write amplification...
> How much of a problem is this? no benchmarks have been run since the feature is not yet there I suppose.

It is simple, for each stripe touched you need to update the parity(1); then you need to update parity checksums(1) (which in turn would requires an update of the parity(2) of the stripe where is stored the parity(1) checksums, which in turn would requires to update the parity(2) checksum... and so on)

> 
>>
>> The only gain is to avoid to try to use the parity when
>> a) you need it (i.e. when the data is missing and/or corrupted)
> I'm not sure I can make out your argument here , but with RAID5/6 you don't have another copy to restore from. You *have* to use the parity to reconstruct data and it is a good thing if this data is trusted.
I never say the opposite

> 
>> and b) it is corrupted.
>> But the likelihood of this case is very low. And you can catch it during the data checksum check (which has to be performed in any case !).
>>
>> So from one side you have a *cost every time* (the write amplification), to other side you have a gain (cpu-time) *only in case* of the parity is corrupted and you need it (eg. scrub or corrupted data)).
>>
>> IMHO the cost are very higher than the gain, and the likelihood the gain is very lower compared to the likelihood (=100% or always) of the cost.
>>
> Then run benchmarks and considering making parity checksums optional (but pretty please dipped in syrup with sugar on top - keep in on by default).
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-02 17:25       ` Goffredo Baroncelli
  2018-05-02 18:17         ` waxhead
@ 2018-05-02 19:29         ` Austin S. Hemmelgarn
  2018-05-02 20:40           ` Goffredo Baroncelli
  2018-05-03  8:11           ` Andrei Borzenkov
  1 sibling, 2 replies; 22+ messages in thread
From: Austin S. Hemmelgarn @ 2018-05-02 19:29 UTC (permalink / raw)
  To: kreijack, waxhead, Duncan, linux-btrfs

On 2018-05-02 13:25, Goffredo Baroncelli wrote:
> On 05/02/2018 06:55 PM, waxhead wrote:
>>>
>>> So again, which problem would solve having the parity checksummed ? On the best of my knowledge nothing. In any case the data is checksummed so it is impossible to return corrupted data (modulo bug :-) ).
>>>
>> I am not a BTRFS dev , but this should be quite easy to answer. Unless you checksum the parity there is no way to verify that that the data (parity) you use to reconstruct other data is correct.
> 
> In any case you could catch that the compute data is wrong, because the data is always checksummed. And in any case you must check the data against their checksum.
> 
> My point is that storing the checksum is a cost that you pay *every time*. Every time you update a part of a stripe you need to update the parity, and then in turn the parity checksum. It is not a problem of space occupied nor a computational problem. It is a a problem of write amplification...
> 
> The only gain is to avoid to try to use the parity when
> a) you need it (i.e. when the data is missing and/or corrupted)
> and b) it is corrupted.
> But the likelihood of this case is very low. And you can catch it during the data checksum check (which has to be performed in any case !).
> 
> So from one side you have a *cost every time* (the write amplification), to other side you have a gain (cpu-time) *only in case* of the parity is corrupted and you need it (eg. scrub or corrupted data)).
> 
> IMHO the cost are very higher than the gain, and the likelihood the gain is very lower compared to the likelihood (=100% or always) of the cost.
You do realize that a write is already rewriting checksums elsewhere? 
It would be pretty trivial to make sure that the checksums for every 
part of a stripe end up in the same metadata block, at which point the 
only cost is computing the checksum (because when a checksum gets 
updated, the whole block it's in gets rewritten, period, because that's 
how CoW works).

Looking at this another way (all the math below uses SI units):

Assume you have a BTRFS raid5 volume consisting of 6 8TB disks (which 
gives you 40TB of usable space).  You're storing roughly 20TB of data on 
it, using a 16kB block size, and it sees about 1GB of writes a day, with 
no partial stripe writes.  You, for reasons of argument, want to scrub 
it every week, because the data in question matters a lot to you.

With a decent CPU, lets say you can compute 1.5GB/s worth of checksums, 
and can compute the parity at a rate of 1.25G/s (the ratio here is about 
the average across the almost 50 systems I have quick access to check, 
including a number of server and workstation systems less than a year 
old, though the numbers themselves are artificially low to accentuate 
the point here).

At this rate, scrubbing by computing parity requires processing:

* Checksums for 20TB of data, at a rate of 1.5GB/s, which would take 
13333 seconds, or 222 minutes, or about 3.7 hours.
* Parity for 20TB of data, at a rate of 1.25GB/s, which would take 16000 
seconds, or 267 minutes, or roughly 4.4 hours.

So, over a week, you would be spending 8.1 hours processing data solely 
for data integrity, or roughly 4.8214% of your time.

Now assume instead that you're doing checksummed parity:

* Scrubbing data is the same, 3.7 hours.
* Scrubbing parity turns into computing checksums for 4TB of data, which 
would take 3200 seconds, or 53 minutes, or roughly 0.88 hours.
* Computing parity for the 7GB of data you write each week takes 5.6 
_SECONDS_.

So, over a week, you would spend just over 4.58 hours processing data 
solely for data integrity, or roughly 2.7262% of your time.

So, in terms of just time spent, it's almost twice as fast to use 
checksummed parity (roughly 43% faster to be more specific).

So, lets look at data usage:

1GB of data is translates to 62500 16kB blocks of data, which equates to 
an additional 15625 blocks for parity.  Adding parity checksums adds a 
25% overhead to checksums being written, but that actually doesn't 
translate to a huge increase in the number of _blocks_ of checksums 
written.  One 16k block can hold roughly 500 checksums, so it would take 
125 blocks worth of checksums without parity, and 157 (technically 
156.25, but you can't write a quarter block) with parity checksums. 
Thus, without parity checksums, writing 1GB of data involves writing 
78250 blocks, while doing the same with parity checksums involves 
writing 78282 blocks, a net change of only 32 blocks, or **0.0409%**.

Note that the difference in the amount of checksums written is a simple 
linear function directly proportionate to the amount of data being 
written provided that all rewrites only rewrite full stripes (because 
that's equivalent for this to just adding new data).  In other words, 
even if we were to increase the total amount of data that array was 
getting in a day, the net change from having parity checksumming would 
still stay within the range of 0.03-0.05%.

Making some of those partial re-writes skews the value upwards, but it 
can never be worse than 25% on a raid5 array (because you can't write 
less than a single block, and therefore the pathological worst case 
involves writing one data block, which translates to a single checksum 
and parity write, and in turn to only a single block written for parity 
checksums).  The exact level of how bad it can get is of course worse 
with higher levels of parity (it's a 33.333% increase for RAID6, 60% for 
raid with 3 parity blocks, etc).

So, given the above, this is a pretty big net win in terms of overhead 
for single-parity RAID arrays, even in the pathological worst case (25% 
higher write overhead (which happens once for each block), in exchange 
for 43% lower post-write processing overhead for data integrity (which 
usually happens way more than once for each block)).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-02 19:29         ` Austin S. Hemmelgarn
@ 2018-05-02 20:40           ` Goffredo Baroncelli
  2018-05-02 23:32             ` Duncan
  2018-05-03 11:26             ` Austin S. Hemmelgarn
  2018-05-03  8:11           ` Andrei Borzenkov
  1 sibling, 2 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2018-05-02 20:40 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, waxhead, Duncan, linux-btrfs

On 05/02/2018 09:29 PM, Austin S. Hemmelgarn wrote:
> On 2018-05-02 13:25, Goffredo Baroncelli wrote:
>> On 05/02/2018 06:55 PM, waxhead wrote:
>>>>
>>>> So again, which problem would solve having the parity checksummed ? On the best of my knowledge nothing. In any case the data is checksummed so it is impossible to return corrupted data (modulo bug :-) ).
>>>>
>>> I am not a BTRFS dev , but this should be quite easy to answer. Unless you checksum the parity there is no way to verify that that the data (parity) you use to reconstruct other data is correct.
>>
>> In any case you could catch that the compute data is wrong, because the data is always checksummed. And in any case you must check the data against their checksum.
>>
>> My point is that storing the checksum is a cost that you pay *every time*. Every time you update a part of a stripe you need to update the parity, and then in turn the parity checksum. It is not a problem of space occupied nor a computational problem. It is a a problem of write amplification...
>>
>> The only gain is to avoid to try to use the parity when
>> a) you need it (i.e. when the data is missing and/or corrupted)
>> and b) it is corrupted.
>> But the likelihood of this case is very low. And you can catch it during the data checksum check (which has to be performed in any case !).
>>
>> So from one side you have a *cost every time* (the write amplification), to other side you have a gain (cpu-time) *only in case* of the parity is corrupted and you need it (eg. scrub or corrupted data)).
>>
>> IMHO the cost are very higher than the gain, and the likelihood the gain is very lower compared to the likelihood (=100% or always) of the cost.
> You do realize that a write is already rewriting checksums elsewhere? It would be pretty trivial to make sure that the checksums for every part of a stripe end up in the same metadata block, at which point the only cost is computing the checksum (because when a checksum gets updated, the whole block it's in gets rewritten, period, because that's how CoW works).
> 
> Looking at this another way (all the math below uses SI units):
> 
[...]
Good point: precomputing the checksum of the parity save a lot of time for the scrub process. You can see this in a more simply way saying that the parity calculation (which is dominated by the memory bandwidth) is like O(n) (where n is the number of disk); the parity checking (which again is dominated by the memory bandwidth) against a checksum is like O(1). And when the data written is 2 order of magnitude lesser than the data stored, the effort required to precompute the checksum is negligible.

Anyway, my "rant" started when Ducan put near the missing of parity checksum and the write hole. The first might be a performance problem. Instead the write hole could lead to a loosing data. My intention was to highlight that the parity-checksum is not related to the reliability and safety of raid5/6.

> 
> So, lets look at data usage:
> 
> 1GB of data is translates to 62500 16kB blocks of data, which equates to an additional 15625 blocks for parity.  Adding parity checksums adds a 25% overhead to checksums being written, but that actually doesn't translate to a huge increase in the number of _blocks_ of checksums written.  One 16k block can hold roughly 500 checksums, so it would take 125 blocks worth of checksums without parity, and 157 (technically 156.25, but you can't write a quarter block) with parity checksums. Thus, without parity checksums, writing 1GB of data involves writing 78250 blocks, while doing the same with parity checksums involves writing 78282 blocks, a net change of only 32 blocks, or **0.0409%**.

How you would store the checksum ?
I asked that because I am not sure that we could use the "standard" btrfs metadata to store the checksum of the parity. Doing so you could face some pathological effect like:
- update a block(1) in a stripe(1)
- update the parity of stripe(1) containing block(1)
- update the checksum of parity stripe (1), which is contained in another stripe(2) [**]

- update the parity of stripe (2) containing the checksum of parity stripe(1)
- update the checksum of parity stripe (2), which is contained in another stripe(3)

and so on...

[**] pay attention that the checksum and the parity *have* to be in different stripe, otherwise you have the egg/chicken problem: compute the parity, then update the checksum, then update the parity again because the checksum is changed....

In order to avoid that, I fear that you can't store the checksum over a raid5/6 BG with parity checksummed; 

It is a bit late and I am a bit tired out, so may be that I am wrong however I fear that the above "write amplification problem" may be a big problem; a possible solution would be storing the checksum in a N-mirror BG (where N is 1 for raid5, 2 for raid6....)

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-02 18:50           ` Andrei Borzenkov
@ 2018-05-02 21:20             ` waxhead
  2018-05-02 21:54               ` Goffredo Baroncelli
  0 siblings, 1 reply; 22+ messages in thread
From: waxhead @ 2018-05-02 21:20 UTC (permalink / raw)
  To: Andrei Borzenkov, kreijack, Duncan, linux-btrfs



Andrei Borzenkov wrote:
> 02.05.2018 21:17, waxhead пишет:
>> Goffredo Baroncelli wrote:
>>> On 05/02/2018 06:55 PM, waxhead wrote:
>>>>>
>>>>> So again, which problem would solve having the parity checksummed ?
>>>>> On the best of my knowledge nothing. In any case the data is
>>>>> checksummed so it is impossible to return corrupted data (modulo bug
>>>>> :-) ).
>>>>>
>>>> I am not a BTRFS dev , but this should be quite easy to answer.
>>>> Unless you checksum the parity there is no way to verify that that
>>>> the data (parity) you use to reconstruct other data is correct.
>>>
>>> In any case you could catch that the compute data is wrong, because
>>> the data is always checksummed. And in any case you must check the
>>> data against their checksum.
>>>
>> What if you lost an entire disk?
> 
> How does it matter exactly? RAID is per chunk anyway.
> 
It does not matter. I was wrong, got bitten by thinking about BTRFS 
"RAID5" as normal RAID5. Again a good reason to change the naming for it 
I think...

>> or had corruption for both data AND checksum?
> 
> By the same logic you may have corrupted parity and its checksum.
> 
Yup. Indeed

>> How do you plan to safely reconstruct that without checksummed
>> parity?
>>
> 
> Define "safely". The main problem of current RAID56 implementation is
> that stripe is not updated atomically (at least, that is what I
> understood from the past discussions) and this is not solved by having
> extra parity checksum. So how exactly "safety" is improved here? You
> still need overall checksum to verify result of reconstruction, what
> exactly extra parity checksum buys you?
> 
 > [...]
>>
> 
> Again - please describe when having parity checksum will be beneficial
> over current implementation. You do not reconstruct anything as long as
> all data strips are there, so parity checksum will not be used. If one
> data strip fails (including checksum) it will be reconstructed and
> verified. If parity itself is corrupted, checksum verification fails
> (hopefully). How is it different from verifying parity checksum before
> reconstructing? In both cases data cannot be reconstructed, end of story.

Ok, before attempting and answer I have to admit that I do not know 
enough about how RAID56 is laid out on disk in BTRFS terms. Is data 
checksummed pr. stripe or pr. disk? Is parity calculated on the data 
only or is it calculated on the data+checksum ?!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-02 21:20             ` waxhead
@ 2018-05-02 21:54               ` Goffredo Baroncelli
  0 siblings, 0 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2018-05-02 21:54 UTC (permalink / raw)
  To: waxhead, Andrei Borzenkov, Duncan, linux-btrfs

On 05/02/2018 11:20 PM, waxhead wrote:
> 
> 
[...]
> 
> Ok, before attempting and answer I have to admit that I do not know enough about how RAID56 is laid out on disk in BTRFS terms. Is data checksummed pr. stripe or pr. disk? Is parity calculated on the data only or is it calculated on the data+checksum ?!
> 

Data is checksummed per block. The parity is invisible to the checksum. The parity are allocated in an "address space" parallel to the data "address space" exposed by the BG.

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-02 20:40           ` Goffredo Baroncelli
@ 2018-05-02 23:32             ` Duncan
  2018-05-03 11:26             ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 22+ messages in thread
From: Duncan @ 2018-05-02 23:32 UTC (permalink / raw)
  To: linux-btrfs

Goffredo Baroncelli posted on Wed, 02 May 2018 22:40:27 +0200 as
excerpted:

> Anyway, my "rant" started when Ducan put near the missing of parity
> checksum and the write hole. The first might be a performance problem.
> Instead the write hole could lead to a loosing data. My intention was to
> highlight that the parity-checksum is not related to the reliability and
> safety of raid5/6.

Thanks for making that point... and to everyone else for the vigorous 
thread debating it, as I'm learning quite a lot! =:^)

>From your first reply:

>> Why the fact that the parity is not checksummed is a problem ?
>> I read several times that this is a problem. However each time the
>> thread reached the conclusion that... it is not a problem.

I must have missed those threads, or at least, missed that conclusion 
from them (maybe believing they were about something rather narrower, or 
conflating... for instance), because AFAICT, this is the first time I've 
seen the practical merits of checksummed parity actually debated, at 
least in terms I as a non-dev can reasonably understand.  To my mind it 
was settled (or I'd have worded my original claim rather differently) and 
only now am I learning different.

And... to my credit... given the healthy vigor of the debate, it seems 
I'm not the only one that missed them...

But I'm surely learning of it now, and indeed, I had somewhat conflated 
parity-checksumming with the in-place-stripe-read-modify-write atomicity 
issue.  I'll leave the parity-checksumming debate (now that I know it at 
least remains debatable) to those more knowledgeable than myself, but in 
addition to what I've learned of it, I've definitely learned that I can't 
properly conflate it with the in-place stripe-rmw atomicity issue, so 
thanks!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-02 19:29         ` Austin S. Hemmelgarn
  2018-05-02 20:40           ` Goffredo Baroncelli
@ 2018-05-03  8:11           ` Andrei Borzenkov
  2018-05-03 11:28             ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 22+ messages in thread
From: Andrei Borzenkov @ 2018-05-03  8:11 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Goffredo Baroncelli, waxhead, Duncan, Btrfs BTRFS

On Wed, May 2, 2018 at 10:29 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
...
>
> Assume you have a BTRFS raid5 volume consisting of 6 8TB disks (which gives
> you 40TB of usable space).  You're storing roughly 20TB of data on it, using
> a 16kB block size, and it sees about 1GB of writes a day, with no partial
> stripe writes.  You, for reasons of argument, want to scrub it every week,
> because the data in question matters a lot to you.
>
> With a decent CPU, lets say you can compute 1.5GB/s worth of checksums, and
> can compute the parity at a rate of 1.25G/s (the ratio here is about the
> average across the almost 50 systems I have quick access to check, including
> a number of server and workstation systems less than a year old, though the
> numbers themselves are artificially low to accentuate the point here).
>
> At this rate, scrubbing by computing parity requires processing:
>
> * Checksums for 20TB of data, at a rate of 1.5GB/s, which would take 13333
> seconds, or 222 minutes, or about 3.7 hours.
> * Parity for 20TB of data, at a rate of 1.25GB/s, which would take 16000
> seconds, or 267 minutes, or roughly 4.4 hours.
>
> So, over a week, you would be spending 8.1 hours processing data solely for
> data integrity, or roughly 4.8214% of your time.
>
> Now assume instead that you're doing checksummed parity:
>
> * Scrubbing data is the same, 3.7 hours.
> * Scrubbing parity turns into computing checksums for 4TB of data, which
> would take 3200 seconds, or 53 minutes, or roughly 0.88 hours.

Scrubbing must compute parity and compare with stored value to detect
write hole. Otherwise you end up with parity having good checksum but
not matching rest of data.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-02 20:40           ` Goffredo Baroncelli
  2018-05-02 23:32             ` Duncan
@ 2018-05-03 11:26             ` Austin S. Hemmelgarn
  2018-05-03 19:00               ` Goffredo Baroncelli
  1 sibling, 1 reply; 22+ messages in thread
From: Austin S. Hemmelgarn @ 2018-05-03 11:26 UTC (permalink / raw)
  To: kreijack, waxhead, Duncan, linux-btrfs

On 2018-05-02 16:40, Goffredo Baroncelli wrote:
> On 05/02/2018 09:29 PM, Austin S. Hemmelgarn wrote:
>> On 2018-05-02 13:25, Goffredo Baroncelli wrote:
>>> On 05/02/2018 06:55 PM, waxhead wrote:
>>>>>
>>>>> So again, which problem would solve having the parity checksummed ? On the best of my knowledge nothing. In any case the data is checksummed so it is impossible to return corrupted data (modulo bug :-) ).
>>>>>
>>>> I am not a BTRFS dev , but this should be quite easy to answer. Unless you checksum the parity there is no way to verify that that the data (parity) you use to reconstruct other data is correct.
>>>
>>> In any case you could catch that the compute data is wrong, because the data is always checksummed. And in any case you must check the data against their checksum.
>>>
>>> My point is that storing the checksum is a cost that you pay *every time*. Every time you update a part of a stripe you need to update the parity, and then in turn the parity checksum. It is not a problem of space occupied nor a computational problem. It is a a problem of write amplification...
>>>
>>> The only gain is to avoid to try to use the parity when
>>> a) you need it (i.e. when the data is missing and/or corrupted)
>>> and b) it is corrupted.
>>> But the likelihood of this case is very low. And you can catch it during the data checksum check (which has to be performed in any case !).
>>>
>>> So from one side you have a *cost every time* (the write amplification), to other side you have a gain (cpu-time) *only in case* of the parity is corrupted and you need it (eg. scrub or corrupted data)).
>>>
>>> IMHO the cost are very higher than the gain, and the likelihood the gain is very lower compared to the likelihood (=100% or always) of the cost.
>> You do realize that a write is already rewriting checksums elsewhere? It would be pretty trivial to make sure that the checksums for every part of a stripe end up in the same metadata block, at which point the only cost is computing the checksum (because when a checksum gets updated, the whole block it's in gets rewritten, period, because that's how CoW works).
>>
>> Looking at this another way (all the math below uses SI units):
>>
> [...]
> Good point: precomputing the checksum of the parity save a lot of time for the scrub process. You can see this in a more simply way saying that the parity calculation (which is dominated by the memory bandwidth) is like O(n) (where n is the number of disk); the parity checking (which again is dominated by the memory bandwidth) against a checksum is like O(1). And when the data written is 2 order of magnitude lesser than the data stored, the effort required to precompute the checksum is negligible.
Excellent point about the computational efficiency, I had not thought of 
framing things that way.
> 
> Anyway, my "rant" started when Ducan put near the missing of parity checksum and the write hole. The first might be a performance problem. Instead the write hole could lead to a loosing data. My intention was to highlight that the parity-checksum is not related to the reliability and safety of raid5/6.
It may not be related to the safety, but it is arguably indirectly 
related to the reliability, dependent on your definition of reliability. 
  Spending less time verifying the parity means you're spending less 
time in an indeterminate state of usability, which arguably does improve 
the reliability of the system.  However, that does still have nothing to 
do with the write hole.
> 
>>
>> So, lets look at data usage:
>>
>> 1GB of data is translates to 62500 16kB blocks of data, which equates to an additional 15625 blocks for parity.  Adding parity checksums adds a 25% overhead to checksums being written, but that actually doesn't translate to a huge increase in the number of _blocks_ of checksums written.  One 16k block can hold roughly 500 checksums, so it would take 125 blocks worth of checksums without parity, and 157 (technically 156.25, but you can't write a quarter block) with parity checksums. Thus, without parity checksums, writing 1GB of data involves writing 78250 blocks, while doing the same with parity checksums involves writing 78282 blocks, a net change of only 32 blocks, or **0.0409%**.
> 
> How you would store the checksum ?
> I asked that because I am not sure that we could use the "standard" btrfs metadata to store the checksum of the parity. Doing so you could face some pathological effect like:
> - update a block(1) in a stripe(1)
> - update the parity of stripe(1) containing block(1)
> - update the checksum of parity stripe (1), which is contained in another stripe(2) [**]
> 
> - update the parity of stripe (2) containing the checksum of parity stripe(1)
> - update the checksum of parity stripe (2), which is contained in another stripe(3)
> 
> and so on...
> 
> [**] pay attention that the checksum and the parity *have* to be in different stripe, otherwise you have the egg/chicken problem: compute the parity, then update the checksum, then update the parity again because the checksum is changed....
Unless I'm completely mistaken about how BTRFS handles parity, it's 
accounted as part of whatever type of block it's for.  IOW, for data 
chunks, it would be no different from storing checksums for data, and if 
you're running metadata as parity RAID (which is debatably a bad idea in 
most cases), checksums for that parity would be handled just like 
regular metadata checksums.  Do note that there is a separate tree 
structure on-disk that handles the checksums, and I'm not sure if it 
even makes sense (given how small it is) to checksum the parity for that.
> 
> In order to avoid that, I fear that you can't store the checksum over a raid5/6 BG with parity checksummed;
> 
> It is a bit late and I am a bit tired out, so may be that I am wrong however I fear that the above "write amplification problem" may be a big problem; a possible solution would be storing the checksum in a N-mirror BG (where N is 1 for raid5, 2 for raid6....)
N would have to be one more than the number of parities to maintain the 
same guarantees of reliability.  That said, if we had N-way mirroring, 
that would be far better for metadata on most arrays than parity RAID, 
as the primary benefit of parity RAID (space efficiency) doesn't really 
matter much for metadata com pared to access performance.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-03  8:11           ` Andrei Borzenkov
@ 2018-05-03 11:28             ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 22+ messages in thread
From: Austin S. Hemmelgarn @ 2018-05-03 11:28 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Goffredo Baroncelli, waxhead, Duncan, Btrfs BTRFS

On 2018-05-03 04:11, Andrei Borzenkov wrote:
> On Wed, May 2, 2018 at 10:29 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> ...
>>
>> Assume you have a BTRFS raid5 volume consisting of 6 8TB disks (which gives
>> you 40TB of usable space).  You're storing roughly 20TB of data on it, using
>> a 16kB block size, and it sees about 1GB of writes a day, with no partial
>> stripe writes.  You, for reasons of argument, want to scrub it every week,
>> because the data in question matters a lot to you.
>>
>> With a decent CPU, lets say you can compute 1.5GB/s worth of checksums, and
>> can compute the parity at a rate of 1.25G/s (the ratio here is about the
>> average across the almost 50 systems I have quick access to check, including
>> a number of server and workstation systems less than a year old, though the
>> numbers themselves are artificially low to accentuate the point here).
>>
>> At this rate, scrubbing by computing parity requires processing:
>>
>> * Checksums for 20TB of data, at a rate of 1.5GB/s, which would take 13333
>> seconds, or 222 minutes, or about 3.7 hours.
>> * Parity for 20TB of data, at a rate of 1.25GB/s, which would take 16000
>> seconds, or 267 minutes, or roughly 4.4 hours.
>>
>> So, over a week, you would be spending 8.1 hours processing data solely for
>> data integrity, or roughly 4.8214% of your time.
>>
>> Now assume instead that you're doing checksummed parity:
>>
>> * Scrubbing data is the same, 3.7 hours.
>> * Scrubbing parity turns into computing checksums for 4TB of data, which
>> would take 3200 seconds, or 53 minutes, or roughly 0.88 hours.
> 
> Scrubbing must compute parity and compare with stored value to detect
> write hole. Otherwise you end up with parity having good checksum but
> not matching rest of data.
Yes, but that assumes we don't do anything to deal with the write hole, 
and it's been pretty much decided by the devs that they're going to try 
and close the write hole.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-01 21:57 RAID56 - 6 parity raid Gandalf Corvotempesta
  2018-05-02  1:47 ` Duncan
@ 2018-05-03 12:47 ` Alberto Bursi
  2018-05-03 19:03   ` Goffredo Baroncelli
  1 sibling, 1 reply; 22+ messages in thread
From: Alberto Bursi @ 2018-05-03 12:47 UTC (permalink / raw)
  To: Gandalf Corvotempesta, linux-btrfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 783 bytes --]



On 01/05/2018 23:57, Gandalf Corvotempesta wrote:
> Hi to all
> I've found some patches from Andrea Mazzoleni that adds support up to 6
> parity raid.
> Why these are wasn't merged ?
> With modern disk size, having something greater than 2 parity, would be
> great.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

His patch was about a generic library to do RAID6, it wasn't directly 
for btrfs.

To actually use that for btrfs someone would have to actually port btrfs 
to that library.

-Alberto
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨è­Ú&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~†­†Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ߣøm

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-03 11:26             ` Austin S. Hemmelgarn
@ 2018-05-03 19:00               ` Goffredo Baroncelli
  0 siblings, 0 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2018-05-03 19:00 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, waxhead, Duncan, linux-btrfs

On 05/03/2018 01:26 PM, Austin S. Hemmelgarn wrote:
>> My intention was to highlight that the parity-checksum is not related to the reliability and safety of raid5/6.
> It may not be related to the safety, but it is arguably indirectly related to the reliability, dependent on your definition of reliability.  Spending less time verifying the parity means you're spending less time in an indeterminate state of usability, which arguably does improve the reliability of the system.  However, that does still have nothing to do with the write hole.

If you start a scrub once per week, the fact that grub requires 1 hr, or 1 day doesn't impact the reliability, because in any case you have 1 week of un-scrubbed data.


BR 
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-03 12:47 ` Alberto Bursi
@ 2018-05-03 19:03   ` Goffredo Baroncelli
  0 siblings, 0 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2018-05-03 19:03 UTC (permalink / raw)
  To: Alberto Bursi, Gandalf Corvotempesta, linux-btrfs

On 05/03/2018 02:47 PM, Alberto Bursi wrote:
> 
> 
> On 01/05/2018 23:57, Gandalf Corvotempesta wrote:
>> Hi to all
>> I've found some patches from Andrea Mazzoleni that adds support up to 6
>> parity raid.
>> Why these are wasn't merged ?
>> With modern disk size, having something greater than 2 parity, would be
>> great.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> His patch was about a generic library to do RAID6, it wasn't directly 
> for btrfs.
> 
> To actually use that for btrfs someone would have to actually port btrfs 
> to that library.

In the past Andrea ported this library to btrfs too

https://lwn.net/Articles/588106/

> -Alberto

G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
  2018-05-02 19:25 Gandalf Corvotempesta
@ 2018-05-02 23:07 ` Duncan
  0 siblings, 0 replies; 22+ messages in thread
From: Duncan @ 2018-05-02 23:07 UTC (permalink / raw)
  To: linux-btrfs

Gandalf Corvotempesta posted on Wed, 02 May 2018 19:25:41 +0000 as
excerpted:

> On 05/02/2018 03:47 AM, Duncan wrote:
>> Meanwhile, have you looked at zfs? Perhaps they have something like
>> that?
> 
> Yes, i've looked at ZFS and I'm using it on some servers but I don't
> like it too much for multiple reasons, in example:
> 
> 1) is not officially in kernel, we have to build a module every time
> with DKMS

FWIW zfz is excluded from my choice domain as well, due to the well known 
license issues.  Regardless of strict legal implications, because Oracle 
has copyrights they could easily solve that problem and the fact that 
they haven't strongly suggests they have no interest in doing so.  That 
in turn means they have no interest in people like me running zfs, which 
means I have no interest in it either.

But because it does remain effectively the nearest to btrfs features and 
potential features "working now" solution out there, for those who simply 
_must_ have it and/or find it a more acceptable solution than cobbling 
together a multi-layer solution out of a standard filesystem on top of 
device-mapper or whatever, it's what I and others point to when people 
wonder about missing or unstable btrfs features.

> I'm new to BTRFS (if fact, i'm not using it) and I've seen in the status
> page that "it's almost ready".
> The only real missing part is a stable, secure and properly working
> RAID56,
> so i'm thinking why most effort aren't directed to fix RAID56 ?

Well, they are.  But finding and fixing corner-case bugs takes time and 
early-adopter deployments, and btrfs doesn't have the engineering 
resources to simply assign to the problem that Sun had with zfs.

Despite that, as I stated, current btrfs raid56 is, to the best of my/
list knowledge, the current code is now reasonably ready, tho it'll take 
another year or two without serious bug reports to actually test that, 
but it simply has the well known write hole that applies to all parity-
raid unless they've taken specific measures such as partial-stripe-write 
logging (slow), writing a full stripe even if it's partially empty 
(wastes space and needs periodic maintenance to reclaim it), or variable-
stripe-widths (needs periodic maintenance and more complex than always 
writing full stripes even if they're partially empty) (both of the latter 
avoiding the problem by avoiding in-place read-modify-write cycle 
entirely).

So to a large degree what's left is simply time for testing to 
demonstrate stability on the one hand, and a well known problem with 
parity-raid in general on the other.  There's the small detail that said 
well-known write hole has additional implementation-detail implications 
on btrfs, but at it's root it's the same problem all parity-raid has, and 
people choosing parity-raid as a solution are already choosing to either 
live with it or ameliorate it in some other way (tho some parity-raid 
solutions have that amelioration built-in).

> There are some environments where a RAID1/10 is too expensive and a
> RAID6 is mandatory,
> but with the current state of RAID56, BTRFS can't be used for valuable
> data

Not entirely true.  Btrfs, even btrfs raid56 mode, _can_ be used for 
"valuable" data, it simply requires astute /practical/ definitions of 
"valuable", as opposed to simple claims that don't actually stand up in 
practice.

Here's what I mean:  The sysadmin's first rule of backups defines 
"valuable data" by the number of backups it's worth making of that data.  
If there's no backups, then by definition the data is worth less than the 
time/hassle/resources necessary to have that backup, because it's not a 
question of if, but rather when, something's going to go wrong with the 
working copy and it won't be available any longer.

Additional layers of backup and whether one keeps geographically 
separated off-site backups as well are simply extensions of the first-
level-backup case/rule.  The more valuable the data, the more backups 
it's worth having of it, and the more effort is justified in ensuring 
that single or even multiple disasters aren't going to leave no working 
backup.

With this view, it's perfectly fine to use btrfs raid56 mode for 
"valuable" data, because that data is backed up and that backup can be 
used as a fallback if necessary.  True, the "working copy" might not be 
as reliable as it is in some cases, but statistically, that simply brings 
the 50% chance of failure rate (or whatever other percentage chance you 
choose) closer, to say once a year, or once a month, rather than perhaps 
once or twice a decade.  Working copy failure is GOING to happen in any 
case, it's just a matter of playing the chance game as to when, and using 
a not yet fully demonstrated reliable filesystem mode simply brings ups 
the chances a bit.

But if the data really *is* defined as "valuable", not simply /claimed/ 
to be valuable, then by that same definition, it *will* have a backup.

In the worst case, when some component (here the filesystem) of the 
storage platform is purely testing and expected to fail in near real-
time, the otherwise working-copy is often not even considered the working 
copy any longer, but rather, the testing copy, garbage value, because 
it's /expected/ to be destroyed by the test and therefore can't be 
considered the working copy.  In that case, what would be the first-line 
backup is actually the working copy, and if the testing copy (that would 
otherwise be the working copy) actually happens to survive the test, it's 
often deliberately destroyed anyway, with a mkfs (for the filesystem 
layer, device replace if it's the hardware layer, etc) destroying the 
data and setting up for the next test or actual working deployment.

And by that view, even if btrfs raid56 mode is defined as entirely 
unreliable, it can still be used for "valuable" data, because by 
definition, "valuable" data will be backed up, and should the working 
copy fail for any reason (remembering that even in the best cases, it 
*WILL* fail, the only question is when, not if), no problem, because the 
data *was* defined as valuable enough to have a backup, which can simply 
be restored, or fallen back to, if the first-line backup is deployable as 
a fallback and there's further backups to allow that without demoting the 
value to trivial because the working copy is now the /only/ copy.

> Also, i've seen that to fix write hole, a dedicated disk is needed ? Is
> this true ?
> I cant' create a 6 disks RAID6 with only 6 disks and no write-hole like
> with ZFS ?

A dedicated disk is not /necessary/, tho depending on the chosen 
mitigation strategy, it might be /useful/faster/.

For partial-stripe-write-logging, a comparatively fast device, say an ssd 
on an otherwise still legacy spinning-rust raid array, will help 
alleviate the speed issue.  But again, parity-raid isn't normally a go-to 
solution where performance is paramount in any case, so just using 
another ordinary spinning-rust device may work, if the performance level 
is acceptable.

For either always-full-stripe writes (writing zeros if the write would 
otherwise be too small) or variable-stripe widths (smaller stripes if 
necessary, down to a single data strip plus parity), the tradeoff is 
different and a dedicated logging device isn't used.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RAID56 - 6 parity raid
@ 2018-05-02 19:25 Gandalf Corvotempesta
  2018-05-02 23:07 ` Duncan
  0 siblings, 1 reply; 22+ messages in thread
From: Gandalf Corvotempesta @ 2018-05-02 19:25 UTC (permalink / raw)
  To: linux-btrfs

On 05/02/2018 03:47 AM, Duncan wrote:
> Meanwhile, have you looked at zfs? Perhaps they have something like that?

Yes, i've looked at ZFS and I'm using it on some servers but I don't like
it too much for multiple reasons, in example:

1) is not officially in kernel, we have to build a module every time with
DKMS
2) it does not forgive, if you add the wrong device to a pool, you are
gone, you can't remove it without migrating all data and creating the new
pool from scratch. If, for mistake, you add a single device to a RAID-Z3,
you totally loose the whole redundancy and so on.
3) doesn't support expansion of RAID-Z one disk per time. if you want to
expand a RAIDZ, you have to create another pool and then stripe over it.

I'm new to BTRFS (if fact, i'm not using it) and I've seen in the status
page that "it's almost ready".
The only real missing part is a stable, secure and properly working RAID56,
so i'm thinking why most effort aren't directed to fix RAID56 ?

There are some environments where a RAID1/10 is too expensive and a RAID6
is mandatory,
but with the current state of RAID56, BTRFS can't be used for valuable data

Also, i've seen that to fix write hole, a dedicated disk is needed ? Is
this true ?
I cant' create a 6 disks RAID6 with only 6 disks and no write-hole like
with ZFS ?

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-05-03 19:03 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-01 21:57 RAID56 - 6 parity raid Gandalf Corvotempesta
2018-05-02  1:47 ` Duncan
2018-05-02 16:27   ` Goffredo Baroncelli
2018-05-02 16:55     ` waxhead
2018-05-02 17:19       ` Austin S. Hemmelgarn
2018-05-02 17:25       ` Goffredo Baroncelli
2018-05-02 18:17         ` waxhead
2018-05-02 18:50           ` Andrei Borzenkov
2018-05-02 21:20             ` waxhead
2018-05-02 21:54               ` Goffredo Baroncelli
2018-05-02 19:04           ` Goffredo Baroncelli
2018-05-02 19:29         ` Austin S. Hemmelgarn
2018-05-02 20:40           ` Goffredo Baroncelli
2018-05-02 23:32             ` Duncan
2018-05-03 11:26             ` Austin S. Hemmelgarn
2018-05-03 19:00               ` Goffredo Baroncelli
2018-05-03  8:11           ` Andrei Borzenkov
2018-05-03 11:28             ` Austin S. Hemmelgarn
2018-05-03 12:47 ` Alberto Bursi
2018-05-03 19:03   ` Goffredo Baroncelli
2018-05-02 19:25 Gandalf Corvotempesta
2018-05-02 23:07 ` Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.