All of lore.kernel.org
 help / color / mirror / Atom feed
* Is this enough for us to have triple-parity RAID?
@ 2012-04-16 10:04 Alex
  0 siblings, 0 replies; 37+ messages in thread
From: Alex @ 2012-04-16 10:04 UTC (permalink / raw)
  To: linux-btrfs

Hey guys,
I have done some work about triple-parity RAID and is posted in
linuxquestions.org
Please go there and search for =A0"Is this enough for us to have
triple-parity RAID?"

As a die-hard follower of Linux, I truly wish btrfs would
have =A0triple-parity RAID integrated in the future. For those who are
interested, why
don't you go take a look and give me some feedback.

creamyfish
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-27 15:15             ` Emmanuel Noobadmin
@ 2012-05-01 16:38               ` Alex
  0 siblings, 0 replies; 37+ messages in thread
From: Alex @ 2012-05-01 16:38 UTC (permalink / raw)
  To: Emmanuel Noobadmin; +Cc: linux-raid

Hi Emmanuel,

The real reason of a HDD performance bottleneck is the mechanical
movements inside it,
RPM * density is just a reflection of that. It is always more
desirable to try to remove the
mechanical parts in it instead of playing with RPM * density when we
try to optimize it for
better performance.

There are three types of mechanical movements and interaction inside a HDD:
1. The spinning of the platter (the one we want to eliminate most)
2. The movement of the actuator arm to position the read/write head on
the right track (relatively tamed)
3. The fluid dynamic interaction between the platter and the
read/write head through a thin layer of fast
moving air

With Ostler' technology, 3 is no longer necessary for the write head,
and the platter is no longer tightly
coupled with the write head as it was. This might lead to elimination
of the spinning of the platter and
my ideas were just some premature attempts.

If this were to succeed, we still need to develop a new way to read
the data from the now non-spinning
storage medium. I haven't even got a premature idea for it yet, but
one observation is this: since we
can always use a cache to increase access response time, the new
reading method may not need to
be 1000X faster than the original one to be comparable to the new
writing method.

Finally we probably need a new way to coordinate the read and write
operations of the new disk since
the new read head and write head are likely to be separate and the
current simply scheme will not
work any more.

This doesn't sound like a plan we can work out via interchange of
e-mail message, so I really want to
stop right here. I myself will definitely follow Ostler's work and see
how things go. So good luck!

Cheers,
Alex

On Fri, Apr 27, 2012 at 11:15 PM, Emmanuel Noobadmin
<centos.admin@gmail.com> wrote:
> On 4/26/12, Alex <creamyfish@gmail.com> wrote:
>
>> Yes if we can't make reading as fast as writing. But in general having a
>> disk with write speed faster than read speed might still be an interesting
>> choice for applications with a lot of disk write.
>
> Pardon me if I might be missing something here. The proposition here
> is that with the laser writing technique, we can increase the data
> rate by having new type write heads which swivel instead of the
> traditional moving head over rotating platter isn't it?
>
> But since we cannot do the same for reading, we are stuck with the
> traditional head/platter mechanism. How would the write speed increase
> independently of the RPM x density general case?  It seems to me that
> we cannot expect a breakthrough in write/read without a fundamental
> improvement in the read function.
>
> It's like we have two person in a car, one collecting payment while
> one gives out goods. Now we increase the speed of the guy giving out
> goods (writing) but the total number units moved will still be the
> same because the car can't go any faster than the guy collecting
> payment (reading). Only difference is that the write head has more
> idle time between bits?
>
> Unless we are suggesting a hybrid system which has an independent
> swiveling write head and a traditional read head? But that sounds
> overly complicated and unlikely to work.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-26  2:30           ` Alex
@ 2012-04-27 15:15             ` Emmanuel Noobadmin
  2012-05-01 16:38               ` Alex
  0 siblings, 1 reply; 37+ messages in thread
From: Emmanuel Noobadmin @ 2012-04-27 15:15 UTC (permalink / raw)
  To: Alex; +Cc: linux-raid

On 4/26/12, Alex <creamyfish@gmail.com> wrote:

> Yes if we can't make reading as fast as writing. But in general having a
> disk with write speed faster than read speed might still be an interesting
> choice for applications with a lot of disk write.

Pardon me if I might be missing something here. The proposition here
is that with the laser writing technique, we can increase the data
rate by having new type write heads which swivel instead of the
traditional moving head over rotating platter isn't it?

But since we cannot do the same for reading, we are stuck with the
traditional head/platter mechanism. How would the write speed increase
independently of the RPM x density general case?  It seems to me that
we cannot expect a breakthrough in write/read without a fundamental
improvement in the read function.

It's like we have two person in a car, one collecting payment while
one gives out goods. Now we increase the speed of the guy giving out
goods (writing) but the total number units moved will still be the
same because the car can't go any faster than the guy collecting
payment (reading). Only difference is that the write head has more
idle time between bits?

Unless we are suggesting a hybrid system which has an independent
swiveling write head and a traditional read head? But that sounds
overly complicated and unlikely to work.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-25 16:59         ` Emmanuel Noobadmin
  2012-04-25 19:29           ` David Brown
  2012-04-26  2:30           ` Alex
@ 2012-04-26  4:24           ` Alex
  2 siblings, 0 replies; 37+ messages in thread
From: Alex @ 2012-04-26  4:24 UTC (permalink / raw)
  To: Emmanuel Noobadmin; +Cc: linux-raid

On Thu, Apr 26, 2012 at 12:59 AM, Emmanuel Noobadmin
<centos.admin@gmail.com> wrote:
> On 4/25/12, Alex <creamyfish@gmail.com> wrote:
>> I can see that you are trying very hard to fit a new picture into an old
>> frame. But with new technology there is always new possibilities. For
>> example, what I am thinking is: with the new laser write head, it doesn't
>> necessarily require the head to stay very close  to the platter since laser
>> doesn't fade away with longer distance, which may enable a design that
>
> The strength of the laser will fall as the distance to the media
> increases, wouldn't it?
>
>> the head turns(the platter doesn't have to be round) instead of spinning
>>  the platter; hence the "Data rate is a product of [density * RPM]"
>> statement falls apart. What is more, if the cost is justified, one may
>> even design a disk with multiple write heads to increase the bandwidth
>> since now we are free from the fluid mechanic interaction between the
>> head and the platter. Both of these may lead to independent increase
>> of data rate. I am having these thoughts because it's been an endless
>
> But Ostler's technique only increases the write speed and admits in a
> Wired.com article there is still no faster/other way to read the data.
> So we would still be stuck with the rotating disk and the magnetic
> reading heads until some other breakthrough happens.
>
> Going back into the context of parity RAID rebuild, we would still be
> bottlenecked on the write side wouldn't we since bits on the
> replacement drive can only be calculated after reading the rest of the
> drives.

I also want to add that sluggishness of read operations can always be
alleviated by prefetching and masked by the speed of a cache.

Cheers,
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-25 16:59         ` Emmanuel Noobadmin
  2012-04-25 19:29           ` David Brown
@ 2012-04-26  2:30           ` Alex
  2012-04-27 15:15             ` Emmanuel Noobadmin
  2012-04-26  4:24           ` Alex
  2 siblings, 1 reply; 37+ messages in thread
From: Alex @ 2012-04-26  2:30 UTC (permalink / raw)
  To: Emmanuel Noobadmin; +Cc: linux-raid

On Thu, Apr 26, 2012 at 12:59 AM, Emmanuel Noobadmin
<centos.admin@gmail.com> wrote:
> On 4/25/12, Alex <creamyfish@gmail.com> wrote:
>> I can see that you are trying very hard to fit a new picture into an old
>> frame. But with new technology there is always new possibilities. For
>> example, what I am thinking is: with the new laser write head, it doesn't
>> necessarily require the head to stay very close  to the platter since laser
>> doesn't fade away with longer distance, which may enable a design that
>

> The strength of the laser will fall as the distance to the media
> increases, wouldn't it?
Thanks for David answering this for me.
>
>> the head turns(the platter doesn't have to be round) instead of spinning
>>  the platter; hence the "Data rate is a product of [density * RPM]"
>> statement falls apart. What is more, if the cost is justified, one may
>> even design a disk with multiple write heads to increase the bandwidth
>> since now we are free from the fluid mechanic interaction between the
>> head and the platter. Both of these may lead to independent increase
>> of data rate. I am having these thoughts because it's been an endless
>
> But Ostler's technique only increases the write speed and admits in a
> Wired.com article there is still no faster/other way to read the data.
> So we would still be stuck with the rotating disk and the magnetic
> reading heads until some other breakthrough happens.
Ostler admits that in his response to my e-mails too.
>
> Going back into the context of parity RAID rebuild, we would still be
> bottlenecked on the write side wouldn't we since bits on the
> replacement drive can only be calculated after reading the rest of the
> drives.
Yes if we can't make reading as fast as writing. But in general having a
disk with write speed faster than read speed might still be an interesting
choice for applications with a lot of disk write.

Cheers,
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-25 16:59         ` Emmanuel Noobadmin
@ 2012-04-25 19:29           ` David Brown
  2012-04-26  2:30           ` Alex
  2012-04-26  4:24           ` Alex
  2 siblings, 0 replies; 37+ messages in thread
From: David Brown @ 2012-04-25 19:29 UTC (permalink / raw)
  To: Emmanuel Noobadmin; +Cc: Alex, stan, linux-raid

On 25/04/12 18:59, Emmanuel Noobadmin wrote:
> On 4/25/12, Alex<creamyfish@gmail.com>  wrote:
>> I can see that you are trying very hard to fit a new picture into an old
>> frame. But with new technology there is always new possibilities. For
>> example, what I am thinking is: with the new laser write head, it doesn't
>> necessarily require the head to stay very close  to the platter since laser
>> doesn't fade away with longer distance, which may enable a design that
>
> The strength of the laser will fall as the distance to the media
> increases, wouldn't it?
>

The laser's strength will only fade very slightly.  It is not passing 
through enough air to significantly attenuate, and one of the points of 
using a laser is that you can focus it into a very tightly linear beam. 
  So you are right in theory, but wrong in practice.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-25  2:45       ` Alex
@ 2012-04-25 16:59         ` Emmanuel Noobadmin
  2012-04-25 19:29           ` David Brown
                             ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Emmanuel Noobadmin @ 2012-04-25 16:59 UTC (permalink / raw)
  To: Alex; +Cc: stan, linux-raid

On 4/25/12, Alex <creamyfish@gmail.com> wrote:
> I can see that you are trying very hard to fit a new picture into an old
> frame. But with new technology there is always new possibilities. For
> example, what I am thinking is: with the new laser write head, it doesn't
> necessarily require the head to stay very close  to the platter since laser
> doesn't fade away with longer distance, which may enable a design that

The strength of the laser will fall as the distance to the media
increases, wouldn't it?

> the head turns(the platter doesn't have to be round) instead of spinning
>  the platter; hence the "Data rate is a product of [density * RPM]"
> statement falls apart. What is more, if the cost is justified, one may
> even design a disk with multiple write heads to increase the bandwidth
> since now we are free from the fluid mechanic interaction between the
> head and the platter. Both of these may lead to independent increase
> of data rate. I am having these thoughts because it's been an endless

But Ostler's technique only increases the write speed and admits in a
Wired.com article there is still no faster/other way to read the data.
So we would still be stuck with the rotating disk and the magnetic
reading heads until some other breakthrough happens.

Going back into the context of parity RAID rebuild, we would still be
bottlenecked on the write side wouldn't we since bits on the
replacement drive can only be calculated after reading the rest of the
drives.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-25  1:20     ` Stan Hoeppner
@ 2012-04-25  2:45       ` Alex
  2012-04-25 16:59         ` Emmanuel Noobadmin
  0 siblings, 1 reply; 37+ messages in thread
From: Alex @ 2012-04-25  2:45 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

I can see that you are trying very hard to fit a new picture into an old
frame. But with new technology there is always new possibilities. For
example, what I am thinking is: with the new laser write head, it doesn't
necessarily require the head to stay very close  to the platter since laser
doesn't fade away with longer distance, which may enable a design that
the head turns(the platter doesn't have to be round) instead of spinning
 the platter; hence the "Data rate is a product of [density * RPM]"
statement falls apart. What is more, if the cost is justified, one may
even design a disk with multiple write heads to increase the bandwidth
since now we are free from the fluid mechanic interaction between the
head and the platter. Both of these may lead to independent increase
of data rate. I am having these thoughts because it's been an endless
effort trying to remove the mechanical parts from a HDD(SSD was an
attempt). With Ostler's new technology, this once again becomes
possible. It is wise to think outside the box at this time.

Cheers,
Alex

On Wed, Apr 25, 2012 at 9:20 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 4/23/2012 10:26 AM, Alex wrote:
>
>> It looks like to me Ostler's team's work does independently lead to an
>> increase in areal density and writing speed, and
>> there is not fixed relation between these two.
>
> That's because you've not taken 60 seconds to actually think about this
> logically.
>
> Data rate is a product of [density * RPM].  To increase data rate we
> must increase one of these two.  Making the R/W head simply punch holes
> faster won't increase data rate.  It will allow simply allow us to
> increase the other two without the R/W head becoming a bottleneck.
>
> Designing the R/W head to punch bits 1000x faster cannot independently
> increase throughput.  The platter must either have more bits per inch or
> they must spin faster.  Period.
>
> Thus, the only way this laser R/W head will increase throughput is if it
> simultaneously writes more bits per inch, which increases density,
> allowing.  So again, unless it increases density, the faster punching
> speed doesn't increase throughput.
>
> --
> Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-23 15:26   ` Alex
@ 2012-04-25  1:20     ` Stan Hoeppner
  2012-04-25  2:45       ` Alex
  0 siblings, 1 reply; 37+ messages in thread
From: Stan Hoeppner @ 2012-04-25  1:20 UTC (permalink / raw)
  To: Alex; +Cc: linux-raid

On 4/23/2012 10:26 AM, Alex wrote:

> It looks like to me Ostler's team's work does independently lead to an
> increase in areal density and writing speed, and
> there is not fixed relation between these two.

That's because you've not taken 60 seconds to actually think about this
logically.

Data rate is a product of [density * RPM].  To increase data rate we
must increase one of these two.  Making the R/W head simply punch holes
faster won't increase data rate.  It will allow simply allow us to
increase the other two without the R/W head becoming a bottleneck.

Designing the R/W head to punch bits 1000x faster cannot independently
increase throughput.  The platter must either have more bits per inch or
they must spin faster.  Period.

Thus, the only way this laser R/W head will increase throughput is if it
simultaneously writes more bits per inch, which increases density,
allowing.  So again, unless it increases density, the faster punching
speed doesn't increase throughput.

-- 
Stan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-20  7:45 ` Stan Hoeppner
@ 2012-04-23 15:26   ` Alex
  2012-04-25  1:20     ` Stan Hoeppner
  0 siblings, 1 reply; 37+ messages in thread
From: Alex @ 2012-04-23 15:26 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

Hi Stan,

Sorry for not replying earlier. My background is not exactly Physics,
so I talked
to Thomas Ostler for a clarification. Here is what I did:

I first ask him how exactly does his model possibly increase areal
density, and I am
quoting his answer:

"thank you for your inquiry. The main reason for our statement that
areal density could be improved is down to a prediction made by the
model we used in this paper to predict this phenomena. The model
results show that the effect occurs with very small system sizes (on
the same time-scale), thus potentially the density could be improved.
However, the model is parameterised on experimental observations of
bulk system so it is not clear that the physics would be the same when
we decrease the size of the bits, this is something that we are
looking at studying in York using different methods over the next few
years. Obviously any increase in writing speed needs to have an
improvement in reading speed to see overall performance, one of the
many engineering problems to be overcome."

Then I ask him to clarify the following two things for me:

1. We also want to make sure what we got about your model improving
writing speed is correct: writing speed could be increased because
writing a bit could be achieved on a timescale of picoseconds rather
than the current nanoseconds and is for this reason only

2. In particular, increase of areal density doesn't lead to increase
of writing speed.

And I am quoting his answers again:

Answer for 1:
"yes, indeed we showed that the process is complete (apart from the
longer time cooling) after a couple of pico seconds. We also verified
the time-scale experimentally by measuring what happens as a function
of time to the Fe and Gd sublattices (though this experiment was done
in an applied field), see Radu et al. Nature 472, 205-208 (2011)."

Answer for 2:
"yes that is correct, the simulations show that the time-scale for
switching is not affected by the system size (within reasonable
limits). The features of this material that are physically important
for switching is something that we wish to study further to allow us
to gain more insight into the mechanism behind this switching."

It looks like to me Ostler's team's work does independently lead to an
increase in areal density and writing speed, and
there is not fixed relation between these two. I am still evaluating
what this means, but my first feeling is the whole
thing is still not very mature and there definitely will be other
improvements in the future, so taking a ' wait and see'
stand point at this point may not necessarily be a bad idea.

On Fri, Apr 20, 2012 at 3:45 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 4/17/2012 1:11 AM, Alex wrote:
>> Thanks to Billy Crook who pointed out this is the right place for my post.
>>
>> Adam Leventhal integrated triple-parity RAID into ZFS in 2009. The
>> necessity of triple-parity RAID is described in detail in Adam
>> Leventhal's article(http://cacm.acm.org/magazines/2010/1/55741-triple-parity-raid-and-beyond/fulltext).
>
> No mention of SSD.
>
>> al.(http://www.nature.com/ncomms/journal/v3/n2/full/ncomms1666.html)
>
> Pay wall.  No matter, as I'd already read of this research.
>
>> established a revolutionary way of writing magnetic substrate using a
>> heat pulse instead of a traditional magnetic field, which may increase
>> data throughput on a hard disk by 1000 times in the future.
>
> Your statement is massively misleading.  The laser heating technology
> doesn't independently increase throughput 1000x.  It will allow for
> increased throughput only via enabling greater aerial density.  Thus the
> ratio of throughput to capacity stays the same.  Thus drive rebuild
> times will still increase dramatically.
>
Sorry

>> facilitate another triple-parity RAID algorithm
>
> CPU performance is increasing at a faster rate than any computer
> technology.  Thus, if you're going to even bother with introducing
> another parity RAID level, and the binary will run on host CPU cores,
> skip triple parity and go straight to quad parity, RAID-P4™.  Most savvy
> folks doing RAID6 are using a 6+2 or 8+2 configuration as wide stripe
> parity arrays tend to be problematic.  They then stripe them to create a
> RAID60, or concatenate them if they're even more savvy and use XFS.
>
> The most common JBOD chassis on the market today seems to be the 24x
> 2.5" drive layout.  This allows three 6+2 RAID6 arrays, losing 6 drives
> to parity leaving 18 drives of capacity.  With RAID-P4™  a wider stripe
> array becomes more attractive for some applications.  Thus our 24 drive
> JBOD could yield a 20+4 RAID-P4™ with two drives more capacity than the
> 6+2 RAID6 configuration.  If one wished to stick with narrower stripes,
> we'd get two 8+4 RAID-P4™ arrays and 16 drives total capacity, 2 less
> than the triple RAID6 setup, and still 4 drives more capacity than RAID10.
>
> The really attractive option here for people who like parity RAID is the
> 20+4 possibility.  With a RAID-P4™ array that can withstand up to 4
> drive failures, people will no longer be afraid of using wide stripes
> for applications that typically benefit, where RAID50/60 would have been
> employed previously.  They also no longer have to worry about secondary
> and/or tertiary drive failures during a rebuild.
>
> Yeah, definitely go straight to RAID-P4™ and skip triple parity RAID
> altogether.  You'll have to do it in 6-10 years anyway so may as well
> prevent the extra work.  And people could definitely benefit from
> RAID-P4™ today.
>
> --
> Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-22  3:14                           ` Alex
@ 2012-04-22  8:57                             ` Piergiorgio Sartor
  0 siblings, 0 replies; 37+ messages in thread
From: Piergiorgio Sartor @ 2012-04-22  8:57 UTC (permalink / raw)
  To: Alex; +Cc: Piergiorgio Sartor, linux-raid

Hi Alex,

> Just being curious, any mathematical improvement in Reed-Solomon's
> code like the
> RS(96,64) you mentioned? The calculation of RS involves arbitrary
> multiplication in a
> Galois field and table look-up has to be used. Has that already changed?

AFAIK (details are not public) it uses the stardard
RS with LUT operations.

Considering that this is for network disks, I suppose
there is no performance problem.

The RS(255,223) or RS(160,128) are very old encoding
schemes, we talk about Voyager space probe here.

What I know about improvement, is the decoding using
the Berlekamp–Massey algorithm.
This is quite hard, since it also find *where* error
are, but it seems there is some work out there on
how to implement it very efficiently.

My personal opinion is that the first thing to do is
to have a working system, later it could be possible
to improve performance by different optimizations.

For example LUT multiplication might or might not be
the best way, but at first it is the easy way to do it.

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-21 11:18                         ` Piergiorgio Sartor
@ 2012-04-22  3:14                           ` Alex
  2012-04-22  8:57                             ` Piergiorgio Sartor
  0 siblings, 1 reply; 37+ messages in thread
From: Alex @ 2012-04-22  3:14 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

Hi Piergiorgio,

Just being curious, any mathematical improvement in Reed-Solomon's
code like the
RS(96,64) you mentioned? The calculation of RS involves arbitrary
multiplication in a
Galois field and table look-up has to be used. Has that already changed?

Cheers,
Alex
On Sat, Apr 21, 2012 at 7:18 PM, Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:
> Hi Peter,
>
>> Perhaps it was not clear that "use self repairing coding
>> similarly to Parchive" does not mean the same as "use Parchive".
>
> sorry, I understood you meant storing somewhere
> files with parities.
>
>> Applying self repairing coding to RAID might be easier to
>> imagine considering of a RAID volume as a directory of
>> stripe-sized files.
>
> Exactly, but as far as I know this is the same as
> any N parities RAID.
>
> I mean, the mathematics and algorithm behind should
> be exactly the same...
>
> bye,
>
> --
>
> piergiorgio
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-21  9:51                       ` Peter Grandi
@ 2012-04-21 11:18                         ` Piergiorgio Sartor
  2012-04-22  3:14                           ` Alex
  0 siblings, 1 reply; 37+ messages in thread
From: Piergiorgio Sartor @ 2012-04-21 11:18 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

Hi Peter,

> Perhaps it was not clear that "use self repairing coding
> similarly to Parchive" does not mean the same as "use Parchive".

sorry, I understood you meant storing somewhere
files with parities.

> Applying self repairing coding to RAID might be easier to
> imagine considering of a RAID volume as a directory of
> stripe-sized files.

Exactly, but as far as I know this is the same as
any N parities RAID.

I mean, the mathematics and algorithm behind should
be exactly the same...

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-20 22:31                     ` Piergiorgio Sartor
@ 2012-04-21  9:51                       ` Peter Grandi
  2012-04-21 11:18                         ` Piergiorgio Sartor
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Grandi @ 2012-04-21  9:51 UTC (permalink / raw)
  To: Linux RAID

>>>> 255 data disks is the theoretical limit for GF(2⁸). [ ... ]

>>> The reason to use many disks is in case of geo-redundant RAID,
>>> for example with iscsi.  In this situation you want to have a
>>> lot of redundance, in parities, not mirror.

>> [ ... ] extreme requirements implied by that why not use self
>> repairing coding similarly to Parchive style storage formats,
>> [ ... ]

> it depends on other requirements, for example if you want to
> control your file or let the control to the "cloud".  In case
> of RAID, the cloud sees only raw bytes and the local host sees
> the files too. In case of par2, the cloud must see the
> files. [ ... ]

Perhaps it was not clear that "use self repairing coding
similarly to Parchive" does not mean the same as "use Parchive".

Applying self repairing coding to RAID might be easier to
imagine considering of a RAID volume as a directory of
stripe-sized files.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-20 21:29                   ` Peter Grandi
@ 2012-04-20 22:31                     ` Piergiorgio Sartor
  2012-04-21  9:51                       ` Peter Grandi
  0 siblings, 1 reply; 37+ messages in thread
From: Piergiorgio Sartor @ 2012-04-20 22:31 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

Hi Peter,

On Fri, Apr 20, 2012 at 10:29:11PM +0100, Peter Grandi wrote:
> [ ... ]
> 
> >> 255 data disks is the theoretical limit for GF(2⁸).  But it
> >> is a theoretical limit of the algorithms - I don't know
> >> whether Linux md raid actually supports that many disks.  I
> >> certainly doubt if it is useful.
> 
> > The reason to use many disks is in case of geo-redundant RAID,
> > for example with iscsi.  In this situation you want to have a
> > lot of redundance, in parities, not mirror.
> 
> is that something that makes sense? If one has the extreme
> requirements implied by that why not use self repairing coding
> similarly to Parchive style storage formats, for example Typhoon
> or the Azure filesystem or others inspired by Parchive.

it depends on other requirements, for example if
you want to control your file or let the control
to the "cloud".
In case of RAID, the cloud sees only raw bytes
and the local host sees the files too.

In case of par2, the cloud must see the files.

BTW, this is not my original idea, there is
someone doing, already mentioned, a RAID-96
with RS(96,64) over networked virtual drives.

> Out of interest I just did a small web search and it turned up a
> recent survey/lecture by Frédérique Oggier on the maths of these
> coding systems:
> 
>   http://phdopen.mimuw.edu.pl/lato12/LectPoland.pdf
>   http://sands.sce.ntu.edu.sg/CodingForNetworkedStorage/

Interesting, I'll have a look.

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-20 21:01                 ` Piergiorgio Sartor
@ 2012-04-20 21:29                   ` Peter Grandi
  2012-04-20 22:31                     ` Piergiorgio Sartor
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Grandi @ 2012-04-20 21:29 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

>> 255 data disks is the theoretical limit for GF(2⁸).  But it
>> is a theoretical limit of the algorithms - I don't know
>> whether Linux md raid actually supports that many disks.  I
>> certainly doubt if it is useful.

> The reason to use many disks is in case of geo-redundant RAID,
> for example with iscsi.  In this situation you want to have a
> lot of redundance, in parities, not mirror.

is that something that makes sense? If one has the extreme
requirements implied by that why not use self repairing coding
similarly to Parchive style storage formats, for example Typhoon
or the Azure filesystem or others inspired by Parchive.

Out of interest I just did a small web search and it turned up a
recent survey/lecture by Frédérique Oggier on the maths of these
coding systems:

  http://phdopen.mimuw.edu.pl/lato12/LectPoland.pdf
  http://sands.sce.ntu.edu.sg/CodingForNetworkedStorage/
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-20 19:39                 ` H. Peter Anvin
@ 2012-04-20 21:04                   ` Piergiorgio Sartor
  0 siblings, 0 replies; 37+ messages in thread
From: Piergiorgio Sartor @ 2012-04-20 21:04 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: David Brown, Alex, linux-raid

Hi Peter,

> > Yes, being a generator for GF(2^8) is a requirement for a parity
> > generator (sorry for the confusing terminology here - if anyone has a
> > better suggestion, please say) to be part of a 255 data disk system.
> > However, being a GF generator is necessary but not sufficient - using
> > parity generators (1, 2, 4, 16) will /not/ give quad parity for 255 data
> > disks, even though individually each of 1, 2, 4 and 16 are generators
> > for GF.
[...]
> It is also worth noting that there is nothing magical about GF(2^8).  It
> is just a reasonable tradeoff when tables are needed.

I, then, ask you too.

What is this story that being a generator is not enough?

Is there any reference, documentation, link which can be
studied in order to understand this limitation?

In all RS papers I found, the only constrain put was that
the Vandermonde must be constructed with generators.
Not all RAID examples used them, but no paper, at least
for what I understood, was limiting the generators to
be also "independent".

Any undestandable explanation?

Thanks,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-20 18:58               ` David Brown
  2012-04-20 19:39                 ` H. Peter Anvin
@ 2012-04-20 21:01                 ` Piergiorgio Sartor
  2012-04-20 21:29                   ` Peter Grandi
  1 sibling, 1 reply; 37+ messages in thread
From: Piergiorgio Sartor @ 2012-04-20 21:01 UTC (permalink / raw)
  To: David Brown; +Cc: Alex, H. Peter Anvin, linux-raid

Hi again David,

> Yes, being a generator for GF(2^8) is a requirement for a parity
> generator (sorry for the confusing terminology here - if anyone has
> a better suggestion, please say) to be part of a 255 data disk
> system. However, being a GF generator is necessary but not
> sufficient - using parity generators (1, 2, 4, 16) will /not/ give
> quad parity for 255 data disks, even though individually each of 1,
> 2, 4 and 16 are generators for GF.

I ask again, could you please elaborate this?
I nowhere found such a further constrain for the parities.

All I could find is that the Vandermonde matrix must
be done with generators.

> 255 data disks is the theoretical limit for GF(2⁸).  But it is a
> theoretical limit of the algorithms - I don't know whether Linux md
> raid actually supports that many disks.  I certainly doubt if it is
> useful.

The reason to use many disks is in case of
geo-redundant RAID, for example with iscsi.
In this situation you want to have a lot of
redundance, in parities, not mirror.

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-20 18:58               ` David Brown
@ 2012-04-20 19:39                 ` H. Peter Anvin
  2012-04-20 21:04                   ` Piergiorgio Sartor
  2012-04-20 21:01                 ` Piergiorgio Sartor
  1 sibling, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2012-04-20 19:39 UTC (permalink / raw)
  To: David Brown; +Cc: Alex, linux-raid

On 04/20/2012 11:58 AM, David Brown wrote:
> Hi,
> 
> Yes, being a generator for GF(2^8) is a requirement for a parity
> generator (sorry for the confusing terminology here - if anyone has a
> better suggestion, please say) to be part of a 255 data disk system.
> However, being a GF generator is necessary but not sufficient - using
> parity generators (1, 2, 4, 16) will /not/ give quad parity for 255 data
> disks, even though individually each of 1, 2, 4 and 16 are generators
> for GF.
> 
> 255 data disks is the theoretical limit for GF(2⁸).  But it is a
> theoretical limit of the algorithms - I don't know whether Linux md raid
> actually supports that many disks.  I certainly doubt if it is useful.
> 
> It might well be that a 21 data disk limit quad parity is useful - or at
> least, as useful as quad parity ever would be.  It would fit well within
> a typical large chassis with 24 disk slots.  And then it doesn't matter
> that 8 is not a generator for GF(2⁸) - it becomes the best choice
> because of the easiest implementation.
> 

It is also worth noting that there is nothing magical about GF(2^8).  It
is just a reasonable tradeoff when tables are needed.

There are hardware tricks one can play to do efficient operation of
wider fields, too.

But It sounds like {04} or {8e} are particular interesting generators of
the existing GF(2^8) field for an efficient second field, giving
triple-parity RAID again at a reasonable cost.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-20  3:32             ` Alex
@ 2012-04-20 18:58               ` David Brown
  2012-04-20 19:39                 ` H. Peter Anvin
  2012-04-20 21:01                 ` Piergiorgio Sartor
  0 siblings, 2 replies; 37+ messages in thread
From: David Brown @ 2012-04-20 18:58 UTC (permalink / raw)
  To: Alex; +Cc: H. Peter Anvin, linux-raid

Hi,

Yes, being a generator for GF(2^8) is a requirement for a parity 
generator (sorry for the confusing terminology here - if anyone has a 
better suggestion, please say) to be part of a 255 data disk system. 
However, being a GF generator is necessary but not sufficient - using 
parity generators (1, 2, 4, 16) will /not/ give quad parity for 255 data 
disks, even though individually each of 1, 2, 4 and 16 are generators 
for GF.

255 data disks is the theoretical limit for GF(2⁸).  But it is a 
theoretical limit of the algorithms - I don't know whether Linux md raid 
actually supports that many disks.  I certainly doubt if it is useful.

It might well be that a 21 data disk limit quad parity is useful - or at 
least, as useful as quad parity ever would be.  It would fit well within 
a typical large chassis with 24 disk slots.  And then it doesn't matter 
that 8 is not a generator for GF(2⁸) - it becomes the best choice 
because of the easiest implementation.

On 20/04/12 05:32, Alex wrote:
> I understand we need a generator to facilitate a 255 data disks
> array, but 255 sounds like a theoretic limit to me. ZFS now only
> supports an array of only 9 disks(6 of them are data disks), so
> having, say, a quad-parity array of 48 disks(theoretically) doesn't
> sound that bad, does it?
>
> Cheers, Alex
>
> On Fri, Apr 20, 2012 at 11:00 AM, H. Peter Anvin<hpa@zytor.com>
> wrote:
>> Being a generator is a requirement for that.
>>
>> Alex<creamyfish@gmail.com>  wrote:
>>
>>> I think when David says 'generator', he doesn't mean the
>>> generator of the order 8 Galois field, he means an arbitrary set
>>> of number in it which can render the system of equations solvable
>>> to up to a certain number of data disks(not necessarily 255). He
>>> uses a brute-force method with the help of a Python program to
>>> actually figure that out. It looks pretty cool to me since I have
>>> known the system of 4 equations generally fails to render a
>>> solution for a while, but now I know exactly how many ways it may
>>> fail...
>>>
>>> Cheers, Alex
>>>
>>>
>>> On Fri, Apr 20, 2012 at 2:16 AM, H. Peter Anvin<hpa@zytor.com>
>>> wrote:
>>>> On 04/17/2012 01:18 PM, David Brown wrote:
>>>>>
>>>>> For quad parity, we can try g3 = 8 as the obvious next choice
>>>>> in the pattern.  Unfortunately, we start hitting conflicts.
>>>>> To recover
>>> missing
>>>>> data, we have to solve multiple simultaneous equations over
>>>>> G(2⁸),
>>> whose
>>>>> coefficients depend on the index numbers of the missing
>>>>> disks.  With parity generators (1, 2, 4, 8), some of these
>>>>> combinations of
>>> missing
>>>>> disk indexes lead to insoluble equations when you have more
>>>>> that 21
>>> disks.
>>>>>
>>>>
>>>> That is because 255 = 3*5*17... this means {02}^3 = {08} is not
>>>> a
>>> generator.
>>>>
>>>> -hpa
>>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-17  6:11 Alex
  2012-04-17  7:58 ` David Brown
@ 2012-04-20  7:45 ` Stan Hoeppner
  2012-04-23 15:26   ` Alex
  1 sibling, 1 reply; 37+ messages in thread
From: Stan Hoeppner @ 2012-04-20  7:45 UTC (permalink / raw)
  To: Alex; +Cc: linux-raid

On 4/17/2012 1:11 AM, Alex wrote:
> Thanks to Billy Crook who pointed out this is the right place for my post.
> 
> Adam Leventhal integrated triple-parity RAID into ZFS in 2009. The
> necessity of triple-parity RAID is described in detail in Adam
> Leventhal's article(http://cacm.acm.org/magazines/2010/1/55741-triple-parity-raid-and-beyond/fulltext).

No mention of SSD.

> al.(http://www.nature.com/ncomms/journal/v3/n2/full/ncomms1666.html)

Pay wall.  No matter, as I'd already read of this research.

> established a revolutionary way of writing magnetic substrate using a
> heat pulse instead of a traditional magnetic field, which may increase
> data throughput on a hard disk by 1000 times in the future.

Your statement is massively misleading.  The laser heating technology
doesn't independently increase throughput 1000x.  It will allow for
increased throughput only via enabling greater aerial density.  Thus the
ratio of throughput to capacity stays the same.  Thus drive rebuild
times will still increase dramatically.

> facilitate another triple-parity RAID algorithm

CPU performance is increasing at a faster rate than any computer
technology.  Thus, if you're going to even bother with introducing
another parity RAID level, and the binary will run on host CPU cores,
skip triple parity and go straight to quad parity, RAID-P4™.  Most savvy
folks doing RAID6 are using a 6+2 or 8+2 configuration as wide stripe
parity arrays tend to be problematic.  They then stripe them to create a
RAID60, or concatenate them if they're even more savvy and use XFS.

The most common JBOD chassis on the market today seems to be the 24x
2.5" drive layout.  This allows three 6+2 RAID6 arrays, losing 6 drives
to parity leaving 18 drives of capacity.  With RAID-P4™  a wider stripe
array becomes more attractive for some applications.  Thus our 24 drive
JBOD could yield a 20+4 RAID-P4™ with two drives more capacity than the
6+2 RAID6 configuration.  If one wished to stick with narrower stripes,
we'd get two 8+4 RAID-P4™ arrays and 16 drives total capacity, 2 less
than the triple RAID6 setup, and still 4 drives more capacity than RAID10.

The really attractive option here for people who like parity RAID is the
20+4 possibility.  With a RAID-P4™ array that can withstand up to 4
drive failures, people will no longer be afraid of using wide stripes
for applications that typically benefit, where RAID50/60 would have been
employed previously.  They also no longer have to worry about secondary
and/or tertiary drive failures during a rebuild.

Yeah, definitely go straight to RAID-P4™ and skip triple parity RAID
altogether.  You'll have to do it in 6-10 years anyway so may as well
prevent the extra work.  And people could definitely benefit from
RAID-P4™ today.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-20  3:00           ` H. Peter Anvin
@ 2012-04-20  3:32             ` Alex
  2012-04-20 18:58               ` David Brown
  0 siblings, 1 reply; 37+ messages in thread
From: Alex @ 2012-04-20  3:32 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

I understand we need a generator to facilitate a 255 data disks array, but 255
sounds like a theoretic limit to me. ZFS now only supports an array of only 9
disks(6 of them are data disks), so having, say, a quad-parity array
of 48 disks(theoretically)
doesn't sound that bad, does it?

Cheers,
Alex

On Fri, Apr 20, 2012 at 11:00 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> Being a generator is a requirement for that.
>
> Alex <creamyfish@gmail.com> wrote:
>
>>I think when David says 'generator', he doesn't mean the generator of
>>the order
>>8 Galois field, he means an arbitrary set  of number in it which can
>>render the
>>system of equations solvable to up to a certain number of data
>>disks(not necessarily
>>255). He uses a brute-force method with the help of a Python program to
>>actually
>>figure that out. It looks pretty cool to me since I have known the
>>system of 4 equations
>>generally fails to render a solution for a while, but now I know
>>exactly how many ways
>>it may fail...
>>
>>Cheers,
>>Alex
>>
>>
>>On Fri, Apr 20, 2012 at 2:16 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 04/17/2012 01:18 PM, David Brown wrote:
>>>>
>>>> For quad parity, we can try g3 = 8 as the obvious next choice in the
>>>> pattern.  Unfortunately, we start hitting conflicts.  To recover
>>missing
>>>> data, we have to solve multiple simultaneous equations over G(2⁸),
>>whose
>>>> coefficients depend on the index numbers of the missing disks.  With
>>>> parity generators (1, 2, 4, 8), some of these combinations of
>>missing
>>>> disk indexes lead to insoluble equations when you have more that 21
>>disks.
>>>>
>>>
>>> That is because 255 = 3*5*17... this means {02}^3 = {08} is not a
>>generator.
>>>
>>>        -hpa
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid"
>>in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-20  2:27         ` Alex
@ 2012-04-20  3:00           ` H. Peter Anvin
  2012-04-20  3:32             ` Alex
  0 siblings, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2012-04-20  3:00 UTC (permalink / raw)
  To: Alex; +Cc: linux-raid

Being a generator is a requirement for that.

Alex <creamyfish@gmail.com> wrote:

>I think when David says 'generator', he doesn't mean the generator of
>the order
>8 Galois field, he means an arbitrary set  of number in it which can
>render the
>system of equations solvable to up to a certain number of data
>disks(not necessarily
>255). He uses a brute-force method with the help of a Python program to
>actually
>figure that out. It looks pretty cool to me since I have known the
>system of 4 equations
>generally fails to render a solution for a while, but now I know
>exactly how many ways
>it may fail...
>
>Cheers,
>Alex
>
>
>On Fri, Apr 20, 2012 at 2:16 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 04/17/2012 01:18 PM, David Brown wrote:
>>>
>>> For quad parity, we can try g3 = 8 as the obvious next choice in the
>>> pattern.  Unfortunately, we start hitting conflicts.  To recover
>missing
>>> data, we have to solve multiple simultaneous equations over G(2⁸),
>whose
>>> coefficients depend on the index numbers of the missing disks.  With
>>> parity generators (1, 2, 4, 8), some of these combinations of
>missing
>>> disk indexes lead to insoluble equations when you have more that 21
>disks.
>>>
>>
>> That is because 255 = 3*5*17... this means {02}^3 = {08} is not a
>generator.
>>
>>        -hpa
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid"
>in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-19 18:16       ` H. Peter Anvin
@ 2012-04-20  2:27         ` Alex
  2012-04-20  3:00           ` H. Peter Anvin
  0 siblings, 1 reply; 37+ messages in thread
From: Alex @ 2012-04-20  2:27 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-raid

I think when David says 'generator', he doesn't mean the generator of the order
8 Galois field, he means an arbitrary set  of number in it which can render the
system of equations solvable to up to a certain number of data
disks(not necessarily
255). He uses a brute-force method with the help of a Python program to actually
figure that out. It looks pretty cool to me since I have known the
system of 4 equations
generally fails to render a solution for a while, but now I know
exactly how many ways
it may fail...

Cheers,
Alex


On Fri, Apr 20, 2012 at 2:16 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/17/2012 01:18 PM, David Brown wrote:
>>
>> For quad parity, we can try g3 = 8 as the obvious next choice in the
>> pattern.  Unfortunately, we start hitting conflicts.  To recover missing
>> data, we have to solve multiple simultaneous equations over G(2⁸), whose
>> coefficients depend on the index numbers of the missing disks.  With
>> parity generators (1, 2, 4, 8), some of these combinations of missing
>> disk indexes lead to insoluble equations when you have more that 21 disks.
>>
>
> That is because 255 = 3*5*17... this means {02}^3 = {08} is not a generator.
>
>        -hpa
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-17 20:18     ` David Brown
  2012-04-17 20:54       ` Piergiorgio Sartor
  2012-04-18 18:22       ` Piergiorgio Sartor
@ 2012-04-19 18:16       ` H. Peter Anvin
  2012-04-20  2:27         ` Alex
  2 siblings, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2012-04-19 18:16 UTC (permalink / raw)
  To: David Brown; +Cc: Piergiorgio Sartor, linux-raid

On 04/17/2012 01:18 PM, David Brown wrote:
> 
> For quad parity, we can try g3 = 8 as the obvious next choice in the
> pattern.  Unfortunately, we start hitting conflicts.  To recover missing
> data, we have to solve multiple simultaneous equations over G(2⁸), whose
> coefficients depend on the index numbers of the missing disks.  With
> parity generators (1, 2, 4, 8), some of these combinations of missing
> disk indexes lead to insoluble equations when you have more that 21 disks.
> 

That is because 255 = 3*5*17... this means {02}^3 = {08} is not a generator.

	-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-18 20:20         ` David Brown
@ 2012-04-18 20:39           ` Piergiorgio Sartor
  0 siblings, 0 replies; 37+ messages in thread
From: Piergiorgio Sartor @ 2012-04-18 20:39 UTC (permalink / raw)
  To: David Brown; +Cc: Piergiorgio Sartor, linux-raid

Hi David,

> I know that 8 is not a generator, and therefore you cannot expect to
> get a full set of (256 - noOfParities) disks.  But picking another
> generator (such as 16) is not enough to guarantee you the full range
> - it is a requirement, but not sufficient.  The generators need to
> be independent of each other, in the sense that all the simultaneous
> equations for all the combinations of failed disks need to be
> soluble.

uhm, than how can RS(255,223) or the equivalent RS(160,128) work?

In all the papers I saw, it was never mentioned the "indepence"
of the GF generators, do you have any reference I can look into?
 
> It turns out that if you pick 16 as the forth parity generator here
> (1, 2, 4, 16), then you can only have 5 data disks.  In fact, there
> are no other values for g3 that give significantly more than 21 data
> disks in combination with (1, 2, 4), whether or not they happen to
> be a generator for all of GF(2⁸).

That's really interesting, but, again, I fail to see how
this could be, given that there are larger range codes.

Maybe they do not use the polynomial 285, in fact AES uses
283, but the first generator is 3 and not 2, actually no
powers of two appears, if I remember correctly.

> When I started out with this, I thought it was as simple as you are
> suggesting.  But it is not - picking a set of generators for GF(2⁸)
> is not enough.  You have to check that all the solution matrices are
> invertible for all combinations of failed disks.

Check or prove? How do you do that?
And what do you mean exactly?
I mean with "all combination of failed disks".

As already wrote, par2 implements a generic RS
parity generator, and it works with very little
limitations. Do they something different?
 
> In fact, it is a little surprising that (1, 2, 4) works so well for
> triple parity.  I don't know whether it is just luck, or genius in
> Peter Anvin's choice of the multiply operation.

The GF(256) with polynomial 285 seems quite a
structure of choice, I wonder if the polynomial
must fullfill specific conditions in order to
allow a wider range of parities...

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-18 18:22       ` Piergiorgio Sartor
@ 2012-04-18 20:20         ` David Brown
  2012-04-18 20:39           ` Piergiorgio Sartor
  0 siblings, 1 reply; 37+ messages in thread
From: David Brown @ 2012-04-18 20:20 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

On 18/04/12 20:22, Piergiorgio Sartor wrote:
> Hi David,
>
> On Tue, Apr 17, 2012 at 10:18:55PM +0200, David Brown wrote:
> [...]
>
>> For quad parity, we can try g3 = 8 as the obvious next choice in the
>> pattern.  Unfortunately, we start hitting conflicts.  To recover
>
> you should not use 8, because this is not a generator
> of GF(256) with polynomial 285, the standard for the
> RAID-5/6 setup.
>
> This means than 8^k does not cover the complete field
> for k in [0 254], thus having cycles and, consequently,
> creating conflicts.
>
> Some generators could be:
>
> 2, 4, 6, 9 13, 14, 16...
>
> but not 32 nor 64.
>
> I know that powers of two are nice, but if you want to
> have generic RAID, you must use other values.
>

I know that 8 is not a generator, and therefore you cannot expect to get 
a full set of (256 - noOfParities) disks.  But picking another generator 
(such as 16) is not enough to guarantee you the full range - it is a 
requirement, but not sufficient.  The generators need to be independent 
of each other, in the sense that all the simultaneous equations for all 
the combinations of failed disks need to be soluble.

It turns out that if you pick 16 as the forth parity generator here (1, 
2, 4, 16), then you can only have 5 data disks.  In fact, there are no 
other values for g3 that give significantly more than 21 data disks in 
combination with (1, 2, 4), whether or not they happen to be a generator 
for all of GF(2⁸).

> The log/exp tables, are, of course, always valid.
>
> BTW, the GF(256) with polynomial 285 has exactly 128
> generators, so it would be possible to have up to 129
> parity disk (1 is not a generator), for, I guess, a
> max of 256 disks (or maybe 255?).
>
> Hope this helps,
>
> bye,
>

When I started out with this, I thought it was as simple as you are 
suggesting.  But it is not - picking a set of generators for GF(2⁸) is 
not enough.  You have to check that all the solution matrices are 
invertible for all combinations of failed disks.

In fact, it is a little surprising that (1, 2, 4) works so well for 
triple parity.  I don't know whether it is just luck, or genius in Peter 
Anvin's choice of the multiply operation.

mvh.,

David
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-17 20:18     ` David Brown
  2012-04-17 20:54       ` Piergiorgio Sartor
@ 2012-04-18 18:22       ` Piergiorgio Sartor
  2012-04-18 20:20         ` David Brown
  2012-04-19 18:16       ` H. Peter Anvin
  2 siblings, 1 reply; 37+ messages in thread
From: Piergiorgio Sartor @ 2012-04-18 18:22 UTC (permalink / raw)
  To: David Brown; +Cc: Piergiorgio Sartor, linux-raid

Hi David,

On Tue, Apr 17, 2012 at 10:18:55PM +0200, David Brown wrote:
[...]

> For quad parity, we can try g3 = 8 as the obvious next choice in the
> pattern.  Unfortunately, we start hitting conflicts.  To recover

you should not use 8, because this is not a generator
of GF(256) with polynomial 285, the standard for the
RAID-5/6 setup.

This means than 8^k does not cover the complete field
for k in [0 254], thus having cycles and, consequently,
creating conflicts.

Some generators could be:

2, 4, 6, 9 13, 14, 16...

but not 32 nor 64.

I know that powers of two are nice, but if you want to
have generic RAID, you must use other values.

The log/exp tables, are, of course, always valid.

BTW, the GF(256) with polynomial 285 has exactly 128
generators, so it would be possible to have up to 129
parity disk (1 is not a generator), for, I guess, a
max of 256 disks (or maybe 255?).

Hope this helps,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-17 16:37   ` Stefan /*St0fF*/ Hübner
@ 2012-04-18 14:15     ` Alex
  2012-04-18 14:11       ` David Brown
  0 siblings, 1 reply; 37+ messages in thread
From: Alex @ 2012-04-18 14:15 UTC (permalink / raw)
  To: stefan.huebner; +Cc: linux-raid

Hi Stefan,
This is just a test of using the list. But still, thanks to your encouragement.

Cheers,
Alex

On Wed, Apr 18, 2012 at 12:37 AM, Stefan /*St0fF*/ Hübner
<stefan.huebner@stud.tu-ilmenau.de> wrote:
> Am 17.04.2012 09:58, schrieb David Brown:
>> Hi Alex,
>>
>> I've been playing around with triple-parity raid theory for a while.
>> It's been mainly for my own interest - it's fun to do some deeper maths
>> (I studied maths at university, but this is the first time I've done
>> serious group theory in the last 20 years), fun to resurrect my LaTeX
>> skills, and maybe it will be useful to the md raid developers.
>>
>> My current state is that I've got theory worked out and written up - not
>> just for triple parity, but for more parities as well.  For some of it,
>> I've got Python code to test and verify the maths.  It turns out that
>> triple parity can work well - but for quad parity the limit is 21 data
>> disks (using generators 2, 4, and 8), or up to 33 (using for example
>> 0x07, 0x35 and 0x8b as generators).  Realistically, I think triple
>> parity is the limit for practical implementations.
>>
>> I haven't finished the paper - in particular, I haven't filled out the
>> simplifications of the triple parity work from more general matrix
>> forms.  I also hope to work on implementation details for the
>> calculations.  But maybe there is already enough to be of interest to
>> some people, such as yourself.
>>
>> <http://redhouse.homelinux.net/raid/>
>>
>> mvh.,
>>
>> David
>>
>>
>>
>> On 17/04/2012 08:11, Alex wrote:
>>> Thanks to Billy Crook who pointed out this is the right place for my
>>> post.
>>>
>>> Adam Leventhal integrated triple-parity RAID into ZFS in 2009. The
>>> necessity of triple-parity RAID is described in detail in Adam
>>> Leventhal's
>>>
> article(http://cacm.acm.org/magazines/2010/1/55741-triple-parity-raid-and-beyond/fulltext).
>>>
>>> Basically it is because hard drive capacity has doubled
>>> annually(Kryder's law) while hard drive throughput has improved far
>>> more modestly, the time it takes to recover a faulty hard disk in a
>>> double-parity RAID like RAID 6 might become so long(in 2010, it takes
>>> about 4 hours to recover a 1TB SAS hard disk at its full speed) that
>>> the remaining array(essentially a RAID 5 array, which has proven to be
>>> unsafe) might fail and cause data loss during recovery. Earlier this
>>> year, Ostler et
>>> al.(http://www.nature.com/ncomms/journal/v3/n2/full/ncomms1666.html)
>>> established a revolutionary way of writing magnetic substrate using a
>>> heat pulse instead of a traditional magnetic field, which may increase
>>> data throughput on a hard disk by 1000 times in the future. They
>>> estimated the commercial usage of this new technology would be advent
>>> in about 10 years. That said, within the next 10 years, having
>>> triple-parity RAID in a data integrity demanding environment is still
>>> highly desirable.
>>>
>>> Unfortunately, due to conflicts between CDDL license of Oracle and GNU
>>> license of Linux, ZFS and hence triple-parity RAID cannot be ported to
>>> Linux. As a die-hard follower of the open source community, I am NOT
>>> exactly pleased by this kind of drama. But instead of claiming the
>>> grape to be sour, I decided to grow my own.
>>>
>>> The work I am going to present here builds on top of that of Adam
>>>
> Leventhal(http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c)
>>>
>>> and Peter Anvin(http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf).
>>> I will generalize Adam Leventhal's work using Peter Anvin's method,
>>> then pick another generator from the Galois field GF(2^8) to
>>> facilitate another triple-parity RAID algorithm that hopefully can be
>>> used by the part of open source community that is not CDDL compliant
>>> and beyond. System engineers who are not exactly familiar with Galois
>>> field theory may find themselves a great exposure for that in Peter
>>> Anvin's article.
>>>
>>> I will use the same notation as in Adam Leventhal's work: D_0, ...
>>> D_n-1 represents corresponding bytes in the n data disks in the array,
>>> addition + is bitwise XOR, multiplication * is multiplication in
>>> GF(2^8), multiplication in the power(for example,2(n-1) in  g^2(n-1)
>>> below) on the other hand, is multiplication in modulo ring Z_255.
>>>
>>> (1) P = D_0 + D_1 + ... + D_n-2 + D_n-1
>>> (2) Q = g^n-1 * D_0 + g^n-2 * D_1 + ... + g^1 * D_n-2 + D_n-1
>>>        = ((...((D_0) * g + D_1) * g + ...) * g + D_n-2) * g + D_n-1
>>> (3) R = g^2(n-1) * D_0 + g^2(n-2) * D_1 + ... + g^2 * D_n-2 + D_n-1
>>>        = ((...((D_0) * g^2 + D_1) * g^2 + ...) * g^2 + D_n-2) * g^2 +
>>> D_n-1
>>>
>>> P,Q,R are the definitions of the parity bytes, these are usually
>>> called P,Q,R syndromes in the literature. Adam Leventhal used
>>> generator {02} in the cyclic representation of Galois field GF(2^8), I
>>> instead use an arbitrary generator g in the definition of P,Q and R.
>>> For generator g, g^k is a generator if and only if k is relatively
>>> prime to 255. Since g^255 = 1, g^(-1) = g^254 is a generator for 254
>>> is relatively prime to 255. g^(-1) is the generator I am going to use
>>> to optimize my triple-parity RAID algorithm.
>>>
>>> Now let's prove we can always recover from 3 or less disk failures,
>>> namely, we can always solve (1), (2), (3) for a unique solution if 3
>>> or less of {P, Q, R, D_0, ... D_n-1} are unknowns.
>>>
>>> We start with the most difficult case, i.e., 3 of {D_0, ... D_n-1} are
>>> unknowns, or 3 data disks have failed and let's call them D_x, D_y and
>>> D_z. For (1), we move the constants on the right to the left and
>>> combine them with P and call the sum to be P_xyz, so (1) becomes
>>>
>>> (1)' P_xyz = D_x + D_y + D_z
>>>
>>> Similarly, (2) and (3) become
>>>
>>> (2)' Q_xyz = g^x * D_x + g^y * D_y + g^z * D_z
>>> (3)' R_xyz = g^2x * D_x + g^2y * D_y + g^2z * D_z
>>>
>>>  From (1)', we have D_z = P_xyz + D_x + D_y since A + A = 0 in a Galois
>>> field. Substitute this into (2)' and (3)', and move the constants to
>>> the left and call them Q_xyz' and R_xyz', we got
>>>
>>> (2)'' Q_xyz' = (g^x + g^z) * D_x + (g^y + g^z) * D_y
>>> (3)'' R_xyz' = (g^2x + g^2z) * D_x + (g^2y + g^2z) * D_y
>>>
>>> Here comes the trick, multiply (2)'' by g^y + g^z, the coefficient for
>>> D_y becomes (g^y + g^z) * (g^y + g^z) = g^2y + g^y * g^z + g^z * g^y +
>>> g^2z = g^2y + g^2z(A + A = 0 again), then add it to (3)'' and move the
>>> constants to the left and call it R_xyz'', we have
>>>
>>> (3)''' R_xyz'' = [(g^x + g^z) * (g^y + g^z) + g^2x + g^2z] * D_x
>>>                 = [g^(x+y) + g^(x+z) + g^(z+y) + g^2z + g^2x +g^2z] * D_x
>>>                 = [g^(x+y) + g^(x+z) + g^(z+y) + g^2x] * D_x (A + A =
>>> 0 again)
>>>                 = [g^y * (g^x + g^z) + g^x *(g^x + g^z)] * D_x
>>>                 = (g^y + g^x) * (g^x + g^z) * D_x
>>>
>>> Now because x, y, z are distinct integers, if we assume 0<= x, y, z<
>>> n<= 255, then neither g^y + g^x nor g^x + g^z can be zero since g is
>>> a generator and we can solve for D_x from (3)'''. A similar argument
>>> can be applied to D_y's coefficient in (2)'' to solve for D_y and from
>>> (1)' we can solve for D_z.
>>>
>>> In a failure of 3 disks involving 1 or more parity disks, we may use
>>> the equations not involving a failed data disk to uniquely solve for
>>> unknows representing the failed data disk bytes, then use the rest of
>>> the equations to recalculate the failed parity disk bytes.
>>>
>>> In a failure involving only two data disks, we may use an argument
>>> similar to above and two of the three equations to uniquely solve for
>>> the unknowns(you might need to observe that g^2 is also a generator
>>> since 2 is relatively prime to 255), the only question is: does the
>>> solution satisfy the third equation? The answer is it has to. The
>>> reason is we originally(before the two data disks failed) have a
>>> solution for the two unknowns that satisfies all three equations,
>>> hence also satisfies the two we used to solve for the unknowns; but
>>> now we uniquely solve for the unknowns from those two equations, so
>>> they have to be the original values.
>>>
>>> The arguments for other cases are similar, but only easier.
>>>
>>> There is an important observation here: The Gauss-Jordan elimination
>>> in the proof above can be apply to Adam Leventhal's code although one
>>> has 3 equations, the other has n+3. And this observaton has two
>>> implications:
>>>
>>> 1) If we replace g with generator {02} in GF(2^8), we have proven that
>>> the algorithm used in Adam Leventhal's code is sound.(I am sure Adam
>>> Leventhal has his own proof, but as another guy with a math
>>> background, I am not convinced until I see it.)
>>>
>>> 2) For other generators in GF(2^8), we can use the same procedure in
>>> the algorithm of Adam Leventhal's code once the corresponding
>>> logarithm and powers tables are replaced, this enable us to reuse most
>>> of the code in Adam Leventhal's code.
>>>
>>> So much for the math, now we enter neverland of a system enggineer.
>>> Let's consider GF(2^8)'s another generator {02}^(-1) and try to
>>> optimize the calculation for the Q ad R syndromes. For this purpose,
>>> we use the second equality in (2) and (3). The addition is just
>>> bitewise XOR, what we need to optimize is the multiplication by
>>> {02}^(-1). Following the way in Peter Anvin's article, we found that
>>> multiplication is implemented by the following bitwise operations:
>>>
>>> (x * {02}^(-1))_7 = x_0
>>> (x * {02}^(-1))_6 = x_7
>>> (x * {02}^(-1))_5 = x_6
>>> (x * {02}^(-1))_4 = x_5
>>> (x * {02}^(-1))_3 = x_4 + x_0
>>> (x * {02}^(-1))_2 = x_3 + x_0
>>> (x * {02}^(-1))_1 = x_2 + x_0
>>> (x * {02}^(-1))_0 = x_1
>>>
>>> For 32 bit architecture, the C code optimizing it is as follows:
>>>
>>> uint32_t v vv;
>>> vv = (v>>  1)&  0x7f7f7f7f;
>>> vv ^= MASK(v)&  0x8e8e8e8e;
>>>
>>> uint32_t MASK(uint32_t v)
>>> {
>>>    v&= 0x01010101;
>>>    return (v<<  8) - v;
>>> }
>>>
>>> The code for 64 bit architecture is just a simple extension of this,
>>> or you may consult  Adam Leventhal's code.
>>>
>>> For arbitrary multiplication in the Gauss-Jordan elimination of the
>>> recovery process, we use the rule:
>>>
>>> A * B = C<==>  C = g^(log_g A + log_g B)
>>> A / B = C<==>  C = g^(log_g A - log_g B)
>>>
>>> where log_g is the discrete logarithm and g = {02}^(-1).
>>>
>>> And the tables for discrete logarithm and powers of {02}^(-1) is as
>>> follows:
>>>
>>> Powers table:
>>>
>>> 0x01, 0x8e, 0x47, 0xad, 0xd8, 0x6c, 0x36, 0x1b,
>>> 0x83, 0xcf, 0xe9, 0xfa, 0x7d, 0xb0, 0x58, 0x2c,
>>> 0x16, 0x0b, 0x8b, 0xcb, 0xeb, 0xfb, 0xf3, 0xf7,
>>> 0xf5, 0xf4, 0x7a, 0x3d, 0x90, 0x48, 0x24, 0x12,
>>> 0x09, 0x8a, 0x45, 0xac, 0x56, 0x2b, 0x9b, 0xc3,
>>> 0xef, 0xf9, 0xf2, 0x79, 0xb2, 0x59, 0xa2, 0x51,
>>> 0xa6, 0x53, 0xa7, 0xdd, 0xe0, 0x70, 0x38, 0x1c,
>>> 0x0e, 0x07, 0x8d, 0xc8, 0x64, 0x32, 0x19, 0x82,
>>> 0x41, 0xae, 0x57, 0xa5, 0xdc, 0x6e, 0x37, 0x95,
>>> 0xc4, 0x62, 0x31, 0x96, 0x4b, 0xab, 0xdb, 0xe3,
>>> 0xff, 0xf1, 0xf6, 0x7b, 0xb3, 0xd7, 0xe5, 0xfc,
>>> 0x7e, 0x3f, 0x91, 0xc6, 0x63, 0xbf, 0xd1, 0xe6,
>>> 0x73, 0xb7, 0xd5, 0xe4, 0x72, 0x39, 0x92, 0x49,
>>> 0xaa, 0x55, 0xa4, 0x52, 0x29, 0x9a, 0x4d, 0xa8,
>>> 0x54, 0x2a, 0x15, 0x84, 0x42, 0x21, 0x9e, 0x4f,
>>> 0xa9, 0xda, 0x6d, 0xb8, 0x5c, 0x2e, 0x17, 0x85,
>>> 0xcc, 0x66, 0x33, 0x97, 0xc5, 0xec, 0x76, 0x3b,
>>> 0x93, 0xc7, 0xed, 0xf8, 0x7c, 0x3e, 0x1f, 0x81,
>>> 0xce, 0x67, 0xbd, 0xd0, 0x68, 0x34, 0x1a, 0x0d,
>>> 0x88, 0x44, 0x22, 0x11, 0x86, 0x43, 0xaf, 0xd9,
>>> 0xe2, 0x71, 0xb6, 0x5b, 0xa3, 0xdf, 0xe1, 0xfe,
>>> 0x7f, 0xb1, 0xd6, 0x6b, 0xbb, 0xd3, 0xe7, 0xfd,
>>> 0xf0, 0x78, 0x3c, 0x1e, 0x0f, 0x89, 0xca, 0x65,
>>> 0xbc, 0x5e, 0x2f, 0x99, 0xc2, 0x61, 0xbe, 0x5f,
>>> 0xa1, 0xde, 0x6f, 0xb9, 0xd2, 0x69, 0xba, 0x5d,
>>> 0xa0, 0x50, 0x28, 0x14, 0x0a, 0x05, 0x8c, 0x46,
>>> 0x23, 0x9f, 0xc1, 0xee, 0x77, 0xb5, 0xd4, 0x6a,
>>> 0x35, 0x94, 0x4a, 0x25, 0x9c, 0x4e, 0x27, 0x9d,
>>> 0xc0, 0x60, 0x30, 0x18, 0x0c, 0x06, 0x03, 0x8f,
>>> 0xc9, 0xea, 0x75, 0xb4, 0x5a, 0x2d, 0x98, 0x4c,
>>> 0x26, 0x13, 0x87, 0xcd, 0xe8, 0x74, 0x3a, 0x1d,
>>> 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01
>>>
>>> Log table:
>>>
>>> 0x00, 0x00, 0xfe, 0xe6, 0xfd, 0xcd, 0xe5, 0x39,
>>> 0xfc, 0x20, 0xcc, 0x11, 0xe4, 0x97, 0x38, 0xb4,
>>> 0xfb, 0x9b, 0x1f, 0xf1, 0xcb, 0x72, 0x10, 0x7e,
>>> 0xe3, 0x3e, 0x96, 0x07, 0x37, 0xf7, 0xb3, 0x8e,
>>> 0xfa, 0x75, 0x9a, 0xd0, 0x1e, 0xdb, 0xf0, 0xde,
>>> 0xca, 0x6c, 0x71, 0x25, 0x0f, 0xed, 0x7d, 0xba,
>>> 0xe2, 0x4a, 0x3d, 0x82, 0x95, 0xd8, 0x06, 0x46,
>>> 0x36, 0x65, 0xf6, 0x87, 0xb2, 0x1b, 0x8d, 0x59,
>>> 0xf9, 0x40, 0x74, 0x9d, 0x99, 0x22, 0xcf, 0x02,
>>> 0x1d, 0x67, 0xda, 0x4c, 0xef, 0x6e, 0xdd, 0x77,
>>> 0xc9, 0x2f, 0x6b, 0x31, 0x70, 0x69, 0x24, 0x42,
>>> 0x0e, 0x2d, 0xec, 0xa3, 0x7c, 0xc7, 0xb9, 0xbf,
>>> 0xe1, 0xbd, 0x49, 0x5c, 0x3c, 0xb7, 0x81, 0x91,
>>> 0x94, 0xc5, 0xd7, 0xab, 0x05, 0x7a, 0x45, 0xc2,
>>> 0x35, 0xa1, 0x64, 0x60, 0xf5, 0xea, 0x86, 0xd4,
>>> 0xb1, 0x2b, 0x1a, 0x53, 0x8c, 0x0c, 0x58, 0xa8,
>>> 0xf8, 0x8f, 0x3f, 0x08, 0x73, 0x7f, 0x9c, 0xf2,
>>> 0x98, 0xb5, 0x21, 0x12, 0xce, 0x3a, 0x01, 0xe7,
>>> 0x1c, 0x5a, 0x66, 0x88, 0xd9, 0x47, 0x4b, 0x83,
>>> 0xee, 0xbb, 0x6d, 0x26, 0xdc, 0xdf, 0x76, 0xd1,
>>> 0xc8, 0xc0, 0x2e, 0xa4, 0x6a, 0x43, 0x30, 0x32,
>>> 0x6f, 0x78, 0x68, 0x4d, 0x23, 0x03, 0x41, 0x9e,
>>> 0x0d, 0xa9, 0x2c, 0x54, 0xeb, 0xd5, 0xa2, 0x61,
>>> 0x7b, 0xc3, 0xc6, 0xac, 0xb8, 0x92, 0xbe, 0x5d,
>>> 0xe0, 0xd2, 0xbc, 0x27, 0x48, 0x84, 0x5b, 0x89,
>>> 0x3b, 0xe8, 0xb6, 0x13, 0x80, 0xf3, 0x90, 0x09,
>>> 0x93, 0x5e, 0xc4, 0xad, 0xd6, 0x62, 0xaa, 0x55,
>>> 0x04, 0x9f, 0x79, 0x4e, 0x44, 0x33, 0xc1, 0xa5,
>>> 0x34, 0xa6, 0xa0, 0x4f, 0x63, 0x56, 0x5f, 0xae,
>>> 0xf4, 0x0a, 0xe9, 0x14, 0x85, 0x8a, 0xd3, 0x28,
>>> 0xb0, 0x51, 0x2a, 0x16, 0x19, 0x18, 0x52, 0x17,
>>> 0x8b, 0x29, 0x0b, 0x15, 0x57, 0xaf, 0xa7, 0x50
>>>
>>> The originally purpose of this work is to enable Linux to have
>>> triple-parity RAID, but I guess it is all right for the rest of the
>>> open source community to use it too. If you are not sure about your
>>> situation or you'd rather talk to me in private about other issues,
>>> please contact me at: creamyfish@gmail.com
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> Hi Alex and David,
>
> please keep up the work on this topic.  This is great stuff.  Extreme
> mathematics put to use.  At the moment I suggest all my customers to
> give one redundancy to four distinct disks in a raid.  For larger RAIDs
> this gets expensive at times and one has to use RAID60-combinations.
>
> If we could use RAID7 (will it be called that?) or RAID70 we could use
> more data disks, as the dropout-probability moves.  I guess when using
> RAID7 it'd be safe enough to have a RAID of 16 data-disks...
>
> So please allow me to thank you very much for your efforts!
>
> Grettings,
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-18 14:15     ` Alex
@ 2012-04-18 14:11       ` David Brown
  0 siblings, 0 replies; 37+ messages in thread
From: David Brown @ 2012-04-18 14:11 UTC (permalink / raw)
  To: Alex; +Cc: stefan.huebner, linux-raid

On 18/04/2012 16:15, Alex wrote:
> Hi Stefan,
> This is just a test of using the list. But still, thanks to your encouragement.
>

Well, your test worked - it woke me up and made me publish my work so 
far.  It is also, AFAIK, the best list for discussing these issues.  The 
md raid developers all follow the list, as do many users.

mvh.,

David



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-17 20:18     ` David Brown
@ 2012-04-17 20:54       ` Piergiorgio Sartor
  2012-04-18 18:22       ` Piergiorgio Sartor
  2012-04-19 18:16       ` H. Peter Anvin
  2 siblings, 0 replies; 37+ messages in thread
From: Piergiorgio Sartor @ 2012-04-17 20:54 UTC (permalink / raw)
  To: David Brown; +Cc: Piergiorgio Sartor, linux-raid

Hi David,

On Tue, Apr 17, 2012 at 10:18:55PM +0200, David Brown wrote:
[some stuff removed]

I think that we must understand a bit better
systematic RS.

The ideal case would be to have a RAID system
where, while adding a disk, the user can decide
if it (the disk) would be for data extension or
for redundancy (parity) extension.

This should be always possible and, in the first
place, performance should not be the target.

As mentioned before, par2 does it without too
many issues, so I fail to see why three parities
should be the limit.

Actually, I'm trying to catch some RS expert in
order to get an explanation (Q&A) on how the
whole thing works. Unfortunately experts seem to
be quite sluggish... :-)

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-17 17:16   ` Piergiorgio Sartor
@ 2012-04-17 20:18     ` David Brown
  2012-04-17 20:54       ` Piergiorgio Sartor
                         ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: David Brown @ 2012-04-17 20:18 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

On 17/04/12 19:16, Piergiorgio Sartor wrote:
> Hi David,
>
>> My current state is that I've got theory worked out and written up -
>> not just for triple parity, but for more parities as well.  For some
>> of it, I've got Python code to test and verify the maths.  It turns
>> out that triple parity can work well - but for quad parity the limit
>> is 21 data disks (using generators 2, 4, and 8), or up to 33 (using
>> for example 0x07, 0x35 and 0x8b as generators).  Realistically, I
>> think triple parity is the limit for practical implementations.
>
> Why is that? An RS code (255,251) should be possible, like
> it is a (255,223). What's the limitation?
> I'm sure there is even a "RAID-96", which is (96,64).
>
> My wild guess would be that the generators must be chosen
> in some way.
>
> Have you had a look at the "par2" code? That seems to be
> capable of doing a parametric RS, even if in 16bit words.
>
> bye,
>

It is certainly possible to make raid systems using as many disks as you 
want, and as many parities as you want.  The trick is to do so in an 
efficient manner.

The way raid6 is implemented is using simple polynomials over the Galois 
field G(2⁸):

Py = g_y^0 . D0 + g_y^1 . D1 + g_y^2 . D2 + ... + g_y^n . Dn

For raid5, you have one parity generators, and g0 is 1 (i.e., a simple 
xor).  For raid6, you have two parity - 1 and 2.  For triple parity, we 
would use g0 = 1, g1 = 2 and g2 = 4.  The whole thing is therefore 
simple (or relatively simple!) to implement, and should run fast - 
calculating the third parity will only be slightly slower than 
calculating the second parity in raid6 is today, and recovery will be as 
fast as raid6 unless you need to recover from 3 data disk failures.

For quad parity, we can try g3 = 8 as the obvious next choice in the 
pattern.  Unfortunately, we start hitting conflicts.  To recover missing 
data, we have to solve multiple simultaneous equations over G(2⁸), whose 
coefficients depend on the index numbers of the missing disks.  With 
parity generators (1, 2, 4, 8), some of these combinations of missing 
disk indexes lead to insoluble equations when you have more that 21 disks.

I didn't find any neat way to prove this, or prove that all combinations 
of missing disks for parity generators (1, 2, 4) work out okay.  I wrote 
some brute-force checking code, and let my computer do the hard work.

There are other ways to make a raid system with quad parity or more. 
The parities don't have to be calculated in such a neat form as above - 
perhaps even something as simple as re-ordering the powers of the parity 
generators would help.  We could use a different operation rather than 
the "G(2⁸) multiply".  We could use different sized Galois fields.  Or 
we could try something completely different, such as RS codes. 
Somewhere there we would get quad parity and more - for as many disks as 
we like.

However, I would be surprised if there were another method that had the 
practicality of implementation that we get with this system.  There are 
/huge/ benefits in having the triple-parity raid being the same as raid6 
but with an extra parity, and being able to generate it in the same way. 
  If quad parity is something that appeals to anyone, then it could be 
implemented at the same time - and 21 data disks (25 disks total) would 
be the limit.  Even though the same equations with different generators 
can give up to 33 data disks, I expect that compatibility with raid6 and 
triple-parity would outweigh the benefits of supporting more disks. 
After all, a 25 disk array is already pretty big for a single array.

mvh.,

David

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-17  7:58 ` David Brown
  2012-04-17 16:37   ` Stefan /*St0fF*/ Hübner
@ 2012-04-17 17:16   ` Piergiorgio Sartor
  2012-04-17 20:18     ` David Brown
  1 sibling, 1 reply; 37+ messages in thread
From: Piergiorgio Sartor @ 2012-04-17 17:16 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

Hi David,

> My current state is that I've got theory worked out and written up -
> not just for triple parity, but for more parities as well.  For some
> of it, I've got Python code to test and verify the maths.  It turns
> out that triple parity can work well - but for quad parity the limit
> is 21 data disks (using generators 2, 4, and 8), or up to 33 (using
> for example 0x07, 0x35 and 0x8b as generators).  Realistically, I
> think triple parity is the limit for practical implementations.

Why is that? An RS code (255,251) should be possible, like
it is a (255,223). What's the limitation?
I'm sure there is even a "RAID-96", which is (96,64).

My wild guess would be that the generators must be chosen
in some way.

Have you had a look at the "par2" code? That seems to be
capable of doing a parametric RS, even if in 16bit words.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-17  7:58 ` David Brown
@ 2012-04-17 16:37   ` Stefan /*St0fF*/ Hübner
  2012-04-18 14:15     ` Alex
  2012-04-17 17:16   ` Piergiorgio Sartor
  1 sibling, 1 reply; 37+ messages in thread
From: Stefan /*St0fF*/ Hübner @ 2012-04-17 16:37 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

Am 17.04.2012 09:58, schrieb David Brown:
> Hi Alex,
>
> I've been playing around with triple-parity raid theory for a while.
> It's been mainly for my own interest - it's fun to do some deeper maths
> (I studied maths at university, but this is the first time I've done
> serious group theory in the last 20 years), fun to resurrect my LaTeX
> skills, and maybe it will be useful to the md raid developers.
>
> My current state is that I've got theory worked out and written up - not
> just for triple parity, but for more parities as well.  For some of it,
> I've got Python code to test and verify the maths.  It turns out that
> triple parity can work well - but for quad parity the limit is 21 data
> disks (using generators 2, 4, and 8), or up to 33 (using for example
> 0x07, 0x35 and 0x8b as generators).  Realistically, I think triple
> parity is the limit for practical implementations.
>
> I haven't finished the paper - in particular, I haven't filled out the
> simplifications of the triple parity work from more general matrix
> forms.  I also hope to work on implementation details for the
> calculations.  But maybe there is already enough to be of interest to
> some people, such as yourself.
>
> <http://redhouse.homelinux.net/raid/>
>
> mvh.,
>
> David
>
>
>
> On 17/04/2012 08:11, Alex wrote:
>> Thanks to Billy Crook who pointed out this is the right place for my
>> post.
>>
>> Adam Leventhal integrated triple-parity RAID into ZFS in 2009. The
>> necessity of triple-parity RAID is described in detail in Adam
>> Leventhal's
>>
article(http://cacm.acm.org/magazines/2010/1/55741-triple-parity-raid-and-beyond/fulltext).
>>
>> Basically it is because hard drive capacity has doubled
>> annually(Kryder's law) while hard drive throughput has improved far
>> more modestly, the time it takes to recover a faulty hard disk in a
>> double-parity RAID like RAID 6 might become so long(in 2010, it takes
>> about 4 hours to recover a 1TB SAS hard disk at its full speed) that
>> the remaining array(essentially a RAID 5 array, which has proven to be
>> unsafe) might fail and cause data loss during recovery. Earlier this
>> year, Ostler et
>> al.(http://www.nature.com/ncomms/journal/v3/n2/full/ncomms1666.html)
>> established a revolutionary way of writing magnetic substrate using a
>> heat pulse instead of a traditional magnetic field, which may increase
>> data throughput on a hard disk by 1000 times in the future. They
>> estimated the commercial usage of this new technology would be advent
>> in about 10 years. That said, within the next 10 years, having
>> triple-parity RAID in a data integrity demanding environment is still
>> highly desirable.
>>
>> Unfortunately, due to conflicts between CDDL license of Oracle and GNU
>> license of Linux, ZFS and hence triple-parity RAID cannot be ported to
>> Linux. As a die-hard follower of the open source community, I am NOT
>> exactly pleased by this kind of drama. But instead of claiming the
>> grape to be sour, I decided to grow my own.
>>
>> The work I am going to present here builds on top of that of Adam
>>
Leventhal(http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c)
>>
>> and Peter Anvin(http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf).
>> I will generalize Adam Leventhal's work using Peter Anvin's method,
>> then pick another generator from the Galois field GF(2^8) to
>> facilitate another triple-parity RAID algorithm that hopefully can be
>> used by the part of open source community that is not CDDL compliant
>> and beyond. System engineers who are not exactly familiar with Galois
>> field theory may find themselves a great exposure for that in Peter
>> Anvin's article.
>>
>> I will use the same notation as in Adam Leventhal's work: D_0, ...
>> D_n-1 represents corresponding bytes in the n data disks in the array,
>> addition + is bitwise XOR, multiplication * is multiplication in
>> GF(2^8), multiplication in the power(for example,2(n-1) in  g^2(n-1)
>> below) on the other hand, is multiplication in modulo ring Z_255.
>>
>> (1) P = D_0 + D_1 + ... + D_n-2 + D_n-1
>> (2) Q = g^n-1 * D_0 + g^n-2 * D_1 + ... + g^1 * D_n-2 + D_n-1
>>        = ((...((D_0) * g + D_1) * g + ...) * g + D_n-2) * g + D_n-1
>> (3) R = g^2(n-1) * D_0 + g^2(n-2) * D_1 + ... + g^2 * D_n-2 + D_n-1
>>        = ((...((D_0) * g^2 + D_1) * g^2 + ...) * g^2 + D_n-2) * g^2 +
>> D_n-1
>>
>> P,Q,R are the definitions of the parity bytes, these are usually
>> called P,Q,R syndromes in the literature. Adam Leventhal used
>> generator {02} in the cyclic representation of Galois field GF(2^8), I
>> instead use an arbitrary generator g in the definition of P,Q and R.
>> For generator g, g^k is a generator if and only if k is relatively
>> prime to 255. Since g^255 = 1, g^(-1) = g^254 is a generator for 254
>> is relatively prime to 255. g^(-1) is the generator I am going to use
>> to optimize my triple-parity RAID algorithm.
>>
>> Now let's prove we can always recover from 3 or less disk failures,
>> namely, we can always solve (1), (2), (3) for a unique solution if 3
>> or less of {P, Q, R, D_0, ... D_n-1} are unknowns.
>>
>> We start with the most difficult case, i.e., 3 of {D_0, ... D_n-1} are
>> unknowns, or 3 data disks have failed and let's call them D_x, D_y and
>> D_z. For (1), we move the constants on the right to the left and
>> combine them with P and call the sum to be P_xyz, so (1) becomes
>>
>> (1)' P_xyz = D_x + D_y + D_z
>>
>> Similarly, (2) and (3) become
>>
>> (2)' Q_xyz = g^x * D_x + g^y * D_y + g^z * D_z
>> (3)' R_xyz = g^2x * D_x + g^2y * D_y + g^2z * D_z
>>
>>  From (1)', we have D_z = P_xyz + D_x + D_y since A + A = 0 in a Galois
>> field. Substitute this into (2)' and (3)', and move the constants to
>> the left and call them Q_xyz' and R_xyz', we got
>>
>> (2)'' Q_xyz' = (g^x + g^z) * D_x + (g^y + g^z) * D_y
>> (3)'' R_xyz' = (g^2x + g^2z) * D_x + (g^2y + g^2z) * D_y
>>
>> Here comes the trick, multiply (2)'' by g^y + g^z, the coefficient for
>> D_y becomes (g^y + g^z) * (g^y + g^z) = g^2y + g^y * g^z + g^z * g^y +
>> g^2z = g^2y + g^2z(A + A = 0 again), then add it to (3)'' and move the
>> constants to the left and call it R_xyz'', we have
>>
>> (3)''' R_xyz'' = [(g^x + g^z) * (g^y + g^z) + g^2x + g^2z] * D_x
>>                 = [g^(x+y) + g^(x+z) + g^(z+y) + g^2z + g^2x +g^2z] * D_x
>>                 = [g^(x+y) + g^(x+z) + g^(z+y) + g^2x] * D_x (A + A =
>> 0 again)
>>                 = [g^y * (g^x + g^z) + g^x *(g^x + g^z)] * D_x
>>                 = (g^y + g^x) * (g^x + g^z) * D_x
>>
>> Now because x, y, z are distinct integers, if we assume 0<= x, y, z<
>> n<= 255, then neither g^y + g^x nor g^x + g^z can be zero since g is
>> a generator and we can solve for D_x from (3)'''. A similar argument
>> can be applied to D_y's coefficient in (2)'' to solve for D_y and from
>> (1)' we can solve for D_z.
>>
>> In a failure of 3 disks involving 1 or more parity disks, we may use
>> the equations not involving a failed data disk to uniquely solve for
>> unknows representing the failed data disk bytes, then use the rest of
>> the equations to recalculate the failed parity disk bytes.
>>
>> In a failure involving only two data disks, we may use an argument
>> similar to above and two of the three equations to uniquely solve for
>> the unknowns(you might need to observe that g^2 is also a generator
>> since 2 is relatively prime to 255), the only question is: does the
>> solution satisfy the third equation? The answer is it has to. The
>> reason is we originally(before the two data disks failed) have a
>> solution for the two unknowns that satisfies all three equations,
>> hence also satisfies the two we used to solve for the unknowns; but
>> now we uniquely solve for the unknowns from those two equations, so
>> they have to be the original values.
>>
>> The arguments for other cases are similar, but only easier.
>>
>> There is an important observation here: The Gauss-Jordan elimination
>> in the proof above can be apply to Adam Leventhal's code although one
>> has 3 equations, the other has n+3. And this observaton has two
>> implications:
>>
>> 1) If we replace g with generator {02} in GF(2^8), we have proven that
>> the algorithm used in Adam Leventhal's code is sound.(I am sure Adam
>> Leventhal has his own proof, but as another guy with a math
>> background, I am not convinced until I see it.)
>>
>> 2) For other generators in GF(2^8), we can use the same procedure in
>> the algorithm of Adam Leventhal's code once the corresponding
>> logarithm and powers tables are replaced, this enable us to reuse most
>> of the code in Adam Leventhal's code.
>>
>> So much for the math, now we enter neverland of a system enggineer.
>> Let's consider GF(2^8)'s another generator {02}^(-1) and try to
>> optimize the calculation for the Q ad R syndromes. For this purpose,
>> we use the second equality in (2) and (3). The addition is just
>> bitewise XOR, what we need to optimize is the multiplication by
>> {02}^(-1). Following the way in Peter Anvin's article, we found that
>> multiplication is implemented by the following bitwise operations:
>>
>> (x * {02}^(-1))_7 = x_0
>> (x * {02}^(-1))_6 = x_7
>> (x * {02}^(-1))_5 = x_6
>> (x * {02}^(-1))_4 = x_5
>> (x * {02}^(-1))_3 = x_4 + x_0
>> (x * {02}^(-1))_2 = x_3 + x_0
>> (x * {02}^(-1))_1 = x_2 + x_0
>> (x * {02}^(-1))_0 = x_1
>>
>> For 32 bit architecture, the C code optimizing it is as follows:
>>
>> uint32_t v vv;
>> vv = (v>>  1)&  0x7f7f7f7f;
>> vv ^= MASK(v)&  0x8e8e8e8e;
>>
>> uint32_t MASK(uint32_t v)
>> {
>>    v&= 0x01010101;
>>    return (v<<  8) - v;
>> }
>>
>> The code for 64 bit architecture is just a simple extension of this,
>> or you may consult  Adam Leventhal's code.
>>
>> For arbitrary multiplication in the Gauss-Jordan elimination of the
>> recovery process, we use the rule:
>>
>> A * B = C<==>  C = g^(log_g A + log_g B)
>> A / B = C<==>  C = g^(log_g A - log_g B)
>>
>> where log_g is the discrete logarithm and g = {02}^(-1).
>>
>> And the tables for discrete logarithm and powers of {02}^(-1) is as
>> follows:
>>
>> Powers table:
>>
>> 0x01, 0x8e, 0x47, 0xad, 0xd8, 0x6c, 0x36, 0x1b,
>> 0x83, 0xcf, 0xe9, 0xfa, 0x7d, 0xb0, 0x58, 0x2c,
>> 0x16, 0x0b, 0x8b, 0xcb, 0xeb, 0xfb, 0xf3, 0xf7,
>> 0xf5, 0xf4, 0x7a, 0x3d, 0x90, 0x48, 0x24, 0x12,
>> 0x09, 0x8a, 0x45, 0xac, 0x56, 0x2b, 0x9b, 0xc3,
>> 0xef, 0xf9, 0xf2, 0x79, 0xb2, 0x59, 0xa2, 0x51,
>> 0xa6, 0x53, 0xa7, 0xdd, 0xe0, 0x70, 0x38, 0x1c,
>> 0x0e, 0x07, 0x8d, 0xc8, 0x64, 0x32, 0x19, 0x82,
>> 0x41, 0xae, 0x57, 0xa5, 0xdc, 0x6e, 0x37, 0x95,
>> 0xc4, 0x62, 0x31, 0x96, 0x4b, 0xab, 0xdb, 0xe3,
>> 0xff, 0xf1, 0xf6, 0x7b, 0xb3, 0xd7, 0xe5, 0xfc,
>> 0x7e, 0x3f, 0x91, 0xc6, 0x63, 0xbf, 0xd1, 0xe6,
>> 0x73, 0xb7, 0xd5, 0xe4, 0x72, 0x39, 0x92, 0x49,
>> 0xaa, 0x55, 0xa4, 0x52, 0x29, 0x9a, 0x4d, 0xa8,
>> 0x54, 0x2a, 0x15, 0x84, 0x42, 0x21, 0x9e, 0x4f,
>> 0xa9, 0xda, 0x6d, 0xb8, 0x5c, 0x2e, 0x17, 0x85,
>> 0xcc, 0x66, 0x33, 0x97, 0xc5, 0xec, 0x76, 0x3b,
>> 0x93, 0xc7, 0xed, 0xf8, 0x7c, 0x3e, 0x1f, 0x81,
>> 0xce, 0x67, 0xbd, 0xd0, 0x68, 0x34, 0x1a, 0x0d,
>> 0x88, 0x44, 0x22, 0x11, 0x86, 0x43, 0xaf, 0xd9,
>> 0xe2, 0x71, 0xb6, 0x5b, 0xa3, 0xdf, 0xe1, 0xfe,
>> 0x7f, 0xb1, 0xd6, 0x6b, 0xbb, 0xd3, 0xe7, 0xfd,
>> 0xf0, 0x78, 0x3c, 0x1e, 0x0f, 0x89, 0xca, 0x65,
>> 0xbc, 0x5e, 0x2f, 0x99, 0xc2, 0x61, 0xbe, 0x5f,
>> 0xa1, 0xde, 0x6f, 0xb9, 0xd2, 0x69, 0xba, 0x5d,
>> 0xa0, 0x50, 0x28, 0x14, 0x0a, 0x05, 0x8c, 0x46,
>> 0x23, 0x9f, 0xc1, 0xee, 0x77, 0xb5, 0xd4, 0x6a,
>> 0x35, 0x94, 0x4a, 0x25, 0x9c, 0x4e, 0x27, 0x9d,
>> 0xc0, 0x60, 0x30, 0x18, 0x0c, 0x06, 0x03, 0x8f,
>> 0xc9, 0xea, 0x75, 0xb4, 0x5a, 0x2d, 0x98, 0x4c,
>> 0x26, 0x13, 0x87, 0xcd, 0xe8, 0x74, 0x3a, 0x1d,
>> 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01
>>
>> Log table:
>>
>> 0x00, 0x00, 0xfe, 0xe6, 0xfd, 0xcd, 0xe5, 0x39,
>> 0xfc, 0x20, 0xcc, 0x11, 0xe4, 0x97, 0x38, 0xb4,
>> 0xfb, 0x9b, 0x1f, 0xf1, 0xcb, 0x72, 0x10, 0x7e,
>> 0xe3, 0x3e, 0x96, 0x07, 0x37, 0xf7, 0xb3, 0x8e,
>> 0xfa, 0x75, 0x9a, 0xd0, 0x1e, 0xdb, 0xf0, 0xde,
>> 0xca, 0x6c, 0x71, 0x25, 0x0f, 0xed, 0x7d, 0xba,
>> 0xe2, 0x4a, 0x3d, 0x82, 0x95, 0xd8, 0x06, 0x46,
>> 0x36, 0x65, 0xf6, 0x87, 0xb2, 0x1b, 0x8d, 0x59,
>> 0xf9, 0x40, 0x74, 0x9d, 0x99, 0x22, 0xcf, 0x02,
>> 0x1d, 0x67, 0xda, 0x4c, 0xef, 0x6e, 0xdd, 0x77,
>> 0xc9, 0x2f, 0x6b, 0x31, 0x70, 0x69, 0x24, 0x42,
>> 0x0e, 0x2d, 0xec, 0xa3, 0x7c, 0xc7, 0xb9, 0xbf,
>> 0xe1, 0xbd, 0x49, 0x5c, 0x3c, 0xb7, 0x81, 0x91,
>> 0x94, 0xc5, 0xd7, 0xab, 0x05, 0x7a, 0x45, 0xc2,
>> 0x35, 0xa1, 0x64, 0x60, 0xf5, 0xea, 0x86, 0xd4,
>> 0xb1, 0x2b, 0x1a, 0x53, 0x8c, 0x0c, 0x58, 0xa8,
>> 0xf8, 0x8f, 0x3f, 0x08, 0x73, 0x7f, 0x9c, 0xf2,
>> 0x98, 0xb5, 0x21, 0x12, 0xce, 0x3a, 0x01, 0xe7,
>> 0x1c, 0x5a, 0x66, 0x88, 0xd9, 0x47, 0x4b, 0x83,
>> 0xee, 0xbb, 0x6d, 0x26, 0xdc, 0xdf, 0x76, 0xd1,
>> 0xc8, 0xc0, 0x2e, 0xa4, 0x6a, 0x43, 0x30, 0x32,
>> 0x6f, 0x78, 0x68, 0x4d, 0x23, 0x03, 0x41, 0x9e,
>> 0x0d, 0xa9, 0x2c, 0x54, 0xeb, 0xd5, 0xa2, 0x61,
>> 0x7b, 0xc3, 0xc6, 0xac, 0xb8, 0x92, 0xbe, 0x5d,
>> 0xe0, 0xd2, 0xbc, 0x27, 0x48, 0x84, 0x5b, 0x89,
>> 0x3b, 0xe8, 0xb6, 0x13, 0x80, 0xf3, 0x90, 0x09,
>> 0x93, 0x5e, 0xc4, 0xad, 0xd6, 0x62, 0xaa, 0x55,
>> 0x04, 0x9f, 0x79, 0x4e, 0x44, 0x33, 0xc1, 0xa5,
>> 0x34, 0xa6, 0xa0, 0x4f, 0x63, 0x56, 0x5f, 0xae,
>> 0xf4, 0x0a, 0xe9, 0x14, 0x85, 0x8a, 0xd3, 0x28,
>> 0xb0, 0x51, 0x2a, 0x16, 0x19, 0x18, 0x52, 0x17,
>> 0x8b, 0x29, 0x0b, 0x15, 0x57, 0xaf, 0xa7, 0x50
>>
>> The originally purpose of this work is to enable Linux to have
>> triple-parity RAID, but I guess it is all right for the rest of the
>> open source community to use it too. If you are not sure about your
>> situation or you'd rather talk to me in private about other issues,
>> please contact me at: creamyfish@gmail.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hi Alex and David,

please keep up the work on this topic.  This is great stuff.  Extreme
mathematics put to use.  At the moment I suggest all my customers to
give one redundancy to four distinct disks in a raid.  For larger RAIDs
this gets expensive at times and one has to use RAID60-combinations.

If we could use RAID7 (will it be called that?) or RAID70 we could use
more data disks, as the dropout-probability moves.  I guess when using
RAID7 it'd be safe enough to have a RAID of 16 data-disks...

So please allow me to thank you very much for your efforts!

Grettings,
Stefan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Is this enough for us to have triple-parity RAID?
  2012-04-17  6:11 Alex
@ 2012-04-17  7:58 ` David Brown
  2012-04-17 16:37   ` Stefan /*St0fF*/ Hübner
  2012-04-17 17:16   ` Piergiorgio Sartor
  2012-04-20  7:45 ` Stan Hoeppner
  1 sibling, 2 replies; 37+ messages in thread
From: David Brown @ 2012-04-17  7:58 UTC (permalink / raw)
  To: linux-raid

Hi Alex,

I've been playing around with triple-parity raid theory for a while. 
It's been mainly for my own interest - it's fun to do some deeper maths 
(I studied maths at university, but this is the first time I've done 
serious group theory in the last 20 years), fun to resurrect my LaTeX 
skills, and maybe it will be useful to the md raid developers.

My current state is that I've got theory worked out and written up - not 
just for triple parity, but for more parities as well.  For some of it, 
I've got Python code to test and verify the maths.  It turns out that 
triple parity can work well - but for quad parity the limit is 21 data 
disks (using generators 2, 4, and 8), or up to 33 (using for example 
0x07, 0x35 and 0x8b as generators).  Realistically, I think triple 
parity is the limit for practical implementations.

I haven't finished the paper - in particular, I haven't filled out the 
simplifications of the triple parity work from more general matrix 
forms.  I also hope to work on implementation details for the 
calculations.  But maybe there is already enough to be of interest to 
some people, such as yourself.

<http://redhouse.homelinux.net/raid/>

mvh.,

David



On 17/04/2012 08:11, Alex wrote:
> Thanks to Billy Crook who pointed out this is the right place for my post.
>
> Adam Leventhal integrated triple-parity RAID into ZFS in 2009. The
> necessity of triple-parity RAID is described in detail in Adam
> Leventhal's article(http://cacm.acm.org/magazines/2010/1/55741-triple-parity-raid-and-beyond/fulltext).
> Basically it is because hard drive capacity has doubled
> annually(Kryder's law) while hard drive throughput has improved far
> more modestly, the time it takes to recover a faulty hard disk in a
> double-parity RAID like RAID 6 might become so long(in 2010, it takes
> about 4 hours to recover a 1TB SAS hard disk at its full speed) that
> the remaining array(essentially a RAID 5 array, which has proven to be
> unsafe) might fail and cause data loss during recovery. Earlier this
> year, Ostler et
> al.(http://www.nature.com/ncomms/journal/v3/n2/full/ncomms1666.html)
> established a revolutionary way of writing magnetic substrate using a
> heat pulse instead of a traditional magnetic field, which may increase
> data throughput on a hard disk by 1000 times in the future. They
> estimated the commercial usage of this new technology would be advent
> in about 10 years. That said, within the next 10 years, having
> triple-parity RAID in a data integrity demanding environment is still
> highly desirable.
>
> Unfortunately, due to conflicts between CDDL license of Oracle and GNU
> license of Linux, ZFS and hence triple-parity RAID cannot be ported to
> Linux. As a die-hard follower of the open source community, I am NOT
> exactly pleased by this kind of drama. But instead of claiming the
> grape to be sour, I decided to grow my own.
>
> The work I am going to present here builds on top of that of Adam
> Leventhal(http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c)
> and Peter Anvin(http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf).
> I will generalize Adam Leventhal's work using Peter Anvin's method,
> then pick another generator from the Galois field GF(2^8) to
> facilitate another triple-parity RAID algorithm that hopefully can be
> used by the part of open source community that is not CDDL compliant
> and beyond. System engineers who are not exactly familiar with Galois
> field theory may find themselves a great exposure for that in Peter
> Anvin's article.
>
> I will use the same notation as in Adam Leventhal's work: D_0, ...
> D_n-1 represents corresponding bytes in the n data disks in the array,
> addition + is bitwise XOR, multiplication * is multiplication in
> GF(2^8), multiplication in the power(for example,2(n-1) in  g^2(n-1)
> below) on the other hand, is multiplication in modulo ring Z_255.
>
> (1) P = D_0 + D_1 + ... + D_n-2 + D_n-1
> (2) Q = g^n-1 * D_0 + g^n-2 * D_1 + ... + g^1 * D_n-2 + D_n-1
>        = ((...((D_0) * g + D_1) * g + ...) * g + D_n-2) * g + D_n-1
> (3) R = g^2(n-1) * D_0 + g^2(n-2) * D_1 + ... + g^2 * D_n-2 + D_n-1
>        = ((...((D_0) * g^2 + D_1) * g^2 + ...) * g^2 + D_n-2) * g^2 + D_n-1
>
> P,Q,R are the definitions of the parity bytes, these are usually
> called P,Q,R syndromes in the literature. Adam Leventhal used
> generator {02} in the cyclic representation of Galois field GF(2^8), I
> instead use an arbitrary generator g in the definition of P,Q and R.
> For generator g, g^k is a generator if and only if k is relatively
> prime to 255. Since g^255 = 1, g^(-1) = g^254 is a generator for 254
> is relatively prime to 255. g^(-1) is the generator I am going to use
> to optimize my triple-parity RAID algorithm.
>
> Now let's prove we can always recover from 3 or less disk failures,
> namely, we can always solve (1), (2), (3) for a unique solution if 3
> or less of {P, Q, R, D_0, ... D_n-1} are unknowns.
>
> We start with the most difficult case, i.e., 3 of {D_0, ... D_n-1} are
> unknowns, or 3 data disks have failed and let's call them D_x, D_y and
> D_z. For (1), we move the constants on the right to the left and
> combine them with P and call the sum to be P_xyz, so (1) becomes
>
> (1)' P_xyz = D_x + D_y + D_z
>
> Similarly, (2) and (3) become
>
> (2)' Q_xyz = g^x * D_x + g^y * D_y + g^z * D_z
> (3)' R_xyz = g^2x * D_x + g^2y * D_y + g^2z * D_z
>
>  From (1)', we have D_z = P_xyz + D_x + D_y since A + A = 0 in a Galois
> field. Substitute this into (2)' and (3)', and move the constants to
> the left and call them Q_xyz' and R_xyz', we got
>
> (2)'' Q_xyz' = (g^x + g^z) * D_x + (g^y + g^z) * D_y
> (3)'' R_xyz' = (g^2x + g^2z) * D_x + (g^2y + g^2z) * D_y
>
> Here comes the trick, multiply (2)'' by g^y + g^z, the coefficient for
> D_y becomes (g^y + g^z) * (g^y + g^z) = g^2y + g^y * g^z + g^z * g^y +
> g^2z = g^2y + g^2z(A + A = 0 again), then add it to (3)'' and move the
> constants to the left and call it R_xyz'', we have
>
> (3)''' R_xyz'' = [(g^x + g^z) * (g^y + g^z) + g^2x + g^2z] * D_x
>                 = [g^(x+y) + g^(x+z) + g^(z+y) + g^2z + g^2x +g^2z] * D_x
>                 = [g^(x+y) + g^(x+z) + g^(z+y) + g^2x] * D_x (A + A = 0 again)
>                 = [g^y * (g^x + g^z) + g^x *(g^x + g^z)] * D_x
>                 = (g^y + g^x) * (g^x + g^z) * D_x
>
> Now because x, y, z are distinct integers, if we assume 0<= x, y, z<
> n<= 255, then neither g^y + g^x nor g^x + g^z can be zero since g is
> a generator and we can solve for D_x from (3)'''. A similar argument
> can be applied to D_y's coefficient in (2)'' to solve for D_y and from
> (1)' we can solve for D_z.
>
> In a failure of 3 disks involving 1 or more parity disks, we may use
> the equations not involving a failed data disk to uniquely solve for
> unknows representing the failed data disk bytes, then use the rest of
> the equations to recalculate the failed parity disk bytes.
>
> In a failure involving only two data disks, we may use an argument
> similar to above and two of the three equations to uniquely solve for
> the unknowns(you might need to observe that g^2 is also a generator
> since 2 is relatively prime to 255), the only question is: does the
> solution satisfy the third equation? The answer is it has to. The
> reason is we originally(before the two data disks failed) have a
> solution for the two unknowns that satisfies all three equations,
> hence also satisfies the two we used to solve for the unknowns; but
> now we uniquely solve for the unknowns from those two equations, so
> they have to be the original values.
>
> The arguments for other cases are similar, but only easier.
>
> There is an important observation here: The Gauss-Jordan elimination
> in the proof above can be apply to Adam Leventhal's code although one
> has 3 equations, the other has n+3. And this observaton has two
> implications:
>
> 1) If we replace g with generator {02} in GF(2^8), we have proven that
> the algorithm used in Adam Leventhal's code is sound.(I am sure Adam
> Leventhal has his own proof, but as another guy with a math
> background, I am not convinced until I see it.)
>
> 2) For other generators in GF(2^8), we can use the same procedure in
> the algorithm of Adam Leventhal's code once the corresponding
> logarithm and powers tables are replaced, this enable us to reuse most
> of the code in Adam Leventhal's code.
>
> So much for the math, now we enter neverland of a system enggineer.
> Let's consider GF(2^8)'s another generator {02}^(-1) and try to
> optimize the calculation for the Q ad R syndromes. For this purpose,
> we use the second equality in (2) and (3). The addition is just
> bitewise XOR, what we need to optimize is the multiplication by
> {02}^(-1). Following the way in Peter Anvin's article, we found that
> multiplication is implemented by the following bitwise operations:
>
> (x * {02}^(-1))_7 = x_0
> (x * {02}^(-1))_6 = x_7
> (x * {02}^(-1))_5 = x_6
> (x * {02}^(-1))_4 = x_5
> (x * {02}^(-1))_3 = x_4 + x_0
> (x * {02}^(-1))_2 = x_3 + x_0
> (x * {02}^(-1))_1 = x_2 + x_0
> (x * {02}^(-1))_0 = x_1
>
> For 32 bit architecture, the C code optimizing it is as follows:
>
> uint32_t v vv;
> vv = (v>>  1)&  0x7f7f7f7f;
> vv ^= MASK(v)&  0x8e8e8e8e;
>
> uint32_t MASK(uint32_t v)
> {
>    v&= 0x01010101;
>    return (v<<  8) - v;
> }
>
> The code for 64 bit architecture is just a simple extension of this,
> or you may consult  Adam Leventhal's code.
>
> For arbitrary multiplication in the Gauss-Jordan elimination of the
> recovery process, we use the rule:
>
> A * B = C<==>  C = g^(log_g A + log_g B)
> A / B = C<==>  C = g^(log_g A - log_g B)
>
> where log_g is the discrete logarithm and g = {02}^(-1).
>
> And the tables for discrete logarithm and powers of {02}^(-1) is as follows:
>
> Powers table:
>
> 0x01, 0x8e, 0x47, 0xad, 0xd8, 0x6c, 0x36, 0x1b,
> 0x83, 0xcf, 0xe9, 0xfa, 0x7d, 0xb0, 0x58, 0x2c,
> 0x16, 0x0b, 0x8b, 0xcb, 0xeb, 0xfb, 0xf3, 0xf7,
> 0xf5, 0xf4, 0x7a, 0x3d, 0x90, 0x48, 0x24, 0x12,
> 0x09, 0x8a, 0x45, 0xac, 0x56, 0x2b, 0x9b, 0xc3,
> 0xef, 0xf9, 0xf2, 0x79, 0xb2, 0x59, 0xa2, 0x51,
> 0xa6, 0x53, 0xa7, 0xdd, 0xe0, 0x70, 0x38, 0x1c,
> 0x0e, 0x07, 0x8d, 0xc8, 0x64, 0x32, 0x19, 0x82,
> 0x41, 0xae, 0x57, 0xa5, 0xdc, 0x6e, 0x37, 0x95,
> 0xc4, 0x62, 0x31, 0x96, 0x4b, 0xab, 0xdb, 0xe3,
> 0xff, 0xf1, 0xf6, 0x7b, 0xb3, 0xd7, 0xe5, 0xfc,
> 0x7e, 0x3f, 0x91, 0xc6, 0x63, 0xbf, 0xd1, 0xe6,
> 0x73, 0xb7, 0xd5, 0xe4, 0x72, 0x39, 0x92, 0x49,
> 0xaa, 0x55, 0xa4, 0x52, 0x29, 0x9a, 0x4d, 0xa8,
> 0x54, 0x2a, 0x15, 0x84, 0x42, 0x21, 0x9e, 0x4f,
> 0xa9, 0xda, 0x6d, 0xb8, 0x5c, 0x2e, 0x17, 0x85,
> 0xcc, 0x66, 0x33, 0x97, 0xc5, 0xec, 0x76, 0x3b,
> 0x93, 0xc7, 0xed, 0xf8, 0x7c, 0x3e, 0x1f, 0x81,
> 0xce, 0x67, 0xbd, 0xd0, 0x68, 0x34, 0x1a, 0x0d,
> 0x88, 0x44, 0x22, 0x11, 0x86, 0x43, 0xaf, 0xd9,
> 0xe2, 0x71, 0xb6, 0x5b, 0xa3, 0xdf, 0xe1, 0xfe,
> 0x7f, 0xb1, 0xd6, 0x6b, 0xbb, 0xd3, 0xe7, 0xfd,
> 0xf0, 0x78, 0x3c, 0x1e, 0x0f, 0x89, 0xca, 0x65,
> 0xbc, 0x5e, 0x2f, 0x99, 0xc2, 0x61, 0xbe, 0x5f,
> 0xa1, 0xde, 0x6f, 0xb9, 0xd2, 0x69, 0xba, 0x5d,
> 0xa0, 0x50, 0x28, 0x14, 0x0a, 0x05, 0x8c, 0x46,
> 0x23, 0x9f, 0xc1, 0xee, 0x77, 0xb5, 0xd4, 0x6a,
> 0x35, 0x94, 0x4a, 0x25, 0x9c, 0x4e, 0x27, 0x9d,
> 0xc0, 0x60, 0x30, 0x18, 0x0c, 0x06, 0x03, 0x8f,
> 0xc9, 0xea, 0x75, 0xb4, 0x5a, 0x2d, 0x98, 0x4c,
> 0x26, 0x13, 0x87, 0xcd, 0xe8, 0x74, 0x3a, 0x1d,
> 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01
>
> Log table:
>
> 0x00, 0x00, 0xfe, 0xe6, 0xfd, 0xcd, 0xe5, 0x39,
> 0xfc, 0x20, 0xcc, 0x11, 0xe4, 0x97, 0x38, 0xb4,
> 0xfb, 0x9b, 0x1f, 0xf1, 0xcb, 0x72, 0x10, 0x7e,
> 0xe3, 0x3e, 0x96, 0x07, 0x37, 0xf7, 0xb3, 0x8e,
> 0xfa, 0x75, 0x9a, 0xd0, 0x1e, 0xdb, 0xf0, 0xde,
> 0xca, 0x6c, 0x71, 0x25, 0x0f, 0xed, 0x7d, 0xba,
> 0xe2, 0x4a, 0x3d, 0x82, 0x95, 0xd8, 0x06, 0x46,
> 0x36, 0x65, 0xf6, 0x87, 0xb2, 0x1b, 0x8d, 0x59,
> 0xf9, 0x40, 0x74, 0x9d, 0x99, 0x22, 0xcf, 0x02,
> 0x1d, 0x67, 0xda, 0x4c, 0xef, 0x6e, 0xdd, 0x77,
> 0xc9, 0x2f, 0x6b, 0x31, 0x70, 0x69, 0x24, 0x42,
> 0x0e, 0x2d, 0xec, 0xa3, 0x7c, 0xc7, 0xb9, 0xbf,
> 0xe1, 0xbd, 0x49, 0x5c, 0x3c, 0xb7, 0x81, 0x91,
> 0x94, 0xc5, 0xd7, 0xab, 0x05, 0x7a, 0x45, 0xc2,
> 0x35, 0xa1, 0x64, 0x60, 0xf5, 0xea, 0x86, 0xd4,
> 0xb1, 0x2b, 0x1a, 0x53, 0x8c, 0x0c, 0x58, 0xa8,
> 0xf8, 0x8f, 0x3f, 0x08, 0x73, 0x7f, 0x9c, 0xf2,
> 0x98, 0xb5, 0x21, 0x12, 0xce, 0x3a, 0x01, 0xe7,
> 0x1c, 0x5a, 0x66, 0x88, 0xd9, 0x47, 0x4b, 0x83,
> 0xee, 0xbb, 0x6d, 0x26, 0xdc, 0xdf, 0x76, 0xd1,
> 0xc8, 0xc0, 0x2e, 0xa4, 0x6a, 0x43, 0x30, 0x32,
> 0x6f, 0x78, 0x68, 0x4d, 0x23, 0x03, 0x41, 0x9e,
> 0x0d, 0xa9, 0x2c, 0x54, 0xeb, 0xd5, 0xa2, 0x61,
> 0x7b, 0xc3, 0xc6, 0xac, 0xb8, 0x92, 0xbe, 0x5d,
> 0xe0, 0xd2, 0xbc, 0x27, 0x48, 0x84, 0x5b, 0x89,
> 0x3b, 0xe8, 0xb6, 0x13, 0x80, 0xf3, 0x90, 0x09,
> 0x93, 0x5e, 0xc4, 0xad, 0xd6, 0x62, 0xaa, 0x55,
> 0x04, 0x9f, 0x79, 0x4e, 0x44, 0x33, 0xc1, 0xa5,
> 0x34, 0xa6, 0xa0, 0x4f, 0x63, 0x56, 0x5f, 0xae,
> 0xf4, 0x0a, 0xe9, 0x14, 0x85, 0x8a, 0xd3, 0x28,
> 0xb0, 0x51, 0x2a, 0x16, 0x19, 0x18, 0x52, 0x17,
> 0x8b, 0x29, 0x0b, 0x15, 0x57, 0xaf, 0xa7, 0x50
>
> The originally purpose of this work is to enable Linux to have
> triple-parity RAID, but I guess it is all right for the rest of the
> open source community to use it too. If you are not sure about your
> situation or you'd rather talk to me in private about other issues,
> please contact me at: creamyfish@gmail.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Is this enough for us to have triple-parity RAID?
@ 2012-04-17  6:11 Alex
  2012-04-17  7:58 ` David Brown
  2012-04-20  7:45 ` Stan Hoeppner
  0 siblings, 2 replies; 37+ messages in thread
From: Alex @ 2012-04-17  6:11 UTC (permalink / raw)
  To: linux-raid

Thanks to Billy Crook who pointed out this is the right place for my post.

Adam Leventhal integrated triple-parity RAID into ZFS in 2009. The
necessity of triple-parity RAID is described in detail in Adam
Leventhal's article(http://cacm.acm.org/magazines/2010/1/55741-triple-parity-raid-and-beyond/fulltext).
Basically it is because hard drive capacity has doubled
annually(Kryder's law) while hard drive throughput has improved far
more modestly, the time it takes to recover a faulty hard disk in a
double-parity RAID like RAID 6 might become so long(in 2010, it takes
about 4 hours to recover a 1TB SAS hard disk at its full speed) that
the remaining array(essentially a RAID 5 array, which has proven to be
unsafe) might fail and cause data loss during recovery. Earlier this
year, Ostler et
al.(http://www.nature.com/ncomms/journal/v3/n2/full/ncomms1666.html)
established a revolutionary way of writing magnetic substrate using a
heat pulse instead of a traditional magnetic field, which may increase
data throughput on a hard disk by 1000 times in the future. They
estimated the commercial usage of this new technology would be advent
in about 10 years. That said, within the next 10 years, having
triple-parity RAID in a data integrity demanding environment is still
highly desirable.

Unfortunately, due to conflicts between CDDL license of Oracle and GNU
license of Linux, ZFS and hence triple-parity RAID cannot be ported to
Linux. As a die-hard follower of the open source community, I am NOT
exactly pleased by this kind of drama. But instead of claiming the
grape to be sour, I decided to grow my own.

The work I am going to present here builds on top of that of Adam
Leventhal(http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c)
and Peter Anvin(http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf).
I will generalize Adam Leventhal's work using Peter Anvin's method,
then pick another generator from the Galois field GF(2^8) to
facilitate another triple-parity RAID algorithm that hopefully can be
used by the part of open source community that is not CDDL compliant
and beyond. System engineers who are not exactly familiar with Galois
field theory may find themselves a great exposure for that in Peter
Anvin's article.

I will use the same notation as in Adam Leventhal's work: D_0, ...
D_n-1 represents corresponding bytes in the n data disks in the array,
addition + is bitwise XOR, multiplication * is multiplication in
GF(2^8), multiplication in the power(for example,2(n-1) in  g^2(n-1)
below) on the other hand, is multiplication in modulo ring Z_255.

(1) P = D_0 + D_1 + ... + D_n-2 + D_n-1
(2) Q = g^n-1 * D_0 + g^n-2 * D_1 + ... + g^1 * D_n-2 + D_n-1
      = ((...((D_0) * g + D_1) * g + ...) * g + D_n-2) * g + D_n-1
(3) R = g^2(n-1) * D_0 + g^2(n-2) * D_1 + ... + g^2 * D_n-2 + D_n-1
      = ((...((D_0) * g^2 + D_1) * g^2 + ...) * g^2 + D_n-2) * g^2 + D_n-1

P,Q,R are the definitions of the parity bytes, these are usually
called P,Q,R syndromes in the literature. Adam Leventhal used
generator {02} in the cyclic representation of Galois field GF(2^8), I
instead use an arbitrary generator g in the definition of P,Q and R.
For generator g, g^k is a generator if and only if k is relatively
prime to 255. Since g^255 = 1, g^(-1) = g^254 is a generator for 254
is relatively prime to 255. g^(-1) is the generator I am going to use
to optimize my triple-parity RAID algorithm.

Now let's prove we can always recover from 3 or less disk failures,
namely, we can always solve (1), (2), (3) for a unique solution if 3
or less of {P, Q, R, D_0, ... D_n-1} are unknowns.

We start with the most difficult case, i.e., 3 of {D_0, ... D_n-1} are
unknowns, or 3 data disks have failed and let's call them D_x, D_y and
D_z. For (1), we move the constants on the right to the left and
combine them with P and call the sum to be P_xyz, so (1) becomes

(1)' P_xyz = D_x + D_y + D_z

Similarly, (2) and (3) become

(2)' Q_xyz = g^x * D_x + g^y * D_y + g^z * D_z
(3)' R_xyz = g^2x * D_x + g^2y * D_y + g^2z * D_z

From (1)', we have D_z = P_xyz + D_x + D_y since A + A = 0 in a Galois
field. Substitute this into (2)' and (3)', and move the constants to
the left and call them Q_xyz' and R_xyz', we got

(2)'' Q_xyz' = (g^x + g^z) * D_x + (g^y + g^z) * D_y
(3)'' R_xyz' = (g^2x + g^2z) * D_x + (g^2y + g^2z) * D_y

Here comes the trick, multiply (2)'' by g^y + g^z, the coefficient for
D_y becomes (g^y + g^z) * (g^y + g^z) = g^2y + g^y * g^z + g^z * g^y +
g^2z = g^2y + g^2z(A + A = 0 again), then add it to (3)'' and move the
constants to the left and call it R_xyz'', we have

(3)''' R_xyz'' = [(g^x + g^z) * (g^y + g^z) + g^2x + g^2z] * D_x
               = [g^(x+y) + g^(x+z) + g^(z+y) + g^2z + g^2x +g^2z] * D_x
               = [g^(x+y) + g^(x+z) + g^(z+y) + g^2x] * D_x (A + A = 0 again)
               = [g^y * (g^x + g^z) + g^x *(g^x + g^z)] * D_x
               = (g^y + g^x) * (g^x + g^z) * D_x

Now because x, y, z are distinct integers, if we assume 0 <= x, y, z <
n <= 255, then neither g^y + g^x nor g^x + g^z can be zero since g is
a generator and we can solve for D_x from (3)'''. A similar argument
can be applied to D_y's coefficient in (2)'' to solve for D_y and from
(1)' we can solve for D_z.

In a failure of 3 disks involving 1 or more parity disks, we may use
the equations not involving a failed data disk to uniquely solve for
unknows representing the failed data disk bytes, then use the rest of
the equations to recalculate the failed parity disk bytes.

In a failure involving only two data disks, we may use an argument
similar to above and two of the three equations to uniquely solve for
the unknowns(you might need to observe that g^2 is also a generator
since 2 is relatively prime to 255), the only question is: does the
solution satisfy the third equation? The answer is it has to. The
reason is we originally(before the two data disks failed) have a
solution for the two unknowns that satisfies all three equations,
hence also satisfies the two we used to solve for the unknowns; but
now we uniquely solve for the unknowns from those two equations, so
they have to be the original values.

The arguments for other cases are similar, but only easier.

There is an important observation here: The Gauss-Jordan elimination
in the proof above can be apply to Adam Leventhal's code although one
has 3 equations, the other has n+3. And this observaton has two
implications:

1) If we replace g with generator {02} in GF(2^8), we have proven that
the algorithm used in Adam Leventhal's code is sound.(I am sure Adam
Leventhal has his own proof, but as another guy with a math
background, I am not convinced until I see it.)

2) For other generators in GF(2^8), we can use the same procedure in
the algorithm of Adam Leventhal's code once the corresponding
logarithm and powers tables are replaced, this enable us to reuse most
of the code in Adam Leventhal's code.

So much for the math, now we enter neverland of a system enggineer.
Let's consider GF(2^8)'s another generator {02}^(-1) and try to
optimize the calculation for the Q ad R syndromes. For this purpose,
we use the second equality in (2) and (3). The addition is just
bitewise XOR, what we need to optimize is the multiplication by
{02}^(-1). Following the way in Peter Anvin's article, we found that
multiplication is implemented by the following bitwise operations:

(x * {02}^(-1))_7 = x_0
(x * {02}^(-1))_6 = x_7
(x * {02}^(-1))_5 = x_6
(x * {02}^(-1))_4 = x_5
(x * {02}^(-1))_3 = x_4 + x_0
(x * {02}^(-1))_2 = x_3 + x_0
(x * {02}^(-1))_1 = x_2 + x_0
(x * {02}^(-1))_0 = x_1

For 32 bit architecture, the C code optimizing it is as follows:

uint32_t v vv;
vv = (v  >> 1) & 0x7f7f7f7f;
vv ^= MASK(v) & 0x8e8e8e8e;

uint32_t MASK(uint32_t v)
{
  v &= 0x01010101;
  return (v << 8) - v;
}

The code for 64 bit architecture is just a simple extension of this,
or you may consult  Adam Leventhal's code.

For arbitrary multiplication in the Gauss-Jordan elimination of the
recovery process, we use the rule:

A * B = C <==> C = g^(log_g A + log_g B)
A / B = C <==> C = g^(log_g A - log_g B)

where log_g is the discrete logarithm and g = {02}^(-1).

And the tables for discrete logarithm and powers of {02}^(-1) is as follows:

Powers table:

0x01, 0x8e, 0x47, 0xad, 0xd8, 0x6c, 0x36, 0x1b,
0x83, 0xcf, 0xe9, 0xfa, 0x7d, 0xb0, 0x58, 0x2c,
0x16, 0x0b, 0x8b, 0xcb, 0xeb, 0xfb, 0xf3, 0xf7,
0xf5, 0xf4, 0x7a, 0x3d, 0x90, 0x48, 0x24, 0x12,
0x09, 0x8a, 0x45, 0xac, 0x56, 0x2b, 0x9b, 0xc3,
0xef, 0xf9, 0xf2, 0x79, 0xb2, 0x59, 0xa2, 0x51,
0xa6, 0x53, 0xa7, 0xdd, 0xe0, 0x70, 0x38, 0x1c,
0x0e, 0x07, 0x8d, 0xc8, 0x64, 0x32, 0x19, 0x82,
0x41, 0xae, 0x57, 0xa5, 0xdc, 0x6e, 0x37, 0x95,
0xc4, 0x62, 0x31, 0x96, 0x4b, 0xab, 0xdb, 0xe3,
0xff, 0xf1, 0xf6, 0x7b, 0xb3, 0xd7, 0xe5, 0xfc,
0x7e, 0x3f, 0x91, 0xc6, 0x63, 0xbf, 0xd1, 0xe6,
0x73, 0xb7, 0xd5, 0xe4, 0x72, 0x39, 0x92, 0x49,
0xaa, 0x55, 0xa4, 0x52, 0x29, 0x9a, 0x4d, 0xa8,
0x54, 0x2a, 0x15, 0x84, 0x42, 0x21, 0x9e, 0x4f,
0xa9, 0xda, 0x6d, 0xb8, 0x5c, 0x2e, 0x17, 0x85,
0xcc, 0x66, 0x33, 0x97, 0xc5, 0xec, 0x76, 0x3b,
0x93, 0xc7, 0xed, 0xf8, 0x7c, 0x3e, 0x1f, 0x81,
0xce, 0x67, 0xbd, 0xd0, 0x68, 0x34, 0x1a, 0x0d,
0x88, 0x44, 0x22, 0x11, 0x86, 0x43, 0xaf, 0xd9,
0xe2, 0x71, 0xb6, 0x5b, 0xa3, 0xdf, 0xe1, 0xfe,
0x7f, 0xb1, 0xd6, 0x6b, 0xbb, 0xd3, 0xe7, 0xfd,
0xf0, 0x78, 0x3c, 0x1e, 0x0f, 0x89, 0xca, 0x65,
0xbc, 0x5e, 0x2f, 0x99, 0xc2, 0x61, 0xbe, 0x5f,
0xa1, 0xde, 0x6f, 0xb9, 0xd2, 0x69, 0xba, 0x5d,
0xa0, 0x50, 0x28, 0x14, 0x0a, 0x05, 0x8c, 0x46,
0x23, 0x9f, 0xc1, 0xee, 0x77, 0xb5, 0xd4, 0x6a,
0x35, 0x94, 0x4a, 0x25, 0x9c, 0x4e, 0x27, 0x9d,
0xc0, 0x60, 0x30, 0x18, 0x0c, 0x06, 0x03, 0x8f,
0xc9, 0xea, 0x75, 0xb4, 0x5a, 0x2d, 0x98, 0x4c,
0x26, 0x13, 0x87, 0xcd, 0xe8, 0x74, 0x3a, 0x1d,
0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01

Log table:

0x00, 0x00, 0xfe, 0xe6, 0xfd, 0xcd, 0xe5, 0x39,
0xfc, 0x20, 0xcc, 0x11, 0xe4, 0x97, 0x38, 0xb4,
0xfb, 0x9b, 0x1f, 0xf1, 0xcb, 0x72, 0x10, 0x7e,
0xe3, 0x3e, 0x96, 0x07, 0x37, 0xf7, 0xb3, 0x8e,
0xfa, 0x75, 0x9a, 0xd0, 0x1e, 0xdb, 0xf0, 0xde,
0xca, 0x6c, 0x71, 0x25, 0x0f, 0xed, 0x7d, 0xba,
0xe2, 0x4a, 0x3d, 0x82, 0x95, 0xd8, 0x06, 0x46,
0x36, 0x65, 0xf6, 0x87, 0xb2, 0x1b, 0x8d, 0x59,
0xf9, 0x40, 0x74, 0x9d, 0x99, 0x22, 0xcf, 0x02,
0x1d, 0x67, 0xda, 0x4c, 0xef, 0x6e, 0xdd, 0x77,
0xc9, 0x2f, 0x6b, 0x31, 0x70, 0x69, 0x24, 0x42,
0x0e, 0x2d, 0xec, 0xa3, 0x7c, 0xc7, 0xb9, 0xbf,
0xe1, 0xbd, 0x49, 0x5c, 0x3c, 0xb7, 0x81, 0x91,
0x94, 0xc5, 0xd7, 0xab, 0x05, 0x7a, 0x45, 0xc2,
0x35, 0xa1, 0x64, 0x60, 0xf5, 0xea, 0x86, 0xd4,
0xb1, 0x2b, 0x1a, 0x53, 0x8c, 0x0c, 0x58, 0xa8,
0xf8, 0x8f, 0x3f, 0x08, 0x73, 0x7f, 0x9c, 0xf2,
0x98, 0xb5, 0x21, 0x12, 0xce, 0x3a, 0x01, 0xe7,
0x1c, 0x5a, 0x66, 0x88, 0xd9, 0x47, 0x4b, 0x83,
0xee, 0xbb, 0x6d, 0x26, 0xdc, 0xdf, 0x76, 0xd1,
0xc8, 0xc0, 0x2e, 0xa4, 0x6a, 0x43, 0x30, 0x32,
0x6f, 0x78, 0x68, 0x4d, 0x23, 0x03, 0x41, 0x9e,
0x0d, 0xa9, 0x2c, 0x54, 0xeb, 0xd5, 0xa2, 0x61,
0x7b, 0xc3, 0xc6, 0xac, 0xb8, 0x92, 0xbe, 0x5d,
0xe0, 0xd2, 0xbc, 0x27, 0x48, 0x84, 0x5b, 0x89,
0x3b, 0xe8, 0xb6, 0x13, 0x80, 0xf3, 0x90, 0x09,
0x93, 0x5e, 0xc4, 0xad, 0xd6, 0x62, 0xaa, 0x55,
0x04, 0x9f, 0x79, 0x4e, 0x44, 0x33, 0xc1, 0xa5,
0x34, 0xa6, 0xa0, 0x4f, 0x63, 0x56, 0x5f, 0xae,
0xf4, 0x0a, 0xe9, 0x14, 0x85, 0x8a, 0xd3, 0x28,
0xb0, 0x51, 0x2a, 0x16, 0x19, 0x18, 0x52, 0x17,
0x8b, 0x29, 0x0b, 0x15, 0x57, 0xaf, 0xa7, 0x50

The originally purpose of this work is to enable Linux to have
triple-parity RAID, but I guess it is all right for the rest of the
open source community to use it too. If you are not sure about your
situation or you'd rather talk to me in private about other issues,
please contact me at: creamyfish@gmail.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Is this enough for us to have triple-parity RAID?
@ 2012-04-16 12:55 Alex
  0 siblings, 0 replies; 37+ messages in thread
From: Alex @ 2012-04-16 12:55 UTC (permalink / raw)
  To: linux-fsdevel

Hey guys,
I have done some work about triple-parity RAID and is posted here:
http://www.linuxquestions.org/questions/linux-server-73/is-this-enough-for-us-to-have-triple-parity-raid-939991/

As a die-hard follower of Linux, I truly wish btrfs would
have  triple-parity RAID integrated in the future. For those who are
interested, why
don't you go take a look and give me some feedback.

creamyfish

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2012-05-01 16:38 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-16 10:04 Is this enough for us to have triple-parity RAID? Alex
2012-04-16 12:55 Alex
2012-04-17  6:11 Alex
2012-04-17  7:58 ` David Brown
2012-04-17 16:37   ` Stefan /*St0fF*/ Hübner
2012-04-18 14:15     ` Alex
2012-04-18 14:11       ` David Brown
2012-04-17 17:16   ` Piergiorgio Sartor
2012-04-17 20:18     ` David Brown
2012-04-17 20:54       ` Piergiorgio Sartor
2012-04-18 18:22       ` Piergiorgio Sartor
2012-04-18 20:20         ` David Brown
2012-04-18 20:39           ` Piergiorgio Sartor
2012-04-19 18:16       ` H. Peter Anvin
2012-04-20  2:27         ` Alex
2012-04-20  3:00           ` H. Peter Anvin
2012-04-20  3:32             ` Alex
2012-04-20 18:58               ` David Brown
2012-04-20 19:39                 ` H. Peter Anvin
2012-04-20 21:04                   ` Piergiorgio Sartor
2012-04-20 21:01                 ` Piergiorgio Sartor
2012-04-20 21:29                   ` Peter Grandi
2012-04-20 22:31                     ` Piergiorgio Sartor
2012-04-21  9:51                       ` Peter Grandi
2012-04-21 11:18                         ` Piergiorgio Sartor
2012-04-22  3:14                           ` Alex
2012-04-22  8:57                             ` Piergiorgio Sartor
2012-04-20  7:45 ` Stan Hoeppner
2012-04-23 15:26   ` Alex
2012-04-25  1:20     ` Stan Hoeppner
2012-04-25  2:45       ` Alex
2012-04-25 16:59         ` Emmanuel Noobadmin
2012-04-25 19:29           ` David Brown
2012-04-26  2:30           ` Alex
2012-04-27 15:15             ` Emmanuel Noobadmin
2012-05-01 16:38               ` Alex
2012-04-26  4:24           ` Alex

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.