* Is this enough for us to have triple-parity RAID? @ 2012-04-17 6:11 Alex 2012-04-17 7:58 ` David Brown 2012-04-20 7:45 ` Stan Hoeppner 0 siblings, 2 replies; 37+ messages in thread From: Alex @ 2012-04-17 6:11 UTC (permalink / raw) To: linux-raid Thanks to Billy Crook who pointed out this is the right place for my post. Adam Leventhal integrated triple-parity RAID into ZFS in 2009. The necessity of triple-parity RAID is described in detail in Adam Leventhal's article(http://cacm.acm.org/magazines/2010/1/55741-triple-parity-raid-and-beyond/fulltext). Basically it is because hard drive capacity has doubled annually(Kryder's law) while hard drive throughput has improved far more modestly, the time it takes to recover a faulty hard disk in a double-parity RAID like RAID 6 might become so long(in 2010, it takes about 4 hours to recover a 1TB SAS hard disk at its full speed) that the remaining array(essentially a RAID 5 array, which has proven to be unsafe) might fail and cause data loss during recovery. Earlier this year, Ostler et al.(http://www.nature.com/ncomms/journal/v3/n2/full/ncomms1666.html) established a revolutionary way of writing magnetic substrate using a heat pulse instead of a traditional magnetic field, which may increase data throughput on a hard disk by 1000 times in the future. They estimated the commercial usage of this new technology would be advent in about 10 years. That said, within the next 10 years, having triple-parity RAID in a data integrity demanding environment is still highly desirable. Unfortunately, due to conflicts between CDDL license of Oracle and GNU license of Linux, ZFS and hence triple-parity RAID cannot be ported to Linux. As a die-hard follower of the open source community, I am NOT exactly pleased by this kind of drama. But instead of claiming the grape to be sour, I decided to grow my own. The work I am going to present here builds on top of that of Adam Leventhal(http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c) and Peter Anvin(http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf). I will generalize Adam Leventhal's work using Peter Anvin's method, then pick another generator from the Galois field GF(2^8) to facilitate another triple-parity RAID algorithm that hopefully can be used by the part of open source community that is not CDDL compliant and beyond. System engineers who are not exactly familiar with Galois field theory may find themselves a great exposure for that in Peter Anvin's article. I will use the same notation as in Adam Leventhal's work: D_0, ... D_n-1 represents corresponding bytes in the n data disks in the array, addition + is bitwise XOR, multiplication * is multiplication in GF(2^8), multiplication in the power(for example,2(n-1) in g^2(n-1) below) on the other hand, is multiplication in modulo ring Z_255. (1) P = D_0 + D_1 + ... + D_n-2 + D_n-1 (2) Q = g^n-1 * D_0 + g^n-2 * D_1 + ... + g^1 * D_n-2 + D_n-1 = ((...((D_0) * g + D_1) * g + ...) * g + D_n-2) * g + D_n-1 (3) R = g^2(n-1) * D_0 + g^2(n-2) * D_1 + ... + g^2 * D_n-2 + D_n-1 = ((...((D_0) * g^2 + D_1) * g^2 + ...) * g^2 + D_n-2) * g^2 + D_n-1 P,Q,R are the definitions of the parity bytes, these are usually called P,Q,R syndromes in the literature. Adam Leventhal used generator {02} in the cyclic representation of Galois field GF(2^8), I instead use an arbitrary generator g in the definition of P,Q and R. For generator g, g^k is a generator if and only if k is relatively prime to 255. Since g^255 = 1, g^(-1) = g^254 is a generator for 254 is relatively prime to 255. g^(-1) is the generator I am going to use to optimize my triple-parity RAID algorithm. Now let's prove we can always recover from 3 or less disk failures, namely, we can always solve (1), (2), (3) for a unique solution if 3 or less of {P, Q, R, D_0, ... D_n-1} are unknowns. We start with the most difficult case, i.e., 3 of {D_0, ... D_n-1} are unknowns, or 3 data disks have failed and let's call them D_x, D_y and D_z. For (1), we move the constants on the right to the left and combine them with P and call the sum to be P_xyz, so (1) becomes (1)' P_xyz = D_x + D_y + D_z Similarly, (2) and (3) become (2)' Q_xyz = g^x * D_x + g^y * D_y + g^z * D_z (3)' R_xyz = g^2x * D_x + g^2y * D_y + g^2z * D_z From (1)', we have D_z = P_xyz + D_x + D_y since A + A = 0 in a Galois field. Substitute this into (2)' and (3)', and move the constants to the left and call them Q_xyz' and R_xyz', we got (2)'' Q_xyz' = (g^x + g^z) * D_x + (g^y + g^z) * D_y (3)'' R_xyz' = (g^2x + g^2z) * D_x + (g^2y + g^2z) * D_y Here comes the trick, multiply (2)'' by g^y + g^z, the coefficient for D_y becomes (g^y + g^z) * (g^y + g^z) = g^2y + g^y * g^z + g^z * g^y + g^2z = g^2y + g^2z(A + A = 0 again), then add it to (3)'' and move the constants to the left and call it R_xyz'', we have (3)''' R_xyz'' = [(g^x + g^z) * (g^y + g^z) + g^2x + g^2z] * D_x = [g^(x+y) + g^(x+z) + g^(z+y) + g^2z + g^2x +g^2z] * D_x = [g^(x+y) + g^(x+z) + g^(z+y) + g^2x] * D_x (A + A = 0 again) = [g^y * (g^x + g^z) + g^x *(g^x + g^z)] * D_x = (g^y + g^x) * (g^x + g^z) * D_x Now because x, y, z are distinct integers, if we assume 0 <= x, y, z < n <= 255, then neither g^y + g^x nor g^x + g^z can be zero since g is a generator and we can solve for D_x from (3)'''. A similar argument can be applied to D_y's coefficient in (2)'' to solve for D_y and from (1)' we can solve for D_z. In a failure of 3 disks involving 1 or more parity disks, we may use the equations not involving a failed data disk to uniquely solve for unknows representing the failed data disk bytes, then use the rest of the equations to recalculate the failed parity disk bytes. In a failure involving only two data disks, we may use an argument similar to above and two of the three equations to uniquely solve for the unknowns(you might need to observe that g^2 is also a generator since 2 is relatively prime to 255), the only question is: does the solution satisfy the third equation? The answer is it has to. The reason is we originally(before the two data disks failed) have a solution for the two unknowns that satisfies all three equations, hence also satisfies the two we used to solve for the unknowns; but now we uniquely solve for the unknowns from those two equations, so they have to be the original values. The arguments for other cases are similar, but only easier. There is an important observation here: The Gauss-Jordan elimination in the proof above can be apply to Adam Leventhal's code although one has 3 equations, the other has n+3. And this observaton has two implications: 1) If we replace g with generator {02} in GF(2^8), we have proven that the algorithm used in Adam Leventhal's code is sound.(I am sure Adam Leventhal has his own proof, but as another guy with a math background, I am not convinced until I see it.) 2) For other generators in GF(2^8), we can use the same procedure in the algorithm of Adam Leventhal's code once the corresponding logarithm and powers tables are replaced, this enable us to reuse most of the code in Adam Leventhal's code. So much for the math, now we enter neverland of a system enggineer. Let's consider GF(2^8)'s another generator {02}^(-1) and try to optimize the calculation for the Q ad R syndromes. For this purpose, we use the second equality in (2) and (3). The addition is just bitewise XOR, what we need to optimize is the multiplication by {02}^(-1). Following the way in Peter Anvin's article, we found that multiplication is implemented by the following bitwise operations: (x * {02}^(-1))_7 = x_0 (x * {02}^(-1))_6 = x_7 (x * {02}^(-1))_5 = x_6 (x * {02}^(-1))_4 = x_5 (x * {02}^(-1))_3 = x_4 + x_0 (x * {02}^(-1))_2 = x_3 + x_0 (x * {02}^(-1))_1 = x_2 + x_0 (x * {02}^(-1))_0 = x_1 For 32 bit architecture, the C code optimizing it is as follows: uint32_t v vv; vv = (v >> 1) & 0x7f7f7f7f; vv ^= MASK(v) & 0x8e8e8e8e; uint32_t MASK(uint32_t v) { v &= 0x01010101; return (v << 8) - v; } The code for 64 bit architecture is just a simple extension of this, or you may consult Adam Leventhal's code. For arbitrary multiplication in the Gauss-Jordan elimination of the recovery process, we use the rule: A * B = C <==> C = g^(log_g A + log_g B) A / B = C <==> C = g^(log_g A - log_g B) where log_g is the discrete logarithm and g = {02}^(-1). And the tables for discrete logarithm and powers of {02}^(-1) is as follows: Powers table: 0x01, 0x8e, 0x47, 0xad, 0xd8, 0x6c, 0x36, 0x1b, 0x83, 0xcf, 0xe9, 0xfa, 0x7d, 0xb0, 0x58, 0x2c, 0x16, 0x0b, 0x8b, 0xcb, 0xeb, 0xfb, 0xf3, 0xf7, 0xf5, 0xf4, 0x7a, 0x3d, 0x90, 0x48, 0x24, 0x12, 0x09, 0x8a, 0x45, 0xac, 0x56, 0x2b, 0x9b, 0xc3, 0xef, 0xf9, 0xf2, 0x79, 0xb2, 0x59, 0xa2, 0x51, 0xa6, 0x53, 0xa7, 0xdd, 0xe0, 0x70, 0x38, 0x1c, 0x0e, 0x07, 0x8d, 0xc8, 0x64, 0x32, 0x19, 0x82, 0x41, 0xae, 0x57, 0xa5, 0xdc, 0x6e, 0x37, 0x95, 0xc4, 0x62, 0x31, 0x96, 0x4b, 0xab, 0xdb, 0xe3, 0xff, 0xf1, 0xf6, 0x7b, 0xb3, 0xd7, 0xe5, 0xfc, 0x7e, 0x3f, 0x91, 0xc6, 0x63, 0xbf, 0xd1, 0xe6, 0x73, 0xb7, 0xd5, 0xe4, 0x72, 0x39, 0x92, 0x49, 0xaa, 0x55, 0xa4, 0x52, 0x29, 0x9a, 0x4d, 0xa8, 0x54, 0x2a, 0x15, 0x84, 0x42, 0x21, 0x9e, 0x4f, 0xa9, 0xda, 0x6d, 0xb8, 0x5c, 0x2e, 0x17, 0x85, 0xcc, 0x66, 0x33, 0x97, 0xc5, 0xec, 0x76, 0x3b, 0x93, 0xc7, 0xed, 0xf8, 0x7c, 0x3e, 0x1f, 0x81, 0xce, 0x67, 0xbd, 0xd0, 0x68, 0x34, 0x1a, 0x0d, 0x88, 0x44, 0x22, 0x11, 0x86, 0x43, 0xaf, 0xd9, 0xe2, 0x71, 0xb6, 0x5b, 0xa3, 0xdf, 0xe1, 0xfe, 0x7f, 0xb1, 0xd6, 0x6b, 0xbb, 0xd3, 0xe7, 0xfd, 0xf0, 0x78, 0x3c, 0x1e, 0x0f, 0x89, 0xca, 0x65, 0xbc, 0x5e, 0x2f, 0x99, 0xc2, 0x61, 0xbe, 0x5f, 0xa1, 0xde, 0x6f, 0xb9, 0xd2, 0x69, 0xba, 0x5d, 0xa0, 0x50, 0x28, 0x14, 0x0a, 0x05, 0x8c, 0x46, 0x23, 0x9f, 0xc1, 0xee, 0x77, 0xb5, 0xd4, 0x6a, 0x35, 0x94, 0x4a, 0x25, 0x9c, 0x4e, 0x27, 0x9d, 0xc0, 0x60, 0x30, 0x18, 0x0c, 0x06, 0x03, 0x8f, 0xc9, 0xea, 0x75, 0xb4, 0x5a, 0x2d, 0x98, 0x4c, 0x26, 0x13, 0x87, 0xcd, 0xe8, 0x74, 0x3a, 0x1d, 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 Log table: 0x00, 0x00, 0xfe, 0xe6, 0xfd, 0xcd, 0xe5, 0x39, 0xfc, 0x20, 0xcc, 0x11, 0xe4, 0x97, 0x38, 0xb4, 0xfb, 0x9b, 0x1f, 0xf1, 0xcb, 0x72, 0x10, 0x7e, 0xe3, 0x3e, 0x96, 0x07, 0x37, 0xf7, 0xb3, 0x8e, 0xfa, 0x75, 0x9a, 0xd0, 0x1e, 0xdb, 0xf0, 0xde, 0xca, 0x6c, 0x71, 0x25, 0x0f, 0xed, 0x7d, 0xba, 0xe2, 0x4a, 0x3d, 0x82, 0x95, 0xd8, 0x06, 0x46, 0x36, 0x65, 0xf6, 0x87, 0xb2, 0x1b, 0x8d, 0x59, 0xf9, 0x40, 0x74, 0x9d, 0x99, 0x22, 0xcf, 0x02, 0x1d, 0x67, 0xda, 0x4c, 0xef, 0x6e, 0xdd, 0x77, 0xc9, 0x2f, 0x6b, 0x31, 0x70, 0x69, 0x24, 0x42, 0x0e, 0x2d, 0xec, 0xa3, 0x7c, 0xc7, 0xb9, 0xbf, 0xe1, 0xbd, 0x49, 0x5c, 0x3c, 0xb7, 0x81, 0x91, 0x94, 0xc5, 0xd7, 0xab, 0x05, 0x7a, 0x45, 0xc2, 0x35, 0xa1, 0x64, 0x60, 0xf5, 0xea, 0x86, 0xd4, 0xb1, 0x2b, 0x1a, 0x53, 0x8c, 0x0c, 0x58, 0xa8, 0xf8, 0x8f, 0x3f, 0x08, 0x73, 0x7f, 0x9c, 0xf2, 0x98, 0xb5, 0x21, 0x12, 0xce, 0x3a, 0x01, 0xe7, 0x1c, 0x5a, 0x66, 0x88, 0xd9, 0x47, 0x4b, 0x83, 0xee, 0xbb, 0x6d, 0x26, 0xdc, 0xdf, 0x76, 0xd1, 0xc8, 0xc0, 0x2e, 0xa4, 0x6a, 0x43, 0x30, 0x32, 0x6f, 0x78, 0x68, 0x4d, 0x23, 0x03, 0x41, 0x9e, 0x0d, 0xa9, 0x2c, 0x54, 0xeb, 0xd5, 0xa2, 0x61, 0x7b, 0xc3, 0xc6, 0xac, 0xb8, 0x92, 0xbe, 0x5d, 0xe0, 0xd2, 0xbc, 0x27, 0x48, 0x84, 0x5b, 0x89, 0x3b, 0xe8, 0xb6, 0x13, 0x80, 0xf3, 0x90, 0x09, 0x93, 0x5e, 0xc4, 0xad, 0xd6, 0x62, 0xaa, 0x55, 0x04, 0x9f, 0x79, 0x4e, 0x44, 0x33, 0xc1, 0xa5, 0x34, 0xa6, 0xa0, 0x4f, 0x63, 0x56, 0x5f, 0xae, 0xf4, 0x0a, 0xe9, 0x14, 0x85, 0x8a, 0xd3, 0x28, 0xb0, 0x51, 0x2a, 0x16, 0x19, 0x18, 0x52, 0x17, 0x8b, 0x29, 0x0b, 0x15, 0x57, 0xaf, 0xa7, 0x50 The originally purpose of this work is to enable Linux to have triple-parity RAID, but I guess it is all right for the rest of the open source community to use it too. If you are not sure about your situation or you'd rather talk to me in private about other issues, please contact me at: creamyfish@gmail.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-17 6:11 Is this enough for us to have triple-parity RAID? Alex @ 2012-04-17 7:58 ` David Brown 2012-04-17 16:37 ` Stefan /*St0fF*/ Hübner 2012-04-17 17:16 ` Piergiorgio Sartor 2012-04-20 7:45 ` Stan Hoeppner 1 sibling, 2 replies; 37+ messages in thread From: David Brown @ 2012-04-17 7:58 UTC (permalink / raw) To: linux-raid Hi Alex, I've been playing around with triple-parity raid theory for a while. It's been mainly for my own interest - it's fun to do some deeper maths (I studied maths at university, but this is the first time I've done serious group theory in the last 20 years), fun to resurrect my LaTeX skills, and maybe it will be useful to the md raid developers. My current state is that I've got theory worked out and written up - not just for triple parity, but for more parities as well. For some of it, I've got Python code to test and verify the maths. It turns out that triple parity can work well - but for quad parity the limit is 21 data disks (using generators 2, 4, and 8), or up to 33 (using for example 0x07, 0x35 and 0x8b as generators). Realistically, I think triple parity is the limit for practical implementations. I haven't finished the paper - in particular, I haven't filled out the simplifications of the triple parity work from more general matrix forms. I also hope to work on implementation details for the calculations. But maybe there is already enough to be of interest to some people, such as yourself. <http://redhouse.homelinux.net/raid/> mvh., David On 17/04/2012 08:11, Alex wrote: > Thanks to Billy Crook who pointed out this is the right place for my post. > > Adam Leventhal integrated triple-parity RAID into ZFS in 2009. The > necessity of triple-parity RAID is described in detail in Adam > Leventhal's article(http://cacm.acm.org/magazines/2010/1/55741-triple-parity-raid-and-beyond/fulltext). > Basically it is because hard drive capacity has doubled > annually(Kryder's law) while hard drive throughput has improved far > more modestly, the time it takes to recover a faulty hard disk in a > double-parity RAID like RAID 6 might become so long(in 2010, it takes > about 4 hours to recover a 1TB SAS hard disk at its full speed) that > the remaining array(essentially a RAID 5 array, which has proven to be > unsafe) might fail and cause data loss during recovery. Earlier this > year, Ostler et > al.(http://www.nature.com/ncomms/journal/v3/n2/full/ncomms1666.html) > established a revolutionary way of writing magnetic substrate using a > heat pulse instead of a traditional magnetic field, which may increase > data throughput on a hard disk by 1000 times in the future. They > estimated the commercial usage of this new technology would be advent > in about 10 years. That said, within the next 10 years, having > triple-parity RAID in a data integrity demanding environment is still > highly desirable. > > Unfortunately, due to conflicts between CDDL license of Oracle and GNU > license of Linux, ZFS and hence triple-parity RAID cannot be ported to > Linux. As a die-hard follower of the open source community, I am NOT > exactly pleased by this kind of drama. But instead of claiming the > grape to be sour, I decided to grow my own. > > The work I am going to present here builds on top of that of Adam > Leventhal(http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c) > and Peter Anvin(http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf). > I will generalize Adam Leventhal's work using Peter Anvin's method, > then pick another generator from the Galois field GF(2^8) to > facilitate another triple-parity RAID algorithm that hopefully can be > used by the part of open source community that is not CDDL compliant > and beyond. System engineers who are not exactly familiar with Galois > field theory may find themselves a great exposure for that in Peter > Anvin's article. > > I will use the same notation as in Adam Leventhal's work: D_0, ... > D_n-1 represents corresponding bytes in the n data disks in the array, > addition + is bitwise XOR, multiplication * is multiplication in > GF(2^8), multiplication in the power(for example,2(n-1) in g^2(n-1) > below) on the other hand, is multiplication in modulo ring Z_255. > > (1) P = D_0 + D_1 + ... + D_n-2 + D_n-1 > (2) Q = g^n-1 * D_0 + g^n-2 * D_1 + ... + g^1 * D_n-2 + D_n-1 > = ((...((D_0) * g + D_1) * g + ...) * g + D_n-2) * g + D_n-1 > (3) R = g^2(n-1) * D_0 + g^2(n-2) * D_1 + ... + g^2 * D_n-2 + D_n-1 > = ((...((D_0) * g^2 + D_1) * g^2 + ...) * g^2 + D_n-2) * g^2 + D_n-1 > > P,Q,R are the definitions of the parity bytes, these are usually > called P,Q,R syndromes in the literature. Adam Leventhal used > generator {02} in the cyclic representation of Galois field GF(2^8), I > instead use an arbitrary generator g in the definition of P,Q and R. > For generator g, g^k is a generator if and only if k is relatively > prime to 255. Since g^255 = 1, g^(-1) = g^254 is a generator for 254 > is relatively prime to 255. g^(-1) is the generator I am going to use > to optimize my triple-parity RAID algorithm. > > Now let's prove we can always recover from 3 or less disk failures, > namely, we can always solve (1), (2), (3) for a unique solution if 3 > or less of {P, Q, R, D_0, ... D_n-1} are unknowns. > > We start with the most difficult case, i.e., 3 of {D_0, ... D_n-1} are > unknowns, or 3 data disks have failed and let's call them D_x, D_y and > D_z. For (1), we move the constants on the right to the left and > combine them with P and call the sum to be P_xyz, so (1) becomes > > (1)' P_xyz = D_x + D_y + D_z > > Similarly, (2) and (3) become > > (2)' Q_xyz = g^x * D_x + g^y * D_y + g^z * D_z > (3)' R_xyz = g^2x * D_x + g^2y * D_y + g^2z * D_z > > From (1)', we have D_z = P_xyz + D_x + D_y since A + A = 0 in a Galois > field. Substitute this into (2)' and (3)', and move the constants to > the left and call them Q_xyz' and R_xyz', we got > > (2)'' Q_xyz' = (g^x + g^z) * D_x + (g^y + g^z) * D_y > (3)'' R_xyz' = (g^2x + g^2z) * D_x + (g^2y + g^2z) * D_y > > Here comes the trick, multiply (2)'' by g^y + g^z, the coefficient for > D_y becomes (g^y + g^z) * (g^y + g^z) = g^2y + g^y * g^z + g^z * g^y + > g^2z = g^2y + g^2z(A + A = 0 again), then add it to (3)'' and move the > constants to the left and call it R_xyz'', we have > > (3)''' R_xyz'' = [(g^x + g^z) * (g^y + g^z) + g^2x + g^2z] * D_x > = [g^(x+y) + g^(x+z) + g^(z+y) + g^2z + g^2x +g^2z] * D_x > = [g^(x+y) + g^(x+z) + g^(z+y) + g^2x] * D_x (A + A = 0 again) > = [g^y * (g^x + g^z) + g^x *(g^x + g^z)] * D_x > = (g^y + g^x) * (g^x + g^z) * D_x > > Now because x, y, z are distinct integers, if we assume 0<= x, y, z< > n<= 255, then neither g^y + g^x nor g^x + g^z can be zero since g is > a generator and we can solve for D_x from (3)'''. A similar argument > can be applied to D_y's coefficient in (2)'' to solve for D_y and from > (1)' we can solve for D_z. > > In a failure of 3 disks involving 1 or more parity disks, we may use > the equations not involving a failed data disk to uniquely solve for > unknows representing the failed data disk bytes, then use the rest of > the equations to recalculate the failed parity disk bytes. > > In a failure involving only two data disks, we may use an argument > similar to above and two of the three equations to uniquely solve for > the unknowns(you might need to observe that g^2 is also a generator > since 2 is relatively prime to 255), the only question is: does the > solution satisfy the third equation? The answer is it has to. The > reason is we originally(before the two data disks failed) have a > solution for the two unknowns that satisfies all three equations, > hence also satisfies the two we used to solve for the unknowns; but > now we uniquely solve for the unknowns from those two equations, so > they have to be the original values. > > The arguments for other cases are similar, but only easier. > > There is an important observation here: The Gauss-Jordan elimination > in the proof above can be apply to Adam Leventhal's code although one > has 3 equations, the other has n+3. And this observaton has two > implications: > > 1) If we replace g with generator {02} in GF(2^8), we have proven that > the algorithm used in Adam Leventhal's code is sound.(I am sure Adam > Leventhal has his own proof, but as another guy with a math > background, I am not convinced until I see it.) > > 2) For other generators in GF(2^8), we can use the same procedure in > the algorithm of Adam Leventhal's code once the corresponding > logarithm and powers tables are replaced, this enable us to reuse most > of the code in Adam Leventhal's code. > > So much for the math, now we enter neverland of a system enggineer. > Let's consider GF(2^8)'s another generator {02}^(-1) and try to > optimize the calculation for the Q ad R syndromes. For this purpose, > we use the second equality in (2) and (3). The addition is just > bitewise XOR, what we need to optimize is the multiplication by > {02}^(-1). Following the way in Peter Anvin's article, we found that > multiplication is implemented by the following bitwise operations: > > (x * {02}^(-1))_7 = x_0 > (x * {02}^(-1))_6 = x_7 > (x * {02}^(-1))_5 = x_6 > (x * {02}^(-1))_4 = x_5 > (x * {02}^(-1))_3 = x_4 + x_0 > (x * {02}^(-1))_2 = x_3 + x_0 > (x * {02}^(-1))_1 = x_2 + x_0 > (x * {02}^(-1))_0 = x_1 > > For 32 bit architecture, the C code optimizing it is as follows: > > uint32_t v vv; > vv = (v>> 1)& 0x7f7f7f7f; > vv ^= MASK(v)& 0x8e8e8e8e; > > uint32_t MASK(uint32_t v) > { > v&= 0x01010101; > return (v<< 8) - v; > } > > The code for 64 bit architecture is just a simple extension of this, > or you may consult Adam Leventhal's code. > > For arbitrary multiplication in the Gauss-Jordan elimination of the > recovery process, we use the rule: > > A * B = C<==> C = g^(log_g A + log_g B) > A / B = C<==> C = g^(log_g A - log_g B) > > where log_g is the discrete logarithm and g = {02}^(-1). > > And the tables for discrete logarithm and powers of {02}^(-1) is as follows: > > Powers table: > > 0x01, 0x8e, 0x47, 0xad, 0xd8, 0x6c, 0x36, 0x1b, > 0x83, 0xcf, 0xe9, 0xfa, 0x7d, 0xb0, 0x58, 0x2c, > 0x16, 0x0b, 0x8b, 0xcb, 0xeb, 0xfb, 0xf3, 0xf7, > 0xf5, 0xf4, 0x7a, 0x3d, 0x90, 0x48, 0x24, 0x12, > 0x09, 0x8a, 0x45, 0xac, 0x56, 0x2b, 0x9b, 0xc3, > 0xef, 0xf9, 0xf2, 0x79, 0xb2, 0x59, 0xa2, 0x51, > 0xa6, 0x53, 0xa7, 0xdd, 0xe0, 0x70, 0x38, 0x1c, > 0x0e, 0x07, 0x8d, 0xc8, 0x64, 0x32, 0x19, 0x82, > 0x41, 0xae, 0x57, 0xa5, 0xdc, 0x6e, 0x37, 0x95, > 0xc4, 0x62, 0x31, 0x96, 0x4b, 0xab, 0xdb, 0xe3, > 0xff, 0xf1, 0xf6, 0x7b, 0xb3, 0xd7, 0xe5, 0xfc, > 0x7e, 0x3f, 0x91, 0xc6, 0x63, 0xbf, 0xd1, 0xe6, > 0x73, 0xb7, 0xd5, 0xe4, 0x72, 0x39, 0x92, 0x49, > 0xaa, 0x55, 0xa4, 0x52, 0x29, 0x9a, 0x4d, 0xa8, > 0x54, 0x2a, 0x15, 0x84, 0x42, 0x21, 0x9e, 0x4f, > 0xa9, 0xda, 0x6d, 0xb8, 0x5c, 0x2e, 0x17, 0x85, > 0xcc, 0x66, 0x33, 0x97, 0xc5, 0xec, 0x76, 0x3b, > 0x93, 0xc7, 0xed, 0xf8, 0x7c, 0x3e, 0x1f, 0x81, > 0xce, 0x67, 0xbd, 0xd0, 0x68, 0x34, 0x1a, 0x0d, > 0x88, 0x44, 0x22, 0x11, 0x86, 0x43, 0xaf, 0xd9, > 0xe2, 0x71, 0xb6, 0x5b, 0xa3, 0xdf, 0xe1, 0xfe, > 0x7f, 0xb1, 0xd6, 0x6b, 0xbb, 0xd3, 0xe7, 0xfd, > 0xf0, 0x78, 0x3c, 0x1e, 0x0f, 0x89, 0xca, 0x65, > 0xbc, 0x5e, 0x2f, 0x99, 0xc2, 0x61, 0xbe, 0x5f, > 0xa1, 0xde, 0x6f, 0xb9, 0xd2, 0x69, 0xba, 0x5d, > 0xa0, 0x50, 0x28, 0x14, 0x0a, 0x05, 0x8c, 0x46, > 0x23, 0x9f, 0xc1, 0xee, 0x77, 0xb5, 0xd4, 0x6a, > 0x35, 0x94, 0x4a, 0x25, 0x9c, 0x4e, 0x27, 0x9d, > 0xc0, 0x60, 0x30, 0x18, 0x0c, 0x06, 0x03, 0x8f, > 0xc9, 0xea, 0x75, 0xb4, 0x5a, 0x2d, 0x98, 0x4c, > 0x26, 0x13, 0x87, 0xcd, 0xe8, 0x74, 0x3a, 0x1d, > 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 > > Log table: > > 0x00, 0x00, 0xfe, 0xe6, 0xfd, 0xcd, 0xe5, 0x39, > 0xfc, 0x20, 0xcc, 0x11, 0xe4, 0x97, 0x38, 0xb4, > 0xfb, 0x9b, 0x1f, 0xf1, 0xcb, 0x72, 0x10, 0x7e, > 0xe3, 0x3e, 0x96, 0x07, 0x37, 0xf7, 0xb3, 0x8e, > 0xfa, 0x75, 0x9a, 0xd0, 0x1e, 0xdb, 0xf0, 0xde, > 0xca, 0x6c, 0x71, 0x25, 0x0f, 0xed, 0x7d, 0xba, > 0xe2, 0x4a, 0x3d, 0x82, 0x95, 0xd8, 0x06, 0x46, > 0x36, 0x65, 0xf6, 0x87, 0xb2, 0x1b, 0x8d, 0x59, > 0xf9, 0x40, 0x74, 0x9d, 0x99, 0x22, 0xcf, 0x02, > 0x1d, 0x67, 0xda, 0x4c, 0xef, 0x6e, 0xdd, 0x77, > 0xc9, 0x2f, 0x6b, 0x31, 0x70, 0x69, 0x24, 0x42, > 0x0e, 0x2d, 0xec, 0xa3, 0x7c, 0xc7, 0xb9, 0xbf, > 0xe1, 0xbd, 0x49, 0x5c, 0x3c, 0xb7, 0x81, 0x91, > 0x94, 0xc5, 0xd7, 0xab, 0x05, 0x7a, 0x45, 0xc2, > 0x35, 0xa1, 0x64, 0x60, 0xf5, 0xea, 0x86, 0xd4, > 0xb1, 0x2b, 0x1a, 0x53, 0x8c, 0x0c, 0x58, 0xa8, > 0xf8, 0x8f, 0x3f, 0x08, 0x73, 0x7f, 0x9c, 0xf2, > 0x98, 0xb5, 0x21, 0x12, 0xce, 0x3a, 0x01, 0xe7, > 0x1c, 0x5a, 0x66, 0x88, 0xd9, 0x47, 0x4b, 0x83, > 0xee, 0xbb, 0x6d, 0x26, 0xdc, 0xdf, 0x76, 0xd1, > 0xc8, 0xc0, 0x2e, 0xa4, 0x6a, 0x43, 0x30, 0x32, > 0x6f, 0x78, 0x68, 0x4d, 0x23, 0x03, 0x41, 0x9e, > 0x0d, 0xa9, 0x2c, 0x54, 0xeb, 0xd5, 0xa2, 0x61, > 0x7b, 0xc3, 0xc6, 0xac, 0xb8, 0x92, 0xbe, 0x5d, > 0xe0, 0xd2, 0xbc, 0x27, 0x48, 0x84, 0x5b, 0x89, > 0x3b, 0xe8, 0xb6, 0x13, 0x80, 0xf3, 0x90, 0x09, > 0x93, 0x5e, 0xc4, 0xad, 0xd6, 0x62, 0xaa, 0x55, > 0x04, 0x9f, 0x79, 0x4e, 0x44, 0x33, 0xc1, 0xa5, > 0x34, 0xa6, 0xa0, 0x4f, 0x63, 0x56, 0x5f, 0xae, > 0xf4, 0x0a, 0xe9, 0x14, 0x85, 0x8a, 0xd3, 0x28, > 0xb0, 0x51, 0x2a, 0x16, 0x19, 0x18, 0x52, 0x17, > 0x8b, 0x29, 0x0b, 0x15, 0x57, 0xaf, 0xa7, 0x50 > > The originally purpose of this work is to enable Linux to have > triple-parity RAID, but I guess it is all right for the rest of the > open source community to use it too. If you are not sure about your > situation or you'd rather talk to me in private about other issues, > please contact me at: creamyfish@gmail.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-17 7:58 ` David Brown @ 2012-04-17 16:37 ` Stefan /*St0fF*/ Hübner 2012-04-18 14:15 ` Alex 2012-04-17 17:16 ` Piergiorgio Sartor 1 sibling, 1 reply; 37+ messages in thread From: Stefan /*St0fF*/ Hübner @ 2012-04-17 16:37 UTC (permalink / raw) To: David Brown; +Cc: linux-raid Am 17.04.2012 09:58, schrieb David Brown: > Hi Alex, > > I've been playing around with triple-parity raid theory for a while. > It's been mainly for my own interest - it's fun to do some deeper maths > (I studied maths at university, but this is the first time I've done > serious group theory in the last 20 years), fun to resurrect my LaTeX > skills, and maybe it will be useful to the md raid developers. > > My current state is that I've got theory worked out and written up - not > just for triple parity, but for more parities as well. For some of it, > I've got Python code to test and verify the maths. It turns out that > triple parity can work well - but for quad parity the limit is 21 data > disks (using generators 2, 4, and 8), or up to 33 (using for example > 0x07, 0x35 and 0x8b as generators). Realistically, I think triple > parity is the limit for practical implementations. > > I haven't finished the paper - in particular, I haven't filled out the > simplifications of the triple parity work from more general matrix > forms. I also hope to work on implementation details for the > calculations. But maybe there is already enough to be of interest to > some people, such as yourself. > > <http://redhouse.homelinux.net/raid/> > > mvh., > > David > > > > On 17/04/2012 08:11, Alex wrote: >> Thanks to Billy Crook who pointed out this is the right place for my >> post. >> >> Adam Leventhal integrated triple-parity RAID into ZFS in 2009. The >> necessity of triple-parity RAID is described in detail in Adam >> Leventhal's >> article(http://cacm.acm.org/magazines/2010/1/55741-triple-parity-raid-and-beyond/fulltext). >> >> Basically it is because hard drive capacity has doubled >> annually(Kryder's law) while hard drive throughput has improved far >> more modestly, the time it takes to recover a faulty hard disk in a >> double-parity RAID like RAID 6 might become so long(in 2010, it takes >> about 4 hours to recover a 1TB SAS hard disk at its full speed) that >> the remaining array(essentially a RAID 5 array, which has proven to be >> unsafe) might fail and cause data loss during recovery. Earlier this >> year, Ostler et >> al.(http://www.nature.com/ncomms/journal/v3/n2/full/ncomms1666.html) >> established a revolutionary way of writing magnetic substrate using a >> heat pulse instead of a traditional magnetic field, which may increase >> data throughput on a hard disk by 1000 times in the future. They >> estimated the commercial usage of this new technology would be advent >> in about 10 years. That said, within the next 10 years, having >> triple-parity RAID in a data integrity demanding environment is still >> highly desirable. >> >> Unfortunately, due to conflicts between CDDL license of Oracle and GNU >> license of Linux, ZFS and hence triple-parity RAID cannot be ported to >> Linux. As a die-hard follower of the open source community, I am NOT >> exactly pleased by this kind of drama. But instead of claiming the >> grape to be sour, I decided to grow my own. >> >> The work I am going to present here builds on top of that of Adam >> Leventhal(http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c) >> >> and Peter Anvin(http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf). >> I will generalize Adam Leventhal's work using Peter Anvin's method, >> then pick another generator from the Galois field GF(2^8) to >> facilitate another triple-parity RAID algorithm that hopefully can be >> used by the part of open source community that is not CDDL compliant >> and beyond. System engineers who are not exactly familiar with Galois >> field theory may find themselves a great exposure for that in Peter >> Anvin's article. >> >> I will use the same notation as in Adam Leventhal's work: D_0, ... >> D_n-1 represents corresponding bytes in the n data disks in the array, >> addition + is bitwise XOR, multiplication * is multiplication in >> GF(2^8), multiplication in the power(for example,2(n-1) in g^2(n-1) >> below) on the other hand, is multiplication in modulo ring Z_255. >> >> (1) P = D_0 + D_1 + ... + D_n-2 + D_n-1 >> (2) Q = g^n-1 * D_0 + g^n-2 * D_1 + ... + g^1 * D_n-2 + D_n-1 >> = ((...((D_0) * g + D_1) * g + ...) * g + D_n-2) * g + D_n-1 >> (3) R = g^2(n-1) * D_0 + g^2(n-2) * D_1 + ... + g^2 * D_n-2 + D_n-1 >> = ((...((D_0) * g^2 + D_1) * g^2 + ...) * g^2 + D_n-2) * g^2 + >> D_n-1 >> >> P,Q,R are the definitions of the parity bytes, these are usually >> called P,Q,R syndromes in the literature. Adam Leventhal used >> generator {02} in the cyclic representation of Galois field GF(2^8), I >> instead use an arbitrary generator g in the definition of P,Q and R. >> For generator g, g^k is a generator if and only if k is relatively >> prime to 255. Since g^255 = 1, g^(-1) = g^254 is a generator for 254 >> is relatively prime to 255. g^(-1) is the generator I am going to use >> to optimize my triple-parity RAID algorithm. >> >> Now let's prove we can always recover from 3 or less disk failures, >> namely, we can always solve (1), (2), (3) for a unique solution if 3 >> or less of {P, Q, R, D_0, ... D_n-1} are unknowns. >> >> We start with the most difficult case, i.e., 3 of {D_0, ... D_n-1} are >> unknowns, or 3 data disks have failed and let's call them D_x, D_y and >> D_z. For (1), we move the constants on the right to the left and >> combine them with P and call the sum to be P_xyz, so (1) becomes >> >> (1)' P_xyz = D_x + D_y + D_z >> >> Similarly, (2) and (3) become >> >> (2)' Q_xyz = g^x * D_x + g^y * D_y + g^z * D_z >> (3)' R_xyz = g^2x * D_x + g^2y * D_y + g^2z * D_z >> >> From (1)', we have D_z = P_xyz + D_x + D_y since A + A = 0 in a Galois >> field. Substitute this into (2)' and (3)', and move the constants to >> the left and call them Q_xyz' and R_xyz', we got >> >> (2)'' Q_xyz' = (g^x + g^z) * D_x + (g^y + g^z) * D_y >> (3)'' R_xyz' = (g^2x + g^2z) * D_x + (g^2y + g^2z) * D_y >> >> Here comes the trick, multiply (2)'' by g^y + g^z, the coefficient for >> D_y becomes (g^y + g^z) * (g^y + g^z) = g^2y + g^y * g^z + g^z * g^y + >> g^2z = g^2y + g^2z(A + A = 0 again), then add it to (3)'' and move the >> constants to the left and call it R_xyz'', we have >> >> (3)''' R_xyz'' = [(g^x + g^z) * (g^y + g^z) + g^2x + g^2z] * D_x >> = [g^(x+y) + g^(x+z) + g^(z+y) + g^2z + g^2x +g^2z] * D_x >> = [g^(x+y) + g^(x+z) + g^(z+y) + g^2x] * D_x (A + A = >> 0 again) >> = [g^y * (g^x + g^z) + g^x *(g^x + g^z)] * D_x >> = (g^y + g^x) * (g^x + g^z) * D_x >> >> Now because x, y, z are distinct integers, if we assume 0<= x, y, z< >> n<= 255, then neither g^y + g^x nor g^x + g^z can be zero since g is >> a generator and we can solve for D_x from (3)'''. A similar argument >> can be applied to D_y's coefficient in (2)'' to solve for D_y and from >> (1)' we can solve for D_z. >> >> In a failure of 3 disks involving 1 or more parity disks, we may use >> the equations not involving a failed data disk to uniquely solve for >> unknows representing the failed data disk bytes, then use the rest of >> the equations to recalculate the failed parity disk bytes. >> >> In a failure involving only two data disks, we may use an argument >> similar to above and two of the three equations to uniquely solve for >> the unknowns(you might need to observe that g^2 is also a generator >> since 2 is relatively prime to 255), the only question is: does the >> solution satisfy the third equation? The answer is it has to. The >> reason is we originally(before the two data disks failed) have a >> solution for the two unknowns that satisfies all three equations, >> hence also satisfies the two we used to solve for the unknowns; but >> now we uniquely solve for the unknowns from those two equations, so >> they have to be the original values. >> >> The arguments for other cases are similar, but only easier. >> >> There is an important observation here: The Gauss-Jordan elimination >> in the proof above can be apply to Adam Leventhal's code although one >> has 3 equations, the other has n+3. And this observaton has two >> implications: >> >> 1) If we replace g with generator {02} in GF(2^8), we have proven that >> the algorithm used in Adam Leventhal's code is sound.(I am sure Adam >> Leventhal has his own proof, but as another guy with a math >> background, I am not convinced until I see it.) >> >> 2) For other generators in GF(2^8), we can use the same procedure in >> the algorithm of Adam Leventhal's code once the corresponding >> logarithm and powers tables are replaced, this enable us to reuse most >> of the code in Adam Leventhal's code. >> >> So much for the math, now we enter neverland of a system enggineer. >> Let's consider GF(2^8)'s another generator {02}^(-1) and try to >> optimize the calculation for the Q ad R syndromes. For this purpose, >> we use the second equality in (2) and (3). The addition is just >> bitewise XOR, what we need to optimize is the multiplication by >> {02}^(-1). Following the way in Peter Anvin's article, we found that >> multiplication is implemented by the following bitwise operations: >> >> (x * {02}^(-1))_7 = x_0 >> (x * {02}^(-1))_6 = x_7 >> (x * {02}^(-1))_5 = x_6 >> (x * {02}^(-1))_4 = x_5 >> (x * {02}^(-1))_3 = x_4 + x_0 >> (x * {02}^(-1))_2 = x_3 + x_0 >> (x * {02}^(-1))_1 = x_2 + x_0 >> (x * {02}^(-1))_0 = x_1 >> >> For 32 bit architecture, the C code optimizing it is as follows: >> >> uint32_t v vv; >> vv = (v>> 1)& 0x7f7f7f7f; >> vv ^= MASK(v)& 0x8e8e8e8e; >> >> uint32_t MASK(uint32_t v) >> { >> v&= 0x01010101; >> return (v<< 8) - v; >> } >> >> The code for 64 bit architecture is just a simple extension of this, >> or you may consult Adam Leventhal's code. >> >> For arbitrary multiplication in the Gauss-Jordan elimination of the >> recovery process, we use the rule: >> >> A * B = C<==> C = g^(log_g A + log_g B) >> A / B = C<==> C = g^(log_g A - log_g B) >> >> where log_g is the discrete logarithm and g = {02}^(-1). >> >> And the tables for discrete logarithm and powers of {02}^(-1) is as >> follows: >> >> Powers table: >> >> 0x01, 0x8e, 0x47, 0xad, 0xd8, 0x6c, 0x36, 0x1b, >> 0x83, 0xcf, 0xe9, 0xfa, 0x7d, 0xb0, 0x58, 0x2c, >> 0x16, 0x0b, 0x8b, 0xcb, 0xeb, 0xfb, 0xf3, 0xf7, >> 0xf5, 0xf4, 0x7a, 0x3d, 0x90, 0x48, 0x24, 0x12, >> 0x09, 0x8a, 0x45, 0xac, 0x56, 0x2b, 0x9b, 0xc3, >> 0xef, 0xf9, 0xf2, 0x79, 0xb2, 0x59, 0xa2, 0x51, >> 0xa6, 0x53, 0xa7, 0xdd, 0xe0, 0x70, 0x38, 0x1c, >> 0x0e, 0x07, 0x8d, 0xc8, 0x64, 0x32, 0x19, 0x82, >> 0x41, 0xae, 0x57, 0xa5, 0xdc, 0x6e, 0x37, 0x95, >> 0xc4, 0x62, 0x31, 0x96, 0x4b, 0xab, 0xdb, 0xe3, >> 0xff, 0xf1, 0xf6, 0x7b, 0xb3, 0xd7, 0xe5, 0xfc, >> 0x7e, 0x3f, 0x91, 0xc6, 0x63, 0xbf, 0xd1, 0xe6, >> 0x73, 0xb7, 0xd5, 0xe4, 0x72, 0x39, 0x92, 0x49, >> 0xaa, 0x55, 0xa4, 0x52, 0x29, 0x9a, 0x4d, 0xa8, >> 0x54, 0x2a, 0x15, 0x84, 0x42, 0x21, 0x9e, 0x4f, >> 0xa9, 0xda, 0x6d, 0xb8, 0x5c, 0x2e, 0x17, 0x85, >> 0xcc, 0x66, 0x33, 0x97, 0xc5, 0xec, 0x76, 0x3b, >> 0x93, 0xc7, 0xed, 0xf8, 0x7c, 0x3e, 0x1f, 0x81, >> 0xce, 0x67, 0xbd, 0xd0, 0x68, 0x34, 0x1a, 0x0d, >> 0x88, 0x44, 0x22, 0x11, 0x86, 0x43, 0xaf, 0xd9, >> 0xe2, 0x71, 0xb6, 0x5b, 0xa3, 0xdf, 0xe1, 0xfe, >> 0x7f, 0xb1, 0xd6, 0x6b, 0xbb, 0xd3, 0xe7, 0xfd, >> 0xf0, 0x78, 0x3c, 0x1e, 0x0f, 0x89, 0xca, 0x65, >> 0xbc, 0x5e, 0x2f, 0x99, 0xc2, 0x61, 0xbe, 0x5f, >> 0xa1, 0xde, 0x6f, 0xb9, 0xd2, 0x69, 0xba, 0x5d, >> 0xa0, 0x50, 0x28, 0x14, 0x0a, 0x05, 0x8c, 0x46, >> 0x23, 0x9f, 0xc1, 0xee, 0x77, 0xb5, 0xd4, 0x6a, >> 0x35, 0x94, 0x4a, 0x25, 0x9c, 0x4e, 0x27, 0x9d, >> 0xc0, 0x60, 0x30, 0x18, 0x0c, 0x06, 0x03, 0x8f, >> 0xc9, 0xea, 0x75, 0xb4, 0x5a, 0x2d, 0x98, 0x4c, >> 0x26, 0x13, 0x87, 0xcd, 0xe8, 0x74, 0x3a, 0x1d, >> 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 >> >> Log table: >> >> 0x00, 0x00, 0xfe, 0xe6, 0xfd, 0xcd, 0xe5, 0x39, >> 0xfc, 0x20, 0xcc, 0x11, 0xe4, 0x97, 0x38, 0xb4, >> 0xfb, 0x9b, 0x1f, 0xf1, 0xcb, 0x72, 0x10, 0x7e, >> 0xe3, 0x3e, 0x96, 0x07, 0x37, 0xf7, 0xb3, 0x8e, >> 0xfa, 0x75, 0x9a, 0xd0, 0x1e, 0xdb, 0xf0, 0xde, >> 0xca, 0x6c, 0x71, 0x25, 0x0f, 0xed, 0x7d, 0xba, >> 0xe2, 0x4a, 0x3d, 0x82, 0x95, 0xd8, 0x06, 0x46, >> 0x36, 0x65, 0xf6, 0x87, 0xb2, 0x1b, 0x8d, 0x59, >> 0xf9, 0x40, 0x74, 0x9d, 0x99, 0x22, 0xcf, 0x02, >> 0x1d, 0x67, 0xda, 0x4c, 0xef, 0x6e, 0xdd, 0x77, >> 0xc9, 0x2f, 0x6b, 0x31, 0x70, 0x69, 0x24, 0x42, >> 0x0e, 0x2d, 0xec, 0xa3, 0x7c, 0xc7, 0xb9, 0xbf, >> 0xe1, 0xbd, 0x49, 0x5c, 0x3c, 0xb7, 0x81, 0x91, >> 0x94, 0xc5, 0xd7, 0xab, 0x05, 0x7a, 0x45, 0xc2, >> 0x35, 0xa1, 0x64, 0x60, 0xf5, 0xea, 0x86, 0xd4, >> 0xb1, 0x2b, 0x1a, 0x53, 0x8c, 0x0c, 0x58, 0xa8, >> 0xf8, 0x8f, 0x3f, 0x08, 0x73, 0x7f, 0x9c, 0xf2, >> 0x98, 0xb5, 0x21, 0x12, 0xce, 0x3a, 0x01, 0xe7, >> 0x1c, 0x5a, 0x66, 0x88, 0xd9, 0x47, 0x4b, 0x83, >> 0xee, 0xbb, 0x6d, 0x26, 0xdc, 0xdf, 0x76, 0xd1, >> 0xc8, 0xc0, 0x2e, 0xa4, 0x6a, 0x43, 0x30, 0x32, >> 0x6f, 0x78, 0x68, 0x4d, 0x23, 0x03, 0x41, 0x9e, >> 0x0d, 0xa9, 0x2c, 0x54, 0xeb, 0xd5, 0xa2, 0x61, >> 0x7b, 0xc3, 0xc6, 0xac, 0xb8, 0x92, 0xbe, 0x5d, >> 0xe0, 0xd2, 0xbc, 0x27, 0x48, 0x84, 0x5b, 0x89, >> 0x3b, 0xe8, 0xb6, 0x13, 0x80, 0xf3, 0x90, 0x09, >> 0x93, 0x5e, 0xc4, 0xad, 0xd6, 0x62, 0xaa, 0x55, >> 0x04, 0x9f, 0x79, 0x4e, 0x44, 0x33, 0xc1, 0xa5, >> 0x34, 0xa6, 0xa0, 0x4f, 0x63, 0x56, 0x5f, 0xae, >> 0xf4, 0x0a, 0xe9, 0x14, 0x85, 0x8a, 0xd3, 0x28, >> 0xb0, 0x51, 0x2a, 0x16, 0x19, 0x18, 0x52, 0x17, >> 0x8b, 0x29, 0x0b, 0x15, 0x57, 0xaf, 0xa7, 0x50 >> >> The originally purpose of this work is to enable Linux to have >> triple-parity RAID, but I guess it is all right for the rest of the >> open source community to use it too. If you are not sure about your >> situation or you'd rather talk to me in private about other issues, >> please contact me at: creamyfish@gmail.com >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Hi Alex and David, please keep up the work on this topic. This is great stuff. Extreme mathematics put to use. At the moment I suggest all my customers to give one redundancy to four distinct disks in a raid. For larger RAIDs this gets expensive at times and one has to use RAID60-combinations. If we could use RAID7 (will it be called that?) or RAID70 we could use more data disks, as the dropout-probability moves. I guess when using RAID7 it'd be safe enough to have a RAID of 16 data-disks... So please allow me to thank you very much for your efforts! Grettings, Stefan ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-17 16:37 ` Stefan /*St0fF*/ Hübner @ 2012-04-18 14:15 ` Alex 2012-04-18 14:11 ` David Brown 0 siblings, 1 reply; 37+ messages in thread From: Alex @ 2012-04-18 14:15 UTC (permalink / raw) To: stefan.huebner; +Cc: linux-raid Hi Stefan, This is just a test of using the list. But still, thanks to your encouragement. Cheers, Alex On Wed, Apr 18, 2012 at 12:37 AM, Stefan /*St0fF*/ Hübner <stefan.huebner@stud.tu-ilmenau.de> wrote: > Am 17.04.2012 09:58, schrieb David Brown: >> Hi Alex, >> >> I've been playing around with triple-parity raid theory for a while. >> It's been mainly for my own interest - it's fun to do some deeper maths >> (I studied maths at university, but this is the first time I've done >> serious group theory in the last 20 years), fun to resurrect my LaTeX >> skills, and maybe it will be useful to the md raid developers. >> >> My current state is that I've got theory worked out and written up - not >> just for triple parity, but for more parities as well. For some of it, >> I've got Python code to test and verify the maths. It turns out that >> triple parity can work well - but for quad parity the limit is 21 data >> disks (using generators 2, 4, and 8), or up to 33 (using for example >> 0x07, 0x35 and 0x8b as generators). Realistically, I think triple >> parity is the limit for practical implementations. >> >> I haven't finished the paper - in particular, I haven't filled out the >> simplifications of the triple parity work from more general matrix >> forms. I also hope to work on implementation details for the >> calculations. But maybe there is already enough to be of interest to >> some people, such as yourself. >> >> <http://redhouse.homelinux.net/raid/> >> >> mvh., >> >> David >> >> >> >> On 17/04/2012 08:11, Alex wrote: >>> Thanks to Billy Crook who pointed out this is the right place for my >>> post. >>> >>> Adam Leventhal integrated triple-parity RAID into ZFS in 2009. The >>> necessity of triple-parity RAID is described in detail in Adam >>> Leventhal's >>> > article(http://cacm.acm.org/magazines/2010/1/55741-triple-parity-raid-and-beyond/fulltext). >>> >>> Basically it is because hard drive capacity has doubled >>> annually(Kryder's law) while hard drive throughput has improved far >>> more modestly, the time it takes to recover a faulty hard disk in a >>> double-parity RAID like RAID 6 might become so long(in 2010, it takes >>> about 4 hours to recover a 1TB SAS hard disk at its full speed) that >>> the remaining array(essentially a RAID 5 array, which has proven to be >>> unsafe) might fail and cause data loss during recovery. Earlier this >>> year, Ostler et >>> al.(http://www.nature.com/ncomms/journal/v3/n2/full/ncomms1666.html) >>> established a revolutionary way of writing magnetic substrate using a >>> heat pulse instead of a traditional magnetic field, which may increase >>> data throughput on a hard disk by 1000 times in the future. They >>> estimated the commercial usage of this new technology would be advent >>> in about 10 years. That said, within the next 10 years, having >>> triple-parity RAID in a data integrity demanding environment is still >>> highly desirable. >>> >>> Unfortunately, due to conflicts between CDDL license of Oracle and GNU >>> license of Linux, ZFS and hence triple-parity RAID cannot be ported to >>> Linux. As a die-hard follower of the open source community, I am NOT >>> exactly pleased by this kind of drama. But instead of claiming the >>> grape to be sour, I decided to grow my own. >>> >>> The work I am going to present here builds on top of that of Adam >>> > Leventhal(http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c) >>> >>> and Peter Anvin(http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf). >>> I will generalize Adam Leventhal's work using Peter Anvin's method, >>> then pick another generator from the Galois field GF(2^8) to >>> facilitate another triple-parity RAID algorithm that hopefully can be >>> used by the part of open source community that is not CDDL compliant >>> and beyond. System engineers who are not exactly familiar with Galois >>> field theory may find themselves a great exposure for that in Peter >>> Anvin's article. >>> >>> I will use the same notation as in Adam Leventhal's work: D_0, ... >>> D_n-1 represents corresponding bytes in the n data disks in the array, >>> addition + is bitwise XOR, multiplication * is multiplication in >>> GF(2^8), multiplication in the power(for example,2(n-1) in g^2(n-1) >>> below) on the other hand, is multiplication in modulo ring Z_255. >>> >>> (1) P = D_0 + D_1 + ... + D_n-2 + D_n-1 >>> (2) Q = g^n-1 * D_0 + g^n-2 * D_1 + ... + g^1 * D_n-2 + D_n-1 >>> = ((...((D_0) * g + D_1) * g + ...) * g + D_n-2) * g + D_n-1 >>> (3) R = g^2(n-1) * D_0 + g^2(n-2) * D_1 + ... + g^2 * D_n-2 + D_n-1 >>> = ((...((D_0) * g^2 + D_1) * g^2 + ...) * g^2 + D_n-2) * g^2 + >>> D_n-1 >>> >>> P,Q,R are the definitions of the parity bytes, these are usually >>> called P,Q,R syndromes in the literature. Adam Leventhal used >>> generator {02} in the cyclic representation of Galois field GF(2^8), I >>> instead use an arbitrary generator g in the definition of P,Q and R. >>> For generator g, g^k is a generator if and only if k is relatively >>> prime to 255. Since g^255 = 1, g^(-1) = g^254 is a generator for 254 >>> is relatively prime to 255. g^(-1) is the generator I am going to use >>> to optimize my triple-parity RAID algorithm. >>> >>> Now let's prove we can always recover from 3 or less disk failures, >>> namely, we can always solve (1), (2), (3) for a unique solution if 3 >>> or less of {P, Q, R, D_0, ... D_n-1} are unknowns. >>> >>> We start with the most difficult case, i.e., 3 of {D_0, ... D_n-1} are >>> unknowns, or 3 data disks have failed and let's call them D_x, D_y and >>> D_z. For (1), we move the constants on the right to the left and >>> combine them with P and call the sum to be P_xyz, so (1) becomes >>> >>> (1)' P_xyz = D_x + D_y + D_z >>> >>> Similarly, (2) and (3) become >>> >>> (2)' Q_xyz = g^x * D_x + g^y * D_y + g^z * D_z >>> (3)' R_xyz = g^2x * D_x + g^2y * D_y + g^2z * D_z >>> >>> From (1)', we have D_z = P_xyz + D_x + D_y since A + A = 0 in a Galois >>> field. Substitute this into (2)' and (3)', and move the constants to >>> the left and call them Q_xyz' and R_xyz', we got >>> >>> (2)'' Q_xyz' = (g^x + g^z) * D_x + (g^y + g^z) * D_y >>> (3)'' R_xyz' = (g^2x + g^2z) * D_x + (g^2y + g^2z) * D_y >>> >>> Here comes the trick, multiply (2)'' by g^y + g^z, the coefficient for >>> D_y becomes (g^y + g^z) * (g^y + g^z) = g^2y + g^y * g^z + g^z * g^y + >>> g^2z = g^2y + g^2z(A + A = 0 again), then add it to (3)'' and move the >>> constants to the left and call it R_xyz'', we have >>> >>> (3)''' R_xyz'' = [(g^x + g^z) * (g^y + g^z) + g^2x + g^2z] * D_x >>> = [g^(x+y) + g^(x+z) + g^(z+y) + g^2z + g^2x +g^2z] * D_x >>> = [g^(x+y) + g^(x+z) + g^(z+y) + g^2x] * D_x (A + A = >>> 0 again) >>> = [g^y * (g^x + g^z) + g^x *(g^x + g^z)] * D_x >>> = (g^y + g^x) * (g^x + g^z) * D_x >>> >>> Now because x, y, z are distinct integers, if we assume 0<= x, y, z< >>> n<= 255, then neither g^y + g^x nor g^x + g^z can be zero since g is >>> a generator and we can solve for D_x from (3)'''. A similar argument >>> can be applied to D_y's coefficient in (2)'' to solve for D_y and from >>> (1)' we can solve for D_z. >>> >>> In a failure of 3 disks involving 1 or more parity disks, we may use >>> the equations not involving a failed data disk to uniquely solve for >>> unknows representing the failed data disk bytes, then use the rest of >>> the equations to recalculate the failed parity disk bytes. >>> >>> In a failure involving only two data disks, we may use an argument >>> similar to above and two of the three equations to uniquely solve for >>> the unknowns(you might need to observe that g^2 is also a generator >>> since 2 is relatively prime to 255), the only question is: does the >>> solution satisfy the third equation? The answer is it has to. The >>> reason is we originally(before the two data disks failed) have a >>> solution for the two unknowns that satisfies all three equations, >>> hence also satisfies the two we used to solve for the unknowns; but >>> now we uniquely solve for the unknowns from those two equations, so >>> they have to be the original values. >>> >>> The arguments for other cases are similar, but only easier. >>> >>> There is an important observation here: The Gauss-Jordan elimination >>> in the proof above can be apply to Adam Leventhal's code although one >>> has 3 equations, the other has n+3. And this observaton has two >>> implications: >>> >>> 1) If we replace g with generator {02} in GF(2^8), we have proven that >>> the algorithm used in Adam Leventhal's code is sound.(I am sure Adam >>> Leventhal has his own proof, but as another guy with a math >>> background, I am not convinced until I see it.) >>> >>> 2) For other generators in GF(2^8), we can use the same procedure in >>> the algorithm of Adam Leventhal's code once the corresponding >>> logarithm and powers tables are replaced, this enable us to reuse most >>> of the code in Adam Leventhal's code. >>> >>> So much for the math, now we enter neverland of a system enggineer. >>> Let's consider GF(2^8)'s another generator {02}^(-1) and try to >>> optimize the calculation for the Q ad R syndromes. For this purpose, >>> we use the second equality in (2) and (3). The addition is just >>> bitewise XOR, what we need to optimize is the multiplication by >>> {02}^(-1). Following the way in Peter Anvin's article, we found that >>> multiplication is implemented by the following bitwise operations: >>> >>> (x * {02}^(-1))_7 = x_0 >>> (x * {02}^(-1))_6 = x_7 >>> (x * {02}^(-1))_5 = x_6 >>> (x * {02}^(-1))_4 = x_5 >>> (x * {02}^(-1))_3 = x_4 + x_0 >>> (x * {02}^(-1))_2 = x_3 + x_0 >>> (x * {02}^(-1))_1 = x_2 + x_0 >>> (x * {02}^(-1))_0 = x_1 >>> >>> For 32 bit architecture, the C code optimizing it is as follows: >>> >>> uint32_t v vv; >>> vv = (v>> 1)& 0x7f7f7f7f; >>> vv ^= MASK(v)& 0x8e8e8e8e; >>> >>> uint32_t MASK(uint32_t v) >>> { >>> v&= 0x01010101; >>> return (v<< 8) - v; >>> } >>> >>> The code for 64 bit architecture is just a simple extension of this, >>> or you may consult Adam Leventhal's code. >>> >>> For arbitrary multiplication in the Gauss-Jordan elimination of the >>> recovery process, we use the rule: >>> >>> A * B = C<==> C = g^(log_g A + log_g B) >>> A / B = C<==> C = g^(log_g A - log_g B) >>> >>> where log_g is the discrete logarithm and g = {02}^(-1). >>> >>> And the tables for discrete logarithm and powers of {02}^(-1) is as >>> follows: >>> >>> Powers table: >>> >>> 0x01, 0x8e, 0x47, 0xad, 0xd8, 0x6c, 0x36, 0x1b, >>> 0x83, 0xcf, 0xe9, 0xfa, 0x7d, 0xb0, 0x58, 0x2c, >>> 0x16, 0x0b, 0x8b, 0xcb, 0xeb, 0xfb, 0xf3, 0xf7, >>> 0xf5, 0xf4, 0x7a, 0x3d, 0x90, 0x48, 0x24, 0x12, >>> 0x09, 0x8a, 0x45, 0xac, 0x56, 0x2b, 0x9b, 0xc3, >>> 0xef, 0xf9, 0xf2, 0x79, 0xb2, 0x59, 0xa2, 0x51, >>> 0xa6, 0x53, 0xa7, 0xdd, 0xe0, 0x70, 0x38, 0x1c, >>> 0x0e, 0x07, 0x8d, 0xc8, 0x64, 0x32, 0x19, 0x82, >>> 0x41, 0xae, 0x57, 0xa5, 0xdc, 0x6e, 0x37, 0x95, >>> 0xc4, 0x62, 0x31, 0x96, 0x4b, 0xab, 0xdb, 0xe3, >>> 0xff, 0xf1, 0xf6, 0x7b, 0xb3, 0xd7, 0xe5, 0xfc, >>> 0x7e, 0x3f, 0x91, 0xc6, 0x63, 0xbf, 0xd1, 0xe6, >>> 0x73, 0xb7, 0xd5, 0xe4, 0x72, 0x39, 0x92, 0x49, >>> 0xaa, 0x55, 0xa4, 0x52, 0x29, 0x9a, 0x4d, 0xa8, >>> 0x54, 0x2a, 0x15, 0x84, 0x42, 0x21, 0x9e, 0x4f, >>> 0xa9, 0xda, 0x6d, 0xb8, 0x5c, 0x2e, 0x17, 0x85, >>> 0xcc, 0x66, 0x33, 0x97, 0xc5, 0xec, 0x76, 0x3b, >>> 0x93, 0xc7, 0xed, 0xf8, 0x7c, 0x3e, 0x1f, 0x81, >>> 0xce, 0x67, 0xbd, 0xd0, 0x68, 0x34, 0x1a, 0x0d, >>> 0x88, 0x44, 0x22, 0x11, 0x86, 0x43, 0xaf, 0xd9, >>> 0xe2, 0x71, 0xb6, 0x5b, 0xa3, 0xdf, 0xe1, 0xfe, >>> 0x7f, 0xb1, 0xd6, 0x6b, 0xbb, 0xd3, 0xe7, 0xfd, >>> 0xf0, 0x78, 0x3c, 0x1e, 0x0f, 0x89, 0xca, 0x65, >>> 0xbc, 0x5e, 0x2f, 0x99, 0xc2, 0x61, 0xbe, 0x5f, >>> 0xa1, 0xde, 0x6f, 0xb9, 0xd2, 0x69, 0xba, 0x5d, >>> 0xa0, 0x50, 0x28, 0x14, 0x0a, 0x05, 0x8c, 0x46, >>> 0x23, 0x9f, 0xc1, 0xee, 0x77, 0xb5, 0xd4, 0x6a, >>> 0x35, 0x94, 0x4a, 0x25, 0x9c, 0x4e, 0x27, 0x9d, >>> 0xc0, 0x60, 0x30, 0x18, 0x0c, 0x06, 0x03, 0x8f, >>> 0xc9, 0xea, 0x75, 0xb4, 0x5a, 0x2d, 0x98, 0x4c, >>> 0x26, 0x13, 0x87, 0xcd, 0xe8, 0x74, 0x3a, 0x1d, >>> 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 >>> >>> Log table: >>> >>> 0x00, 0x00, 0xfe, 0xe6, 0xfd, 0xcd, 0xe5, 0x39, >>> 0xfc, 0x20, 0xcc, 0x11, 0xe4, 0x97, 0x38, 0xb4, >>> 0xfb, 0x9b, 0x1f, 0xf1, 0xcb, 0x72, 0x10, 0x7e, >>> 0xe3, 0x3e, 0x96, 0x07, 0x37, 0xf7, 0xb3, 0x8e, >>> 0xfa, 0x75, 0x9a, 0xd0, 0x1e, 0xdb, 0xf0, 0xde, >>> 0xca, 0x6c, 0x71, 0x25, 0x0f, 0xed, 0x7d, 0xba, >>> 0xe2, 0x4a, 0x3d, 0x82, 0x95, 0xd8, 0x06, 0x46, >>> 0x36, 0x65, 0xf6, 0x87, 0xb2, 0x1b, 0x8d, 0x59, >>> 0xf9, 0x40, 0x74, 0x9d, 0x99, 0x22, 0xcf, 0x02, >>> 0x1d, 0x67, 0xda, 0x4c, 0xef, 0x6e, 0xdd, 0x77, >>> 0xc9, 0x2f, 0x6b, 0x31, 0x70, 0x69, 0x24, 0x42, >>> 0x0e, 0x2d, 0xec, 0xa3, 0x7c, 0xc7, 0xb9, 0xbf, >>> 0xe1, 0xbd, 0x49, 0x5c, 0x3c, 0xb7, 0x81, 0x91, >>> 0x94, 0xc5, 0xd7, 0xab, 0x05, 0x7a, 0x45, 0xc2, >>> 0x35, 0xa1, 0x64, 0x60, 0xf5, 0xea, 0x86, 0xd4, >>> 0xb1, 0x2b, 0x1a, 0x53, 0x8c, 0x0c, 0x58, 0xa8, >>> 0xf8, 0x8f, 0x3f, 0x08, 0x73, 0x7f, 0x9c, 0xf2, >>> 0x98, 0xb5, 0x21, 0x12, 0xce, 0x3a, 0x01, 0xe7, >>> 0x1c, 0x5a, 0x66, 0x88, 0xd9, 0x47, 0x4b, 0x83, >>> 0xee, 0xbb, 0x6d, 0x26, 0xdc, 0xdf, 0x76, 0xd1, >>> 0xc8, 0xc0, 0x2e, 0xa4, 0x6a, 0x43, 0x30, 0x32, >>> 0x6f, 0x78, 0x68, 0x4d, 0x23, 0x03, 0x41, 0x9e, >>> 0x0d, 0xa9, 0x2c, 0x54, 0xeb, 0xd5, 0xa2, 0x61, >>> 0x7b, 0xc3, 0xc6, 0xac, 0xb8, 0x92, 0xbe, 0x5d, >>> 0xe0, 0xd2, 0xbc, 0x27, 0x48, 0x84, 0x5b, 0x89, >>> 0x3b, 0xe8, 0xb6, 0x13, 0x80, 0xf3, 0x90, 0x09, >>> 0x93, 0x5e, 0xc4, 0xad, 0xd6, 0x62, 0xaa, 0x55, >>> 0x04, 0x9f, 0x79, 0x4e, 0x44, 0x33, 0xc1, 0xa5, >>> 0x34, 0xa6, 0xa0, 0x4f, 0x63, 0x56, 0x5f, 0xae, >>> 0xf4, 0x0a, 0xe9, 0x14, 0x85, 0x8a, 0xd3, 0x28, >>> 0xb0, 0x51, 0x2a, 0x16, 0x19, 0x18, 0x52, 0x17, >>> 0x8b, 0x29, 0x0b, 0x15, 0x57, 0xaf, 0xa7, 0x50 >>> >>> The originally purpose of this work is to enable Linux to have >>> triple-parity RAID, but I guess it is all right for the rest of the >>> open source community to use it too. If you are not sure about your >>> situation or you'd rather talk to me in private about other issues, >>> please contact me at: creamyfish@gmail.com >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > Hi Alex and David, > > please keep up the work on this topic. This is great stuff. Extreme > mathematics put to use. At the moment I suggest all my customers to > give one redundancy to four distinct disks in a raid. For larger RAIDs > this gets expensive at times and one has to use RAID60-combinations. > > If we could use RAID7 (will it be called that?) or RAID70 we could use > more data disks, as the dropout-probability moves. I guess when using > RAID7 it'd be safe enough to have a RAID of 16 data-disks... > > So please allow me to thank you very much for your efforts! > > Grettings, > Stefan > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-18 14:15 ` Alex @ 2012-04-18 14:11 ` David Brown 0 siblings, 0 replies; 37+ messages in thread From: David Brown @ 2012-04-18 14:11 UTC (permalink / raw) To: Alex; +Cc: stefan.huebner, linux-raid On 18/04/2012 16:15, Alex wrote: > Hi Stefan, > This is just a test of using the list. But still, thanks to your encouragement. > Well, your test worked - it woke me up and made me publish my work so far. It is also, AFAIK, the best list for discussing these issues. The md raid developers all follow the list, as do many users. mvh., David ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-17 7:58 ` David Brown 2012-04-17 16:37 ` Stefan /*St0fF*/ Hübner @ 2012-04-17 17:16 ` Piergiorgio Sartor 2012-04-17 20:18 ` David Brown 1 sibling, 1 reply; 37+ messages in thread From: Piergiorgio Sartor @ 2012-04-17 17:16 UTC (permalink / raw) To: David Brown; +Cc: linux-raid Hi David, > My current state is that I've got theory worked out and written up - > not just for triple parity, but for more parities as well. For some > of it, I've got Python code to test and verify the maths. It turns > out that triple parity can work well - but for quad parity the limit > is 21 data disks (using generators 2, 4, and 8), or up to 33 (using > for example 0x07, 0x35 and 0x8b as generators). Realistically, I > think triple parity is the limit for practical implementations. Why is that? An RS code (255,251) should be possible, like it is a (255,223). What's the limitation? I'm sure there is even a "RAID-96", which is (96,64). My wild guess would be that the generators must be chosen in some way. Have you had a look at the "par2" code? That seems to be capable of doing a parametric RS, even if in 16bit words. bye, -- piergiorgio ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-17 17:16 ` Piergiorgio Sartor @ 2012-04-17 20:18 ` David Brown 2012-04-17 20:54 ` Piergiorgio Sartor ` (2 more replies) 0 siblings, 3 replies; 37+ messages in thread From: David Brown @ 2012-04-17 20:18 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: linux-raid On 17/04/12 19:16, Piergiorgio Sartor wrote: > Hi David, > >> My current state is that I've got theory worked out and written up - >> not just for triple parity, but for more parities as well. For some >> of it, I've got Python code to test and verify the maths. It turns >> out that triple parity can work well - but for quad parity the limit >> is 21 data disks (using generators 2, 4, and 8), or up to 33 (using >> for example 0x07, 0x35 and 0x8b as generators). Realistically, I >> think triple parity is the limit for practical implementations. > > Why is that? An RS code (255,251) should be possible, like > it is a (255,223). What's the limitation? > I'm sure there is even a "RAID-96", which is (96,64). > > My wild guess would be that the generators must be chosen > in some way. > > Have you had a look at the "par2" code? That seems to be > capable of doing a parametric RS, even if in 16bit words. > > bye, > It is certainly possible to make raid systems using as many disks as you want, and as many parities as you want. The trick is to do so in an efficient manner. The way raid6 is implemented is using simple polynomials over the Galois field G(2⁸): Py = g_y^0 . D0 + g_y^1 . D1 + g_y^2 . D2 + ... + g_y^n . Dn For raid5, you have one parity generators, and g0 is 1 (i.e., a simple xor). For raid6, you have two parity - 1 and 2. For triple parity, we would use g0 = 1, g1 = 2 and g2 = 4. The whole thing is therefore simple (or relatively simple!) to implement, and should run fast - calculating the third parity will only be slightly slower than calculating the second parity in raid6 is today, and recovery will be as fast as raid6 unless you need to recover from 3 data disk failures. For quad parity, we can try g3 = 8 as the obvious next choice in the pattern. Unfortunately, we start hitting conflicts. To recover missing data, we have to solve multiple simultaneous equations over G(2⁸), whose coefficients depend on the index numbers of the missing disks. With parity generators (1, 2, 4, 8), some of these combinations of missing disk indexes lead to insoluble equations when you have more that 21 disks. I didn't find any neat way to prove this, or prove that all combinations of missing disks for parity generators (1, 2, 4) work out okay. I wrote some brute-force checking code, and let my computer do the hard work. There are other ways to make a raid system with quad parity or more. The parities don't have to be calculated in such a neat form as above - perhaps even something as simple as re-ordering the powers of the parity generators would help. We could use a different operation rather than the "G(2⁸) multiply". We could use different sized Galois fields. Or we could try something completely different, such as RS codes. Somewhere there we would get quad parity and more - for as many disks as we like. However, I would be surprised if there were another method that had the practicality of implementation that we get with this system. There are /huge/ benefits in having the triple-parity raid being the same as raid6 but with an extra parity, and being able to generate it in the same way. If quad parity is something that appeals to anyone, then it could be implemented at the same time - and 21 data disks (25 disks total) would be the limit. Even though the same equations with different generators can give up to 33 data disks, I expect that compatibility with raid6 and triple-parity would outweigh the benefits of supporting more disks. After all, a 25 disk array is already pretty big for a single array. mvh., David -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-17 20:18 ` David Brown @ 2012-04-17 20:54 ` Piergiorgio Sartor 2012-04-18 18:22 ` Piergiorgio Sartor 2012-04-19 18:16 ` H. Peter Anvin 2 siblings, 0 replies; 37+ messages in thread From: Piergiorgio Sartor @ 2012-04-17 20:54 UTC (permalink / raw) To: David Brown; +Cc: Piergiorgio Sartor, linux-raid Hi David, On Tue, Apr 17, 2012 at 10:18:55PM +0200, David Brown wrote: [some stuff removed] I think that we must understand a bit better systematic RS. The ideal case would be to have a RAID system where, while adding a disk, the user can decide if it (the disk) would be for data extension or for redundancy (parity) extension. This should be always possible and, in the first place, performance should not be the target. As mentioned before, par2 does it without too many issues, so I fail to see why three parities should be the limit. Actually, I'm trying to catch some RS expert in order to get an explanation (Q&A) on how the whole thing works. Unfortunately experts seem to be quite sluggish... :-) bye, -- piergiorgio ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-17 20:18 ` David Brown 2012-04-17 20:54 ` Piergiorgio Sartor @ 2012-04-18 18:22 ` Piergiorgio Sartor 2012-04-18 20:20 ` David Brown 2012-04-19 18:16 ` H. Peter Anvin 2 siblings, 1 reply; 37+ messages in thread From: Piergiorgio Sartor @ 2012-04-18 18:22 UTC (permalink / raw) To: David Brown; +Cc: Piergiorgio Sartor, linux-raid Hi David, On Tue, Apr 17, 2012 at 10:18:55PM +0200, David Brown wrote: [...] > For quad parity, we can try g3 = 8 as the obvious next choice in the > pattern. Unfortunately, we start hitting conflicts. To recover you should not use 8, because this is not a generator of GF(256) with polynomial 285, the standard for the RAID-5/6 setup. This means than 8^k does not cover the complete field for k in [0 254], thus having cycles and, consequently, creating conflicts. Some generators could be: 2, 4, 6, 9 13, 14, 16... but not 32 nor 64. I know that powers of two are nice, but if you want to have generic RAID, you must use other values. The log/exp tables, are, of course, always valid. BTW, the GF(256) with polynomial 285 has exactly 128 generators, so it would be possible to have up to 129 parity disk (1 is not a generator), for, I guess, a max of 256 disks (or maybe 255?). Hope this helps, bye, -- piergiorgio ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-18 18:22 ` Piergiorgio Sartor @ 2012-04-18 20:20 ` David Brown 2012-04-18 20:39 ` Piergiorgio Sartor 0 siblings, 1 reply; 37+ messages in thread From: David Brown @ 2012-04-18 20:20 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: linux-raid On 18/04/12 20:22, Piergiorgio Sartor wrote: > Hi David, > > On Tue, Apr 17, 2012 at 10:18:55PM +0200, David Brown wrote: > [...] > >> For quad parity, we can try g3 = 8 as the obvious next choice in the >> pattern. Unfortunately, we start hitting conflicts. To recover > > you should not use 8, because this is not a generator > of GF(256) with polynomial 285, the standard for the > RAID-5/6 setup. > > This means than 8^k does not cover the complete field > for k in [0 254], thus having cycles and, consequently, > creating conflicts. > > Some generators could be: > > 2, 4, 6, 9 13, 14, 16... > > but not 32 nor 64. > > I know that powers of two are nice, but if you want to > have generic RAID, you must use other values. > I know that 8 is not a generator, and therefore you cannot expect to get a full set of (256 - noOfParities) disks. But picking another generator (such as 16) is not enough to guarantee you the full range - it is a requirement, but not sufficient. The generators need to be independent of each other, in the sense that all the simultaneous equations for all the combinations of failed disks need to be soluble. It turns out that if you pick 16 as the forth parity generator here (1, 2, 4, 16), then you can only have 5 data disks. In fact, there are no other values for g3 that give significantly more than 21 data disks in combination with (1, 2, 4), whether or not they happen to be a generator for all of GF(2⁸). > The log/exp tables, are, of course, always valid. > > BTW, the GF(256) with polynomial 285 has exactly 128 > generators, so it would be possible to have up to 129 > parity disk (1 is not a generator), for, I guess, a > max of 256 disks (or maybe 255?). > > Hope this helps, > > bye, > When I started out with this, I thought it was as simple as you are suggesting. But it is not - picking a set of generators for GF(2⁸) is not enough. You have to check that all the solution matrices are invertible for all combinations of failed disks. In fact, it is a little surprising that (1, 2, 4) works so well for triple parity. I don't know whether it is just luck, or genius in Peter Anvin's choice of the multiply operation. mvh., David -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-18 20:20 ` David Brown @ 2012-04-18 20:39 ` Piergiorgio Sartor 0 siblings, 0 replies; 37+ messages in thread From: Piergiorgio Sartor @ 2012-04-18 20:39 UTC (permalink / raw) To: David Brown; +Cc: Piergiorgio Sartor, linux-raid Hi David, > I know that 8 is not a generator, and therefore you cannot expect to > get a full set of (256 - noOfParities) disks. But picking another > generator (such as 16) is not enough to guarantee you the full range > - it is a requirement, but not sufficient. The generators need to > be independent of each other, in the sense that all the simultaneous > equations for all the combinations of failed disks need to be > soluble. uhm, than how can RS(255,223) or the equivalent RS(160,128) work? In all the papers I saw, it was never mentioned the "indepence" of the GF generators, do you have any reference I can look into? > It turns out that if you pick 16 as the forth parity generator here > (1, 2, 4, 16), then you can only have 5 data disks. In fact, there > are no other values for g3 that give significantly more than 21 data > disks in combination with (1, 2, 4), whether or not they happen to > be a generator for all of GF(2⁸). That's really interesting, but, again, I fail to see how this could be, given that there are larger range codes. Maybe they do not use the polynomial 285, in fact AES uses 283, but the first generator is 3 and not 2, actually no powers of two appears, if I remember correctly. > When I started out with this, I thought it was as simple as you are > suggesting. But it is not - picking a set of generators for GF(2⁸) > is not enough. You have to check that all the solution matrices are > invertible for all combinations of failed disks. Check or prove? How do you do that? And what do you mean exactly? I mean with "all combination of failed disks". As already wrote, par2 implements a generic RS parity generator, and it works with very little limitations. Do they something different? > In fact, it is a little surprising that (1, 2, 4) works so well for > triple parity. I don't know whether it is just luck, or genius in > Peter Anvin's choice of the multiply operation. The GF(256) with polynomial 285 seems quite a structure of choice, I wonder if the polynomial must fullfill specific conditions in order to allow a wider range of parities... bye, -- piergiorgio -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-17 20:18 ` David Brown 2012-04-17 20:54 ` Piergiorgio Sartor 2012-04-18 18:22 ` Piergiorgio Sartor @ 2012-04-19 18:16 ` H. Peter Anvin 2012-04-20 2:27 ` Alex 2 siblings, 1 reply; 37+ messages in thread From: H. Peter Anvin @ 2012-04-19 18:16 UTC (permalink / raw) To: David Brown; +Cc: Piergiorgio Sartor, linux-raid On 04/17/2012 01:18 PM, David Brown wrote: > > For quad parity, we can try g3 = 8 as the obvious next choice in the > pattern. Unfortunately, we start hitting conflicts. To recover missing > data, we have to solve multiple simultaneous equations over G(2⁸), whose > coefficients depend on the index numbers of the missing disks. With > parity generators (1, 2, 4, 8), some of these combinations of missing > disk indexes lead to insoluble equations when you have more that 21 disks. > That is because 255 = 3*5*17... this means {02}^3 = {08} is not a generator. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-19 18:16 ` H. Peter Anvin @ 2012-04-20 2:27 ` Alex 2012-04-20 3:00 ` H. Peter Anvin 0 siblings, 1 reply; 37+ messages in thread From: Alex @ 2012-04-20 2:27 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-raid I think when David says 'generator', he doesn't mean the generator of the order 8 Galois field, he means an arbitrary set of number in it which can render the system of equations solvable to up to a certain number of data disks(not necessarily 255). He uses a brute-force method with the help of a Python program to actually figure that out. It looks pretty cool to me since I have known the system of 4 equations generally fails to render a solution for a while, but now I know exactly how many ways it may fail... Cheers, Alex On Fri, Apr 20, 2012 at 2:16 AM, H. Peter Anvin <hpa@zytor.com> wrote: > On 04/17/2012 01:18 PM, David Brown wrote: >> >> For quad parity, we can try g3 = 8 as the obvious next choice in the >> pattern. Unfortunately, we start hitting conflicts. To recover missing >> data, we have to solve multiple simultaneous equations over G(2⁸), whose >> coefficients depend on the index numbers of the missing disks. With >> parity generators (1, 2, 4, 8), some of these combinations of missing >> disk indexes lead to insoluble equations when you have more that 21 disks. >> > > That is because 255 = 3*5*17... this means {02}^3 = {08} is not a generator. > > -hpa > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-20 2:27 ` Alex @ 2012-04-20 3:00 ` H. Peter Anvin 2012-04-20 3:32 ` Alex 0 siblings, 1 reply; 37+ messages in thread From: H. Peter Anvin @ 2012-04-20 3:00 UTC (permalink / raw) To: Alex; +Cc: linux-raid Being a generator is a requirement for that. Alex <creamyfish@gmail.com> wrote: >I think when David says 'generator', he doesn't mean the generator of >the order >8 Galois field, he means an arbitrary set of number in it which can >render the >system of equations solvable to up to a certain number of data >disks(not necessarily >255). He uses a brute-force method with the help of a Python program to >actually >figure that out. It looks pretty cool to me since I have known the >system of 4 equations >generally fails to render a solution for a while, but now I know >exactly how many ways >it may fail... > >Cheers, >Alex > > >On Fri, Apr 20, 2012 at 2:16 AM, H. Peter Anvin <hpa@zytor.com> wrote: >> On 04/17/2012 01:18 PM, David Brown wrote: >>> >>> For quad parity, we can try g3 = 8 as the obvious next choice in the >>> pattern. Unfortunately, we start hitting conflicts. To recover >missing >>> data, we have to solve multiple simultaneous equations over G(2⁸), >whose >>> coefficients depend on the index numbers of the missing disks. With >>> parity generators (1, 2, 4, 8), some of these combinations of >missing >>> disk indexes lead to insoluble equations when you have more that 21 >disks. >>> >> >> That is because 255 = 3*5*17... this means {02}^3 = {08} is not a >generator. >> >> -hpa >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" >in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- Sent from my mobile phone. Please excuse brevity and lack of formatting. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-20 3:00 ` H. Peter Anvin @ 2012-04-20 3:32 ` Alex 2012-04-20 18:58 ` David Brown 0 siblings, 1 reply; 37+ messages in thread From: Alex @ 2012-04-20 3:32 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-raid I understand we need a generator to facilitate a 255 data disks array, but 255 sounds like a theoretic limit to me. ZFS now only supports an array of only 9 disks(6 of them are data disks), so having, say, a quad-parity array of 48 disks(theoretically) doesn't sound that bad, does it? Cheers, Alex On Fri, Apr 20, 2012 at 11:00 AM, H. Peter Anvin <hpa@zytor.com> wrote: > Being a generator is a requirement for that. > > Alex <creamyfish@gmail.com> wrote: > >>I think when David says 'generator', he doesn't mean the generator of >>the order >>8 Galois field, he means an arbitrary set of number in it which can >>render the >>system of equations solvable to up to a certain number of data >>disks(not necessarily >>255). He uses a brute-force method with the help of a Python program to >>actually >>figure that out. It looks pretty cool to me since I have known the >>system of 4 equations >>generally fails to render a solution for a while, but now I know >>exactly how many ways >>it may fail... >> >>Cheers, >>Alex >> >> >>On Fri, Apr 20, 2012 at 2:16 AM, H. Peter Anvin <hpa@zytor.com> wrote: >>> On 04/17/2012 01:18 PM, David Brown wrote: >>>> >>>> For quad parity, we can try g3 = 8 as the obvious next choice in the >>>> pattern. Unfortunately, we start hitting conflicts. To recover >>missing >>>> data, we have to solve multiple simultaneous equations over G(2⁸), >>whose >>>> coefficients depend on the index numbers of the missing disks. With >>>> parity generators (1, 2, 4, 8), some of these combinations of >>missing >>>> disk indexes lead to insoluble equations when you have more that 21 >>disks. >>>> >>> >>> That is because 255 = 3*5*17... this means {02}^3 = {08} is not a >>generator. >>> >>> -hpa >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" >>in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Sent from my mobile phone. Please excuse brevity and lack of formatting. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-20 3:32 ` Alex @ 2012-04-20 18:58 ` David Brown 2012-04-20 19:39 ` H. Peter Anvin 2012-04-20 21:01 ` Piergiorgio Sartor 0 siblings, 2 replies; 37+ messages in thread From: David Brown @ 2012-04-20 18:58 UTC (permalink / raw) To: Alex; +Cc: H. Peter Anvin, linux-raid Hi, Yes, being a generator for GF(2^8) is a requirement for a parity generator (sorry for the confusing terminology here - if anyone has a better suggestion, please say) to be part of a 255 data disk system. However, being a GF generator is necessary but not sufficient - using parity generators (1, 2, 4, 16) will /not/ give quad parity for 255 data disks, even though individually each of 1, 2, 4 and 16 are generators for GF. 255 data disks is the theoretical limit for GF(2⁸). But it is a theoretical limit of the algorithms - I don't know whether Linux md raid actually supports that many disks. I certainly doubt if it is useful. It might well be that a 21 data disk limit quad parity is useful - or at least, as useful as quad parity ever would be. It would fit well within a typical large chassis with 24 disk slots. And then it doesn't matter that 8 is not a generator for GF(2⁸) - it becomes the best choice because of the easiest implementation. On 20/04/12 05:32, Alex wrote: > I understand we need a generator to facilitate a 255 data disks > array, but 255 sounds like a theoretic limit to me. ZFS now only > supports an array of only 9 disks(6 of them are data disks), so > having, say, a quad-parity array of 48 disks(theoretically) doesn't > sound that bad, does it? > > Cheers, Alex > > On Fri, Apr 20, 2012 at 11:00 AM, H. Peter Anvin<hpa@zytor.com> > wrote: >> Being a generator is a requirement for that. >> >> Alex<creamyfish@gmail.com> wrote: >> >>> I think when David says 'generator', he doesn't mean the >>> generator of the order 8 Galois field, he means an arbitrary set >>> of number in it which can render the system of equations solvable >>> to up to a certain number of data disks(not necessarily 255). He >>> uses a brute-force method with the help of a Python program to >>> actually figure that out. It looks pretty cool to me since I have >>> known the system of 4 equations generally fails to render a >>> solution for a while, but now I know exactly how many ways it may >>> fail... >>> >>> Cheers, Alex >>> >>> >>> On Fri, Apr 20, 2012 at 2:16 AM, H. Peter Anvin<hpa@zytor.com> >>> wrote: >>>> On 04/17/2012 01:18 PM, David Brown wrote: >>>>> >>>>> For quad parity, we can try g3 = 8 as the obvious next choice >>>>> in the pattern. Unfortunately, we start hitting conflicts. >>>>> To recover >>> missing >>>>> data, we have to solve multiple simultaneous equations over >>>>> G(2⁸), >>> whose >>>>> coefficients depend on the index numbers of the missing >>>>> disks. With parity generators (1, 2, 4, 8), some of these >>>>> combinations of >>> missing >>>>> disk indexes lead to insoluble equations when you have more >>>>> that 21 >>> disks. >>>>> >>>> >>>> That is because 255 = 3*5*17... this means {02}^3 = {08} is not >>>> a >>> generator. >>>> >>>> -hpa >>>> -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-20 18:58 ` David Brown @ 2012-04-20 19:39 ` H. Peter Anvin 2012-04-20 21:04 ` Piergiorgio Sartor 2012-04-20 21:01 ` Piergiorgio Sartor 1 sibling, 1 reply; 37+ messages in thread From: H. Peter Anvin @ 2012-04-20 19:39 UTC (permalink / raw) To: David Brown; +Cc: Alex, linux-raid On 04/20/2012 11:58 AM, David Brown wrote: > Hi, > > Yes, being a generator for GF(2^8) is a requirement for a parity > generator (sorry for the confusing terminology here - if anyone has a > better suggestion, please say) to be part of a 255 data disk system. > However, being a GF generator is necessary but not sufficient - using > parity generators (1, 2, 4, 16) will /not/ give quad parity for 255 data > disks, even though individually each of 1, 2, 4 and 16 are generators > for GF. > > 255 data disks is the theoretical limit for GF(2⁸). But it is a > theoretical limit of the algorithms - I don't know whether Linux md raid > actually supports that many disks. I certainly doubt if it is useful. > > It might well be that a 21 data disk limit quad parity is useful - or at > least, as useful as quad parity ever would be. It would fit well within > a typical large chassis with 24 disk slots. And then it doesn't matter > that 8 is not a generator for GF(2⁸) - it becomes the best choice > because of the easiest implementation. > It is also worth noting that there is nothing magical about GF(2^8). It is just a reasonable tradeoff when tables are needed. There are hardware tricks one can play to do efficient operation of wider fields, too. But It sounds like {04} or {8e} are particular interesting generators of the existing GF(2^8) field for an efficient second field, giving triple-parity RAID again at a reasonable cost. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-20 19:39 ` H. Peter Anvin @ 2012-04-20 21:04 ` Piergiorgio Sartor 0 siblings, 0 replies; 37+ messages in thread From: Piergiorgio Sartor @ 2012-04-20 21:04 UTC (permalink / raw) To: H. Peter Anvin; +Cc: David Brown, Alex, linux-raid Hi Peter, > > Yes, being a generator for GF(2^8) is a requirement for a parity > > generator (sorry for the confusing terminology here - if anyone has a > > better suggestion, please say) to be part of a 255 data disk system. > > However, being a GF generator is necessary but not sufficient - using > > parity generators (1, 2, 4, 16) will /not/ give quad parity for 255 data > > disks, even though individually each of 1, 2, 4 and 16 are generators > > for GF. [...] > It is also worth noting that there is nothing magical about GF(2^8). It > is just a reasonable tradeoff when tables are needed. I, then, ask you too. What is this story that being a generator is not enough? Is there any reference, documentation, link which can be studied in order to understand this limitation? In all RS papers I found, the only constrain put was that the Vandermonde must be constructed with generators. Not all RAID examples used them, but no paper, at least for what I understood, was limiting the generators to be also "independent". Any undestandable explanation? Thanks, bye, -- piergiorgio ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-20 18:58 ` David Brown 2012-04-20 19:39 ` H. Peter Anvin @ 2012-04-20 21:01 ` Piergiorgio Sartor 2012-04-20 21:29 ` Peter Grandi 1 sibling, 1 reply; 37+ messages in thread From: Piergiorgio Sartor @ 2012-04-20 21:01 UTC (permalink / raw) To: David Brown; +Cc: Alex, H. Peter Anvin, linux-raid Hi again David, > Yes, being a generator for GF(2^8) is a requirement for a parity > generator (sorry for the confusing terminology here - if anyone has > a better suggestion, please say) to be part of a 255 data disk > system. However, being a GF generator is necessary but not > sufficient - using parity generators (1, 2, 4, 16) will /not/ give > quad parity for 255 data disks, even though individually each of 1, > 2, 4 and 16 are generators for GF. I ask again, could you please elaborate this? I nowhere found such a further constrain for the parities. All I could find is that the Vandermonde matrix must be done with generators. > 255 data disks is the theoretical limit for GF(2⁸). But it is a > theoretical limit of the algorithms - I don't know whether Linux md > raid actually supports that many disks. I certainly doubt if it is > useful. The reason to use many disks is in case of geo-redundant RAID, for example with iscsi. In this situation you want to have a lot of redundance, in parities, not mirror. bye, -- piergiorgio -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-20 21:01 ` Piergiorgio Sartor @ 2012-04-20 21:29 ` Peter Grandi 2012-04-20 22:31 ` Piergiorgio Sartor 0 siblings, 1 reply; 37+ messages in thread From: Peter Grandi @ 2012-04-20 21:29 UTC (permalink / raw) To: Linux RAID [ ... ] >> 255 data disks is the theoretical limit for GF(2⁸). But it >> is a theoretical limit of the algorithms - I don't know >> whether Linux md raid actually supports that many disks. I >> certainly doubt if it is useful. > The reason to use many disks is in case of geo-redundant RAID, > for example with iscsi. In this situation you want to have a > lot of redundance, in parities, not mirror. is that something that makes sense? If one has the extreme requirements implied by that why not use self repairing coding similarly to Parchive style storage formats, for example Typhoon or the Azure filesystem or others inspired by Parchive. Out of interest I just did a small web search and it turned up a recent survey/lecture by Frédérique Oggier on the maths of these coding systems: http://phdopen.mimuw.edu.pl/lato12/LectPoland.pdf http://sands.sce.ntu.edu.sg/CodingForNetworkedStorage/ -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-20 21:29 ` Peter Grandi @ 2012-04-20 22:31 ` Piergiorgio Sartor 2012-04-21 9:51 ` Peter Grandi 0 siblings, 1 reply; 37+ messages in thread From: Piergiorgio Sartor @ 2012-04-20 22:31 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID Hi Peter, On Fri, Apr 20, 2012 at 10:29:11PM +0100, Peter Grandi wrote: > [ ... ] > > >> 255 data disks is the theoretical limit for GF(2⁸). But it > >> is a theoretical limit of the algorithms - I don't know > >> whether Linux md raid actually supports that many disks. I > >> certainly doubt if it is useful. > > > The reason to use many disks is in case of geo-redundant RAID, > > for example with iscsi. In this situation you want to have a > > lot of redundance, in parities, not mirror. > > is that something that makes sense? If one has the extreme > requirements implied by that why not use self repairing coding > similarly to Parchive style storage formats, for example Typhoon > or the Azure filesystem or others inspired by Parchive. it depends on other requirements, for example if you want to control your file or let the control to the "cloud". In case of RAID, the cloud sees only raw bytes and the local host sees the files too. In case of par2, the cloud must see the files. BTW, this is not my original idea, there is someone doing, already mentioned, a RAID-96 with RS(96,64) over networked virtual drives. > Out of interest I just did a small web search and it turned up a > recent survey/lecture by Frédérique Oggier on the maths of these > coding systems: > > http://phdopen.mimuw.edu.pl/lato12/LectPoland.pdf > http://sands.sce.ntu.edu.sg/CodingForNetworkedStorage/ Interesting, I'll have a look. bye, -- piergiorgio -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-20 22:31 ` Piergiorgio Sartor @ 2012-04-21 9:51 ` Peter Grandi 2012-04-21 11:18 ` Piergiorgio Sartor 0 siblings, 1 reply; 37+ messages in thread From: Peter Grandi @ 2012-04-21 9:51 UTC (permalink / raw) To: Linux RAID >>>> 255 data disks is the theoretical limit for GF(2⁸). [ ... ] >>> The reason to use many disks is in case of geo-redundant RAID, >>> for example with iscsi. In this situation you want to have a >>> lot of redundance, in parities, not mirror. >> [ ... ] extreme requirements implied by that why not use self >> repairing coding similarly to Parchive style storage formats, >> [ ... ] > it depends on other requirements, for example if you want to > control your file or let the control to the "cloud". In case > of RAID, the cloud sees only raw bytes and the local host sees > the files too. In case of par2, the cloud must see the > files. [ ... ] Perhaps it was not clear that "use self repairing coding similarly to Parchive" does not mean the same as "use Parchive". Applying self repairing coding to RAID might be easier to imagine considering of a RAID volume as a directory of stripe-sized files. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-21 9:51 ` Peter Grandi @ 2012-04-21 11:18 ` Piergiorgio Sartor 2012-04-22 3:14 ` Alex 0 siblings, 1 reply; 37+ messages in thread From: Piergiorgio Sartor @ 2012-04-21 11:18 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID Hi Peter, > Perhaps it was not clear that "use self repairing coding > similarly to Parchive" does not mean the same as "use Parchive". sorry, I understood you meant storing somewhere files with parities. > Applying self repairing coding to RAID might be easier to > imagine considering of a RAID volume as a directory of > stripe-sized files. Exactly, but as far as I know this is the same as any N parities RAID. I mean, the mathematics and algorithm behind should be exactly the same... bye, -- piergiorgio ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-21 11:18 ` Piergiorgio Sartor @ 2012-04-22 3:14 ` Alex 2012-04-22 8:57 ` Piergiorgio Sartor 0 siblings, 1 reply; 37+ messages in thread From: Alex @ 2012-04-22 3:14 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: linux-raid Hi Piergiorgio, Just being curious, any mathematical improvement in Reed-Solomon's code like the RS(96,64) you mentioned? The calculation of RS involves arbitrary multiplication in a Galois field and table look-up has to be used. Has that already changed? Cheers, Alex On Sat, Apr 21, 2012 at 7:18 PM, Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote: > Hi Peter, > >> Perhaps it was not clear that "use self repairing coding >> similarly to Parchive" does not mean the same as "use Parchive". > > sorry, I understood you meant storing somewhere > files with parities. > >> Applying self repairing coding to RAID might be easier to >> imagine considering of a RAID volume as a directory of >> stripe-sized files. > > Exactly, but as far as I know this is the same as > any N parities RAID. > > I mean, the mathematics and algorithm behind should > be exactly the same... > > bye, > > -- > > piergiorgio > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-22 3:14 ` Alex @ 2012-04-22 8:57 ` Piergiorgio Sartor 0 siblings, 0 replies; 37+ messages in thread From: Piergiorgio Sartor @ 2012-04-22 8:57 UTC (permalink / raw) To: Alex; +Cc: Piergiorgio Sartor, linux-raid Hi Alex, > Just being curious, any mathematical improvement in Reed-Solomon's > code like the > RS(96,64) you mentioned? The calculation of RS involves arbitrary > multiplication in a > Galois field and table look-up has to be used. Has that already changed? AFAIK (details are not public) it uses the stardard RS with LUT operations. Considering that this is for network disks, I suppose there is no performance problem. The RS(255,223) or RS(160,128) are very old encoding schemes, we talk about Voyager space probe here. What I know about improvement, is the decoding using the Berlekamp–Massey algorithm. This is quite hard, since it also find *where* error are, but it seems there is some work out there on how to implement it very efficiently. My personal opinion is that the first thing to do is to have a working system, later it could be possible to improve performance by different optimizations. For example LUT multiplication might or might not be the best way, but at first it is the easy way to do it. bye, -- piergiorgio -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-17 6:11 Is this enough for us to have triple-parity RAID? Alex 2012-04-17 7:58 ` David Brown @ 2012-04-20 7:45 ` Stan Hoeppner 2012-04-23 15:26 ` Alex 1 sibling, 1 reply; 37+ messages in thread From: Stan Hoeppner @ 2012-04-20 7:45 UTC (permalink / raw) To: Alex; +Cc: linux-raid On 4/17/2012 1:11 AM, Alex wrote: > Thanks to Billy Crook who pointed out this is the right place for my post. > > Adam Leventhal integrated triple-parity RAID into ZFS in 2009. The > necessity of triple-parity RAID is described in detail in Adam > Leventhal's article(http://cacm.acm.org/magazines/2010/1/55741-triple-parity-raid-and-beyond/fulltext). No mention of SSD. > al.(http://www.nature.com/ncomms/journal/v3/n2/full/ncomms1666.html) Pay wall. No matter, as I'd already read of this research. > established a revolutionary way of writing magnetic substrate using a > heat pulse instead of a traditional magnetic field, which may increase > data throughput on a hard disk by 1000 times in the future. Your statement is massively misleading. The laser heating technology doesn't independently increase throughput 1000x. It will allow for increased throughput only via enabling greater aerial density. Thus the ratio of throughput to capacity stays the same. Thus drive rebuild times will still increase dramatically. > facilitate another triple-parity RAID algorithm CPU performance is increasing at a faster rate than any computer technology. Thus, if you're going to even bother with introducing another parity RAID level, and the binary will run on host CPU cores, skip triple parity and go straight to quad parity, RAID-P4™. Most savvy folks doing RAID6 are using a 6+2 or 8+2 configuration as wide stripe parity arrays tend to be problematic. They then stripe them to create a RAID60, or concatenate them if they're even more savvy and use XFS. The most common JBOD chassis on the market today seems to be the 24x 2.5" drive layout. This allows three 6+2 RAID6 arrays, losing 6 drives to parity leaving 18 drives of capacity. With RAID-P4™ a wider stripe array becomes more attractive for some applications. Thus our 24 drive JBOD could yield a 20+4 RAID-P4™ with two drives more capacity than the 6+2 RAID6 configuration. If one wished to stick with narrower stripes, we'd get two 8+4 RAID-P4™ arrays and 16 drives total capacity, 2 less than the triple RAID6 setup, and still 4 drives more capacity than RAID10. The really attractive option here for people who like parity RAID is the 20+4 possibility. With a RAID-P4™ array that can withstand up to 4 drive failures, people will no longer be afraid of using wide stripes for applications that typically benefit, where RAID50/60 would have been employed previously. They also no longer have to worry about secondary and/or tertiary drive failures during a rebuild. Yeah, definitely go straight to RAID-P4™ and skip triple parity RAID altogether. You'll have to do it in 6-10 years anyway so may as well prevent the extra work. And people could definitely benefit from RAID-P4™ today. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-20 7:45 ` Stan Hoeppner @ 2012-04-23 15:26 ` Alex 2012-04-25 1:20 ` Stan Hoeppner 0 siblings, 1 reply; 37+ messages in thread From: Alex @ 2012-04-23 15:26 UTC (permalink / raw) To: stan; +Cc: linux-raid Hi Stan, Sorry for not replying earlier. My background is not exactly Physics, so I talked to Thomas Ostler for a clarification. Here is what I did: I first ask him how exactly does his model possibly increase areal density, and I am quoting his answer: "thank you for your inquiry. The main reason for our statement that areal density could be improved is down to a prediction made by the model we used in this paper to predict this phenomena. The model results show that the effect occurs with very small system sizes (on the same time-scale), thus potentially the density could be improved. However, the model is parameterised on experimental observations of bulk system so it is not clear that the physics would be the same when we decrease the size of the bits, this is something that we are looking at studying in York using different methods over the next few years. Obviously any increase in writing speed needs to have an improvement in reading speed to see overall performance, one of the many engineering problems to be overcome." Then I ask him to clarify the following two things for me: 1. We also want to make sure what we got about your model improving writing speed is correct: writing speed could be increased because writing a bit could be achieved on a timescale of picoseconds rather than the current nanoseconds and is for this reason only 2. In particular, increase of areal density doesn't lead to increase of writing speed. And I am quoting his answers again: Answer for 1: "yes, indeed we showed that the process is complete (apart from the longer time cooling) after a couple of pico seconds. We also verified the time-scale experimentally by measuring what happens as a function of time to the Fe and Gd sublattices (though this experiment was done in an applied field), see Radu et al. Nature 472, 205-208 (2011)." Answer for 2: "yes that is correct, the simulations show that the time-scale for switching is not affected by the system size (within reasonable limits). The features of this material that are physically important for switching is something that we wish to study further to allow us to gain more insight into the mechanism behind this switching." It looks like to me Ostler's team's work does independently lead to an increase in areal density and writing speed, and there is not fixed relation between these two. I am still evaluating what this means, but my first feeling is the whole thing is still not very mature and there definitely will be other improvements in the future, so taking a ' wait and see' stand point at this point may not necessarily be a bad idea. On Fri, Apr 20, 2012 at 3:45 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote: > On 4/17/2012 1:11 AM, Alex wrote: >> Thanks to Billy Crook who pointed out this is the right place for my post. >> >> Adam Leventhal integrated triple-parity RAID into ZFS in 2009. The >> necessity of triple-parity RAID is described in detail in Adam >> Leventhal's article(http://cacm.acm.org/magazines/2010/1/55741-triple-parity-raid-and-beyond/fulltext). > > No mention of SSD. > >> al.(http://www.nature.com/ncomms/journal/v3/n2/full/ncomms1666.html) > > Pay wall. No matter, as I'd already read of this research. > >> established a revolutionary way of writing magnetic substrate using a >> heat pulse instead of a traditional magnetic field, which may increase >> data throughput on a hard disk by 1000 times in the future. > > Your statement is massively misleading. The laser heating technology > doesn't independently increase throughput 1000x. It will allow for > increased throughput only via enabling greater aerial density. Thus the > ratio of throughput to capacity stays the same. Thus drive rebuild > times will still increase dramatically. > Sorry >> facilitate another triple-parity RAID algorithm > > CPU performance is increasing at a faster rate than any computer > technology. Thus, if you're going to even bother with introducing > another parity RAID level, and the binary will run on host CPU cores, > skip triple parity and go straight to quad parity, RAID-P4™. Most savvy > folks doing RAID6 are using a 6+2 or 8+2 configuration as wide stripe > parity arrays tend to be problematic. They then stripe them to create a > RAID60, or concatenate them if they're even more savvy and use XFS. > > The most common JBOD chassis on the market today seems to be the 24x > 2.5" drive layout. This allows three 6+2 RAID6 arrays, losing 6 drives > to parity leaving 18 drives of capacity. With RAID-P4™ a wider stripe > array becomes more attractive for some applications. Thus our 24 drive > JBOD could yield a 20+4 RAID-P4™ with two drives more capacity than the > 6+2 RAID6 configuration. If one wished to stick with narrower stripes, > we'd get two 8+4 RAID-P4™ arrays and 16 drives total capacity, 2 less > than the triple RAID6 setup, and still 4 drives more capacity than RAID10. > > The really attractive option here for people who like parity RAID is the > 20+4 possibility. With a RAID-P4™ array that can withstand up to 4 > drive failures, people will no longer be afraid of using wide stripes > for applications that typically benefit, where RAID50/60 would have been > employed previously. They also no longer have to worry about secondary > and/or tertiary drive failures during a rebuild. > > Yeah, definitely go straight to RAID-P4™ and skip triple parity RAID > altogether. You'll have to do it in 6-10 years anyway so may as well > prevent the extra work. And people could definitely benefit from > RAID-P4™ today. > > -- > Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-23 15:26 ` Alex @ 2012-04-25 1:20 ` Stan Hoeppner 2012-04-25 2:45 ` Alex 0 siblings, 1 reply; 37+ messages in thread From: Stan Hoeppner @ 2012-04-25 1:20 UTC (permalink / raw) To: Alex; +Cc: linux-raid On 4/23/2012 10:26 AM, Alex wrote: > It looks like to me Ostler's team's work does independently lead to an > increase in areal density and writing speed, and > there is not fixed relation between these two. That's because you've not taken 60 seconds to actually think about this logically. Data rate is a product of [density * RPM]. To increase data rate we must increase one of these two. Making the R/W head simply punch holes faster won't increase data rate. It will allow simply allow us to increase the other two without the R/W head becoming a bottleneck. Designing the R/W head to punch bits 1000x faster cannot independently increase throughput. The platter must either have more bits per inch or they must spin faster. Period. Thus, the only way this laser R/W head will increase throughput is if it simultaneously writes more bits per inch, which increases density, allowing. So again, unless it increases density, the faster punching speed doesn't increase throughput. -- Stan ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-25 1:20 ` Stan Hoeppner @ 2012-04-25 2:45 ` Alex 2012-04-25 16:59 ` Emmanuel Noobadmin 0 siblings, 1 reply; 37+ messages in thread From: Alex @ 2012-04-25 2:45 UTC (permalink / raw) To: stan; +Cc: linux-raid I can see that you are trying very hard to fit a new picture into an old frame. But with new technology there is always new possibilities. For example, what I am thinking is: with the new laser write head, it doesn't necessarily require the head to stay very close to the platter since laser doesn't fade away with longer distance, which may enable a design that the head turns(the platter doesn't have to be round) instead of spinning the platter; hence the "Data rate is a product of [density * RPM]" statement falls apart. What is more, if the cost is justified, one may even design a disk with multiple write heads to increase the bandwidth since now we are free from the fluid mechanic interaction between the head and the platter. Both of these may lead to independent increase of data rate. I am having these thoughts because it's been an endless effort trying to remove the mechanical parts from a HDD(SSD was an attempt). With Ostler's new technology, this once again becomes possible. It is wise to think outside the box at this time. Cheers, Alex On Wed, Apr 25, 2012 at 9:20 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote: > On 4/23/2012 10:26 AM, Alex wrote: > >> It looks like to me Ostler's team's work does independently lead to an >> increase in areal density and writing speed, and >> there is not fixed relation between these two. > > That's because you've not taken 60 seconds to actually think about this > logically. > > Data rate is a product of [density * RPM]. To increase data rate we > must increase one of these two. Making the R/W head simply punch holes > faster won't increase data rate. It will allow simply allow us to > increase the other two without the R/W head becoming a bottleneck. > > Designing the R/W head to punch bits 1000x faster cannot independently > increase throughput. The platter must either have more bits per inch or > they must spin faster. Period. > > Thus, the only way this laser R/W head will increase throughput is if it > simultaneously writes more bits per inch, which increases density, > allowing. So again, unless it increases density, the faster punching > speed doesn't increase throughput. > > -- > Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-25 2:45 ` Alex @ 2012-04-25 16:59 ` Emmanuel Noobadmin 2012-04-25 19:29 ` David Brown ` (2 more replies) 0 siblings, 3 replies; 37+ messages in thread From: Emmanuel Noobadmin @ 2012-04-25 16:59 UTC (permalink / raw) To: Alex; +Cc: stan, linux-raid On 4/25/12, Alex <creamyfish@gmail.com> wrote: > I can see that you are trying very hard to fit a new picture into an old > frame. But with new technology there is always new possibilities. For > example, what I am thinking is: with the new laser write head, it doesn't > necessarily require the head to stay very close to the platter since laser > doesn't fade away with longer distance, which may enable a design that The strength of the laser will fall as the distance to the media increases, wouldn't it? > the head turns(the platter doesn't have to be round) instead of spinning > the platter; hence the "Data rate is a product of [density * RPM]" > statement falls apart. What is more, if the cost is justified, one may > even design a disk with multiple write heads to increase the bandwidth > since now we are free from the fluid mechanic interaction between the > head and the platter. Both of these may lead to independent increase > of data rate. I am having these thoughts because it's been an endless But Ostler's technique only increases the write speed and admits in a Wired.com article there is still no faster/other way to read the data. So we would still be stuck with the rotating disk and the magnetic reading heads until some other breakthrough happens. Going back into the context of parity RAID rebuild, we would still be bottlenecked on the write side wouldn't we since bits on the replacement drive can only be calculated after reading the rest of the drives. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-25 16:59 ` Emmanuel Noobadmin @ 2012-04-25 19:29 ` David Brown 2012-04-26 2:30 ` Alex 2012-04-26 4:24 ` Alex 2 siblings, 0 replies; 37+ messages in thread From: David Brown @ 2012-04-25 19:29 UTC (permalink / raw) To: Emmanuel Noobadmin; +Cc: Alex, stan, linux-raid On 25/04/12 18:59, Emmanuel Noobadmin wrote: > On 4/25/12, Alex<creamyfish@gmail.com> wrote: >> I can see that you are trying very hard to fit a new picture into an old >> frame. But with new technology there is always new possibilities. For >> example, what I am thinking is: with the new laser write head, it doesn't >> necessarily require the head to stay very close to the platter since laser >> doesn't fade away with longer distance, which may enable a design that > > The strength of the laser will fall as the distance to the media > increases, wouldn't it? > The laser's strength will only fade very slightly. It is not passing through enough air to significantly attenuate, and one of the points of using a laser is that you can focus it into a very tightly linear beam. So you are right in theory, but wrong in practice. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-25 16:59 ` Emmanuel Noobadmin 2012-04-25 19:29 ` David Brown @ 2012-04-26 2:30 ` Alex 2012-04-27 15:15 ` Emmanuel Noobadmin 2012-04-26 4:24 ` Alex 2 siblings, 1 reply; 37+ messages in thread From: Alex @ 2012-04-26 2:30 UTC (permalink / raw) To: Emmanuel Noobadmin; +Cc: linux-raid On Thu, Apr 26, 2012 at 12:59 AM, Emmanuel Noobadmin <centos.admin@gmail.com> wrote: > On 4/25/12, Alex <creamyfish@gmail.com> wrote: >> I can see that you are trying very hard to fit a new picture into an old >> frame. But with new technology there is always new possibilities. For >> example, what I am thinking is: with the new laser write head, it doesn't >> necessarily require the head to stay very close to the platter since laser >> doesn't fade away with longer distance, which may enable a design that > > The strength of the laser will fall as the distance to the media > increases, wouldn't it? Thanks for David answering this for me. > >> the head turns(the platter doesn't have to be round) instead of spinning >> the platter; hence the "Data rate is a product of [density * RPM]" >> statement falls apart. What is more, if the cost is justified, one may >> even design a disk with multiple write heads to increase the bandwidth >> since now we are free from the fluid mechanic interaction between the >> head and the platter. Both of these may lead to independent increase >> of data rate. I am having these thoughts because it's been an endless > > But Ostler's technique only increases the write speed and admits in a > Wired.com article there is still no faster/other way to read the data. > So we would still be stuck with the rotating disk and the magnetic > reading heads until some other breakthrough happens. Ostler admits that in his response to my e-mails too. > > Going back into the context of parity RAID rebuild, we would still be > bottlenecked on the write side wouldn't we since bits on the > replacement drive can only be calculated after reading the rest of the > drives. Yes if we can't make reading as fast as writing. But in general having a disk with write speed faster than read speed might still be an interesting choice for applications with a lot of disk write. Cheers, Alex -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-26 2:30 ` Alex @ 2012-04-27 15:15 ` Emmanuel Noobadmin 2012-05-01 16:38 ` Alex 0 siblings, 1 reply; 37+ messages in thread From: Emmanuel Noobadmin @ 2012-04-27 15:15 UTC (permalink / raw) To: Alex; +Cc: linux-raid On 4/26/12, Alex <creamyfish@gmail.com> wrote: > Yes if we can't make reading as fast as writing. But in general having a > disk with write speed faster than read speed might still be an interesting > choice for applications with a lot of disk write. Pardon me if I might be missing something here. The proposition here is that with the laser writing technique, we can increase the data rate by having new type write heads which swivel instead of the traditional moving head over rotating platter isn't it? But since we cannot do the same for reading, we are stuck with the traditional head/platter mechanism. How would the write speed increase independently of the RPM x density general case? It seems to me that we cannot expect a breakthrough in write/read without a fundamental improvement in the read function. It's like we have two person in a car, one collecting payment while one gives out goods. Now we increase the speed of the guy giving out goods (writing) but the total number units moved will still be the same because the car can't go any faster than the guy collecting payment (reading). Only difference is that the write head has more idle time between bits? Unless we are suggesting a hybrid system which has an independent swiveling write head and a traditional read head? But that sounds overly complicated and unlikely to work. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-27 15:15 ` Emmanuel Noobadmin @ 2012-05-01 16:38 ` Alex 0 siblings, 0 replies; 37+ messages in thread From: Alex @ 2012-05-01 16:38 UTC (permalink / raw) To: Emmanuel Noobadmin; +Cc: linux-raid Hi Emmanuel, The real reason of a HDD performance bottleneck is the mechanical movements inside it, RPM * density is just a reflection of that. It is always more desirable to try to remove the mechanical parts in it instead of playing with RPM * density when we try to optimize it for better performance. There are three types of mechanical movements and interaction inside a HDD: 1. The spinning of the platter (the one we want to eliminate most) 2. The movement of the actuator arm to position the read/write head on the right track (relatively tamed) 3. The fluid dynamic interaction between the platter and the read/write head through a thin layer of fast moving air With Ostler' technology, 3 is no longer necessary for the write head, and the platter is no longer tightly coupled with the write head as it was. This might lead to elimination of the spinning of the platter and my ideas were just some premature attempts. If this were to succeed, we still need to develop a new way to read the data from the now non-spinning storage medium. I haven't even got a premature idea for it yet, but one observation is this: since we can always use a cache to increase access response time, the new reading method may not need to be 1000X faster than the original one to be comparable to the new writing method. Finally we probably need a new way to coordinate the read and write operations of the new disk since the new read head and write head are likely to be separate and the current simply scheme will not work any more. This doesn't sound like a plan we can work out via interchange of e-mail message, so I really want to stop right here. I myself will definitely follow Ostler's work and see how things go. So good luck! Cheers, Alex On Fri, Apr 27, 2012 at 11:15 PM, Emmanuel Noobadmin <centos.admin@gmail.com> wrote: > On 4/26/12, Alex <creamyfish@gmail.com> wrote: > >> Yes if we can't make reading as fast as writing. But in general having a >> disk with write speed faster than read speed might still be an interesting >> choice for applications with a lot of disk write. > > Pardon me if I might be missing something here. The proposition here > is that with the laser writing technique, we can increase the data > rate by having new type write heads which swivel instead of the > traditional moving head over rotating platter isn't it? > > But since we cannot do the same for reading, we are stuck with the > traditional head/platter mechanism. How would the write speed increase > independently of the RPM x density general case? It seems to me that > we cannot expect a breakthrough in write/read without a fundamental > improvement in the read function. > > It's like we have two person in a car, one collecting payment while > one gives out goods. Now we increase the speed of the guy giving out > goods (writing) but the total number units moved will still be the > same because the car can't go any faster than the guy collecting > payment (reading). Only difference is that the write head has more > idle time between bits? > > Unless we are suggesting a hybrid system which has an independent > swiveling write head and a traditional read head? But that sounds > overly complicated and unlikely to work. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: Is this enough for us to have triple-parity RAID? 2012-04-25 16:59 ` Emmanuel Noobadmin 2012-04-25 19:29 ` David Brown 2012-04-26 2:30 ` Alex @ 2012-04-26 4:24 ` Alex 2 siblings, 0 replies; 37+ messages in thread From: Alex @ 2012-04-26 4:24 UTC (permalink / raw) To: Emmanuel Noobadmin; +Cc: linux-raid On Thu, Apr 26, 2012 at 12:59 AM, Emmanuel Noobadmin <centos.admin@gmail.com> wrote: > On 4/25/12, Alex <creamyfish@gmail.com> wrote: >> I can see that you are trying very hard to fit a new picture into an old >> frame. But with new technology there is always new possibilities. For >> example, what I am thinking is: with the new laser write head, it doesn't >> necessarily require the head to stay very close to the platter since laser >> doesn't fade away with longer distance, which may enable a design that > > The strength of the laser will fall as the distance to the media > increases, wouldn't it? > >> the head turns(the platter doesn't have to be round) instead of spinning >> the platter; hence the "Data rate is a product of [density * RPM]" >> statement falls apart. What is more, if the cost is justified, one may >> even design a disk with multiple write heads to increase the bandwidth >> since now we are free from the fluid mechanic interaction between the >> head and the platter. Both of these may lead to independent increase >> of data rate. I am having these thoughts because it's been an endless > > But Ostler's technique only increases the write speed and admits in a > Wired.com article there is still no faster/other way to read the data. > So we would still be stuck with the rotating disk and the magnetic > reading heads until some other breakthrough happens. > > Going back into the context of parity RAID rebuild, we would still be > bottlenecked on the write side wouldn't we since bits on the > replacement drive can only be calculated after reading the rest of the > drives. I also want to add that sluggishness of read operations can always be alleviated by prefetching and masked by the speed of a cache. Cheers, Alex -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
* Is this enough for us to have triple-parity RAID? @ 2012-04-16 12:55 Alex 0 siblings, 0 replies; 37+ messages in thread From: Alex @ 2012-04-16 12:55 UTC (permalink / raw) To: linux-fsdevel Hey guys, I have done some work about triple-parity RAID and is posted here: http://www.linuxquestions.org/questions/linux-server-73/is-this-enough-for-us-to-have-triple-parity-raid-939991/ As a die-hard follower of Linux, I truly wish btrfs would have triple-parity RAID integrated in the future. For those who are interested, why don't you go take a look and give me some feedback. creamyfish ^ permalink raw reply [flat|nested] 37+ messages in thread
* Is this enough for us to have triple-parity RAID? @ 2012-04-16 10:04 Alex 0 siblings, 0 replies; 37+ messages in thread From: Alex @ 2012-04-16 10:04 UTC (permalink / raw) To: linux-btrfs Hey guys, I have done some work about triple-parity RAID and is posted in linuxquestions.org Please go there and search for =A0"Is this enough for us to have triple-parity RAID?" As a die-hard follower of Linux, I truly wish btrfs would have =A0triple-parity RAID integrated in the future. For those who are interested, why don't you go take a look and give me some feedback. creamyfish -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2012-05-01 16:38 UTC | newest] Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-04-17 6:11 Is this enough for us to have triple-parity RAID? Alex 2012-04-17 7:58 ` David Brown 2012-04-17 16:37 ` Stefan /*St0fF*/ Hübner 2012-04-18 14:15 ` Alex 2012-04-18 14:11 ` David Brown 2012-04-17 17:16 ` Piergiorgio Sartor 2012-04-17 20:18 ` David Brown 2012-04-17 20:54 ` Piergiorgio Sartor 2012-04-18 18:22 ` Piergiorgio Sartor 2012-04-18 20:20 ` David Brown 2012-04-18 20:39 ` Piergiorgio Sartor 2012-04-19 18:16 ` H. Peter Anvin 2012-04-20 2:27 ` Alex 2012-04-20 3:00 ` H. Peter Anvin 2012-04-20 3:32 ` Alex 2012-04-20 18:58 ` David Brown 2012-04-20 19:39 ` H. Peter Anvin 2012-04-20 21:04 ` Piergiorgio Sartor 2012-04-20 21:01 ` Piergiorgio Sartor 2012-04-20 21:29 ` Peter Grandi 2012-04-20 22:31 ` Piergiorgio Sartor 2012-04-21 9:51 ` Peter Grandi 2012-04-21 11:18 ` Piergiorgio Sartor 2012-04-22 3:14 ` Alex 2012-04-22 8:57 ` Piergiorgio Sartor 2012-04-20 7:45 ` Stan Hoeppner 2012-04-23 15:26 ` Alex 2012-04-25 1:20 ` Stan Hoeppner 2012-04-25 2:45 ` Alex 2012-04-25 16:59 ` Emmanuel Noobadmin 2012-04-25 19:29 ` David Brown 2012-04-26 2:30 ` Alex 2012-04-27 15:15 ` Emmanuel Noobadmin 2012-05-01 16:38 ` Alex 2012-04-26 4:24 ` Alex -- strict thread matches above, loose matches on Subject: below -- 2012-04-16 12:55 Alex 2012-04-16 10:04 Alex
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.