* Implementing Global Parity Codes @ 2018-01-27 5:47 mostafa kishani 2018-01-27 8:37 ` Wols Lists 2018-02-02 5:24 ` NeilBrown 0 siblings, 2 replies; 15+ messages in thread From: mostafa kishani @ 2018-01-27 5:47 UTC (permalink / raw) To: linux-raid Dear All, I am going to make some modifications to RAID protocol to make it more reliable for my case (for a scientific, and maybe later, industrial purpose). For example, I'm going to hold a Global Parity (a parity taken across the whole data stripe rather than a row) alongside normal row-wise parities, to cope with an extra sector/page failure per stripe. Do you have any suggestion how can I implement this with a moderate effort (I mean what functions should be modified)? have any of you had any similar effort? I also appreciate if you guide me how can I enable DEBUG mode in mdadm. Bests, Mostafa ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Implementing Global Parity Codes 2018-01-27 5:47 Implementing Global Parity Codes mostafa kishani @ 2018-01-27 8:37 ` Wols Lists 2018-01-27 14:29 ` mostafa kishani 2018-02-02 5:24 ` NeilBrown 1 sibling, 1 reply; 15+ messages in thread From: Wols Lists @ 2018-01-27 8:37 UTC (permalink / raw) To: mostafa kishani, linux-raid On 27/01/18 05:47, mostafa kishani wrote: > Dear All, > > I am going to make some modifications to RAID protocol to make it more > reliable for my case (for a scientific, and maybe later, industrial > purpose). For example, I'm going to hold a Global Parity (a parity > taken across the whole data stripe rather than a row) Except that what do you mean by "row"? Aren't you using it as just another word for "stripe"? alongside normal > row-wise parities, to cope with an extra sector/page failure per > stripe. Do you have any suggestion how can I implement this with a > moderate effort (I mean what functions should be modified)? have any > of you had any similar effort? If I understand you correctly, that's easy. Raid-5 has one parity block per stripe, enabling it to recover from one lost disk. Raid-6 has two parity stripes, enabling it to recover from two lost disks, or one random corrupted block. Nobody's tried to do it, but it's a simple extension of the current setup ... why don't you implement what I call "raid-6+", where you can have as many parity disks as you like - in your case three. You'd need to take the current raid-6 code and extend it - ignore raid-5 because while the principle is the same, the detail is much simpler and cannot be extended. > I also appreciate if you guide me how can I enable DEBUG mode in mdadm. > Can't help there, I'm afraid. Cheers, Wol ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Implementing Global Parity Codes 2018-01-27 8:37 ` Wols Lists @ 2018-01-27 14:29 ` mostafa kishani 2018-01-27 15:13 ` Wols Lists 0 siblings, 1 reply; 15+ messages in thread From: mostafa kishani @ 2018-01-27 14:29 UTC (permalink / raw) To: Wols Lists; +Cc: linux-raid Thanks for your response Wol Well, maybe I failed to illustrate what I'm going to implement. I try to better clarify using your terminology: In the normal RAID5 and RAID6 codes we have one/two parities per stripe. Now consider sharing a redundant sector between say, 4 stripes, and assume that the redundant sector is saved in stripe4. Assume the redundant sector is the parity of all sectors in stripe1, stripe2, stripe3, and stripe4. Using this redundant sector you can tolerate one sector failure across stripe1 to stripe4. We already have the parity sectors of RAID5 and RAID6 and this redundant sector is added to tolerate an extra sector failure. I call this redundant sector "Global Parity". I try to demonstrate this as follows, assuming each RAID5 stripe has 3 data sectors and one parity sector. stripe1: DATA1 | DATA2 | DATA3 | PARITY1 stripe2: PARITY2 | DATA4 | DATA5 | DATA6 stripe3: DATA7 | PARITY3 | DATA8 | DATA9 stripe4: DATA10 | DATA11 | PARITY4 | GLOBAL PARITY and the Global Parity is taken across all data and parity as follows: GLOBAL PARITY = DATA1 X DATA2 X DATA3 X DATA4 X DATA5 X DATA6 X DATA7 X DATA8 X DATA9 X DATA10 X DATA11 X PARITY1 X PARITY2 X PARITY3 Where "X" stands for XOR operation. I hope it was clear. Bests, Mostafa On Sat, Jan 27, 2018 at 12:07 PM, Wols Lists <antlists@youngman.org.uk> wrote: > On 27/01/18 05:47, mostafa kishani wrote: >> Dear All, >> >> I am going to make some modifications to RAID protocol to make it more >> reliable for my case (for a scientific, and maybe later, industrial >> purpose). For example, I'm going to hold a Global Parity (a parity >> taken across the whole data stripe rather than a row) > > Except that what do you mean by "row"? Aren't you using it as just > another word for "stripe"? > > alongside normal >> row-wise parities, to cope with an extra sector/page failure per >> stripe. Do you have any suggestion how can I implement this with a >> moderate effort (I mean what functions should be modified)? have any >> of you had any similar effort? > > If I understand you correctly, that's easy. Raid-5 has one parity block > per stripe, enabling it to recover from one lost disk. Raid-6 has two > parity stripes, enabling it to recover from two lost disks, or one > random corrupted block. > > Nobody's tried to do it, but it's a simple extension of the current > setup ... why don't you implement what I call "raid-6+", where you can > have as many parity disks as you like - in your case three. You'd need > to take the current raid-6 code and extend it - ignore raid-5 because > while the principle is the same, the detail is much simpler and cannot > be extended. > >> I also appreciate if you guide me how can I enable DEBUG mode in mdadm. >> > Can't help there, I'm afraid. > > Cheers, > Wol > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Implementing Global Parity Codes 2018-01-27 14:29 ` mostafa kishani @ 2018-01-27 15:13 ` Wols Lists 2018-01-28 13:00 ` mostafa kishani 2018-01-29 10:22 ` David Brown 0 siblings, 2 replies; 15+ messages in thread From: Wols Lists @ 2018-01-27 15:13 UTC (permalink / raw) To: mostafa kishani; +Cc: linux-raid On 27/01/18 14:29, mostafa kishani wrote: > Thanks for your response Wol > Well, maybe I failed to illustrate what I'm going to implement. I try > to better clarify using your terminology: > In the normal RAID5 and RAID6 codes we have one/two parities per > stripe. Now consider sharing a redundant sector between say, 4 > stripes, and assume that the redundant sector is saved in stripe4. > Assume the redundant sector is the parity of all sectors in stripe1, > stripe2, stripe3, and stripe4. Using this redundant sector you can > tolerate one sector failure across stripe1 to stripe4. We already have > the parity sectors of RAID5 and RAID6 and this redundant sector is > added to tolerate an extra sector failure. I call this redundant > sector "Global Parity". > I try to demonstrate this as follows, assuming each RAID5 stripe has 3 > data sectors and one parity sector. > stripe1: DATA1 | DATA2 | DATA3 | PARITY1 > stripe2: PARITY2 | DATA4 | DATA5 | DATA6 > stripe3: DATA7 | PARITY3 | DATA8 | DATA9 > stripe4: DATA10 | DATA11 | PARITY4 | GLOBAL PARITY > > and the Global Parity is taken across all data and parity as follows: > GLOBAL PARITY = DATA1 X DATA2 X DATA3 X DATA4 X DATA5 X DATA6 X DATA7 > X DATA8 X DATA9 X DATA10 X DATA11 X PARITY1 X PARITY2 X PARITY3 > > Where "X" stands for XOR operation. > I hope it was clear. OWWW!!!! Have you done and understood the maths!!!??? You may have noticed I said that while raid-6 was similar in principle to raid-5, it was very different in implementation. Because of the maths! Going back to high-school algebra, if we have E *unique* equations, and U unknowns, then we can only solve the equations if E > U (I think I've got that right, it might be >=). With raid-5, parity1 = data1 xor data2 xor data3. Now let's assume somebody thinks "let's add parity2" and defines parity2 = data1 xor data2 xor data3 xor parity1. THAT WON'T WORK. Raid-5 relies on the equation "if N is even, then parity2 is all ones, else parity2 is all zeroes", where N is the number of disks, so if we calculate parity2 we add absolutely nothing to our pre-existing E. If you are planning to use XOR, I think you are falling into *exactly* that trap! Plus, it looks to me as if calculating your global parity is going to be a disk-hammering nightmare ... That's why raid-6 uses a *completely* *different* algorithm to calculate its parity1 and parity2. I've updated a page on the wiki, because it's come up in other discussions as well, but it seems to me if you need extra parity, you really ought to be going for raid-60. Take a look ... https://raid.wiki.kernel.org/index.php/What_is_RAID_and_why_should_you_want_it%3F#Which_raid_is_for_me.3F and if anyone else wants to comment, too? ... Cheers, Wol ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Implementing Global Parity Codes 2018-01-27 15:13 ` Wols Lists @ 2018-01-28 13:00 ` mostafa kishani 2018-01-29 10:22 ` David Brown 1 sibling, 0 replies; 15+ messages in thread From: mostafa kishani @ 2018-01-28 13:00 UTC (permalink / raw) To: Wols Lists; +Cc: linux-raid Wol, I guess so that this code is possibly hammering the disk. But the implementation of encoding/decoding part is not a big deal. The codes used in RAID5 and RAID6 (and also in Global Parity) are in the category of Maximum Distance Separable (MDS) codes (some famous examples of these codes are Hamming and ReedSolomon). If you have interest you can take a look at this paper: https://arxiv.org/pdf/1205.0997 But my main challenge now is what parts of mdadm should be modified, and how can I enable debug mode in mdadm. I wonder if anyone can give me a clue about the debug stuff. Bests, Mostafa On Sat, Jan 27, 2018 at 6:43 PM, Wols Lists <antlists@youngman.org.uk> wrote: > On 27/01/18 14:29, mostafa kishani wrote: >> Thanks for your response Wol >> Well, maybe I failed to illustrate what I'm going to implement. I try >> to better clarify using your terminology: >> In the normal RAID5 and RAID6 codes we have one/two parities per >> stripe. Now consider sharing a redundant sector between say, 4 >> stripes, and assume that the redundant sector is saved in stripe4. >> Assume the redundant sector is the parity of all sectors in stripe1, >> stripe2, stripe3, and stripe4. Using this redundant sector you can >> tolerate one sector failure across stripe1 to stripe4. We already have >> the parity sectors of RAID5 and RAID6 and this redundant sector is >> added to tolerate an extra sector failure. I call this redundant >> sector "Global Parity". >> I try to demonstrate this as follows, assuming each RAID5 stripe has 3 >> data sectors and one parity sector. >> stripe1: DATA1 | DATA2 | DATA3 | PARITY1 >> stripe2: PARITY2 | DATA4 | DATA5 | DATA6 >> stripe3: DATA7 | PARITY3 | DATA8 | DATA9 >> stripe4: DATA10 | DATA11 | PARITY4 | GLOBAL PARITY >> >> and the Global Parity is taken across all data and parity as follows: >> GLOBAL PARITY = DATA1 X DATA2 X DATA3 X DATA4 X DATA5 X DATA6 X DATA7 >> X DATA8 X DATA9 X DATA10 X DATA11 X PARITY1 X PARITY2 X PARITY3 >> >> Where "X" stands for XOR operation. >> I hope it was clear. > > OWWW!!!! > > Have you done and understood the maths!!!??? > > You may have noticed I said that while raid-6 was similar in principle > to raid-5, it was very different in implementation. Because of the maths! > > Going back to high-school algebra, if we have E *unique* equations, and > U unknowns, then we can only solve the equations if E > U (I think I've > got that right, it might be >=). > > With raid-5, parity1 = data1 xor data2 xor data3. Now let's assume > somebody thinks "let's add parity2" and defines parity2 = data1 xor > data2 xor data3 xor parity1. THAT WON'T WORK. Raid-5 relies on the > equation "if N is even, then parity2 is all ones, else parity2 is all > zeroes", where N is the number of disks, so if we calculate parity2 we > add absolutely nothing to our pre-existing E. > > If you are planning to use XOR, I think you are falling into *exactly* > that trap! Plus, it looks to me as if calculating your global parity is > going to be a disk-hammering nightmare ... > > That's why raid-6 uses a *completely* *different* algorithm to calculate > its parity1 and parity2. > > I've updated a page on the wiki, because it's come up in other > discussions as well, but it seems to me if you need extra parity, you > really ought to be going for raid-60. Take a look ... > > https://raid.wiki.kernel.org/index.php/What_is_RAID_and_why_should_you_want_it%3F#Which_raid_is_for_me.3F > > and if anyone else wants to comment, too? ... > > Cheers, > Wol ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Implementing Global Parity Codes 2018-01-27 15:13 ` Wols Lists 2018-01-28 13:00 ` mostafa kishani @ 2018-01-29 10:22 ` David Brown 2018-01-29 17:44 ` Wols Lists 2018-01-30 11:30 ` mostafa kishani 1 sibling, 2 replies; 15+ messages in thread From: David Brown @ 2018-01-29 10:22 UTC (permalink / raw) To: Wols Lists, mostafa kishani; +Cc: linux-raid On 27/01/2018 16:13, Wols Lists wrote: > On 27/01/18 14:29, mostafa kishani wrote: >> Thanks for your response Wol >> Well, maybe I failed to illustrate what I'm going to implement. I try >> to better clarify using your terminology: >> In the normal RAID5 and RAID6 codes we have one/two parities per >> stripe. Now consider sharing a redundant sector between say, 4 >> stripes, and assume that the redundant sector is saved in stripe4. >> Assume the redundant sector is the parity of all sectors in stripe1, >> stripe2, stripe3, and stripe4. Using this redundant sector you can >> tolerate one sector failure across stripe1 to stripe4. We already have >> the parity sectors of RAID5 and RAID6 and this redundant sector is >> added to tolerate an extra sector failure. I call this redundant >> sector "Global Parity". >> I try to demonstrate this as follows, assuming each RAID5 stripe has 3 >> data sectors and one parity sector. >> stripe1: DATA1 | DATA2 | DATA3 | PARITY1 >> stripe2: PARITY2 | DATA4 | DATA5 | DATA6 >> stripe3: DATA7 | PARITY3 | DATA8 | DATA9 >> stripe4: DATA10 | DATA11 | PARITY4 | GLOBAL PARITY >> >> and the Global Parity is taken across all data and parity as follows: >> GLOBAL PARITY = DATA1 X DATA2 X DATA3 X DATA4 X DATA5 X DATA6 X DATA7 >> X DATA8 X DATA9 X DATA10 X DATA11 X PARITY1 X PARITY2 X PARITY3 >> >> Where "X" stands for XOR operation. >> I hope it was clear. > > OWWW!!!! > > Have you done and understood the maths!!!??? I have started looking at the paper (from the link in Mostafa's next post). I have only read a few pages as yet, but it looks to me to have some fundamental misunderstandings about SSDs, how they work, and their typical failures, and to be massively mixing up the low-level structures visible inside the SSD firmware and the high-level view available to the kernel and the md layer. At best, this "PMDS" idea with blocks might be an alternative or addition to ECC layers within the SSD - but not at the md layer. I have not read the whole paper yet, so I could be missing something - but I am sceptical. > > You may have noticed I said that while raid-6 was similar in principle > to raid-5, it was very different in implementation. Because of the maths! Yes, indeed. The maths of raid 6 is a lot of fun, and very smart. > > Going back to high-school algebra, if we have E *unique* equations, and > U unknowns, then we can only solve the equations if E > U (I think I've > got that right, it might be >=). E >= U. You can solve "2 * x - 4 = 0", which is one equation in one unknown. But critically, the E equations need to be linearly independent (that is probably what you mean by "unique"). > > With raid-5, parity1 = data1 xor data2 xor data3. Now let's assume > somebody thinks "let's add parity2" and defines parity2 = data1 xor > data2 xor data3 xor parity1. THAT WON'T WORK. Correct - this is because the two equations are not linearly independent. The second parity would always be 0 in this case. > Raid-5 relies on the > equation "if N is even, then parity2 is all ones, else parity2 is all > zeroes", where N is the number of disks, so if we calculate parity2 we > add absolutely nothing to our pre-existing E. > > If you are planning to use XOR, I think you are falling into *exactly* > that trap! Plus, it looks to me as if calculating your global parity is > going to be a disk-hammering nightmare ... Yes. > > That's why raid-6 uses a *completely* *different* algorithm to calculate > its parity1 and parity2. It is not actually a completely different algorithm, if you view it in the correct way. You can say the raid5 parity P is just the xor of the bits, while the raid6 parity Q is a polynomial over the GF(2^8) field - certainly they look completely different then. But once you move to the GF(2^8) field, the equations become: P = d_0 + d_1 + d_2 + d_3 + d_4 + ... Q = d_0 + 2 . d_1 + 2^2 . d_2 + 2^3 . d_3 + 2^4 . d_4 + ... (Note that none of this is "ordinary" maths - addition and multiplication is special in the GF field.) It is even possible to extend it to a third parity in a similar way: R = d_0 + 4 . d_1 + 4^2 . d_2 + 4^3 . d_3 + 4^4 . d_4 + ... There are other schemes that scale better beyond the third parity (this scheme can generate a fourth parity bit, but it is then only valid for up to 21 data disks). > > I've updated a page on the wiki, because it's come up in other > discussions as well, but it seems to me if you need extra parity, you > really ought to be going for raid-60. Take a look ... > > https://raid.wiki.kernel.org/index.php/What_is_RAID_and_why_should_you_want_it%3F#Which_raid_is_for_me.3F > > and if anyone else wants to comment, too? ... > Here are a few random comments: Raid-10-far2 can be /faster/ than Raid0 on the same number of HDs, for read-only performance. This is because the data for both stripes will be read from the first half of the disks - the outside half. On many disks this gives higher read speeds, since the same angular rotation speed has higher linear velocity at the disk heads. It also gives shorter seek times as the head does not have to move as far in or out to cover the whole range. For SSDs, the layout for Raid-10 makes almost no difference (but it is still faster than plain Raid-1 for streamed reads). For two drives, Raid-10 is a fine choice on read-heavy or streaming applications. I think you could emphasise that there is little point in having Raid-5 plus a spare - Raid-6 is better in every way. You should make a clearer distinction that by "Raid-6+0" you mean a Raid-0 stripe of Raid-6 sets, rather than a Raid-6 set of Raid-0 stripes. There are also many, many other ways to organise multi-layer raids. Striping at the high level (like Raid-6+0) makes sense only if you have massive streaming operations for single files, and massive bandwidth - it is poorer for operations involving a large number of parallel accesses. A common arrangement for big arrays is a linear concatenation of Raid-1 pairs (or Raid-5 or Raid-6 sets) - combined with an appropriate file system (XFS comes out well here) you get massive scalability and very high parallel access speeds. Other things to consider on big arrays are redundancy of controllers, or even servers (for SAN arrays). Consider the pros and cons of spreading your redundancy across blocks. For example, if your server has two controllers then you might want your low-level block to be Raid-1 pairs with one disk on each controller. That could give you a better spread of bandwidths and give you resistance to a broken controller. You could also talk about asymmetric raid setups, such as having a write-only redundant copy on a second server over a network, or as a cheap hard disk copy of your fast SSDs. And you could also discuss strategies for disk replacement - after failures, or for growing the array. It is also worth emphasising that RAID is /not/ a backup solution - that cannot be said often enough! Discuss failure recovery - how to find and remove bad disks, how to deal with recovering disks from a different machine after the first one has died, etc. Emphasise the importance of labelling disks in your machines and being sure you pull the right disk! > Cheers, > Wol ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Implementing Global Parity Codes 2018-01-29 10:22 ` David Brown @ 2018-01-29 17:44 ` Wols Lists 2018-01-30 11:47 ` David Brown 2018-01-30 14:18 ` Brad Campbell 2018-01-30 11:30 ` mostafa kishani 1 sibling, 2 replies; 15+ messages in thread From: Wols Lists @ 2018-01-29 17:44 UTC (permalink / raw) To: David Brown; +Cc: linux-raid On 29/01/18 10:22, David Brown wrote: >> I've updated a page on the wiki, because it's come up in other >> discussions as well, but it seems to me if you need extra parity, you >> really ought to be going for raid-60. Take a look ... >> >> https://raid.wiki.kernel.org/index.php/What_is_RAID_and_why_should_you_want_it%3F#Which_raid_is_for_me.3F >> >> >> and if anyone else wants to comment, too? ... >> > > Here are a few random comments: > > Raid-10-far2 can be /faster/ than Raid0 on the same number of HDs, for > read-only performance. This is because the data for both stripes will > be read from the first half of the disks - the outside half. On many > disks this gives higher read speeds, since the same angular rotation > speed has higher linear velocity at the disk heads. It also gives > shorter seek times as the head does not have to move as far in or out to > cover the whole range. For SSDs, the layout for Raid-10 makes almost no > difference (but it is still faster than plain Raid-1 for streamed reads). Except that most drives don't do that nowadays, they do "constant linear velocity" so the drive speeds up or slows down depending on where the heads are, I believe. > > For two drives, Raid-10 is a fine choice on read-heavy or streaming > applications. Which is just raid-1, no? > > I think you could emphasise that there is little point in having Raid-5 > plus a spare - Raid-6 is better in every way. Agreed. I don't agree raid-6 is better in *every* way - it wastes space - but yes once you have enough drives you should go raid-6 :-) > > You should make a clearer distinction that by "Raid-6+0" you mean a > Raid-0 stripe of Raid-6 sets, rather than a Raid-6 set of Raid-0 stripes. > Done. > There are also many, many other ways to organise multi-layer raids. > Striping at the high level (like Raid-6+0) makes sense only if you have > massive streaming operations for single files, and massive bandwidth - > it is poorer for operations involving a large number of parallel > accesses. A common arrangement for big arrays is a linear concatenation > of Raid-1 pairs (or Raid-5 or Raid-6 sets) - combined with an > appropriate file system (XFS comes out well here) you get massive > scalability and very high parallel access speeds. > > Other things to consider on big arrays are redundancy of controllers, or > even servers (for SAN arrays). Consider the pros and cons of spreading > your redundancy across blocks. For example, if your server has two > controllers then you might want your low-level block to be Raid-1 pairs > with one disk on each controller. That could give you a better spread > of bandwidths and give you resistance to a broken controller. > > You could also talk about asymmetric raid setups, such as having a > write-only redundant copy on a second server over a network, or as a > cheap hard disk copy of your fast SSDs. Snag is, I don't manage large arrays - it's a lot to think about. I might add that later. > > And you could also discuss strategies for disk replacement - after > failures, or for growing the array. > > It is also worth emphasising that RAID is /not/ a backup solution - that > cannot be said often enough! > > Discuss failure recovery - how to find and remove bad disks, how to deal > with recovering disks from a different machine after the first one has > died, etc. Emphasise the importance of labelling disks in your machines > and being sure you pull the right disk! > I think that's covered elsewhere :-) Cheers, Wol ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Implementing Global Parity Codes 2018-01-29 17:44 ` Wols Lists @ 2018-01-30 11:47 ` David Brown 2018-01-30 14:18 ` Brad Campbell 1 sibling, 0 replies; 15+ messages in thread From: David Brown @ 2018-01-30 11:47 UTC (permalink / raw) To: Wols Lists; +Cc: linux-raid On 29/01/2018 18:44, Wols Lists wrote: > On 29/01/18 10:22, David Brown wrote: >>> I've updated a page on the wiki, because it's come up in other >>> discussions as well, but it seems to me if you need extra parity, you >>> really ought to be going for raid-60. Take a look ... >>> >>> https://raid.wiki.kernel.org/index.php/What_is_RAID_and_why_should_you_want_it%3F#Which_raid_is_for_me.3F >>> >>> >>> and if anyone else wants to comment, too? ... >>> >> >> Here are a few random comments: >> >> Raid-10-far2 can be /faster/ than Raid0 on the same number of HDs, for >> read-only performance. This is because the data for both stripes will >> be read from the first half of the disks - the outside half. On many >> disks this gives higher read speeds, since the same angular rotation >> speed has higher linear velocity at the disk heads. It also gives >> shorter seek times as the head does not have to move as far in or out to >> cover the whole range. For SSDs, the layout for Raid-10 makes almost no >> difference (but it is still faster than plain Raid-1 for streamed reads). > > Except that most drives don't do that nowadays, they do "constant linear > velocity" so the drive speeds up or slows down depending on where the > heads are, I believe. Perhaps - I haven't tried to keep up with the specs of all drives, and it is surprisingly hard to get a good answer from a quick google. However, I would be surprised if CLV were the norm for hard disks. Certainly it was not the case before (though it was used for CD-ROMS and other optical media). The inner tracks of a hard disk are perhaps half the circumference of the outer tracks - to keep constant linear velocity, you need twice the rotational speed on the inside compared to the outside. That is a massive difference, taking several seconds (perhaps 10 seconds) to bring to a stable speed. I suspect you are mixing up velocity and density here. Earlier hard drives had constant angular density - the same number of sectors per track throughout the disk. Modern drives have constant linear density - so you get more sectors on an outer track than an inner track. >> >> For two drives, Raid-10 is a fine choice on read-heavy or streaming >> applications. > > Which is just raid-1, no? No. The "raid-10 near" is pretty much identical to raid-1, but the "far" and "offset" raid-10 layouts are different: <https://en.wikipedia.org/wiki/Non-standard_RAID_levels#LINUX-MD-RAID-10> Near layout minimises the head movement on writes. Far layout maximises streaming read performance, but has more latency during writes due to larger head movements. Offset layout gives raid0 read performance for small and mid-size reads, with only slightly more latency in writes. >> >> I think you could emphasise that there is little point in having Raid-5 >> plus a spare - Raid-6 is better in every way. > > Agreed. I don't agree raid-6 is better in *every* way - it wastes space > - but yes once you have enough drives you should go raid-6 :-) Raid-6 is better than raid-5 plus a spare - it uses exactly the same number of disks, and does not waste anything while providing a huge improvement in redundancy and therefore data safety. Okay, it is not better in /every/ way. It takes a bit of computing power, though that is rarely relevant as cpus have got more threads that are mostly idle. And it gives a bit more write amplification than raid-5. But if you think "my raid-5 array is so important that I want a hot spare to make rebuilds happen as soon as possible", then you /definitely/ want raid-6 instead. >> >> You should make a clearer distinction that by "Raid-6+0" you mean a >> Raid-0 stripe of Raid-6 sets, rather than a Raid-6 set of Raid-0 stripes. >> > Done. > >> There are also many, many other ways to organise multi-layer raids. >> Striping at the high level (like Raid-6+0) makes sense only if you have >> massive streaming operations for single files, and massive bandwidth - >> it is poorer for operations involving a large number of parallel >> accesses. A common arrangement for big arrays is a linear concatenation >> of Raid-1 pairs (or Raid-5 or Raid-6 sets) - combined with an >> appropriate file system (XFS comes out well here) you get massive >> scalability and very high parallel access speeds. >> >> Other things to consider on big arrays are redundancy of controllers, or >> even servers (for SAN arrays). Consider the pros and cons of spreading >> your redundancy across blocks. For example, if your server has two >> controllers then you might want your low-level block to be Raid-1 pairs >> with one disk on each controller. That could give you a better spread >> of bandwidths and give you resistance to a broken controller. >> >> You could also talk about asymmetric raid setups, such as having a >> write-only redundant copy on a second server over a network, or as a >> cheap hard disk copy of your fast SSDs. > > Snag is, I don't manage large arrays - it's a lot to think about. I > might add that later. Fair enough. You can't cover /everything/ on a wiki page - then it is a full time job and a book, not a wiki page! I am just giving suggestions and ideas. >> >> And you could also discuss strategies for disk replacement - after >> failures, or for growing the array. >> >> It is also worth emphasising that RAID is /not/ a backup solution - that >> cannot be said often enough! >> >> Discuss failure recovery - how to find and remove bad disks, how to deal >> with recovering disks from a different machine after the first one has >> died, etc. Emphasise the importance of labelling disks in your machines >> and being sure you pull the right disk! >> > I think that's covered elsewhere :-) Maybe you could add a few links? There is no need to repeat information. mvh., David > > Cheers, > Wol > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Implementing Global Parity Codes 2018-01-29 17:44 ` Wols Lists 2018-01-30 11:47 ` David Brown @ 2018-01-30 14:18 ` Brad Campbell 1 sibling, 0 replies; 15+ messages in thread From: Brad Campbell @ 2018-01-30 14:18 UTC (permalink / raw) To: Wols Lists; +Cc: linux-raid On 30/01/18 01:44, Wols Lists wrote: > Except that most drives don't do that nowadays, they do "constant linear > velocity" so the drive speeds up or slows down depending on where the > heads are, I believe. No. Hard disks have one speed. That can be easily proven by popping the top off and putting a tacho on the spindle, or just looking at any linear read benchmark as they all demonstrate that data transfer slows down as the head works its way towards the spindle. Same reason the first couple of tracks on each LP sounded better. More vinyl to get a better response. Optical disks used CLV where they needed to get bits out at a specific clock rate (ie Audio). Brad. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Implementing Global Parity Codes 2018-01-29 10:22 ` David Brown 2018-01-29 17:44 ` Wols Lists @ 2018-01-30 11:30 ` mostafa kishani 2018-01-30 15:14 ` David Brown 1 sibling, 1 reply; 15+ messages in thread From: mostafa kishani @ 2018-01-30 11:30 UTC (permalink / raw) To: David Brown; +Cc: Wols Lists, linux-raid David what you pointed about employment of PMDS codes is correct. We have no access to what happens in the SSD firmware (such as FTL). But why this code cannot be implemented in the software layer (similar to RAID5/6...) ? I also thank you for pointing out very interesting subjects. > > Other things to consider on big arrays are redundancy of controllers, or > even servers (for SAN arrays). Consider the pros and cons of spreading your > redundancy across blocks. For example, if your server has two controllers > then you might want your low-level block to be Raid-1 pairs with one disk on > each controller. That could give you a better spread of bandwidths and give > you resistance to a broken controller. > > You could also talk about asymmetric raid setups, such as having a > write-only redundant copy on a second server over a network, or as a cheap > hard disk copy of your fast SSDs. > > And you could also discuss strategies for disk replacement - after failures, > or for growing the array. The disk replacement strategy has a significant effect on both reliability and performance. The occurrence of human errors in desk replacement can result in data unavailability and data loss. In the following paper I've briefly discussed this subject and how a good disk replacement policy can improve reliability by orders of magnitude (a more detailed version of this paper is on the way!): https://dl.acm.org/citation.cfm?id=3130452 you can download it using sci-hub if you don't have ACM access. > > It is also worth emphasising that RAID is /not/ a backup solution - that > cannot be said often enough! > > Discuss failure recovery - how to find and remove bad disks, how to deal > with recovering disks from a different machine after the first one has died, > etc. Emphasise the importance of labelling disks in your machines and being > sure you pull the right disk! I really appreciate if you can share your experience about pulling wrong disk and any statistics. This is an interesting subject to discuss. On Mon, Jan 29, 2018 at 1:52 PM, David Brown <david.brown@hesbynett.no> wrote: > > > On 27/01/2018 16:13, Wols Lists wrote: >> >> On 27/01/18 14:29, mostafa kishani wrote: >>> >>> Thanks for your response Wol >>> Well, maybe I failed to illustrate what I'm going to implement. I try >>> to better clarify using your terminology: >>> In the normal RAID5 and RAID6 codes we have one/two parities per >>> stripe. Now consider sharing a redundant sector between say, 4 >>> stripes, and assume that the redundant sector is saved in stripe4. >>> Assume the redundant sector is the parity of all sectors in stripe1, >>> stripe2, stripe3, and stripe4. Using this redundant sector you can >>> tolerate one sector failure across stripe1 to stripe4. We already have >>> the parity sectors of RAID5 and RAID6 and this redundant sector is >>> added to tolerate an extra sector failure. I call this redundant >>> sector "Global Parity". >>> I try to demonstrate this as follows, assuming each RAID5 stripe has 3 >>> data sectors and one parity sector. >>> stripe1: DATA1 | DATA2 | DATA3 | PARITY1 >>> stripe2: PARITY2 | DATA4 | DATA5 | DATA6 >>> stripe3: DATA7 | PARITY3 | DATA8 | DATA9 >>> stripe4: DATA10 | DATA11 | PARITY4 | GLOBAL PARITY >>> >>> and the Global Parity is taken across all data and parity as follows: >>> GLOBAL PARITY = DATA1 X DATA2 X DATA3 X DATA4 X DATA5 X DATA6 X DATA7 >>> X DATA8 X DATA9 X DATA10 X DATA11 X PARITY1 X PARITY2 X PARITY3 >>> >>> Where "X" stands for XOR operation. >>> I hope it was clear. >> >> >> OWWW!!!! >> >> Have you done and understood the maths!!!??? > > > I have started looking at the paper (from the link in Mostafa's next post). > I have only read a few pages as yet, but it looks to me to have some > fundamental misunderstandings about SSDs, how they work, and their typical > failures, and to be massively mixing up the low-level structures visible > inside the SSD firmware and the high-level view available to the kernel and > the md layer. At best, this "PMDS" idea with blocks might be an alternative > or addition to ECC layers within the SSD - but not at the md layer. I have > not read the whole paper yet, so I could be missing something - but I am > sceptical. > > >> >> You may have noticed I said that while raid-6 was similar in principle >> to raid-5, it was very different in implementation. Because of the maths! > > > Yes, indeed. The maths of raid 6 is a lot of fun, and very smart. > >> >> Going back to high-school algebra, if we have E *unique* equations, and >> U unknowns, then we can only solve the equations if E > U (I think I've >> got that right, it might be >=). > > > E >= U. You can solve "2 * x - 4 = 0", which is one equation in one > unknown. But critically, the E equations need to be linearly independent > (that is probably what you mean by "unique"). > >> >> With raid-5, parity1 = data1 xor data2 xor data3. Now let's assume >> somebody thinks "let's add parity2" and defines parity2 = data1 xor >> data2 xor data3 xor parity1. THAT WON'T WORK. > > > Correct - this is because the two equations are not linearly independent. > The second parity would always be 0 in this case. > >> Raid-5 relies on the >> equation "if N is even, then parity2 is all ones, else parity2 is all >> zeroes", where N is the number of disks, so if we calculate parity2 we >> add absolutely nothing to our pre-existing E. >> >> If you are planning to use XOR, I think you are falling into *exactly* >> that trap! Plus, it looks to me as if calculating your global parity is >> going to be a disk-hammering nightmare ... > > > Yes. > >> >> That's why raid-6 uses a *completely* *different* algorithm to calculate >> its parity1 and parity2. > > > It is not actually a completely different algorithm, if you view it in the > correct way. You can say the raid5 parity P is just the xor of the bits, > while the raid6 parity Q is a polynomial over the GF(2^8) field - certainly > they look completely different then. But once you move to the GF(2^8) > field, the equations become: > > P = d_0 + d_1 + d_2 + d_3 + d_4 + ... > Q = d_0 + 2 . d_1 + 2^2 . d_2 + 2^3 . d_3 + 2^4 . d_4 + ... > > (Note that none of this is "ordinary" maths - addition and multiplication is > special in the GF field.) > > It is even possible to extend it to a third parity in a similar way: > > R = d_0 + 4 . d_1 + 4^2 . d_2 + 4^3 . d_3 + 4^4 . d_4 + ... > > There are other schemes that scale better beyond the third parity (this > scheme can generate a fourth parity bit, but it is then only valid for up to > 21 data disks). > >> >> I've updated a page on the wiki, because it's come up in other >> discussions as well, but it seems to me if you need extra parity, you >> really ought to be going for raid-60. Take a look ... >> >> >> https://raid.wiki.kernel.org/index.php/What_is_RAID_and_why_should_you_want_it%3F#Which_raid_is_for_me.3F >> >> and if anyone else wants to comment, too? ... >> > > Here are a few random comments: > > Raid-10-far2 can be /faster/ than Raid0 on the same number of HDs, for > read-only performance. This is because the data for both stripes will be > read from the first half of the disks - the outside half. On many disks > this gives higher read speeds, since the same angular rotation speed has > higher linear velocity at the disk heads. It also gives shorter seek times > as the head does not have to move as far in or out to cover the whole range. > For SSDs, the layout for Raid-10 makes almost no difference (but it is still > faster than plain Raid-1 for streamed reads). > > For two drives, Raid-10 is a fine choice on read-heavy or streaming > applications. > > I think you could emphasise that there is little point in having Raid-5 plus > a spare - Raid-6 is better in every way. > > You should make a clearer distinction that by "Raid-6+0" you mean a Raid-0 > stripe of Raid-6 sets, rather than a Raid-6 set of Raid-0 stripes. > > There are also many, many other ways to organise multi-layer raids. Striping > at the high level (like Raid-6+0) makes sense only if you have massive > streaming operations for single files, and massive bandwidth - it is poorer > for operations involving a large number of parallel accesses. A common > arrangement for big arrays is a linear concatenation of Raid-1 pairs (or > Raid-5 or Raid-6 sets) - combined with an appropriate file system (XFS comes > out well here) you get massive scalability and very high parallel access > speeds. > > Other things to consider on big arrays are redundancy of controllers, or > even servers (for SAN arrays). Consider the pros and cons of spreading your > redundancy across blocks. For example, if your server has two controllers > then you might want your low-level block to be Raid-1 pairs with one disk on > each controller. That could give you a better spread of bandwidths and give > you resistance to a broken controller. > > You could also talk about asymmetric raid setups, such as having a > write-only redundant copy on a second server over a network, or as a cheap > hard disk copy of your fast SSDs. > > And you could also discuss strategies for disk replacement - after failures, > or for growing the array. > > It is also worth emphasising that RAID is /not/ a backup solution - that > cannot be said often enough! > > Discuss failure recovery - how to find and remove bad disks, how to deal > with recovering disks from a different machine after the first one has died, > etc. Emphasise the importance of labelling disks in your machines and being > sure you pull the right disk! > > > >> Cheers, >> Wol ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Implementing Global Parity Codes 2018-01-30 11:30 ` mostafa kishani @ 2018-01-30 15:14 ` David Brown 2018-01-31 16:03 ` mostafa kishani 0 siblings, 1 reply; 15+ messages in thread From: David Brown @ 2018-01-30 15:14 UTC (permalink / raw) To: mostafa kishani; +Cc: Wols Lists, linux-raid On 30/01/18 12:30, mostafa kishani wrote: > David what you pointed about employment of PMDS codes is correct. We > have no access to what happens in the SSD firmware (such as FTL). But > why this code cannot be implemented in the software layer (similar to > RAID5/6...) ? I also thank you for pointing out very interesting > subjects. > I must admit that I haven't dug through the mathematical details of the paper. It looks to be at a level that I /could/ understand, but would need to put in quite a bit of time and effort. And the paper does not strike me as being particularly outstanding or special - there are many, many such papers published about new ideas in error detection and correction. While it is not clear to me exactly how these additional "global" parity blocks are intended to help correct errors in the paper, I can see a way to handle it. d d d d d P d d d d d P d d d d d P d d d S S P Where the "d" blocks are normal data blocks, "P" are raid-5 parity blocks (another column for raid-6 Q blocks could be added), and "S" are these "global" parity blocks. If a row has more errors than the normal parity block(s) can correct, then it is possible to use wider parity blocks to help. If you have one S that is defined in the same way as raid-6 Q parity, then it can be used to correct an extra error in a stripe. That relies on all the other stripes having at most P-correctable errors. The maths gets quite hairy. Two parity blocks are well-defined at the moment - raid-5 (xor) and raid-6 (using powers of 2 weights on the data blocks, over GF(8)). To provide recovery here, the S parities would have to fit within the same scheme. A third parity block is relatively easy to calculate using powers of 4 weights - but that is not scalable (a fourth parity using powers of 8 does not work beyond 21 data blocks). An alternative multi-parity scheme is possible using significantly more complex maths. However it is done, it would be hard. I am also not convinced that it would work for extra errors distributed throughout the block, rather than just in one row. A much simpler system could be done using vertical parities: d d d d d P d d d d d P d d d d d P V V V V V P Here, the V is just a raid-5 parity of the column of blocks. You now effectively have a raid-5-5 layered setup, but distributed within the one set of disks. Recovery would be straight-forward - if a block could not be re-created from a horizontal parity, then the vertical parity would be used. You would have some write amplification, but it would perhaps not be too bad (you could have many rows per vertical parity block), and it would be fine for read-mostly applications. It bears a certain resemblance to raid-10 layouts. Of course, raid-5-6, raid-6-5 and raid-6-6 would also be possible. >> >> Other things to consider on big arrays are redundancy of controllers, or >> even servers (for SAN arrays). Consider the pros and cons of spreading your >> redundancy across blocks. For example, if your server has two controllers >> then you might want your low-level block to be Raid-1 pairs with one disk on >> each controller. That could give you a better spread of bandwidths and give >> you resistance to a broken controller. >> >> You could also talk about asymmetric raid setups, such as having a >> write-only redundant copy on a second server over a network, or as a cheap >> hard disk copy of your fast SSDs. >> >> And you could also discuss strategies for disk replacement - after failures, >> or for growing the array. > > The disk replacement strategy has a significant effect on both > reliability and performance. The occurrence of human errors in desk > replacement can result in data unavailability and data loss. In the > following paper I've briefly discussed this subject and how a good > disk replacement policy can improve reliability by orders of magnitude > (a more detailed version of this paper is on the way!): > https://dl.acm.org/citation.cfm?id=3130452 In my experience, human error leads to more data loss than mechanical errors - and you really need to take it into account. > > you can download it using sci-hub if you don't have ACM access. > >> >> It is also worth emphasising that RAID is /not/ a backup solution - that >> cannot be said often enough! >> >> Discuss failure recovery - how to find and remove bad disks, how to deal >> with recovering disks from a different machine after the first one has died, >> etc. Emphasise the importance of labelling disks in your machines and being >> sure you pull the right disk! > > I really appreciate if you can share your experience about pulling > wrong disk and any statistics. This is an interesting subject to > discuss. > My server systems are too small in size, and too few in numbers, for statistics. I haven't actually pulled the wrong disk, but I did come /very/ close before deciding to have one last double-check. I have also tripped over the USB wire to an external disk and thrown it across the room - I am now a lot more careful about draping wires around! mvh., David ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Implementing Global Parity Codes 2018-01-30 15:14 ` David Brown @ 2018-01-31 16:03 ` mostafa kishani 2018-01-31 17:53 ` Piergiorgio Sartor 0 siblings, 1 reply; 15+ messages in thread From: mostafa kishani @ 2018-01-31 16:03 UTC (permalink / raw) To: David Brown; +Cc: Wols Lists, linux-raid Yes that's exactly what the code does. Here the math of encoding/decoding is not as important of IO overhead. Upon a stripe update, it needs to update the Global parity as well (that is probably in another stripe). This should result in a terrible performance in random-write workloads. But in sequential-write workloads this code may have a performance near to RAID5 and slightly better than RAID6. The 2D codes (as you suggested) also suffer a huge IO penalty and this is why the're barely employed even is fast memory structure such as SRAM/DRAM. Bests, Mostafa On Tue, Jan 30, 2018 at 6:44 PM, David Brown <david.brown@hesbynett.no> wrote: > On 30/01/18 12:30, mostafa kishani wrote: >> David what you pointed about employment of PMDS codes is correct. We >> have no access to what happens in the SSD firmware (such as FTL). But >> why this code cannot be implemented in the software layer (similar to >> RAID5/6...) ? I also thank you for pointing out very interesting >> subjects. >> > > I must admit that I haven't dug through the mathematical details of the > paper. It looks to be at a level that I /could/ understand, but would > need to put in quite a bit of time and effort. And the paper does not > strike me as being particularly outstanding or special - there are many, > many such papers published about new ideas in error detection and > correction. > > While it is not clear to me exactly how these additional "global" parity > blocks are intended to help correct errors in the paper, I can see a way > to handle it. > > d d d d d P > d d d d d P > d d d d d P > d d d S S P > > Where the "d" blocks are normal data blocks, "P" are raid-5 parity > blocks (another column for raid-6 Q blocks could be added), and "S" are > these "global" parity blocks. > > If a row has more errors than the normal parity block(s) can correct, > then it is possible to use wider parity blocks to help. If you have one > S that is defined in the same way as raid-6 Q parity, then it can be > used to correct an extra error in a stripe. That relies on all the > other stripes having at most P-correctable errors. > > The maths gets quite hairy. Two parity blocks are well-defined at the > moment - raid-5 (xor) and raid-6 (using powers of 2 weights on the data > blocks, over GF(8)). To provide recovery here, the S parities would > have to fit within the same scheme. A third parity block is relatively > easy to calculate using powers of 4 weights - but that is not scalable > (a fourth parity using powers of 8 does not work beyond 21 data blocks). > An alternative multi-parity scheme is possible using significantly more > complex maths. > > However it is done, it would be hard. I am also not convinced that it > would work for extra errors distributed throughout the block, rather > than just in one row. > > A much simpler system could be done using vertical parities: > > d d d d d P > d d d d d P > d d d d d P > V V V V V P > > Here, the V is just a raid-5 parity of the column of blocks. You now > effectively have a raid-5-5 layered setup, but distributed within the > one set of disks. Recovery would be straight-forward - if a block could > not be re-created from a horizontal parity, then the vertical parity > would be used. You would have some write amplification, but it would > perhaps not be too bad (you could have many rows per vertical parity > block), and it would be fine for read-mostly applications. It bears a > certain resemblance to raid-10 layouts. Of course, raid-5-6, raid-6-5 > and raid-6-6 would also be possible. > > >>> >>> Other things to consider on big arrays are redundancy of controllers, or >>> even servers (for SAN arrays). Consider the pros and cons of spreading your >>> redundancy across blocks. For example, if your server has two controllers >>> then you might want your low-level block to be Raid-1 pairs with one disk on >>> each controller. That could give you a better spread of bandwidths and give >>> you resistance to a broken controller. >>> >>> You could also talk about asymmetric raid setups, such as having a >>> write-only redundant copy on a second server over a network, or as a cheap >>> hard disk copy of your fast SSDs. >>> >>> And you could also discuss strategies for disk replacement - after failures, >>> or for growing the array. >> >> The disk replacement strategy has a significant effect on both >> reliability and performance. The occurrence of human errors in desk >> replacement can result in data unavailability and data loss. In the >> following paper I've briefly discussed this subject and how a good >> disk replacement policy can improve reliability by orders of magnitude >> (a more detailed version of this paper is on the way!): >> https://dl.acm.org/citation.cfm?id=3130452 > > In my experience, human error leads to more data loss than mechanical > errors - and you really need to take it into account. > >> >> you can download it using sci-hub if you don't have ACM access. >> >>> >>> It is also worth emphasising that RAID is /not/ a backup solution - that >>> cannot be said often enough! >>> >>> Discuss failure recovery - how to find and remove bad disks, how to deal >>> with recovering disks from a different machine after the first one has died, >>> etc. Emphasise the importance of labelling disks in your machines and being >>> sure you pull the right disk! >> >> I really appreciate if you can share your experience about pulling >> wrong disk and any statistics. This is an interesting subject to >> discuss. >> > > My server systems are too small in size, and too few in numbers, for > statistics. I haven't actually pulled the wrong disk, but I did come > /very/ close before deciding to have one last double-check. > > I have also tripped over the USB wire to an external disk and thrown it > across the room - I am now a lot more careful about draping wires around! > > > mvh., > > David > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Implementing Global Parity Codes 2018-01-31 16:03 ` mostafa kishani @ 2018-01-31 17:53 ` Piergiorgio Sartor 0 siblings, 0 replies; 15+ messages in thread From: Piergiorgio Sartor @ 2018-01-31 17:53 UTC (permalink / raw) To: mostafa kishani; +Cc: David Brown, Wols Lists, linux-raid Hi all, sorry for the top posting. In a previous message, you explain the "Global Parity" would be the xor of all the data across the stripes, including the stripes parities. Is this still the case? Did I miss something? Because, by definition, the xor between the data and the parity, in a stripe, is always 0. Hence, the xor of all stripes data and parities is 0 too, always, and so it is *not* necessary to store it. It is only necessary to check it, if wanted. Now, again, maybe I skipped some parts, so I apologize in advance if this is the case and what is written above is just rubbish, otherwise something is not really correct in the intepretation of the cited paper. bye, pg On Wed, Jan 31, 2018 at 07:33:54PM +0330, mostafa kishani wrote: > Yes that's exactly what the code does. Here the math of > encoding/decoding is not as important of IO overhead. Upon a stripe > update, it needs to update the Global parity as well (that is probably > in another stripe). This should result in a terrible performance in > random-write workloads. But in sequential-write workloads this code > may have a performance near to RAID5 and slightly better than RAID6. > The 2D codes (as you suggested) also suffer a huge IO penalty and this > is why the're barely employed even is fast memory structure such as > SRAM/DRAM. > > Bests, > Mostafa > > On Tue, Jan 30, 2018 at 6:44 PM, David Brown <david.brown@hesbynett.no> wrote: > > On 30/01/18 12:30, mostafa kishani wrote: > >> David what you pointed about employment of PMDS codes is correct. We > >> have no access to what happens in the SSD firmware (such as FTL). But > >> why this code cannot be implemented in the software layer (similar to > >> RAID5/6...) ? I also thank you for pointing out very interesting > >> subjects. > >> > > > > I must admit that I haven't dug through the mathematical details of the > > paper. It looks to be at a level that I /could/ understand, but would > > need to put in quite a bit of time and effort. And the paper does not > > strike me as being particularly outstanding or special - there are many, > > many such papers published about new ideas in error detection and > > correction. > > > > While it is not clear to me exactly how these additional "global" parity > > blocks are intended to help correct errors in the paper, I can see a way > > to handle it. > > > > d d d d d P > > d d d d d P > > d d d d d P > > d d d S S P > > > > Where the "d" blocks are normal data blocks, "P" are raid-5 parity > > blocks (another column for raid-6 Q blocks could be added), and "S" are > > these "global" parity blocks. > > > > If a row has more errors than the normal parity block(s) can correct, > > then it is possible to use wider parity blocks to help. If you have one > > S that is defined in the same way as raid-6 Q parity, then it can be > > used to correct an extra error in a stripe. That relies on all the > > other stripes having at most P-correctable errors. > > > > The maths gets quite hairy. Two parity blocks are well-defined at the > > moment - raid-5 (xor) and raid-6 (using powers of 2 weights on the data > > blocks, over GF(8)). To provide recovery here, the S parities would > > have to fit within the same scheme. A third parity block is relatively > > easy to calculate using powers of 4 weights - but that is not scalable > > (a fourth parity using powers of 8 does not work beyond 21 data blocks). > > An alternative multi-parity scheme is possible using significantly more > > complex maths. > > > > However it is done, it would be hard. I am also not convinced that it > > would work for extra errors distributed throughout the block, rather > > than just in one row. > > > > A much simpler system could be done using vertical parities: > > > > d d d d d P > > d d d d d P > > d d d d d P > > V V V V V P > > > > Here, the V is just a raid-5 parity of the column of blocks. You now > > effectively have a raid-5-5 layered setup, but distributed within the > > one set of disks. Recovery would be straight-forward - if a block could > > not be re-created from a horizontal parity, then the vertical parity > > would be used. You would have some write amplification, but it would > > perhaps not be too bad (you could have many rows per vertical parity > > block), and it would be fine for read-mostly applications. It bears a > > certain resemblance to raid-10 layouts. Of course, raid-5-6, raid-6-5 > > and raid-6-6 would also be possible. > > > > > >>> > >>> Other things to consider on big arrays are redundancy of controllers, or > >>> even servers (for SAN arrays). Consider the pros and cons of spreading your > >>> redundancy across blocks. For example, if your server has two controllers > >>> then you might want your low-level block to be Raid-1 pairs with one disk on > >>> each controller. That could give you a better spread of bandwidths and give > >>> you resistance to a broken controller. > >>> > >>> You could also talk about asymmetric raid setups, such as having a > >>> write-only redundant copy on a second server over a network, or as a cheap > >>> hard disk copy of your fast SSDs. > >>> > >>> And you could also discuss strategies for disk replacement - after failures, > >>> or for growing the array. > >> > >> The disk replacement strategy has a significant effect on both > >> reliability and performance. The occurrence of human errors in desk > >> replacement can result in data unavailability and data loss. In the > >> following paper I've briefly discussed this subject and how a good > >> disk replacement policy can improve reliability by orders of magnitude > >> (a more detailed version of this paper is on the way!): > >> https://dl.acm.org/citation.cfm?id=3130452 > > > > In my experience, human error leads to more data loss than mechanical > > errors - and you really need to take it into account. > > > >> > >> you can download it using sci-hub if you don't have ACM access. > >> > >>> > >>> It is also worth emphasising that RAID is /not/ a backup solution - that > >>> cannot be said often enough! > >>> > >>> Discuss failure recovery - how to find and remove bad disks, how to deal > >>> with recovering disks from a different machine after the first one has died, > >>> etc. Emphasise the importance of labelling disks in your machines and being > >>> sure you pull the right disk! > >> > >> I really appreciate if you can share your experience about pulling > >> wrong disk and any statistics. This is an interesting subject to > >> discuss. > >> > > > > My server systems are too small in size, and too few in numbers, for > > statistics. I haven't actually pulled the wrong disk, but I did come > > /very/ close before deciding to have one last double-check. > > > > I have also tripped over the USB wire to an external disk and thrown it > > across the room - I am now a lot more careful about draping wires around! > > > > > > mvh., > > > > David > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- piergiorgio ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Implementing Global Parity Codes 2018-01-27 5:47 Implementing Global Parity Codes mostafa kishani 2018-01-27 8:37 ` Wols Lists @ 2018-02-02 5:24 ` NeilBrown 2018-02-03 6:01 ` mostafa kishani 1 sibling, 1 reply; 15+ messages in thread From: NeilBrown @ 2018-02-02 5:24 UTC (permalink / raw) To: mostafa kishani, linux-raid [-- Attachment #1: Type: text/plain, Size: 1797 bytes --] On Sat, Jan 27 2018, mostafa kishani wrote: > Dear All, > > I am going to make some modifications to RAID protocol to make it more > reliable for my case (for a scientific, and maybe later, industrial > purpose). For example, I'm going to hold a Global Parity (a parity > taken across the whole data stripe rather than a row) alongside normal > row-wise parities, to cope with an extra sector/page failure per > stripe. Do you have any suggestion how can I implement this with a > moderate effort (I mean what functions should be modified)? have any > of you had any similar effort? In raid5.c the is a "struct stripe_head" which represents a stripe that is one-page (normally 4K) wide across all devices. All the data for any parity calculation can all be found in a 'stripe_head'. You would probably need to modify the stripe_head to represent several more blocks so that all the Data and Parity for any computation are always attached to the one stripe_head. > I also appreciate if you guide me how can I enable DEBUG mode in mdadm. I assume you mean debug mode on "md". mdadm is the management tool. md is the kernel driver. mdadm doesn't have a debug mode. md has a number of pr_debug() calls which can each be turned on or off independently using dynamic debugging https://www.kernel.org/doc/html/v4.15/admin-guide/dynamic-debug-howto.html To turn on all pr_debug commands in raid5.c use echo file raid5.c +p > /sys/kernel/debug/dynamic_debug/control to turn them off again: echo file raid5.c -p > /sys/kernel/debug/dynamic_debug/control NeilBrown > > Bests, > Mostafa > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Implementing Global Parity Codes 2018-02-02 5:24 ` NeilBrown @ 2018-02-03 6:01 ` mostafa kishani 0 siblings, 0 replies; 15+ messages in thread From: mostafa kishani @ 2018-02-03 6:01 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Dear Neil, Sincere thanks for the useful information. I really appreciate your help. Bests, Mostafa On Fri, Feb 2, 2018 at 8:54 AM, NeilBrown <neilb@suse.com> wrote: > On Sat, Jan 27 2018, mostafa kishani wrote: > >> Dear All, >> >> I am going to make some modifications to RAID protocol to make it more >> reliable for my case (for a scientific, and maybe later, industrial >> purpose). For example, I'm going to hold a Global Parity (a parity >> taken across the whole data stripe rather than a row) alongside normal >> row-wise parities, to cope with an extra sector/page failure per >> stripe. Do you have any suggestion how can I implement this with a >> moderate effort (I mean what functions should be modified)? have any >> of you had any similar effort? > > In raid5.c the is a "struct stripe_head" which represents a stripe that > is one-page (normally 4K) wide across all devices. All the data for any > parity calculation can all be found in a 'stripe_head'. > You would probably need to modify the stripe_head to represent several > more blocks so that all the Data and Parity for any computation are > always attached to the one stripe_head. > >> I also appreciate if you guide me how can I enable DEBUG mode in mdadm. > > I assume you mean debug mode on "md". > mdadm is the management tool. > md is the kernel driver. > > mdadm doesn't have a debug mode. > > md has a number of pr_debug() calls which can each be turned on or off > independently using dynamic debugging > > https://www.kernel.org/doc/html/v4.15/admin-guide/dynamic-debug-howto.html > > To turn on all pr_debug commands in raid5.c use > > echo file raid5.c +p > /sys/kernel/debug/dynamic_debug/control > > to turn them off again: > > echo file raid5.c -p > /sys/kernel/debug/dynamic_debug/control > > NeilBrown > >> >> Bests, >> Mostafa >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2018-02-03 6:01 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-01-27 5:47 Implementing Global Parity Codes mostafa kishani 2018-01-27 8:37 ` Wols Lists 2018-01-27 14:29 ` mostafa kishani 2018-01-27 15:13 ` Wols Lists 2018-01-28 13:00 ` mostafa kishani 2018-01-29 10:22 ` David Brown 2018-01-29 17:44 ` Wols Lists 2018-01-30 11:47 ` David Brown 2018-01-30 14:18 ` Brad Campbell 2018-01-30 11:30 ` mostafa kishani 2018-01-30 15:14 ` David Brown 2018-01-31 16:03 ` mostafa kishani 2018-01-31 17:53 ` Piergiorgio Sartor 2018-02-02 5:24 ` NeilBrown 2018-02-03 6:01 ` mostafa kishani
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.