All of lore.kernel.org
 help / color / mirror / Atom feed
* Implementing Global Parity Codes
@ 2018-01-27  5:47 mostafa kishani
  2018-01-27  8:37 ` Wols Lists
  2018-02-02  5:24 ` NeilBrown
  0 siblings, 2 replies; 15+ messages in thread
From: mostafa kishani @ 2018-01-27  5:47 UTC (permalink / raw)
  To: linux-raid

Dear All,

I am going to make some modifications to RAID protocol to make it more
reliable for my case (for a scientific, and maybe later, industrial
purpose). For example, I'm going to hold a Global Parity (a parity
taken across the whole data stripe rather than a row) alongside normal
row-wise parities, to cope with an extra sector/page failure per
stripe. Do you have any suggestion how can I implement this with a
moderate effort (I mean what functions should be modified)? have any
of you had any similar effort?
I also appreciate if you guide me how can I enable DEBUG mode in mdadm.

Bests,
Mostafa

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Implementing Global Parity Codes
  2018-01-27  5:47 Implementing Global Parity Codes mostafa kishani
@ 2018-01-27  8:37 ` Wols Lists
  2018-01-27 14:29   ` mostafa kishani
  2018-02-02  5:24 ` NeilBrown
  1 sibling, 1 reply; 15+ messages in thread
From: Wols Lists @ 2018-01-27  8:37 UTC (permalink / raw)
  To: mostafa kishani, linux-raid

On 27/01/18 05:47, mostafa kishani wrote:
> Dear All,
> 
> I am going to make some modifications to RAID protocol to make it more
> reliable for my case (for a scientific, and maybe later, industrial
> purpose). For example, I'm going to hold a Global Parity (a parity
> taken across the whole data stripe rather than a row)

Except that what do you mean by "row"? Aren't you using it as just
another word for "stripe"?

 alongside normal
> row-wise parities, to cope with an extra sector/page failure per
> stripe. Do you have any suggestion how can I implement this with a
> moderate effort (I mean what functions should be modified)? have any
> of you had any similar effort?

If I understand you correctly, that's easy. Raid-5 has one parity block
per stripe, enabling it to recover from one lost disk. Raid-6 has two
parity stripes, enabling it to recover from two lost disks, or one
random corrupted block.

Nobody's tried to do it, but it's a simple extension of the current
setup ... why don't you implement what I call "raid-6+", where you can
have as many parity disks as you like - in your case three. You'd need
to take the current raid-6 code and extend it - ignore raid-5 because
while the principle is the same, the detail is much simpler and cannot
be extended.

> I also appreciate if you guide me how can I enable DEBUG mode in mdadm.
> 
Can't help there, I'm afraid.

Cheers,
Wol


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Implementing Global Parity Codes
  2018-01-27  8:37 ` Wols Lists
@ 2018-01-27 14:29   ` mostafa kishani
  2018-01-27 15:13     ` Wols Lists
  0 siblings, 1 reply; 15+ messages in thread
From: mostafa kishani @ 2018-01-27 14:29 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid

Thanks for your response Wol
Well, maybe I failed to illustrate what I'm going to implement. I try
to better clarify using your terminology:
In the normal RAID5 and RAID6 codes we have one/two parities per
stripe. Now consider sharing a redundant sector between say, 4
stripes, and assume that the redundant sector is saved in stripe4.
Assume the redundant sector is the parity of all sectors in stripe1,
stripe2, stripe3, and stripe4. Using this redundant sector you can
tolerate one sector failure across stripe1 to stripe4. We already have
the parity sectors of RAID5 and RAID6 and this redundant sector is
added to tolerate an extra sector failure. I call this redundant
sector "Global Parity".
I try to demonstrate this as follows, assuming each RAID5 stripe has 3
data sectors and one parity sector.
stripe1: DATA1 | DATA2 | DATA3 | PARITY1
stripe2: PARITY2 | DATA4 | DATA5 | DATA6
stripe3: DATA7 | PARITY3 | DATA8 | DATA9
stripe4: DATA10 | DATA11 | PARITY4 | GLOBAL PARITY

and the Global Parity is taken across all data and parity as follows:
GLOBAL PARITY = DATA1 X DATA2 X DATA3 X DATA4 X DATA5 X DATA6 X DATA7
X DATA8 X DATA9 X DATA10 X DATA11 X PARITY1 X PARITY2 X PARITY3

Where "X" stands for XOR operation.
I hope it was clear.

Bests,
Mostafa

On Sat, Jan 27, 2018 at 12:07 PM, Wols Lists <antlists@youngman.org.uk> wrote:
> On 27/01/18 05:47, mostafa kishani wrote:
>> Dear All,
>>
>> I am going to make some modifications to RAID protocol to make it more
>> reliable for my case (for a scientific, and maybe later, industrial
>> purpose). For example, I'm going to hold a Global Parity (a parity
>> taken across the whole data stripe rather than a row)
>
> Except that what do you mean by "row"? Aren't you using it as just
> another word for "stripe"?
>
>  alongside normal
>> row-wise parities, to cope with an extra sector/page failure per
>> stripe. Do you have any suggestion how can I implement this with a
>> moderate effort (I mean what functions should be modified)? have any
>> of you had any similar effort?
>
> If I understand you correctly, that's easy. Raid-5 has one parity block
> per stripe, enabling it to recover from one lost disk. Raid-6 has two
> parity stripes, enabling it to recover from two lost disks, or one
> random corrupted block.
>
> Nobody's tried to do it, but it's a simple extension of the current
> setup ... why don't you implement what I call "raid-6+", where you can
> have as many parity disks as you like - in your case three. You'd need
> to take the current raid-6 code and extend it - ignore raid-5 because
> while the principle is the same, the detail is much simpler and cannot
> be extended.
>
>> I also appreciate if you guide me how can I enable DEBUG mode in mdadm.
>>
> Can't help there, I'm afraid.
>
> Cheers,
> Wol
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Implementing Global Parity Codes
  2018-01-27 14:29   ` mostafa kishani
@ 2018-01-27 15:13     ` Wols Lists
  2018-01-28 13:00       ` mostafa kishani
  2018-01-29 10:22       ` David Brown
  0 siblings, 2 replies; 15+ messages in thread
From: Wols Lists @ 2018-01-27 15:13 UTC (permalink / raw)
  To: mostafa kishani; +Cc: linux-raid

On 27/01/18 14:29, mostafa kishani wrote:
> Thanks for your response Wol
> Well, maybe I failed to illustrate what I'm going to implement. I try
> to better clarify using your terminology:
> In the normal RAID5 and RAID6 codes we have one/two parities per
> stripe. Now consider sharing a redundant sector between say, 4
> stripes, and assume that the redundant sector is saved in stripe4.
> Assume the redundant sector is the parity of all sectors in stripe1,
> stripe2, stripe3, and stripe4. Using this redundant sector you can
> tolerate one sector failure across stripe1 to stripe4. We already have
> the parity sectors of RAID5 and RAID6 and this redundant sector is
> added to tolerate an extra sector failure. I call this redundant
> sector "Global Parity".
> I try to demonstrate this as follows, assuming each RAID5 stripe has 3
> data sectors and one parity sector.
> stripe1: DATA1 | DATA2 | DATA3 | PARITY1
> stripe2: PARITY2 | DATA4 | DATA5 | DATA6
> stripe3: DATA7 | PARITY3 | DATA8 | DATA9
> stripe4: DATA10 | DATA11 | PARITY4 | GLOBAL PARITY
> 
> and the Global Parity is taken across all data and parity as follows:
> GLOBAL PARITY = DATA1 X DATA2 X DATA3 X DATA4 X DATA5 X DATA6 X DATA7
> X DATA8 X DATA9 X DATA10 X DATA11 X PARITY1 X PARITY2 X PARITY3
> 
> Where "X" stands for XOR operation.
> I hope it was clear.

OWWW!!!!

Have you done and understood the maths!!!???

You may have noticed I said that while raid-6 was similar in principle
to raid-5, it was very different in implementation. Because of the maths!

Going back to high-school algebra, if we have E *unique* equations, and
U unknowns, then we can only solve the equations if E > U (I think I've
got that right, it might be >=).

With raid-5, parity1 = data1 xor data2 xor data3. Now let's assume
somebody thinks "let's add parity2" and defines parity2 = data1 xor
data2 xor data3 xor parity1. THAT WON'T WORK. Raid-5 relies on the
equation "if N is even, then parity2 is all ones, else parity2 is all
zeroes", where N is the number of disks, so if we calculate parity2 we
add absolutely nothing to our pre-existing E.

If you are planning to use XOR, I think you are falling into *exactly*
that trap! Plus, it looks to me as if calculating your global parity is
going to be a disk-hammering nightmare ...

That's why raid-6 uses a *completely* *different* algorithm to calculate
its parity1 and parity2.

I've updated a page on the wiki, because it's come up in other
discussions as well, but it seems to me if you need extra parity, you
really ought to be going for raid-60. Take a look ...

https://raid.wiki.kernel.org/index.php/What_is_RAID_and_why_should_you_want_it%3F#Which_raid_is_for_me.3F

and if anyone else wants to comment, too? ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Implementing Global Parity Codes
  2018-01-27 15:13     ` Wols Lists
@ 2018-01-28 13:00       ` mostafa kishani
  2018-01-29 10:22       ` David Brown
  1 sibling, 0 replies; 15+ messages in thread
From: mostafa kishani @ 2018-01-28 13:00 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid

Wol, I guess so that this code is possibly hammering the disk. But the
implementation of encoding/decoding part is not a big deal. The codes
used in RAID5 and RAID6 (and also in Global Parity) are in the
category of Maximum Distance Separable (MDS) codes (some famous
examples of these codes are Hamming and ReedSolomon). If you have
interest you can take a look at this paper:
https://arxiv.org/pdf/1205.0997

But my main challenge now is what parts of mdadm should be modified,
and how can I enable debug mode in mdadm.
I wonder if anyone can give me a clue about the debug stuff.

Bests,
Mostafa

On Sat, Jan 27, 2018 at 6:43 PM, Wols Lists <antlists@youngman.org.uk> wrote:
> On 27/01/18 14:29, mostafa kishani wrote:
>> Thanks for your response Wol
>> Well, maybe I failed to illustrate what I'm going to implement. I try
>> to better clarify using your terminology:
>> In the normal RAID5 and RAID6 codes we have one/two parities per
>> stripe. Now consider sharing a redundant sector between say, 4
>> stripes, and assume that the redundant sector is saved in stripe4.
>> Assume the redundant sector is the parity of all sectors in stripe1,
>> stripe2, stripe3, and stripe4. Using this redundant sector you can
>> tolerate one sector failure across stripe1 to stripe4. We already have
>> the parity sectors of RAID5 and RAID6 and this redundant sector is
>> added to tolerate an extra sector failure. I call this redundant
>> sector "Global Parity".
>> I try to demonstrate this as follows, assuming each RAID5 stripe has 3
>> data sectors and one parity sector.
>> stripe1: DATA1 | DATA2 | DATA3 | PARITY1
>> stripe2: PARITY2 | DATA4 | DATA5 | DATA6
>> stripe3: DATA7 | PARITY3 | DATA8 | DATA9
>> stripe4: DATA10 | DATA11 | PARITY4 | GLOBAL PARITY
>>
>> and the Global Parity is taken across all data and parity as follows:
>> GLOBAL PARITY = DATA1 X DATA2 X DATA3 X DATA4 X DATA5 X DATA6 X DATA7
>> X DATA8 X DATA9 X DATA10 X DATA11 X PARITY1 X PARITY2 X PARITY3
>>
>> Where "X" stands for XOR operation.
>> I hope it was clear.
>
> OWWW!!!!
>
> Have you done and understood the maths!!!???
>
> You may have noticed I said that while raid-6 was similar in principle
> to raid-5, it was very different in implementation. Because of the maths!
>
> Going back to high-school algebra, if we have E *unique* equations, and
> U unknowns, then we can only solve the equations if E > U (I think I've
> got that right, it might be >=).
>
> With raid-5, parity1 = data1 xor data2 xor data3. Now let's assume
> somebody thinks "let's add parity2" and defines parity2 = data1 xor
> data2 xor data3 xor parity1. THAT WON'T WORK. Raid-5 relies on the
> equation "if N is even, then parity2 is all ones, else parity2 is all
> zeroes", where N is the number of disks, so if we calculate parity2 we
> add absolutely nothing to our pre-existing E.
>
> If you are planning to use XOR, I think you are falling into *exactly*
> that trap! Plus, it looks to me as if calculating your global parity is
> going to be a disk-hammering nightmare ...
>
> That's why raid-6 uses a *completely* *different* algorithm to calculate
> its parity1 and parity2.
>
> I've updated a page on the wiki, because it's come up in other
> discussions as well, but it seems to me if you need extra parity, you
> really ought to be going for raid-60. Take a look ...
>
> https://raid.wiki.kernel.org/index.php/What_is_RAID_and_why_should_you_want_it%3F#Which_raid_is_for_me.3F
>
> and if anyone else wants to comment, too? ...
>
> Cheers,
> Wol

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Implementing Global Parity Codes
  2018-01-27 15:13     ` Wols Lists
  2018-01-28 13:00       ` mostafa kishani
@ 2018-01-29 10:22       ` David Brown
  2018-01-29 17:44         ` Wols Lists
  2018-01-30 11:30         ` mostafa kishani
  1 sibling, 2 replies; 15+ messages in thread
From: David Brown @ 2018-01-29 10:22 UTC (permalink / raw)
  To: Wols Lists, mostafa kishani; +Cc: linux-raid



On 27/01/2018 16:13, Wols Lists wrote:
> On 27/01/18 14:29, mostafa kishani wrote:
>> Thanks for your response Wol
>> Well, maybe I failed to illustrate what I'm going to implement. I try
>> to better clarify using your terminology:
>> In the normal RAID5 and RAID6 codes we have one/two parities per
>> stripe. Now consider sharing a redundant sector between say, 4
>> stripes, and assume that the redundant sector is saved in stripe4.
>> Assume the redundant sector is the parity of all sectors in stripe1,
>> stripe2, stripe3, and stripe4. Using this redundant sector you can
>> tolerate one sector failure across stripe1 to stripe4. We already have
>> the parity sectors of RAID5 and RAID6 and this redundant sector is
>> added to tolerate an extra sector failure. I call this redundant
>> sector "Global Parity".
>> I try to demonstrate this as follows, assuming each RAID5 stripe has 3
>> data sectors and one parity sector.
>> stripe1: DATA1 | DATA2 | DATA3 | PARITY1
>> stripe2: PARITY2 | DATA4 | DATA5 | DATA6
>> stripe3: DATA7 | PARITY3 | DATA8 | DATA9
>> stripe4: DATA10 | DATA11 | PARITY4 | GLOBAL PARITY
>>
>> and the Global Parity is taken across all data and parity as follows:
>> GLOBAL PARITY = DATA1 X DATA2 X DATA3 X DATA4 X DATA5 X DATA6 X DATA7
>> X DATA8 X DATA9 X DATA10 X DATA11 X PARITY1 X PARITY2 X PARITY3
>>
>> Where "X" stands for XOR operation.
>> I hope it was clear.
> 
> OWWW!!!!
> 
> Have you done and understood the maths!!!???

I have started looking at the paper (from the link in Mostafa's next 
post).  I have only read a few pages as yet, but it looks to me to have 
some fundamental misunderstandings about SSDs, how they work, and their 
typical failures, and to be massively mixing up the low-level structures 
visible inside the SSD firmware and the high-level view available to the 
kernel and the md layer.  At best, this "PMDS" idea with blocks might be 
an alternative or addition to ECC layers within the SSD - but not at the 
md layer.  I have not read the whole paper yet, so I could be missing 
something - but I am sceptical.


> 
> You may have noticed I said that while raid-6 was similar in principle
> to raid-5, it was very different in implementation. Because of the maths!

Yes, indeed.  The maths of raid 6 is a lot of fun, and very smart.

> 
> Going back to high-school algebra, if we have E *unique* equations, and
> U unknowns, then we can only solve the equations if E > U (I think I've
> got that right, it might be >=).

E >= U.  You can solve "2 * x - 4 = 0", which is one equation in one 
unknown.  But critically, the E equations need to be linearly 
independent (that is probably what you mean by "unique").

> 
> With raid-5, parity1 = data1 xor data2 xor data3. Now let's assume
> somebody thinks "let's add parity2" and defines parity2 = data1 xor
> data2 xor data3 xor parity1. THAT WON'T WORK. 

Correct - this is because the two equations are not linearly 
independent.  The second parity would always be 0 in this case.

> Raid-5 relies on the
> equation "if N is even, then parity2 is all ones, else parity2 is all
> zeroes", where N is the number of disks, so if we calculate parity2 we
> add absolutely nothing to our pre-existing E.
> 
> If you are planning to use XOR, I think you are falling into *exactly*
> that trap! Plus, it looks to me as if calculating your global parity is
> going to be a disk-hammering nightmare ...

Yes.

> 
> That's why raid-6 uses a *completely* *different* algorithm to calculate
> its parity1 and parity2.

It is not actually a completely different algorithm, if you view it in 
the correct way.  You can say the raid5 parity P is just the xor of the 
bits, while the raid6 parity Q is a polynomial over the GF(2^8) field - 
certainly they look completely different then.  But once you move to the 
GF(2^8) field, the equations become:

	P = d_0 + d_1 + d_2 + d_3 + d_4 + ...
	Q = d_0 + 2 . d_1 + 2^2 . d_2 + 2^3 . d_3 + 2^4 . d_4 + ...

(Note that none of this is "ordinary" maths - addition and 
multiplication is special in the GF field.)

It is even possible to extend it to a third parity in a similar way:

	R = d_0 + 4 . d_1 + 4^2 . d_2 + 4^3 . d_3 + 4^4 . d_4 + ...

There are other schemes that scale better beyond the third parity (this 
scheme can generate a fourth parity bit, but it is then only valid for 
up to 21 data disks).

> 
> I've updated a page on the wiki, because it's come up in other
> discussions as well, but it seems to me if you need extra parity, you
> really ought to be going for raid-60. Take a look ...
> 
> https://raid.wiki.kernel.org/index.php/What_is_RAID_and_why_should_you_want_it%3F#Which_raid_is_for_me.3F
> 
> and if anyone else wants to comment, too? ...
> 

Here are a few random comments:

Raid-10-far2 can be /faster/ than Raid0 on the same number of HDs, for 
read-only performance.  This is because the data for both stripes will 
be read from the first half of the disks - the outside half.  On many 
disks this gives higher read speeds, since the same angular rotation 
speed has higher linear velocity at the disk heads.  It also gives 
shorter seek times as the head does not have to move as far in or out to 
cover the whole range.  For SSDs, the layout for Raid-10 makes almost no 
difference (but it is still faster than plain Raid-1 for streamed reads).

For two drives, Raid-10 is a fine choice on read-heavy or streaming 
applications.

I think you could emphasise that there is little point in having Raid-5 
plus a spare - Raid-6 is better in every way.

You should make a clearer distinction that by "Raid-6+0" you mean a 
Raid-0 stripe of Raid-6 sets, rather than a Raid-6 set of Raid-0 stripes.

There are also many, many other ways to organise multi-layer raids. 
Striping at the high level (like Raid-6+0) makes sense only if you have 
massive streaming operations for single files, and massive bandwidth - 
it is poorer for operations involving a large number of parallel 
accesses.  A common arrangement for big arrays is a linear concatenation 
of Raid-1 pairs (or Raid-5 or Raid-6 sets) - combined with an 
appropriate file system (XFS comes out well here) you get massive 
scalability and very high parallel access speeds.

Other things to consider on big arrays are redundancy of controllers, or 
even servers (for SAN arrays).  Consider the pros and cons of spreading 
your redundancy across blocks.  For example, if your server has two 
controllers then you might want your low-level block to be Raid-1 pairs 
with one disk on each controller.  That could give you a better spread 
of bandwidths and give you resistance to a broken controller.

You could also talk about asymmetric raid setups, such as having a 
write-only redundant copy on a second server over a network, or as a 
cheap hard disk copy of your fast SSDs.

And you could also discuss strategies for disk replacement - after 
failures, or for growing the array.

It is also worth emphasising that RAID is /not/ a backup solution - that 
cannot be said often enough!

Discuss failure recovery - how to find and remove bad disks, how to deal 
with recovering disks from a different machine after the first one has 
died, etc.  Emphasise the importance of labelling disks in your machines 
and being sure you pull the right disk!



> Cheers,
> Wol

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Implementing Global Parity Codes
  2018-01-29 10:22       ` David Brown
@ 2018-01-29 17:44         ` Wols Lists
  2018-01-30 11:47           ` David Brown
  2018-01-30 14:18           ` Brad Campbell
  2018-01-30 11:30         ` mostafa kishani
  1 sibling, 2 replies; 15+ messages in thread
From: Wols Lists @ 2018-01-29 17:44 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

On 29/01/18 10:22, David Brown wrote:
>> I've updated a page on the wiki, because it's come up in other
>> discussions as well, but it seems to me if you need extra parity, you
>> really ought to be going for raid-60. Take a look ...
>>
>> https://raid.wiki.kernel.org/index.php/What_is_RAID_and_why_should_you_want_it%3F#Which_raid_is_for_me.3F
>>
>>
>> and if anyone else wants to comment, too? ...
>>
> 
> Here are a few random comments:
> 
> Raid-10-far2 can be /faster/ than Raid0 on the same number of HDs, for
> read-only performance.  This is because the data for both stripes will
> be read from the first half of the disks - the outside half.  On many
> disks this gives higher read speeds, since the same angular rotation
> speed has higher linear velocity at the disk heads.  It also gives
> shorter seek times as the head does not have to move as far in or out to
> cover the whole range.  For SSDs, the layout for Raid-10 makes almost no
> difference (but it is still faster than plain Raid-1 for streamed reads).

Except that most drives don't do that nowadays, they do "constant linear
velocity" so the drive speeds up or slows down depending on where the
heads are, I believe.
> 
> For two drives, Raid-10 is a fine choice on read-heavy or streaming
> applications.

Which is just raid-1, no?
> 
> I think you could emphasise that there is little point in having Raid-5
> plus a spare - Raid-6 is better in every way.

Agreed. I don't agree raid-6 is better in *every* way - it wastes space
- but yes once you have enough drives you should go raid-6 :-)
> 
> You should make a clearer distinction that by "Raid-6+0" you mean a
> Raid-0 stripe of Raid-6 sets, rather than a Raid-6 set of Raid-0 stripes.
> 
Done.

> There are also many, many other ways to organise multi-layer raids.
> Striping at the high level (like Raid-6+0) makes sense only if you have
> massive streaming operations for single files, and massive bandwidth -
> it is poorer for operations involving a large number of parallel
> accesses.  A common arrangement for big arrays is a linear concatenation
> of Raid-1 pairs (or Raid-5 or Raid-6 sets) - combined with an
> appropriate file system (XFS comes out well here) you get massive
> scalability and very high parallel access speeds.
> 
> Other things to consider on big arrays are redundancy of controllers, or
> even servers (for SAN arrays).  Consider the pros and cons of spreading
> your redundancy across blocks.  For example, if your server has two
> controllers then you might want your low-level block to be Raid-1 pairs
> with one disk on each controller.  That could give you a better spread
> of bandwidths and give you resistance to a broken controller.
> 
> You could also talk about asymmetric raid setups, such as having a
> write-only redundant copy on a second server over a network, or as a
> cheap hard disk copy of your fast SSDs.

Snag is, I don't manage large arrays - it's a lot to think about. I
might add that later.
> 
> And you could also discuss strategies for disk replacement - after
> failures, or for growing the array.
> 
> It is also worth emphasising that RAID is /not/ a backup solution - that
> cannot be said often enough!
> 
> Discuss failure recovery - how to find and remove bad disks, how to deal
> with recovering disks from a different machine after the first one has
> died, etc.  Emphasise the importance of labelling disks in your machines
> and being sure you pull the right disk!
> 
I think that's covered elsewhere :-)

Cheers,
Wol


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Implementing Global Parity Codes
  2018-01-29 10:22       ` David Brown
  2018-01-29 17:44         ` Wols Lists
@ 2018-01-30 11:30         ` mostafa kishani
  2018-01-30 15:14           ` David Brown
  1 sibling, 1 reply; 15+ messages in thread
From: mostafa kishani @ 2018-01-30 11:30 UTC (permalink / raw)
  To: David Brown; +Cc: Wols Lists, linux-raid

David what you pointed about employment of PMDS codes is correct. We
have no access to what happens in the SSD firmware (such as FTL). But
why this code cannot be implemented in the software layer (similar to
RAID5/6...) ? I also thank you for pointing out very interesting
subjects.

>
> Other things to consider on big arrays are redundancy of controllers, or
> even servers (for SAN arrays).  Consider the pros and cons of spreading your
> redundancy across blocks.  For example, if your server has two controllers
> then you might want your low-level block to be Raid-1 pairs with one disk on
> each controller.  That could give you a better spread of bandwidths and give
> you resistance to a broken controller.
>
> You could also talk about asymmetric raid setups, such as having a
> write-only redundant copy on a second server over a network, or as a cheap
> hard disk copy of your fast SSDs.
>
> And you could also discuss strategies for disk replacement - after failures,
> or for growing the array.

The disk replacement strategy has a significant effect on both
reliability and performance. The occurrence of human errors in desk
replacement can result in data unavailability and data loss. In the
following paper I've briefly discussed this subject and how a good
disk replacement policy can improve reliability by orders of magnitude
(a more detailed version of this paper is on the way!):
https://dl.acm.org/citation.cfm?id=3130452

you can download it using sci-hub if you don't have ACM access.

>
> It is also worth emphasising that RAID is /not/ a backup solution - that
> cannot be said often enough!
>
> Discuss failure recovery - how to find and remove bad disks, how to deal
> with recovering disks from a different machine after the first one has died,
> etc.  Emphasise the importance of labelling disks in your machines and being
> sure you pull the right disk!

I really appreciate if you can share your experience about pulling
wrong disk and any statistics. This is an interesting subject to
discuss.


On Mon, Jan 29, 2018 at 1:52 PM, David Brown <david.brown@hesbynett.no> wrote:
>
>
> On 27/01/2018 16:13, Wols Lists wrote:
>>
>> On 27/01/18 14:29, mostafa kishani wrote:
>>>
>>> Thanks for your response Wol
>>> Well, maybe I failed to illustrate what I'm going to implement. I try
>>> to better clarify using your terminology:
>>> In the normal RAID5 and RAID6 codes we have one/two parities per
>>> stripe. Now consider sharing a redundant sector between say, 4
>>> stripes, and assume that the redundant sector is saved in stripe4.
>>> Assume the redundant sector is the parity of all sectors in stripe1,
>>> stripe2, stripe3, and stripe4. Using this redundant sector you can
>>> tolerate one sector failure across stripe1 to stripe4. We already have
>>> the parity sectors of RAID5 and RAID6 and this redundant sector is
>>> added to tolerate an extra sector failure. I call this redundant
>>> sector "Global Parity".
>>> I try to demonstrate this as follows, assuming each RAID5 stripe has 3
>>> data sectors and one parity sector.
>>> stripe1: DATA1 | DATA2 | DATA3 | PARITY1
>>> stripe2: PARITY2 | DATA4 | DATA5 | DATA6
>>> stripe3: DATA7 | PARITY3 | DATA8 | DATA9
>>> stripe4: DATA10 | DATA11 | PARITY4 | GLOBAL PARITY
>>>
>>> and the Global Parity is taken across all data and parity as follows:
>>> GLOBAL PARITY = DATA1 X DATA2 X DATA3 X DATA4 X DATA5 X DATA6 X DATA7
>>> X DATA8 X DATA9 X DATA10 X DATA11 X PARITY1 X PARITY2 X PARITY3
>>>
>>> Where "X" stands for XOR operation.
>>> I hope it was clear.
>>
>>
>> OWWW!!!!
>>
>> Have you done and understood the maths!!!???
>
>
> I have started looking at the paper (from the link in Mostafa's next post).
> I have only read a few pages as yet, but it looks to me to have some
> fundamental misunderstandings about SSDs, how they work, and their typical
> failures, and to be massively mixing up the low-level structures visible
> inside the SSD firmware and the high-level view available to the kernel and
> the md layer.  At best, this "PMDS" idea with blocks might be an alternative
> or addition to ECC layers within the SSD - but not at the md layer.  I have
> not read the whole paper yet, so I could be missing something - but I am
> sceptical.
>
>
>>
>> You may have noticed I said that while raid-6 was similar in principle
>> to raid-5, it was very different in implementation. Because of the maths!
>
>
> Yes, indeed.  The maths of raid 6 is a lot of fun, and very smart.
>
>>
>> Going back to high-school algebra, if we have E *unique* equations, and
>> U unknowns, then we can only solve the equations if E > U (I think I've
>> got that right, it might be >=).
>
>
> E >= U.  You can solve "2 * x - 4 = 0", which is one equation in one
> unknown.  But critically, the E equations need to be linearly independent
> (that is probably what you mean by "unique").
>
>>
>> With raid-5, parity1 = data1 xor data2 xor data3. Now let's assume
>> somebody thinks "let's add parity2" and defines parity2 = data1 xor
>> data2 xor data3 xor parity1. THAT WON'T WORK.
>
>
> Correct - this is because the two equations are not linearly independent.
> The second parity would always be 0 in this case.
>
>> Raid-5 relies on the
>> equation "if N is even, then parity2 is all ones, else parity2 is all
>> zeroes", where N is the number of disks, so if we calculate parity2 we
>> add absolutely nothing to our pre-existing E.
>>
>> If you are planning to use XOR, I think you are falling into *exactly*
>> that trap! Plus, it looks to me as if calculating your global parity is
>> going to be a disk-hammering nightmare ...
>
>
> Yes.
>
>>
>> That's why raid-6 uses a *completely* *different* algorithm to calculate
>> its parity1 and parity2.
>
>
> It is not actually a completely different algorithm, if you view it in the
> correct way.  You can say the raid5 parity P is just the xor of the bits,
> while the raid6 parity Q is a polynomial over the GF(2^8) field - certainly
> they look completely different then.  But once you move to the GF(2^8)
> field, the equations become:
>
>         P = d_0 + d_1 + d_2 + d_3 + d_4 + ...
>         Q = d_0 + 2 . d_1 + 2^2 . d_2 + 2^3 . d_3 + 2^4 . d_4 + ...
>
> (Note that none of this is "ordinary" maths - addition and multiplication is
> special in the GF field.)
>
> It is even possible to extend it to a third parity in a similar way:
>
>         R = d_0 + 4 . d_1 + 4^2 . d_2 + 4^3 . d_3 + 4^4 . d_4 + ...
>
> There are other schemes that scale better beyond the third parity (this
> scheme can generate a fourth parity bit, but it is then only valid for up to
> 21 data disks).
>
>>
>> I've updated a page on the wiki, because it's come up in other
>> discussions as well, but it seems to me if you need extra parity, you
>> really ought to be going for raid-60. Take a look ...
>>
>>
>> https://raid.wiki.kernel.org/index.php/What_is_RAID_and_why_should_you_want_it%3F#Which_raid_is_for_me.3F
>>
>> and if anyone else wants to comment, too? ...
>>
>
> Here are a few random comments:
>
> Raid-10-far2 can be /faster/ than Raid0 on the same number of HDs, for
> read-only performance.  This is because the data for both stripes will be
> read from the first half of the disks - the outside half.  On many disks
> this gives higher read speeds, since the same angular rotation speed has
> higher linear velocity at the disk heads.  It also gives shorter seek times
> as the head does not have to move as far in or out to cover the whole range.
> For SSDs, the layout for Raid-10 makes almost no difference (but it is still
> faster than plain Raid-1 for streamed reads).
>
> For two drives, Raid-10 is a fine choice on read-heavy or streaming
> applications.
>
> I think you could emphasise that there is little point in having Raid-5 plus
> a spare - Raid-6 is better in every way.
>
> You should make a clearer distinction that by "Raid-6+0" you mean a Raid-0
> stripe of Raid-6 sets, rather than a Raid-6 set of Raid-0 stripes.
>
> There are also many, many other ways to organise multi-layer raids. Striping
> at the high level (like Raid-6+0) makes sense only if you have massive
> streaming operations for single files, and massive bandwidth - it is poorer
> for operations involving a large number of parallel accesses.  A common
> arrangement for big arrays is a linear concatenation of Raid-1 pairs (or
> Raid-5 or Raid-6 sets) - combined with an appropriate file system (XFS comes
> out well here) you get massive scalability and very high parallel access
> speeds.
>
> Other things to consider on big arrays are redundancy of controllers, or
> even servers (for SAN arrays).  Consider the pros and cons of spreading your
> redundancy across blocks.  For example, if your server has two controllers
> then you might want your low-level block to be Raid-1 pairs with one disk on
> each controller.  That could give you a better spread of bandwidths and give
> you resistance to a broken controller.
>
> You could also talk about asymmetric raid setups, such as having a
> write-only redundant copy on a second server over a network, or as a cheap
> hard disk copy of your fast SSDs.
>
> And you could also discuss strategies for disk replacement - after failures,
> or for growing the array.
>
> It is also worth emphasising that RAID is /not/ a backup solution - that
> cannot be said often enough!
>
> Discuss failure recovery - how to find and remove bad disks, how to deal
> with recovering disks from a different machine after the first one has died,
> etc.  Emphasise the importance of labelling disks in your machines and being
> sure you pull the right disk!
>
>
>
>> Cheers,
>> Wol

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Implementing Global Parity Codes
  2018-01-29 17:44         ` Wols Lists
@ 2018-01-30 11:47           ` David Brown
  2018-01-30 14:18           ` Brad Campbell
  1 sibling, 0 replies; 15+ messages in thread
From: David Brown @ 2018-01-30 11:47 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid



On 29/01/2018 18:44, Wols Lists wrote:
> On 29/01/18 10:22, David Brown wrote:
>>> I've updated a page on the wiki, because it's come up in other
>>> discussions as well, but it seems to me if you need extra parity, you
>>> really ought to be going for raid-60. Take a look ...
>>>
>>> https://raid.wiki.kernel.org/index.php/What_is_RAID_and_why_should_you_want_it%3F#Which_raid_is_for_me.3F
>>>
>>>
>>> and if anyone else wants to comment, too? ...
>>>
>>
>> Here are a few random comments:
>>
>> Raid-10-far2 can be /faster/ than Raid0 on the same number of HDs, for
>> read-only performance.  This is because the data for both stripes will
>> be read from the first half of the disks - the outside half.  On many
>> disks this gives higher read speeds, since the same angular rotation
>> speed has higher linear velocity at the disk heads.  It also gives
>> shorter seek times as the head does not have to move as far in or out to
>> cover the whole range.  For SSDs, the layout for Raid-10 makes almost no
>> difference (but it is still faster than plain Raid-1 for streamed reads).
> 
> Except that most drives don't do that nowadays, they do "constant linear
> velocity" so the drive speeds up or slows down depending on where the
> heads are, I believe.

Perhaps - I haven't tried to keep up with the specs of all drives, and 
it is surprisingly hard to get a good answer from a quick google. 
However, I would be surprised if CLV were the norm for hard disks. 
Certainly it was not the case before (though it was used for CD-ROMS and 
other optical media).  The inner tracks of a hard disk are perhaps half 
the circumference of the outer tracks - to keep constant linear 
velocity, you need twice the rotational speed on the inside compared to 
the outside.  That is a massive difference, taking several seconds 
(perhaps 10 seconds) to bring to a stable speed.

I suspect you are mixing up velocity and density here.  Earlier hard 
drives had constant angular density - the same number of sectors per 
track throughout the disk.  Modern drives have constant linear density - 
so you get more sectors on an outer track than an inner track.

>>
>> For two drives, Raid-10 is a fine choice on read-heavy or streaming
>> applications.
> 
> Which is just raid-1, no?

No.  The "raid-10 near" is pretty much identical to raid-1, but the 
"far" and "offset" raid-10 layouts are different:

<https://en.wikipedia.org/wiki/Non-standard_RAID_levels#LINUX-MD-RAID-10>

Near layout minimises the head movement on writes.  Far layout maximises 
streaming read performance, but has more latency during writes due to 
larger head movements.  Offset layout gives raid0 read performance for 
small and mid-size reads, with only slightly more latency in writes.

>>
>> I think you could emphasise that there is little point in having Raid-5
>> plus a spare - Raid-6 is better in every way.
> 
> Agreed. I don't agree raid-6 is better in *every* way - it wastes space
> - but yes once you have enough drives you should go raid-6 :-)

Raid-6 is better than raid-5 plus a spare - it uses exactly the same 
number of disks, and does not waste anything while providing a huge 
improvement in redundancy and therefore data safety.

Okay, it is not better in /every/ way.  It takes a bit of computing 
power, though that is rarely relevant as cpus have got more threads that 
are mostly idle.  And it gives a bit more write amplification than 
raid-5.  But if you think "my raid-5 array is so important that I want a 
hot spare to make rebuilds happen as soon as possible", then you 
/definitely/ want raid-6 instead.

>>
>> You should make a clearer distinction that by "Raid-6+0" you mean a
>> Raid-0 stripe of Raid-6 sets, rather than a Raid-6 set of Raid-0 stripes.
>>
> Done.
> 
>> There are also many, many other ways to organise multi-layer raids.
>> Striping at the high level (like Raid-6+0) makes sense only if you have
>> massive streaming operations for single files, and massive bandwidth -
>> it is poorer for operations involving a large number of parallel
>> accesses.  A common arrangement for big arrays is a linear concatenation
>> of Raid-1 pairs (or Raid-5 or Raid-6 sets) - combined with an
>> appropriate file system (XFS comes out well here) you get massive
>> scalability and very high parallel access speeds.
>>
>> Other things to consider on big arrays are redundancy of controllers, or
>> even servers (for SAN arrays).  Consider the pros and cons of spreading
>> your redundancy across blocks.  For example, if your server has two
>> controllers then you might want your low-level block to be Raid-1 pairs
>> with one disk on each controller.  That could give you a better spread
>> of bandwidths and give you resistance to a broken controller.
>>
>> You could also talk about asymmetric raid setups, such as having a
>> write-only redundant copy on a second server over a network, or as a
>> cheap hard disk copy of your fast SSDs.
> 
> Snag is, I don't manage large arrays - it's a lot to think about. I
> might add that later.

Fair enough.  You can't cover /everything/ on a wiki page - then it is a 
full time job and a book, not a wiki page!  I am just giving suggestions 
and ideas.

>>
>> And you could also discuss strategies for disk replacement - after
>> failures, or for growing the array.
>>
>> It is also worth emphasising that RAID is /not/ a backup solution - that
>> cannot be said often enough!
>>
>> Discuss failure recovery - how to find and remove bad disks, how to deal
>> with recovering disks from a different machine after the first one has
>> died, etc.  Emphasise the importance of labelling disks in your machines
>> and being sure you pull the right disk!
>>
> I think that's covered elsewhere :-)

Maybe you could add a few links?  There is no need to repeat information.

mvh.,

David


> 
> Cheers,
> Wol
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Implementing Global Parity Codes
  2018-01-29 17:44         ` Wols Lists
  2018-01-30 11:47           ` David Brown
@ 2018-01-30 14:18           ` Brad Campbell
  1 sibling, 0 replies; 15+ messages in thread
From: Brad Campbell @ 2018-01-30 14:18 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid

On 30/01/18 01:44, Wols Lists wrote:

> Except that most drives don't do that nowadays, they do "constant linear
> velocity" so the drive speeds up or slows down depending on where the
> heads are, I believe.

No. Hard disks have one speed. That can be easily proven by popping the 
top off and putting a tacho on the spindle, or just looking at any 
linear read benchmark as they all demonstrate that data transfer slows 
down as the head works its way towards the spindle.

Same reason the first couple of tracks on each LP sounded better. More 
vinyl to get a better response.

Optical disks used CLV where they needed to get bits out at a specific 
clock rate (ie Audio).

Brad.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Implementing Global Parity Codes
  2018-01-30 11:30         ` mostafa kishani
@ 2018-01-30 15:14           ` David Brown
  2018-01-31 16:03             ` mostafa kishani
  0 siblings, 1 reply; 15+ messages in thread
From: David Brown @ 2018-01-30 15:14 UTC (permalink / raw)
  To: mostafa kishani; +Cc: Wols Lists, linux-raid

On 30/01/18 12:30, mostafa kishani wrote:
> David what you pointed about employment of PMDS codes is correct. We
> have no access to what happens in the SSD firmware (such as FTL). But
> why this code cannot be implemented in the software layer (similar to
> RAID5/6...) ? I also thank you for pointing out very interesting
> subjects.
> 

I must admit that I haven't dug through the mathematical details of the
paper.  It looks to be at a level that I /could/ understand, but would
need to put in quite a bit of time and effort.  And the paper does not
strike me as being particularly outstanding or special - there are many,
many such papers published about new ideas in error detection and
correction.

While it is not clear to me exactly how these additional "global" parity
blocks are intended to help correct errors in the paper, I can see a way
to handle it.

d d d d d P
d d d d d P
d d d d d P
d d d S S P

Where the "d" blocks are normal data blocks, "P" are raid-5 parity
blocks (another column for raid-6 Q blocks could be added), and "S" are
these "global" parity blocks.

If a row has more errors than the normal parity block(s) can correct,
then it is possible to use wider parity blocks to help.  If you have one
S that is defined in the same way as raid-6 Q parity, then it can be
used to correct an extra error in a stripe.  That relies on all the
other stripes having at most P-correctable errors.

The maths gets quite hairy.  Two parity blocks are well-defined at the
moment - raid-5 (xor) and raid-6 (using powers of 2 weights on the data
blocks, over GF(8)).  To provide recovery here, the S parities would
have to fit within the same scheme.  A third parity block is relatively
easy to calculate using powers of 4 weights - but that is not scalable
(a fourth parity using powers of 8 does not work beyond 21 data blocks).
 An alternative multi-parity scheme is possible using significantly more
complex maths.

However it is done, it would be hard.  I am also not convinced that it
would work for extra errors distributed throughout the block, rather
than just in one row.

A much simpler system could be done using vertical parities:

d d d d d P
d d d d d P
d d d d d P
V V V V V P

Here, the V is just a raid-5 parity of the column of blocks.  You now
effectively have a raid-5-5 layered setup, but distributed within the
one set of disks.  Recovery would be straight-forward - if a block could
not be re-created from a horizontal parity, then the vertical parity
would be used.  You would have some write amplification, but it would
perhaps not be too bad (you could have many rows per vertical parity
block), and it would be fine for read-mostly applications.  It bears a
certain resemblance to raid-10 layouts.  Of course, raid-5-6, raid-6-5
and raid-6-6 would also be possible.


>>
>> Other things to consider on big arrays are redundancy of controllers, or
>> even servers (for SAN arrays).  Consider the pros and cons of spreading your
>> redundancy across blocks.  For example, if your server has two controllers
>> then you might want your low-level block to be Raid-1 pairs with one disk on
>> each controller.  That could give you a better spread of bandwidths and give
>> you resistance to a broken controller.
>>
>> You could also talk about asymmetric raid setups, such as having a
>> write-only redundant copy on a second server over a network, or as a cheap
>> hard disk copy of your fast SSDs.
>>
>> And you could also discuss strategies for disk replacement - after failures,
>> or for growing the array.
> 
> The disk replacement strategy has a significant effect on both
> reliability and performance. The occurrence of human errors in desk
> replacement can result in data unavailability and data loss. In the
> following paper I've briefly discussed this subject and how a good
> disk replacement policy can improve reliability by orders of magnitude
> (a more detailed version of this paper is on the way!):
> https://dl.acm.org/citation.cfm?id=3130452

In my experience, human error leads to more data loss than mechanical
errors - and you really need to take it into account.

> 
> you can download it using sci-hub if you don't have ACM access.
> 
>>
>> It is also worth emphasising that RAID is /not/ a backup solution - that
>> cannot be said often enough!
>>
>> Discuss failure recovery - how to find and remove bad disks, how to deal
>> with recovering disks from a different machine after the first one has died,
>> etc.  Emphasise the importance of labelling disks in your machines and being
>> sure you pull the right disk!
> 
> I really appreciate if you can share your experience about pulling
> wrong disk and any statistics. This is an interesting subject to
> discuss.
> 

My server systems are too small in size, and too few in numbers, for
statistics.  I haven't actually pulled the wrong disk, but I did come
/very/ close before deciding to have one last double-check.

I have also tripped over the USB wire to an external disk and thrown it
across the room - I am now a lot more careful about draping wires around!


mvh.,

David


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Implementing Global Parity Codes
  2018-01-30 15:14           ` David Brown
@ 2018-01-31 16:03             ` mostafa kishani
  2018-01-31 17:53               ` Piergiorgio Sartor
  0 siblings, 1 reply; 15+ messages in thread
From: mostafa kishani @ 2018-01-31 16:03 UTC (permalink / raw)
  To: David Brown; +Cc: Wols Lists, linux-raid

Yes that's exactly what the code does. Here the math of
encoding/decoding is not as important of IO overhead. Upon a stripe
update, it needs to update the Global parity as well (that is probably
in another stripe). This should result in a terrible performance in
random-write workloads. But in sequential-write workloads this code
may have a performance near to RAID5 and slightly better than RAID6.
The 2D codes (as you suggested) also suffer a huge IO penalty and this
is why the're barely employed even is fast memory structure such as
SRAM/DRAM.

Bests,
Mostafa

On Tue, Jan 30, 2018 at 6:44 PM, David Brown <david.brown@hesbynett.no> wrote:
> On 30/01/18 12:30, mostafa kishani wrote:
>> David what you pointed about employment of PMDS codes is correct. We
>> have no access to what happens in the SSD firmware (such as FTL). But
>> why this code cannot be implemented in the software layer (similar to
>> RAID5/6...) ? I also thank you for pointing out very interesting
>> subjects.
>>
>
> I must admit that I haven't dug through the mathematical details of the
> paper.  It looks to be at a level that I /could/ understand, but would
> need to put in quite a bit of time and effort.  And the paper does not
> strike me as being particularly outstanding or special - there are many,
> many such papers published about new ideas in error detection and
> correction.
>
> While it is not clear to me exactly how these additional "global" parity
> blocks are intended to help correct errors in the paper, I can see a way
> to handle it.
>
> d d d d d P
> d d d d d P
> d d d d d P
> d d d S S P
>
> Where the "d" blocks are normal data blocks, "P" are raid-5 parity
> blocks (another column for raid-6 Q blocks could be added), and "S" are
> these "global" parity blocks.
>
> If a row has more errors than the normal parity block(s) can correct,
> then it is possible to use wider parity blocks to help.  If you have one
> S that is defined in the same way as raid-6 Q parity, then it can be
> used to correct an extra error in a stripe.  That relies on all the
> other stripes having at most P-correctable errors.
>
> The maths gets quite hairy.  Two parity blocks are well-defined at the
> moment - raid-5 (xor) and raid-6 (using powers of 2 weights on the data
> blocks, over GF(8)).  To provide recovery here, the S parities would
> have to fit within the same scheme.  A third parity block is relatively
> easy to calculate using powers of 4 weights - but that is not scalable
> (a fourth parity using powers of 8 does not work beyond 21 data blocks).
>  An alternative multi-parity scheme is possible using significantly more
> complex maths.
>
> However it is done, it would be hard.  I am also not convinced that it
> would work for extra errors distributed throughout the block, rather
> than just in one row.
>
> A much simpler system could be done using vertical parities:
>
> d d d d d P
> d d d d d P
> d d d d d P
> V V V V V P
>
> Here, the V is just a raid-5 parity of the column of blocks.  You now
> effectively have a raid-5-5 layered setup, but distributed within the
> one set of disks.  Recovery would be straight-forward - if a block could
> not be re-created from a horizontal parity, then the vertical parity
> would be used.  You would have some write amplification, but it would
> perhaps not be too bad (you could have many rows per vertical parity
> block), and it would be fine for read-mostly applications.  It bears a
> certain resemblance to raid-10 layouts.  Of course, raid-5-6, raid-6-5
> and raid-6-6 would also be possible.
>
>
>>>
>>> Other things to consider on big arrays are redundancy of controllers, or
>>> even servers (for SAN arrays).  Consider the pros and cons of spreading your
>>> redundancy across blocks.  For example, if your server has two controllers
>>> then you might want your low-level block to be Raid-1 pairs with one disk on
>>> each controller.  That could give you a better spread of bandwidths and give
>>> you resistance to a broken controller.
>>>
>>> You could also talk about asymmetric raid setups, such as having a
>>> write-only redundant copy on a second server over a network, or as a cheap
>>> hard disk copy of your fast SSDs.
>>>
>>> And you could also discuss strategies for disk replacement - after failures,
>>> or for growing the array.
>>
>> The disk replacement strategy has a significant effect on both
>> reliability and performance. The occurrence of human errors in desk
>> replacement can result in data unavailability and data loss. In the
>> following paper I've briefly discussed this subject and how a good
>> disk replacement policy can improve reliability by orders of magnitude
>> (a more detailed version of this paper is on the way!):
>> https://dl.acm.org/citation.cfm?id=3130452
>
> In my experience, human error leads to more data loss than mechanical
> errors - and you really need to take it into account.
>
>>
>> you can download it using sci-hub if you don't have ACM access.
>>
>>>
>>> It is also worth emphasising that RAID is /not/ a backup solution - that
>>> cannot be said often enough!
>>>
>>> Discuss failure recovery - how to find and remove bad disks, how to deal
>>> with recovering disks from a different machine after the first one has died,
>>> etc.  Emphasise the importance of labelling disks in your machines and being
>>> sure you pull the right disk!
>>
>> I really appreciate if you can share your experience about pulling
>> wrong disk and any statistics. This is an interesting subject to
>> discuss.
>>
>
> My server systems are too small in size, and too few in numbers, for
> statistics.  I haven't actually pulled the wrong disk, but I did come
> /very/ close before deciding to have one last double-check.
>
> I have also tripped over the USB wire to an external disk and thrown it
> across the room - I am now a lot more careful about draping wires around!
>
>
> mvh.,
>
> David
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Implementing Global Parity Codes
  2018-01-31 16:03             ` mostafa kishani
@ 2018-01-31 17:53               ` Piergiorgio Sartor
  0 siblings, 0 replies; 15+ messages in thread
From: Piergiorgio Sartor @ 2018-01-31 17:53 UTC (permalink / raw)
  To: mostafa kishani; +Cc: David Brown, Wols Lists, linux-raid

Hi all,

sorry for the top posting.

In a previous message, you explain the "Global Parity"
would be the xor of all the data across the stripes,
including the stripes parities.

Is this still the case?
Did I miss something?

Because, by definition, the xor between the data and
the parity, in a stripe, is always 0.
Hence, the xor of all stripes data and parities is 0
too, always, and so it is *not* necessary to store it.
It is only necessary to check it, if wanted.

Now, again, maybe I skipped some parts, so I apologize
in advance if this is the case and what is written
above is just rubbish, otherwise something is not really
correct in the intepretation of the cited paper.

bye,

pg

On Wed, Jan 31, 2018 at 07:33:54PM +0330, mostafa kishani wrote:
> Yes that's exactly what the code does. Here the math of
> encoding/decoding is not as important of IO overhead. Upon a stripe
> update, it needs to update the Global parity as well (that is probably
> in another stripe). This should result in a terrible performance in
> random-write workloads. But in sequential-write workloads this code
> may have a performance near to RAID5 and slightly better than RAID6.
> The 2D codes (as you suggested) also suffer a huge IO penalty and this
> is why the're barely employed even is fast memory structure such as
> SRAM/DRAM.
> 
> Bests,
> Mostafa
> 
> On Tue, Jan 30, 2018 at 6:44 PM, David Brown <david.brown@hesbynett.no> wrote:
> > On 30/01/18 12:30, mostafa kishani wrote:
> >> David what you pointed about employment of PMDS codes is correct. We
> >> have no access to what happens in the SSD firmware (such as FTL). But
> >> why this code cannot be implemented in the software layer (similar to
> >> RAID5/6...) ? I also thank you for pointing out very interesting
> >> subjects.
> >>
> >
> > I must admit that I haven't dug through the mathematical details of the
> > paper.  It looks to be at a level that I /could/ understand, but would
> > need to put in quite a bit of time and effort.  And the paper does not
> > strike me as being particularly outstanding or special - there are many,
> > many such papers published about new ideas in error detection and
> > correction.
> >
> > While it is not clear to me exactly how these additional "global" parity
> > blocks are intended to help correct errors in the paper, I can see a way
> > to handle it.
> >
> > d d d d d P
> > d d d d d P
> > d d d d d P
> > d d d S S P
> >
> > Where the "d" blocks are normal data blocks, "P" are raid-5 parity
> > blocks (another column for raid-6 Q blocks could be added), and "S" are
> > these "global" parity blocks.
> >
> > If a row has more errors than the normal parity block(s) can correct,
> > then it is possible to use wider parity blocks to help.  If you have one
> > S that is defined in the same way as raid-6 Q parity, then it can be
> > used to correct an extra error in a stripe.  That relies on all the
> > other stripes having at most P-correctable errors.
> >
> > The maths gets quite hairy.  Two parity blocks are well-defined at the
> > moment - raid-5 (xor) and raid-6 (using powers of 2 weights on the data
> > blocks, over GF(8)).  To provide recovery here, the S parities would
> > have to fit within the same scheme.  A third parity block is relatively
> > easy to calculate using powers of 4 weights - but that is not scalable
> > (a fourth parity using powers of 8 does not work beyond 21 data blocks).
> >  An alternative multi-parity scheme is possible using significantly more
> > complex maths.
> >
> > However it is done, it would be hard.  I am also not convinced that it
> > would work for extra errors distributed throughout the block, rather
> > than just in one row.
> >
> > A much simpler system could be done using vertical parities:
> >
> > d d d d d P
> > d d d d d P
> > d d d d d P
> > V V V V V P
> >
> > Here, the V is just a raid-5 parity of the column of blocks.  You now
> > effectively have a raid-5-5 layered setup, but distributed within the
> > one set of disks.  Recovery would be straight-forward - if a block could
> > not be re-created from a horizontal parity, then the vertical parity
> > would be used.  You would have some write amplification, but it would
> > perhaps not be too bad (you could have many rows per vertical parity
> > block), and it would be fine for read-mostly applications.  It bears a
> > certain resemblance to raid-10 layouts.  Of course, raid-5-6, raid-6-5
> > and raid-6-6 would also be possible.
> >
> >
> >>>
> >>> Other things to consider on big arrays are redundancy of controllers, or
> >>> even servers (for SAN arrays).  Consider the pros and cons of spreading your
> >>> redundancy across blocks.  For example, if your server has two controllers
> >>> then you might want your low-level block to be Raid-1 pairs with one disk on
> >>> each controller.  That could give you a better spread of bandwidths and give
> >>> you resistance to a broken controller.
> >>>
> >>> You could also talk about asymmetric raid setups, such as having a
> >>> write-only redundant copy on a second server over a network, or as a cheap
> >>> hard disk copy of your fast SSDs.
> >>>
> >>> And you could also discuss strategies for disk replacement - after failures,
> >>> or for growing the array.
> >>
> >> The disk replacement strategy has a significant effect on both
> >> reliability and performance. The occurrence of human errors in desk
> >> replacement can result in data unavailability and data loss. In the
> >> following paper I've briefly discussed this subject and how a good
> >> disk replacement policy can improve reliability by orders of magnitude
> >> (a more detailed version of this paper is on the way!):
> >> https://dl.acm.org/citation.cfm?id=3130452
> >
> > In my experience, human error leads to more data loss than mechanical
> > errors - and you really need to take it into account.
> >
> >>
> >> you can download it using sci-hub if you don't have ACM access.
> >>
> >>>
> >>> It is also worth emphasising that RAID is /not/ a backup solution - that
> >>> cannot be said often enough!
> >>>
> >>> Discuss failure recovery - how to find and remove bad disks, how to deal
> >>> with recovering disks from a different machine after the first one has died,
> >>> etc.  Emphasise the importance of labelling disks in your machines and being
> >>> sure you pull the right disk!
> >>
> >> I really appreciate if you can share your experience about pulling
> >> wrong disk and any statistics. This is an interesting subject to
> >> discuss.
> >>
> >
> > My server systems are too small in size, and too few in numbers, for
> > statistics.  I haven't actually pulled the wrong disk, but I did come
> > /very/ close before deciding to have one last double-check.
> >
> > I have also tripped over the USB wire to an external disk and thrown it
> > across the room - I am now a lot more careful about draping wires around!
> >
> >
> > mvh.,
> >
> > David
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Implementing Global Parity Codes
  2018-01-27  5:47 Implementing Global Parity Codes mostafa kishani
  2018-01-27  8:37 ` Wols Lists
@ 2018-02-02  5:24 ` NeilBrown
  2018-02-03  6:01   ` mostafa kishani
  1 sibling, 1 reply; 15+ messages in thread
From: NeilBrown @ 2018-02-02  5:24 UTC (permalink / raw)
  To: mostafa kishani, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1797 bytes --]

On Sat, Jan 27 2018, mostafa kishani wrote:

> Dear All,
>
> I am going to make some modifications to RAID protocol to make it more
> reliable for my case (for a scientific, and maybe later, industrial
> purpose). For example, I'm going to hold a Global Parity (a parity
> taken across the whole data stripe rather than a row) alongside normal
> row-wise parities, to cope with an extra sector/page failure per
> stripe. Do you have any suggestion how can I implement this with a
> moderate effort (I mean what functions should be modified)? have any
> of you had any similar effort?

In raid5.c the is a "struct stripe_head" which represents a stripe that
is one-page (normally 4K) wide across all devices.  All the data for any
parity calculation can all be found in a 'stripe_head'.
You would probably need to modify the stripe_head to represent several
more blocks so that all the Data and Parity for any computation are
always attached to the one stripe_head.

> I also appreciate if you guide me how can I enable DEBUG mode in mdadm.

I assume you mean debug mode on "md".
mdadm is the management tool.
md is the kernel driver.

mdadm doesn't have a debug mode.

md has a number of pr_debug() calls which can each be turned on or off
independently using dynamic debugging

https://www.kernel.org/doc/html/v4.15/admin-guide/dynamic-debug-howto.html

To turn on all pr_debug commands in raid5.c use

   echo file raid5.c +p > /sys/kernel/debug/dynamic_debug/control

to turn them off again:

   echo file raid5.c -p > /sys/kernel/debug/dynamic_debug/control

NeilBrown

>
> Bests,
> Mostafa
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Implementing Global Parity Codes
  2018-02-02  5:24 ` NeilBrown
@ 2018-02-03  6:01   ` mostafa kishani
  0 siblings, 0 replies; 15+ messages in thread
From: mostafa kishani @ 2018-02-03  6:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Dear Neil,

Sincere thanks for the useful information. I really appreciate your help.

Bests,
Mostafa

On Fri, Feb 2, 2018 at 8:54 AM, NeilBrown <neilb@suse.com> wrote:
> On Sat, Jan 27 2018, mostafa kishani wrote:
>
>> Dear All,
>>
>> I am going to make some modifications to RAID protocol to make it more
>> reliable for my case (for a scientific, and maybe later, industrial
>> purpose). For example, I'm going to hold a Global Parity (a parity
>> taken across the whole data stripe rather than a row) alongside normal
>> row-wise parities, to cope with an extra sector/page failure per
>> stripe. Do you have any suggestion how can I implement this with a
>> moderate effort (I mean what functions should be modified)? have any
>> of you had any similar effort?
>
> In raid5.c the is a "struct stripe_head" which represents a stripe that
> is one-page (normally 4K) wide across all devices.  All the data for any
> parity calculation can all be found in a 'stripe_head'.
> You would probably need to modify the stripe_head to represent several
> more blocks so that all the Data and Parity for any computation are
> always attached to the one stripe_head.
>
>> I also appreciate if you guide me how can I enable DEBUG mode in mdadm.
>
> I assume you mean debug mode on "md".
> mdadm is the management tool.
> md is the kernel driver.
>
> mdadm doesn't have a debug mode.
>
> md has a number of pr_debug() calls which can each be turned on or off
> independently using dynamic debugging
>
> https://www.kernel.org/doc/html/v4.15/admin-guide/dynamic-debug-howto.html
>
> To turn on all pr_debug commands in raid5.c use
>
>    echo file raid5.c +p > /sys/kernel/debug/dynamic_debug/control
>
> to turn them off again:
>
>    echo file raid5.c -p > /sys/kernel/debug/dynamic_debug/control
>
> NeilBrown
>
>>
>> Bests,
>> Mostafa
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-02-03  6:01 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-27  5:47 Implementing Global Parity Codes mostafa kishani
2018-01-27  8:37 ` Wols Lists
2018-01-27 14:29   ` mostafa kishani
2018-01-27 15:13     ` Wols Lists
2018-01-28 13:00       ` mostafa kishani
2018-01-29 10:22       ` David Brown
2018-01-29 17:44         ` Wols Lists
2018-01-30 11:47           ` David Brown
2018-01-30 14:18           ` Brad Campbell
2018-01-30 11:30         ` mostafa kishani
2018-01-30 15:14           ` David Brown
2018-01-31 16:03             ` mostafa kishani
2018-01-31 17:53               ` Piergiorgio Sartor
2018-02-02  5:24 ` NeilBrown
2018-02-03  6:01   ` mostafa kishani

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.