All of lore.kernel.org
 help / color / mirror / Atom feed
* Erasure code library summary
@ 2013-06-18 12:22 Loic Dachary
  2013-06-19  1:10 ` Alex Elsayed
  2013-06-19 11:33 ` Mark Nelson
  0 siblings, 2 replies; 13+ messages in thread
From: Loic Dachary @ 2013-06-18 12:22 UTC (permalink / raw)
  To: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 2637 bytes --]

Hi Ceph,

TL;DR: use jerasure 1.2 with Reed-Solomon to code/decode/repair an object, and upgrade to 2.0 when available.

Disclaimer: I'm no expert ;-) The terms are explained in wikipedia[1].

Using Reed-Solomon object O is encoded by dividing it into consecutive chuncks O1, O2, ... ON and computing parity blocks P1, P2, ... PK.  Reading the original content of object O is a simple concatenation of O1, O2, ... ON. If O2 or P2 are lost, they can be repaired/reconstructed using O1 ... ON and P1 ... PK. If the use case is mostly reading objects and repairs are at least 1000 times less likely than normal operations, being able to read the object from non-coded chuncks is attractive. 

Reed-Solomon is significantly more expensive to encode ( 100MB/s order of magnitude on a single 2.5Ghz core ) than fountain codes with the current jerasure implementation[2]. However, gf-complete[3] that will be used in the upcoming version of jerasure significantly improves performances ( 2 to 10 times faster ) and the difference becomes negligible. 

Reed-Solomon coding family is the only one that can keep the chuncks unencoded and therefore concatenable.

The jerasure library is packaged and being worked on by the author at the moment. All other Free Software implementations are either not packaged or not maintained. 

The license[4] of jerasure is compatible with the license of Ceph.

Performances depend on the parameters to the Reed-Solomon functions but they will also be influenced by the buffer sizes used when calling the encoding functions: smaller buffers will mean more calls and more overhead.

Open questions:

* Does Mojette Transform [5] have compelling qualities compared to other code families ?
* Do hierarchical codes [6] have compelling qualities ? Implementing them would require a different API. To be effective they need to take into account the context in which an object is stored where the other code only require the object itself.
* I have not experiemented with the jerasure API yet

Feedback and criticisms are welcome :-)

[1] http://en.wikipedia.org/wiki/Erasure_code
[2] jerasure 1.2 http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627.html
[3] gf-complete http://web.eecs.utk.edu/~plank/plank/papers/CS-13-703.html
[4] jerasure license https://github.com/tsuraan/Jerasure/blob/master/License.txt
[5] Mojette Transform http://en.wikipedia.org/wiki/Mojette_Transform
[6] hierarchical codes http://www.e-biersack.eu/BPublished/nc_springer.pdf


-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Erasure code library summary
  2013-06-18 12:22 Erasure code library summary Loic Dachary
@ 2013-06-19  1:10 ` Alex Elsayed
  2013-06-19  1:14   ` Alex Elsayed
  2013-06-19  6:56   ` Loic Dachary
  2013-06-19 11:33 ` Mark Nelson
  1 sibling, 2 replies; 13+ messages in thread
From: Alex Elsayed @ 2013-06-19  1:10 UTC (permalink / raw)
  To: ceph-devel

Loic Dachary wrote:

> Hi Ceph,
> 
<snip>
> Reed-Solomon coding family is the only one that can keep the chuncks
> unencoded and therefore concatenable.
<snip>

In my understanding, this is not strictly true - any 'systematic' code will 
have the unencoded chunks remain available in this manner, and any non-
systematic linear code can be transformed into a systematic code with the 
same minimum distance. Fountain codes are often explicitly constructed to 
maintain this property, as in the case of RaptorQ [RFC 6330].

https://en.wikipedia.org/wiki/Systematic_code


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Erasure code library summary
  2013-06-19  1:10 ` Alex Elsayed
@ 2013-06-19  1:14   ` Alex Elsayed
  2013-06-19  7:00     ` Loic Dachary
  2013-06-19  6:56   ` Loic Dachary
  1 sibling, 1 reply; 13+ messages in thread
From: Alex Elsayed @ 2013-06-19  1:14 UTC (permalink / raw)
  To: ceph-devel

Alex Elsayed wrote:

> Loic Dachary wrote:
> 
>> Hi Ceph,
>> 
> <snip>
>> Reed-Solomon coding family is the only one that can keep the chuncks
>> unencoded and therefore concatenable.
> <snip>
> 
> In my understanding, this is not strictly true - any 'systematic' code
> will have the unencoded chunks remain available in this manner, and any
> non- systematic linear code can be transformed into a systematic code with
> the same minimum distance. Fountain codes are often explicitly constructed
> to maintain this property, as in the case of RaptorQ [RFC 6330].
> 
> https://en.wikipedia.org/wiki/Systematic_code

...that said, Reed-Solomon is to the best of my knowledge the only space-
optimal such code. An interesting option, however, might be to use a 
fountain code over the network when distributing either replicas *or* parity 
chunks, so that losses can be recovered with <1 full chunk retransmission.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Erasure code library summary
  2013-06-19  1:10 ` Alex Elsayed
  2013-06-19  1:14   ` Alex Elsayed
@ 2013-06-19  6:56   ` Loic Dachary
  1 sibling, 0 replies; 13+ messages in thread
From: Loic Dachary @ 2013-06-19  6:56 UTC (permalink / raw)
  To: Alex Elsayed; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1039 bytes --]



On 06/19/2013 03:10 AM, Alex Elsayed wrote:
> Loic Dachary wrote:
> 
>> Hi Ceph,
>>
> <snip>
>> Reed-Solomon coding family is the only one that can keep the chuncks
>> unencoded and therefore concatenable.
> <snip>
> 
> In my understanding, this is not strictly true - any 'systematic' code will 
> have the unencoded chunks remain available in this manner, and any non-
> systematic linear code can be transformed into a systematic code with the 
> same minimum distance. Fountain codes are often explicitly constructed to 
> maintain this property, as in the case of RaptorQ [RFC 6330].
> 
> https://en.wikipedia.org/wiki/Systematic_code

Thanks for correcting my mistake :-)

> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Erasure code library summary
  2013-06-19  1:14   ` Alex Elsayed
@ 2013-06-19  7:00     ` Loic Dachary
  2013-06-19  7:47       ` Alex Elsayed
  0 siblings, 1 reply; 13+ messages in thread
From: Loic Dachary @ 2013-06-19  7:00 UTC (permalink / raw)
  To: Alex Elsayed; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1598 bytes --]



On 06/19/2013 03:14 AM, Alex Elsayed wrote:
> Alex Elsayed wrote:
> 
>> Loic Dachary wrote:
>>
>>> Hi Ceph,
>>>
>> <snip>
>>> Reed-Solomon coding family is the only one that can keep the chuncks
>>> unencoded and therefore concatenable.
>> <snip>
>>
>> In my understanding, this is not strictly true - any 'systematic' code
>> will have the unencoded chunks remain available in this manner, and any
>> non- systematic linear code can be transformed into a systematic code with
>> the same minimum distance. Fountain codes are often explicitly constructed
>> to maintain this property, as in the case of RaptorQ [RFC 6330].
>>
>> https://en.wikipedia.org/wiki/Systematic_code
> 
> ...that said, Reed-Solomon is to the best of my knowledge the only space-
> optimal such code. 

What does "space-optimal" mean ? Does it mean that Reed-Solomon will use less disk space than fountain codes to code the same number of parity chunks ? 

> An interesting option, however, might be to use a 
> fountain code over the network when distributing either replicas *or* parity 
> chunks, so that losses can be recovered with <1 full chunk retransmission.

I would be gratefull if you could expand on this idea. I don't get it :-)

Cheers

> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Erasure code library summary
  2013-06-19  7:00     ` Loic Dachary
@ 2013-06-19  7:47       ` Alex Elsayed
  2013-06-19  8:33         ` Loic Dachary
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Elsayed @ 2013-06-19  7:47 UTC (permalink / raw)
  To: ceph-devel

Loic Dachary wrote:

> 
> 
> On 06/19/2013 03:14 AM, Alex Elsayed wrote:
>> Alex Elsayed wrote:
>> 
>>> Loic Dachary wrote:
>>>
>>>> Hi Ceph,
>>>>
>>> <snip>
>>>> Reed-Solomon coding family is the only one that can keep the chuncks
>>>> unencoded and therefore concatenable.
>>> <snip>
>>>
>>> In my understanding, this is not strictly true - any 'systematic' code
>>> will have the unencoded chunks remain available in this manner, and any
>>> non- systematic linear code can be transformed into a systematic code
>>> with the same minimum distance. Fountain codes are often explicitly
>>> constructed to maintain this property, as in the case of RaptorQ [RFC
>>> 6330].
>>>
>>> https://en.wikipedia.org/wiki/Systematic_code
>> 
>> ...that said, Reed-Solomon is to the best of my knowledge the only space-
>> optimal such code.
> 
> What does "space-optimal" mean ? Does it mean that Reed-Solomon will use
> less disk space than fountain codes to code the same number of parity
> chunks ?

Optimal (for an erasure code) means that if you have K symbols of real data, 
then *any* K symbols of the output of the erasure code will let you recover 
it.

Current fountain codes (RaptorQ is best-of-breed right now as far as I know) 
require K + epsilon, and while epsilon is zero for the vast majority of 
cases, some K-sized subsets of the total list of encoded symbols have a non-
zero epsilon, thus requiring more parity data to get exactly the same level 
of assurance.

Optimal erasure codes are also known as "Maximum Distance Separable" codes.

>> An interesting option, however, might be to use a
>> fountain code over the network when distributing either replicas *or*
>> parity chunks, so that losses can be recovered with <1 full chunk
>> retransmission.
> 
> I would be gratefull if you could expand on this idea. I don't get it :-)

First, a couple caveats - one, doing this over TCP would yield no real 
benefit. In fact, any reliable transport makes this mostly pointless - the 
idea is to avoid retransmitting not only chunks, but packets as well.

Let's assume 4MB chunks. Encode the chunk as a single source block (Raptor 
terminology, see the RFC), with a symbol size chosen to fit 1 (one) symbol 
comfortably into a single packet of whatever unreliable, unordered transport 
you're using. DCCP is basically perfect for this.

Send the symbols taking advantage of RaptorQ being a systematic code, and 
thus sending the unmodified chunk first. If it gets through okay, the 
receiver closes the connection and you're done.

If one or more packets failed to get through, those are erasures - so the 
receiver leaves the connection open. The sender can be really simplistic - 
'keep encoding and sending symbols as long as the connection is open.' Once 
the receiver has enough symbols to recover, it closes the connection.

In cases of no loss, overhead is zero. In cases of some loss, the number of 
additional packets is equal to the number of lost packets plus a (very 
small) potential overhead. The real benefit here is this:

There is no longer any need to wait a syn/ack cycle to realize a packet was 
lost.

This is the use case fountain codes are optimized for - coding for 
transmission. Creating a new symbol is an O(1) operation for RaptorQ, while 
for Reed-Solomon it's O(N) with the size of the source block.

Another neat property with Raptor codes is that you can have multiple, 
unsynchronized senders - so for replicas, once one replica has succeeded it 
could join in to accelerate it *linearly* without needing to track who had 
which symbols in the chunk.

Multicast, too.

> Cheers
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Erasure code library summary
  2013-06-19  7:47       ` Alex Elsayed
@ 2013-06-19  8:33         ` Loic Dachary
  2013-06-19  9:09           ` Alex Elsayed
  0 siblings, 1 reply; 13+ messages in thread
From: Loic Dachary @ 2013-06-19  8:33 UTC (permalink / raw)
  To: Alex Elsayed; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 5398 bytes --]

Hi Alex,

If I understand correctly, part of what you propose is to make use of fountain codes to optimize replicas transmissions. It could even be used to speed up replication by allowing existing copies to contribute to the completion of the others. Although this is very appealing, it may be outside of the scope of my current work. Unless you tell me that both topics are intimately linked and should be handled simultaneously for some reason :-)

The erasure coded placement group could work as follows:

* Placement group is using K+M OSDs

* An object is sent by a client to the primary OSD
* The primary OSD code the object in K+M chunks 
* The primary OSD writes the chunks to the OSDs in order
* The primary OSD acknowledge the write

* An object is read by a client from the primary OSD
* The primary OSD reads the K chunks from the K OSDs
* The primary OSD concatenates the K chunks and returns the result

I'd be interested to know if you see a problem with this approach. I could elaborate on how it would behave when scrubbing, or when the OSDMap changes ( either because new OSDs becomes available or because an OSD fails ).

Cheers

On 06/19/2013 09:47 AM, Alex Elsayed wrote:
> Loic Dachary wrote:
> 
>>
>>
>> On 06/19/2013 03:14 AM, Alex Elsayed wrote:
>>> Alex Elsayed wrote:
>>>
>>>> Loic Dachary wrote:
>>>>
>>>>> Hi Ceph,
>>>>>
>>>> <snip>
>>>>> Reed-Solomon coding family is the only one that can keep the chuncks
>>>>> unencoded and therefore concatenable.
>>>> <snip>
>>>>
>>>> In my understanding, this is not strictly true - any 'systematic' code
>>>> will have the unencoded chunks remain available in this manner, and any
>>>> non- systematic linear code can be transformed into a systematic code
>>>> with the same minimum distance. Fountain codes are often explicitly
>>>> constructed to maintain this property, as in the case of RaptorQ [RFC
>>>> 6330].
>>>>
>>>> https://en.wikipedia.org/wiki/Systematic_code
>>>
>>> ...that said, Reed-Solomon is to the best of my knowledge the only space-
>>> optimal such code.
>>
>> What does "space-optimal" mean ? Does it mean that Reed-Solomon will use
>> less disk space than fountain codes to code the same number of parity
>> chunks ?
> 
> Optimal (for an erasure code) means that if you have K symbols of real data, 
> then *any* K symbols of the output of the erasure code will let you recover 
> it.
> 
> Current fountain codes (RaptorQ is best-of-breed right now as far as I know) 
> require K + epsilon, and while epsilon is zero for the vast majority of 
> cases, some K-sized subsets of the total list of encoded symbols have a non-
> zero epsilon, thus requiring more parity data to get exactly the same level 
> of assurance.
> 
> Optimal erasure codes are also known as "Maximum Distance Separable" codes.
> 
>>> An interesting option, however, might be to use a
>>> fountain code over the network when distributing either replicas *or*
>>> parity chunks, so that losses can be recovered with <1 full chunk
>>> retransmission.
>>
>> I would be gratefull if you could expand on this idea. I don't get it :-)
> 
> First, a couple caveats - one, doing this over TCP would yield no real 
> benefit. In fact, any reliable transport makes this mostly pointless - the 
> idea is to avoid retransmitting not only chunks, but packets as well.
> 
> Let's assume 4MB chunks. Encode the chunk as a single source block (Raptor 
> terminology, see the RFC), with a symbol size chosen to fit 1 (one) symbol 
> comfortably into a single packet of whatever unreliable, unordered transport 
> you're using. DCCP is basically perfect for this.
> 
> Send the symbols taking advantage of RaptorQ being a systematic code, and 
> thus sending the unmodified chunk first. If it gets through okay, the 
> receiver closes the connection and you're done.
> 
> If one or more packets failed to get through, those are erasures - so the 
> receiver leaves the connection open. The sender can be really simplistic - 
> 'keep encoding and sending symbols as long as the connection is open.' Once 
> the receiver has enough symbols to recover, it closes the connection.
> 
> In cases of no loss, overhead is zero. In cases of some loss, the number of 
> additional packets is equal to the number of lost packets plus a (very 
> small) potential overhead. The real benefit here is this:
> 
> There is no longer any need to wait a syn/ack cycle to realize a packet was 
> lost.
> 
> This is the use case fountain codes are optimized for - coding for 
> transmission. Creating a new symbol is an O(1) operation for RaptorQ, while 
> for Reed-Solomon it's O(N) with the size of the source block.
> 
> Another neat property with Raptor codes is that you can have multiple, 
> unsynchronized senders - so for replicas, once one replica has succeeded it 
> could join in to accelerate it *linearly* without needing to track who had 
> which symbols in the chunk.
> 
> Multicast, too.
> 
>> Cheers
>>
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Erasure code library summary
  2013-06-19  8:33         ` Loic Dachary
@ 2013-06-19  9:09           ` Alex Elsayed
  2013-06-19 10:41             ` Loic Dachary
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Elsayed @ 2013-06-19  9:09 UTC (permalink / raw)
  To: ceph-devel

Loic Dachary wrote:

> Hi Alex,
> 
> If I understand correctly, part of what you propose is to make use of
> fountain codes to optimize replicas transmissions. It could even be used
> to speed up replication by allowing existing copies to contribute to the
> completion of the others. Although this is very appealing, it may be
> outside of the scope of my current work. Unless you tell me that both
> topics are intimately linked and should be handled simultaneously for some
> reason :-)

No, just an idea that hit while I was thinking about Raptor codes with 
regard to Ceph. I'd read up on them a bit ago and had seen a few papers 
describing video distribution systems that made use of that.

> The erasure coded placement group could work as follows:
> 
> * Placement group is using K+M OSDs
> 
> * An object is sent by a client to the primary OSD
> * The primary OSD code the object in K+M chunks
> * The primary OSD writes the chunks to the OSDs in order

Well, one thing to consider is whether you want the additional complexity of 
explicitly scheduling where they go, when it's been said before that a.) 
parity recovery is going to be a (somewhat) common operation and b.) we're 
unlikely to be CPU-bound on encode/decode, considering network and disk IO.

It might be a win to have some simple header denoting which chunk is which, 
toss them onto the appropriate OSDs in whatever order, and just let the 
read-side decode whichever K it gets. With Reed-Solomon being an optimal 
erasure code, K are guaranteed to be enough.

> * The primary OSD acknowledge the write
> 
> * An object is read by a client from the primary OSD
> * The primary OSD reads the K chunks from the K OSDs
> * The primary OSD concatenates the K chunks and returns the result

The other reason I suggested what I did above is that then you have exactly 
one code path here - fetch && decode. If the K you get are the input data, 
it's a nice degenerate case, but having a single codepath might help with 
reliability and maintainability.

Unless it's known to be a bottleneck, this just tweaks my 'premature 
optimization' meter a little.

If you want to add it later, it can be done without breaking compatibility - 
add a branch for if the chunks you got are the contiguous real data on the 
read side, preferentially fetch from certain OSDs, and do that location 
scheduling on the write side for new chunks.

Over time, chunks will be allocated in such a way that you get the behavior 
you propose above, and for older chunks it just decodes them like always.

> I'd be interested to know if you see a problem with this approach. I could
> elaborate on how it would behave when scrubbing, or when the OSDMap
> changes ( either because new OSDs becomes available or because an OSD
> fails ).

I'd like that, because those are exactly the cases I'd expect the explicit 
scheduling to be a burden rather than a help.

> Cheers
> 

<snip quotes>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Erasure code library summary
  2013-06-19  9:09           ` Alex Elsayed
@ 2013-06-19 10:41             ` Loic Dachary
  0 siblings, 0 replies; 13+ messages in thread
From: Loic Dachary @ 2013-06-19 10:41 UTC (permalink / raw)
  To: Alex Elsayed; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 5001 bytes --]



On 06/19/2013 11:09 AM, Alex Elsayed wrote:
> Loic Dachary wrote:
> 
>> Hi Alex,
>>
>> If I understand correctly, part of what you propose is to make use of
>> fountain codes to optimize replicas transmissions. It could even be used
>> to speed up replication by allowing existing copies to contribute to the
>> completion of the others. Although this is very appealing, it may be
>> outside of the scope of my current work. Unless you tell me that both
>> topics are intimately linked and should be handled simultaneously for some
>> reason :-)
> 
> No, just an idea that hit while I was thinking about Raptor codes with 
> regard to Ceph. I'd read up on them a bit ago and had seen a few papers 
> describing video distribution systems that made use of that.
> 
>> The erasure coded placement group could work as follows:
>>
>> * Placement group is using K+M OSDs
>>
>> * An object is sent by a client to the primary OSD
>> * The primary OSD code the object in K+M chunks
>> * The primary OSD writes the chunks to the OSDs in order
> 
> Well, one thing to consider is whether you want the additional complexity of 
> explicitly scheduling where they go, when it's been said before that a.) 
> parity recovery is going to be a (somewhat) common operation and b.) we're 
> unlikely to be CPU-bound on encode/decode, considering network and disk IO.
> 
> It might be a win to have some simple header denoting which chunk is which, 
> toss them onto the appropriate OSDs in whatever order, and just let the 
> read-side decode whichever K it gets. With Reed-Solomon being an optimal 
> erasure code, K are guaranteed to be enough.
> 
>> * The primary OSD acknowledge the write
>>
>> * An object is read by a client from the primary OSD
>> * The primary OSD reads the K chunks from the K OSDs
>> * The primary OSD concatenates the K chunks and returns the result
> 
> The other reason I suggested what I did above is that then you have exactly 
> one code path here - fetch && decode. If the K you get are the input data, 
> it's a nice degenerate case, but having a single codepath might help with 
> reliability and maintainability.
> 
> Unless it's known to be a bottleneck, this just tweaks my 'premature 
> optimization' meter a little.
> 
> If you want to add it later, it can be done without breaking compatibility - 
> add a branch for if the chunks you got are the contiguous real data on the 
> read side, preferentially fetch from certain OSDs, and do that location 
> scheduling on the write side for new chunks.
> 
> Over time, chunks will be allocated in such a way that you get the behavior 
> you propose above, and for older chunks it just decodes them like always.
> 
>> I'd be interested to know if you see a problem with this approach. I could
>> elaborate on how it would behave when scrubbing, or when the OSDMap
>> changes ( either because new OSDs becomes available or because an OSD
>> fails ).
> 
> I'd like that, because those are exactly the cases I'd expect the explicit 
> scheduling to be a burden rather than a help.

You made me realize that I always thought about the best case scenario :-) It is indeed easier to store the rank of the chunk as an attribute of the object containing the payload so that the order of the chunks does not rely on the order of the OSDs.

A placement group uses an ordered list of OSDs, the first of which is it the primary and the others are the replicas. The order of the list does not change as long as the OSD map stays the same (i.e. no change in the crush map and no OSD coming in or going out ). Let say we have K = 3, M = 2

epoch 1 => OSD 1 / chunk 1, OSD 2 / chunk 2, OSD 3 / chunk 3, OSD 4 / chunk 4, OSD 5 / chunk 5

The first three chunks ( K = 3 ) are on OSD 1, 2, 3 and could be used when reading by concatenating them in order. The parity chunks are on OSD 4, 5.

OSD 2 dies and the placement group needs to recover

epoch 1 => OSD 1 / chunk 1, OSD 2 / chunk 2, OSD 3 / chunk 3, OSD 4 / chunk 4, OSD 5 / chunk 5
epoch 2 => OSD 1 / chunk 1, OSD 3 / chunk 3, OSD 4 / chunk 4, OSD 5 / chunk 5, OSD 6 / chunk 2 

The placement group can still allow reads and does so with OSD 1, OSD 3 and OSD 4 although they cannot be concatenated. The missing chunk ( 2 ) is reconstructed during the placement group recovery and written to OSD 6 together with its rank.

Reducing the CPU load by making sure chunks are simply concatenated when possible does look like early optimization indeed. It can be implemented later, if necessary, without changing the architecture.

Cheers
> 
>> Cheers
>>
> 
> <snip quotes>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Erasure code library summary
  2013-06-18 12:22 Erasure code library summary Loic Dachary
  2013-06-19  1:10 ` Alex Elsayed
@ 2013-06-19 11:33 ` Mark Nelson
  2013-06-19 12:10   ` Loic Dachary
  1 sibling, 1 reply; 13+ messages in thread
From: Mark Nelson @ 2013-06-19 11:33 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On 06/18/2013 07:22 AM, Loic Dachary wrote:
> Hi Ceph,
>
> TL;DR: use jerasure 1.2 with Reed-Solomon to code/decode/repair an object, and upgrade to 2.0 when available.
>
> Disclaimer: I'm no expert ;-) The terms are explained in wikipedia[1].
>
> Using Reed-Solomon object O is encoded by dividing it into consecutive chuncks O1, O2, ... ON and computing parity blocks P1, P2, ... PK.  Reading the original content of object O is a simple concatenation of O1, O2, ... ON. If O2 or P2 are lost, they can be repaired/reconstructed using O1 ... ON and P1 ... PK. If the use case is mostly reading objects and repairs are at least 1000 times less likely than normal operations, being able to read the object from non-coded chuncks is attractive.
>
> Reed-Solomon is significantly more expensive to encode ( 100MB/s order of magnitude on a single 2.5Ghz core ) than fountain codes with the current jerasure implementation[2]. However, gf-complete[3] that will be used in the upcoming version of jerasure significantly improves performances ( 2 to 10 times faster ) and the difference becomes negligible.

One thing that we might consider is that ARM is very quickly becoming an 
option for Ceph.  It may be very important to have our erasure coding 
scheme be viable on that platform and CPU is going to be the primary 
bottleneck.  It may be worth a quick look at NEON to see if there are 
any things we should be thinking about now.

>
> Reed-Solomon coding family is the only one that can keep the chuncks unencoded and therefore concatenable.
>
> The jerasure library is packaged and being worked on by the author at the moment. All other Free Software implementations are either not packaged or not maintained.
>
> The license[4] of jerasure is compatible with the license of Ceph.
>
> Performances depend on the parameters to the Reed-Solomon functions but they will also be influenced by the buffer sizes used when calling the encoding functions: smaller buffers will mean more calls and more overhead.
>
> Open questions:
>
> * Does Mojette Transform [5] have compelling qualities compared to other code families ?
> * Do hierarchical codes [6] have compelling qualities ? Implementing them would require a different API. To be effective they need to take into account the context in which an object is stored where the other code only require the object itself.
> * I have not experiemented with the jerasure API yet
>
> Feedback and criticisms are welcome :-)
>
> [1] http://en.wikipedia.org/wiki/Erasure_code
> [2] jerasure 1.2 http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627.html
> [3] gf-complete http://web.eecs.utk.edu/~plank/plank/papers/CS-13-703.html
> [4] jerasure license https://github.com/tsuraan/Jerasure/blob/master/License.txt
> [5] Mojette Transform http://en.wikipedia.org/wiki/Mojette_Transform
> [6] hierarchical codes http://www.e-biersack.eu/BPublished/nc_springer.pdf
>
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Erasure code library summary
  2013-06-19 11:33 ` Mark Nelson
@ 2013-06-19 12:10   ` Loic Dachary
  2013-06-19 12:33     ` Mark Nelson
  0 siblings, 1 reply; 13+ messages in thread
From: Loic Dachary @ 2013-06-19 12:10 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 4051 bytes --]



On 06/19/2013 01:33 PM, Mark Nelson wrote:
> On 06/18/2013 07:22 AM, Loic Dachary wrote:
>> Hi Ceph,
>>
>> TL;DR: use jerasure 1.2 with Reed-Solomon to code/decode/repair an object, and upgrade to 2.0 when available.
>>
>> Disclaimer: I'm no expert ;-) The terms are explained in wikipedia[1].
>>
>> Using Reed-Solomon object O is encoded by dividing it into consecutive chuncks O1, O2, ... ON and computing parity blocks P1, P2, ... PK.  Reading the original content of object O is a simple concatenation of O1, O2, ... ON. If O2 or P2 are lost, they can be repaired/reconstructed using O1 ... ON and P1 ... PK. If the use case is mostly reading objects and repairs are at least 1000 times less likely than normal operations, being able to read the object from non-coded chuncks is attractive.
>>
>> Reed-Solomon is significantly more expensive to encode ( 100MB/s order of magnitude on a single 2.5Ghz core ) than fountain codes with the current jerasure implementation[2]. However, gf-complete[3] that will be used in the upcoming version of jerasure significantly improves performances ( 2 to 10 times faster ) and the difference becomes negligible.
> 
> One thing that we might consider is that ARM is very quickly becoming an option for Ceph.  It may be very important to have our erasure coding scheme be viable on that platform and CPU is going to be the primary bottleneck.  It may be worth a quick look at NEON to see if there are any things we should be thinking about now.

Hi Mark,

In another thread James Plank wrote that CPU usage is not going to be a problem as long as we're not trying to slice an object into more than 2^16 chunks ( the actual sentence is "I agree that the CPU burden of the GF arithmetic will not be a bottleneck in your system, regardless of which implementation you use, as long as you stay at or below GF(2^16)." http://article.gmane.org/gmane.comp.file-systems.ceph.devel/15650 ). It looks like we're aiming for something in the order of 10 data chunks + 5 parity chunks, i.e. much lower than 2^16. My hunch is that using more than 100 OSDs to code a single object would be problematic for reasons that are unrelated to the maths involved in coding it anyway.

That being said I can look for/write benchmark code based on jerasure to run on ARM and get a rough idea of the CPU footprint, if you think it's worth it. 

Cheers
> 
>>
>> Reed-Solomon coding family is the only one that can keep the chuncks unencoded and therefore concatenable.
>>
>> The jerasure library is packaged and being worked on by the author at the moment. All other Free Software implementations are either not packaged or not maintained.
>>
>> The license[4] of jerasure is compatible with the license of Ceph.
>>
>> Performances depend on the parameters to the Reed-Solomon functions but they will also be influenced by the buffer sizes used when calling the encoding functions: smaller buffers will mean more calls and more overhead.
>>
>> Open questions:
>>
>> * Does Mojette Transform [5] have compelling qualities compared to other code families ?
>> * Do hierarchical codes [6] have compelling qualities ? Implementing them would require a different API. To be effective they need to take into account the context in which an object is stored where the other code only require the object itself.
>> * I have not experiemented with the jerasure API yet
>>
>> Feedback and criticisms are welcome :-)
>>
>> [1] http://en.wikipedia.org/wiki/Erasure_code
>> [2] jerasure 1.2 http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627.html
>> [3] gf-complete http://web.eecs.utk.edu/~plank/plank/papers/CS-13-703.html
>> [4] jerasure license https://github.com/tsuraan/Jerasure/blob/master/License.txt
>> [5] Mojette Transform http://en.wikipedia.org/wiki/Mojette_Transform
>> [6] hierarchical codes http://www.e-biersack.eu/BPublished/nc_springer.pdf
>>
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Erasure code library summary
  2013-06-19 12:10   ` Loic Dachary
@ 2013-06-19 12:33     ` Mark Nelson
  2013-06-23  7:01       ` Loic Dachary
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Nelson @ 2013-06-19 12:33 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On 06/19/2013 07:10 AM, Loic Dachary wrote:
>
>
> On 06/19/2013 01:33 PM, Mark Nelson wrote:
>> On 06/18/2013 07:22 AM, Loic Dachary wrote:
>>> Hi Ceph,
>>>
>>> TL;DR: use jerasure 1.2 with Reed-Solomon to code/decode/repair an object, and upgrade to 2.0 when available.
>>>
>>> Disclaimer: I'm no expert ;-) The terms are explained in wikipedia[1].
>>>
>>> Using Reed-Solomon object O is encoded by dividing it into consecutive chuncks O1, O2, ... ON and computing parity blocks P1, P2, ... PK.  Reading the original content of object O is a simple concatenation of O1, O2, ... ON. If O2 or P2 are lost, they can be repaired/reconstructed using O1 ... ON and P1 ... PK. If the use case is mostly reading objects and repairs are at least 1000 times less likely than normal operations, being able to read the object from non-coded chuncks is attractive.
>>>
>>> Reed-Solomon is significantly more expensive to encode ( 100MB/s order of magnitude on a single 2.5Ghz core ) than fountain codes with the current jerasure implementation[2]. However, gf-complete[3] that will be used in the upcoming version of jerasure significantly improves performances ( 2 to 10 times faster ) and the difference becomes negligible.
>>
>> One thing that we might consider is that ARM is very quickly becoming an option for Ceph.  It may be very important to have our erasure coding scheme be viable on that platform and CPU is going to be the primary bottleneck.  It may be worth a quick look at NEON to see if there are any things we should be thinking about now.
>
> Hi Mark,
>
> In another thread James Plank wrote that CPU usage is not going to be a problem as long as we're not trying to slice an object into more than 2^16 chunks ( the actual sentence is "I agree that the CPU burden of the GF arithmetic will not be a bottleneck in your system, regardless of which implementation you use, as long as you stay at or below GF(2^16)." http://article.gmane.org/gmane.comp.file-systems.ceph.devel/15650 ). It looks like we're aiming for something in the order of 10 data chunks + 5 parity chunks, i.e. much lower than 2^16. My hunch is that using more than 100 OSDs to code a single object would be problematic for reasons that are unrelated to the maths involved in coding it anyway.
>
> That being said I can look for/write benchmark code based on jerasure to run on ARM and get a rough idea of the CPU footprint, if you think it's worth it.

I don't want to add even more to your plate because you already have 
quite a bit here!  I just want to mention it because on ARM, CRC32c and 
general Ceph processing is already using a significant amount of the CPU 
resources.  I suspect that even highly optimized erasure coding 
implementations will be fighting for CPU on ARM (That may change though 
with some of the next generation ARM cores coming out next year).

>
> Cheers
>>
>>>
>>> Reed-Solomon coding family is the only one that can keep the chuncks unencoded and therefore concatenable.
>>>
>>> The jerasure library is packaged and being worked on by the author at the moment. All other Free Software implementations are either not packaged or not maintained.
>>>
>>> The license[4] of jerasure is compatible with the license of Ceph.
>>>
>>> Performances depend on the parameters to the Reed-Solomon functions but they will also be influenced by the buffer sizes used when calling the encoding functions: smaller buffers will mean more calls and more overhead.
>>>
>>> Open questions:
>>>
>>> * Does Mojette Transform [5] have compelling qualities compared to other code families ?
>>> * Do hierarchical codes [6] have compelling qualities ? Implementing them would require a different API. To be effective they need to take into account the context in which an object is stored where the other code only require the object itself.
>>> * I have not experiemented with the jerasure API yet
>>>
>>> Feedback and criticisms are welcome :-)
>>>
>>> [1] http://en.wikipedia.org/wiki/Erasure_code
>>> [2] jerasure 1.2 http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627.html
>>> [3] gf-complete http://web.eecs.utk.edu/~plank/plank/papers/CS-13-703.html
>>> [4] jerasure license https://github.com/tsuraan/Jerasure/blob/master/License.txt
>>> [5] Mojette Transform http://en.wikipedia.org/wiki/Mojette_Transform
>>> [6] hierarchical codes http://www.e-biersack.eu/BPublished/nc_springer.pdf
>>>
>>>
>>
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Erasure code library summary
  2013-06-19 12:33     ` Mark Nelson
@ 2013-06-23  7:01       ` Loic Dachary
  0 siblings, 0 replies; 13+ messages in thread
From: Loic Dachary @ 2013-06-23  7:01 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 4802 bytes --]



On 06/19/2013 02:33 PM, Mark Nelson wrote:
> On 06/19/2013 07:10 AM, Loic Dachary wrote:
>>
>>
>> On 06/19/2013 01:33 PM, Mark Nelson wrote:
>>> On 06/18/2013 07:22 AM, Loic Dachary wrote:
>>>> Hi Ceph,
>>>>
>>>> TL;DR: use jerasure 1.2 with Reed-Solomon to code/decode/repair an object, and upgrade to 2.0 when available.
>>>>
>>>> Disclaimer: I'm no expert ;-) The terms are explained in wikipedia[1].
>>>>
>>>> Using Reed-Solomon object O is encoded by dividing it into consecutive chuncks O1, O2, ... ON and computing parity blocks P1, P2, ... PK.  Reading the original content of object O is a simple concatenation of O1, O2, ... ON. If O2 or P2 are lost, they can be repaired/reconstructed using O1 ... ON and P1 ... PK. If the use case is mostly reading objects and repairs are at least 1000 times less likely than normal operations, being able to read the object from non-coded chuncks is attractive.
>>>>
>>>> Reed-Solomon is significantly more expensive to encode ( 100MB/s order of magnitude on a single 2.5Ghz core ) than fountain codes with the current jerasure implementation[2]. However, gf-complete[3] that will be used in the upcoming version of jerasure significantly improves performances ( 2 to 10 times faster ) and the difference becomes negligible.
>>>
>>> One thing that we might consider is that ARM is very quickly becoming an option for Ceph.  It may be very important to have our erasure coding scheme be viable on that platform and CPU is going to be the primary bottleneck.  It may be worth a quick look at NEON to see if there are any things we should be thinking about now.
>>
>> Hi Mark,
>>
>> In another thread James Plank wrote that CPU usage is not going to be a problem as long as we're not trying to slice an object into more than 2^16 chunks ( the actual sentence is "I agree that the CPU burden of the GF arithmetic will not be a bottleneck in your system, regardless of which implementation you use, as long as you stay at or below GF(2^16)." http://article.gmane.org/gmane.comp.file-systems.ceph.devel/15650 ). It looks like we're aiming for something in the order of 10 data chunks + 5 parity chunks, i.e. much lower than 2^16. My hunch is that using more than 100 OSDs to code a single object would be problematic for reasons that are unrelated to the maths involved in coding it anyway.
>>
>> That being said I can look for/write benchmark code based on jerasure to run on ARM and get a rough idea of the CPU footprint, if you think it's worth it.
> 
> I don't want to add even more to your plate because you already have quite a bit here!  I just want to mention it because on ARM, CRC32c and general Ceph processing is already using a significant amount of the CPU resources.  I suspect that even highly optimized erasure coding implementations will be fighting for CPU on ARM (That may change though with some of the next generation ARM cores coming out next year).

Hi Mark,

I'll keep that in mind and ask around for ARM vs Intel benchmarks related to erasure coding.

Cheers

> 
>>
>> Cheers
>>>
>>>>
>>>> Reed-Solomon coding family is the only one that can keep the chuncks unencoded and therefore concatenable.
>>>>
>>>> The jerasure library is packaged and being worked on by the author at the moment. All other Free Software implementations are either not packaged or not maintained.
>>>>
>>>> The license[4] of jerasure is compatible with the license of Ceph.
>>>>
>>>> Performances depend on the parameters to the Reed-Solomon functions but they will also be influenced by the buffer sizes used when calling the encoding functions: smaller buffers will mean more calls and more overhead.
>>>>
>>>> Open questions:
>>>>
>>>> * Does Mojette Transform [5] have compelling qualities compared to other code families ?
>>>> * Do hierarchical codes [6] have compelling qualities ? Implementing them would require a different API. To be effective they need to take into account the context in which an object is stored where the other code only require the object itself.
>>>> * I have not experiemented with the jerasure API yet
>>>>
>>>> Feedback and criticisms are welcome :-)
>>>>
>>>> [1] http://en.wikipedia.org/wiki/Erasure_code
>>>> [2] jerasure 1.2 http://web.eecs.utk.edu/~plank/plank/papers/CS-08-627.html
>>>> [3] gf-complete http://web.eecs.utk.edu/~plank/plank/papers/CS-13-703.html
>>>> [4] jerasure license https://github.com/tsuraan/Jerasure/blob/master/License.txt
>>>> [5] Mojette Transform http://en.wikipedia.org/wiki/Mojette_Transform
>>>> [6] hierarchical codes http://www.e-biersack.eu/BPublished/nc_springer.pdf
>>>>
>>>>
>>>
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-06-23  7:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-18 12:22 Erasure code library summary Loic Dachary
2013-06-19  1:10 ` Alex Elsayed
2013-06-19  1:14   ` Alex Elsayed
2013-06-19  7:00     ` Loic Dachary
2013-06-19  7:47       ` Alex Elsayed
2013-06-19  8:33         ` Loic Dachary
2013-06-19  9:09           ` Alex Elsayed
2013-06-19 10:41             ` Loic Dachary
2013-06-19  6:56   ` Loic Dachary
2013-06-19 11:33 ` Mark Nelson
2013-06-19 12:10   ` Loic Dachary
2013-06-19 12:33     ` Mark Nelson
2013-06-23  7:01       ` Loic Dachary

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.