All of lore.kernel.org
 help / color / mirror / Atom feed
From: Loic Dachary <loic@dachary.org>
To: Alex Elsayed <eternaleye@gmail.com>
Cc: ceph-devel@vger.kernel.org
Subject: Re: Erasure code library summary
Date: Wed, 19 Jun 2013 10:33:39 +0200	[thread overview]
Message-ID: <51C16CE3.9020004@dachary.org> (raw)
In-Reply-To: <kprnn8$l72$1@ger.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 5398 bytes --]

Hi Alex,

If I understand correctly, part of what you propose is to make use of fountain codes to optimize replicas transmissions. It could even be used to speed up replication by allowing existing copies to contribute to the completion of the others. Although this is very appealing, it may be outside of the scope of my current work. Unless you tell me that both topics are intimately linked and should be handled simultaneously for some reason :-)

The erasure coded placement group could work as follows:

* Placement group is using K+M OSDs

* An object is sent by a client to the primary OSD
* The primary OSD code the object in K+M chunks 
* The primary OSD writes the chunks to the OSDs in order
* The primary OSD acknowledge the write

* An object is read by a client from the primary OSD
* The primary OSD reads the K chunks from the K OSDs
* The primary OSD concatenates the K chunks and returns the result

I'd be interested to know if you see a problem with this approach. I could elaborate on how it would behave when scrubbing, or when the OSDMap changes ( either because new OSDs becomes available or because an OSD fails ).

Cheers

On 06/19/2013 09:47 AM, Alex Elsayed wrote:
> Loic Dachary wrote:
> 
>>
>>
>> On 06/19/2013 03:14 AM, Alex Elsayed wrote:
>>> Alex Elsayed wrote:
>>>
>>>> Loic Dachary wrote:
>>>>
>>>>> Hi Ceph,
>>>>>
>>>> <snip>
>>>>> Reed-Solomon coding family is the only one that can keep the chuncks
>>>>> unencoded and therefore concatenable.
>>>> <snip>
>>>>
>>>> In my understanding, this is not strictly true - any 'systematic' code
>>>> will have the unencoded chunks remain available in this manner, and any
>>>> non- systematic linear code can be transformed into a systematic code
>>>> with the same minimum distance. Fountain codes are often explicitly
>>>> constructed to maintain this property, as in the case of RaptorQ [RFC
>>>> 6330].
>>>>
>>>> https://en.wikipedia.org/wiki/Systematic_code
>>>
>>> ...that said, Reed-Solomon is to the best of my knowledge the only space-
>>> optimal such code.
>>
>> What does "space-optimal" mean ? Does it mean that Reed-Solomon will use
>> less disk space than fountain codes to code the same number of parity
>> chunks ?
> 
> Optimal (for an erasure code) means that if you have K symbols of real data, 
> then *any* K symbols of the output of the erasure code will let you recover 
> it.
> 
> Current fountain codes (RaptorQ is best-of-breed right now as far as I know) 
> require K + epsilon, and while epsilon is zero for the vast majority of 
> cases, some K-sized subsets of the total list of encoded symbols have a non-
> zero epsilon, thus requiring more parity data to get exactly the same level 
> of assurance.
> 
> Optimal erasure codes are also known as "Maximum Distance Separable" codes.
> 
>>> An interesting option, however, might be to use a
>>> fountain code over the network when distributing either replicas *or*
>>> parity chunks, so that losses can be recovered with <1 full chunk
>>> retransmission.
>>
>> I would be gratefull if you could expand on this idea. I don't get it :-)
> 
> First, a couple caveats - one, doing this over TCP would yield no real 
> benefit. In fact, any reliable transport makes this mostly pointless - the 
> idea is to avoid retransmitting not only chunks, but packets as well.
> 
> Let's assume 4MB chunks. Encode the chunk as a single source block (Raptor 
> terminology, see the RFC), with a symbol size chosen to fit 1 (one) symbol 
> comfortably into a single packet of whatever unreliable, unordered transport 
> you're using. DCCP is basically perfect for this.
> 
> Send the symbols taking advantage of RaptorQ being a systematic code, and 
> thus sending the unmodified chunk first. If it gets through okay, the 
> receiver closes the connection and you're done.
> 
> If one or more packets failed to get through, those are erasures - so the 
> receiver leaves the connection open. The sender can be really simplistic - 
> 'keep encoding and sending symbols as long as the connection is open.' Once 
> the receiver has enough symbols to recover, it closes the connection.
> 
> In cases of no loss, overhead is zero. In cases of some loss, the number of 
> additional packets is equal to the number of lost packets plus a (very 
> small) potential overhead. The real benefit here is this:
> 
> There is no longer any need to wait a syn/ack cycle to realize a packet was 
> lost.
> 
> This is the use case fountain codes are optimized for - coding for 
> transmission. Creating a new symbol is an O(1) operation for RaptorQ, while 
> for Reed-Solomon it's O(N) with the size of the source block.
> 
> Another neat property with Raptor codes is that you can have multiple, 
> unsynchronized senders - so for replicas, once one replica has succeeded it 
> could join in to accelerate it *linearly* without needing to track who had 
> which symbols in the chunk.
> 
> Multicast, too.
> 
>> Cheers
>>
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

  reply	other threads:[~2013-06-19  8:33 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-18 12:22 Erasure code library summary Loic Dachary
2013-06-19  1:10 ` Alex Elsayed
2013-06-19  1:14   ` Alex Elsayed
2013-06-19  7:00     ` Loic Dachary
2013-06-19  7:47       ` Alex Elsayed
2013-06-19  8:33         ` Loic Dachary [this message]
2013-06-19  9:09           ` Alex Elsayed
2013-06-19 10:41             ` Loic Dachary
2013-06-19  6:56   ` Loic Dachary
2013-06-19 11:33 ` Mark Nelson
2013-06-19 12:10   ` Loic Dachary
2013-06-19 12:33     ` Mark Nelson
2013-06-23  7:01       ` Loic Dachary

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51C16CE3.9020004@dachary.org \
    --to=loic@dachary.org \
    --cc=ceph-devel@vger.kernel.org \
    --cc=eternaleye@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.