On 06/19/2013 11:09 AM, Alex Elsayed wrote:
> Loic Dachary wrote:
> 
>> Hi Alex,
>>
>> If I understand correctly, part of what you propose is to make use of
>> fountain codes to optimize replicas transmissions. It could even be used
>> to speed up replication by allowing existing copies to contribute to the
>> completion of the others. Although this is very appealing, it may be
>> outside of the scope of my current work. Unless you tell me that both
>> topics are intimately linked and should be handled simultaneously for some
>> reason :-)
> 
> No, just an idea that hit while I was thinking about Raptor codes with 
> regard to Ceph. I'd read up on them a bit ago and had seen a few papers 
> describing video distribution systems that made use of that.
> 
>> The erasure coded placement group could work as follows:
>>
>> * Placement group is using K+M OSDs
>>
>> * An object is sent by a client to the primary OSD
>> * The primary OSD code the object in K+M chunks
>> * The primary OSD writes the chunks to the OSDs in order
> 
> Well, one thing to consider is whether you want the additional complexity of 
> explicitly scheduling where they go, when it's been said before that a.) 
> parity recovery is going to be a (somewhat) common operation and b.) we're 
> unlikely to be CPU-bound on encode/decode, considering network and disk IO.
> 
> It might be a win to have some simple header denoting which chunk is which, 
> toss them onto the appropriate OSDs in whatever order, and just let the 
> read-side decode whichever K it gets. With Reed-Solomon being an optimal 
> erasure code, K are guaranteed to be enough.
> 
>> * The primary OSD acknowledge the write
>>
>> * An object is read by a client from the primary OSD
>> * The primary OSD reads the K chunks from the K OSDs
>> * The primary OSD concatenates the K chunks and returns the result
> 
> The other reason I suggested what I did above is that then you have exactly 
> one code path here - fetch && decode. If the K you get are the input data, 
> it's a nice degenerate case, but having a single codepath might help with 
> reliability and maintainability.
> 
> Unless it's known to be a bottleneck, this just tweaks my 'premature 
> optimization' meter a little.
> 
> If you want to add it later, it can be done without breaking compatibility - 
> add a branch for if the chunks you got are the contiguous real data on the 
> read side, preferentially fetch from certain OSDs, and do that location 
> scheduling on the write side for new chunks.
> 
> Over time, chunks will be allocated in such a way that you get the behavior 
> you propose above, and for older chunks it just decodes them like always.
> 
>> I'd be interested to know if you see a problem with this approach. I could
>> elaborate on how it would behave when scrubbing, or when the OSDMap
>> changes ( either because new OSDs becomes available or because an OSD
>> fails ).
> 
> I'd like that, because those are exactly the cases I'd expect the explicit 
> scheduling to be a burden rather than a help.

You made me realize that I always thought about the best case scenario :-) It is indeed easier to store the rank of the chunk as an attribute of the object containing the payload so that the order of the chunks does not rely on the order of the OSDs.

A placement group uses an ordered list of OSDs, the first of which is it the primary and the others are the replicas. The order of the list does not change as long as the OSD map stays the same (i.e. no change in the crush map and no OSD coming in or going out ). Let say we have K = 3, M = 2

epoch 1 => OSD 1 / chunk 1, OSD 2 / chunk 2, OSD 3 / chunk 3, OSD 4 / chunk 4, OSD 5 / chunk 5

The first three chunks ( K = 3 ) are on OSD 1, 2, 3 and could be used when reading by concatenating them in order. The parity chunks are on OSD 4, 5.

OSD 2 dies and the placement group needs to recover

epoch 1 => OSD 1 / chunk 1, OSD 2 / chunk 2, OSD 3 / chunk 3, OSD 4 / chunk 4, OSD 5 / chunk 5
epoch 2 => OSD 1 / chunk 1, OSD 3 / chunk 3, OSD 4 / chunk 4, OSD 5 / chunk 5, OSD 6 / chunk 2 

The placement group can still allow reads and does so with OSD 1, OSD 3 and OSD 4 although they cannot be concatenated. The missing chunk ( 2 ) is reconstructed during the placement group recovery and written to OSD 6 together with its rank.

Reducing the CPU load by making sure chunks are simply concatenated when possible does look like early optimization indeed. It can be implemented later, if necessary, without changing the architecture.

Cheers
> 
>> Cheers
>>
> 
> <snip quotes>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loīc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.