All of lore.kernel.org
 help / color / mirror / Atom feed
* EC API to expose locality
@ 2014-01-14 13:43 Andreas Joachim Peters
  2014-01-14 15:39 ` Sage Weil
  0 siblings, 1 reply; 3+ messages in thread
From: Andreas Joachim Peters @ 2014-01-14 13:43 UTC (permalink / raw)
  To: ceph-devel

After some exchange with Loic and the recent list discussion, 
the API of the EC plugin might need some clarification/extension in the ::encode method:

Currently ::encode returns a map of bufferlists where the key is the index of [ 0 .. (m+k) ] 
and the value is the encoded buffer belonging to that stripe index:

map<int, bufferlist> *encoded

If a pyramid code is used the index would be [ 0 .. (m+k+(l*l_k)) ] where l is the number of local 
parity subgroups and l_k are the number of parity stripes per subgroup. The pyramid code would just
chunk the input into the requested number of subgroups and compute local parity for them according 
to the configuration.

With this API the caller has actually no clue how to group stripes together for intelligent
placement allowing to keep subgroups with local parities together to minimize traffic 
during remapping and reconstruction.

Either there is an additional function returning the location sub-group [ 0 .. l ] for each created 
chunk or the ::encode function returns the chunks already grouped like:

vector<int, map<int, bufferlist> *encoded

Probably it would be good to have both.

However it is not clear, if you can actually remap/recover an OSD without destroying the locality 
of pyramid encoding and if you can at all define CRUSH rules honoring the idea of chunk locality where
shrinking/extension of pools keeps the locality.

Last question is, if a remapping/recovery action is only possible with the traffic going through the primary OSD.

If locality cannot be supported sufficiently now or in the future, should the API stay as it is?

The ::decode function is fine, since the plugin knows about the locality of the available chunks and will
select the cheapest decoding possible.

Cheers Andreas.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: EC API to expose locality
  2014-01-14 13:43 EC API to expose locality Andreas Joachim Peters
@ 2014-01-14 15:39 ` Sage Weil
  2014-01-14 23:02   ` Loic Dachary
  0 siblings, 1 reply; 3+ messages in thread
From: Sage Weil @ 2014-01-14 15:39 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

Hi Andreas,

On Tue, 14 Jan 2014, Andreas Joachim Peters wrote:
> After some exchange with Loic and the recent list discussion, 
> the API of the EC plugin might need some clarification/extension in the ::encode method:
> 
> Currently ::encode returns a map of bufferlists where the key is the index of [ 0 .. (m+k) ] 
> and the value is the encoded buffer belonging to that stripe index:
> 
> map<int, bufferlist> *encoded
> 
> If a pyramid code is used the index would be [ 0 .. (m+k+(l*l_k)) ] where l is the number of local 
> parity subgroups and l_k are the number of parity stripes per subgroup. The pyramid code would just
> chunk the input into the requested number of subgroups and compute local parity for them according 
> to the configuration.
> 
> With this API the caller has actually no clue how to group stripes together for intelligent
> placement allowing to keep subgroups with local parities together to minimize traffic 
> during remapping and reconstruction.

This is a bit awkward, it's true.  I'm not sure there is a 'magic' way to 
accomplish this.  In the end, the CRUSH rule needs to have the required 
width *and* should group nodes accordingly, but this mapping happens at a 
very different layer in Ceph than the low-level plugin, so even if callers 
had this information they wouldn't be able to do anything about it.

Currently, what we need to do is make sure the EC plugin maps onto a 
linear array of devices the same way that CRUSH does.  For a pyramid code, 
the CRUSH rule will be something like 

 step take root
 step choose 3 rack
 step choose 5 osd
 emit

to get 3 groups of 5 devices as an array of size 15.  That means the EC
plugin needs to map onto ranks that go something like

 0-3 data
 4 local parity
 5-8 data
 9 local parity
 10-11 data
 12-13 global parity
 14 local parity

(or whatever).

Getting this to line up is a bit fragile, unfortunately.  We could make
a plugin method that describes the subgrouping, but even then I'm not
sure how easy it is to programmatically validate that an arbitrary CRUSH
rule will behave well.  Maybe it is enough to

- have some way to query the layout of the EC plugin (e.g, 3 groups of 5).
- add a new 'osd crush rule create-pyramid ...' command to supplement 
  'create-simple'.

and document it well... 

sage


> 
> Either there is an additional function returning the location sub-group [ 0 .. l ] for each created 
> chunk or the ::encode function returns the chunks already grouped like:
> 
> vector<int, map<int, bufferlist> *encoded
> 
> Probably it would be good to have both.
> 
> However it is not clear, if you can actually remap/recover an OSD without destroying the locality 
> of pyramid encoding and if you can at all define CRUSH rules honoring the idea of chunk locality where
> shrinking/extension of pools keeps the locality.
> 
> Last question is, if a remapping/recovery action is only possible with the traffic going through the primary OSD.
> 
> If locality cannot be supported sufficiently now or in the future, should the API stay as it is?
> 
> The ::decode function is fine, since the plugin knows about the locality of the available chunks and will
> select the cheapest decoding possible.
> 
> Cheers Andreas.--
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: EC API to expose locality
  2014-01-14 15:39 ` Sage Weil
@ 2014-01-14 23:02   ` Loic Dachary
  0 siblings, 0 replies; 3+ messages in thread
From: Loic Dachary @ 2014-01-14 23:02 UTC (permalink / raw)
  To: Sage Weil, Andreas Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 4029 bytes --]



On 14/01/2014 16:39, Sage Weil wrote:
> Hi Andreas,
> 
> On Tue, 14 Jan 2014, Andreas Joachim Peters wrote:
>> After some exchange with Loic and the recent list discussion, 
>> the API of the EC plugin might need some clarification/extension in the ::encode method:
>>
>> Currently ::encode returns a map of bufferlists where the key is the index of [ 0 .. (m+k) ] 
>> and the value is the encoded buffer belonging to that stripe index:
>>
>> map<int, bufferlist> *encoded
>>
>> If a pyramid code is used the index would be [ 0 .. (m+k+(l*l_k)) ] where l is the number of local 
>> parity subgroups and l_k are the number of parity stripes per subgroup. The pyramid code would just
>> chunk the input into the requested number of subgroups and compute local parity for them according 
>> to the configuration.
>>
>> With this API the caller has actually no clue how to group stripes together for intelligent
>> placement allowing to keep subgroups with local parities together to minimize traffic 
>> during remapping and reconstruction.
> 
> This is a bit awkward, it's true.  I'm not sure there is a 'magic' way to 
> accomplish this.  In the end, the CRUSH rule needs to have the required 
> width *and* should group nodes accordingly, but this mapping happens at a 
> very different layer in Ceph than the low-level plugin, so even if callers 
> had this information they wouldn't be able to do anything about it.
> 
> Currently, what we need to do is make sure the EC plugin maps onto a 
> linear array of devices the same way that CRUSH does.  For a pyramid code, 
> the CRUSH rule will be something like 
> 
>  step take root
>  step choose 3 rack
>  step choose 5 osd
>  emit
> 
> to get 3 groups of 5 devices as an array of size 15.  That means the EC
> plugin needs to map onto ranks that go something like
> 
>  0-3 data
>  4 local parity
>  5-8 data
>  9 local parity
>  10-11 data
>  12-13 global parity
>  14 local parity
> 
> (or whatever).
> 
> Getting this to line up is a bit fragile, unfortunately.  We could make
> a plugin method that describes the subgrouping, but even then I'm not
> sure how easy it is to programmatically validate that an arbitrary CRUSH
> rule will behave well.  Maybe it is enough to
> 
> - have some way to query the layout of the EC plugin (e.g, 3 groups of 5).
> - add a new 'osd crush rule create-pyramid ...' command to supplement 
>   'create-simple'.
> 
> and document it well... 
> 
> sage

I created http://tracker.ceph.com/issues/7146 to keep track of this feature.

Cheers
> 
>>
>> Either there is an additional function returning the location sub-group [ 0 .. l ] for each created 
>> chunk or the ::encode function returns the chunks already grouped like:
>>
>> vector<int, map<int, bufferlist> *encoded
>>
>> Probably it would be good to have both.
>>
>> However it is not clear, if you can actually remap/recover an OSD without destroying the locality 
>> of pyramid encoding and if you can at all define CRUSH rules honoring the idea of chunk locality where
>> shrinking/extension of pools keeps the locality.
>>
>> Last question is, if a remapping/recovery action is only possible with the traffic going through the primary OSD.
>>
>> If locality cannot be supported sufficiently now or in the future, should the API stay as it is?
>>
>> The ::decode function is fine, since the plugin knows about the locality of the available chunks and will
>> select the cheapest decoding possible.
>>
>> Cheers Andreas.--
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-01-14 23:02 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-14 13:43 EC API to expose locality Andreas Joachim Peters
2014-01-14 15:39 ` Sage Weil
2014-01-14 23:02   ` Loic Dachary

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.