All of lore.kernel.org
 help / color / mirror / Atom feed
* Erasure coding library API
@ 2013-07-01 23:00 Loic Dachary
  2013-07-02 14:07 ` Atchley, Scott
  0 siblings, 1 reply; 16+ messages in thread
From: Loic Dachary @ 2013-07-01 23:00 UTC (permalink / raw)
  To: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 1634 bytes --]

Hi,

Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/ for instance ) would need to be different from the one initialy proposed:

    context(k, m, reed-solomon|...) => context* c 
    encode(context* c, void* data) => void* chunks[k+m]
    decode(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* data // erased chunks are
not used
    repair(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* chunks[k+m] // erased
chunks are rebuilt

The decode function must allow for partial read:

    decode(context* c, int offset, int length, void* chunk[k+m], int* indices_of_erased_chunks, int* missing_chunks) => void* data 

If there are not enough chunks to recover the desired data range [offset, offset+length) the function returns NULL and sets missing_chunks to the list of chunks that must be retrieved in order to be able to read the desired data.

If decode is called to read just 1 chunk and it is missing, reed-solomon would return on error and ask for all other chunks to repair. If the underlying library implements LRC, it would ask for a subset of the chunks. 

An implementation allowing only full reads and using jerasure ( which does not do LRC ) requires that offset is always zero, length is the size of the object and returns a copy of indices_of_erased_chunks if there are not enough chunks to rebuild the missing ones. 

Comments are welcome :-)

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Erasure coding library API
  2013-07-01 23:00 Erasure coding library API Loic Dachary
@ 2013-07-02 14:07 ` Atchley, Scott
  2013-07-02 17:14   ` Atchley, Scott
  0 siblings, 1 reply; 16+ messages in thread
From: Atchley, Scott @ 2013-07-02 14:07 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@dachary.org> wrote:

> Hi,
> 
> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/ for instance ) would need to be different from the one initialy proposed:

An interesting video. Not as entertaining as Jim Plank's video. ;-)

While Plank's focused on the processor requirements for encoding/decoding, this video focuses on the network and disk I/O requirements.

>    context(k, m, reed-solomon|...) => context* c 
>    encode(context* c, void* data) => void* chunks[k+m]
>    decode(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* data // erased chunks are
> not used
>    repair(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* chunks[k+m] // erased
> chunks are rebuilt
> 
> The decode function must allow for partial read:
> 
>    decode(context* c, int offset, int length, void* chunk[k+m], int* indices_of_erased_chunks, int* missing_chunks) => void* data 
> 
> If there are not enough chunks to recover the desired data range [offset, offset+length) the function returns NULL and sets missing_chunks to the list of chunks that must be retrieved in order to be able to read the desired data.
> 
> If decode is called to read just 1 chunk and it is missing, reed-solomon would return on error and ask for all other chunks to repair. If the underlying library implements LRC, it would ask for a subset of the chunks. 
> 
> An implementation allowing only full reads and using jerasure ( which does not do LRC ) requires that offset is always zero, length is the size of the object and returns a copy of indices_of_erased_chunks if there are not enough chunks to rebuild the missing ones. 
> 
> Comments are welcome :-)

I have loosely followed this discussion and I have not looked closely at the proposed API nor at the jerasure interface. My apologies if this has already been addressed.

It is not clear to me from the above proposed API (ignoring the partial read) what it would do. Was the original intent to encode the entire file using k+m blocks irregardless of the file size and of the rados object size?

If so, how will you map rados objects to the logical k+m objects and vice versa?

If not, then the initial API needed an offset and length (either logical or rados object).

I would assume that you would want to operate on rados sized objects. Given a fixed k+m, then you may have more than one set of k+m objects per file. This is ignoring the LRC "local" parity blocks. For example, if the rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas video), then for a 20 MB file one would need two sets of encoding blocks. The first for objects 1-10 and the second for objects 11-20.

Perhaps, this is what the context is above. If so, it should have the logical offset and rados object size, no?

I see value in the Xorbas concept and I wonder if the jerasure library can be modified to generate the local parity blocks such that they can be used to generate the global parity blocks. That would be a question for Jim Plank.

Scott

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Erasure coding library API
  2013-07-02 14:07 ` Atchley, Scott
@ 2013-07-02 17:14   ` Atchley, Scott
  2013-07-02 21:33     ` Samuel Just
  0 siblings, 1 reply; 16+ messages in thread
From: Atchley, Scott @ 2013-07-02 17:14 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@ornl.gov> wrote:

> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@dachary.org> wrote:
> 
>> Hi,
>> 
>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/ for instance ) would need to be different from the one initialy proposed:
> 
> An interesting video. Not as entertaining as Jim Plank's video. ;-)
> 
> While Plank's focused on the processor requirements for encoding/decoding, this video focuses on the network and disk I/O requirements.
> 
>>   context(k, m, reed-solomon|...) => context* c 
>>   encode(context* c, void* data) => void* chunks[k+m]
>>   decode(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* data // erased chunks are
>> not used
>>   repair(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* chunks[k+m] // erased
>> chunks are rebuilt
>> 
>> The decode function must allow for partial read:
>> 
>>   decode(context* c, int offset, int length, void* chunk[k+m], int* indices_of_erased_chunks, int* missing_chunks) => void* data 
>> 
>> If there are not enough chunks to recover the desired data range [offset, offset+length) the function returns NULL and sets missing_chunks to the list of chunks that must be retrieved in order to be able to read the desired data.
>> 
>> If decode is called to read just 1 chunk and it is missing, reed-solomon would return on error and ask for all other chunks to repair. If the underlying library implements LRC, it would ask for a subset of the chunks. 
>> 
>> An implementation allowing only full reads and using jerasure ( which does not do LRC ) requires that offset is always zero, length is the size of the object and returns a copy of indices_of_erased_chunks if there are not enough chunks to rebuild the missing ones. 
>> 
>> Comments are welcome :-)
> 
> I have loosely followed this discussion and I have not looked closely at the proposed API nor at the jerasure interface. My apologies if this has already been addressed.
> 
> It is not clear to me from the above proposed API (ignoring the partial read) what it would do. Was the original intent to encode the entire file using k+m blocks irregardless of the file size and of the rados object size?
> 
> If so, how will you map rados objects to the logical k+m objects and vice versa?
> 
> If not, then the initial API needed an offset and length (either logical or rados object).
> 
> I would assume that you would want to operate on rados sized objects. Given a fixed k+m, then you may have more than one set of k+m objects per file. This is ignoring the LRC "local" parity blocks. For example, if the rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas video), then for a 20 MB file one would need two sets of encoding blocks. The first for objects 1-10 and the second for objects 11-20.
> 
> Perhaps, this is what the context is above. If so, it should have the logical offset and rados object size, no?
> 
> I see value in the Xorbas concept and I wonder if the jerasure library can be modified to generate the local parity blocks such that they can be used to generate the global parity blocks. That would be a question for Jim Plank.

The benefits of the Xorbas concept is reduced network and disk I/O for failures while maintaining traditional RS's higher fault-tolerance than 3x replication while using less space.

You can do almost the same thing with jerasure without modifying it at all. Using the values from the Xorbas video, they have k data blocks, m global parity blocks, and 2 local parity blocks (generated from k/2 data blocks) for a total of k+m+2 blocks on disk that can tolerate any m failures. In their example, k = 10 and m = 4. They store 16 blocks for each 10 data blocks.

If you use traditional RS encoding via jerasure and used the same amount of storage (16 blocks for each 10 data blocks), you could encode 3 parity blocks for each 5 data blocks. This would consume 16 data blocks for each 10 data blocks and the fault-tolerance would be variable from 3-6 failures depending on how the failures fell between the two groups of 5 blocks which is higher than the static 4 failures for the Xorbas code. The I/O to recover from a single failure for both schemes is 5 blocks so it is as efficient as Xorbas. On average, it provides more fault-tolerance, but it can be less (four failures from one group of 5 data + 3 parity blocks), but that worst case is the same as 3x replication.

Scott

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Erasure coding library API
  2013-07-02 17:14   ` Atchley, Scott
@ 2013-07-02 21:33     ` Samuel Just
  2013-07-03  2:12       ` Paul Von-Stamwitz
  2013-07-05 16:50       ` Loic Dachary
  0 siblings, 2 replies; 16+ messages in thread
From: Samuel Just @ 2013-07-02 21:33 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: Loic Dachary, Ceph Development

I think we should be able to cover most cases by adding an interface like:

set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
&available_chunks);

which returns the smallest set required to read/rebuild the chunks in
want_to_read given the chunks in available_chunks.  Alternately, we
might include a "cost" for reading each chunk like

set<int> minimum_to_read_with_cost(const set<int> &want_to_read, const
map<int, int> &available)

which returns the minimum cost set required to read the specified
chunks given a mapping of available chunks to costs.  The costs might
allow us to consider the difference between reading local chunks vs
remote chunks.  This should be sufficient to cover the read case (esp
the degraded read case) and the repair case.
-Sam

On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@ornl.gov> wrote:
> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@ornl.gov> wrote:
>
>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@dachary.org> wrote:
>>
>>> Hi,
>>>
>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/ for instance ) would need to be different from the one initialy proposed:
>>
>> An interesting video. Not as entertaining as Jim Plank's video. ;-)
>>
>> While Plank's focused on the processor requirements for encoding/decoding, this video focuses on the network and disk I/O requirements.
>>
>>>   context(k, m, reed-solomon|...) => context* c
>>>   encode(context* c, void* data) => void* chunks[k+m]
>>>   decode(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* data // erased chunks are
>>> not used
>>>   repair(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* chunks[k+m] // erased
>>> chunks are rebuilt
>>>
>>> The decode function must allow for partial read:
>>>
>>>   decode(context* c, int offset, int length, void* chunk[k+m], int* indices_of_erased_chunks, int* missing_chunks) => void* data
>>>
>>> If there are not enough chunks to recover the desired data range [offset, offset+length) the function returns NULL and sets missing_chunks to the list of chunks that must be retrieved in order to be able to read the desired data.
>>>
>>> If decode is called to read just 1 chunk and it is missing, reed-solomon would return on error and ask for all other chunks to repair. If the underlying library implements LRC, it would ask for a subset of the chunks.
>>>
>>> An implementation allowing only full reads and using jerasure ( which does not do LRC ) requires that offset is always zero, length is the size of the object and returns a copy of indices_of_erased_chunks if there are not enough chunks to rebuild the missing ones.
>>>
>>> Comments are welcome :-)
>>
>> I have loosely followed this discussion and I have not looked closely at the proposed API nor at the jerasure interface. My apologies if this has already been addressed.
>>
>> It is not clear to me from the above proposed API (ignoring the partial read) what it would do. Was the original intent to encode the entire file using k+m blocks irregardless of the file size and of the rados object size?
>>
>> If so, how will you map rados objects to the logical k+m objects and vice versa?
>>
>> If not, then the initial API needed an offset and length (either logical or rados object).
>>
>> I would assume that you would want to operate on rados sized objects. Given a fixed k+m, then you may have more than one set of k+m objects per file. This is ignoring the LRC "local" parity blocks. For example, if the rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas video), then for a 20 MB file one would need two sets of encoding blocks. The first for objects 1-10 and the second for objects 11-20.
>>
>> Perhaps, this is what the context is above. If so, it should have the logical offset and rados object size, no?
>>
>> I see value in the Xorbas concept and I wonder if the jerasure library can be modified to generate the local parity blocks such that they can be used to generate the global parity blocks. That would be a question for Jim Plank.
>
> The benefits of the Xorbas concept is reduced network and disk I/O for failures while maintaining traditional RS's higher fault-tolerance than 3x replication while using less space.
>
> You can do almost the same thing with jerasure without modifying it at all. Using the values from the Xorbas video, they have k data blocks, m global parity blocks, and 2 local parity blocks (generated from k/2 data blocks) for a total of k+m+2 blocks on disk that can tolerate any m failures. In their example, k = 10 and m = 4. They store 16 blocks for each 10 data blocks.
>
> If you use traditional RS encoding via jerasure and used the same amount of storage (16 blocks for each 10 data blocks), you could encode 3 parity blocks for each 5 data blocks. This would consume 16 data blocks for each 10 data blocks and the fault-tolerance would be variable from 3-6 failures depending on how the failures fell between the two groups of 5 blocks which is higher than the static 4 failures for the Xorbas code. The I/O to recover from a single failure for both schemes is 5 blocks so it is as efficient as Xorbas. On average, it provides more fault-tolerance, but it can be less (four failures from one group of 5 data + 3 parity blocks), but that worst case is the same as 3x replication.
>
> Scott--
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Erasure coding library API
  2013-07-02 21:33     ` Samuel Just
@ 2013-07-03  2:12       ` Paul Von-Stamwitz
  2013-07-03 11:53         ` Atchley, Scott
  2013-07-05 16:50       ` Loic Dachary
  1 sibling, 1 reply; 16+ messages in thread
From: Paul Von-Stamwitz @ 2013-07-03  2:12 UTC (permalink / raw)
  To: Samuel Just, Atchley, Scott; +Cc: Loic Dachary, Ceph Development

Scott,

You make a good point comparing (5/3) RS with Xorbas, but a small nit:

"The I/O to recover from a single failure for both schemes is 5 blocks so it is as efficient as Xorbas."

Maybe not. You would probably issue I/O to all the remaining 7 blocks to cover for the possibility of double errors. The time to reconstruct would be about the same, but there could be more disk and network I/O. (LRC will need to issue I/O to the rest of the global stripe if it detected multiple local errors.)

What I like about Xorbas is that it is an extension of a (x,y) RS. You can start with traditional RS. If degraded reads and repaired blocks are causing a problem, you can add the LRC. If capacity is an issue, you can take it out. 

Best,
Paul

On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote:
> I think we should be able to cover most cases by adding an interface like:
> 
> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
> &available_chunks);
> 
> which returns the smallest set required to read/rebuild the chunks in
> want_to_read given the chunks in available_chunks.  Alternately, we might
> include a "cost" for reading each chunk like
> 
> set<int> minimum_to_read_with_cost(const set<int> &want_to_read, const
> map<int, int> &available)
> 
> which returns the minimum cost set required to read the specified chunks
> given a mapping of available chunks to costs.  The costs might allow us to
> consider the difference between reading local chunks vs remote chunks.
> This should be sufficient to cover the read case (esp the degraded read
> case) and the repair case.
> -Sam
> 
> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@ornl.gov>
> wrote:
> > On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@ornl.gov>
> wrote:
> >
> >> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@dachary.org> wrote:
> >>
> >>> Hi,
> >>>
> >>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project
> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/ for
> instance ) would need to be different from the one initialy proposed:
> >>
> >> An interesting video. Not as entertaining as Jim Plank's video. ;-)
> >>
> >> While Plank's focused on the processor requirements for
> encoding/decoding, this video focuses on the network and disk I/O
> requirements.
> >>
> >>>   context(k, m, reed-solomon|...) => context* c
> >>>   encode(context* c, void* data) => void* chunks[k+m]
> >>>   decode(context* c, void* chunk[k+m], int*
> >>> indices_of_erased_chunks) => void* data // erased chunks are not used
> >>>   repair(context* c, void* chunk[k+m], int*
> >>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks are
> >>> rebuilt
> >>>
> >>> The decode function must allow for partial read:
> >>>
> >>>   decode(context* c, int offset, int length, void* chunk[k+m], int*
> >>> indices_of_erased_chunks, int* missing_chunks) => void* data
> >>>
> >>> If there are not enough chunks to recover the desired data range
> [offset, offset+length) the function returns NULL and sets missing_chunks
> to the list of chunks that must be retrieved in order to be able to read
> the desired data.
> >>>
> >>> If decode is called to read just 1 chunk and it is missing, reed-
> solomon would return on error and ask for all other chunks to repair. If
> the underlying library implements LRC, it would ask for a subset of the
> chunks.
> >>>
> >>> An implementation allowing only full reads and using jerasure ( which
> does not do LRC ) requires that offset is always zero, length is the size
> of the object and returns a copy of indices_of_erased_chunks if there are
> not enough chunks to rebuild the missing ones.
> >>>
> >>> Comments are welcome :-)
> >>
> >> I have loosely followed this discussion and I have not looked closely
> at the proposed API nor at the jerasure interface. My apologies if this
> has already been addressed.
> >>
> >> It is not clear to me from the above proposed API (ignoring the partial
> read) what it would do. Was the original intent to encode the entire file
> using k+m blocks irregardless of the file size and of the rados object
> size?
> >>
> >> If so, how will you map rados objects to the logical k+m objects and
> vice versa?
> >>
> >> If not, then the initial API needed an offset and length (either
> logical or rados object).
> >>
> >> I would assume that you would want to operate on rados sized objects.
> Given a fixed k+m, then you may have more than one set of k+m objects per
> file. This is ignoring the LRC "local" parity blocks. For example, if the
> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas video),
> then for a 20 MB file one would need two sets of encoding blocks. The
> first for objects 1-10 and the second for objects 11-20.
> >>
> >> Perhaps, this is what the context is above. If so, it should have the
> logical offset and rados object size, no?
> >>
> >> I see value in the Xorbas concept and I wonder if the jerasure library
> can be modified to generate the local parity blocks such that they can be
> used to generate the global parity blocks. That would be a question for
> Jim Plank.
> >
> > The benefits of the Xorbas concept is reduced network and disk I/O for
> failures while maintaining traditional RS's higher fault-tolerance than 3x
> replication while using less space.
> >
> > You can do almost the same thing with jerasure without modifying it at
> all. Using the values from the Xorbas video, they have k data blocks, m
> global parity blocks, and 2 local parity blocks (generated from k/2 data
> blocks) for a total of k+m+2 blocks on disk that can tolerate any m
> failures. In their example, k = 10 and m = 4. They store 16 blocks for
> each 10 data blocks.
> >
> > If you use traditional RS encoding via jerasure and used the same amount
> of storage (16 blocks for each 10 data blocks), you could encode 3 parity
> blocks for each 5 data blocks. This would consume 16 data blocks for each
> 10 data blocks and the fault-tolerance would be variable from 3-6 failures
> depending on how the failures fell between the two groups of 5 blocks
> which is higher than the static 4 failures for the Xorbas code. The I/O to
> recover from a single failure for both schemes is 5 blocks so it is as
> efficient as Xorbas. On average, it provides more fault-tolerance, but it
> can be less (four failures from one group of 5 data + 3 parity blocks),
> but that worst case is the same as 3x replication.
> >
> > Scott--
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Erasure coding library API
  2013-07-03  2:12       ` Paul Von-Stamwitz
@ 2013-07-03 11:53         ` Atchley, Scott
  2013-07-03 18:19           ` Paul Von-Stamwitz
  2013-07-04  3:06           ` Paul Von-Stamwitz
  0 siblings, 2 replies; 16+ messages in thread
From: Atchley, Scott @ 2013-07-03 11:53 UTC (permalink / raw)
  To: Paul Von-Stamwitz; +Cc: Samuel Just, Loic Dachary, Ceph Development

On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz <PVonStamwitz@us.fujitsu.com> wrote:

> Scott,
> 
> You make a good point comparing (5/3) RS with Xorbas, but a small nit:
> 
> "The I/O to recover from a single failure for both schemes is 5 blocks so it is as efficient as Xorbas."
> 
> Maybe not. You would probably issue I/O to all the remaining 7 blocks to cover for the possibility of double errors. The time to reconstruct would be about the same, but there could be more disk and network I/O. (LRC will need to issue I/O to the rest of the global stripe if it detected multiple local errors.)

Why would you request more than five? If one cannot be read, request another.

Also, I am not sure that you want to request five at once since it will lead to spikes in network traffic and require memory for all five blocks. You will need at least two buffers. Request the first two and start the decoding. You may want a third buffer to overlap the decoding of the current block with the communication for the next block. It may be that the decode time is less than the communication and, in that case, you will want to request all of the blocks at once.

> What I like about Xorbas is that it is an extension of a (x,y) RS. You can start with traditional RS. If degraded reads and repaired blocks are causing a problem, you can add the LRC. If capacity is an issue, you can take it out. 

I like it too and Microsoft has something similar with Pyramid codes. That said, my example using traditional RS can provide more fault-tolerance on average given the same amount of storage overhead.

> 
> Best,
> Paul
> 
> On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote:
>> I think we should be able to cover most cases by adding an interface like:
>> 
>> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
>> &available_chunks);
>> 
>> which returns the smallest set required to read/rebuild the chunks in
>> want_to_read given the chunks in available_chunks.  Alternately, we might
>> include a "cost" for reading each chunk like
>> 
>> set<int> minimum_to_read_with_cost(const set<int> &want_to_read, const
>> map<int, int> &available)
>> 
>> which returns the minimum cost set required to read the specified chunks
>> given a mapping of available chunks to costs.  The costs might allow us to
>> consider the difference between reading local chunks vs remote chunks.
>> This should be sufficient to cover the read case (esp the degraded read
>> case) and the repair case.
>> -Sam
>> 
>> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@ornl.gov>
>> wrote:
>>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@ornl.gov>
>> wrote:
>>> 
>>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@dachary.org> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project
>> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/ for
>> instance ) would need to be different from the one initialy proposed:
>>>> 
>>>> An interesting video. Not as entertaining as Jim Plank's video. ;-)
>>>> 
>>>> While Plank's focused on the processor requirements for
>> encoding/decoding, this video focuses on the network and disk I/O
>> requirements.
>>>> 
>>>>>  context(k, m, reed-solomon|...) => context* c
>>>>>  encode(context* c, void* data) => void* chunks[k+m]
>>>>>  decode(context* c, void* chunk[k+m], int*
>>>>> indices_of_erased_chunks) => void* data // erased chunks are not used
>>>>>  repair(context* c, void* chunk[k+m], int*
>>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks are
>>>>> rebuilt
>>>>> 
>>>>> The decode function must allow for partial read:
>>>>> 
>>>>>  decode(context* c, int offset, int length, void* chunk[k+m], int*
>>>>> indices_of_erased_chunks, int* missing_chunks) => void* data
>>>>> 
>>>>> If there are not enough chunks to recover the desired data range
>> [offset, offset+length) the function returns NULL and sets missing_chunks
>> to the list of chunks that must be retrieved in order to be able to read
>> the desired data.
>>>>> 
>>>>> If decode is called to read just 1 chunk and it is missing, reed-
>> solomon would return on error and ask for all other chunks to repair. If
>> the underlying library implements LRC, it would ask for a subset of the
>> chunks.
>>>>> 
>>>>> An implementation allowing only full reads and using jerasure ( which
>> does not do LRC ) requires that offset is always zero, length is the size
>> of the object and returns a copy of indices_of_erased_chunks if there are
>> not enough chunks to rebuild the missing ones.
>>>>> 
>>>>> Comments are welcome :-)
>>>> 
>>>> I have loosely followed this discussion and I have not looked closely
>> at the proposed API nor at the jerasure interface. My apologies if this
>> has already been addressed.
>>>> 
>>>> It is not clear to me from the above proposed API (ignoring the partial
>> read) what it would do. Was the original intent to encode the entire file
>> using k+m blocks irregardless of the file size and of the rados object
>> size?
>>>> 
>>>> If so, how will you map rados objects to the logical k+m objects and
>> vice versa?
>>>> 
>>>> If not, then the initial API needed an offset and length (either
>> logical or rados object).
>>>> 
>>>> I would assume that you would want to operate on rados sized objects.
>> Given a fixed k+m, then you may have more than one set of k+m objects per
>> file. This is ignoring the LRC "local" parity blocks. For example, if the
>> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas video),
>> then for a 20 MB file one would need two sets of encoding blocks. The
>> first for objects 1-10 and the second for objects 11-20.
>>>> 
>>>> Perhaps, this is what the context is above. If so, it should have the
>> logical offset and rados object size, no?
>>>> 
>>>> I see value in the Xorbas concept and I wonder if the jerasure library
>> can be modified to generate the local parity blocks such that they can be
>> used to generate the global parity blocks. That would be a question for
>> Jim Plank.
>>> 
>>> The benefits of the Xorbas concept is reduced network and disk I/O for
>> failures while maintaining traditional RS's higher fault-tolerance than 3x
>> replication while using less space.
>>> 
>>> You can do almost the same thing with jerasure without modifying it at
>> all. Using the values from the Xorbas video, they have k data blocks, m
>> global parity blocks, and 2 local parity blocks (generated from k/2 data
>> blocks) for a total of k+m+2 blocks on disk that can tolerate any m
>> failures. In their example, k = 10 and m = 4. They store 16 blocks for
>> each 10 data blocks.
>>> 
>>> If you use traditional RS encoding via jerasure and used the same amount
>> of storage (16 blocks for each 10 data blocks), you could encode 3 parity
>> blocks for each 5 data blocks. This would consume 16 data blocks for each
>> 10 data blocks and the fault-tolerance would be variable from 3-6 failures
>> depending on how the failures fell between the two groups of 5 blocks
>> which is higher than the static 4 failures for the Xorbas code. The I/O to
>> recover from a single failure for both schemes is 5 blocks so it is as
>> efficient as Xorbas. On average, it provides more fault-tolerance, but it
>> can be less (four failures from one group of 5 data + 3 parity blocks),
>> but that worst case is the same as 3x replication.
>>> 
>>> Scott--
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Erasure coding library API
  2013-07-03 11:53         ` Atchley, Scott
@ 2013-07-03 18:19           ` Paul Von-Stamwitz
  2013-07-04  3:06           ` Paul Von-Stamwitz
  1 sibling, 0 replies; 16+ messages in thread
From: Paul Von-Stamwitz @ 2013-07-03 18:19 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: Samuel Just, Loic Dachary, Ceph Development

Hi Scott,

Point taken.

I was thinking about Loic's decode description where k+m was requested and data was decoded when k blocks were received. But he was referring to full stripe reads where all the memory is allocated.

Degraded reads and block repair are a different matter.

pvs

> -----Original Message-----
> From: Atchley, Scott [mailto:atchleyes@ornl.gov]
> Sent: Wednesday, July 03, 2013 4:53 AM
> To: Paul Von-Stamwitz
> Cc: Samuel Just; Loic Dachary; Ceph Development
> Subject: Re: Erasure coding library API
> 
> On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz
> <PVonStamwitz@us.fujitsu.com> wrote:
> 
> > Scott,
> >
> > You make a good point comparing (5/3) RS with Xorbas, but a small nit:
> >
> > "The I/O to recover from a single failure for both schemes is 5 blocks
> so it is as efficient as Xorbas."
> >
> > Maybe not. You would probably issue I/O to all the remaining 7 blocks to
> cover for the possibility of double errors. The time to reconstruct would
> be about the same, but there could be more disk and network I/O. (LRC will
> need to issue I/O to the rest of the global stripe if it detected multiple
> local errors.)
> 
> Why would you request more than five? If one cannot be read, request
> another.
> 
> Also, I am not sure that you want to request five at once since it will
> lead to spikes in network traffic and require memory for all five blocks.
> You will need at least two buffers. Request the first two and start the
> decoding. You may want a third buffer to overlap the decoding of the
> current block with the communication for the next block. It may be that
> the decode time is less than the communication and, in that case, you will
> want to request all of the blocks at once.
> 
> > What I like about Xorbas is that it is an extension of a (x,y) RS. You
> can start with traditional RS. If degraded reads and repaired blocks are
> causing a problem, you can add the LRC. If capacity is an issue, you can
> take it out.
> 
> I like it too and Microsoft has something similar with Pyramid codes. That
> said, my example using traditional RS can provide more fault-tolerance on
> average given the same amount of storage overhead.
> 
> >
> > Best,
> > Paul
> >
> > On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote:
> >> I think we should be able to cover most cases by adding an interface
> like:
> >>
> >> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
> >> &available_chunks);
> >>
> >> which returns the smallest set required to read/rebuild the chunks in
> >> want_to_read given the chunks in available_chunks.  Alternately, we
> might
> >> include a "cost" for reading each chunk like
> >>
> >> set<int> minimum_to_read_with_cost(const set<int> &want_to_read, const
> >> map<int, int> &available)
> >>
> >> which returns the minimum cost set required to read the specified
> chunks
> >> given a mapping of available chunks to costs.  The costs might allow us
> to
> >> consider the difference between reading local chunks vs remote chunks.
> >> This should be sufficient to cover the read case (esp the degraded read
> >> case) and the repair case.
> >> -Sam
> >>
> >> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@ornl.gov>
> >> wrote:
> >>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@ornl.gov>
> >> wrote:
> >>>
> >>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@dachary.org> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project
> >> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/ for
> >> instance ) would need to be different from the one initialy proposed:
> >>>>
> >>>> An interesting video. Not as entertaining as Jim Plank's video. ;-)
> >>>>
> >>>> While Plank's focused on the processor requirements for
> >> encoding/decoding, this video focuses on the network and disk I/O
> >> requirements.
> >>>>
> >>>>>  context(k, m, reed-solomon|...) => context* c
> >>>>>  encode(context* c, void* data) => void* chunks[k+m]
> >>>>>  decode(context* c, void* chunk[k+m], int*
> >>>>> indices_of_erased_chunks) => void* data // erased chunks are not
> used
> >>>>>  repair(context* c, void* chunk[k+m], int*
> >>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks are
> >>>>> rebuilt
> >>>>>
> >>>>> The decode function must allow for partial read:
> >>>>>
> >>>>>  decode(context* c, int offset, int length, void* chunk[k+m], int*
> >>>>> indices_of_erased_chunks, int* missing_chunks) => void* data
> >>>>>
> >>>>> If there are not enough chunks to recover the desired data range
> >> [offset, offset+length) the function returns NULL and sets
> missing_chunks
> >> to the list of chunks that must be retrieved in order to be able to
> read
> >> the desired data.
> >>>>>
> >>>>> If decode is called to read just 1 chunk and it is missing, reed-
> >> solomon would return on error and ask for all other chunks to repair.
> If
> >> the underlying library implements LRC, it would ask for a subset of the
> >> chunks.
> >>>>>
> >>>>> An implementation allowing only full reads and using jerasure
> ( which
> >> does not do LRC ) requires that offset is always zero, length is the
> size
> >> of the object and returns a copy of indices_of_erased_chunks if there
> are
> >> not enough chunks to rebuild the missing ones.
> >>>>>
> >>>>> Comments are welcome :-)
> >>>>
> >>>> I have loosely followed this discussion and I have not looked closely
> >> at the proposed API nor at the jerasure interface. My apologies if this
> >> has already been addressed.
> >>>>
> >>>> It is not clear to me from the above proposed API (ignoring the
> partial
> >> read) what it would do. Was the original intent to encode the entire
> file
> >> using k+m blocks irregardless of the file size and of the rados object
> >> size?
> >>>>
> >>>> If so, how will you map rados objects to the logical k+m objects and
> >> vice versa?
> >>>>
> >>>> If not, then the initial API needed an offset and length (either
> >> logical or rados object).
> >>>>
> >>>> I would assume that you would want to operate on rados sized objects.
> >> Given a fixed k+m, then you may have more than one set of k+m objects
> per
> >> file. This is ignoring the LRC "local" parity blocks. For example, if
> the
> >> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas video),
> >> then for a 20 MB file one would need two sets of encoding blocks. The
> >> first for objects 1-10 and the second for objects 11-20.
> >>>>
> >>>> Perhaps, this is what the context is above. If so, it should have the
> >> logical offset and rados object size, no?
> >>>>
> >>>> I see value in the Xorbas concept and I wonder if the jerasure
> library
> >> can be modified to generate the local parity blocks such that they can
> be
> >> used to generate the global parity blocks. That would be a question for
> >> Jim Plank.
> >>>
> >>> The benefits of the Xorbas concept is reduced network and disk I/O for
> >> failures while maintaining traditional RS's higher fault-tolerance than
> 3x
> >> replication while using less space.
> >>>
> >>> You can do almost the same thing with jerasure without modifying it at
> >> all. Using the values from the Xorbas video, they have k data blocks, m
> >> global parity blocks, and 2 local parity blocks (generated from k/2
> data
> >> blocks) for a total of k+m+2 blocks on disk that can tolerate any m
> >> failures. In their example, k = 10 and m = 4. They store 16 blocks for
> >> each 10 data blocks.
> >>>
> >>> If you use traditional RS encoding via jerasure and used the same
> amount
> >> of storage (16 blocks for each 10 data blocks), you could encode 3
> parity
> >> blocks for each 5 data blocks. This would consume 16 data blocks for
> each
> >> 10 data blocks and the fault-tolerance would be variable from 3-6
> failures
> >> depending on how the failures fell between the two groups of 5 blocks
> >> which is higher than the static 4 failures for the Xorbas code. The I/O
> to
> >> recover from a single failure for both schemes is 5 blocks so it is as
> >> efficient as Xorbas. On average, it provides more fault-tolerance, but
> it
> >> can be less (four failures from one group of 5 data + 3 parity blocks),
> >> but that worst case is the same as 3x replication.
> >>>
> >>> Scott--
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@vger.kernel.org More majordomo
> >>> info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in
> >> the body of a message to majordomo@vger.kernel.org More majordomo info
> at
> >> http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Erasure coding library API
  2013-07-03 11:53         ` Atchley, Scott
  2013-07-03 18:19           ` Paul Von-Stamwitz
@ 2013-07-04  3:06           ` Paul Von-Stamwitz
  2013-07-04 13:24             ` Loic Dachary
  2013-07-05 12:39             ` Atchley, Scott
  1 sibling, 2 replies; 16+ messages in thread
From: Paul Von-Stamwitz @ 2013-07-04  3:06 UTC (permalink / raw)
  To: Paul Von-Stamwitz, Atchley, Scott
  Cc: Samuel Just, Loic Dachary, Ceph Development

Scott, et al.

Here is an interesting paper from Usenix HotStorage Conference which provides local codes without additional capacity overhead.

Check it out. (abstract with links to paper and slides)
https://www.usenix.org/conference/hotstorage13/solution-network-challenges-data-recovery-erasure-coded-distributed-storage

Cheers,
pvs

> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote:
> 
> Hi Scott,
> 
> Point taken.
> 
> I was thinking about Loic's decode description where k+m was requested and
> data was decoded when k blocks were received. But he was referring to full
> stripe reads where all the memory is allocated.
> 
> Degraded reads and block repair are a different matter.
> 
> pvs
> 
> > On Jul 3, 2013, at 4:53 AM Scott Atchley wrote:
> >
> > On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz
> > <PVonStamwitz@us.fujitsu.com> wrote:
> >
> > > Scott,
> > >
> > > You make a good point comparing (5/3) RS with Xorbas, but a small nit:
> > >
> > > "The I/O to recover from a single failure for both schemes is 5 blocks
> > so it is as efficient as Xorbas."
> > >
> > > Maybe not. You would probably issue I/O to all the remaining 7 blocks
> to
> > cover for the possibility of double errors. The time to reconstruct
> would
> > be about the same, but there could be more disk and network I/O. (LRC
> will
> > need to issue I/O to the rest of the global stripe if it detected
> multiple
> > local errors.)
> >
> > Why would you request more than five? If one cannot be read, request
> > another.
> >
> > Also, I am not sure that you want to request five at once since it will
> > lead to spikes in network traffic and require memory for all five blocks.
> > You will need at least two buffers. Request the first two and start the
> > decoding. You may want a third buffer to overlap the decoding of the
> > current block with the communication for the next block. It may be that
> > the decode time is less than the communication and, in that case, you
> will
> > want to request all of the blocks at once.
> >
> > > What I like about Xorbas is that it is an extension of a (x,y) RS. You
> > can start with traditional RS. If degraded reads and repaired blocks are
> > causing a problem, you can add the LRC. If capacity is an issue, you can
> > take it out.
> >
> > I like it too and Microsoft has something similar with Pyramid codes.
> That
> > said, my example using traditional RS can provide more fault-tolerance
> on
> > average given the same amount of storage overhead.
> >
> > >
> > > Best,
> > > Paul
> > >
> > > On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote:
> > >> I think we should be able to cover most cases by adding an interface
> > like:
> > >>
> > >> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
> > >> &available_chunks);
> > >>
> > >> which returns the smallest set required to read/rebuild the chunks in
> > >> want_to_read given the chunks in available_chunks.  Alternately, we
> > might
> > >> include a "cost" for reading each chunk like
> > >>
> > >> set<int> minimum_to_read_with_cost(const set<int> &want_to_read,
> const
> > >> map<int, int> &available)
> > >>
> > >> which returns the minimum cost set required to read the specified
> > chunks
> > >> given a mapping of available chunks to costs.  The costs might allow
> us
> > to
> > >> consider the difference between reading local chunks vs remote chunks.
> > >> This should be sufficient to cover the read case (esp the degraded
> read
> > >> case) and the repair case.
> > >> -Sam
> > >>
> > >> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@ornl.gov>
> > >> wrote:
> > >>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@ornl.gov>
> > >> wrote:
> > >>>
> > >>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@dachary.org> wrote:
> > >>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project
> > >> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/
> for
> > >> instance ) would need to be different from the one initialy proposed:
> > >>>>
> > >>>> An interesting video. Not as entertaining as Jim Plank's video. ;-)
> > >>>>
> > >>>> While Plank's focused on the processor requirements for
> > >> encoding/decoding, this video focuses on the network and disk I/O
> > >> requirements.
> > >>>>
> > >>>>>  context(k, m, reed-solomon|...) => context* c
> > >>>>>  encode(context* c, void* data) => void* chunks[k+m]
> > >>>>>  decode(context* c, void* chunk[k+m], int*
> > >>>>> indices_of_erased_chunks) => void* data // erased chunks are not
> > used
> > >>>>>  repair(context* c, void* chunk[k+m], int*
> > >>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks
> are
> > >>>>> rebuilt
> > >>>>>
> > >>>>> The decode function must allow for partial read:
> > >>>>>
> > >>>>>  decode(context* c, int offset, int length, void* chunk[k+m], int*
> > >>>>> indices_of_erased_chunks, int* missing_chunks) => void* data
> > >>>>>
> > >>>>> If there are not enough chunks to recover the desired data range
> > >> [offset, offset+length) the function returns NULL and sets
> > missing_chunks
> > >> to the list of chunks that must be retrieved in order to be able to
> > read
> > >> the desired data.
> > >>>>>
> > >>>>> If decode is called to read just 1 chunk and it is missing, reed-
> > >> solomon would return on error and ask for all other chunks to repair.
> > If
> > >> the underlying library implements LRC, it would ask for a subset of
> the
> > >> chunks.
> > >>>>>
> > >>>>> An implementation allowing only full reads and using jerasure
> > ( which
> > >> does not do LRC ) requires that offset is always zero, length is the
> > size
> > >> of the object and returns a copy of indices_of_erased_chunks if there
> > are
> > >> not enough chunks to rebuild the missing ones.
> > >>>>>
> > >>>>> Comments are welcome :-)
> > >>>>
> > >>>> I have loosely followed this discussion and I have not looked
> closely
> > >> at the proposed API nor at the jerasure interface. My apologies if
> this
> > >> has already been addressed.
> > >>>>
> > >>>> It is not clear to me from the above proposed API (ignoring the
> > partial
> > >> read) what it would do. Was the original intent to encode the entire
> > file
> > >> using k+m blocks irregardless of the file size and of the rados
> object
> > >> size?
> > >>>>
> > >>>> If so, how will you map rados objects to the logical k+m objects
> and
> > >> vice versa?
> > >>>>
> > >>>> If not, then the initial API needed an offset and length (either
> > >> logical or rados object).
> > >>>>
> > >>>> I would assume that you would want to operate on rados sized
> objects.
> > >> Given a fixed k+m, then you may have more than one set of k+m objects
> > per
> > >> file. This is ignoring the LRC "local" parity blocks. For example, if
> > the
> > >> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas
> video),
> > >> then for a 20 MB file one would need two sets of encoding blocks. The
> > >> first for objects 1-10 and the second for objects 11-20.
> > >>>>
> > >>>> Perhaps, this is what the context is above. If so, it should have
> the
> > >> logical offset and rados object size, no?
> > >>>>
> > >>>> I see value in the Xorbas concept and I wonder if the jerasure
> > library
> > >> can be modified to generate the local parity blocks such that they
> can
> > be
> > >> used to generate the global parity blocks. That would be a question
> for
> > >> Jim Plank.
> > >>>
> > >>> The benefits of the Xorbas concept is reduced network and disk I/O
> for
> > >> failures while maintaining traditional RS's higher fault-tolerance
> than
> > 3x
> > >> replication while using less space.
> > >>>
> > >>> You can do almost the same thing with jerasure without modifying it
> at
> > >> all. Using the values from the Xorbas video, they have k data blocks,
> m
> > >> global parity blocks, and 2 local parity blocks (generated from k/2
> > data
> > >> blocks) for a total of k+m+2 blocks on disk that can tolerate any m
> > >> failures. In their example, k = 10 and m = 4. They store 16 blocks
> for
> > >> each 10 data blocks.
> > >>>
> > >>> If you use traditional RS encoding via jerasure and used the same
> > amount
> > >> of storage (16 blocks for each 10 data blocks), you could encode 3
> > parity
> > >> blocks for each 5 data blocks. This would consume 16 data blocks for
> > each
> > >> 10 data blocks and the fault-tolerance would be variable from 3-6
> > failures
> > >> depending on how the failures fell between the two groups of 5 blocks
> > >> which is higher than the static 4 failures for the Xorbas code. The
> I/O
> > to
> > >> recover from a single failure for both schemes is 5 blocks so it is
> as
> > >> efficient as Xorbas. On average, it provides more fault-tolerance,
> but
> > it
> > >> can be less (four failures from one group of 5 data + 3 parity
> blocks),
> > >> but that worst case is the same as 3x replication.
> > >>>
> > >>> Scott--
> > >>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> > >>> in the body of a message to majordomo@vger.kernel.org More majordomo
> > >>> info at  http://vger.kernel.org/majordomo-info.html
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in
> > >> the body of a message to majordomo@vger.kernel.org More majordomo
> info
> > at
> > >> http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Erasure coding library API
  2013-07-04  3:06           ` Paul Von-Stamwitz
@ 2013-07-04 13:24             ` Loic Dachary
  2013-07-05 12:13               ` Atchley, Scott
  2013-07-05 12:39             ` Atchley, Scott
  1 sibling, 1 reply; 16+ messages in thread
From: Loic Dachary @ 2013-07-04 13:24 UTC (permalink / raw)
  To: Paul Von-Stamwitz; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 10372 bytes --]

Hi,

I was thinking about scrubbing of erasure coded chunks and realized I don't know the answer to this very simple question : what happens when a chunk is corrupted ? I.e. if AB is coded with 2+1 into A + B ( data ) + Z (parity ) and Z is replaced with Q. Would reed-solomon ignore/discard the corrupted chunk ? If that's the case I think it slightly changes what the API should be.

Cheers

On 04/07/2013 05:06, Paul Von-Stamwitz wrote:
> Scott, et al.
> 
> Here is an interesting paper from Usenix HotStorage Conference which provides local codes without additional capacity overhead.
> 
> Check it out. (abstract with links to paper and slides)
> https://www.usenix.org/conference/hotstorage13/solution-network-challenges-data-recovery-erasure-coded-distributed-storage
> 
> Cheers,
> pvs
> 
>> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote:
>>
>> Hi Scott,
>>
>> Point taken.
>>
>> I was thinking about Loic's decode description where k+m was requested and
>> data was decoded when k blocks were received. But he was referring to full
>> stripe reads where all the memory is allocated.
>>
>> Degraded reads and block repair are a different matter.
>>
>> pvs
>>
>>> On Jul 3, 2013, at 4:53 AM Scott Atchley wrote:
>>>
>>> On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz
>>> <PVonStamwitz@us.fujitsu.com> wrote:
>>>
>>>> Scott,
>>>>
>>>> You make a good point comparing (5/3) RS with Xorbas, but a small nit:
>>>>
>>>> "The I/O to recover from a single failure for both schemes is 5 blocks
>>> so it is as efficient as Xorbas."
>>>>
>>>> Maybe not. You would probably issue I/O to all the remaining 7 blocks
>> to
>>> cover for the possibility of double errors. The time to reconstruct
>> would
>>> be about the same, but there could be more disk and network I/O. (LRC
>> will
>>> need to issue I/O to the rest of the global stripe if it detected
>> multiple
>>> local errors.)
>>>
>>> Why would you request more than five? If one cannot be read, request
>>> another.
>>>
>>> Also, I am not sure that you want to request five at once since it will
>>> lead to spikes in network traffic and require memory for all five blocks.
>>> You will need at least two buffers. Request the first two and start the
>>> decoding. You may want a third buffer to overlap the decoding of the
>>> current block with the communication for the next block. It may be that
>>> the decode time is less than the communication and, in that case, you
>> will
>>> want to request all of the blocks at once.
>>>
>>>> What I like about Xorbas is that it is an extension of a (x,y) RS. You
>>> can start with traditional RS. If degraded reads and repaired blocks are
>>> causing a problem, you can add the LRC. If capacity is an issue, you can
>>> take it out.
>>>
>>> I like it too and Microsoft has something similar with Pyramid codes.
>> That
>>> said, my example using traditional RS can provide more fault-tolerance
>> on
>>> average given the same amount of storage overhead.
>>>
>>>>
>>>> Best,
>>>> Paul
>>>>
>>>> On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote:
>>>>> I think we should be able to cover most cases by adding an interface
>>> like:
>>>>>
>>>>> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
>>>>> &available_chunks);
>>>>>
>>>>> which returns the smallest set required to read/rebuild the chunks in
>>>>> want_to_read given the chunks in available_chunks.  Alternately, we
>>> might
>>>>> include a "cost" for reading each chunk like
>>>>>
>>>>> set<int> minimum_to_read_with_cost(const set<int> &want_to_read,
>> const
>>>>> map<int, int> &available)
>>>>>
>>>>> which returns the minimum cost set required to read the specified
>>> chunks
>>>>> given a mapping of available chunks to costs.  The costs might allow
>> us
>>> to
>>>>> consider the difference between reading local chunks vs remote chunks.
>>>>> This should be sufficient to cover the read case (esp the degraded
>> read
>>>>> case) and the repair case.
>>>>> -Sam
>>>>>
>>>>> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@ornl.gov>
>>>>> wrote:
>>>>>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@ornl.gov>
>>>>> wrote:
>>>>>>
>>>>>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@dachary.org> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project
>>>>> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/
>> for
>>>>> instance ) would need to be different from the one initialy proposed:
>>>>>>>
>>>>>>> An interesting video. Not as entertaining as Jim Plank's video. ;-)
>>>>>>>
>>>>>>> While Plank's focused on the processor requirements for
>>>>> encoding/decoding, this video focuses on the network and disk I/O
>>>>> requirements.
>>>>>>>
>>>>>>>>  context(k, m, reed-solomon|...) => context* c
>>>>>>>>  encode(context* c, void* data) => void* chunks[k+m]
>>>>>>>>  decode(context* c, void* chunk[k+m], int*
>>>>>>>> indices_of_erased_chunks) => void* data // erased chunks are not
>>> used
>>>>>>>>  repair(context* c, void* chunk[k+m], int*
>>>>>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks
>> are
>>>>>>>> rebuilt
>>>>>>>>
>>>>>>>> The decode function must allow for partial read:
>>>>>>>>
>>>>>>>>  decode(context* c, int offset, int length, void* chunk[k+m], int*
>>>>>>>> indices_of_erased_chunks, int* missing_chunks) => void* data
>>>>>>>>
>>>>>>>> If there are not enough chunks to recover the desired data range
>>>>> [offset, offset+length) the function returns NULL and sets
>>> missing_chunks
>>>>> to the list of chunks that must be retrieved in order to be able to
>>> read
>>>>> the desired data.
>>>>>>>>
>>>>>>>> If decode is called to read just 1 chunk and it is missing, reed-
>>>>> solomon would return on error and ask for all other chunks to repair.
>>> If
>>>>> the underlying library implements LRC, it would ask for a subset of
>> the
>>>>> chunks.
>>>>>>>>
>>>>>>>> An implementation allowing only full reads and using jerasure
>>> ( which
>>>>> does not do LRC ) requires that offset is always zero, length is the
>>> size
>>>>> of the object and returns a copy of indices_of_erased_chunks if there
>>> are
>>>>> not enough chunks to rebuild the missing ones.
>>>>>>>>
>>>>>>>> Comments are welcome :-)
>>>>>>>
>>>>>>> I have loosely followed this discussion and I have not looked
>> closely
>>>>> at the proposed API nor at the jerasure interface. My apologies if
>> this
>>>>> has already been addressed.
>>>>>>>
>>>>>>> It is not clear to me from the above proposed API (ignoring the
>>> partial
>>>>> read) what it would do. Was the original intent to encode the entire
>>> file
>>>>> using k+m blocks irregardless of the file size and of the rados
>> object
>>>>> size?
>>>>>>>
>>>>>>> If so, how will you map rados objects to the logical k+m objects
>> and
>>>>> vice versa?
>>>>>>>
>>>>>>> If not, then the initial API needed an offset and length (either
>>>>> logical or rados object).
>>>>>>>
>>>>>>> I would assume that you would want to operate on rados sized
>> objects.
>>>>> Given a fixed k+m, then you may have more than one set of k+m objects
>>> per
>>>>> file. This is ignoring the LRC "local" parity blocks. For example, if
>>> the
>>>>> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas
>> video),
>>>>> then for a 20 MB file one would need two sets of encoding blocks. The
>>>>> first for objects 1-10 and the second for objects 11-20.
>>>>>>>
>>>>>>> Perhaps, this is what the context is above. If so, it should have
>> the
>>>>> logical offset and rados object size, no?
>>>>>>>
>>>>>>> I see value in the Xorbas concept and I wonder if the jerasure
>>> library
>>>>> can be modified to generate the local parity blocks such that they
>> can
>>> be
>>>>> used to generate the global parity blocks. That would be a question
>> for
>>>>> Jim Plank.
>>>>>>
>>>>>> The benefits of the Xorbas concept is reduced network and disk I/O
>> for
>>>>> failures while maintaining traditional RS's higher fault-tolerance
>> than
>>> 3x
>>>>> replication while using less space.
>>>>>>
>>>>>> You can do almost the same thing with jerasure without modifying it
>> at
>>>>> all. Using the values from the Xorbas video, they have k data blocks,
>> m
>>>>> global parity blocks, and 2 local parity blocks (generated from k/2
>>> data
>>>>> blocks) for a total of k+m+2 blocks on disk that can tolerate any m
>>>>> failures. In their example, k = 10 and m = 4. They store 16 blocks
>> for
>>>>> each 10 data blocks.
>>>>>>
>>>>>> If you use traditional RS encoding via jerasure and used the same
>>> amount
>>>>> of storage (16 blocks for each 10 data blocks), you could encode 3
>>> parity
>>>>> blocks for each 5 data blocks. This would consume 16 data blocks for
>>> each
>>>>> 10 data blocks and the fault-tolerance would be variable from 3-6
>>> failures
>>>>> depending on how the failures fell between the two groups of 5 blocks
>>>>> which is higher than the static 4 failures for the Xorbas code. The
>> I/O
>>> to
>>>>> recover from a single failure for both schemes is 5 blocks so it is
>> as
>>>>> efficient as Xorbas. On average, it provides more fault-tolerance,
>> but
>>> it
>>>>> can be less (four failures from one group of 5 data + 3 parity
>> blocks),
>>>>> but that worst case is the same as 3x replication.
>>>>>>
>>>>>> Scott--
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>> devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in
>>>>> the body of a message to majordomo@vger.kernel.org More majordomo
>> info
>>> at
>>>>> http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Erasure coding library API
  2013-07-04 13:24             ` Loic Dachary
@ 2013-07-05 12:13               ` Atchley, Scott
  2013-07-05 14:06                 ` Loic Dachary
  0 siblings, 1 reply; 16+ messages in thread
From: Atchley, Scott @ 2013-07-05 12:13 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Paul Von-Stamwitz, Ceph Development

Loic,

Erasure codes take what ever you give them. You need to verify the chunk before using it. Perhaps storing the checksum in the metadata/context that describes the parity object?

Scott

On Jul 4, 2013, at 9:24 AM, Loic Dachary <loic@dachary.org> wrote:

> Hi,
> 
> I was thinking about scrubbing of erasure coded chunks and realized I don't know the answer to this very simple question : what happens when a chunk is corrupted ? I.e. if AB is coded with 2+1 into A + B ( data ) + Z (parity ) and Z is replaced with Q. Would reed-solomon ignore/discard the corrupted chunk ? If that's the case I think it slightly changes what the API should be.
> 
> Cheers
> 
> On 04/07/2013 05:06, Paul Von-Stamwitz wrote:
>> Scott, et al.
>> 
>> Here is an interesting paper from Usenix HotStorage Conference which provides local codes without additional capacity overhead.
>> 
>> Check it out. (abstract with links to paper and slides)
>> https://www.usenix.org/conference/hotstorage13/solution-network-challenges-data-recovery-erasure-coded-distributed-storage
>> 
>> Cheers,
>> pvs
>> 
>>> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote:
>>> 
>>> Hi Scott,
>>> 
>>> Point taken.
>>> 
>>> I was thinking about Loic's decode description where k+m was requested and
>>> data was decoded when k blocks were received. But he was referring to full
>>> stripe reads where all the memory is allocated.
>>> 
>>> Degraded reads and block repair are a different matter.
>>> 
>>> pvs
>>> 
>>>> On Jul 3, 2013, at 4:53 AM Scott Atchley wrote:
>>>> 
>>>> On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz
>>>> <PVonStamwitz@us.fujitsu.com> wrote:
>>>> 
>>>>> Scott,
>>>>> 
>>>>> You make a good point comparing (5/3) RS with Xorbas, but a small nit:
>>>>> 
>>>>> "The I/O to recover from a single failure for both schemes is 5 blocks
>>>> so it is as efficient as Xorbas."
>>>>> 
>>>>> Maybe not. You would probably issue I/O to all the remaining 7 blocks
>>> to
>>>> cover for the possibility of double errors. The time to reconstruct
>>> would
>>>> be about the same, but there could be more disk and network I/O. (LRC
>>> will
>>>> need to issue I/O to the rest of the global stripe if it detected
>>> multiple
>>>> local errors.)
>>>> 
>>>> Why would you request more than five? If one cannot be read, request
>>>> another.
>>>> 
>>>> Also, I am not sure that you want to request five at once since it will
>>>> lead to spikes in network traffic and require memory for all five blocks.
>>>> You will need at least two buffers. Request the first two and start the
>>>> decoding. You may want a third buffer to overlap the decoding of the
>>>> current block with the communication for the next block. It may be that
>>>> the decode time is less than the communication and, in that case, you
>>> will
>>>> want to request all of the blocks at once.
>>>> 
>>>>> What I like about Xorbas is that it is an extension of a (x,y) RS. You
>>>> can start with traditional RS. If degraded reads and repaired blocks are
>>>> causing a problem, you can add the LRC. If capacity is an issue, you can
>>>> take it out.
>>>> 
>>>> I like it too and Microsoft has something similar with Pyramid codes.
>>> That
>>>> said, my example using traditional RS can provide more fault-tolerance
>>> on
>>>> average given the same amount of storage overhead.
>>>> 
>>>>> 
>>>>> Best,
>>>>> Paul
>>>>> 
>>>>> On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote:
>>>>>> I think we should be able to cover most cases by adding an interface
>>>> like:
>>>>>> 
>>>>>> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
>>>>>> &available_chunks);
>>>>>> 
>>>>>> which returns the smallest set required to read/rebuild the chunks in
>>>>>> want_to_read given the chunks in available_chunks.  Alternately, we
>>>> might
>>>>>> include a "cost" for reading each chunk like
>>>>>> 
>>>>>> set<int> minimum_to_read_with_cost(const set<int> &want_to_read,
>>> const
>>>>>> map<int, int> &available)
>>>>>> 
>>>>>> which returns the minimum cost set required to read the specified
>>>> chunks
>>>>>> given a mapping of available chunks to costs.  The costs might allow
>>> us
>>>> to
>>>>>> consider the difference between reading local chunks vs remote chunks.
>>>>>> This should be sufficient to cover the read case (esp the degraded
>>> read
>>>>>> case) and the repair case.
>>>>>> -Sam
>>>>>> 
>>>>>> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@ornl.gov>
>>>>>> wrote:
>>>>>>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@ornl.gov>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@dachary.org> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project
>>>>>> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/
>>> for
>>>>>> instance ) would need to be different from the one initialy proposed:
>>>>>>>> 
>>>>>>>> An interesting video. Not as entertaining as Jim Plank's video. ;-)
>>>>>>>> 
>>>>>>>> While Plank's focused on the processor requirements for
>>>>>> encoding/decoding, this video focuses on the network and disk I/O
>>>>>> requirements.
>>>>>>>> 
>>>>>>>>> context(k, m, reed-solomon|...) => context* c
>>>>>>>>> encode(context* c, void* data) => void* chunks[k+m]
>>>>>>>>> decode(context* c, void* chunk[k+m], int*
>>>>>>>>> indices_of_erased_chunks) => void* data // erased chunks are not
>>>> used
>>>>>>>>> repair(context* c, void* chunk[k+m], int*
>>>>>>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks
>>> are
>>>>>>>>> rebuilt
>>>>>>>>> 
>>>>>>>>> The decode function must allow for partial read:
>>>>>>>>> 
>>>>>>>>> decode(context* c, int offset, int length, void* chunk[k+m], int*
>>>>>>>>> indices_of_erased_chunks, int* missing_chunks) => void* data
>>>>>>>>> 
>>>>>>>>> If there are not enough chunks to recover the desired data range
>>>>>> [offset, offset+length) the function returns NULL and sets
>>>> missing_chunks
>>>>>> to the list of chunks that must be retrieved in order to be able to
>>>> read
>>>>>> the desired data.
>>>>>>>>> 
>>>>>>>>> If decode is called to read just 1 chunk and it is missing, reed-
>>>>>> solomon would return on error and ask for all other chunks to repair.
>>>> If
>>>>>> the underlying library implements LRC, it would ask for a subset of
>>> the
>>>>>> chunks.
>>>>>>>>> 
>>>>>>>>> An implementation allowing only full reads and using jerasure
>>>> ( which
>>>>>> does not do LRC ) requires that offset is always zero, length is the
>>>> size
>>>>>> of the object and returns a copy of indices_of_erased_chunks if there
>>>> are
>>>>>> not enough chunks to rebuild the missing ones.
>>>>>>>>> 
>>>>>>>>> Comments are welcome :-)
>>>>>>>> 
>>>>>>>> I have loosely followed this discussion and I have not looked
>>> closely
>>>>>> at the proposed API nor at the jerasure interface. My apologies if
>>> this
>>>>>> has already been addressed.
>>>>>>>> 
>>>>>>>> It is not clear to me from the above proposed API (ignoring the
>>>> partial
>>>>>> read) what it would do. Was the original intent to encode the entire
>>>> file
>>>>>> using k+m blocks irregardless of the file size and of the rados
>>> object
>>>>>> size?
>>>>>>>> 
>>>>>>>> If so, how will you map rados objects to the logical k+m objects
>>> and
>>>>>> vice versa?
>>>>>>>> 
>>>>>>>> If not, then the initial API needed an offset and length (either
>>>>>> logical or rados object).
>>>>>>>> 
>>>>>>>> I would assume that you would want to operate on rados sized
>>> objects.
>>>>>> Given a fixed k+m, then you may have more than one set of k+m objects
>>>> per
>>>>>> file. This is ignoring the LRC "local" parity blocks. For example, if
>>>> the
>>>>>> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas
>>> video),
>>>>>> then for a 20 MB file one would need two sets of encoding blocks. The
>>>>>> first for objects 1-10 and the second for objects 11-20.
>>>>>>>> 
>>>>>>>> Perhaps, this is what the context is above. If so, it should have
>>> the
>>>>>> logical offset and rados object size, no?
>>>>>>>> 
>>>>>>>> I see value in the Xorbas concept and I wonder if the jerasure
>>>> library
>>>>>> can be modified to generate the local parity blocks such that they
>>> can
>>>> be
>>>>>> used to generate the global parity blocks. That would be a question
>>> for
>>>>>> Jim Plank.
>>>>>>> 
>>>>>>> The benefits of the Xorbas concept is reduced network and disk I/O
>>> for
>>>>>> failures while maintaining traditional RS's higher fault-tolerance
>>> than
>>>> 3x
>>>>>> replication while using less space.
>>>>>>> 
>>>>>>> You can do almost the same thing with jerasure without modifying it
>>> at
>>>>>> all. Using the values from the Xorbas video, they have k data blocks,
>>> m
>>>>>> global parity blocks, and 2 local parity blocks (generated from k/2
>>>> data
>>>>>> blocks) for a total of k+m+2 blocks on disk that can tolerate any m
>>>>>> failures. In their example, k = 10 and m = 4. They store 16 blocks
>>> for
>>>>>> each 10 data blocks.
>>>>>>> 
>>>>>>> If you use traditional RS encoding via jerasure and used the same
>>>> amount
>>>>>> of storage (16 blocks for each 10 data blocks), you could encode 3
>>>> parity
>>>>>> blocks for each 5 data blocks. This would consume 16 data blocks for
>>>> each
>>>>>> 10 data blocks and the fault-tolerance would be variable from 3-6
>>>> failures
>>>>>> depending on how the failures fell between the two groups of 5 blocks
>>>>>> which is higher than the static 4 failures for the Xorbas code. The
>>> I/O
>>>> to
>>>>>> recover from a single failure for both schemes is 5 blocks so it is
>>> as
>>>>>> efficient as Xorbas. On average, it provides more fault-tolerance,
>>> but
>>>> it
>>>>>> can be less (four failures from one group of 5 data + 3 parity
>>> blocks),
>>>>>> but that worst case is the same as 3x replication.
>>>>>>> 
>>>>>>> Scott--
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>>> devel"
>>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in
>>>>>> the body of a message to majordomo@vger.kernel.org More majordomo
>>> info
>>>> at
>>>>>> http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Erasure coding library API
  2013-07-04  3:06           ` Paul Von-Stamwitz
  2013-07-04 13:24             ` Loic Dachary
@ 2013-07-05 12:39             ` Atchley, Scott
  1 sibling, 0 replies; 16+ messages in thread
From: Atchley, Scott @ 2013-07-05 12:39 UTC (permalink / raw)
  To: Paul Von-Stamwitz; +Cc: Samuel Just, Loic Dachary, Ceph Development

Paul,

Very nice. No extra storage versus traditional RS with lower recover traffic, although not as low as, LRC. I imagine it complicates the logic to recover as well as the metadata required to track the dependencies.

Scott

On Jul 3, 2013, at 11:06 PM, Paul Von-Stamwitz <PVonStamwitz@us.fujitsu.com> wrote:

> Scott, et al.
> 
> Here is an interesting paper from Usenix HotStorage Conference which provides local codes without additional capacity overhead.
> 
> Check it out. (abstract with links to paper and slides)
> https://www.usenix.org/conference/hotstorage13/solution-network-challenges-data-recovery-erasure-coded-distributed-storage
> 
> Cheers,
> pvs
> 
>> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote:
>> 
>> Hi Scott,
>> 
>> Point taken.
>> 
>> I was thinking about Loic's decode description where k+m was requested and
>> data was decoded when k blocks were received. But he was referring to full
>> stripe reads where all the memory is allocated.
>> 
>> Degraded reads and block repair are a different matter.
>> 
>> pvs
>> 
>>> On Jul 3, 2013, at 4:53 AM Scott Atchley wrote:
>>> 
>>> On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz
>>> <PVonStamwitz@us.fujitsu.com> wrote:
>>> 
>>>> Scott,
>>>> 
>>>> You make a good point comparing (5/3) RS with Xorbas, but a small nit:
>>>> 
>>>> "The I/O to recover from a single failure for both schemes is 5 blocks
>>> so it is as efficient as Xorbas."
>>>> 
>>>> Maybe not. You would probably issue I/O to all the remaining 7 blocks
>> to
>>> cover for the possibility of double errors. The time to reconstruct
>> would
>>> be about the same, but there could be more disk and network I/O. (LRC
>> will
>>> need to issue I/O to the rest of the global stripe if it detected
>> multiple
>>> local errors.)
>>> 
>>> Why would you request more than five? If one cannot be read, request
>>> another.
>>> 
>>> Also, I am not sure that you want to request five at once since it will
>>> lead to spikes in network traffic and require memory for all five blocks.
>>> You will need at least two buffers. Request the first two and start the
>>> decoding. You may want a third buffer to overlap the decoding of the
>>> current block with the communication for the next block. It may be that
>>> the decode time is less than the communication and, in that case, you
>> will
>>> want to request all of the blocks at once.
>>> 
>>>> What I like about Xorbas is that it is an extension of a (x,y) RS. You
>>> can start with traditional RS. If degraded reads and repaired blocks are
>>> causing a problem, you can add the LRC. If capacity is an issue, you can
>>> take it out.
>>> 
>>> I like it too and Microsoft has something similar with Pyramid codes.
>> That
>>> said, my example using traditional RS can provide more fault-tolerance
>> on
>>> average given the same amount of storage overhead.
>>> 
>>>> 
>>>> Best,
>>>> Paul
>>>> 
>>>> On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote:
>>>>> I think we should be able to cover most cases by adding an interface
>>> like:
>>>>> 
>>>>> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
>>>>> &available_chunks);
>>>>> 
>>>>> which returns the smallest set required to read/rebuild the chunks in
>>>>> want_to_read given the chunks in available_chunks.  Alternately, we
>>> might
>>>>> include a "cost" for reading each chunk like
>>>>> 
>>>>> set<int> minimum_to_read_with_cost(const set<int> &want_to_read,
>> const
>>>>> map<int, int> &available)
>>>>> 
>>>>> which returns the minimum cost set required to read the specified
>>> chunks
>>>>> given a mapping of available chunks to costs.  The costs might allow
>> us
>>> to
>>>>> consider the difference between reading local chunks vs remote chunks.
>>>>> This should be sufficient to cover the read case (esp the degraded
>> read
>>>>> case) and the repair case.
>>>>> -Sam
>>>>> 
>>>>> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@ornl.gov>
>>>>> wrote:
>>>>>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@ornl.gov>
>>>>> wrote:
>>>>>> 
>>>>>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@dachary.org> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project
>>>>> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/
>> for
>>>>> instance ) would need to be different from the one initialy proposed:
>>>>>>> 
>>>>>>> An interesting video. Not as entertaining as Jim Plank's video. ;-)
>>>>>>> 
>>>>>>> While Plank's focused on the processor requirements for
>>>>> encoding/decoding, this video focuses on the network and disk I/O
>>>>> requirements.
>>>>>>> 
>>>>>>>> context(k, m, reed-solomon|...) => context* c
>>>>>>>> encode(context* c, void* data) => void* chunks[k+m]
>>>>>>>> decode(context* c, void* chunk[k+m], int*
>>>>>>>> indices_of_erased_chunks) => void* data // erased chunks are not
>>> used
>>>>>>>> repair(context* c, void* chunk[k+m], int*
>>>>>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks
>> are
>>>>>>>> rebuilt
>>>>>>>> 
>>>>>>>> The decode function must allow for partial read:
>>>>>>>> 
>>>>>>>> decode(context* c, int offset, int length, void* chunk[k+m], int*
>>>>>>>> indices_of_erased_chunks, int* missing_chunks) => void* data
>>>>>>>> 
>>>>>>>> If there are not enough chunks to recover the desired data range
>>>>> [offset, offset+length) the function returns NULL and sets
>>> missing_chunks
>>>>> to the list of chunks that must be retrieved in order to be able to
>>> read
>>>>> the desired data.
>>>>>>>> 
>>>>>>>> If decode is called to read just 1 chunk and it is missing, reed-
>>>>> solomon would return on error and ask for all other chunks to repair.
>>> If
>>>>> the underlying library implements LRC, it would ask for a subset of
>> the
>>>>> chunks.
>>>>>>>> 
>>>>>>>> An implementation allowing only full reads and using jerasure
>>> ( which
>>>>> does not do LRC ) requires that offset is always zero, length is the
>>> size
>>>>> of the object and returns a copy of indices_of_erased_chunks if there
>>> are
>>>>> not enough chunks to rebuild the missing ones.
>>>>>>>> 
>>>>>>>> Comments are welcome :-)
>>>>>>> 
>>>>>>> I have loosely followed this discussion and I have not looked
>> closely
>>>>> at the proposed API nor at the jerasure interface. My apologies if
>> this
>>>>> has already been addressed.
>>>>>>> 
>>>>>>> It is not clear to me from the above proposed API (ignoring the
>>> partial
>>>>> read) what it would do. Was the original intent to encode the entire
>>> file
>>>>> using k+m blocks irregardless of the file size and of the rados
>> object
>>>>> size?
>>>>>>> 
>>>>>>> If so, how will you map rados objects to the logical k+m objects
>> and
>>>>> vice versa?
>>>>>>> 
>>>>>>> If not, then the initial API needed an offset and length (either
>>>>> logical or rados object).
>>>>>>> 
>>>>>>> I would assume that you would want to operate on rados sized
>> objects.
>>>>> Given a fixed k+m, then you may have more than one set of k+m objects
>>> per
>>>>> file. This is ignoring the LRC "local" parity blocks. For example, if
>>> the
>>>>> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas
>> video),
>>>>> then for a 20 MB file one would need two sets of encoding blocks. The
>>>>> first for objects 1-10 and the second for objects 11-20.
>>>>>>> 
>>>>>>> Perhaps, this is what the context is above. If so, it should have
>> the
>>>>> logical offset and rados object size, no?
>>>>>>> 
>>>>>>> I see value in the Xorbas concept and I wonder if the jerasure
>>> library
>>>>> can be modified to generate the local parity blocks such that they
>> can
>>> be
>>>>> used to generate the global parity blocks. That would be a question
>> for
>>>>> Jim Plank.
>>>>>> 
>>>>>> The benefits of the Xorbas concept is reduced network and disk I/O
>> for
>>>>> failures while maintaining traditional RS's higher fault-tolerance
>> than
>>> 3x
>>>>> replication while using less space.
>>>>>> 
>>>>>> You can do almost the same thing with jerasure without modifying it
>> at
>>>>> all. Using the values from the Xorbas video, they have k data blocks,
>> m
>>>>> global parity blocks, and 2 local parity blocks (generated from k/2
>>> data
>>>>> blocks) for a total of k+m+2 blocks on disk that can tolerate any m
>>>>> failures. In their example, k = 10 and m = 4. They store 16 blocks
>> for
>>>>> each 10 data blocks.
>>>>>> 
>>>>>> If you use traditional RS encoding via jerasure and used the same
>>> amount
>>>>> of storage (16 blocks for each 10 data blocks), you could encode 3
>>> parity
>>>>> blocks for each 5 data blocks. This would consume 16 data blocks for
>>> each
>>>>> 10 data blocks and the fault-tolerance would be variable from 3-6
>>> failures
>>>>> depending on how the failures fell between the two groups of 5 blocks
>>>>> which is higher than the static 4 failures for the Xorbas code. The
>> I/O
>>> to
>>>>> recover from a single failure for both schemes is 5 blocks so it is
>> as
>>>>> efficient as Xorbas. On average, it provides more fault-tolerance,
>> but
>>> it
>>>>> can be less (four failures from one group of 5 data + 3 parity
>> blocks),
>>>>> but that worst case is the same as 3x replication.
>>>>>> 
>>>>>> Scott--
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>> devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in
>>>>> the body of a message to majordomo@vger.kernel.org More majordomo
>> info
>>> at
>>>>> http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Erasure coding library API
  2013-07-05 12:13               ` Atchley, Scott
@ 2013-07-05 14:06                 ` Loic Dachary
  2013-07-05 15:02                   ` Atchley, Scott
  0 siblings, 1 reply; 16+ messages in thread
From: Loic Dachary @ 2013-07-05 14:06 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 11781 bytes --]



On 05/07/2013 14:13, Atchley, Scott wrote:> Loic,
> 
> Erasure codes take what ever you give them. You need to verify the chunk before using it. Perhaps storing the checksum in the metadata/context that describes the parity object?

Hi Scott,

Does that mean that if I give the chunks A + B + C to decode() and B is corrupted but A and C are ok it will return an incorrectly decoded content ? I'm curious to know the answer but I don't think it is an actual problem. A corrupted chunk would mean that the underlying file system corrupted the content of one of its file. I can't remember the last time I saw that happen ;-)

Cheers

> 
> Scott
> 
> On Jul 4, 2013, at 9:24 AM, Loic Dachary <loic@dachary.org> wrote:
> 
>> Hi,
>>
>> I was thinking about scrubbing of erasure coded chunks and realized I don't know the answer to this very simple question : what happens when a chunk is corrupted ? I.e. if AB is coded with 2+1 into A + B ( data ) + Z (parity ) and Z is replaced with Q. Would reed-solomon ignore/discard the corrupted chunk ? If that's the case I think it slightly changes what the API should be.
>>
>> Cheers
>>
>> On 04/07/2013 05:06, Paul Von-Stamwitz wrote:
>>> Scott, et al.
>>>
>>> Here is an interesting paper from Usenix HotStorage Conference which provides local codes without additional capacity overhead.
>>>
>>> Check it out. (abstract with links to paper and slides)
>>> https://www.usenix.org/conference/hotstorage13/solution-network-challenges-data-recovery-erasure-coded-distributed-storage
>>>
>>> Cheers,
>>> pvs
>>>
>>>> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote:
>>>>
>>>> Hi Scott,
>>>>
>>>> Point taken.
>>>>
>>>> I was thinking about Loic's decode description where k+m was requested and
>>>> data was decoded when k blocks were received. But he was referring to full
>>>> stripe reads where all the memory is allocated.
>>>>
>>>> Degraded reads and block repair are a different matter.
>>>>
>>>> pvs
>>>>
>>>>> On Jul 3, 2013, at 4:53 AM Scott Atchley wrote:
>>>>>
>>>>> On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz
>>>>> <PVonStamwitz@us.fujitsu.com> wrote:
>>>>>
>>>>>> Scott,
>>>>>>
>>>>>> You make a good point comparing (5/3) RS with Xorbas, but a small nit:
>>>>>>
>>>>>> "The I/O to recover from a single failure for both schemes is 5 blocks
>>>>> so it is as efficient as Xorbas."
>>>>>>
>>>>>> Maybe not. You would probably issue I/O to all the remaining 7 blocks
>>>> to
>>>>> cover for the possibility of double errors. The time to reconstruct
>>>> would
>>>>> be about the same, but there could be more disk and network I/O. (LRC
>>>> will
>>>>> need to issue I/O to the rest of the global stripe if it detected
>>>> multiple
>>>>> local errors.)
>>>>>
>>>>> Why would you request more than five? If one cannot be read, request
>>>>> another.
>>>>>
>>>>> Also, I am not sure that you want to request five at once since it will
>>>>> lead to spikes in network traffic and require memory for all five blocks.
>>>>> You will need at least two buffers. Request the first two and start the
>>>>> decoding. You may want a third buffer to overlap the decoding of the
>>>>> current block with the communication for the next block. It may be that
>>>>> the decode time is less than the communication and, in that case, you
>>>> will
>>>>> want to request all of the blocks at once.
>>>>>
>>>>>> What I like about Xorbas is that it is an extension of a (x,y) RS. You
>>>>> can start with traditional RS. If degraded reads and repaired blocks are
>>>>> causing a problem, you can add the LRC. If capacity is an issue, you can
>>>>> take it out.
>>>>>
>>>>> I like it too and Microsoft has something similar with Pyramid codes.
>>>> That
>>>>> said, my example using traditional RS can provide more fault-tolerance
>>>> on
>>>>> average given the same amount of storage overhead.
>>>>>
>>>>>>
>>>>>> Best,
>>>>>> Paul
>>>>>>
>>>>>> On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote:
>>>>>>> I think we should be able to cover most cases by adding an interface
>>>>> like:
>>>>>>>
>>>>>>> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
>>>>>>> &available_chunks);
>>>>>>>
>>>>>>> which returns the smallest set required to read/rebuild the chunks in
>>>>>>> want_to_read given the chunks in available_chunks.  Alternately, we
>>>>> might
>>>>>>> include a "cost" for reading each chunk like
>>>>>>>
>>>>>>> set<int> minimum_to_read_with_cost(const set<int> &want_to_read,
>>>> const
>>>>>>> map<int, int> &available)
>>>>>>>
>>>>>>> which returns the minimum cost set required to read the specified
>>>>> chunks
>>>>>>> given a mapping of available chunks to costs.  The costs might allow
>>>> us
>>>>> to
>>>>>>> consider the difference between reading local chunks vs remote chunks.
>>>>>>> This should be sufficient to cover the read case (esp the degraded
>>>> read
>>>>>>> case) and the repair case.
>>>>>>> -Sam
>>>>>>>
>>>>>>> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@ornl.gov>
>>>>>>> wrote:
>>>>>>>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@ornl.gov>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@dachary.org> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project
>>>>>>> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/
>>>> for
>>>>>>> instance ) would need to be different from the one initialy proposed:
>>>>>>>>>
>>>>>>>>> An interesting video. Not as entertaining as Jim Plank's video. ;-)
>>>>>>>>>
>>>>>>>>> While Plank's focused on the processor requirements for
>>>>>>> encoding/decoding, this video focuses on the network and disk I/O
>>>>>>> requirements.
>>>>>>>>>
>>>>>>>>>> context(k, m, reed-solomon|...) => context* c
>>>>>>>>>> encode(context* c, void* data) => void* chunks[k+m]
>>>>>>>>>> decode(context* c, void* chunk[k+m], int*
>>>>>>>>>> indices_of_erased_chunks) => void* data // erased chunks are not
>>>>> used
>>>>>>>>>> repair(context* c, void* chunk[k+m], int*
>>>>>>>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks
>>>> are
>>>>>>>>>> rebuilt
>>>>>>>>>>
>>>>>>>>>> The decode function must allow for partial read:
>>>>>>>>>>
>>>>>>>>>> decode(context* c, int offset, int length, void* chunk[k+m], int*
>>>>>>>>>> indices_of_erased_chunks, int* missing_chunks) => void* data
>>>>>>>>>>
>>>>>>>>>> If there are not enough chunks to recover the desired data range
>>>>>>> [offset, offset+length) the function returns NULL and sets
>>>>> missing_chunks
>>>>>>> to the list of chunks that must be retrieved in order to be able to
>>>>> read
>>>>>>> the desired data.
>>>>>>>>>>
>>>>>>>>>> If decode is called to read just 1 chunk and it is missing, reed-
>>>>>>> solomon would return on error and ask for all other chunks to repair.
>>>>> If
>>>>>>> the underlying library implements LRC, it would ask for a subset of
>>>> the
>>>>>>> chunks.
>>>>>>>>>>
>>>>>>>>>> An implementation allowing only full reads and using jerasure
>>>>> ( which
>>>>>>> does not do LRC ) requires that offset is always zero, length is the
>>>>> size
>>>>>>> of the object and returns a copy of indices_of_erased_chunks if there
>>>>> are
>>>>>>> not enough chunks to rebuild the missing ones.
>>>>>>>>>>
>>>>>>>>>> Comments are welcome :-)
>>>>>>>>>
>>>>>>>>> I have loosely followed this discussion and I have not looked
>>>> closely
>>>>>>> at the proposed API nor at the jerasure interface. My apologies if
>>>> this
>>>>>>> has already been addressed.
>>>>>>>>>
>>>>>>>>> It is not clear to me from the above proposed API (ignoring the
>>>>> partial
>>>>>>> read) what it would do. Was the original intent to encode the entire
>>>>> file
>>>>>>> using k+m blocks irregardless of the file size and of the rados
>>>> object
>>>>>>> size?
>>>>>>>>>
>>>>>>>>> If so, how will you map rados objects to the logical k+m objects
>>>> and
>>>>>>> vice versa?
>>>>>>>>>
>>>>>>>>> If not, then the initial API needed an offset and length (either
>>>>>>> logical or rados object).
>>>>>>>>>
>>>>>>>>> I would assume that you would want to operate on rados sized
>>>> objects.
>>>>>>> Given a fixed k+m, then you may have more than one set of k+m objects
>>>>> per
>>>>>>> file. This is ignoring the LRC "local" parity blocks. For example, if
>>>>> the
>>>>>>> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas
>>>> video),
>>>>>>> then for a 20 MB file one would need two sets of encoding blocks. The
>>>>>>> first for objects 1-10 and the second for objects 11-20.
>>>>>>>>>
>>>>>>>>> Perhaps, this is what the context is above. If so, it should have
>>>> the
>>>>>>> logical offset and rados object size, no?
>>>>>>>>>
>>>>>>>>> I see value in the Xorbas concept and I wonder if the jerasure
>>>>> library
>>>>>>> can be modified to generate the local parity blocks such that they
>>>> can
>>>>> be
>>>>>>> used to generate the global parity blocks. That would be a question
>>>> for
>>>>>>> Jim Plank.
>>>>>>>>
>>>>>>>> The benefits of the Xorbas concept is reduced network and disk I/O
>>>> for
>>>>>>> failures while maintaining traditional RS's higher fault-tolerance
>>>> than
>>>>> 3x
>>>>>>> replication while using less space.
>>>>>>>>
>>>>>>>> You can do almost the same thing with jerasure without modifying it
>>>> at
>>>>>>> all. Using the values from the Xorbas video, they have k data blocks,
>>>> m
>>>>>>> global parity blocks, and 2 local parity blocks (generated from k/2
>>>>> data
>>>>>>> blocks) for a total of k+m+2 blocks on disk that can tolerate any m
>>>>>>> failures. In their example, k = 10 and m = 4. They store 16 blocks
>>>> for
>>>>>>> each 10 data blocks.
>>>>>>>>
>>>>>>>> If you use traditional RS encoding via jerasure and used the same
>>>>> amount
>>>>>>> of storage (16 blocks for each 10 data blocks), you could encode 3
>>>>> parity
>>>>>>> blocks for each 5 data blocks. This would consume 16 data blocks for
>>>>> each
>>>>>>> 10 data blocks and the fault-tolerance would be variable from 3-6
>>>>> failures
>>>>>>> depending on how the failures fell between the two groups of 5 blocks
>>>>>>> which is higher than the static 4 failures for the Xorbas code. The
>>>> I/O
>>>>> to
>>>>>>> recover from a single failure for both schemes is 5 blocks so it is
>>>> as
>>>>>>> efficient as Xorbas. On average, it provides more fault-tolerance,
>>>> but
>>>>> it
>>>>>>> can be less (four failures from one group of 5 data + 3 parity
>>>> blocks),
>>>>>>> but that worst case is the same as 3x replication.
>>>>>>>>
>>>>>>>> Scott--
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>>>> devel"
>>>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in
>>>>>>> the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info
>>>>> at
>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>> All that is necessary for the triumph of evil is that good people do nothing.
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Erasure coding library API
  2013-07-05 14:06                 ` Loic Dachary
@ 2013-07-05 15:02                   ` Atchley, Scott
  2013-07-05 16:34                     ` Loic Dachary
  0 siblings, 1 reply; 16+ messages in thread
From: Atchley, Scott @ 2013-07-05 15:02 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On Jul 5, 2013, at 10:06 AM, Loic Dachary <loic@dachary.org> wrote:

> On 05/07/2013 14:13, Atchley, Scott wrote:> Loic,
>> 
>> Erasure codes take what ever you give them. You need to verify the chunk before using it. Perhaps storing the checksum in the metadata/context that describes the parity object?
> 
> Hi Scott,
> 
> Does that mean that if I give the chunks A + B + C to decode() and B is corrupted but A and C are ok it will return an incorrectly decoded content ?

Unfortunately, yes.

> I'm curious to know the answer but I don't think it is an actual problem. A corrupted chunk would mean that the underlying file system corrupted the content of one of its file. I can't remember the last time I saw that happen ;-)

Bit flips happen. Not often, but it is possible. Here is an article from 2011:

http://www.linux-mag.com/id/8794/

also search for "bit rot" and "bit error rate".

Scott

> 
> Cheers
> 
>> 
>> Scott
>> 
>> On Jul 4, 2013, at 9:24 AM, Loic Dachary <loic@dachary.org> wrote:
>> 
>>> Hi,
>>> 
>>> I was thinking about scrubbing of erasure coded chunks and realized I don't know the answer to this very simple question : what happens when a chunk is corrupted ? I.e. if AB is coded with 2+1 into A + B ( data ) + Z (parity ) and Z is replaced with Q. Would reed-solomon ignore/discard the corrupted chunk ? If that's the case I think it slightly changes what the API should be.
>>> 
>>> Cheers
>>> 
>>> On 04/07/2013 05:06, Paul Von-Stamwitz wrote:
>>>> Scott, et al.
>>>> 
>>>> Here is an interesting paper from Usenix HotStorage Conference which provides local codes without additional capacity overhead.
>>>> 
>>>> Check it out. (abstract with links to paper and slides)
>>>> https://www.usenix.org/conference/hotstorage13/solution-network-challenges-data-recovery-erasure-coded-distributed-storage
>>>> 
>>>> Cheers,
>>>> pvs
>>>> 
>>>>> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote:
>>>>> 
>>>>> Hi Scott,
>>>>> 
>>>>> Point taken.
>>>>> 
>>>>> I was thinking about Loic's decode description where k+m was requested and
>>>>> data was decoded when k blocks were received. But he was referring to full
>>>>> stripe reads where all the memory is allocated.
>>>>> 
>>>>> Degraded reads and block repair are a different matter.
>>>>> 
>>>>> pvs
>>>>> 
>>>>>> On Jul 3, 2013, at 4:53 AM Scott Atchley wrote:
>>>>>> 
>>>>>> On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz
>>>>>> <PVonStamwitz@us.fujitsu.com> wrote:
>>>>>> 
>>>>>>> Scott,
>>>>>>> 
>>>>>>> You make a good point comparing (5/3) RS with Xorbas, but a small nit:
>>>>>>> 
>>>>>>> "The I/O to recover from a single failure for both schemes is 5 blocks
>>>>>> so it is as efficient as Xorbas."
>>>>>>> 
>>>>>>> Maybe not. You would probably issue I/O to all the remaining 7 blocks
>>>>> to
>>>>>> cover for the possibility of double errors. The time to reconstruct
>>>>> would
>>>>>> be about the same, but there could be more disk and network I/O. (LRC
>>>>> will
>>>>>> need to issue I/O to the rest of the global stripe if it detected
>>>>> multiple
>>>>>> local errors.)
>>>>>> 
>>>>>> Why would you request more than five? If one cannot be read, request
>>>>>> another.
>>>>>> 
>>>>>> Also, I am not sure that you want to request five at once since it will
>>>>>> lead to spikes in network traffic and require memory for all five blocks.
>>>>>> You will need at least two buffers. Request the first two and start the
>>>>>> decoding. You may want a third buffer to overlap the decoding of the
>>>>>> current block with the communication for the next block. It may be that
>>>>>> the decode time is less than the communication and, in that case, you
>>>>> will
>>>>>> want to request all of the blocks at once.
>>>>>> 
>>>>>>> What I like about Xorbas is that it is an extension of a (x,y) RS. You
>>>>>> can start with traditional RS. If degraded reads and repaired blocks are
>>>>>> causing a problem, you can add the LRC. If capacity is an issue, you can
>>>>>> take it out.
>>>>>> 
>>>>>> I like it too and Microsoft has something similar with Pyramid codes.
>>>>> That
>>>>>> said, my example using traditional RS can provide more fault-tolerance
>>>>> on
>>>>>> average given the same amount of storage overhead.
>>>>>> 
>>>>>>> 
>>>>>>> Best,
>>>>>>> Paul
>>>>>>> 
>>>>>>> On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote:
>>>>>>>> I think we should be able to cover most cases by adding an interface
>>>>>> like:
>>>>>>>> 
>>>>>>>> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
>>>>>>>> &available_chunks);
>>>>>>>> 
>>>>>>>> which returns the smallest set required to read/rebuild the chunks in
>>>>>>>> want_to_read given the chunks in available_chunks.  Alternately, we
>>>>>> might
>>>>>>>> include a "cost" for reading each chunk like
>>>>>>>> 
>>>>>>>> set<int> minimum_to_read_with_cost(const set<int> &want_to_read,
>>>>> const
>>>>>>>> map<int, int> &available)
>>>>>>>> 
>>>>>>>> which returns the minimum cost set required to read the specified
>>>>>> chunks
>>>>>>>> given a mapping of available chunks to costs.  The costs might allow
>>>>> us
>>>>>> to
>>>>>>>> consider the difference between reading local chunks vs remote chunks.
>>>>>>>> This should be sufficient to cover the read case (esp the degraded
>>>>> read
>>>>>>>> case) and the repair case.
>>>>>>>> -Sam
>>>>>>>> 
>>>>>>>> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@ornl.gov>
>>>>>>>> wrote:
>>>>>>>>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@ornl.gov>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@dachary.org> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project
>>>>>>>> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/
>>>>> for
>>>>>>>> instance ) would need to be different from the one initialy proposed:
>>>>>>>>>> 
>>>>>>>>>> An interesting video. Not as entertaining as Jim Plank's video. ;-)
>>>>>>>>>> 
>>>>>>>>>> While Plank's focused on the processor requirements for
>>>>>>>> encoding/decoding, this video focuses on the network and disk I/O
>>>>>>>> requirements.
>>>>>>>>>> 
>>>>>>>>>>> context(k, m, reed-solomon|...) => context* c
>>>>>>>>>>> encode(context* c, void* data) => void* chunks[k+m]
>>>>>>>>>>> decode(context* c, void* chunk[k+m], int*
>>>>>>>>>>> indices_of_erased_chunks) => void* data // erased chunks are not
>>>>>> used
>>>>>>>>>>> repair(context* c, void* chunk[k+m], int*
>>>>>>>>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks
>>>>> are
>>>>>>>>>>> rebuilt
>>>>>>>>>>> 
>>>>>>>>>>> The decode function must allow for partial read:
>>>>>>>>>>> 
>>>>>>>>>>> decode(context* c, int offset, int length, void* chunk[k+m], int*
>>>>>>>>>>> indices_of_erased_chunks, int* missing_chunks) => void* data
>>>>>>>>>>> 
>>>>>>>>>>> If there are not enough chunks to recover the desired data range
>>>>>>>> [offset, offset+length) the function returns NULL and sets
>>>>>> missing_chunks
>>>>>>>> to the list of chunks that must be retrieved in order to be able to
>>>>>> read
>>>>>>>> the desired data.
>>>>>>>>>>> 
>>>>>>>>>>> If decode is called to read just 1 chunk and it is missing, reed-
>>>>>>>> solomon would return on error and ask for all other chunks to repair.
>>>>>> If
>>>>>>>> the underlying library implements LRC, it would ask for a subset of
>>>>> the
>>>>>>>> chunks.
>>>>>>>>>>> 
>>>>>>>>>>> An implementation allowing only full reads and using jerasure
>>>>>> ( which
>>>>>>>> does not do LRC ) requires that offset is always zero, length is the
>>>>>> size
>>>>>>>> of the object and returns a copy of indices_of_erased_chunks if there
>>>>>> are
>>>>>>>> not enough chunks to rebuild the missing ones.
>>>>>>>>>>> 
>>>>>>>>>>> Comments are welcome :-)
>>>>>>>>>> 
>>>>>>>>>> I have loosely followed this discussion and I have not looked
>>>>> closely
>>>>>>>> at the proposed API nor at the jerasure interface. My apologies if
>>>>> this
>>>>>>>> has already been addressed.
>>>>>>>>>> 
>>>>>>>>>> It is not clear to me from the above proposed API (ignoring the
>>>>>> partial
>>>>>>>> read) what it would do. Was the original intent to encode the entire
>>>>>> file
>>>>>>>> using k+m blocks irregardless of the file size and of the rados
>>>>> object
>>>>>>>> size?
>>>>>>>>>> 
>>>>>>>>>> If so, how will you map rados objects to the logical k+m objects
>>>>> and
>>>>>>>> vice versa?
>>>>>>>>>> 
>>>>>>>>>> If not, then the initial API needed an offset and length (either
>>>>>>>> logical or rados object).
>>>>>>>>>> 
>>>>>>>>>> I would assume that you would want to operate on rados sized
>>>>> objects.
>>>>>>>> Given a fixed k+m, then you may have more than one set of k+m objects
>>>>>> per
>>>>>>>> file. This is ignoring the LRC "local" parity blocks. For example, if
>>>>>> the
>>>>>>>> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas
>>>>> video),
>>>>>>>> then for a 20 MB file one would need two sets of encoding blocks. The
>>>>>>>> first for objects 1-10 and the second for objects 11-20.
>>>>>>>>>> 
>>>>>>>>>> Perhaps, this is what the context is above. If so, it should have
>>>>> the
>>>>>>>> logical offset and rados object size, no?
>>>>>>>>>> 
>>>>>>>>>> I see value in the Xorbas concept and I wonder if the jerasure
>>>>>> library
>>>>>>>> can be modified to generate the local parity blocks such that they
>>>>> can
>>>>>> be
>>>>>>>> used to generate the global parity blocks. That would be a question
>>>>> for
>>>>>>>> Jim Plank.
>>>>>>>>> 
>>>>>>>>> The benefits of the Xorbas concept is reduced network and disk I/O
>>>>> for
>>>>>>>> failures while maintaining traditional RS's higher fault-tolerance
>>>>> than
>>>>>> 3x
>>>>>>>> replication while using less space.
>>>>>>>>> 
>>>>>>>>> You can do almost the same thing with jerasure without modifying it
>>>>> at
>>>>>>>> all. Using the values from the Xorbas video, they have k data blocks,
>>>>> m
>>>>>>>> global parity blocks, and 2 local parity blocks (generated from k/2
>>>>>> data
>>>>>>>> blocks) for a total of k+m+2 blocks on disk that can tolerate any m
>>>>>>>> failures. In their example, k = 10 and m = 4. They store 16 blocks
>>>>> for
>>>>>>>> each 10 data blocks.
>>>>>>>>> 
>>>>>>>>> If you use traditional RS encoding via jerasure and used the same
>>>>>> amount
>>>>>>>> of storage (16 blocks for each 10 data blocks), you could encode 3
>>>>>> parity
>>>>>>>> blocks for each 5 data blocks. This would consume 16 data blocks for
>>>>>> each
>>>>>>>> 10 data blocks and the fault-tolerance would be variable from 3-6
>>>>>> failures
>>>>>>>> depending on how the failures fell between the two groups of 5 blocks
>>>>>>>> which is higher than the static 4 failures for the Xorbas code. The
>>>>> I/O
>>>>>> to
>>>>>>>> recover from a single failure for both schemes is 5 blocks so it is
>>>>> as
>>>>>>>> efficient as Xorbas. On average, it provides more fault-tolerance,
>>>>> but
>>>>>> it
>>>>>>>> can be less (four failures from one group of 5 data + 3 parity
>>>>> blocks),
>>>>>>>> but that worst case is the same as 3x replication.
>>>>>>>>> 
>>>>>>>>> Scott--
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>>>>> devel"
>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in
>>>>>>>> the body of a message to majordomo@vger.kernel.org More majordomo
>>>>> info
>>>>>> at
>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>> 
>>> -- 
>>> Loïc Dachary, Artisan Logiciel Libre
>>> All that is necessary for the triumph of evil is that good people do nothing.
>>> 
>> 
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Erasure coding library API
  2013-07-05 15:02                   ` Atchley, Scott
@ 2013-07-05 16:34                     ` Loic Dachary
  2013-07-05 16:55                       ` Atchley, Scott
  0 siblings, 1 reply; 16+ messages in thread
From: Loic Dachary @ 2013-07-05 16:34 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 13983 bytes --]

Hi Scott,

That's unfortunate indeed and I now better understand why it is necessary to add a signature to the chunks. I added the following to:

https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst

Bit flips happen. Not often, but it is possible. Here is an article from 2011 also search for "bit rot" and "bit error rate". To detect corrupted chunks, a signature (SHA1 for instance) should be added as an attribute of the file containing the chunk so that deep scrubbing can check that the chunk is valid by rehashing the content of the chunk and compare it with the signature.

While writing a more detailed version of the API taking your comments and Sam's comment in account, I realized that I don't know if it is possible to recover a parity chunk. If a data chunk is missing, it is enough to recover with a decode operation and write it back ( assuming a systematic code is used ). Would it be possible to do something similar with parity chunks ? Or would we need to re-encode all the parity chunks to write just one of them ? I assume the later but ... I'm not sure ;-)

Cheers

On 05/07/2013 17:02, Atchley, Scott wrote:
> On Jul 5, 2013, at 10:06 AM, Loic Dachary <loic@dachary.org> wrote:
> 
>> On 05/07/2013 14:13, Atchley, Scott wrote:> Loic,
>>>
>>> Erasure codes take what ever you give them. You need to verify the chunk before using it. Perhaps storing the checksum in the metadata/context that describes the parity object?
>>
>> Hi Scott,
>>
>> Does that mean that if I give the chunks A + B + C to decode() and B is corrupted but A and C are ok it will return an incorrectly decoded content ?
> 
> Unfortunately, yes.
> 
>> I'm curious to know the answer but I don't think it is an actual problem. A corrupted chunk would mean that the underlying file system corrupted the content of one of its file. I can't remember the last time I saw that happen ;-)
> 
> Bit flips happen. Not often, but it is possible. Here is an article from 2011:
> 
> http://www.linux-mag.com/id/8794/
> 
> also search for "bit rot" and "bit error rate".
> 
> Scott
> 
>>
>> Cheers
>>
>>>
>>> Scott
>>>
>>> On Jul 4, 2013, at 9:24 AM, Loic Dachary <loic@dachary.org> wrote:
>>>
>>>> Hi,
>>>>
>>>> I was thinking about scrubbing of erasure coded chunks and realized I don't know the answer to this very simple question : what happens when a chunk is corrupted ? I.e. if AB is coded with 2+1 into A + B ( data ) + Z (parity ) and Z is replaced with Q. Would reed-solomon ignore/discard the corrupted chunk ? If that's the case I think it slightly changes what the API should be.
>>>>
>>>> Cheers
>>>>
>>>> On 04/07/2013 05:06, Paul Von-Stamwitz wrote:
>>>>> Scott, et al.
>>>>>
>>>>> Here is an interesting paper from Usenix HotStorage Conference which provides local codes without additional capacity overhead.
>>>>>
>>>>> Check it out. (abstract with links to paper and slides)
>>>>> https://www.usenix.org/conference/hotstorage13/solution-network-challenges-data-recovery-erasure-coded-distributed-storage
>>>>>
>>>>> Cheers,
>>>>> pvs
>>>>>
>>>>>> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote:
>>>>>>
>>>>>> Hi Scott,
>>>>>>
>>>>>> Point taken.
>>>>>>
>>>>>> I was thinking about Loic's decode description where k+m was requested and
>>>>>> data was decoded when k blocks were received. But he was referring to full
>>>>>> stripe reads where all the memory is allocated.
>>>>>>
>>>>>> Degraded reads and block repair are a different matter.
>>>>>>
>>>>>> pvs
>>>>>>
>>>>>>> On Jul 3, 2013, at 4:53 AM Scott Atchley wrote:
>>>>>>>
>>>>>>> On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz
>>>>>>> <PVonStamwitz@us.fujitsu.com> wrote:
>>>>>>>
>>>>>>>> Scott,
>>>>>>>>
>>>>>>>> You make a good point comparing (5/3) RS with Xorbas, but a small nit:
>>>>>>>>
>>>>>>>> "The I/O to recover from a single failure for both schemes is 5 blocks
>>>>>>> so it is as efficient as Xorbas."
>>>>>>>>
>>>>>>>> Maybe not. You would probably issue I/O to all the remaining 7 blocks
>>>>>> to
>>>>>>> cover for the possibility of double errors. The time to reconstruct
>>>>>> would
>>>>>>> be about the same, but there could be more disk and network I/O. (LRC
>>>>>> will
>>>>>>> need to issue I/O to the rest of the global stripe if it detected
>>>>>> multiple
>>>>>>> local errors.)
>>>>>>>
>>>>>>> Why would you request more than five? If one cannot be read, request
>>>>>>> another.
>>>>>>>
>>>>>>> Also, I am not sure that you want to request five at once since it will
>>>>>>> lead to spikes in network traffic and require memory for all five blocks.
>>>>>>> You will need at least two buffers. Request the first two and start the
>>>>>>> decoding. You may want a third buffer to overlap the decoding of the
>>>>>>> current block with the communication for the next block. It may be that
>>>>>>> the decode time is less than the communication and, in that case, you
>>>>>> will
>>>>>>> want to request all of the blocks at once.
>>>>>>>
>>>>>>>> What I like about Xorbas is that it is an extension of a (x,y) RS. You
>>>>>>> can start with traditional RS. If degraded reads and repaired blocks are
>>>>>>> causing a problem, you can add the LRC. If capacity is an issue, you can
>>>>>>> take it out.
>>>>>>>
>>>>>>> I like it too and Microsoft has something similar with Pyramid codes.
>>>>>> That
>>>>>>> said, my example using traditional RS can provide more fault-tolerance
>>>>>> on
>>>>>>> average given the same amount of storage overhead.
>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Paul
>>>>>>>>
>>>>>>>> On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote:
>>>>>>>>> I think we should be able to cover most cases by adding an interface
>>>>>>> like:
>>>>>>>>>
>>>>>>>>> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
>>>>>>>>> &available_chunks);
>>>>>>>>>
>>>>>>>>> which returns the smallest set required to read/rebuild the chunks in
>>>>>>>>> want_to_read given the chunks in available_chunks.  Alternately, we
>>>>>>> might
>>>>>>>>> include a "cost" for reading each chunk like
>>>>>>>>>
>>>>>>>>> set<int> minimum_to_read_with_cost(const set<int> &want_to_read,
>>>>>> const
>>>>>>>>> map<int, int> &available)
>>>>>>>>>
>>>>>>>>> which returns the minimum cost set required to read the specified
>>>>>>> chunks
>>>>>>>>> given a mapping of available chunks to costs.  The costs might allow
>>>>>> us
>>>>>>> to
>>>>>>>>> consider the difference between reading local chunks vs remote chunks.
>>>>>>>>> This should be sufficient to cover the read case (esp the degraded
>>>>>> read
>>>>>>>>> case) and the repair case.
>>>>>>>>> -Sam
>>>>>>>>>
>>>>>>>>> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@ornl.gov>
>>>>>>>>> wrote:
>>>>>>>>>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@ornl.gov>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@dachary.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project
>>>>>>>>> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/
>>>>>> for
>>>>>>>>> instance ) would need to be different from the one initialy proposed:
>>>>>>>>>>>
>>>>>>>>>>> An interesting video. Not as entertaining as Jim Plank's video. ;-)
>>>>>>>>>>>
>>>>>>>>>>> While Plank's focused on the processor requirements for
>>>>>>>>> encoding/decoding, this video focuses on the network and disk I/O
>>>>>>>>> requirements.
>>>>>>>>>>>
>>>>>>>>>>>> context(k, m, reed-solomon|...) => context* c
>>>>>>>>>>>> encode(context* c, void* data) => void* chunks[k+m]
>>>>>>>>>>>> decode(context* c, void* chunk[k+m], int*
>>>>>>>>>>>> indices_of_erased_chunks) => void* data // erased chunks are not
>>>>>>> used
>>>>>>>>>>>> repair(context* c, void* chunk[k+m], int*
>>>>>>>>>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks
>>>>>> are
>>>>>>>>>>>> rebuilt
>>>>>>>>>>>>
>>>>>>>>>>>> The decode function must allow for partial read:
>>>>>>>>>>>>
>>>>>>>>>>>> decode(context* c, int offset, int length, void* chunk[k+m], int*
>>>>>>>>>>>> indices_of_erased_chunks, int* missing_chunks) => void* data
>>>>>>>>>>>>
>>>>>>>>>>>> If there are not enough chunks to recover the desired data range
>>>>>>>>> [offset, offset+length) the function returns NULL and sets
>>>>>>> missing_chunks
>>>>>>>>> to the list of chunks that must be retrieved in order to be able to
>>>>>>> read
>>>>>>>>> the desired data.
>>>>>>>>>>>>
>>>>>>>>>>>> If decode is called to read just 1 chunk and it is missing, reed-
>>>>>>>>> solomon would return on error and ask for all other chunks to repair.
>>>>>>> If
>>>>>>>>> the underlying library implements LRC, it would ask for a subset of
>>>>>> the
>>>>>>>>> chunks.
>>>>>>>>>>>>
>>>>>>>>>>>> An implementation allowing only full reads and using jerasure
>>>>>>> ( which
>>>>>>>>> does not do LRC ) requires that offset is always zero, length is the
>>>>>>> size
>>>>>>>>> of the object and returns a copy of indices_of_erased_chunks if there
>>>>>>> are
>>>>>>>>> not enough chunks to rebuild the missing ones.
>>>>>>>>>>>>
>>>>>>>>>>>> Comments are welcome :-)
>>>>>>>>>>>
>>>>>>>>>>> I have loosely followed this discussion and I have not looked
>>>>>> closely
>>>>>>>>> at the proposed API nor at the jerasure interface. My apologies if
>>>>>> this
>>>>>>>>> has already been addressed.
>>>>>>>>>>>
>>>>>>>>>>> It is not clear to me from the above proposed API (ignoring the
>>>>>>> partial
>>>>>>>>> read) what it would do. Was the original intent to encode the entire
>>>>>>> file
>>>>>>>>> using k+m blocks irregardless of the file size and of the rados
>>>>>> object
>>>>>>>>> size?
>>>>>>>>>>>
>>>>>>>>>>> If so, how will you map rados objects to the logical k+m objects
>>>>>> and
>>>>>>>>> vice versa?
>>>>>>>>>>>
>>>>>>>>>>> If not, then the initial API needed an offset and length (either
>>>>>>>>> logical or rados object).
>>>>>>>>>>>
>>>>>>>>>>> I would assume that you would want to operate on rados sized
>>>>>> objects.
>>>>>>>>> Given a fixed k+m, then you may have more than one set of k+m objects
>>>>>>> per
>>>>>>>>> file. This is ignoring the LRC "local" parity blocks. For example, if
>>>>>>> the
>>>>>>>>> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas
>>>>>> video),
>>>>>>>>> then for a 20 MB file one would need two sets of encoding blocks. The
>>>>>>>>> first for objects 1-10 and the second for objects 11-20.
>>>>>>>>>>>
>>>>>>>>>>> Perhaps, this is what the context is above. If so, it should have
>>>>>> the
>>>>>>>>> logical offset and rados object size, no?
>>>>>>>>>>>
>>>>>>>>>>> I see value in the Xorbas concept and I wonder if the jerasure
>>>>>>> library
>>>>>>>>> can be modified to generate the local parity blocks such that they
>>>>>> can
>>>>>>> be
>>>>>>>>> used to generate the global parity blocks. That would be a question
>>>>>> for
>>>>>>>>> Jim Plank.
>>>>>>>>>>
>>>>>>>>>> The benefits of the Xorbas concept is reduced network and disk I/O
>>>>>> for
>>>>>>>>> failures while maintaining traditional RS's higher fault-tolerance
>>>>>> than
>>>>>>> 3x
>>>>>>>>> replication while using less space.
>>>>>>>>>>
>>>>>>>>>> You can do almost the same thing with jerasure without modifying it
>>>>>> at
>>>>>>>>> all. Using the values from the Xorbas video, they have k data blocks,
>>>>>> m
>>>>>>>>> global parity blocks, and 2 local parity blocks (generated from k/2
>>>>>>> data
>>>>>>>>> blocks) for a total of k+m+2 blocks on disk that can tolerate any m
>>>>>>>>> failures. In their example, k = 10 and m = 4. They store 16 blocks
>>>>>> for
>>>>>>>>> each 10 data blocks.
>>>>>>>>>>
>>>>>>>>>> If you use traditional RS encoding via jerasure and used the same
>>>>>>> amount
>>>>>>>>> of storage (16 blocks for each 10 data blocks), you could encode 3
>>>>>>> parity
>>>>>>>>> blocks for each 5 data blocks. This would consume 16 data blocks for
>>>>>>> each
>>>>>>>>> 10 data blocks and the fault-tolerance would be variable from 3-6
>>>>>>> failures
>>>>>>>>> depending on how the failures fell between the two groups of 5 blocks
>>>>>>>>> which is higher than the static 4 failures for the Xorbas code. The
>>>>>> I/O
>>>>>>> to
>>>>>>>>> recover from a single failure for both schemes is 5 blocks so it is
>>>>>> as
>>>>>>>>> efficient as Xorbas. On average, it provides more fault-tolerance,
>>>>>> but
>>>>>>> it
>>>>>>>>> can be less (four failures from one group of 5 data + 3 parity
>>>>>> blocks),
>>>>>>>>> but that worst case is the same as 3x replication.
>>>>>>>>>>
>>>>>>>>>> Scott--
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>>>>>> devel"
>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>> info
>>>>>>> at
>>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> -- 
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> All that is necessary for the triumph of evil is that good people do nothing.
>>>>
>>>
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>> All that is necessary for the triumph of evil is that good people do nothing.
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Erasure coding library API
  2013-07-02 21:33     ` Samuel Just
  2013-07-03  2:12       ` Paul Von-Stamwitz
@ 2013-07-05 16:50       ` Loic Dachary
  1 sibling, 0 replies; 16+ messages in thread
From: Loic Dachary @ 2013-07-05 16:50 UTC (permalink / raw)
  To: Samuel Just; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 6188 bytes --]

Hi Sam,

I updated the API description at

https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api

Please let me know if I missed something.

Cheers

On 02/07/2013 23:33, Samuel Just wrote:
> I think we should be able to cover most cases by adding an interface like:
> 
> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
> &available_chunks);
> 
> which returns the smallest set required to read/rebuild the chunks in
> want_to_read given the chunks in available_chunks.  Alternately, we
> might include a "cost" for reading each chunk like
> 
> set<int> minimum_to_read_with_cost(const set<int> &want_to_read, const
> map<int, int> &available)
> 
> which returns the minimum cost set required to read the specified
> chunks given a mapping of available chunks to costs.  The costs might
> allow us to consider the difference between reading local chunks vs
> remote chunks.  This should be sufficient to cover the read case (esp
> the degraded read case) and the repair case.
> -Sam
> 
> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@ornl.gov> wrote:
>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@ornl.gov> wrote:
>>
>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@dachary.org> wrote:
>>>
>>>> Hi,
>>>>
>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/ for instance ) would need to be different from the one initialy proposed:
>>>
>>> An interesting video. Not as entertaining as Jim Plank's video. ;-)
>>>
>>> While Plank's focused on the processor requirements for encoding/decoding, this video focuses on the network and disk I/O requirements.
>>>
>>>>   context(k, m, reed-solomon|...) => context* c
>>>>   encode(context* c, void* data) => void* chunks[k+m]
>>>>   decode(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* data // erased chunks are
>>>> not used
>>>>   repair(context* c, void* chunk[k+m], int* indices_of_erased_chunks) => void* chunks[k+m] // erased
>>>> chunks are rebuilt
>>>>
>>>> The decode function must allow for partial read:
>>>>
>>>>   decode(context* c, int offset, int length, void* chunk[k+m], int* indices_of_erased_chunks, int* missing_chunks) => void* data
>>>>
>>>> If there are not enough chunks to recover the desired data range [offset, offset+length) the function returns NULL and sets missing_chunks to the list of chunks that must be retrieved in order to be able to read the desired data.
>>>>
>>>> If decode is called to read just 1 chunk and it is missing, reed-solomon would return on error and ask for all other chunks to repair. If the underlying library implements LRC, it would ask for a subset of the chunks.
>>>>
>>>> An implementation allowing only full reads and using jerasure ( which does not do LRC ) requires that offset is always zero, length is the size of the object and returns a copy of indices_of_erased_chunks if there are not enough chunks to rebuild the missing ones.
>>>>
>>>> Comments are welcome :-)
>>>
>>> I have loosely followed this discussion and I have not looked closely at the proposed API nor at the jerasure interface. My apologies if this has already been addressed.
>>>
>>> It is not clear to me from the above proposed API (ignoring the partial read) what it would do. Was the original intent to encode the entire file using k+m blocks irregardless of the file size and of the rados object size?
>>>
>>> If so, how will you map rados objects to the logical k+m objects and vice versa?
>>>
>>> If not, then the initial API needed an offset and length (either logical or rados object).
>>>
>>> I would assume that you would want to operate on rados sized objects. Given a fixed k+m, then you may have more than one set of k+m objects per file. This is ignoring the LRC "local" parity blocks. For example, if the rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas video), then for a 20 MB file one would need two sets of encoding blocks. The first for objects 1-10 and the second for objects 11-20.
>>>
>>> Perhaps, this is what the context is above. If so, it should have the logical offset and rados object size, no?
>>>
>>> I see value in the Xorbas concept and I wonder if the jerasure library can be modified to generate the local parity blocks such that they can be used to generate the global parity blocks. That would be a question for Jim Plank.
>>
>> The benefits of the Xorbas concept is reduced network and disk I/O for failures while maintaining traditional RS's higher fault-tolerance than 3x replication while using less space.
>>
>> You can do almost the same thing with jerasure without modifying it at all. Using the values from the Xorbas video, they have k data blocks, m global parity blocks, and 2 local parity blocks (generated from k/2 data blocks) for a total of k+m+2 blocks on disk that can tolerate any m failures. In their example, k = 10 and m = 4. They store 16 blocks for each 10 data blocks.
>>
>> If you use traditional RS encoding via jerasure and used the same amount of storage (16 blocks for each 10 data blocks), you could encode 3 parity blocks for each 5 data blocks. This would consume 16 data blocks for each 10 data blocks and the fault-tolerance would be variable from 3-6 failures depending on how the failures fell between the two groups of 5 blocks which is higher than the static 4 failures for the Xorbas code. The I/O to recover from a single failure for both schemes is 5 blocks so it is as efficient as Xorbas. On average, it provides more fault-tolerance, but it can be less (four failures from one group of 5 data + 3 parity blocks), but that worst case is the same as 3x replication.
>>
>> Scott--
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Erasure coding library API
  2013-07-05 16:34                     ` Loic Dachary
@ 2013-07-05 16:55                       ` Atchley, Scott
  0 siblings, 0 replies; 16+ messages in thread
From: Atchley, Scott @ 2013-07-05 16:55 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On Jul 5, 2013, at 12:34 PM, Loic Dachary <loic@dachary.org> wrote:

> Hi Scott,
>
> That's unfortunate indeed and I now better understand why it is necessary to add a signature to the chunks. I added the following to:
>
> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst
>
> Bit flips happen. Not often, but it is possible. Here is an article from 2011 also search for "bit rot" and "bit error rate". To detect corrupted chunks, a signature (SHA1 for instance) should be added as an attribute of the file containing the chunk so that deep scrubbing can check that the chunk is valid by rehashing the content of the chunk and compare it with the signature.

Note, this can be local to the OSD to avoid sending an invalid chunk of the network.

> While writing a more detailed version of the API taking your comments and Sam's comment in account, I realized that I don't know if it is possible to recover a parity chunk. If a data chunk is missing, it is enough to recover with a decode operation and write it back ( assuming a systematic code is used ). Would it be possible to do something similar with parity chunks ? Or would we need to re-encode all the parity chunks to write just one of them ? I assume the later but ... I'm not sure ;-)

Yes, you can recover a parity chunk just as you would a data chunk. I have not used the jerasure library, so I do not know what it requires.

>
> Cheers
>
> On 05/07/2013 17:02, Atchley, Scott wrote:
>> On Jul 5, 2013, at 10:06 AM, Loic Dachary <loic@dachary.org> wrote:
>>
>>> On 05/07/2013 14:13, Atchley, Scott wrote:> Loic,
>>>>
>>>> Erasure codes take what ever you give them. You need to verify the chunk before using it. Perhaps storing the checksum in the metadata/context that describes the parity object?
>>>
>>> Hi Scott,
>>>
>>> Does that mean that if I give the chunks A + B + C to decode() and B is corrupted but A and C are ok it will return an incorrectly decoded content ?
>>
>> Unfortunately, yes.
>>
>>> I'm curious to know the answer but I don't think it is an actual problem. A corrupted chunk would mean that the underlying file system corrupted the content of one of its file. I can't remember the last time I saw that happen ;-)
>>
>> Bit flips happen. Not often, but it is possible. Here is an article from 2011:
>>
>> http://www.linux-mag.com/id/8794/
>>
>> also search for "bit rot" and "bit error rate".
>>
>> Scott
>>
>>>
>>> Cheers
>>>
>>>>
>>>> Scott
>>>>
>>>> On Jul 4, 2013, at 9:24 AM, Loic Dachary <loic@dachary.org> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I was thinking about scrubbing of erasure coded chunks and realized I don't know the answer to this very simple question : what happens when a chunk is corrupted ? I.e. if AB is coded with 2+1 into A + B ( data ) + Z (parity ) and Z is replaced with Q. Would reed-solomon ignore/discard the corrupted chunk ? If that's the case I think it slightly changes what the API should be.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 04/07/2013 05:06, Paul Von-Stamwitz wrote:
>>>>>> Scott, et al.
>>>>>>
>>>>>> Here is an interesting paper from Usenix HotStorage Conference which provides local codes without additional capacity overhead.
>>>>>>
>>>>>> Check it out. (abstract with links to paper and slides)
>>>>>> https://www.usenix.org/conference/hotstorage13/solution-network-challenges-data-recovery-erasure-coded-distributed-storage
>>>>>>
>>>>>> Cheers,
>>>>>> pvs
>>>>>>
>>>>>>> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote:
>>>>>>>
>>>>>>> Hi Scott,
>>>>>>>
>>>>>>> Point taken.
>>>>>>>
>>>>>>> I was thinking about Loic's decode description where k+m was requested and
>>>>>>> data was decoded when k blocks were received. But he was referring to full
>>>>>>> stripe reads where all the memory is allocated.
>>>>>>>
>>>>>>> Degraded reads and block repair are a different matter.
>>>>>>>
>>>>>>> pvs
>>>>>>>
>>>>>>>> On Jul 3, 2013, at 4:53 AM Scott Atchley wrote:
>>>>>>>>
>>>>>>>> On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz
>>>>>>>> <PVonStamwitz@us.fujitsu.com> wrote:
>>>>>>>>
>>>>>>>>> Scott,
>>>>>>>>>
>>>>>>>>> You make a good point comparing (5/3) RS with Xorbas, but a small nit:
>>>>>>>>>
>>>>>>>>> "The I/O to recover from a single failure for both schemes is 5 blocks
>>>>>>>> so it is as efficient as Xorbas."
>>>>>>>>>
>>>>>>>>> Maybe not. You would probably issue I/O to all the remaining 7 blocks
>>>>>>> to
>>>>>>>> cover for the possibility of double errors. The time to reconstruct
>>>>>>> would
>>>>>>>> be about the same, but there could be more disk and network I/O. (LRC
>>>>>>> will
>>>>>>>> need to issue I/O to the rest of the global stripe if it detected
>>>>>>> multiple
>>>>>>>> local errors.)
>>>>>>>>
>>>>>>>> Why would you request more than five? If one cannot be read, request
>>>>>>>> another.
>>>>>>>>
>>>>>>>> Also, I am not sure that you want to request five at once since it will
>>>>>>>> lead to spikes in network traffic and require memory for all five blocks.
>>>>>>>> You will need at least two buffers. Request the first two and start the
>>>>>>>> decoding. You may want a third buffer to overlap the decoding of the
>>>>>>>> current block with the communication for the next block. It may be that
>>>>>>>> the decode time is less than the communication and, in that case, you
>>>>>>> will
>>>>>>>> want to request all of the blocks at once.
>>>>>>>>
>>>>>>>>> What I like about Xorbas is that it is an extension of a (x,y) RS. You
>>>>>>>> can start with traditional RS. If degraded reads and repaired blocks are
>>>>>>>> causing a problem, you can add the LRC. If capacity is an issue, you can
>>>>>>>> take it out.
>>>>>>>>
>>>>>>>> I like it too and Microsoft has something similar with Pyramid codes.
>>>>>>> That
>>>>>>>> said, my example using traditional RS can provide more fault-tolerance
>>>>>>> on
>>>>>>>> average given the same amount of storage overhead.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Paul
>>>>>>>>>
>>>>>>>>> On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote:
>>>>>>>>>> I think we should be able to cover most cases by adding an interface
>>>>>>>> like:
>>>>>>>>>>
>>>>>>>>>> set<int> minimum_to_read(const set<int> &want_to_read, const set<int>
>>>>>>>>>> &available_chunks);
>>>>>>>>>>
>>>>>>>>>> which returns the smallest set required to read/rebuild the chunks in
>>>>>>>>>> want_to_read given the chunks in available_chunks.  Alternately, we
>>>>>>>> might
>>>>>>>>>> include a "cost" for reading each chunk like
>>>>>>>>>>
>>>>>>>>>> set<int> minimum_to_read_with_cost(const set<int> &want_to_read,
>>>>>>> const
>>>>>>>>>> map<int, int> &available)
>>>>>>>>>>
>>>>>>>>>> which returns the minimum cost set required to read the specified
>>>>>>>> chunks
>>>>>>>>>> given a mapping of available chunks to costs.  The costs might allow
>>>>>>> us
>>>>>>>> to
>>>>>>>>>> consider the difference between reading local chunks vs remote chunks.
>>>>>>>>>> This should be sufficient to cover the read case (esp the degraded
>>>>>>> read
>>>>>>>>>> case) and the repair case.
>>>>>>>>>> -Sam
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott <atchleyes@ornl.gov>
>>>>>>>>>> wrote:
>>>>>>>>>>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" <atchleyes@ornl.gov>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary <loic@dachary.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Project
>>>>>>>>>> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC/
>>>>>>> for
>>>>>>>>>> instance ) would need to be different from the one initialy proposed:
>>>>>>>>>>>>
>>>>>>>>>>>> An interesting video. Not as entertaining as Jim Plank's video. ;-)
>>>>>>>>>>>>
>>>>>>>>>>>> While Plank's focused on the processor requirements for
>>>>>>>>>> encoding/decoding, this video focuses on the network and disk I/O
>>>>>>>>>> requirements.
>>>>>>>>>>>>
>>>>>>>>>>>>> context(k, m, reed-solomon|...) => context* c
>>>>>>>>>>>>> encode(context* c, void* data) => void* chunks[k+m]
>>>>>>>>>>>>> decode(context* c, void* chunk[k+m], int*
>>>>>>>>>>>>> indices_of_erased_chunks) => void* data // erased chunks are not
>>>>>>>> used
>>>>>>>>>>>>> repair(context* c, void* chunk[k+m], int*
>>>>>>>>>>>>> indices_of_erased_chunks) => void* chunks[k+m] // erased chunks
>>>>>>> are
>>>>>>>>>>>>> rebuilt
>>>>>>>>>>>>>
>>>>>>>>>>>>> The decode function must allow for partial read:
>>>>>>>>>>>>>
>>>>>>>>>>>>> decode(context* c, int offset, int length, void* chunk[k+m], int*
>>>>>>>>>>>>> indices_of_erased_chunks, int* missing_chunks) => void* data
>>>>>>>>>>>>>
>>>>>>>>>>>>> If there are not enough chunks to recover the desired data range
>>>>>>>>>> [offset, offset+length) the function returns NULL and sets
>>>>>>>> missing_chunks
>>>>>>>>>> to the list of chunks that must be retrieved in order to be able to
>>>>>>>> read
>>>>>>>>>> the desired data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If decode is called to read just 1 chunk and it is missing, reed-
>>>>>>>>>> solomon would return on error and ask for all other chunks to repair.
>>>>>>>> If
>>>>>>>>>> the underlying library implements LRC, it would ask for a subset of
>>>>>>> the
>>>>>>>>>> chunks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> An implementation allowing only full reads and using jerasure
>>>>>>>> ( which
>>>>>>>>>> does not do LRC ) requires that offset is always zero, length is the
>>>>>>>> size
>>>>>>>>>> of the object and returns a copy of indices_of_erased_chunks if there
>>>>>>>> are
>>>>>>>>>> not enough chunks to rebuild the missing ones.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Comments are welcome :-)
>>>>>>>>>>>>
>>>>>>>>>>>> I have loosely followed this discussion and I have not looked
>>>>>>> closely
>>>>>>>>>> at the proposed API nor at the jerasure interface. My apologies if
>>>>>>> this
>>>>>>>>>> has already been addressed.
>>>>>>>>>>>>
>>>>>>>>>>>> It is not clear to me from the above proposed API (ignoring the
>>>>>>>> partial
>>>>>>>>>> read) what it would do. Was the original intent to encode the entire
>>>>>>>> file
>>>>>>>>>> using k+m blocks irregardless of the file size and of the rados
>>>>>>> object
>>>>>>>>>> size?
>>>>>>>>>>>>
>>>>>>>>>>>> If so, how will you map rados objects to the logical k+m objects
>>>>>>> and
>>>>>>>>>> vice versa?
>>>>>>>>>>>>
>>>>>>>>>>>> If not, then the initial API needed an offset and length (either
>>>>>>>>>> logical or rados object).
>>>>>>>>>>>>
>>>>>>>>>>>> I would assume that you would want to operate on rados sized
>>>>>>> objects.
>>>>>>>>>> Given a fixed k+m, then you may have more than one set of k+m objects
>>>>>>>> per
>>>>>>>>>> file. This is ignoring the LRC "local" parity blocks. For example, if
>>>>>>>> the
>>>>>>>>>> rados object size if 1 MB and k = 10 and m = 4 (as in the Xorbas
>>>>>>> video),
>>>>>>>>>> then for a 20 MB file one would need two sets of encoding blocks. The
>>>>>>>>>> first for objects 1-10 and the second for objects 11-20.
>>>>>>>>>>>>
>>>>>>>>>>>> Perhaps, this is what the context is above. If so, it should have
>>>>>>> the
>>>>>>>>>> logical offset and rados object size, no?
>>>>>>>>>>>>
>>>>>>>>>>>> I see value in the Xorbas concept and I wonder if the jerasure
>>>>>>>> library
>>>>>>>>>> can be modified to generate the local parity blocks such that they
>>>>>>> can
>>>>>>>> be
>>>>>>>>>> used to generate the global parity blocks. That would be a question
>>>>>>> for
>>>>>>>>>> Jim Plank.
>>>>>>>>>>>
>>>>>>>>>>> The benefits of the Xorbas concept is reduced network and disk I/O
>>>>>>> for
>>>>>>>>>> failures while maintaining traditional RS's higher fault-tolerance
>>>>>>> than
>>>>>>>> 3x
>>>>>>>>>> replication while using less space.
>>>>>>>>>>>
>>>>>>>>>>> You can do almost the same thing with jerasure without modifying it
>>>>>>> at
>>>>>>>>>> all. Using the values from the Xorbas video, they have k data blocks,
>>>>>>> m
>>>>>>>>>> global parity blocks, and 2 local parity blocks (generated from k/2
>>>>>>>> data
>>>>>>>>>> blocks) for a total of k+m+2 blocks on disk that can tolerate any m
>>>>>>>>>> failures. In their example, k = 10 and m = 4. They store 16 blocks
>>>>>>> for
>>>>>>>>>> each 10 data blocks.
>>>>>>>>>>>
>>>>>>>>>>> If you use traditional RS encoding via jerasure and used the same
>>>>>>>> amount
>>>>>>>>>> of storage (16 blocks for each 10 data blocks), you could encode 3
>>>>>>>> parity
>>>>>>>>>> blocks for each 5 data blocks. This would consume 16 data blocks for
>>>>>>>> each
>>>>>>>>>> 10 data blocks and the fault-tolerance would be variable from 3-6
>>>>>>>> failures
>>>>>>>>>> depending on how the failures fell between the two groups of 5 blocks
>>>>>>>>>> which is higher than the static 4 failures for the Xorbas code. The
>>>>>>> I/O
>>>>>>>> to
>>>>>>>>>> recover from a single failure for both schemes is 5 blocks so it is
>>>>>>> as
>>>>>>>>>> efficient as Xorbas. On average, it provides more fault-tolerance,
>>>>>>> but
>>>>>>>> it
>>>>>>>>>> can be less (four failures from one group of 5 data + 3 parity
>>>>>>> blocks),
>>>>>>>>>> but that worst case is the same as 3x replication.
>>>>>>>>>>>
>>>>>>>>>>> Scott--
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>>>>>>> devel"
>>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>> in
>>>>>>>>>> the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>>> info
>>>>>>>> at
>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> All that is necessary for the triumph of evil is that good people do nothing.
>>>>>
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> All that is necessary for the triumph of evil is that good people do nothing.
>>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2013-07-05 16:55 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-01 23:00 Erasure coding library API Loic Dachary
2013-07-02 14:07 ` Atchley, Scott
2013-07-02 17:14   ` Atchley, Scott
2013-07-02 21:33     ` Samuel Just
2013-07-03  2:12       ` Paul Von-Stamwitz
2013-07-03 11:53         ` Atchley, Scott
2013-07-03 18:19           ` Paul Von-Stamwitz
2013-07-04  3:06           ` Paul Von-Stamwitz
2013-07-04 13:24             ` Loic Dachary
2013-07-05 12:13               ` Atchley, Scott
2013-07-05 14:06                 ` Loic Dachary
2013-07-05 15:02                   ` Atchley, Scott
2013-07-05 16:34                     ` Loic Dachary
2013-07-05 16:55                       ` Atchley, Scott
2013-07-05 12:39             ` Atchley, Scott
2013-07-05 16:50       ` Loic Dachary

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.