All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: CEPH Erasure Encoding + OSD Scalability
       [not found] <3472A07E6605974CBC9BC573F1BC02E494B06990@PLOXCHG04.cern.ch>
@ 2013-07-05 21:23 ` Loic Dachary
  2013-07-06 13:45   ` Andreas Joachim Peters
       [not found]   ` <CAGhffvx5-xmprT-vL1VNrz12+pJSikg1WsUqy_JRdW0JNm5auQ@mail.gmail.com>
  0 siblings, 2 replies; 52+ messages in thread
From: Loic Dachary @ 2013-07-05 21:23 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 2025 bytes --]

Hi Andreas,

On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
> thanks for the responses!
> 
> Maybe this is useful for your erasure code discussion:
> 
> as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
> 
> Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
> 
> You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...

What does (C)RS mean ? (C)Reed-Solomon ? 

> In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk. 
> If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code. 

Nice. I did not know that was built-in :-) 
https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing

> (wouldn't CRC32C be also useful for normal CEPH block replication? )

I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing

https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731

Cheers

> As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
> 
> Cheers Andreas.

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-07-05 21:23 ` CEPH Erasure Encoding + OSD Scalability Loic Dachary
@ 2013-07-06 13:45   ` Andreas Joachim Peters
  2013-07-06 15:28     ` Mark Nelson
       [not found]   ` <CAGhffvx5-xmprT-vL1VNrz12+pJSikg1WsUqy_JRdW0JNm5auQ@mail.gmail.com>
  1 sibling, 1 reply; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-07-06 13:45 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

HI Loic, 
(C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.

Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.

Cheers Andreas.
________________________________________
From: Loic Dachary [loic@dachary.org]
Sent: 05 July 2013 23:23
To: Andreas Joachim Peters
Cc: ceph-devel@vger.kernel.org
Subject: Re: CEPH Erasure Encoding + OSD Scalability

Hi Andreas,

On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
> thanks for the responses!
>
> Maybe this is useful for your erasure code discussion:
>
> as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>
> Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
>
> You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...

What does (C)RS mean ? (C)Reed-Solomon ?

> In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
> If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.

Nice. I did not know that was built-in :-)
https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing

> (wouldn't CRC32C be also useful for normal CEPH block replication? )

I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing

https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731

Cheers

> As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
>
> Cheers Andreas.

--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-07-06 13:45   ` Andreas Joachim Peters
@ 2013-07-06 15:28     ` Mark Nelson
  2013-07-06 20:43       ` Loic Dachary
  0 siblings, 1 reply; 52+ messages in thread
From: Mark Nelson @ 2013-07-06 15:28 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: Loic Dachary, ceph-devel

Hi Guys,

For what it's worth, we just added SSE 4.2 CRC32c for architectures that 
support it:

https://github.com/ceph/ceph/commit/7c59288d9168ddef3b3dc570464ae9a1f180d18c#src/common/crc32c-intel.c

Mark

On 07/06/2013 08:45 AM, Andreas Joachim Peters wrote:
> HI Loic,
> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
>
> Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
>
> Cheers Andreas.
> ________________________________________
> From: Loic Dachary [loic@dachary.org]
> Sent: 05 July 2013 23:23
> To: Andreas Joachim Peters
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: CEPH Erasure Encoding + OSD Scalability
>
> Hi Andreas,
>
> On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
>> thanks for the responses!
>>
>> Maybe this is useful for your erasure code discussion:
>>
>> as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>>
>> Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
>>
>> You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
>
> What does (C)RS mean ? (C)Reed-Solomon ?
>
>> In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
>> If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
>
> Nice. I did not know that was built-in :-)
> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
>
>> (wouldn't CRC32C be also useful for normal CEPH block replication? )
>
> I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
>
> https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
>
> Cheers
>
>> As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
>>
>> Cheers Andreas.
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-07-06 15:28     ` Mark Nelson
@ 2013-07-06 20:43       ` Loic Dachary
  2013-07-08 15:38         ` Mark Nelson
  0 siblings, 1 reply; 52+ messages in thread
From: Loic Dachary @ 2013-07-06 20:43 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 3609 bytes --]

Hi Mark,

Nice :-) I'm curious about how it's used. Is it computed every time an object is written to disk ? Or is it part of the WRITE messages that are sent to the replicas ? 

Cheers

On 06/07/2013 17:28, Mark Nelson wrote:
> Hi Guys,
> 
> For what it's worth, we just added SSE 4.2 CRC32c for architectures that support it:
> 
> https://github.com/ceph/ceph/commit/7c59288d9168ddef3b3dc570464ae9a1f180d18c#src/common/crc32c-intel.c
> 
> Mark
> 
> On 07/06/2013 08:45 AM, Andreas Joachim Peters wrote:
>> HI Loic,
>> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
>>
>> Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
>>
>> Cheers Andreas.
>> ________________________________________
>> From: Loic Dachary [loic@dachary.org]
>> Sent: 05 July 2013 23:23
>> To: Andreas Joachim Peters
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: CEPH Erasure Encoding + OSD Scalability
>>
>> Hi Andreas,
>>
>> On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
>>> thanks for the responses!
>>>
>>> Maybe this is useful for your erasure code discussion:
>>>
>>> as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>>>
>>> Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
>>>
>>> You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
>>
>> What does (C)RS mean ? (C)Reed-Solomon ?
>>
>>> In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
>>> If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
>>
>> Nice. I did not know that was built-in :-)
>> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
>>
>>> (wouldn't CRC32C be also useful for normal CEPH block replication? )
>>
>> I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
>>
>> https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
>>
>> Cheers
>>
>>> As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
>>>
>>> Cheers Andreas.
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>> All that is necessary for the triumph of evil is that good people do nothing.
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
       [not found]   ` <CAGhffvx5-xmprT-vL1VNrz12+pJSikg1WsUqy_JRdW0JNm5auQ@mail.gmail.com>
@ 2013-07-06 20:47     ` Loic Dachary
  2013-07-07 21:04       ` Andreas Joachim Peters
  0 siblings, 1 reply; 52+ messages in thread
From: Loic Dachary @ 2013-07-06 20:47 UTC (permalink / raw)
  To: Andreas-Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 3230 bytes --]

Hi Andreas,

Since it looks like we're going to use jerasure-1.2, we will be able to try (C)RS using

https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c
https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h

Do you know of a better / faster implementation ? Is there a tradeoff between (C)RS and RS ?

Cheers

On 06/07/2013 15:43, Andreas-Joachim Peters wrote:
> HI Loic, 
> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
> 
> Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
> 
> Cheers Andreas.
> 
> 
> 
> 
> On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
> 
>     Hi Andreas,
> 
>     On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
>     > thanks for the responses!
>     >
>     > Maybe this is useful for your erasure code discussion:
>     >
>     > as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>     >
>     > Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
>     >
>     > You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
> 
>     What does (C)RS mean ? (C)Reed-Solomon ?
> 
>     > In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
>     > If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
> 
>     Nice. I did not know that was built-in :-)
>     https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
> 
>     > (wouldn't CRC32C be also useful for normal CEPH block replication? )
> 
>     I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
> 
>     https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
> 
>     Cheers
> 
>     > As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
>     >
>     > Cheers Andreas.
> 
>     --
>     Loïc Dachary, Artisan Logiciel Libre
>     All that is necessary for the triumph of evil is that good people do nothing.
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-07-06 20:47     ` Loic Dachary
@ 2013-07-07 21:04       ` Andreas Joachim Peters
  2013-07-08  3:37         ` Sage Weil
  2013-08-19 10:35         ` Loic Dachary
  0 siblings, 2 replies; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-07-07 21:04 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel


Hi Loic,
I don't think there is a better generic implementation. Just made a benchmark .. the Jerasure library with algorithm 'cauchy_good' gives 1.1 GB/s (Xeon 2.27 GHz) on a single core for a 4+2 encoding w=32. Just to give a feeling if you do 10+4 it is 300 MB/s .... there is a specialized implementation in QFS (Hadoop in C++) for (M+3) ... for curiosity I will make a benchmark with this to compare with Jerasure ...

In any case I would do an optimized implementation for 3+2 which would be probably the most performant implementation having the same reliability like standard 3-fold replication in CEPH using only 53% of the space.

3+2 is trivial since you encode (A,B,C) with only two parity operations
P1 = A^B
P2 = B^C
and reconstruct with one or two parity operations:
A = P1^B
B = P1^A
B = P2^C
C = P2^B
aso.

You can write this as a simple loop using advanced vector extensions on Intel (AVX). I can paste a benchmark tomorrow.

Considering the crc32c-intel code you added ... I would provide a function which provides a crc32c checksum and detects if it can do it using SSE4.2 or implements just the standard algorithm e.g if you run in a virtual machine you need this emulation ...

Cheers Andreas.
________________________________________
From: Loic Dachary [loic@dachary.org]
Sent: 06 July 2013 22:47
To: Andreas Joachim Peters
Cc: ceph-devel@vger.kernel.org
Subject: Re: CEPH Erasure Encoding + OSD Scalability

Hi Andreas,

Since it looks like we're going to use jerasure-1.2, we will be able to try (C)RS using

https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c
https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h

Do you know of a better / faster implementation ? Is there a tradeoff between (C)RS and RS ?

Cheers

On 06/07/2013 15:43, Andreas-Joachim Peters wrote:
> HI Loic,
> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
>
> Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
>
> Cheers Andreas.
>
>
>
>
> On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>
>     Hi Andreas,
>
>     On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
>     > thanks for the responses!
>     >
>     > Maybe this is useful for your erasure code discussion:
>     >
>     > as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>     >
>     > Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
>     >
>     > You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
>
>     What does (C)RS mean ? (C)Reed-Solomon ?
>
>     > In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
>     > If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
>
>     Nice. I did not know that was built-in :-)
>     https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
>
>     > (wouldn't CRC32C be also useful for normal CEPH block replication? )
>
>     I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
>
>     https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
>
>     Cheers
>
>     > As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
>     >
>     > Cheers Andreas.
>
>     --
>     Loïc Dachary, Artisan Logiciel Libre
>     All that is necessary for the triumph of evil is that good people do nothing.
>
>

--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-07-07 21:04       ` Andreas Joachim Peters
@ 2013-07-08  3:37         ` Sage Weil
  2013-07-08 10:00           ` Andreas Joachim Peters
  2013-08-19 10:35         ` Loic Dachary
  1 sibling, 1 reply; 52+ messages in thread
From: Sage Weil @ 2013-07-08  3:37 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: Loic Dachary, ceph-devel

On Sun, 7 Jul 2013, Andreas Joachim Peters wrote:
> Considering the crc32c-intel code you added ... I would provide a 
> function which provides a crc32c checksum and detects if it can do it 
> using SSE4.2 or implements just the standard algorithm e.g if you run in 
> a virtual machine you need this emulation ...

The current code in master will do this detection by checking the cpu 
features; see

	https://github.com/ceph/ceph/blob/master/src/common/crc32c-intel.c#L74

If there is a better way to do this, I'd love to hear about it.  gcc 4.8 
just added a bunch of built-in functions to do this stuff cleanly, but 
it'll be quite a while before all of our build targets are on 4.8 or 
later.

sage


> 
> Cheers Andreas.
> ________________________________________
> From: Loic Dachary [loic@dachary.org]
> Sent: 06 July 2013 22:47
> To: Andreas Joachim Peters
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: CEPH Erasure Encoding + OSD Scalability
> 
> Hi Andreas,
> 
> Since it looks like we're going to use jerasure-1.2, we will be able to try (C)RS using
> 
> https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c
> https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h
> 
> Do you know of a better / faster implementation ? Is there a tradeoff between (C)RS and RS ?
> 
> Cheers
> 
> On 06/07/2013 15:43, Andreas-Joachim Peters wrote:
> > HI Loic,
> > (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
> >
> > Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
> >
> > Cheers Andreas.
> >
> >
> >
> >
> > On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
> >
> >     Hi Andreas,
> >
> >     On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
> >     > thanks for the responses!
> >     >
> >     > Maybe this is useful for your erasure code discussion:
> >     >
> >     > as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
> >     >
> >     > Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
> >     >
> >     > You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
> >
> >     What does (C)RS mean ? (C)Reed-Solomon ?
> >
> >     > In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
> >     > If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
> >
> >     Nice. I did not know that was built-in :-)
> >     https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
> >
> >     > (wouldn't CRC32C be also useful for normal CEPH block replication? )
> >
> >     I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
> >
> >     https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
> >
> >     Cheers
> >
> >     > As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
> >     >
> >     > Cheers Andreas.
> >
> >     --
> >     Lo?c Dachary, Artisan Logiciel Libre
> >     All that is necessary for the triumph of evil is that good people do nothing.
> >
> >
> 
> --
> Lo?c Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-07-08  3:37         ` Sage Weil
@ 2013-07-08 10:00           ` Andreas Joachim Peters
  2013-07-08 10:31             ` Loic Dachary
  2013-07-08 15:47             ` Sage Weil
  0 siblings, 2 replies; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-07-08 10:00 UTC (permalink / raw)
  To: Sage Weil; +Cc: Loic Dachary, ceph-devel

Hi Loic,

I did the two mentioned benchmarks:

QFS (m+3) code run's at 300 MB/s ... not worthy (jerasure 390 MB/s).

I made a quick (3+2) encoding benchmark and this encodes ~ 3 GB/s.

...

For the checksumming ... I saw that there is the check if CRC32C is supported, but I was looking for a generic routine like:

crc32c_t crc32c(void* buffer, off_t lenght)

which internally selects either the HW accelerated or SW implementation. Mayby you have this in some other source file.

Cheers Andreas.

________________________________________
From: Sage Weil [sage@inktank.com]
Sent: 08 July 2013 05:37
To: Andreas Joachim Peters
Cc: Loic Dachary; ceph-devel@vger.kernel.org
Subject: RE: CEPH Erasure Encoding + OSD Scalability

On Sun, 7 Jul 2013, Andreas Joachim Peters wrote:
> Considering the crc32c-intel code you added ... I would provide a
> function which provides a crc32c checksum and detects if it can do it
> using SSE4.2 or implements just the standard algorithm e.g if you run in
> a virtual machine you need this emulation ...

The current code in master will do this detection by checking the cpu
features; see

        https://github.com/ceph/ceph/blob/master/src/common/crc32c-intel.c#L74

If there is a better way to do this, I'd love to hear about it.  gcc 4.8
just added a bunch of built-in functions to do this stuff cleanly, but
it'll be quite a while before all of our build targets are on 4.8 or
later.

sage


>
> Cheers Andreas.
> ________________________________________
> From: Loic Dachary [loic@dachary.org]
> Sent: 06 July 2013 22:47
> To: Andreas Joachim Peters
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: CEPH Erasure Encoding + OSD Scalability
>
> Hi Andreas,
>
> Since it looks like we're going to use jerasure-1.2, we will be able to try (C)RS using
>
> https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c
> https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h
>
> Do you know of a better / faster implementation ? Is there a tradeoff between (C)RS and RS ?
>
> Cheers
>
> On 06/07/2013 15:43, Andreas-Joachim Peters wrote:
> > HI Loic,
> > (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
> >
> > Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
> >
> > Cheers Andreas.
> >
> >
> >
> >
> > On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
> >
> >     Hi Andreas,
> >
> >     On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
> >     > thanks for the responses!
> >     >
> >     > Maybe this is useful for your erasure code discussion:
> >     >
> >     > as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
> >     >
> >     > Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
> >     >
> >     > You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
> >
> >     What does (C)RS mean ? (C)Reed-Solomon ?
> >
> >     > In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
> >     > If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
> >
> >     Nice. I did not know that was built-in :-)
> >     https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
> >
> >     > (wouldn't CRC32C be also useful for normal CEPH block replication? )
> >
> >     I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
> >
> >     https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
> >
> >     Cheers
> >
> >     > As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
> >     >
> >     > Cheers Andreas.
> >
> >     --
> >     Lo?c Dachary, Artisan Logiciel Libre
> >     All that is necessary for the triumph of evil is that good people do nothing.
> >
> >
>
> --
> Lo?c Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-07-08 10:00           ` Andreas Joachim Peters
@ 2013-07-08 10:31             ` Loic Dachary
  2013-07-08 15:47             ` Sage Weil
  1 sibling, 0 replies; 52+ messages in thread
From: Loic Dachary @ 2013-07-08 10:31 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 5633 bytes --]



On 08/07/2013 12:00, Andreas Joachim Peters wrote:
> Hi Loic,
> 
> I did the two mentioned benchmarks:
> 
> QFS (m+3) code run's at 300 MB/s ... not worthy (jerasure 390 MB/s).
> 
> I made a quick (3+2) encoding benchmark and this encodes ~ 3 GB/s.
> 

Hi Andreas,

It looks like the simplest and fastest implementation there is :-) I understand it only addresses 3+2 but it would make for a fine default implementation / example for the erasure coding plugin implementing the proposed API https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#erasure-code-library-abstract-api

Cheers

> Cheers Andreas.
> 
> ________________________________________
> From: Sage Weil [sage@inktank.com]
> Sent: 08 July 2013 05:37
> To: Andreas Joachim Peters
> Cc: Loic Dachary; ceph-devel@vger.kernel.org
> Subject: RE: CEPH Erasure Encoding + OSD Scalability
> 
> On Sun, 7 Jul 2013, Andreas Joachim Peters wrote:
>> Considering the crc32c-intel code you added ... I would provide a
>> function which provides a crc32c checksum and detects if it can do it
>> using SSE4.2 or implements just the standard algorithm e.g if you run in
>> a virtual machine you need this emulation ...
> 
> The current code in master will do this detection by checking the cpu
> features; see
> 
>         https://github.com/ceph/ceph/blob/master/src/common/crc32c-intel.c#L74
> 
> If there is a better way to do this, I'd love to hear about it.  gcc 4.8
> just added a bunch of built-in functions to do this stuff cleanly, but
> it'll be quite a while before all of our build targets are on 4.8 or
> later.
> 
> sage
> 
> 
>>
>> Cheers Andreas.
>> ________________________________________
>> From: Loic Dachary [loic@dachary.org]
>> Sent: 06 July 2013 22:47
>> To: Andreas Joachim Peters
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: CEPH Erasure Encoding + OSD Scalability
>>
>> Hi Andreas,
>>
>> Since it looks like we're going to use jerasure-1.2, we will be able to try (C)RS using
>>
>> https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c
>> https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h
>>
>> Do you know of a better / faster implementation ? Is there a tradeoff between (C)RS and RS ?
>>
>> Cheers
>>
>> On 06/07/2013 15:43, Andreas-Joachim Peters wrote:
>>> HI Loic,
>>> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
>>>
>>> Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
>>>
>>> Cheers Andreas.
>>>
>>>
>>>
>>>
>>> On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>>
>>>     Hi Andreas,
>>>
>>>     On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
>>>     > thanks for the responses!
>>>     >
>>>     > Maybe this is useful for your erasure code discussion:
>>>     >
>>>     > as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>>>     >
>>>     > Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
>>>     >
>>>     > You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
>>>
>>>     What does (C)RS mean ? (C)Reed-Solomon ?
>>>
>>>     > In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
>>>     > If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
>>>
>>>     Nice. I did not know that was built-in :-)
>>>     https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
>>>
>>>     > (wouldn't CRC32C be also useful for normal CEPH block replication? )
>>>
>>>     I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
>>>
>>>     https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
>>>
>>>     Cheers
>>>
>>>     > As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
>>>     >
>>>     > Cheers Andreas.
>>>
>>>     --
>>>     Lo?c Dachary, Artisan Logiciel Libre
>>>     All that is necessary for the triumph of evil is that good people do nothing.
>>>
>>>
>>
>> --
>> Lo?c Dachary, Artisan Logiciel Libre
>> All that is necessary for the triumph of evil is that good people do nothing.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-07-06 20:43       ` Loic Dachary
@ 2013-07-08 15:38         ` Mark Nelson
  0 siblings, 0 replies; 52+ messages in thread
From: Mark Nelson @ 2013-07-08 15:38 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

Hi Loic,

Sam will be able to answer more authoritatively, but my understanding is 
we use it for messages and for journal writes.  On the message side, 
this is used in userspace while afaik the kernel implementation is used 
in kernel space.

Mark

On 07/06/2013 03:43 PM, Loic Dachary wrote:
> Hi Mark,
>
> Nice :-) I'm curious about how it's used. Is it computed every time an object is written to disk ? Or is it part of the WRITE messages that are sent to the replicas ?
>
> Cheers
>
> On 06/07/2013 17:28, Mark Nelson wrote:
>> Hi Guys,
>>
>> For what it's worth, we just added SSE 4.2 CRC32c for architectures that support it:
>>
>> https://github.com/ceph/ceph/commit/7c59288d9168ddef3b3dc570464ae9a1f180d18c#src/common/crc32c-intel.c
>>
>> Mark
>>
>> On 07/06/2013 08:45 AM, Andreas Joachim Peters wrote:
>>> HI Loic,
>>> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
>>>
>>> Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
>>>
>>> Cheers Andreas.
>>> ________________________________________
>>> From: Loic Dachary [loic@dachary.org]
>>> Sent: 05 July 2013 23:23
>>> To: Andreas Joachim Peters
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: Re: CEPH Erasure Encoding + OSD Scalability
>>>
>>> Hi Andreas,
>>>
>>> On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
>>>> thanks for the responses!
>>>>
>>>> Maybe this is useful for your erasure code discussion:
>>>>
>>>> as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>>>>
>>>> Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
>>>>
>>>> You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
>>>
>>> What does (C)RS mean ? (C)Reed-Solomon ?
>>>
>>>> In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
>>>> If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
>>>
>>> Nice. I did not know that was built-in :-)
>>> https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
>>>
>>>> (wouldn't CRC32C be also useful for normal CEPH block replication? )
>>>
>>> I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
>>>
>>> https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
>>>
>>> Cheers
>>>
>>>> As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
>>>>
>>>> Cheers Andreas.
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> All that is necessary for the triumph of evil is that good people do nothing.
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-07-08 10:00           ` Andreas Joachim Peters
  2013-07-08 10:31             ` Loic Dachary
@ 2013-07-08 15:47             ` Sage Weil
  1 sibling, 0 replies; 52+ messages in thread
From: Sage Weil @ 2013-07-08 15:47 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: Loic Dachary, ceph-devel

On Mon, 8 Jul 2013, Andreas Joachim Peters wrote:
> Hi Loic,
> 
> I did the two mentioned benchmarks:
> 
> QFS (m+3) code run's at 300 MB/s ... not worthy (jerasure 390 MB/s).
> 
> I made a quick (3+2) encoding benchmark and this encodes ~ 3 GB/s.
> 
> ...
> 
> For the checksumming ... I saw that there is the check if CRC32C is supported, but I was looking for a generic routine like:
> 
> crc32c_t crc32c(void* buffer, off_t lenght)
> 
> which internally selects either the HW accelerated or SW implementation. Mayby you have this in some other source file.

https://github.com/ceph/ceph/blob/master/src/include/crc32c.h

Let me know if anything looks awry; this is the first time I've done any 
runtime cpu checks.

Thanks!
sage

> 
> Cheers Andreas.
> 
> ________________________________________
> From: Sage Weil [sage@inktank.com]
> Sent: 08 July 2013 05:37
> To: Andreas Joachim Peters
> Cc: Loic Dachary; ceph-devel@vger.kernel.org
> Subject: RE: CEPH Erasure Encoding + OSD Scalability
> 
> On Sun, 7 Jul 2013, Andreas Joachim Peters wrote:
> > Considering the crc32c-intel code you added ... I would provide a
> > function which provides a crc32c checksum and detects if it can do it
> > using SSE4.2 or implements just the standard algorithm e.g if you run in
> > a virtual machine you need this emulation ...
> 
> The current code in master will do this detection by checking the cpu
> features; see
> 
>         https://github.com/ceph/ceph/blob/master/src/common/crc32c-intel.c#L74
> 
> If there is a better way to do this, I'd love to hear about it.  gcc 4.8
> just added a bunch of built-in functions to do this stuff cleanly, but
> it'll be quite a while before all of our build targets are on 4.8 or
> later.
> 
> sage
> 
> 
> >
> > Cheers Andreas.
> > ________________________________________
> > From: Loic Dachary [loic@dachary.org]
> > Sent: 06 July 2013 22:47
> > To: Andreas Joachim Peters
> > Cc: ceph-devel@vger.kernel.org
> > Subject: Re: CEPH Erasure Encoding + OSD Scalability
> >
> > Hi Andreas,
> >
> > Since it looks like we're going to use jerasure-1.2, we will be able to try (C)RS using
> >
> > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c
> > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h
> >
> > Do you know of a better / faster implementation ? Is there a tradeoff between (C)RS and RS ?
> >
> > Cheers
> >
> > On 06/07/2013 15:43, Andreas-Joachim Peters wrote:
> > > HI Loic,
> > > (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
> > >
> > > Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
> > >
> > > Cheers Andreas.
> > >
> > >
> > >
> > >
> > > On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
> > >
> > >     Hi Andreas,
> > >
> > >     On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
> > >     > thanks for the responses!
> > >     >
> > >     > Maybe this is useful for your erasure code discussion:
> > >     >
> > >     > as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
> > >     >
> > >     > Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
> > >     >
> > >     > You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
> > >
> > >     What does (C)RS mean ? (C)Reed-Solomon ?
> > >
> > >     > In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
> > >     > If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
> > >
> > >     Nice. I did not know that was built-in :-)
> > >     https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
> > >
> > >     > (wouldn't CRC32C be also useful for normal CEPH block replication? )
> > >
> > >     I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
> > >
> > >     https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
> > >
> > >     Cheers
> > >
> > >     > As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
> > >     >
> > >     > Cheers Andreas.
> > >
> > >     --
> > >     Lo?c Dachary, Artisan Logiciel Libre
> > >     All that is necessary for the triumph of evil is that good people do nothing.
> > >
> > >
> >
> > --
> > Lo?c Dachary, Artisan Logiciel Libre
> > All that is necessary for the triumph of evil is that good people do nothing.
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-07-07 21:04       ` Andreas Joachim Peters
  2013-07-08  3:37         ` Sage Weil
@ 2013-08-19 10:35         ` Loic Dachary
  2013-08-22 21:50           ` Andreas Joachim Peters
       [not found]           ` <CAGhffvwB87a+1294BjmPrfu0a9hYdu17N-eHOvYCHWMXDLcJmA@mail.gmail.com>
  1 sibling, 2 replies; 52+ messages in thread
From: Loic Dachary @ 2013-08-19 10:35 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 5448 bytes --]

Hi Andreas,

Trying to write minimal code as you suggested, for an example plugin. My first attempt at writing an erasure coding function. I don't get how you can rebuild P1 + A from P2 + B + C. I must be missing something obvious :-)

Cheers

On 07/07/2013 23:04, Andreas Joachim Peters wrote:
> 
> Hi Loic,
> I don't think there is a better generic implementation. Just made a benchmark .. the Jerasure library with algorithm 'cauchy_good' gives 1.1 GB/s (Xeon 2.27 GHz) on a single core for a 4+2 encoding w=32. Just to give a feeling if you do 10+4 it is 300 MB/s .... there is a specialized implementation in QFS (Hadoop in C++) for (M+3) ... for curiosity I will make a benchmark with this to compare with Jerasure ...
> 
> In any case I would do an optimized implementation for 3+2 which would be probably the most performant implementation having the same reliability like standard 3-fold replication in CEPH using only 53% of the space.
> 
> 3+2 is trivial since you encode (A,B,C) with only two parity operations
> P1 = A^B
> P2 = B^C
> and reconstruct with one or two parity operations:
> A = P1^B
> B = P1^A
> B = P2^C
> C = P2^B
> aso.
> 
> You can write this as a simple loop using advanced vector extensions on Intel (AVX). I can paste a benchmark tomorrow.
> 
> Considering the crc32c-intel code you added ... I would provide a function which provides a crc32c checksum and detects if it can do it using SSE4.2 or implements just the standard algorithm e.g if you run in a virtual machine you need this emulation ...
> 
> Cheers Andreas.
> ________________________________________
> From: Loic Dachary [loic@dachary.org]
> Sent: 06 July 2013 22:47
> To: Andreas Joachim Peters
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: CEPH Erasure Encoding + OSD Scalability
> 
> Hi Andreas,
> 
> Since it looks like we're going to use jerasure-1.2, we will be able to try (C)RS using
> 
> https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c
> https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h
> 
> Do you know of a better / faster implementation ? Is there a tradeoff between (C)RS and RS ?
> 
> Cheers
> 
> On 06/07/2013 15:43, Andreas-Joachim Peters wrote:
>> HI Loic,
>> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
>>
>> Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
>>
>> Cheers Andreas.
>>
>>
>>
>>
>> On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>
>>     Hi Andreas,
>>
>>     On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
>>     > thanks for the responses!
>>     >
>>     > Maybe this is useful for your erasure code discussion:
>>     >
>>     > as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>>     >
>>     > Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
>>     >
>>     > You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
>>
>>     What does (C)RS mean ? (C)Reed-Solomon ?
>>
>>     > In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
>>     > If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
>>
>>     Nice. I did not know that was built-in :-)
>>     https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
>>
>>     > (wouldn't CRC32C be also useful for normal CEPH block replication? )
>>
>>     I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
>>
>>     https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
>>
>>     Cheers
>>
>>     > As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
>>     >
>>     > Cheers Andreas.
>>
>>     --
>>     Loïc Dachary, Artisan Logiciel Libre
>>     All that is necessary for the triumph of evil is that good people do nothing.
>>
>>
> 
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-08-19 10:35         ` Loic Dachary
@ 2013-08-22 21:50           ` Andreas Joachim Peters
       [not found]           ` <CAGhffvwB87a+1294BjmPrfu0a9hYdu17N-eHOvYCHWMXDLcJmA@mail.gmail.com>
  1 sibling, 0 replies; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-08-22 21:50 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

Hi Loic, 
sorry for the late reply, I was on vacation ...  you are right, I did a simple logical mistake since I assumed you loose only the data stripes but never the parity stripes which is a very wrong assumption. Ignore the proposal !

So for testing you probably could just implement (2+1) and then use jerasure or use the dual parity (4+2) algorithm where you build horizontal and diagonal parities (however there is a patent on RAID-DP)

Cheers Andreas.

________________________________________
From: Loic Dachary [loic@dachary.org]
Sent: 19 August 2013 12:35
To: Andreas Joachim Peters
Cc: ceph-devel@vger.kernel.org
Subject: Re: CEPH Erasure Encoding + OSD Scalability

Hi Andreas,

Trying to write minimal code as you suggested, for an example plugin. My first attempt at writing an erasure coding function. I don't get how you can rebuild P1 + A from P2 + B + C. I must be missing something obvious :-)

Cheers

On 07/07/2013 23:04, Andreas Joachim Peters wrote:
>
> Hi Loic,
> I don't think there is a better generic implementation. Just made a benchmark .. the Jerasure library with algorithm 'cauchy_good' gives 1.1 GB/s (Xeon 2.27 GHz) on a single core for a 4+2 encoding w=32. Just to give a feeling if you do 10+4 it is 300 MB/s .... there is a specialized implementation in QFS (Hadoop in C++) for (M+3) ... for curiosity I will make a benchmark with this to compare with Jerasure ...
>
> In any case I would do an optimized implementation for 3+2 which would be probably the most performant implementation having the same reliability like standard 3-fold replication in CEPH using only 53% of the space.
>
> 3+2 is trivial since you encode (A,B,C) with only two parity operations
> P1 = A^B
> P2 = B^C
> and reconstruct with one or two parity operations:
> A = P1^B
> B = P1^A
> B = P2^C
> C = P2^B
> aso.
>
> You can write this as a simple loop using advanced vector extensions on Intel (AVX). I can paste a benchmark tomorrow.
>
> Considering the crc32c-intel code you added ... I would provide a function which provides a crc32c checksum and detects if it can do it using SSE4.2 or implements just the standard algorithm e.g if you run in a virtual machine you need this emulation ...
>
> Cheers Andreas.
> ________________________________________
> From: Loic Dachary [loic@dachary.org]
> Sent: 06 July 2013 22:47
> To: Andreas Joachim Peters
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: CEPH Erasure Encoding + OSD Scalability
>
> Hi Andreas,
>
> Since it looks like we're going to use jerasure-1.2, we will be able to try (C)RS using
>
> https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c
> https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h
>
> Do you know of a better / faster implementation ? Is there a tradeoff between (C)RS and RS ?
>
> Cheers
>
> On 06/07/2013 15:43, Andreas-Joachim Peters wrote:
>> HI Loic,
>> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
>>
>> Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
>>
>> Cheers Andreas.
>>
>>
>>
>>
>> On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>
>>     Hi Andreas,
>>
>>     On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
>>     > thanks for the responses!
>>     >
>>     > Maybe this is useful for your erasure code discussion:
>>     >
>>     > as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>>     >
>>     > Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
>>     >
>>     > You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
>>
>>     What does (C)RS mean ? (C)Reed-Solomon ?
>>
>>     > In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
>>     > If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
>>
>>     Nice. I did not know that was built-in :-)
>>     https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
>>
>>     > (wouldn't CRC32C be also useful for normal CEPH block replication? )
>>
>>     I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
>>
>>     https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
>>
>>     Cheers
>>
>>     > As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
>>     >
>>     > Cheers Andreas.
>>
>>     --
>>     Loïc Dachary, Artisan Logiciel Libre
>>     All that is necessary for the triumph of evil is that good people do nothing.
>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
       [not found]           ` <CAGhffvwB87a+1294BjmPrfu0a9hYdu17N-eHOvYCHWMXDLcJmA@mail.gmail.com>
@ 2013-08-22 23:03             ` Loic Dachary
       [not found]               ` <CAGhffvxW9sG5LtcF-tU1YGkCMAQUfh2WW_3N=f=-vWs48vyxkQ@mail.gmail.com>
  0 siblings, 1 reply; 52+ messages in thread
From: Loic Dachary @ 2013-08-22 23:03 UTC (permalink / raw)
  To: Andreas-Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 7290 bytes --]



On 22/08/2013 23:42, Andreas-Joachim Peters wrote:> Hi Loic, 
> sorry for the late reply, I was on vacation ...  you are right, I did a simple logical mistake since I assumed you loose only the data stripes but never the parity stripes which is a very wrong assumption.
> 
> So for testing you probably could just implement (2+1) and then move to jerasure or dual parity (4+2) where you build horizontal and diagonal parities.
> 

Hi Andreas,

That's what I did :-) It would be great if you could review the proposed implementation at https://github.com/ceph/ceph/pull/518/files . I'll keep working on https://github.com/dachary/ceph/commit/83845a66ae1cba63c122c0ef7658b97b474c2bd2 tomorrow to create the jerasure plugin but it's not ready for review yet. 

Cheers

> Cheers Andreas.
> 
> 
> 
> 
> 
> On Mon, Aug 19, 2013 at 12:35 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
> 
>     Hi Andreas,
> 
>     Trying to write minimal code as you suggested, for an example plugin. My first attempt at writing an erasure coding function. I don't get how you can rebuild P1 + A from P2 + B + C. I must be missing something obvious :-)
> 
>     Cheers
> 
>     On 07/07/2013 23:04, Andreas Joachim Peters wrote:
>     >
>     > Hi Loic,
>     > I don't think there is a better generic implementation. Just made a benchmark .. the Jerasure library with algorithm 'cauchy_good' gives 1.1 GB/s (Xeon 2.27 GHz) on a single core for a 4+2 encoding w=32. Just to give a feeling if you do 10+4 it is 300 MB/s .... there is a specialized implementation in QFS (Hadoop in C++) for (M+3) ... for curiosity I will make a benchmark with this to compare with Jerasure ...
>     >
>     > In any case I would do an optimized implementation for 3+2 which would be probably the most performant implementation having the same reliability like standard 3-fold replication in CEPH using only 53% of the space.
>     >
>     > 3+2 is trivial since you encode (A,B,C) with only two parity operations
>     > P1 = A^B
>     > P2 = B^C
>     > and reconstruct with one or two parity operations:
>     > A = P1^B
>     > B = P1^A
>     > B = P2^C
>     > C = P2^B
>     > aso.
>     >
>     > You can write this as a simple loop using advanced vector extensions on Intel (AVX). I can paste a benchmark tomorrow.
>     >
>     > Considering the crc32c-intel code you added ... I would provide a function which provides a crc32c checksum and detects if it can do it using SSE4.2 or implements just the standard algorithm e.g if you run in a virtual machine you need this emulation ...
>     >
>     > Cheers Andreas.
>     > ________________________________________
>     > From: Loic Dachary [loic@dachary.org <mailto:loic@dachary.org>]
>     > Sent: 06 July 2013 22:47
>     > To: Andreas Joachim Peters
>     > Cc: ceph-devel@vger.kernel.org <mailto:ceph-devel@vger.kernel.org>
>     > Subject: Re: CEPH Erasure Encoding + OSD Scalability
>     >
>     > Hi Andreas,
>     >
>     > Since it looks like we're going to use jerasure-1.2, we will be able to try (C)RS using
>     >
>     > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c
>     > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h
>     >
>     > Do you know of a better / faster implementation ? Is there a tradeoff between (C)RS and RS ?
>     >
>     > Cheers
>     >
>     > On 06/07/2013 15:43, Andreas-Joachim Peters wrote:
>     >> HI Loic,
>     >> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
>     >>
>     >> Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
>     >>
>     >> Cheers Andreas.
>     >>
>     >>
>     >>
>     >>
>     >> On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>> wrote:
>     >>
>     >>     Hi Andreas,
>     >>
>     >>     On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
>     >>     > thanks for the responses!
>     >>     >
>     >>     > Maybe this is useful for your erasure code discussion:
>     >>     >
>     >>     > as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>     >>     >
>     >>     > Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
>     >>     >
>     >>     > You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
>     >>
>     >>     What does (C)RS mean ? (C)Reed-Solomon ?
>     >>
>     >>     > In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
>     >>     > If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
>     >>
>     >>     Nice. I did not know that was built-in :-)
>     >>     https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
>     >>
>     >>     > (wouldn't CRC32C be also useful for normal CEPH block replication? )
>     >>
>     >>     I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
>     >>
>     >>     https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
>     >>
>     >>     Cheers
>     >>
>     >>     > As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
>     >>     >
>     >>     > Cheers Andreas.
>     >>
>     >>     --
>     >>     Loïc Dachary, Artisan Logiciel Libre
>     >>     All that is necessary for the triumph of evil is that good people do nothing.
>     >>
>     >>
>     >
>     > --
>     > Loïc Dachary, Artisan Logiciel Libre
>     > All that is necessary for the triumph of evil is that good people do nothing.
>     >
>     > --
>     > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>     > the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>
>     > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>     >
> 
>     --
>     Loïc Dachary, Artisan Logiciel Libre
>     All that is necessary for the triumph of evil is that good people do nothing.
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
       [not found]               ` <CAGhffvxW9sG5LtcF-tU1YGkCMAQUfh2WW_3N=f=-vWs48vyxkQ@mail.gmail.com>
@ 2013-08-24 19:41                 ` Loic Dachary
  2013-08-25 11:49                   ` Loic Dachary
  0 siblings, 1 reply; 52+ messages in thread
From: Loic Dachary @ 2013-08-24 19:41 UTC (permalink / raw)
  To: Andreas-Joachim Peters; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 11243 bytes --]



On 24/08/2013 15:30, Andreas-Joachim Peters wrote:
> Hi Loic, 
> I will start to review  

Cool :-)

...maybe you can briefly explain few things beforehand:
> 
> 1) the buffer management  .... who allocates the output buffers for the encoding? Are they always malloced or does it use some generic CEPH buffer recyling functionality? 

The output bufferlist is allocated by the pluing and it is the responsibility of the caller to deallocate them. I will write doxygen documentation
https://github.com/ceph/ceph/pull/518/files#r5966727

> 2) do you support to retrieve partial blocks or only the full 4M block? are decoded blocks cached for some time?

This is outside of the scope of https://github.com/ceph/ceph/pull/518/files : the plugin can handle encode/decode of 128 bytes or 4M in the same way.

> 3) do you want to tune the 2+1 basic code for performance or is it just proof of concept? If yes, then you should move over the encoding buffer with *ptr++ and use the largest available vector size for the used platform to perform XOR operations. I will send you an improved version of the loop if you want ...

The 2+1 is just a proof of concept. I completed a first implementation of the jerasure plugin https://github.com/ceph/ceph/pull/538/files which is meant to be used as a default. 

> 4) if you are interested I can write also code for a (3+3) plugin which tolerates 2-3 lost stripes. (one has to add P3=A^B^C to my [3,2] proposal). Atleast it reduces the overhead from 3-fold replication from 300% => 200% ...

It would be great to have such a plugin :-)

> 5) will you add CRC32C checksums to the blocks (4M block or 4k pages?) or will this be a CEPH generic functionality for any kind of block?

The idea is to have a CRC32C checksum per object / shard ( as described in http://ceph.com/docs/master/dev/osd_internals/erasure_coding/#glossary ) : it is the only way for scrubbing to figure out if a given shard is not corrupted and not too expensive since erasure coded pool only support full writes + append and not partial writes that would require to re-calculate the CRC32C for the whole shard each time one byte is changed.

> 6) do you put a kind of header or magic into the encoded blocks to verify that your input blocks are actually corresponding?

This has not been decided yet but I think it would be sensible to use the object attributes ( either xattr or leveldb ) to store meta information instead of creating a file format specifically designed for erasure code.

Cheers

> Cheers Andreas.
> 
> 
> 
> 
> On Fri, Aug 23, 2013 at 1:03 AM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
> 
> 
> 
>     On 22/08/2013 23:42, Andreas-Joachim Peters wrote:> Hi Loic,
>     > sorry for the late reply, I was on vacation ...  you are right, I did a simple logical mistake since I assumed you loose only the data stripes but never the parity stripes which is a very wrong assumption.
>     >
>     > So for testing you probably could just implement (2+1) and then move to jerasure or dual parity (4+2) where you build horizontal and diagonal parities.
>     >
> 
>     Hi Andreas,
> 
>     That's what I did :-) It would be great if you could review the proposed implementation at https://github.com/ceph/ceph/pull/518/files . I'll keep working on https://github.com/dachary/ceph/commit/83845a66ae1cba63c122c0ef7658b97b474c2bd2 tomorrow to create the jerasure plugin but it's not ready for review yet.
> 
>     Cheers
> 
>     > Cheers Andreas.
>     >
>     >
>     >
>     >
>     >
>     > On Mon, Aug 19, 2013 at 12:35 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>> wrote:
>     >
>     >     Hi Andreas,
>     >
>     >     Trying to write minimal code as you suggested, for an example plugin. My first attempt at writing an erasure coding function. I don't get how you can rebuild P1 + A from P2 + B + C. I must be missing something obvious :-)
>     >
>     >     Cheers
>     >
>     >     On 07/07/2013 23:04, Andreas Joachim Peters wrote:
>     >     >
>     >     > Hi Loic,
>     >     > I don't think there is a better generic implementation. Just made a benchmark .. the Jerasure library with algorithm 'cauchy_good' gives 1.1 GB/s (Xeon 2.27 GHz) on a single core for a 4+2 encoding w=32. Just to give a feeling if you do 10+4 it is 300 MB/s .... there is a specialized implementation in QFS (Hadoop in C++) for (M+3) ... for curiosity I will make a benchmark with this to compare with Jerasure ...
>     >     >
>     >     > In any case I would do an optimized implementation for 3+2 which would be probably the most performant implementation having the same reliability like standard 3-fold replication in CEPH using only 53% of the space.
>     >     >
>     >     > 3+2 is trivial since you encode (A,B,C) with only two parity operations
>     >     > P1 = A^B
>     >     > P2 = B^C
>     >     > and reconstruct with one or two parity operations:
>     >     > A = P1^B
>     >     > B = P1^A
>     >     > B = P2^C
>     >     > C = P2^B
>     >     > aso.
>     >     >
>     >     > You can write this as a simple loop using advanced vector extensions on Intel (AVX). I can paste a benchmark tomorrow.
>     >     >
>     >     > Considering the crc32c-intel code you added ... I would provide a function which provides a crc32c checksum and detects if it can do it using SSE4.2 or implements just the standard algorithm e.g if you run in a virtual machine you need this emulation ...
>     >     >
>     >     > Cheers Andreas.
>     >     > ________________________________________
>     >     > From: Loic Dachary [loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>]
>     >     > Sent: 06 July 2013 22:47
>     >     > To: Andreas Joachim Peters
>     >     > Cc: ceph-devel@vger.kernel.org <mailto:ceph-devel@vger.kernel.org> <mailto:ceph-devel@vger.kernel.org <mailto:ceph-devel@vger.kernel.org>>
>     >     > Subject: Re: CEPH Erasure Encoding + OSD Scalability
>     >     >
>     >     > Hi Andreas,
>     >     >
>     >     > Since it looks like we're going to use jerasure-1.2, we will be able to try (C)RS using
>     >     >
>     >     > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c
>     >     > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h
>     >     >
>     >     > Do you know of a better / faster implementation ? Is there a tradeoff between (C)RS and RS ?
>     >     >
>     >     > Cheers
>     >     >
>     >     > On 06/07/2013 15:43, Andreas-Joachim Peters wrote:
>     >     >> HI Loic,
>     >     >> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
>     >     >>
>     >     >> Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
>     >     >>
>     >     >> Cheers Andreas.
>     >     >>
>     >     >>
>     >     >>
>     >     >>
>     >     >> On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>> <mailto:loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>>> wrote:
>     >     >>
>     >     >>     Hi Andreas,
>     >     >>
>     >     >>     On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
>     >     >>     > thanks for the responses!
>     >     >>     >
>     >     >>     > Maybe this is useful for your erasure code discussion:
>     >     >>     >
>     >     >>     > as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>     >     >>     >
>     >     >>     > Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
>     >     >>     >
>     >     >>     > You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
>     >     >>
>     >     >>     What does (C)RS mean ? (C)Reed-Solomon ?
>     >     >>
>     >     >>     > In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
>     >     >>     > If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
>     >     >>
>     >     >>     Nice. I did not know that was built-in :-)
>     >     >>     https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
>     >     >>
>     >     >>     > (wouldn't CRC32C be also useful for normal CEPH block replication? )
>     >     >>
>     >     >>     I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
>     >     >>
>     >     >>     https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
>     >     >>
>     >     >>     Cheers
>     >     >>
>     >     >>     > As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
>     >     >>     >
>     >     >>     > Cheers Andreas.
>     >     >>
>     >     >>     --
>     >     >>     Loïc Dachary, Artisan Logiciel Libre
>     >     >>     All that is necessary for the triumph of evil is that good people do nothing.
>     >     >>
>     >     >>
>     >     >
>     >     > --
>     >     > Loïc Dachary, Artisan Logiciel Libre
>     >     > All that is necessary for the triumph of evil is that good people do nothing.
>     >     >
>     >     > --
>     >     > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>     >     > the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org> <mailto:majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>>
>     >     > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>     >     >
>     >
>     >     --
>     >     Loïc Dachary, Artisan Logiciel Libre
>     >     All that is necessary for the triumph of evil is that good people do nothing.
>     >
>     >
> 
>     --
>     Loïc Dachary, Artisan Logiciel Libre
>     All that is necessary for the triumph of evil is that good people do nothing.
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-08-24 19:41                 ` Loic Dachary
@ 2013-08-25 11:49                   ` Loic Dachary
  2013-09-14 14:59                     ` Andreas Joachim Peters
  0 siblings, 1 reply; 52+ messages in thread
From: Loic Dachary @ 2013-08-25 11:49 UTC (permalink / raw)
  To: Andreas-Joachim Peters; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 11791 bytes --]



On 24/08/2013 21:41, Loic Dachary wrote:
> 
> 
> On 24/08/2013 15:30, Andreas-Joachim Peters wrote:
>> Hi Loic, 
>> I will start to review  
> 
> Cool :-)
> 
> ...maybe you can briefly explain few things beforehand:
>>
>> 1) the buffer management  .... who allocates the output buffers for the encoding? Are they always malloced or does it use some generic CEPH buffer recyling functionality? 
> 
> The output bufferlist is allocated by the pluing and it is the responsibility of the caller to deallocate them. I will write doxygen documentation
> https://github.com/ceph/ceph/pull/518/files#r5966727

Hi Andreas,

The documentation added today in
https://github.com/dachary/ceph/blob/wip-5878/src/osd/ErasureCodeInterface.h
will hopefully clarify things. It requires an understanding of https://github.com/ceph/ceph/blob/master/src/include/buffer.h

Let me know if you have more questions.

> 
>> 2) do you support to retrieve partial blocks or only the full 4M block? are decoded blocks cached for some time?
> 
> This is outside of the scope of https://github.com/ceph/ceph/pull/518/files : the plugin can handle encode/decode of 128 bytes or 4M in the same way.
> 
>> 3) do you want to tune the 2+1 basic code for performance or is it just proof of concept? If yes, then you should move over the encoding buffer with *ptr++ and use the largest available vector size for the used platform to perform XOR operations. I will send you an improved version of the loop if you want ...
> 
> The 2+1 is just a proof of concept. I completed a first implementation of the jerasure plugin https://github.com/ceph/ceph/pull/538/files which is meant to be used as a default. 
> 
>> 4) if you are interested I can write also code for a (3+3) plugin which tolerates 2-3 lost stripes. (one has to add P3=A^B^C to my [3,2] proposal). Atleast it reduces the overhead from 3-fold replication from 300% => 200% ...
> 
> It would be great to have such a plugin :-)
> 
>> 5) will you add CRC32C checksums to the blocks (4M block or 4k pages?) or will this be a CEPH generic functionality for any kind of block?
> 
> The idea is to have a CRC32C checksum per object / shard ( as described in http://ceph.com/docs/master/dev/osd_internals/erasure_coding/#glossary ) : it is the only way for scrubbing to figure out if a given shard is not corrupted and not too expensive since erasure coded pool only support full writes + append and not partial writes that would require to re-calculate the CRC32C for the whole shard each time one byte is changed.
> 
>> 6) do you put a kind of header or magic into the encoded blocks to verify that your input blocks are actually corresponding?
> 
> This has not been decided yet but I think it would be sensible to use the object attributes ( either xattr or leveldb ) to store meta information instead of creating a file format specifically designed for erasure code.
> 
> Cheers
> 
>> Cheers Andreas.
>>
>>
>>
>>
>> On Fri, Aug 23, 2013 at 1:03 AM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>
>>
>>
>>     On 22/08/2013 23:42, Andreas-Joachim Peters wrote:> Hi Loic,
>>     > sorry for the late reply, I was on vacation ...  you are right, I did a simple logical mistake since I assumed you loose only the data stripes but never the parity stripes which is a very wrong assumption.
>>     >
>>     > So for testing you probably could just implement (2+1) and then move to jerasure or dual parity (4+2) where you build horizontal and diagonal parities.
>>     >
>>
>>     Hi Andreas,
>>
>>     That's what I did :-) It would be great if you could review the proposed implementation at https://github.com/ceph/ceph/pull/518/files . I'll keep working on https://github.com/dachary/ceph/commit/83845a66ae1cba63c122c0ef7658b97b474c2bd2 tomorrow to create the jerasure plugin but it's not ready for review yet.
>>
>>     Cheers
>>
>>     > Cheers Andreas.
>>     >
>>     >
>>     >
>>     >
>>     >
>>     > On Mon, Aug 19, 2013 at 12:35 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>> wrote:
>>     >
>>     >     Hi Andreas,
>>     >
>>     >     Trying to write minimal code as you suggested, for an example plugin. My first attempt at writing an erasure coding function. I don't get how you can rebuild P1 + A from P2 + B + C. I must be missing something obvious :-)
>>     >
>>     >     Cheers
>>     >
>>     >     On 07/07/2013 23:04, Andreas Joachim Peters wrote:
>>     >     >
>>     >     > Hi Loic,
>>     >     > I don't think there is a better generic implementation. Just made a benchmark .. the Jerasure library with algorithm 'cauchy_good' gives 1.1 GB/s (Xeon 2.27 GHz) on a single core for a 4+2 encoding w=32. Just to give a feeling if you do 10+4 it is 300 MB/s .... there is a specialized implementation in QFS (Hadoop in C++) for (M+3) ... for curiosity I will make a benchmark with this to compare with Jerasure ...
>>     >     >
>>     >     > In any case I would do an optimized implementation for 3+2 which would be probably the most performant implementation having the same reliability like standard 3-fold replication in CEPH using only 53% of the space.
>>     >     >
>>     >     > 3+2 is trivial since you encode (A,B,C) with only two parity operations
>>     >     > P1 = A^B
>>     >     > P2 = B^C
>>     >     > and reconstruct with one or two parity operations:
>>     >     > A = P1^B
>>     >     > B = P1^A
>>     >     > B = P2^C
>>     >     > C = P2^B
>>     >     > aso.
>>     >     >
>>     >     > You can write this as a simple loop using advanced vector extensions on Intel (AVX). I can paste a benchmark tomorrow.
>>     >     >
>>     >     > Considering the crc32c-intel code you added ... I would provide a function which provides a crc32c checksum and detects if it can do it using SSE4.2 or implements just the standard algorithm e.g if you run in a virtual machine you need this emulation ...
>>     >     >
>>     >     > Cheers Andreas.
>>     >     > ________________________________________
>>     >     > From: Loic Dachary [loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>]
>>     >     > Sent: 06 July 2013 22:47
>>     >     > To: Andreas Joachim Peters
>>     >     > Cc: ceph-devel@vger.kernel.org <mailto:ceph-devel@vger.kernel.org> <mailto:ceph-devel@vger.kernel.org <mailto:ceph-devel@vger.kernel.org>>
>>     >     > Subject: Re: CEPH Erasure Encoding + OSD Scalability
>>     >     >
>>     >     > Hi Andreas,
>>     >     >
>>     >     > Since it looks like we're going to use jerasure-1.2, we will be able to try (C)RS using
>>     >     >
>>     >     > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c
>>     >     > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h
>>     >     >
>>     >     > Do you know of a better / faster implementation ? Is there a tradeoff between (C)RS and RS ?
>>     >     >
>>     >     > Cheers
>>     >     >
>>     >     > On 06/07/2013 15:43, Andreas-Joachim Peters wrote:
>>     >     >> HI Loic,
>>     >     >> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
>>     >     >>
>>     >     >> Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
>>     >     >>
>>     >     >> Cheers Andreas.
>>     >     >>
>>     >     >>
>>     >     >>
>>     >     >>
>>     >     >> On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>> <mailto:loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>>> wrote:
>>     >     >>
>>     >     >>     Hi Andreas,
>>     >     >>
>>     >     >>     On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
>>     >     >>     > thanks for the responses!
>>     >     >>     >
>>     >     >>     > Maybe this is useful for your erasure code discussion:
>>     >     >>     >
>>     >     >>     > as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>>     >     >>     >
>>     >     >>     > Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
>>     >     >>     >
>>     >     >>     > You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
>>     >     >>
>>     >     >>     What does (C)RS mean ? (C)Reed-Solomon ?
>>     >     >>
>>     >     >>     > In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
>>     >     >>     > If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
>>     >     >>
>>     >     >>     Nice. I did not know that was built-in :-)
>>     >     >>     https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
>>     >     >>
>>     >     >>     > (wouldn't CRC32C be also useful for normal CEPH block replication? )
>>     >     >>
>>     >     >>     I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
>>     >     >>
>>     >     >>     https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
>>     >     >>
>>     >     >>     Cheers
>>     >     >>
>>     >     >>     > As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
>>     >     >>     >
>>     >     >>     > Cheers Andreas.
>>     >     >>
>>     >     >>     --
>>     >     >>     Loïc Dachary, Artisan Logiciel Libre
>>     >     >>     All that is necessary for the triumph of evil is that good people do nothing.
>>     >     >>
>>     >     >>
>>     >     >
>>     >     > --
>>     >     > Loïc Dachary, Artisan Logiciel Libre
>>     >     > All that is necessary for the triumph of evil is that good people do nothing.
>>     >     >
>>     >     > --
>>     >     > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>     >     > the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org> <mailto:majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>>
>>     >     > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>     >     >
>>     >
>>     >     --
>>     >     Loïc Dachary, Artisan Logiciel Libre
>>     >     All that is necessary for the triumph of evil is that good people do nothing.
>>     >
>>     >
>>
>>     --
>>     Loïc Dachary, Artisan Logiciel Libre
>>     All that is necessary for the triumph of evil is that good people do nothing.
>>
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-08-25 11:49                   ` Loic Dachary
@ 2013-09-14 14:59                     ` Andreas Joachim Peters
  2013-09-14 18:04                       ` Loic Dachary
  0 siblings, 1 reply; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-09-14 14:59 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

Hi Loic, 

I finally run/read the code of the erasure encoding. 

What I noticed is, that in your implementation you always copy the data to encode once because you add a padding block to the bufferlist and then call "out.c_str()", which calls bufferlist::rebuild and then new with the full size of all chunks and the it copies the input data. Please correct me if I am wrong ... couldn't you just allocate the additional redundancy chunks and return bufferptr pointing into the 'in' bufferlist ?

Another question is, why 'in' in the encode function is a list of buffers? Maybe this is the natural interface object in CEPH IO, don't know ... the implementation would concatenate them and produce chunks for the merged block ...

I will try to run a benchmark to see, if the additional copy has a visible impact on the performance, however it looks unnecessary.

I am also more or less finished with the 3 + (3XOR) implementation ... will do also a benchmark with this and let you know the result.

Last question  a little bit out of context, I did some benchmark about librados and latency. I see a latency of 1ms to read/stat objects of very small size (5 bytes in this case). If we (re-)write such an object with a 3-fold replica configuration on a 10 GBit setup with 1000 disks I see a latency of 80 ms per object. If I append it is 75 ms. If we do a massive test with the benchmark tool the total object creation rate saturates at 20kHz which is ok however the individual latency is higher than I would expect ?

Is there something in the OSD delaying communication since I don't believe it takes 80 ms to sync 5 bytes on an idle pool to a harddisk with a network roundtrip time of far less than a ms ?

Cheers, Andreas.















________________________________________
From: Loic Dachary [loic@dachary.org]
Sent: 25 August 2013 13:49
To: Andreas Joachim Peters
Cc: Ceph Development
Subject: Re: CEPH Erasure Encoding + OSD Scalability

On 24/08/2013 21:41, Loic Dachary wrote:
>
>
> On 24/08/2013 15:30, Andreas-Joachim Peters wrote:
>> Hi Loic,
>> I will start to review
>
> Cool :-)
>
> ...maybe you can briefly explain few things beforehand:
>>
>> 1) the buffer management  .... who allocates the output buffers for the encoding? Are they always malloced or does it use some generic CEPH buffer recyling functionality?
>
> The output bufferlist is allocated by the pluing and it is the responsibility of the caller to deallocate them. I will write doxygen documentation
> https://github.com/ceph/ceph/pull/518/files#r5966727

Hi Andreas,

The documentation added today in
https://github.com/dachary/ceph/blob/wip-5878/src/osd/ErasureCodeInterface.h
will hopefully clarify things. It requires an understanding of https://github.com/ceph/ceph/blob/master/src/include/buffer.h

Let me know if you have more questions.

>
>> 2) do you support to retrieve partial blocks or only the full 4M block? are decoded blocks cached for some time?
>
> This is outside of the scope of https://github.com/ceph/ceph/pull/518/files : the plugin can handle encode/decode of 128 bytes or 4M in the same way.
>
>> 3) do you want to tune the 2+1 basic code for performance or is it just proof of concept? If yes, then you should move over the encoding buffer with *ptr++ and use the largest available vector size for the used platform to perform XOR operations. I will send you an improved version of the loop if you want ...
>
> The 2+1 is just a proof of concept. I completed a first implementation of the jerasure plugin https://github.com/ceph/ceph/pull/538/files which is meant to be used as a default.
>
>> 4) if you are interested I can write also code for a (3+3) plugin which tolerates 2-3 lost stripes. (one has to add P3=A^B^C to my [3,2] proposal). Atleast it reduces the overhead from 3-fold replication from 300% => 200% ...
>
> It would be great to have such a plugin :-)
>
>> 5) will you add CRC32C checksums to the blocks (4M block or 4k pages?) or will this be a CEPH generic functionality for any kind of block?
>
> The idea is to have a CRC32C checksum per object / shard ( as described in http://ceph.com/docs/master/dev/osd_internals/erasure_coding/#glossary ) : it is the only way for scrubbing to figure out if a given shard is not corrupted and not too expensive since erasure coded pool only support full writes + append and not partial writes that would require to re-calculate the CRC32C for the whole shard each time one byte is changed.
>
>> 6) do you put a kind of header or magic into the encoded blocks to verify that your input blocks are actually corresponding?
>
> This has not been decided yet but I think it would be sensible to use the object attributes ( either xattr or leveldb ) to store meta information instead of creating a file format specifically designed for erasure code.
>
> Cheers
>
>> Cheers Andreas.
>>
>>
>>
>>
>> On Fri, Aug 23, 2013 at 1:03 AM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>
>>
>>
>>     On 22/08/2013 23:42, Andreas-Joachim Peters wrote:> Hi Loic,
>>     > sorry for the late reply, I was on vacation ...  you are right, I did a simple logical mistake since I assumed you loose only the data stripes but never the parity stripes which is a very wrong assumption.
>>     >
>>     > So for testing you probably could just implement (2+1) and then move to jerasure or dual parity (4+2) where you build horizontal and diagonal parities.
>>     >
>>
>>     Hi Andreas,
>>
>>     That's what I did :-) It would be great if you could review the proposed implementation at https://github.com/ceph/ceph/pull/518/files . I'll keep working on https://github.com/dachary/ceph/commit/83845a66ae1cba63c122c0ef7658b97b474c2bd2 tomorrow to create the jerasure plugin but it's not ready for review yet.
>>
>>     Cheers
>>
>>     > Cheers Andreas.
>>     >
>>     >
>>     >
>>     >
>>     >
>>     > On Mon, Aug 19, 2013 at 12:35 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>> wrote:
>>     >
>>     >     Hi Andreas,
>>     >
>>     >     Trying to write minimal code as you suggested, for an example plugin. My first attempt at writing an erasure coding function. I don't get how you can rebuild P1 + A from P2 + B + C. I must be missing something obvious :-)
>>     >
>>     >     Cheers
>>     >
>>     >     On 07/07/2013 23:04, Andreas Joachim Peters wrote:
>>     >     >
>>     >     > Hi Loic,
>>     >     > I don't think there is a better generic implementation. Just made a benchmark .. the Jerasure library with algorithm 'cauchy_good' gives 1.1 GB/s (Xeon 2.27 GHz) on a single core for a 4+2 encoding w=32. Just to give a feeling if you do 10+4 it is 300 MB/s .... there is a specialized implementation in QFS (Hadoop in C++) for (M+3) ... for curiosity I will make a benchmark with this to compare with Jerasure ...
>>     >     >
>>     >     > In any case I would do an optimized implementation for 3+2 which would be probably the most performant implementation having the same reliability like standard 3-fold replication in CEPH using only 53% of the space.
>>     >     >
>>     >     > 3+2 is trivial since you encode (A,B,C) with only two parity operations
>>     >     > P1 = A^B
>>     >     > P2 = B^C
>>     >     > and reconstruct with one or two parity operations:
>>     >     > A = P1^B
>>     >     > B = P1^A
>>     >     > B = P2^C
>>     >     > C = P2^B
>>     >     > aso.
>>     >     >
>>     >     > You can write this as a simple loop using advanced vector extensions on Intel (AVX). I can paste a benchmark tomorrow.
>>     >     >
>>     >     > Considering the crc32c-intel code you added ... I would provide a function which provides a crc32c checksum and detects if it can do it using SSE4.2 or implements just the standard algorithm e.g if you run in a virtual machine you need this emulation ...
>>     >     >
>>     >     > Cheers Andreas.
>>     >     > ________________________________________
>>     >     > From: Loic Dachary [loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>]
>>     >     > Sent: 06 July 2013 22:47
>>     >     > To: Andreas Joachim Peters
>>     >     > Cc: ceph-devel@vger.kernel.org <mailto:ceph-devel@vger.kernel.org> <mailto:ceph-devel@vger.kernel.org <mailto:ceph-devel@vger.kernel.org>>
>>     >     > Subject: Re: CEPH Erasure Encoding + OSD Scalability
>>     >     >
>>     >     > Hi Andreas,
>>     >     >
>>     >     > Since it looks like we're going to use jerasure-1.2, we will be able to try (C)RS using
>>     >     >
>>     >     > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c
>>     >     > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h
>>     >     >
>>     >     > Do you know of a better / faster implementation ? Is there a tradeoff between (C)RS and RS ?
>>     >     >
>>     >     > Cheers
>>     >     >
>>     >     > On 06/07/2013 15:43, Andreas-Joachim Peters wrote:
>>     >     >> HI Loic,
>>     >     >> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
>>     >     >>
>>     >     >> Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
>>     >     >>
>>     >     >> Cheers Andreas.
>>     >     >>
>>     >     >>
>>     >     >>
>>     >     >>
>>     >     >> On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>> <mailto:loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>>> wrote:
>>     >     >>
>>     >     >>     Hi Andreas,
>>     >     >>
>>     >     >>     On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
>>     >     >>     > thanks for the responses!
>>     >     >>     >
>>     >     >>     > Maybe this is useful for your erasure code discussion:
>>     >     >>     >
>>     >     >>     > as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>>     >     >>     >
>>     >     >>     > Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
>>     >     >>     >
>>     >     >>     > You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
>>     >     >>
>>     >     >>     What does (C)RS mean ? (C)Reed-Solomon ?
>>     >     >>
>>     >     >>     > In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
>>     >     >>     > If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
>>     >     >>
>>     >     >>     Nice. I did not know that was built-in :-)
>>     >     >>     https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
>>     >     >>
>>     >     >>     > (wouldn't CRC32C be also useful for normal CEPH block replication? )
>>     >     >>
>>     >     >>     I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
>>     >     >>
>>     >     >>     https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
>>     >     >>
>>     >     >>     Cheers
>>     >     >>
>>     >     >>     > As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
>>     >     >>     >
>>     >     >>     > Cheers Andreas.
>>     >     >>
>>     >     >>     --
>>     >     >>     Loïc Dachary, Artisan Logiciel Libre
>>     >     >>     All that is necessary for the triumph of evil is that good people do nothing.
>>     >     >>
>>     >     >>
>>     >     >
>>     >     > --
>>     >     > Loïc Dachary, Artisan Logiciel Libre
>>     >     > All that is necessary for the triumph of evil is that good people do nothing.
>>     >     >
>>     >     > --
>>     >     > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>     >     > the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org> <mailto:majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>>
>>     >     > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>     >     >
>>     >
>>     >     --
>>     >     Loïc Dachary, Artisan Logiciel Libre
>>     >     All that is necessary for the triumph of evil is that good people do nothing.
>>     >
>>     >
>>
>>     --
>>     Loïc Dachary, Artisan Logiciel Libre
>>     All that is necessary for the triumph of evil is that good people do nothing.
>>
>>
>

--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-09-14 14:59                     ` Andreas Joachim Peters
@ 2013-09-14 18:04                       ` Loic Dachary
  0 siblings, 0 replies; 52+ messages in thread
From: Loic Dachary @ 2013-09-14 18:04 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 15523 bytes --]


Hi Andreas,

On 14/09/2013 16:59, Andreas Joachim Peters wrote:> Hi Loic,
>
> I finally run/read the code of the erasure encoding.

Great !

> What I noticed is, that in your implementation you always copy the data to encode once because you add a padding block to the bufferlist and then call "out.c_str()", which calls bufferlist::rebuild and then new with the full size of all chunks and the it copies the input data. Please correct me if I am wrong ... couldn't you just allocate the additional redundancy chunks and return bufferptr pointing into the 'in' bufferlist ?

I assume you're referring to

https://github.com/ceph/ceph/blob/e9e53912503259326a7877bda31c4360302c2c34/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L78

and indeed it implies an extra copy because of the padding. The optimization you're suggesting, if I get it right would only require an extra copy of the last data chunk. The code would extract the char * from in with c_str() before padding ( hence no rebuild ). Feed data[] with pointers to this but the last one if it requires padding. Allocate + copy a chunk + pad the last chunk if necessary. The allocated area can be made big enough to accomodate for the coding chunks. That would reduce the copy to the minimum. It also means that the returned bufferlist has to properly reference the input buffer and that the caller must not modify the content of *in* after calling encode otherwise it may have a side effect on the *encoded* result because they really are the same pointer.

>
> Another question is, why 'in' in the encode function is a list of buffers? Maybe this is the natural interface object in CEPH IO, don't know ... the implementation would concatenate them and produce chunks for the merged block ...

You guess right. Initially I had the encode function accept a bufferptr instead of a bufferlist but it's not the preferred API data structure to convey a buffer.

> I will try to run a benchmark to see, if the additional copy has a visible impact on the performance, however it looks unnecessary.

Indeed there should be a way to avoid this extra copy.

> I am also more or less finished with the 3 + (3XOR) implementation ... will do also a benchmark with this and let you know the result.

Cool !

> Last question  a little bit out of context, I did some benchmark about librados and latency. I see a latency of 1ms to read/stat objects of very small size (5 bytes in this case). If we (re-)write such an object with a 3-fold replica configuration on a 10 GBit setup with 1000 disks I see a latency of 80 ms per object. If I append it is 75 ms. If we do a massive test with the benchmark tool the total object creation rate saturates at 20kHz which is ok however the individual latency is higher than I would expect ?
>
> Is there something in the OSD delaying communication since I don't believe it takes 80 ms to sync 5 bytes on an idle pool to a harddisk with a network roundtrip time of far less than a ms ?

I suggest you start a separate thread for this, chances are your question will not be noticed otherwise.

Cheers

> Cheers, Andreas.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> ________________________________________
> From: Loic Dachary [loic@dachary.org]
> Sent: 25 August 2013 13:49
> To: Andreas Joachim Peters
> Cc: Ceph Development
> Subject: Re: CEPH Erasure Encoding + OSD Scalability
>
> On 24/08/2013 21:41, Loic Dachary wrote:
>>
>>
>> On 24/08/2013 15:30, Andreas-Joachim Peters wrote:
>>> Hi Loic,
>>> I will start to review
>>
>> Cool :-)
>>
>> ...maybe you can briefly explain few things beforehand:
>>>
>>> 1) the buffer management  .... who allocates the output buffers for the encoding? Are they always malloced or does it use some generic CEPH buffer recyling functionality?
>>
>> The output bufferlist is allocated by the pluing and it is the responsibility of the caller to deallocate them. I will write doxygen documentation
>> https://github.com/ceph/ceph/pull/518/files#r5966727
>
> Hi Andreas,
>
> The documentation added today in
> https://github.com/dachary/ceph/blob/wip-5878/src/osd/ErasureCodeInterface.h
> will hopefully clarify things. It requires an understanding of https://github.com/ceph/ceph/blob/master/src/include/buffer.h
>
> Let me know if you have more questions.
>
>>
>>> 2) do you support to retrieve partial blocks or only the full 4M block? are decoded blocks cached for some time?
>>
>> This is outside of the scope of https://github.com/ceph/ceph/pull/518/files : the plugin can handle encode/decode of 128 bytes or 4M in the same way.
>>
>>> 3) do you want to tune the 2+1 basic code for performance or is it just proof of concept? If yes, then you should move over the encoding buffer with *ptr++ and use the largest available vector size for the used platform to perform XOR operations. I will send you an improved version of the loop if you want ...
>>
>> The 2+1 is just a proof of concept. I completed a first implementation of the jerasure plugin https://github.com/ceph/ceph/pull/538/files which is meant to be used as a default.
>>
>>> 4) if you are interested I can write also code for a (3+3) plugin which tolerates 2-3 lost stripes. (one has to add P3=A^B^C to my [3,2] proposal). Atleast it reduces the overhead from 3-fold replication from 300% => 200% ...
>>
>> It would be great to have such a plugin :-)
>>
>>> 5) will you add CRC32C checksums to the blocks (4M block or 4k pages?) or will this be a CEPH generic functionality for any kind of block?
>>
>> The idea is to have a CRC32C checksum per object / shard ( as described in http://ceph.com/docs/master/dev/osd_internals/erasure_coding/#glossary ) : it is the only way for scrubbing to figure out if a given shard is not corrupted and not too expensive since erasure coded pool only support full writes + append and not partial writes that would require to re-calculate the CRC32C for the whole shard each time one byte is changed.
>>
>>> 6) do you put a kind of header or magic into the encoded blocks to verify that your input blocks are actually corresponding?
>>
>> This has not been decided yet but I think it would be sensible to use the object attributes ( either xattr or leveldb ) to store meta information instead of creating a file format specifically designed for erasure code.
>>
>> Cheers
>>
>>> Cheers Andreas.
>>>
>>>
>>>
>>>
>>> On Fri, Aug 23, 2013 at 1:03 AM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>>
>>>
>>>
>>>     On 22/08/2013 23:42, Andreas-Joachim Peters wrote:> Hi Loic,
>>>     > sorry for the late reply, I was on vacation ...  you are right, I did a simple logical mistake since I assumed you loose only the data stripes but never the parity stripes which is a very wrong assumption.
>>>     >
>>>     > So for testing you probably could just implement (2+1) and then move to jerasure or dual parity (4+2) where you build horizontal and diagonal parities.
>>>     >
>>>
>>>     Hi Andreas,
>>>
>>>     That's what I did :-) It would be great if you could review the proposed implementation at https://github.com/ceph/ceph/pull/518/files . I'll keep working on https://github.com/dachary/ceph/commit/83845a66ae1cba63c122c0ef7658b97b474c2bd2 tomorrow to create the jerasure plugin but it's not ready for review yet.
>>>
>>>     Cheers
>>>
>>>     > Cheers Andreas.
>>>     >
>>>     >
>>>     >
>>>     >
>>>     >
>>>     > On Mon, Aug 19, 2013 at 12:35 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>> wrote:
>>>     >
>>>     >     Hi Andreas,
>>>     >
>>>     >     Trying to write minimal code as you suggested, for an example plugin. My first attempt at writing an erasure coding function. I don't get how you can rebuild P1 + A from P2 + B + C. I must be missing something obvious :-)
>>>     >
>>>     >     Cheers
>>>     >
>>>     >     On 07/07/2013 23:04, Andreas Joachim Peters wrote:
>>>     >     >
>>>     >     > Hi Loic,
>>>     >     > I don't think there is a better generic implementation. Just made a benchmark .. the Jerasure library with algorithm 'cauchy_good' gives 1.1 GB/s (Xeon 2.27 GHz) on a single core for a 4+2 encoding w=32. Just to give a feeling if you do 10+4 it is 300 MB/s .... there is a specialized implementation in QFS (Hadoop in C++) for (M+3) ... for curiosity I will make a benchmark with this to compare with Jerasure ...
>>>     >     >
>>>     >     > In any case I would do an optimized implementation for 3+2 which would be probably the most performant implementation having the same reliability like standard 3-fold replication in CEPH using only 53% of the space.
>>>     >     >
>>>     >     > 3+2 is trivial since you encode (A,B,C) with only two parity operations
>>>     >     > P1 = A^B
>>>     >     > P2 = B^C
>>>     >     > and reconstruct with one or two parity operations:
>>>     >     > A = P1^B
>>>     >     > B = P1^A
>>>     >     > B = P2^C
>>>     >     > C = P2^B
>>>     >     > aso.
>>>     >     >
>>>     >     > You can write this as a simple loop using advanced vector extensions on Intel (AVX). I can paste a benchmark tomorrow.
>>>     >     >
>>>     >     > Considering the crc32c-intel code you added ... I would provide a function which provides a crc32c checksum and detects if it can do it using SSE4.2 or implements just the standard algorithm e.g if you run in a virtual machine you need this emulation ...
>>>     >     >
>>>     >     > Cheers Andreas.
>>>     >     > ________________________________________
>>>     >     > From: Loic Dachary [loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>]
>>>     >     > Sent: 06 July 2013 22:47
>>>     >     > To: Andreas Joachim Peters
>>>     >     > Cc: ceph-devel@vger.kernel.org <mailto:ceph-devel@vger.kernel.org> <mailto:ceph-devel@vger.kernel.org <mailto:ceph-devel@vger.kernel.org>>
>>>     >     > Subject: Re: CEPH Erasure Encoding + OSD Scalability
>>>     >     >
>>>     >     > Hi Andreas,
>>>     >     >
>>>     >     > Since it looks like we're going to use jerasure-1.2, we will be able to try (C)RS using
>>>     >     >
>>>     >     > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c
>>>     >     > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h
>>>     >     >
>>>     >     > Do you know of a better / faster implementation ? Is there a tradeoff between (C)RS and RS ?
>>>     >     >
>>>     >     > Cheers
>>>     >     >
>>>     >     > On 06/07/2013 15:43, Andreas-Joachim Peters wrote:
>>>     >     >> HI Loic,
>>>     >     >> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
>>>     >     >>
>>>     >     >> Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
>>>     >     >>
>>>     >     >> Cheers Andreas.
>>>     >     >>
>>>     >     >>
>>>     >     >>
>>>     >     >>
>>>     >     >> On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>> <mailto:loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>>> wrote:
>>>     >     >>
>>>     >     >>     Hi Andreas,
>>>     >     >>
>>>     >     >>     On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
>>>     >     >>     > thanks for the responses!
>>>     >     >>     >
>>>     >     >>     > Maybe this is useful for your erasure code discussion:
>>>     >     >>     >
>>>     >     >>     > as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>>>     >     >>     >
>>>     >     >>     > Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
>>>     >     >>     >
>>>     >     >>     > You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
>>>     >     >>
>>>     >     >>     What does (C)RS mean ? (C)Reed-Solomon ?
>>>     >     >>
>>>     >     >>     > In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
>>>     >     >>     > If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
>>>     >     >>
>>>     >     >>     Nice. I did not know that was built-in :-)
>>>     >     >>     https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
>>>     >     >>
>>>     >     >>     > (wouldn't CRC32C be also useful for normal CEPH block replication? )
>>>     >     >>
>>>     >     >>     I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
>>>     >     >>
>>>     >     >>     https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
>>>     >     >>
>>>     >     >>     Cheers
>>>     >     >>
>>>     >     >>     > As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
>>>     >     >>     >
>>>     >     >>     > Cheers Andreas.
>>>     >     >>
>>>     >     >>     --
>>>     >     >>     Loïc Dachary, Artisan Logiciel Libre
>>>     >     >>     All that is necessary for the triumph of evil is that good people do nothing.
>>>     >     >>
>>>     >     >>
>>>     >     >
>>>     >     > --
>>>     >     > Loïc Dachary, Artisan Logiciel Libre
>>>     >     > All that is necessary for the triumph of evil is that good people do nothing.
>>>     >     >
>>>     >     > --
>>>     >     > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>     >     > the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org> <mailto:majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>>
>>>     >     > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>     >     >
>>>     >
>>>     >     --
>>>     >     Loïc Dachary, Artisan Logiciel Libre
>>>     >     All that is necessary for the triumph of evil is that good people do nothing.
>>>     >
>>>     >
>>>
>>>     --
>>>     Loïc Dachary, Artisan Logiciel Libre
>>>     All that is necessary for the triumph of evil is that good people do nothing.
>>>
>>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
>

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-12-13 15:47                           ` Andreas Joachim Peters
@ 2013-12-13 16:42                             ` Loic Dachary
  0 siblings, 0 replies; 52+ messages in thread
From: Loic Dachary @ 2013-12-13 16:42 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 3358 bytes --]



On 13/12/2013 16:47, Andreas Joachim Peters wrote:> Hi Loic, 
> 
> I (re-)pushed/fixed wip-bpc-01 in my GIT repository.
> 

There still seem to be an issue as https://github.com/ceph/ceph/pull/740 show to be unmergeable ( if such a word exists ;-).

> There is one commit of general interest to 'galois.c' which gives me a factor 1.5 speed improvement (I exchanged the region XOR loop with vector operations if available) in the Jerasure code base.
> 

Great.

> I have also replaced the parity implementation to use SSE registers (via assembler) ... (seen in snapraid) which gives a factor 2.5 for the BPC part ... 

Great.

> I needed to add a test for this in arch/intel.c like it was for the crc32c register .

Ok. 

It also occurs to me that the benchmark should show the erasure code overhead. I.e. 10 + 4 by default means + 40% and with BPC using 5 chunks at a time it would be + 60%.

Cheers

> Cheers Andreas.
> 
> ________________________________________
> From: Mark Nelson [mark.nelson@inktank.com]
> Sent: 11 December 2013 14:00
> To: Loic Dachary
> Cc: Andreas Joachim Peters; ceph-devel@vger.kernel.org
> Subject: Re: CEPH Erasure Encoding + OSD Scalability
> 
> On 12/11/2013 06:28 AM, Loic Dachary wrote:
>>
>>
>> On 11/12/2013 10:49, Andreas Joachim Peters wrote:> Hi Loic,
>>> I am a little bit confused which kind of tool you actually want. You want a simple benchmark to check for degradation or you want a full profiler tool?
>>>
>>
>> I was not sure, hence the confusion.
>>
>>> Most of the external tools have the problem that you measure the whole thing including buffer allocation and initialization. We probably don't want to measure how long it takes to allocate memory and write random numbers into it.
>>>
>>> I would just o:
>>>
>>> < prepare memory>
>>> <take CPU/realtime>
>>> < run algorithm >
>>> <take CPU/realtime>
>>> < print result>
>>>
>>
>> Ok, I'll do that.
>>
>> I'm glad I learnt about the other tools in the process, even if only to conclude that they are not needed.
> 
> Certainly things like perf are useful for profiling but may be overkill
> in the general case depending on what you are trying to do.  Collectl is
> pretty low overhead though if you are just looking for per-process CPU
> utilization stats.
> 
>>
>> Cheers
>>
>>> Now one can also add to run the perf-stat tool after <prepare memory> and start it from within the test program pointing to the PID running <run alogorithm>, so the benchmark would be:
>>>
>>> < prepare memory>
>>> < take CPU/realtime>
>>> < fork=>"perf stat -p <mypid>";
>>> < run algorithm n times>
>>> < take CPU/realtime>
>>> < SIGINT to fork>
>>> < print results>
>>>
>>> As an extension one could also add to have <run algorithm> with <n> threads in a thread pool.
>>>
>>> Cheers Andreas.
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-12-11 13:00                         ` Mark Nelson
@ 2013-12-13 15:47                           ` Andreas Joachim Peters
  2013-12-13 16:42                             ` Loic Dachary
  0 siblings, 1 reply; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-12-13 15:47 UTC (permalink / raw)
  To: Mark Nelson, Loic Dachary; +Cc: ceph-devel

Hi Loic, 

I (re-)pushed/fixed wip-bpc-01 in my GIT repository.

There is one commit of general interest to 'galois.c' which gives me a factor 1.5 speed improvement (I exchanged the region XOR loop with vector operations if available) in the Jerasure code base.

I have also replaced the parity implementation to use SSE registers (via assembler) ... (seen in snapraid) which gives a factor 2.5 for the BPC part ... 

I needed to add a test for this in arch/intel.c like it was for the crc32c register .

Cheers Andreas.

________________________________________
From: Mark Nelson [mark.nelson@inktank.com]
Sent: 11 December 2013 14:00
To: Loic Dachary
Cc: Andreas Joachim Peters; ceph-devel@vger.kernel.org
Subject: Re: CEPH Erasure Encoding + OSD Scalability

On 12/11/2013 06:28 AM, Loic Dachary wrote:
>
>
> On 11/12/2013 10:49, Andreas Joachim Peters wrote:> Hi Loic,
>> I am a little bit confused which kind of tool you actually want. You want a simple benchmark to check for degradation or you want a full profiler tool?
>>
>
> I was not sure, hence the confusion.
>
>> Most of the external tools have the problem that you measure the whole thing including buffer allocation and initialization. We probably don't want to measure how long it takes to allocate memory and write random numbers into it.
>>
>> I would just o:
>>
>> < prepare memory>
>> <take CPU/realtime>
>> < run algorithm >
>> <take CPU/realtime>
>> < print result>
>>
>
> Ok, I'll do that.
>
> I'm glad I learnt about the other tools in the process, even if only to conclude that they are not needed.

Certainly things like perf are useful for profiling but may be overkill
in the general case depending on what you are trying to do.  Collectl is
pretty low overhead though if you are just looking for per-process CPU
utilization stats.

>
> Cheers
>
>> Now one can also add to run the perf-stat tool after <prepare memory> and start it from within the test program pointing to the PID running <run alogorithm>, so the benchmark would be:
>>
>> < prepare memory>
>> < take CPU/realtime>
>> < fork=>"perf stat -p <mypid>";
>> < run algorithm n times>
>> < take CPU/realtime>
>> < SIGINT to fork>
>> < print results>
>>
>> As an extension one could also add to have <run algorithm> with <n> threads in a thread pool.
>>
>> Cheers Andreas.
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-12-11 12:28                       ` Loic Dachary
@ 2013-12-11 13:00                         ` Mark Nelson
  2013-12-13 15:47                           ` Andreas Joachim Peters
  0 siblings, 1 reply; 52+ messages in thread
From: Mark Nelson @ 2013-12-11 13:00 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Andreas Joachim Peters, ceph-devel

On 12/11/2013 06:28 AM, Loic Dachary wrote:
>
>
> On 11/12/2013 10:49, Andreas Joachim Peters wrote:> Hi Loic,
>> I am a little bit confused which kind of tool you actually want. You want a simple benchmark to check for degradation or you want a full profiler tool?
>>
>
> I was not sure, hence the confusion.
>
>> Most of the external tools have the problem that you measure the whole thing including buffer allocation and initialization. We probably don't want to measure how long it takes to allocate memory and write random numbers into it.
>>
>> I would just o:
>>
>> < prepare memory>
>> <take CPU/realtime>
>> < run algorithm >
>> <take CPU/realtime>
>> < print result>
>>
>
> Ok, I'll do that.
>
> I'm glad I learnt about the other tools in the process, even if only to conclude that they are not needed.

Certainly things like perf are useful for profiling but may be overkill 
in the general case depending on what you are trying to do.  Collectl is 
pretty low overhead though if you are just looking for per-process CPU 
utilization stats.

>
> Cheers
>
>> Now one can also add to run the perf-stat tool after <prepare memory> and start it from within the test program pointing to the PID running <run alogorithm>, so the benchmark would be:
>>
>> < prepare memory>
>> < take CPU/realtime>
>> < fork=>"perf stat -p <mypid>";
>> < run algorithm n times>
>> < take CPU/realtime>
>> < SIGINT to fork>
>> < print results>
>>
>> As an extension one could also add to have <run algorithm> with <n> threads in a thread pool.
>>
>> Cheers Andreas.
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-12-11  9:49                     ` Andreas Joachim Peters
@ 2013-12-11 12:28                       ` Loic Dachary
  2013-12-11 13:00                         ` Mark Nelson
  0 siblings, 1 reply; 52+ messages in thread
From: Loic Dachary @ 2013-12-11 12:28 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1543 bytes --]



On 11/12/2013 10:49, Andreas Joachim Peters wrote:> Hi Loic, 
> I am a little bit confused which kind of tool you actually want. You want a simple benchmark to check for degradation or you want a full profiler tool?
> 

I was not sure, hence the confusion.

> Most of the external tools have the problem that you measure the whole thing including buffer allocation and initialization. We probably don't want to measure how long it takes to allocate memory and write random numbers into it. 
> 
> I would just o:
> 
> < prepare memory>
> <take CPU/realtime>
> < run algorithm >
> <take CPU/realtime>
> < print result>
> 

Ok, I'll do that. 

I'm glad I learnt about the other tools in the process, even if only to conclude that they are not needed. 

Cheers

> Now one can also add to run the perf-stat tool after <prepare memory> and start it from within the test program pointing to the PID running <run alogorithm>, so the benchmark would be:
> 
> < prepare memory>
> < take CPU/realtime>
> < fork=>"perf stat -p <mypid>";
> < run algorithm n times>
> < take CPU/realtime>
> < SIGINT to fork>
> < print results>
> 
> As an extension one could also add to have <run algorithm> with <n> threads in a thread pool.
> 
> Cheers Andreas.
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-12-10  8:43                   ` Loic Dachary
@ 2013-12-11  9:49                     ` Andreas Joachim Peters
  2013-12-11 12:28                       ` Loic Dachary
  0 siblings, 1 reply; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-12-11  9:49 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

Hi Loic, 
I am a little bit confused which kind of tool you actually want. You want a simple benchmark to check for degradation or you want a full profiler tool?

Most of the external tools have the problem that you measure the whole thing including buffer allocation and initialization. We probably don't want to measure how long it takes to allocate memory and write random numbers into it. 

I would just do:

< prepare memory>
<take CPU/realtime>
< run algorithm >
<take CPU/realtime>
< print result>

Now one can also add to run the perf-stat tool after <prepare memory> and start it from within the test program pointing to the PID running <run alogorithm>, so the benchmark would be:

< prepare memory>
< take CPU/realtime>
< fork=>"perf stat -p <mypid>";
< run algorithm n times>
< take CPU/realtime>
< SIGINT to fork>
< print results>

As an extension one could also add to have <run algorithm> with <n> threads in a thread pool.

Cheers Andreas.




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-12-09 16:45                 ` Loic Dachary
  2013-12-09 17:03                   ` Mark Nelson
@ 2013-12-10  8:43                   ` Loic Dachary
  2013-12-11  9:49                     ` Andreas Joachim Peters
  1 sibling, 1 reply; 52+ messages in thread
From: Loic Dachary @ 2013-12-10  8:43 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 8580 bytes --]

Maybe using

http://google-perftools.googlecode.com/svn/trunk/doc/cpuprofile.html

is enough. fsbench looks overkill indeed.

/me exploring options ;-)

On 09/12/2013 17:45, Loic Dachary wrote:
> Hi,
> 
> Mark Nelson suggested we use perf ( linux-tools ) for benchmarking. It looks like something that would help indeed : the benchmark program would only concern itself with doing some work according to the options and let performances be collected from the outside, using tools that are familiar to people doing benchmarking.
> 
> What do you think ?
> 
> Cheers
> 
> $ perf stat -e
>   Error: switch `e' requires a value
> 
>  usage: perf stat [<options>] [<command>]
> 
>     -e, --event <event>   event selector. use 'perf list' to list available events
>         --filter <filter>
>                           event filter
>     -i, --no-inherit      child tasks do not inherit counters
>     -p, --pid <pid>       stat events on existing process id
>     -t, --tid <tid>       stat events on existing thread id
>     -a, --all-cpus        system-wide collection from all CPUs
>     -g, --group           put the counters into a counter group
>     -c, --scale           scale/normalize counters
>     -v, --verbose         be more verbose (show counter open errors, etc)
>     -r, --repeat <n>      repeat command and print average + stddev (max: 100, forever: 0)
>     -n, --null            null run - dont start any counters
>     -d, --detailed        detailed run - start a lot of events
>     -S, --sync            call sync() before starting a run
>     -B, --big-num         print large numbers with thousands' separators
>     -C, --cpu <cpu>       list of cpus to monitor in system-wide
>     -A, --no-aggr         disable CPU count aggregation
>     -x, --field-separator <separator>
>                           print counts with custom separator
>     -G, --cgroup <name>   monitor event in cgroup name only
>     -o, --output <file>   output file name
>         --append          append to the output file
>         --log-fd <n>      log output to fd, instead of stderr
>         --pre <command>   command to run prior to the measured command
>         --post <command>  command to run after to the measured command
>     -I, --interval-print <n>
>                           print counts at regular interval in ms (>= 100)
>         --per-socket      aggregate counts per processor socket
>         --per-core        aggregate counts per physical processor core
> 
> 
> On 12/11/2013 19:06, Loic Dachary wrote:
>> Hi Andreas,
>>
>> On 12/11/2013 02:11, Andreas Joachim Peters wrote:
>>> Hi Loic,
>>>
>>> I am finally doing the benchmark tool and I found a bunch of wrong parameter checks which can make the whole thing SEGV.
>>>
>>> All the RAID-6 codes have restrictions on the parameters but they are not correctly enforced for Liberation & Blaum-Roth codes in the CEPH wrapper class ... see text from PDF
>>>
>>> "Minimal Density RAID-6 codes are MDS codes based on binary matrices which satisfy a lower-bound on the number  of non-zero entries. Unlike Cauchy coding, the bit-matrix elements do not correspond to elements in GF (2 w ). Instead, the bit-matrix itself has the proper MDS property. Minimal Density RAID-6 codes perform faster than Reed-Solomon and Cauchy Reed-Solomon codes for the same parameters. Liberation coding, Liber8tion coding, and Blaum-Roth coding are three examples of this kind of coding that are supported in jerasure.
>>>
>>> With each of these codes, m must be equal to two and k must be less than or equal to w. The value of w has restrictions based on the code:
>>>
>>> • With Liberation coding, w must be a prime number [Pla08b].
>>> • With Blaum-Roth coding, w + 1 must be a prime number [BR99]. • With Liber8tion coding, w must equal 8 [Pla08a].
>>>
>>> ...
>>>
>>> Do you add this fixes?
>>
>> Nice catch. I created and assigned to myself : http://tracker.ceph.com/issues/6754
>>>
>>> For the benchmark suite it runs currently 308 different configurations for the 2 algorithm which make sense from the performance point of view and provides this output:
>>>
>>>
>>> # -----------------------------------------------------------------
>>> # Erasure Coding Benchmark - (C) CERN 2013 - Andreas.Joachim.Peters@cern.ch
>>> # Ram-Size=12614856704 Allocation-Size=100000000
>>> # -----------------------------------------------------------------
>>> # [ -BENCH- ] [       ] technique=memcpy                                                            speed=5.408 [GB/s] latency=9.245 ms
>>> # [ -BENCH- ] [       ] technique=d=a^b^c-xor                                                       speed=4.377 [GB/s] latency=17.136 ms
>>> # [ -BENCH- ] [001/304] technique=cauchy_good:k=05:m=2:w=8:lp=0:packet=00064:size=50000000          speed=1.308 [GB/s] latency=038	[ms] size-overhead=40	[%]
>>> ..
>>> ..
>>> # [ -BENCH- ] [304/304] technique=liberation:k=24:m=2:w=29:lp=2:packet=65536:size=50000000          speed=0.083 [GB/s] latency=604	[ms] size-overhead=16	[%]
>>> # -----------------------------------------------------------------
>>> # Erasure Code Performance Summary::
>>> # -----------------------------------------------------------------
>>> # RAM:                   12.61 GB
>>> # Allocation-Size        0.10 GB
>>> # -----------------------------------------------------------------
>>> # Byte Initialization:   29.35 MB/s
>>> # Memcpy:                5.41 GB/s
>>> # Triple-XOR:            4.38 GB/s
>>> # -----------------------------------------------------------------
>>> # Fastest RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000
>>> # Fastest Triple Failure 0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000
>>> # Fastest Quadr. Failure 0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000
>>> # -----------------------------------------------------------------
>>> # .................................................................
>>> # Top 1  RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000
>>> # Top 2  RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=16384:size=50000000
>>> # Top 3  RAID6          2.64 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=65536:size=50000000
>>> # Top 4  RAID6          2.60 GB/s liberation:k=07:m=2:w=7:lp=0:packet=16384:size=50000000
>>> # Top 5  RAID6          2.59 GB/s liberation:k=05:m=2:w=7:lp=0:packet=04096:size=50000000
>>> # .................................................................
>>> # Top 1  Triple         0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000
>>> # Top 2  Triple         0.94 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=16384:size=50000000
>>> # Top 3  Triple         0.93 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=65536:size=50000000
>>> # Top 4  Triple         0.89 GB/s cauchy_good:k=07:m=3:w=8:lp=0:packet=04096:size=50000000
>>> # Top 5  Triple         0.87 GB/s cauchy_good:k=05:m=3:w=8:lp=0:packet=04096:size=50000000
>>> # .................................................................
>>> # Top 1  Quadr.         0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000
>>> # Top 2  Quadr.         0.65 GB/s cauchy_good:k=07:m=4:w=8:lp=0:packet=04096:size=50000000
>>> # Top 3  Quadr.         0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=16384:size=50000000
>>> # Top 4  Quadr.         0.64 GB/s cauchy_good:k=05:m=4:w=8:lp=0:packet=04096:size=50000000
>>> # Top 5  Quadr.         0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=65536:size=50000000
>>> # .................................................................
>>>
>>> It takes around 30 second on my box. 
>>
>>
>> That looks great :-) If I understand correctly, it means https://github.com/ceph/ceph/pull/740 will no longer have benchmarks as they are moved to a separate program. Correct ?
>>
>>> I will add a measurement how the XOR and the 3 top algorithms scale with the number of cores and make the object-size configurable from the command line. Anything else ? 
>>
>> It would be convenient to run this from a "workunit" ( i.e. a script in ceph/qa/workunits/ ) so that it can later be run by teuthology integration tests. That could be used to show regression.
>>
>> Shall I add the possiblity to test a single user specified configuration via command line arguments?
>>>
>> I would need to play with it to comment usefully.
>>
>> Cheers
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-12-09 16:45                 ` Loic Dachary
@ 2013-12-09 17:03                   ` Mark Nelson
  2013-12-10  8:43                   ` Loic Dachary
  1 sibling, 0 replies; 52+ messages in thread
From: Mark Nelson @ 2013-12-09 17:03 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Andreas Joachim Peters, ceph-devel

I will mention that this is a good tool if you want really detailed 
profiling or cpu counter data about what's going on.  Other tools that 
are more generic (ie ones that just read data from proc, ie collectl, 
sar, etc) may also be options.

Mark

On 12/09/2013 10:45 AM, Loic Dachary wrote:
> Hi,
>
> Mark Nelson suggested we use perf ( linux-tools ) for benchmarking. It looks like something that would help indeed : the benchmark program would only concern itself with doing some work according to the options and let performances be collected from the outside, using tools that are familiar to people doing benchmarking.
>
> What do you think ?
>
> Cheers
>
> $ perf stat -e
>    Error: switch `e' requires a value
>
>   usage: perf stat [<options>] [<command>]
>
>      -e, --event <event>   event selector. use 'perf list' to list available events
>          --filter <filter>
>                            event filter
>      -i, --no-inherit      child tasks do not inherit counters
>      -p, --pid <pid>       stat events on existing process id
>      -t, --tid <tid>       stat events on existing thread id
>      -a, --all-cpus        system-wide collection from all CPUs
>      -g, --group           put the counters into a counter group
>      -c, --scale           scale/normalize counters
>      -v, --verbose         be more verbose (show counter open errors, etc)
>      -r, --repeat <n>      repeat command and print average + stddev (max: 100, forever: 0)
>      -n, --null            null run - dont start any counters
>      -d, --detailed        detailed run - start a lot of events
>      -S, --sync            call sync() before starting a run
>      -B, --big-num         print large numbers with thousands' separators
>      -C, --cpu <cpu>       list of cpus to monitor in system-wide
>      -A, --no-aggr         disable CPU count aggregation
>      -x, --field-separator <separator>
>                            print counts with custom separator
>      -G, --cgroup <name>   monitor event in cgroup name only
>      -o, --output <file>   output file name
>          --append          append to the output file
>          --log-fd <n>      log output to fd, instead of stderr
>          --pre <command>   command to run prior to the measured command
>          --post <command>  command to run after to the measured command
>      -I, --interval-print <n>
>                            print counts at regular interval in ms (>= 100)
>          --per-socket      aggregate counts per processor socket
>          --per-core        aggregate counts per physical processor core
>
>
> On 12/11/2013 19:06, Loic Dachary wrote:
>> Hi Andreas,
>>
>> On 12/11/2013 02:11, Andreas Joachim Peters wrote:
>>> Hi Loic,
>>>
>>> I am finally doing the benchmark tool and I found a bunch of wrong parameter checks which can make the whole thing SEGV.
>>>
>>> All the RAID-6 codes have restrictions on the parameters but they are not correctly enforced for Liberation & Blaum-Roth codes in the CEPH wrapper class ... see text from PDF
>>>
>>> "Minimal Density RAID-6 codes are MDS codes based on binary matrices which satisfy a lower-bound on the number  of non-zero entries. Unlike Cauchy coding, the bit-matrix elements do not correspond to elements in GF (2 w ). Instead, the bit-matrix itself has the proper MDS property. Minimal Density RAID-6 codes perform faster than Reed-Solomon and Cauchy Reed-Solomon codes for the same parameters. Liberation coding, Liber8tion coding, and Blaum-Roth coding are three examples of this kind of coding that are supported in jerasure.
>>>
>>> With each of these codes, m must be equal to two and k must be less than or equal to w. The value of w has restrictions based on the code:
>>>
>>> • With Liberation coding, w must be a prime number [Pla08b].
>>> • With Blaum-Roth coding, w + 1 must be a prime number [BR99]. • With Liber8tion coding, w must equal 8 [Pla08a].
>>>
>>> ...
>>>
>>> Do you add this fixes?
>>
>> Nice catch. I created and assigned to myself : http://tracker.ceph.com/issues/6754
>>>
>>> For the benchmark suite it runs currently 308 different configurations for the 2 algorithm which make sense from the performance point of view and provides this output:
>>>
>>>
>>> # -----------------------------------------------------------------
>>> # Erasure Coding Benchmark - (C) CERN 2013 - Andreas.Joachim.Peters@cern.ch
>>> # Ram-Size=12614856704 Allocation-Size=100000000
>>> # -----------------------------------------------------------------
>>> # [ -BENCH- ] [       ] technique=memcpy                                                            speed=5.408 [GB/s] latency=9.245 ms
>>> # [ -BENCH- ] [       ] technique=d=a^b^c-xor                                                       speed=4.377 [GB/s] latency=17.136 ms
>>> # [ -BENCH- ] [001/304] technique=cauchy_good:k=05:m=2:w=8:lp=0:packet=00064:size=50000000          speed=1.308 [GB/s] latency=038	[ms] size-overhead=40	[%]
>>> ..
>>> ..
>>> # [ -BENCH- ] [304/304] technique=liberation:k=24:m=2:w=29:lp=2:packet=65536:size=50000000          speed=0.083 [GB/s] latency=604	[ms] size-overhead=16	[%]
>>> # -----------------------------------------------------------------
>>> # Erasure Code Performance Summary::
>>> # -----------------------------------------------------------------
>>> # RAM:                   12.61 GB
>>> # Allocation-Size        0.10 GB
>>> # -----------------------------------------------------------------
>>> # Byte Initialization:   29.35 MB/s
>>> # Memcpy:                5.41 GB/s
>>> # Triple-XOR:            4.38 GB/s
>>> # -----------------------------------------------------------------
>>> # Fastest RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000
>>> # Fastest Triple Failure 0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000
>>> # Fastest Quadr. Failure 0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000
>>> # -----------------------------------------------------------------
>>> # .................................................................
>>> # Top 1  RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000
>>> # Top 2  RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=16384:size=50000000
>>> # Top 3  RAID6          2.64 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=65536:size=50000000
>>> # Top 4  RAID6          2.60 GB/s liberation:k=07:m=2:w=7:lp=0:packet=16384:size=50000000
>>> # Top 5  RAID6          2.59 GB/s liberation:k=05:m=2:w=7:lp=0:packet=04096:size=50000000
>>> # .................................................................
>>> # Top 1  Triple         0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000
>>> # Top 2  Triple         0.94 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=16384:size=50000000
>>> # Top 3  Triple         0.93 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=65536:size=50000000
>>> # Top 4  Triple         0.89 GB/s cauchy_good:k=07:m=3:w=8:lp=0:packet=04096:size=50000000
>>> # Top 5  Triple         0.87 GB/s cauchy_good:k=05:m=3:w=8:lp=0:packet=04096:size=50000000
>>> # .................................................................
>>> # Top 1  Quadr.         0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000
>>> # Top 2  Quadr.         0.65 GB/s cauchy_good:k=07:m=4:w=8:lp=0:packet=04096:size=50000000
>>> # Top 3  Quadr.         0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=16384:size=50000000
>>> # Top 4  Quadr.         0.64 GB/s cauchy_good:k=05:m=4:w=8:lp=0:packet=04096:size=50000000
>>> # Top 5  Quadr.         0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=65536:size=50000000
>>> # .................................................................
>>>
>>> It takes around 30 second on my box.
>>
>>
>> That looks great :-) If I understand correctly, it means https://github.com/ceph/ceph/pull/740 will no longer have benchmarks as they are moved to a separate program. Correct ?
>>
>>> I will add a measurement how the XOR and the 3 top algorithms scale with the number of cores and make the object-size configurable from the command line. Anything else ?
>>
>> It would be convenient to run this from a "workunit" ( i.e. a script in ceph/qa/workunits/ ) so that it can later be run by teuthology integration tests. That could be used to show regression.
>>
>> Shall I add the possiblity to test a single user specified configuration via command line arguments?
>>>
>> I would need to play with it to comment usefully.
>>
>> Cheers
>>
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-11-12 18:06               ` Loic Dachary
  2013-11-19 11:35                 ` Andreas Joachim Peters
@ 2013-12-09 16:45                 ` Loic Dachary
  2013-12-09 17:03                   ` Mark Nelson
  2013-12-10  8:43                   ` Loic Dachary
  1 sibling, 2 replies; 52+ messages in thread
From: Loic Dachary @ 2013-12-09 16:45 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 8208 bytes --]

Hi,

Mark Nelson suggested we use perf ( linux-tools ) for benchmarking. It looks like something that would help indeed : the benchmark program would only concern itself with doing some work according to the options and let performances be collected from the outside, using tools that are familiar to people doing benchmarking.

What do you think ?

Cheers

$ perf stat -e
  Error: switch `e' requires a value

 usage: perf stat [<options>] [<command>]

    -e, --event <event>   event selector. use 'perf list' to list available events
        --filter <filter>
                          event filter
    -i, --no-inherit      child tasks do not inherit counters
    -p, --pid <pid>       stat events on existing process id
    -t, --tid <tid>       stat events on existing thread id
    -a, --all-cpus        system-wide collection from all CPUs
    -g, --group           put the counters into a counter group
    -c, --scale           scale/normalize counters
    -v, --verbose         be more verbose (show counter open errors, etc)
    -r, --repeat <n>      repeat command and print average + stddev (max: 100, forever: 0)
    -n, --null            null run - dont start any counters
    -d, --detailed        detailed run - start a lot of events
    -S, --sync            call sync() before starting a run
    -B, --big-num         print large numbers with thousands' separators
    -C, --cpu <cpu>       list of cpus to monitor in system-wide
    -A, --no-aggr         disable CPU count aggregation
    -x, --field-separator <separator>
                          print counts with custom separator
    -G, --cgroup <name>   monitor event in cgroup name only
    -o, --output <file>   output file name
        --append          append to the output file
        --log-fd <n>      log output to fd, instead of stderr
        --pre <command>   command to run prior to the measured command
        --post <command>  command to run after to the measured command
    -I, --interval-print <n>
                          print counts at regular interval in ms (>= 100)
        --per-socket      aggregate counts per processor socket
        --per-core        aggregate counts per physical processor core


On 12/11/2013 19:06, Loic Dachary wrote:
> Hi Andreas,
> 
> On 12/11/2013 02:11, Andreas Joachim Peters wrote:
>> Hi Loic,
>>
>> I am finally doing the benchmark tool and I found a bunch of wrong parameter checks which can make the whole thing SEGV.
>>
>> All the RAID-6 codes have restrictions on the parameters but they are not correctly enforced for Liberation & Blaum-Roth codes in the CEPH wrapper class ... see text from PDF
>>
>> "Minimal Density RAID-6 codes are MDS codes based on binary matrices which satisfy a lower-bound on the number  of non-zero entries. Unlike Cauchy coding, the bit-matrix elements do not correspond to elements in GF (2 w ). Instead, the bit-matrix itself has the proper MDS property. Minimal Density RAID-6 codes perform faster than Reed-Solomon and Cauchy Reed-Solomon codes for the same parameters. Liberation coding, Liber8tion coding, and Blaum-Roth coding are three examples of this kind of coding that are supported in jerasure.
>>
>> With each of these codes, m must be equal to two and k must be less than or equal to w. The value of w has restrictions based on the code:
>>
>> • With Liberation coding, w must be a prime number [Pla08b].
>> • With Blaum-Roth coding, w + 1 must be a prime number [BR99]. • With Liber8tion coding, w must equal 8 [Pla08a].
>>
>> ...
>>
>> Do you add this fixes?
> 
> Nice catch. I created and assigned to myself : http://tracker.ceph.com/issues/6754
>>
>> For the benchmark suite it runs currently 308 different configurations for the 2 algorithm which make sense from the performance point of view and provides this output:
>>
>>
>> # -----------------------------------------------------------------
>> # Erasure Coding Benchmark - (C) CERN 2013 - Andreas.Joachim.Peters@cern.ch
>> # Ram-Size=12614856704 Allocation-Size=100000000
>> # -----------------------------------------------------------------
>> # [ -BENCH- ] [       ] technique=memcpy                                                            speed=5.408 [GB/s] latency=9.245 ms
>> # [ -BENCH- ] [       ] technique=d=a^b^c-xor                                                       speed=4.377 [GB/s] latency=17.136 ms
>> # [ -BENCH- ] [001/304] technique=cauchy_good:k=05:m=2:w=8:lp=0:packet=00064:size=50000000          speed=1.308 [GB/s] latency=038	[ms] size-overhead=40	[%]
>> ..
>> ..
>> # [ -BENCH- ] [304/304] technique=liberation:k=24:m=2:w=29:lp=2:packet=65536:size=50000000          speed=0.083 [GB/s] latency=604	[ms] size-overhead=16	[%]
>> # -----------------------------------------------------------------
>> # Erasure Code Performance Summary::
>> # -----------------------------------------------------------------
>> # RAM:                   12.61 GB
>> # Allocation-Size        0.10 GB
>> # -----------------------------------------------------------------
>> # Byte Initialization:   29.35 MB/s
>> # Memcpy:                5.41 GB/s
>> # Triple-XOR:            4.38 GB/s
>> # -----------------------------------------------------------------
>> # Fastest RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000
>> # Fastest Triple Failure 0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000
>> # Fastest Quadr. Failure 0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000
>> # -----------------------------------------------------------------
>> # .................................................................
>> # Top 1  RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000
>> # Top 2  RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=16384:size=50000000
>> # Top 3  RAID6          2.64 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=65536:size=50000000
>> # Top 4  RAID6          2.60 GB/s liberation:k=07:m=2:w=7:lp=0:packet=16384:size=50000000
>> # Top 5  RAID6          2.59 GB/s liberation:k=05:m=2:w=7:lp=0:packet=04096:size=50000000
>> # .................................................................
>> # Top 1  Triple         0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000
>> # Top 2  Triple         0.94 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=16384:size=50000000
>> # Top 3  Triple         0.93 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=65536:size=50000000
>> # Top 4  Triple         0.89 GB/s cauchy_good:k=07:m=3:w=8:lp=0:packet=04096:size=50000000
>> # Top 5  Triple         0.87 GB/s cauchy_good:k=05:m=3:w=8:lp=0:packet=04096:size=50000000
>> # .................................................................
>> # Top 1  Quadr.         0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000
>> # Top 2  Quadr.         0.65 GB/s cauchy_good:k=07:m=4:w=8:lp=0:packet=04096:size=50000000
>> # Top 3  Quadr.         0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=16384:size=50000000
>> # Top 4  Quadr.         0.64 GB/s cauchy_good:k=05:m=4:w=8:lp=0:packet=04096:size=50000000
>> # Top 5  Quadr.         0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=65536:size=50000000
>> # .................................................................
>>
>> It takes around 30 second on my box. 
> 
> 
> That looks great :-) If I understand correctly, it means https://github.com/ceph/ceph/pull/740 will no longer have benchmarks as they are moved to a separate program. Correct ?
> 
>> I will add a measurement how the XOR and the 3 top algorithms scale with the number of cores and make the object-size configurable from the command line. Anything else ? 
> 
> It would be convenient to run this from a "workunit" ( i.e. a script in ceph/qa/workunits/ ) so that it can later be run by teuthology integration tests. That could be used to show regression.
> 
> Shall I add the possiblity to test a single user specified configuration via command line arguments?
>>
> I would need to play with it to comment usefully.
> 
> Cheers
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-11-12 18:06               ` Loic Dachary
@ 2013-11-19 11:35                 ` Andreas Joachim Peters
  2013-12-09 16:45                 ` Loic Dachary
  1 sibling, 0 replies; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-11-19 11:35 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

Hi Loic et al, 
Dan pointed me to this:

http://sourceforge.net/p/snapraid/code/ci/master/tree/raid.c

It has a very straight forward API and GPL license ...

The implementation seems more performant than the current Jerasure library probably due to the use of ssse3 extensions and slightly less flexibility.... maybe it is worth a plugin or to become "the" plugin? It seems also worth to rewrite the xoring function I use with the sse2 assembler xor ...

It comes also with a nice benchmark tool, here are the results on my 'standard' Xeon for a 4MB block with 8 disks + parity disks ....:

./snapraid -T
snapraid v5.0 by Andrea Mazzoleni, http://snapraid.sourceforge.net
Compiler gcc 4.8.1
CPU GenuineIntel, family 6, model 26, flags mmx sse2 ssse3 sse42
Memory is little-endian 64-bit
Support nanosecond timestamps with futimens()


Speed test using 8 buffers of 524288 bytes, for a total of 4096 KiB.
The reported value is the sustainable aggregate bandwidth of all data disks in MiB/s (not counting parity disks).

Memory write speed using the C memset() function:
  memset   15873

CRC used to check the content file integrity:
   table     857
   intel    6689

Hash used to check the data blocks integrity:
            best murmur3 spooky2
    hash spooky2    2987    6998

RAID functions used for computing the parity with 'sync':
            best    int8   int32   int64    sse2   sse2e   ssse3  ssse3e
    par1    sse2            6201   11080   19404
    par2   sse2e            1851    3462    9949   10359
    parz   sse2e            1134    2020    5157    5738
    par3  ssse3e     421                                    4766    5225
    par4  ssse3e     303                                    3449    3844
    par5  ssse3e     241                                    2750    2830
    par6  ssse3e     198                                    2189    2261

RAID functions used for recovering with 'fix':
            best    int8   ssse3
    rec1   ssse3     496    1029
    rec2   ssse3     208     477
    rec3   ssse3      51     261
    rec4   ssse3      33     170
    rec5   ssse3      22     112
    rec6   ssse3      16      86



________________________________________
From: Loic Dachary [loic@dachary.org]
Sent: 12 November 2013 19:06
To: Andreas Joachim Peters
Cc: ceph-devel@vger.kernel.org
Subject: Re: CEPH Erasure Encoding + OSD Scalability

Hi Andreas,

On 12/11/2013 02:11, Andreas Joachim Peters wrote:
> Hi Loic,
>
> I am finally doing the benchmark tool and I found a bunch of wrong parameter checks which can make the whole thing SEGV.
>
> All the RAID-6 codes have restrictions on the parameters but they are not correctly enforced for Liberation & Blaum-Roth codes in the CEPH wrapper class ... see text from PDF
>
> "Minimal Density RAID-6 codes are MDS codes based on binary matrices which satisfy a lower-bound on the number  of non-zero entries. Unlike Cauchy coding, the bit-matrix elements do not correspond to elements in GF (2 w ). Instead, the bit-matrix itself has the proper MDS property. Minimal Density RAID-6 codes perform faster than Reed-Solomon and Cauchy Reed-Solomon codes for the same parameters. Liberation coding, Liber8tion coding, and Blaum-Roth coding are three examples of this kind of coding that are supported in jerasure.
>
> With each of these codes, m must be equal to two and k must be less than or equal to w. The value of w has restrictions based on the code:
>
> • With Liberation coding, w must be a prime number [Pla08b].
> • With Blaum-Roth coding, w + 1 must be a prime number [BR99]. • With Liber8tion coding, w must equal 8 [Pla08a].
>
> ...
>
> Do you add this fixes?

Nice catch. I created and assigned to myself : http://tracker.ceph.com/issues/6754
>
> For the benchmark suite it runs currently 308 different configurations for the 2 algorithm which make sense from the performance point of view and provides this output:
>
>
> # -----------------------------------------------------------------
> # Erasure Coding Benchmark - (C) CERN 2013 - Andreas.Joachim.Peters@cern.ch
> # Ram-Size=12614856704 Allocation-Size=100000000
> # -----------------------------------------------------------------
> # [ -BENCH- ] [       ] technique=memcpy                                                            speed=5.408 [GB/s] latency=9.245 ms
> # [ -BENCH- ] [       ] technique=d=a^b^c-xor                                                       speed=4.377 [GB/s] latency=17.136 ms
> # [ -BENCH- ] [001/304] technique=cauchy_good:k=05:m=2:w=8:lp=0:packet=00064:size=50000000          speed=1.308 [GB/s] latency=038    [ms] size-overhead=40   [%]
> ..
> ..
> # [ -BENCH- ] [304/304] technique=liberation:k=24:m=2:w=29:lp=2:packet=65536:size=50000000          speed=0.083 [GB/s] latency=604    [ms] size-overhead=16   [%]
> # -----------------------------------------------------------------
> # Erasure Code Performance Summary::
> # -----------------------------------------------------------------
> # RAM:                   12.61 GB
> # Allocation-Size        0.10 GB
> # -----------------------------------------------------------------
> # Byte Initialization:   29.35 MB/s
> # Memcpy:                5.41 GB/s
> # Triple-XOR:            4.38 GB/s
> # -----------------------------------------------------------------
> # Fastest RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000
> # Fastest Triple Failure 0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000
> # Fastest Quadr. Failure 0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000
> # -----------------------------------------------------------------
> # .................................................................
> # Top 1  RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000
> # Top 2  RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=16384:size=50000000
> # Top 3  RAID6          2.64 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=65536:size=50000000
> # Top 4  RAID6          2.60 GB/s liberation:k=07:m=2:w=7:lp=0:packet=16384:size=50000000
> # Top 5  RAID6          2.59 GB/s liberation:k=05:m=2:w=7:lp=0:packet=04096:size=50000000
> # .................................................................
> # Top 1  Triple         0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000
> # Top 2  Triple         0.94 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=16384:size=50000000
> # Top 3  Triple         0.93 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=65536:size=50000000
> # Top 4  Triple         0.89 GB/s cauchy_good:k=07:m=3:w=8:lp=0:packet=04096:size=50000000
> # Top 5  Triple         0.87 GB/s cauchy_good:k=05:m=3:w=8:lp=0:packet=04096:size=50000000
> # .................................................................
> # Top 1  Quadr.         0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000
> # Top 2  Quadr.         0.65 GB/s cauchy_good:k=07:m=4:w=8:lp=0:packet=04096:size=50000000
> # Top 3  Quadr.         0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=16384:size=50000000
> # Top 4  Quadr.         0.64 GB/s cauchy_good:k=05:m=4:w=8:lp=0:packet=04096:size=50000000
> # Top 5  Quadr.         0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=65536:size=50000000
> # .................................................................
>
> It takes around 30 second on my box.


That looks great :-) If I understand correctly, it means https://github.com/ceph/ceph/pull/740 will no longer have benchmarks as they are moved to a separate program. Correct ?

> I will add a measurement how the XOR and the 3 top algorithms scale with the number of cores and make the object-size configurable from the command line. Anything else ?

It would be convenient to run this from a "workunit" ( i.e. a script in ceph/qa/workunits/ ) so that it can later be run by teuthology integration tests. That could be used to show regression.

Shall I add the possiblity to test a single user specified configuration via command line arguments?
>
I would need to play with it to comment usefully.

Cheers

--
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-11-12  1:11             ` Andreas Joachim Peters
@ 2013-11-12 18:06               ` Loic Dachary
  2013-11-19 11:35                 ` Andreas Joachim Peters
  2013-12-09 16:45                 ` Loic Dachary
  0 siblings, 2 replies; 52+ messages in thread
From: Loic Dachary @ 2013-11-12 18:06 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

Hi Andreas,

On 12/11/2013 02:11, Andreas Joachim Peters wrote:
> Hi Loic,
> 
> I am finally doing the benchmark tool and I found a bunch of wrong parameter checks which can make the whole thing SEGV.
> 
> All the RAID-6 codes have restrictions on the parameters but they are not correctly enforced for Liberation & Blaum-Roth codes in the CEPH wrapper class ... see text from PDF
> 
> "Minimal Density RAID-6 codes are MDS codes based on binary matrices which satisfy a lower-bound on the number  of non-zero entries. Unlike Cauchy coding, the bit-matrix elements do not correspond to elements in GF (2 w ). Instead, the bit-matrix itself has the proper MDS property. Minimal Density RAID-6 codes perform faster than Reed-Solomon and Cauchy Reed-Solomon codes for the same parameters. Liberation coding, Liber8tion coding, and Blaum-Roth coding are three examples of this kind of coding that are supported in jerasure.
> 
> With each of these codes, m must be equal to two and k must be less than or equal to w. The value of w has restrictions based on the code:
> 
> • With Liberation coding, w must be a prime number [Pla08b].
> • With Blaum-Roth coding, w + 1 must be a prime number [BR99]. • With Liber8tion coding, w must equal 8 [Pla08a].
> 
> ...
> 
> Do you add this fixes?

Nice catch. I created and assigned to myself : http://tracker.ceph.com/issues/6754
> 
> For the benchmark suite it runs currently 308 different configurations for the 2 algorithm which make sense from the performance point of view and provides this output:
> 
> 
> # -----------------------------------------------------------------
> # Erasure Coding Benchmark - (C) CERN 2013 - Andreas.Joachim.Peters@cern.ch
> # Ram-Size=12614856704 Allocation-Size=100000000
> # -----------------------------------------------------------------
> # [ -BENCH- ] [       ] technique=memcpy                                                            speed=5.408 [GB/s] latency=9.245 ms
> # [ -BENCH- ] [       ] technique=d=a^b^c-xor                                                       speed=4.377 [GB/s] latency=17.136 ms
> # [ -BENCH- ] [001/304] technique=cauchy_good:k=05:m=2:w=8:lp=0:packet=00064:size=50000000          speed=1.308 [GB/s] latency=038	[ms] size-overhead=40	[%]
> ..
> ..
> # [ -BENCH- ] [304/304] technique=liberation:k=24:m=2:w=29:lp=2:packet=65536:size=50000000          speed=0.083 [GB/s] latency=604	[ms] size-overhead=16	[%]
> # -----------------------------------------------------------------
> # Erasure Code Performance Summary::
> # -----------------------------------------------------------------
> # RAM:                   12.61 GB
> # Allocation-Size        0.10 GB
> # -----------------------------------------------------------------
> # Byte Initialization:   29.35 MB/s
> # Memcpy:                5.41 GB/s
> # Triple-XOR:            4.38 GB/s
> # -----------------------------------------------------------------
> # Fastest RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000
> # Fastest Triple Failure 0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000
> # Fastest Quadr. Failure 0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000
> # -----------------------------------------------------------------
> # .................................................................
> # Top 1  RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000
> # Top 2  RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=16384:size=50000000
> # Top 3  RAID6          2.64 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=65536:size=50000000
> # Top 4  RAID6          2.60 GB/s liberation:k=07:m=2:w=7:lp=0:packet=16384:size=50000000
> # Top 5  RAID6          2.59 GB/s liberation:k=05:m=2:w=7:lp=0:packet=04096:size=50000000
> # .................................................................
> # Top 1  Triple         0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000
> # Top 2  Triple         0.94 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=16384:size=50000000
> # Top 3  Triple         0.93 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=65536:size=50000000
> # Top 4  Triple         0.89 GB/s cauchy_good:k=07:m=3:w=8:lp=0:packet=04096:size=50000000
> # Top 5  Triple         0.87 GB/s cauchy_good:k=05:m=3:w=8:lp=0:packet=04096:size=50000000
> # .................................................................
> # Top 1  Quadr.         0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000
> # Top 2  Quadr.         0.65 GB/s cauchy_good:k=07:m=4:w=8:lp=0:packet=04096:size=50000000
> # Top 3  Quadr.         0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=16384:size=50000000
> # Top 4  Quadr.         0.64 GB/s cauchy_good:k=05:m=4:w=8:lp=0:packet=04096:size=50000000
> # Top 5  Quadr.         0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=65536:size=50000000
> # .................................................................
> 
> It takes around 30 second on my box. 


That looks great :-) If I understand correctly, it means https://github.com/ceph/ceph/pull/740 will no longer have benchmarks as they are moved to a separate program. Correct ?

> I will add a measurement how the XOR and the 3 top algorithms scale with the number of cores and make the object-size configurable from the command line. Anything else ? 

It would be convenient to run this from a "workunit" ( i.e. a script in ceph/qa/workunits/ ) so that it can later be run by teuthology integration tests. That could be used to show regression.

Shall I add the possiblity to test a single user specified configuration via command line arguments?
> 
I would need to play with it to comment usefully.

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-09-22  7:26           ` Andreas Joachim Peters
  2013-09-22  9:41             ` Loic Dachary
@ 2013-11-12  1:11             ` Andreas Joachim Peters
  2013-11-12 18:06               ` Loic Dachary
  1 sibling, 1 reply; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-11-12  1:11 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

Hi Loic, 

I am finally doing the benchmark tool and I found a bunch of wrong parameter checks which can make the whole thing SEGV.

All the RAID-6 codes have restrictions on the parameters but they are not correctly enforced for Liberation & Blaum-Roth codes in the CEPH wrapper class ... see text from PDF

"Minimal Density RAID-6 codes are MDS codes based on binary matrices which satisfy a lower-bound on the number  of non-zero entries. Unlike Cauchy coding, the bit-matrix elements do not correspond to elements in GF (2 w ). Instead, the bit-matrix itself has the proper MDS property. Minimal Density RAID-6 codes perform faster than Reed-Solomon and Cauchy Reed-Solomon codes for the same parameters. Liberation coding, Liber8tion coding, and Blaum-Roth coding are three examples of this kind of coding that are supported in jerasure.

With each of these codes, m must be equal to two and k must be less than or equal to w. The value of w has restrictions based on the code:

• With Liberation coding, w must be a prime number [Pla08b].
• With Blaum-Roth coding, w + 1 must be a prime number [BR99]. • With Liber8tion coding, w must equal 8 [Pla08a].

...

Do you add this fixes?

For the benchmark suite it runs currently 308 different configurations for the 2 algorithm which make sense from the performance point of view and provides this output:


# -----------------------------------------------------------------
# Erasure Coding Benchmark - (C) CERN 2013 - Andreas.Joachim.Peters@cern.ch
# Ram-Size=12614856704 Allocation-Size=100000000
# -----------------------------------------------------------------
# [ -BENCH- ] [       ] technique=memcpy                                                            speed=5.408 [GB/s] latency=9.245 ms
# [ -BENCH- ] [       ] technique=d=a^b^c-xor                                                       speed=4.377 [GB/s] latency=17.136 ms
# [ -BENCH- ] [001/304] technique=cauchy_good:k=05:m=2:w=8:lp=0:packet=00064:size=50000000          speed=1.308 [GB/s] latency=038	[ms] size-overhead=40	[%]
..
..
# [ -BENCH- ] [304/304] technique=liberation:k=24:m=2:w=29:lp=2:packet=65536:size=50000000          speed=0.083 [GB/s] latency=604	[ms] size-overhead=16	[%]
# -----------------------------------------------------------------
# Erasure Code Performance Summary::
# -----------------------------------------------------------------
# RAM:                   12.61 GB
# Allocation-Size        0.10 GB
# -----------------------------------------------------------------
# Byte Initialization:   29.35 MB/s
# Memcpy:                5.41 GB/s
# Triple-XOR:            4.38 GB/s
# -----------------------------------------------------------------
# Fastest RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000
# Fastest Triple Failure 0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000
# Fastest Quadr. Failure 0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000
# -----------------------------------------------------------------
# .................................................................
# Top 1  RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000
# Top 2  RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=16384:size=50000000
# Top 3  RAID6          2.64 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=65536:size=50000000
# Top 4  RAID6          2.60 GB/s liberation:k=07:m=2:w=7:lp=0:packet=16384:size=50000000
# Top 5  RAID6          2.59 GB/s liberation:k=05:m=2:w=7:lp=0:packet=04096:size=50000000
# .................................................................
# Top 1  Triple         0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000
# Top 2  Triple         0.94 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=16384:size=50000000
# Top 3  Triple         0.93 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=65536:size=50000000
# Top 4  Triple         0.89 GB/s cauchy_good:k=07:m=3:w=8:lp=0:packet=04096:size=50000000
# Top 5  Triple         0.87 GB/s cauchy_good:k=05:m=3:w=8:lp=0:packet=04096:size=50000000
# .................................................................
# Top 1  Quadr.         0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000
# Top 2  Quadr.         0.65 GB/s cauchy_good:k=07:m=4:w=8:lp=0:packet=04096:size=50000000
# Top 3  Quadr.         0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=16384:size=50000000
# Top 4  Quadr.         0.64 GB/s cauchy_good:k=05:m=4:w=8:lp=0:packet=04096:size=50000000
# Top 5  Quadr.         0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=65536:size=50000000
# .................................................................

It takes around 30 second on my box. I will add a measurement how the XOR and the 3 top algorithms scale with the number of cores and make the object-size configurable from the command line. Anything else ? Shall I add the possiblity to test a single user specified configuration via command line arguments?

Cheers Andreas.




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-10-01 23:00                   ` Andreas Joachim Peters
  2013-10-02 10:04                     ` Loic Dachary
@ 2013-10-02 10:15                     ` Loic Dachary
  1 sibling, 0 replies; 52+ messages in thread
From: Loic Dachary @ 2013-10-02 10:15 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 9277 bytes --]

Hi Andreas,

You should include the Copyright holder. If you are a Cern employee it will probably look like this:

Copyright (C) 2013 CERN <????@cern.ch>
Author: Andreas Joachim Peters <andreas.joachim.peters@cern.ch>

unless your contract specifies otherwise. If you are not an employee you should update to

Copyright (C) 2013 Andreas Joachim Peters <andreas.joachim.peters@cern.ch>

unless there is a contract (freelance or ...) that specifies otherwise.

Cheers

On 02/10/2013 01:00, Andreas Joachim Peters wrote:
> Hi Loic, 
> 
> here is the patch implementing the basic pyramid code adding local parity to erasure encoding. I tried to keep it 100% identical to the behaviour of the original version besides I changed the alignment to 128-bit words. Atleast your unit tests works ;-)
> 
> https://github.com/apeters1971/ceph/commit/b2de7af1a49dc98940d5685eab00a339bf81a0e5
> 
> in src: 
> 
> make unittest_erasure_code_pyramid_jerasure
> 
> ./unittest_erasure_code_pyramid_jerasure --gtest_filter=*.* --log-to-stderr=true --object-size=64
> 
> It tests (8,2,2)
> 
> [ -TIMING- ] technique=cauchy_good      [           encode ] speed=1.840 [GB/s] latency=34.791 ms
> [ -TIMING- ] technique=cauchy_good      [        encode-lp ] speed=1.305 [GB/s] latency=49.057 ms
> [ -TIMING- ] technique=cauchy_good      [      encode-lp-3 ] speed=1.307 [GB/s] latency=48.956 ms
> [ -TIMING- ] technique=cauchy_good      [ encode-lp-crc32c ] speed=1.036 [GB/s] latency=61.752 ms
> [ -TIMING- ] technique=cauchy_good      [             reco ] speed=1.780 [GB/s] latency=35.959 ms
> [ -TIMING- ] technique=cauchy_good      [          reco-lp ] speed=4.348 [GB/s] latency=14.720 ms
> [ -TIMING- ] technique=cauchy_good      [        reco-lp-3 ] speed=1.256 [GB/s] latency=50.962 ms
> [ -TIMING- ] technique=cauchy_good      [   reco-lp-crc32c ] speed=2.300 [GB/s] latency=27.832 ms
> [ -TIMING- ] technique=liber8tion       [           encode ] speed=2.297 [GB/s] latency=27.865 ms
> [ -TIMING- ] technique=liber8tion       [        encode-lp ] speed=1.498 [GB/s] latency=42.731 ms
> [ -TIMING- ] technique=liber8tion       [      encode-lp-3 ] speed=1.505 [GB/s] latency=42.513 ms
> [ -TIMING- ] technique=liber8tion       [ encode-lp-crc32c ] speed=1.142 [GB/s] latency=56.018 ms
> [ -TIMING- ] technique=liber8tion       [             reco ] speed=2.238 [GB/s] latency=28.601 ms
> [ -TIMING- ] technique=liber8tion       [          reco-lp ] speed=4.399 [GB/s] latency=14.550 ms
> [ -TIMING- ] technique=liber8tion       [        reco-lp-3 ] speed=1.878 [GB/s] latency=34.070 ms
> [ -TIMING- ] technique=liber8tion       [   reco-lp-crc32c ] speed=2.307 [GB/s] latency=27.737 ms
> 
> Cheers Andreas.
> 
> 
> ________________________________________
> From: Loic Dachary [loic@dachary.org]
> Sent: 27 September 2013 11:40
> To: Andreas Joachim Peters
> Cc: Ceph Development
> Subject: Re: CEPH Erasure Encoding + OSD Scalability
> 
> On 26/09/2013 23:49, Andreas Joachim Peters wrote:> Sure,
>> this text is clear, but it does not talk about the cost of reconstruction e.g. not to select a data chunk but a parity chunk costs CPU and increases latency, but is not reflected by the external cost parameter e.g. if you have RS (3,2), 3 data and 2 parity chunks with chunks [0,1,2,3,4] with equal cost values,  I would select [0,1,2] since it avoids computation, however the retrieval cost for [2,3,4] would be the same but the computational cost is higher.
> 
> The implementation knows about the computational cost already and is able to figure out that [0,1,2] is going to be cheaper. It does not need input from the caller and the minimum_to_decode method (without the cost)
> https://github.com/ceph/ceph/blob/master/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L45
> does this. If you want to read [0,1,2] and have [0,1,2,3,4] available it will return that you need to retreive [0,1,2] and not [2,3,4] although both would allow to get the content of [0,1,2].
> 
>>
>> Now if [0] has for example the double cost compared to chunk [3], it is not clear to me if [1,2,3] is a better set than [0,1,2] ... is the meaning of a higher cost actually more a binary flag saying 'avoid to read this chunk if possible' ?
>>
>> Could you give a practical example when a chunk can have a higher cost in a CEPH setup and a rough range for the 'cost' parameter?
> 
> At the moment I can't because it depends on the implementation of the erasure code placement group and it's not complete yet. You are correct : the interpretation of the cost by the plugin cannot be fully described without an intimate knowledge of the implementation. It also means that if the implementation of the caller changes, the semantic of the cost will change an may require a different strategy.
> 
> Cheers
> 
>> Thanks Andreas.
>>
>>
>>
>>
>> ________________________________________
>> From: Loic Dachary [loic@dachary.org]
>> Sent: 26 September 2013 21:18
>> To: Andreas Joachim Peters
>> Cc: Ceph Development
>> Subject: Re: CEPH Erasure Encoding + OSD Scalability
>>
>> [re-adding ceph-devel to the cc]
>>
>> On 26/09/2013 20:36, Andreas-Joachim Peters wrote:> Hi Loic,
>>> today I forked he CEPH repository and will commit my changes to my GitHub fork asap ... (I am not familiar with GitHub in particular).
>>> I was finalizing the minimim_to_decode function today with test cases (it is more sophisticated in this case ...) ... I didn't fully get what the 'with cost' function is supposed to do diffrent from the one without cost?
>>
>> I'd be happy to explain if
>> https://github.com/ceph/ceph/blob/master/src/osd/ErasureCodeInterface.h#L131
>> is unclear. Would you be so kind as to tell me what is confusing in the description ?
>>
>>>
>>>
>>> Cheers Andreas.
>>>
>>> On Wed, Sep 25, 2013 at 8:48 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>>
>>>
>>>
>>>     On 25/09/2013 20:33, Andreas Joachim Peters wrote:> Yes, sure. I actually thought the same in the meanwhile ...  I have some questions:
>>>     >
>>>     > Q: Can/should it stay in the framework of google test's or you would prefer just a plain executable ?
>>>     >
>>>
>>>     A plain executable would make sense. An simple example from src/test/Makefile.am :
>>>
>>>     ceph_test_trans_SOURCES = test/test_trans.cc
>>>     ceph_test_trans_LDADD = $(LIBOS) $(CEPH_GLOBAL)
>>>     bin_DEBUGPROGRAMS += ceph_test_trans
>>>
>>>
>>>     > I have added local parity support to your erasure class adding a new argument: "erasure-code-lp" and
>>>     > two new methods:
>>>     >
>>>     > localparity_encode(...)
>>>     > localparity_decode(...)
>>>     >
>>>     > I made a more complex benchmark of (8,2) + 2 local parities (1^2^3^4, 5^6^7^8) which benchmarks performance of encoding/decoding as speed & effective write-latency for three cases (each for liberation & cauchy_good codecs):
>>>     >
>>>     > 1 (8,2)
>>>     > 2 (8,2,lp=2)
>>>     > 3 (8,2,lp=2) + crc32c (blocks)
>>>     >
>>>     > and several failure scenarios ... single, double, triple disk failures. Probably the best is if I make all this parameters configurable.
>>>
>>>     Great :-) Do you have a public git repository where I could clone this & give it a try ?
>>>
>>>     > Q: For the local parity implementation .... shall I inherit from your erasure plugin and overwrite the encode/decode method or you would consider a patch to the original class?
>>>
>>>     It is a perfect timing for a patch to the original class.
>>>
>>>     > I have also a 128-bit XOR implementation for the local parities. This will work with new gcc's & clang compilers ...
>>>     >
>>>     > Q: Which compilers/platforms are supported by CEPH? Is there a minimal GCC version?
>>>
>>>     You can see all supported platforms here:
>>>
>>>     http://ceph.com/gitbuilder.cgi
>>>
>>>     I don't think the GCC version shows in the logs but you can probably figure it out from the corresponding distribution.
>>>
>>>     > Q: is there some policy restricting comments within code? In general I see very few or no comments within the code ..
>>>
>>>     :-) The mon code tends to be more heavily commented than the osd code (IMO) but I'm not aware of any policy. When I feel the need to comment, I write a unit test. If the unit test is difficult, I tend to comment to clarify its purpose. The problem with comments is that they quickly become obsolete and/or misleading. That being said, I don't think anyone will object if you heavily comment your code.
>>>
>>>     Cheers
>>>
>>>     > Cheers Andreas.
>>>     >
>>>     >
>>>     >
>>>     >
>>>
>>>     --
>>>     Loïc Dachary, Artisan Logiciel Libre
>>>     All that is necessary for the triumph of evil is that good people do nothing.
>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> All that is necessary for the triumph of evil is that good people do nothing.
>>
> 
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-10-01 23:00                   ` Andreas Joachim Peters
@ 2013-10-02 10:04                     ` Loic Dachary
  2013-10-02 10:15                     ` Loic Dachary
  1 sibling, 0 replies; 52+ messages in thread
From: Loic Dachary @ 2013-10-02 10:04 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 8870 bytes --]

Cool :-) Could you create a pull request so I can review it ?

Cheers

On 02/10/2013 01:00, Andreas Joachim Peters wrote:
> Hi Loic, 
> 
> here is the patch implementing the basic pyramid code adding local parity to erasure encoding. I tried to keep it 100% identical to the behaviour of the original version besides I changed the alignment to 128-bit words. Atleast your unit tests works ;-)
> 
> https://github.com/apeters1971/ceph/commit/b2de7af1a49dc98940d5685eab00a339bf81a0e5
> 
> in src: 
> 
> make unittest_erasure_code_pyramid_jerasure
> 
> ./unittest_erasure_code_pyramid_jerasure --gtest_filter=*.* --log-to-stderr=true --object-size=64
> 
> It tests (8,2,2)
> 
> [ -TIMING- ] technique=cauchy_good      [           encode ] speed=1.840 [GB/s] latency=34.791 ms
> [ -TIMING- ] technique=cauchy_good      [        encode-lp ] speed=1.305 [GB/s] latency=49.057 ms
> [ -TIMING- ] technique=cauchy_good      [      encode-lp-3 ] speed=1.307 [GB/s] latency=48.956 ms
> [ -TIMING- ] technique=cauchy_good      [ encode-lp-crc32c ] speed=1.036 [GB/s] latency=61.752 ms
> [ -TIMING- ] technique=cauchy_good      [             reco ] speed=1.780 [GB/s] latency=35.959 ms
> [ -TIMING- ] technique=cauchy_good      [          reco-lp ] speed=4.348 [GB/s] latency=14.720 ms
> [ -TIMING- ] technique=cauchy_good      [        reco-lp-3 ] speed=1.256 [GB/s] latency=50.962 ms
> [ -TIMING- ] technique=cauchy_good      [   reco-lp-crc32c ] speed=2.300 [GB/s] latency=27.832 ms
> [ -TIMING- ] technique=liber8tion       [           encode ] speed=2.297 [GB/s] latency=27.865 ms
> [ -TIMING- ] technique=liber8tion       [        encode-lp ] speed=1.498 [GB/s] latency=42.731 ms
> [ -TIMING- ] technique=liber8tion       [      encode-lp-3 ] speed=1.505 [GB/s] latency=42.513 ms
> [ -TIMING- ] technique=liber8tion       [ encode-lp-crc32c ] speed=1.142 [GB/s] latency=56.018 ms
> [ -TIMING- ] technique=liber8tion       [             reco ] speed=2.238 [GB/s] latency=28.601 ms
> [ -TIMING- ] technique=liber8tion       [          reco-lp ] speed=4.399 [GB/s] latency=14.550 ms
> [ -TIMING- ] technique=liber8tion       [        reco-lp-3 ] speed=1.878 [GB/s] latency=34.070 ms
> [ -TIMING- ] technique=liber8tion       [   reco-lp-crc32c ] speed=2.307 [GB/s] latency=27.737 ms
> 
> Cheers Andreas.
> 
> 
> ________________________________________
> From: Loic Dachary [loic@dachary.org]
> Sent: 27 September 2013 11:40
> To: Andreas Joachim Peters
> Cc: Ceph Development
> Subject: Re: CEPH Erasure Encoding + OSD Scalability
> 
> On 26/09/2013 23:49, Andreas Joachim Peters wrote:> Sure,
>> this text is clear, but it does not talk about the cost of reconstruction e.g. not to select a data chunk but a parity chunk costs CPU and increases latency, but is not reflected by the external cost parameter e.g. if you have RS (3,2), 3 data and 2 parity chunks with chunks [0,1,2,3,4] with equal cost values,  I would select [0,1,2] since it avoids computation, however the retrieval cost for [2,3,4] would be the same but the computational cost is higher.
> 
> The implementation knows about the computational cost already and is able to figure out that [0,1,2] is going to be cheaper. It does not need input from the caller and the minimum_to_decode method (without the cost)
> https://github.com/ceph/ceph/blob/master/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L45
> does this. If you want to read [0,1,2] and have [0,1,2,3,4] available it will return that you need to retreive [0,1,2] and not [2,3,4] although both would allow to get the content of [0,1,2].
> 
>>
>> Now if [0] has for example the double cost compared to chunk [3], it is not clear to me if [1,2,3] is a better set than [0,1,2] ... is the meaning of a higher cost actually more a binary flag saying 'avoid to read this chunk if possible' ?
>>
>> Could you give a practical example when a chunk can have a higher cost in a CEPH setup and a rough range for the 'cost' parameter?
> 
> At the moment I can't because it depends on the implementation of the erasure code placement group and it's not complete yet. You are correct : the interpretation of the cost by the plugin cannot be fully described without an intimate knowledge of the implementation. It also means that if the implementation of the caller changes, the semantic of the cost will change an may require a different strategy.
> 
> Cheers
> 
>> Thanks Andreas.
>>
>>
>>
>>
>> ________________________________________
>> From: Loic Dachary [loic@dachary.org]
>> Sent: 26 September 2013 21:18
>> To: Andreas Joachim Peters
>> Cc: Ceph Development
>> Subject: Re: CEPH Erasure Encoding + OSD Scalability
>>
>> [re-adding ceph-devel to the cc]
>>
>> On 26/09/2013 20:36, Andreas-Joachim Peters wrote:> Hi Loic,
>>> today I forked he CEPH repository and will commit my changes to my GitHub fork asap ... (I am not familiar with GitHub in particular).
>>> I was finalizing the minimim_to_decode function today with test cases (it is more sophisticated in this case ...) ... I didn't fully get what the 'with cost' function is supposed to do diffrent from the one without cost?
>>
>> I'd be happy to explain if
>> https://github.com/ceph/ceph/blob/master/src/osd/ErasureCodeInterface.h#L131
>> is unclear. Would you be so kind as to tell me what is confusing in the description ?
>>
>>>
>>>
>>> Cheers Andreas.
>>>
>>> On Wed, Sep 25, 2013 at 8:48 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>>
>>>
>>>
>>>     On 25/09/2013 20:33, Andreas Joachim Peters wrote:> Yes, sure. I actually thought the same in the meanwhile ...  I have some questions:
>>>     >
>>>     > Q: Can/should it stay in the framework of google test's or you would prefer just a plain executable ?
>>>     >
>>>
>>>     A plain executable would make sense. An simple example from src/test/Makefile.am :
>>>
>>>     ceph_test_trans_SOURCES = test/test_trans.cc
>>>     ceph_test_trans_LDADD = $(LIBOS) $(CEPH_GLOBAL)
>>>     bin_DEBUGPROGRAMS += ceph_test_trans
>>>
>>>
>>>     > I have added local parity support to your erasure class adding a new argument: "erasure-code-lp" and
>>>     > two new methods:
>>>     >
>>>     > localparity_encode(...)
>>>     > localparity_decode(...)
>>>     >
>>>     > I made a more complex benchmark of (8,2) + 2 local parities (1^2^3^4, 5^6^7^8) which benchmarks performance of encoding/decoding as speed & effective write-latency for three cases (each for liberation & cauchy_good codecs):
>>>     >
>>>     > 1 (8,2)
>>>     > 2 (8,2,lp=2)
>>>     > 3 (8,2,lp=2) + crc32c (blocks)
>>>     >
>>>     > and several failure scenarios ... single, double, triple disk failures. Probably the best is if I make all this parameters configurable.
>>>
>>>     Great :-) Do you have a public git repository where I could clone this & give it a try ?
>>>
>>>     > Q: For the local parity implementation .... shall I inherit from your erasure plugin and overwrite the encode/decode method or you would consider a patch to the original class?
>>>
>>>     It is a perfect timing for a patch to the original class.
>>>
>>>     > I have also a 128-bit XOR implementation for the local parities. This will work with new gcc's & clang compilers ...
>>>     >
>>>     > Q: Which compilers/platforms are supported by CEPH? Is there a minimal GCC version?
>>>
>>>     You can see all supported platforms here:
>>>
>>>     http://ceph.com/gitbuilder.cgi
>>>
>>>     I don't think the GCC version shows in the logs but you can probably figure it out from the corresponding distribution.
>>>
>>>     > Q: is there some policy restricting comments within code? In general I see very few or no comments within the code ..
>>>
>>>     :-) The mon code tends to be more heavily commented than the osd code (IMO) but I'm not aware of any policy. When I feel the need to comment, I write a unit test. If the unit test is difficult, I tend to comment to clarify its purpose. The problem with comments is that they quickly become obsolete and/or misleading. That being said, I don't think anyone will object if you heavily comment your code.
>>>
>>>     Cheers
>>>
>>>     > Cheers Andreas.
>>>     >
>>>     >
>>>     >
>>>     >
>>>
>>>     --
>>>     Loïc Dachary, Artisan Logiciel Libre
>>>     All that is necessary for the triumph of evil is that good people do nothing.
>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> All that is necessary for the triumph of evil is that good people do nothing.
>>
> 
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-09-27  9:40                 ` Loic Dachary
@ 2013-10-01 23:00                   ` Andreas Joachim Peters
  2013-10-02 10:04                     ` Loic Dachary
  2013-10-02 10:15                     ` Loic Dachary
  0 siblings, 2 replies; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-10-01 23:00 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

Hi Loic, 

here is the patch implementing the basic pyramid code adding local parity to erasure encoding. I tried to keep it 100% identical to the behaviour of the original version besides I changed the alignment to 128-bit words. Atleast your unit tests works ;-)

https://github.com/apeters1971/ceph/commit/b2de7af1a49dc98940d5685eab00a339bf81a0e5

in src: 

make unittest_erasure_code_pyramid_jerasure

./unittest_erasure_code_pyramid_jerasure --gtest_filter=*.* --log-to-stderr=true --object-size=64

It tests (8,2,2)

[ -TIMING- ] technique=cauchy_good      [           encode ] speed=1.840 [GB/s] latency=34.791 ms
[ -TIMING- ] technique=cauchy_good      [        encode-lp ] speed=1.305 [GB/s] latency=49.057 ms
[ -TIMING- ] technique=cauchy_good      [      encode-lp-3 ] speed=1.307 [GB/s] latency=48.956 ms
[ -TIMING- ] technique=cauchy_good      [ encode-lp-crc32c ] speed=1.036 [GB/s] latency=61.752 ms
[ -TIMING- ] technique=cauchy_good      [             reco ] speed=1.780 [GB/s] latency=35.959 ms
[ -TIMING- ] technique=cauchy_good      [          reco-lp ] speed=4.348 [GB/s] latency=14.720 ms
[ -TIMING- ] technique=cauchy_good      [        reco-lp-3 ] speed=1.256 [GB/s] latency=50.962 ms
[ -TIMING- ] technique=cauchy_good      [   reco-lp-crc32c ] speed=2.300 [GB/s] latency=27.832 ms
[ -TIMING- ] technique=liber8tion       [           encode ] speed=2.297 [GB/s] latency=27.865 ms
[ -TIMING- ] technique=liber8tion       [        encode-lp ] speed=1.498 [GB/s] latency=42.731 ms
[ -TIMING- ] technique=liber8tion       [      encode-lp-3 ] speed=1.505 [GB/s] latency=42.513 ms
[ -TIMING- ] technique=liber8tion       [ encode-lp-crc32c ] speed=1.142 [GB/s] latency=56.018 ms
[ -TIMING- ] technique=liber8tion       [             reco ] speed=2.238 [GB/s] latency=28.601 ms
[ -TIMING- ] technique=liber8tion       [          reco-lp ] speed=4.399 [GB/s] latency=14.550 ms
[ -TIMING- ] technique=liber8tion       [        reco-lp-3 ] speed=1.878 [GB/s] latency=34.070 ms
[ -TIMING- ] technique=liber8tion       [   reco-lp-crc32c ] speed=2.307 [GB/s] latency=27.737 ms

Cheers Andreas.


________________________________________
From: Loic Dachary [loic@dachary.org]
Sent: 27 September 2013 11:40
To: Andreas Joachim Peters
Cc: Ceph Development
Subject: Re: CEPH Erasure Encoding + OSD Scalability

On 26/09/2013 23:49, Andreas Joachim Peters wrote:> Sure,
> this text is clear, but it does not talk about the cost of reconstruction e.g. not to select a data chunk but a parity chunk costs CPU and increases latency, but is not reflected by the external cost parameter e.g. if you have RS (3,2), 3 data and 2 parity chunks with chunks [0,1,2,3,4] with equal cost values,  I would select [0,1,2] since it avoids computation, however the retrieval cost for [2,3,4] would be the same but the computational cost is higher.

The implementation knows about the computational cost already and is able to figure out that [0,1,2] is going to be cheaper. It does not need input from the caller and the minimum_to_decode method (without the cost)
https://github.com/ceph/ceph/blob/master/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L45
does this. If you want to read [0,1,2] and have [0,1,2,3,4] available it will return that you need to retreive [0,1,2] and not [2,3,4] although both would allow to get the content of [0,1,2].

>
> Now if [0] has for example the double cost compared to chunk [3], it is not clear to me if [1,2,3] is a better set than [0,1,2] ... is the meaning of a higher cost actually more a binary flag saying 'avoid to read this chunk if possible' ?
>
> Could you give a practical example when a chunk can have a higher cost in a CEPH setup and a rough range for the 'cost' parameter?

At the moment I can't because it depends on the implementation of the erasure code placement group and it's not complete yet. You are correct : the interpretation of the cost by the plugin cannot be fully described without an intimate knowledge of the implementation. It also means that if the implementation of the caller changes, the semantic of the cost will change an may require a different strategy.

Cheers

> Thanks Andreas.
>
>
>
>
> ________________________________________
> From: Loic Dachary [loic@dachary.org]
> Sent: 26 September 2013 21:18
> To: Andreas Joachim Peters
> Cc: Ceph Development
> Subject: Re: CEPH Erasure Encoding + OSD Scalability
>
> [re-adding ceph-devel to the cc]
>
> On 26/09/2013 20:36, Andreas-Joachim Peters wrote:> Hi Loic,
>> today I forked he CEPH repository and will commit my changes to my GitHub fork asap ... (I am not familiar with GitHub in particular).
>> I was finalizing the minimim_to_decode function today with test cases (it is more sophisticated in this case ...) ... I didn't fully get what the 'with cost' function is supposed to do diffrent from the one without cost?
>
> I'd be happy to explain if
> https://github.com/ceph/ceph/blob/master/src/osd/ErasureCodeInterface.h#L131
> is unclear. Would you be so kind as to tell me what is confusing in the description ?
>
>>
>>
>> Cheers Andreas.
>>
>> On Wed, Sep 25, 2013 at 8:48 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>
>>
>>
>>     On 25/09/2013 20:33, Andreas Joachim Peters wrote:> Yes, sure. I actually thought the same in the meanwhile ...  I have some questions:
>>     >
>>     > Q: Can/should it stay in the framework of google test's or you would prefer just a plain executable ?
>>     >
>>
>>     A plain executable would make sense. An simple example from src/test/Makefile.am :
>>
>>     ceph_test_trans_SOURCES = test/test_trans.cc
>>     ceph_test_trans_LDADD = $(LIBOS) $(CEPH_GLOBAL)
>>     bin_DEBUGPROGRAMS += ceph_test_trans
>>
>>
>>     > I have added local parity support to your erasure class adding a new argument: "erasure-code-lp" and
>>     > two new methods:
>>     >
>>     > localparity_encode(...)
>>     > localparity_decode(...)
>>     >
>>     > I made a more complex benchmark of (8,2) + 2 local parities (1^2^3^4, 5^6^7^8) which benchmarks performance of encoding/decoding as speed & effective write-latency for three cases (each for liberation & cauchy_good codecs):
>>     >
>>     > 1 (8,2)
>>     > 2 (8,2,lp=2)
>>     > 3 (8,2,lp=2) + crc32c (blocks)
>>     >
>>     > and several failure scenarios ... single, double, triple disk failures. Probably the best is if I make all this parameters configurable.
>>
>>     Great :-) Do you have a public git repository where I could clone this & give it a try ?
>>
>>     > Q: For the local parity implementation .... shall I inherit from your erasure plugin and overwrite the encode/decode method or you would consider a patch to the original class?
>>
>>     It is a perfect timing for a patch to the original class.
>>
>>     > I have also a 128-bit XOR implementation for the local parities. This will work with new gcc's & clang compilers ...
>>     >
>>     > Q: Which compilers/platforms are supported by CEPH? Is there a minimal GCC version?
>>
>>     You can see all supported platforms here:
>>
>>     http://ceph.com/gitbuilder.cgi
>>
>>     I don't think the GCC version shows in the logs but you can probably figure it out from the corresponding distribution.
>>
>>     > Q: is there some policy restricting comments within code? In general I see very few or no comments within the code ..
>>
>>     :-) The mon code tends to be more heavily commented than the osd code (IMO) but I'm not aware of any policy. When I feel the need to comment, I write a unit test. If the unit test is difficult, I tend to comment to clarify its purpose. The problem with comments is that they quickly become obsolete and/or misleading. That being said, I don't think anyone will object if you heavily comment your code.
>>
>>     Cheers
>>
>>     > Cheers Andreas.
>>     >
>>     >
>>     >
>>     >
>>
>>     --
>>     Loïc Dachary, Artisan Logiciel Libre
>>     All that is necessary for the triumph of evil is that good people do nothing.
>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
>

--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-09-26 21:49               ` Andreas Joachim Peters
@ 2013-09-27  9:40                 ` Loic Dachary
  2013-10-01 23:00                   ` Andreas Joachim Peters
  0 siblings, 1 reply; 52+ messages in thread
From: Loic Dachary @ 2013-09-27  9:40 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 6061 bytes --]



On 26/09/2013 23:49, Andreas Joachim Peters wrote:> Sure, 
> this text is clear, but it does not talk about the cost of reconstruction e.g. not to select a data chunk but a parity chunk costs CPU and increases latency, but is not reflected by the external cost parameter e.g. if you have RS (3,2), 3 data and 2 parity chunks with chunks [0,1,2,3,4] with equal cost values,  I would select [0,1,2] since it avoids computation, however the retrieval cost for [2,3,4] would be the same but the computational cost is higher.

The implementation knows about the computational cost already and is able to figure out that [0,1,2] is going to be cheaper. It does not need input from the caller and the minimum_to_decode method (without the cost)
https://github.com/ceph/ceph/blob/master/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L45
does this. If you want to read [0,1,2] and have [0,1,2,3,4] available it will return that you need to retreive [0,1,2] and not [2,3,4] although both would allow to get the content of [0,1,2].

> 
> Now if [0] has for example the double cost compared to chunk [3], it is not clear to me if [1,2,3] is a better set than [0,1,2] ... is the meaning of a higher cost actually more a binary flag saying 'avoid to read this chunk if possible' ? 
> 
> Could you give a practical example when a chunk can have a higher cost in a CEPH setup and a rough range for the 'cost' parameter?

At the moment I can't because it depends on the implementation of the erasure code placement group and it's not complete yet. You are correct : the interpretation of the cost by the plugin cannot be fully described without an intimate knowledge of the implementation. It also means that if the implementation of the caller changes, the semantic of the cost will change an may require a different strategy.

Cheers

> Thanks Andreas.
> 
> 
> 
> 
> ________________________________________
> From: Loic Dachary [loic@dachary.org]
> Sent: 26 September 2013 21:18
> To: Andreas Joachim Peters
> Cc: Ceph Development
> Subject: Re: CEPH Erasure Encoding + OSD Scalability
> 
> [re-adding ceph-devel to the cc]
> 
> On 26/09/2013 20:36, Andreas-Joachim Peters wrote:> Hi Loic,
>> today I forked he CEPH repository and will commit my changes to my GitHub fork asap ... (I am not familiar with GitHub in particular).
>> I was finalizing the minimim_to_decode function today with test cases (it is more sophisticated in this case ...) ... I didn't fully get what the 'with cost' function is supposed to do diffrent from the one without cost?
> 
> I'd be happy to explain if
> https://github.com/ceph/ceph/blob/master/src/osd/ErasureCodeInterface.h#L131
> is unclear. Would you be so kind as to tell me what is confusing in the description ?
> 
>>
>>
>> Cheers Andreas.
>>
>> On Wed, Sep 25, 2013 at 8:48 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>
>>
>>
>>     On 25/09/2013 20:33, Andreas Joachim Peters wrote:> Yes, sure. I actually thought the same in the meanwhile ...  I have some questions:
>>     >
>>     > Q: Can/should it stay in the framework of google test's or you would prefer just a plain executable ?
>>     >
>>
>>     A plain executable would make sense. An simple example from src/test/Makefile.am :
>>
>>     ceph_test_trans_SOURCES = test/test_trans.cc
>>     ceph_test_trans_LDADD = $(LIBOS) $(CEPH_GLOBAL)
>>     bin_DEBUGPROGRAMS += ceph_test_trans
>>
>>
>>     > I have added local parity support to your erasure class adding a new argument: "erasure-code-lp" and
>>     > two new methods:
>>     >
>>     > localparity_encode(...)
>>     > localparity_decode(...)
>>     >
>>     > I made a more complex benchmark of (8,2) + 2 local parities (1^2^3^4, 5^6^7^8) which benchmarks performance of encoding/decoding as speed & effective write-latency for three cases (each for liberation & cauchy_good codecs):
>>     >
>>     > 1 (8,2)
>>     > 2 (8,2,lp=2)
>>     > 3 (8,2,lp=2) + crc32c (blocks)
>>     >
>>     > and several failure scenarios ... single, double, triple disk failures. Probably the best is if I make all this parameters configurable.
>>
>>     Great :-) Do you have a public git repository where I could clone this & give it a try ?
>>
>>     > Q: For the local parity implementation .... shall I inherit from your erasure plugin and overwrite the encode/decode method or you would consider a patch to the original class?
>>
>>     It is a perfect timing for a patch to the original class.
>>
>>     > I have also a 128-bit XOR implementation for the local parities. This will work with new gcc's & clang compilers ...
>>     >
>>     > Q: Which compilers/platforms are supported by CEPH? Is there a minimal GCC version?
>>
>>     You can see all supported platforms here:
>>
>>     http://ceph.com/gitbuilder.cgi
>>
>>     I don't think the GCC version shows in the logs but you can probably figure it out from the corresponding distribution.
>>
>>     > Q: is there some policy restricting comments within code? In general I see very few or no comments within the code ..
>>
>>     :-) The mon code tends to be more heavily commented than the osd code (IMO) but I'm not aware of any policy. When I feel the need to comment, I write a unit test. If the unit test is difficult, I tend to comment to clarify its purpose. The problem with comments is that they quickly become obsolete and/or misleading. That being said, I don't think anyone will object if you heavily comment your code.
>>
>>     Cheers
>>
>>     > Cheers Andreas.
>>     >
>>     >
>>     >
>>     >
>>
>>     --
>>     Loïc Dachary, Artisan Logiciel Libre
>>     All that is necessary for the triumph of evil is that good people do nothing.
>>
>>
> 
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-09-26 19:18             ` Loic Dachary
@ 2013-09-26 21:49               ` Andreas Joachim Peters
  2013-09-27  9:40                 ` Loic Dachary
  0 siblings, 1 reply; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-09-26 21:49 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

Sure, 
this text is clear, but it does not talk about the cost of reconstruction e.g. not to select a data chunk but a parity chunk costs CPU and increases latency, but is not reflected by the external cost parameter e.g. if you have RS (3,2), 3 data and 2 parity chunks with chunks [0,1,2,3,4] with equal cost values,  I would select [0,1,2] since it avoids computation, however the retrieval cost for [2,3,4] would be the same but the computational cost is higher.

Now if [0] has for example the double cost compared to chunk [3], it is not clear to me if [1,2,3] is a better set than [0,1,2] ... is the meaning of a higher cost actually more a binary flag saying 'avoid to read this chunk if possible' ? 

Could you give a practical example when a chunk can have a higher cost in a CEPH setup and a rough range for the 'cost' parameter?

Thanks Andreas.




________________________________________
From: Loic Dachary [loic@dachary.org]
Sent: 26 September 2013 21:18
To: Andreas Joachim Peters
Cc: Ceph Development
Subject: Re: CEPH Erasure Encoding + OSD Scalability

[re-adding ceph-devel to the cc]

On 26/09/2013 20:36, Andreas-Joachim Peters wrote:> Hi Loic,
> today I forked he CEPH repository and will commit my changes to my GitHub fork asap ... (I am not familiar with GitHub in particular).
> I was finalizing the minimim_to_decode function today with test cases (it is more sophisticated in this case ...) ... I didn't fully get what the 'with cost' function is supposed to do diffrent from the one without cost?

I'd be happy to explain if
https://github.com/ceph/ceph/blob/master/src/osd/ErasureCodeInterface.h#L131
is unclear. Would you be so kind as to tell me what is confusing in the description ?

>
>
> Cheers Andreas.
>
> On Wed, Sep 25, 2013 at 8:48 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>
>
>
>     On 25/09/2013 20:33, Andreas Joachim Peters wrote:> Yes, sure. I actually thought the same in the meanwhile ...  I have some questions:
>     >
>     > Q: Can/should it stay in the framework of google test's or you would prefer just a plain executable ?
>     >
>
>     A plain executable would make sense. An simple example from src/test/Makefile.am :
>
>     ceph_test_trans_SOURCES = test/test_trans.cc
>     ceph_test_trans_LDADD = $(LIBOS) $(CEPH_GLOBAL)
>     bin_DEBUGPROGRAMS += ceph_test_trans
>
>
>     > I have added local parity support to your erasure class adding a new argument: "erasure-code-lp" and
>     > two new methods:
>     >
>     > localparity_encode(...)
>     > localparity_decode(...)
>     >
>     > I made a more complex benchmark of (8,2) + 2 local parities (1^2^3^4, 5^6^7^8) which benchmarks performance of encoding/decoding as speed & effective write-latency for three cases (each for liberation & cauchy_good codecs):
>     >
>     > 1 (8,2)
>     > 2 (8,2,lp=2)
>     > 3 (8,2,lp=2) + crc32c (blocks)
>     >
>     > and several failure scenarios ... single, double, triple disk failures. Probably the best is if I make all this parameters configurable.
>
>     Great :-) Do you have a public git repository where I could clone this & give it a try ?
>
>     > Q: For the local parity implementation .... shall I inherit from your erasure plugin and overwrite the encode/decode method or you would consider a patch to the original class?
>
>     It is a perfect timing for a patch to the original class.
>
>     > I have also a 128-bit XOR implementation for the local parities. This will work with new gcc's & clang compilers ...
>     >
>     > Q: Which compilers/platforms are supported by CEPH? Is there a minimal GCC version?
>
>     You can see all supported platforms here:
>
>     http://ceph.com/gitbuilder.cgi
>
>     I don't think the GCC version shows in the logs but you can probably figure it out from the corresponding distribution.
>
>     > Q: is there some policy restricting comments within code? In general I see very few or no comments within the code ..
>
>     :-) The mon code tends to be more heavily commented than the osd code (IMO) but I'm not aware of any policy. When I feel the need to comment, I write a unit test. If the unit test is difficult, I tend to comment to clarify its purpose. The problem with comments is that they quickly become obsolete and/or misleading. That being said, I don't think anyone will object if you heavily comment your code.
>
>     Cheers
>
>     > Cheers Andreas.
>     >
>     >
>     >
>     >
>
>     --
>     Loïc Dachary, Artisan Logiciel Libre
>     All that is necessary for the triumph of evil is that good people do nothing.
>
>

--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
       [not found]           ` <CAGhffvz1TYYLoqn0tps1HiLObSCv7H0ZNVgOd0raicGqgRuukA@mail.gmail.com>
@ 2013-09-26 19:18             ` Loic Dachary
  2013-09-26 21:49               ` Andreas Joachim Peters
  0 siblings, 1 reply; 52+ messages in thread
From: Loic Dachary @ 2013-09-26 19:18 UTC (permalink / raw)
  To: Andreas-Joachim Peters; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 3749 bytes --]

[re-adding ceph-devel to the cc]

On 26/09/2013 20:36, Andreas-Joachim Peters wrote:> Hi Loic,
> today I forked he CEPH repository and will commit my changes to my GitHub fork asap ... (I am not familiar with GitHub in particular).
> I was finalizing the minimim_to_decode function today with test cases (it is more sophisticated in this case ...) ... I didn't fully get what the 'with cost' function is supposed to do diffrent from the one without cost?

I'd be happy to explain if 
https://github.com/ceph/ceph/blob/master/src/osd/ErasureCodeInterface.h#L131
is unclear. Would you be so kind as to tell me what is confusing in the description ? 

> 
> 
> Cheers Andreas.
> 
> On Wed, Sep 25, 2013 at 8:48 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
> 
> 
> 
>     On 25/09/2013 20:33, Andreas Joachim Peters wrote:> Yes, sure. I actually thought the same in the meanwhile ...  I have some questions:
>     >
>     > Q: Can/should it stay in the framework of google test's or you would prefer just a plain executable ?
>     >
> 
>     A plain executable would make sense. An simple example from src/test/Makefile.am :
> 
>     ceph_test_trans_SOURCES = test/test_trans.cc
>     ceph_test_trans_LDADD = $(LIBOS) $(CEPH_GLOBAL)
>     bin_DEBUGPROGRAMS += ceph_test_trans
> 
> 
>     > I have added local parity support to your erasure class adding a new argument: "erasure-code-lp" and
>     > two new methods:
>     >
>     > localparity_encode(...)
>     > localparity_decode(...)
>     >
>     > I made a more complex benchmark of (8,2) + 2 local parities (1^2^3^4, 5^6^7^8) which benchmarks performance of encoding/decoding as speed & effective write-latency for three cases (each for liberation & cauchy_good codecs):
>     >
>     > 1 (8,2)
>     > 2 (8,2,lp=2)
>     > 3 (8,2,lp=2) + crc32c (blocks)
>     >
>     > and several failure scenarios ... single, double, triple disk failures. Probably the best is if I make all this parameters configurable.
> 
>     Great :-) Do you have a public git repository where I could clone this & give it a try ?
> 
>     > Q: For the local parity implementation .... shall I inherit from your erasure plugin and overwrite the encode/decode method or you would consider a patch to the original class?
> 
>     It is a perfect timing for a patch to the original class.
> 
>     > I have also a 128-bit XOR implementation for the local parities. This will work with new gcc's & clang compilers ...
>     >
>     > Q: Which compilers/platforms are supported by CEPH? Is there a minimal GCC version?
> 
>     You can see all supported platforms here:
> 
>     http://ceph.com/gitbuilder.cgi
> 
>     I don't think the GCC version shows in the logs but you can probably figure it out from the corresponding distribution.
> 
>     > Q: is there some policy restricting comments within code? In general I see very few or no comments within the code ..
> 
>     :-) The mon code tends to be more heavily commented than the osd code (IMO) but I'm not aware of any policy. When I feel the need to comment, I write a unit test. If the unit test is difficult, I tend to comment to clarify its purpose. The problem with comments is that they quickly become obsolete and/or misleading. That being said, I don't think anyone will object if you heavily comment your code.
> 
>     Cheers
> 
>     > Cheers Andreas.
>     >
>     >
>     >
>     >
> 
>     --
>     Loïc Dachary, Artisan Logiciel Libre
>     All that is necessary for the triumph of evil is that good people do nothing.
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-09-25 18:48         ` Loic Dachary
@ 2013-09-25 18:53           ` Sage Weil
       [not found]           ` <CAGhffvz1TYYLoqn0tps1HiLObSCv7H0ZNVgOd0raicGqgRuukA@mail.gmail.com>
  1 sibling, 0 replies; 52+ messages in thread
From: Sage Weil @ 2013-09-25 18:53 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Andreas Joachim Peters, ceph-devel

On Wed, 25 Sep 2013, Loic Dachary wrote:
> 
> 
> On 25/09/2013 20:33, Andreas Joachim Peters wrote:> Yes, sure. I actually thought the same in the meanwhile ...  I have some questions:
> > 
> > Q: Can/should it stay in the framework of google test's or you would prefer just a plain executable ?
> > 
> 
> A plain executable would make sense. An simple example from src/test/Makefile.am :
> 
> ceph_test_trans_SOURCES = test/test_trans.cc
> ceph_test_trans_LDADD = $(LIBOS) $(CEPH_GLOBAL)
> bin_DEBUGPROGRAMS += ceph_test_trans

FWIW there are a few tools that use gtest that aren't strictly unit tests 
(ceph_test_rados_api_*, for example) just because the framework is 
convenient.  

There are also a few things that do simple, low-cost benchmarks that are 
run as unit tests (e.g., unittest_crc32c).  I think it just depends on how 
expensive the tests you're considering are.

sage


> 
> 
> > I have added local parity support to your erasure class adding a new argument: "erasure-code-lp" and
> > two new methods:
> > 
> > localparity_encode(...)
> > localparity_decode(...)
> > 
> > I made a more complex benchmark of (8,2) + 2 local parities (1^2^3^4, 5^6^7^8) which benchmarks performance of encoding/decoding as speed & effective write-latency for three cases (each for liberation & cauchy_good codecs):
> > 
> > 1 (8,2)
> > 2 (8,2,lp=2)
> > 3 (8,2,lp=2) + crc32c (blocks)
> > 
> > and several failure scenarios ... single, double, triple disk failures. Probably the best is if I make all this parameters configurable. 
> 
> Great :-) Do you have a public git repository where I could clone this & give it a try ?
> 
> > Q: For the local parity implementation .... shall I inherit from your erasure plugin and overwrite the encode/decode method or you would consider a patch to the original class?
> 
> It is a perfect timing for a patch to the original class.
> 
> > I have also a 128-bit XOR implementation for the local parities. This will work with new gcc's & clang compilers ... 
> > 
> > Q: Which compilers/platforms are supported by CEPH? Is there a minimal GCC version?
> 
> You can see all supported platforms here:
> 
> http://ceph.com/gitbuilder.cgi
> 
> I don't think the GCC version shows in the logs but you can probably figure it out from the corresponding distribution. 
> 
> > Q: is there some policy restricting comments within code? In general I see very few or no comments within the code ..
> 
> :-) The mon code tends to be more heavily commented than the osd code (IMO) but I'm not aware of any policy. When I feel the need to comment, I write a unit test. If the unit test is difficult, I tend to comment to clarify its purpose. The problem with comments is that they quickly become obsolete and/or misleading. That being said, I don't think anyone will object if you heavily comment your code.
> 
> Cheers
> 
> > Cheers Andreas.
> > 
> > 
> > 
> > 
> 
> -- 
> Lo?c Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-09-25 18:33       ` Andreas Joachim Peters
@ 2013-09-25 18:48         ` Loic Dachary
  2013-09-25 18:53           ` Sage Weil
       [not found]           ` <CAGhffvz1TYYLoqn0tps1HiLObSCv7H0ZNVgOd0raicGqgRuukA@mail.gmail.com>
  0 siblings, 2 replies; 52+ messages in thread
From: Loic Dachary @ 2013-09-25 18:48 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 2555 bytes --]



On 25/09/2013 20:33, Andreas Joachim Peters wrote:> Yes, sure. I actually thought the same in the meanwhile ...  I have some questions:
> 
> Q: Can/should it stay in the framework of google test's or you would prefer just a plain executable ?
> 

A plain executable would make sense. An simple example from src/test/Makefile.am :

ceph_test_trans_SOURCES = test/test_trans.cc
ceph_test_trans_LDADD = $(LIBOS) $(CEPH_GLOBAL)
bin_DEBUGPROGRAMS += ceph_test_trans


> I have added local parity support to your erasure class adding a new argument: "erasure-code-lp" and
> two new methods:
> 
> localparity_encode(...)
> localparity_decode(...)
> 
> I made a more complex benchmark of (8,2) + 2 local parities (1^2^3^4, 5^6^7^8) which benchmarks performance of encoding/decoding as speed & effective write-latency for three cases (each for liberation & cauchy_good codecs):
> 
> 1 (8,2)
> 2 (8,2,lp=2)
> 3 (8,2,lp=2) + crc32c (blocks)
> 
> and several failure scenarios ... single, double, triple disk failures. Probably the best is if I make all this parameters configurable. 

Great :-) Do you have a public git repository where I could clone this & give it a try ?

> Q: For the local parity implementation .... shall I inherit from your erasure plugin and overwrite the encode/decode method or you would consider a patch to the original class?

It is a perfect timing for a patch to the original class.

> I have also a 128-bit XOR implementation for the local parities. This will work with new gcc's & clang compilers ... 
> 
> Q: Which compilers/platforms are supported by CEPH? Is there a minimal GCC version?

You can see all supported platforms here:

http://ceph.com/gitbuilder.cgi

I don't think the GCC version shows in the logs but you can probably figure it out from the corresponding distribution. 

> Q: is there some policy restricting comments within code? In general I see very few or no comments within the code ..

:-) The mon code tends to be more heavily commented than the osd code (IMO) but I'm not aware of any policy. When I feel the need to comment, I write a unit test. If the unit test is difficult, I tend to comment to clarify its purpose. The problem with comments is that they quickly become obsolete and/or misleading. That being said, I don't think anyone will object if you heavily comment your code.

Cheers

> Cheers Andreas.
> 
> 
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-09-25 15:14     ` Loic Dachary
@ 2013-09-25 18:33       ` Andreas Joachim Peters
  2013-09-25 18:48         ` Loic Dachary
  0 siblings, 1 reply; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-09-25 18:33 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

Yes, sure. I actually thought the same in the meanwhile ...  I have some questions:

Q: Can/should it stay in the framework of google test's or you would prefer just a plain executable ?

I have added local parity support to your erasure class adding a new argument: "erasure-code-lp" and
two new methods:

localparity_encode(...)
localparity_decode(...)

I made a more complex benchmark of (8,2) + 2 local parities (1^2^3^4, 5^6^7^8) which benchmarks performance of encoding/decoding as speed & effective write-latency for three cases (each for liberation & cauchy_good codecs):

1 (8,2)
2 (8,2,lp=2)
3 (8,2,lp=2) + crc32c (blocks)

and several failure scenarios ... single, double, triple disk failures. Probably the best is if I make all this parameters configurable. 

Q: For the local parity implementation .... shall I inherit from your erasure plugin and overwrite the encode/decode method or you would consider a patch to the original class?

I have also a 128-bit XOR implementation for the local parities. This will work with new gcc's & clang compilers ... 

Q: Which compilers/platforms are supported by CEPH? Is there a minimal GCC version?

Q: is there some policy restricting comments within code? In general I see very few or no comments within the code ..

Cheers Andreas.





^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-09-23 15:43   ` Andreas Joachim Peters
@ 2013-09-25 15:14     ` Loic Dachary
  2013-09-25 18:33       ` Andreas Joachim Peters
  0 siblings, 1 reply; 52+ messages in thread
From: Loic Dachary @ 2013-09-25 15:14 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 5686 bytes --]

Hi Andreas,

It looks like this code would be useful as a standalone program instead of being integrated within unit tests. There are a few support program of that kind. The unit tests are probably not the best place for benchmarks. What do you think ?

Cheers

Note: got the updated file ;-)

> I fixed some more issues in the test:
> 
> 1 strncmp => memcmp (object has now binary data and 0 ..)
> 2 I fixed the length of the second block which was wrong after the change from 2,2 to 4,2
> 3 I have put 1MB as default object size, otherwise running with valgrind is too long
> 4 I have simplified the time measurement code etc.
> 

On 23/09/2013 17:43, Andreas Joachim Peters wrote:
> Hi Loic, 
> 
> I have modified the Jerasure unit test, to record the encoding & reconstruction performance and to store this value in the optional google-test xml outputfile. I have put (4,2) with a 4MB random object as default and one can pass a different object size via '--object-size=1000' (for 1G).
> 
> It looks like this:
> 
> ./unittest_erasure_code_jerasure --gtest_filter=*.* --log-to-stderr=true --gtest_output="xml:erasure.xml" --object-size=1000
> Note: Google Test filter = *.*
> [==========] Running 8 tests from 8 test cases.
> [----------] Global test environment set-up.
> [----------] 1 test from ErasureCodeTest/0, where TypeParam = ErasureCodeJerasureReedSolomonVandermonde
> [ RUN      ] ErasureCodeTest/0.encode_decode
> [       OK ] ErasureCodeTest/0.encode_decode (35231 ms)
> [----------] 1 test from ErasureCodeTest/0 (35231 ms total)
> 
> [----------] 1 test from ErasureCodeTest/1, where TypeParam = ErasureCodeJerasureReedSolomonRAID6
> [ RUN      ] ErasureCodeTest/1.encode_decode
> [       OK ] ErasureCodeTest/1.encode_decode (35594 ms)
> [----------] 1 test from ErasureCodeTest/1 (35594 ms total)
> 
> [----------] 1 test from ErasureCodeTest/2, where TypeParam = ErasureCodeJerasureCauchyOrig
> [ RUN      ] ErasureCodeTest/2.encode_decode
> [       OK ] ErasureCodeTest/2.encode_decode (33009 ms)
> [----------] 1 test from ErasureCodeTest/2 (33010 ms total)
> 
> [----------] 1 test from ErasureCodeTest/3, where TypeParam = ErasureCodeJerasureCauchyGood
> [ RUN      ] ErasureCodeTest/3.encode_decode
> [       OK ] ErasureCodeTest/3.encode_decode (31917 ms)
> [----------] 1 test from ErasureCodeTest/3 (31920 ms total)
> 
> [----------] 1 test from ErasureCodeTest/4, where TypeParam = ErasureCodeJerasureLiberation
> [ RUN      ] ErasureCodeTest/4.encode_decode
> [       OK ] ErasureCodeTest/4.encode_decode (31801 ms)
> [----------] 1 test from ErasureCodeTest/4 (31801 ms total)
> 
> [----------] 1 test from ErasureCodeTest/5, where TypeParam = ErasureCodeJerasureBlaumRoth
> [ RUN      ] ErasureCodeTest/5.encode_decode
> [       OK ] ErasureCodeTest/5.encode_decode (31927 ms)
> [----------] 1 test from ErasureCodeTest/5 (31927 ms total)
> 
> [----------] 1 test from ErasureCodeTest/6, where TypeParam = ErasureCodeJerasureLiber8tion
> [ RUN      ] ErasureCodeTest/6.encode_decode
> [       OK ] ErasureCodeTest/6.encode_decode (31824 ms)
> [----------] 1 test from ErasureCodeTest/6 (31824 ms total)
> 
> [----------] 1 test from ErasureCodeTiming
> [ RUN      ] ErasureCodeTiming.PropertyOutput
> [ -TIMING- ] technique=blaum_roth       speed [ encode ]=2.902 [GB/s]
> [ -TIMING- ] technique=blaum_roth       speed [   reco ]=1.701 [GB/s]
> [ -TIMING- ] technique=cauchy_good      speed [ encode ]=2.551 [GB/s]
> [ -TIMING- ] technique=cauchy_good      speed [   reco ]=1.571 [GB/s]
> [ -TIMING- ] technique=cauchy_orig      speed [ encode ]=1.401 [GB/s]
> [ -TIMING- ] technique=cauchy_orig      speed [   reco ]=0.911 [GB/s]
> [ -TIMING- ] technique=liber8tion       speed [ encode ]=2.861 [GB/s]
> [ -TIMING- ] technique=liber8tion       speed [   reco ]=1.822 [GB/s]
> [ -TIMING- ] technique=liberation       speed [ encode ]=2.863 [GB/s]
> [ -TIMING- ] technique=liberation       speed [   reco ]=1.815 [GB/s]
> [ -TIMING- ] technique=reed_sol_r6_op   speed [ encode ]=1.194 [GB/s]
> [ -TIMING- ] technique=reed_sol_r6_op   speed [   reco ]=0.489 [GB/s]
> [ -TIMING- ] technique=reed_sol_van     speed [ encode ]=0.600 [GB/s]
> [ -TIMING- ] technique=reed_sol_van     speed [   reco ]=0.429 [GB/s]
> [       OK ] ErasureCodeTiming.PropertyOutput (0 ms)
> [----------] 1 test from ErasureCodeTiming (0 ms total)
> 
> [----------] Global test environment tear-down
> [==========] 8 tests from 8 test cases ran. (231307 ms total)
> [  PASSED  ] 8 tests.
> 
> 
> [----------] Global test environment tear-down
> [==========] 8 tests from 8 test cases ran. (31351 ms total)
> [  PASSED  ] 8 tests.
> 
> And the XML:
> 
> <testsuite name="ErasureCodeTiming" tests="1" failures="0" disabled="0" errors="0" time="0">
>     <testcase name="PropertyOutput" status="run" time="0" classname="ErasureCodeTiming" jerasure::blaum_roth::encode="2902" jerasure::blaum_roth::reco="1700" jerasure::cauchy_good::encode="2551" jerasure::cauchy_good::reco="1571" jerasure::cauchy_orig::encode="1401" jerasure::cauchy_orig::reco="910" jerasure::liber8tion::encode="2861" jerasure::liber8tion::reco="1821" jerasure::liberation::encode="2862" jerasure::liberation::reco="1814" jerasure::reed_sol_r6_op::encode="1194" jerasure::reed_sol_r6_op::reco="489" jerasure::reed_sol_van::encode="599" jerasure::reed_sol_van::reco="428" object-size="1000000000" />
>   </testsuite>
> 
> Maybe you could use this directly for QA.
> 
> Cheers Andreas.
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-09-23  7:27 ` Loic Dachary
  2013-09-23  9:37   ` Andreas Joachim Peters
@ 2013-09-23 15:43   ` Andreas Joachim Peters
  2013-09-25 15:14     ` Loic Dachary
  1 sibling, 1 reply; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-09-23 15:43 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 4691 bytes --]

Hi Loic, 

I have modified the Jerasure unit test, to record the encoding & reconstruction performance and to store this value in the optional google-test xml outputfile. I have put (4,2) with a 4MB random object as default and one can pass a different object size via '--object-size=1000' (for 1G).

It looks like this:

./unittest_erasure_code_jerasure --gtest_filter=*.* --log-to-stderr=true --gtest_output="xml:erasure.xml" --object-size=1000
Note: Google Test filter = *.*
[==========] Running 8 tests from 8 test cases.
[----------] Global test environment set-up.
[----------] 1 test from ErasureCodeTest/0, where TypeParam = ErasureCodeJerasureReedSolomonVandermonde
[ RUN      ] ErasureCodeTest/0.encode_decode
[       OK ] ErasureCodeTest/0.encode_decode (35231 ms)
[----------] 1 test from ErasureCodeTest/0 (35231 ms total)

[----------] 1 test from ErasureCodeTest/1, where TypeParam = ErasureCodeJerasureReedSolomonRAID6
[ RUN      ] ErasureCodeTest/1.encode_decode
[       OK ] ErasureCodeTest/1.encode_decode (35594 ms)
[----------] 1 test from ErasureCodeTest/1 (35594 ms total)

[----------] 1 test from ErasureCodeTest/2, where TypeParam = ErasureCodeJerasureCauchyOrig
[ RUN      ] ErasureCodeTest/2.encode_decode
[       OK ] ErasureCodeTest/2.encode_decode (33009 ms)
[----------] 1 test from ErasureCodeTest/2 (33010 ms total)

[----------] 1 test from ErasureCodeTest/3, where TypeParam = ErasureCodeJerasureCauchyGood
[ RUN      ] ErasureCodeTest/3.encode_decode
[       OK ] ErasureCodeTest/3.encode_decode (31917 ms)
[----------] 1 test from ErasureCodeTest/3 (31920 ms total)

[----------] 1 test from ErasureCodeTest/4, where TypeParam = ErasureCodeJerasureLiberation
[ RUN      ] ErasureCodeTest/4.encode_decode
[       OK ] ErasureCodeTest/4.encode_decode (31801 ms)
[----------] 1 test from ErasureCodeTest/4 (31801 ms total)

[----------] 1 test from ErasureCodeTest/5, where TypeParam = ErasureCodeJerasureBlaumRoth
[ RUN      ] ErasureCodeTest/5.encode_decode
[       OK ] ErasureCodeTest/5.encode_decode (31927 ms)
[----------] 1 test from ErasureCodeTest/5 (31927 ms total)

[----------] 1 test from ErasureCodeTest/6, where TypeParam = ErasureCodeJerasureLiber8tion
[ RUN      ] ErasureCodeTest/6.encode_decode
[       OK ] ErasureCodeTest/6.encode_decode (31824 ms)
[----------] 1 test from ErasureCodeTest/6 (31824 ms total)

[----------] 1 test from ErasureCodeTiming
[ RUN      ] ErasureCodeTiming.PropertyOutput
[ -TIMING- ] technique=blaum_roth       speed [ encode ]=2.902 [GB/s]
[ -TIMING- ] technique=blaum_roth       speed [   reco ]=1.701 [GB/s]
[ -TIMING- ] technique=cauchy_good      speed [ encode ]=2.551 [GB/s]
[ -TIMING- ] technique=cauchy_good      speed [   reco ]=1.571 [GB/s]
[ -TIMING- ] technique=cauchy_orig      speed [ encode ]=1.401 [GB/s]
[ -TIMING- ] technique=cauchy_orig      speed [   reco ]=0.911 [GB/s]
[ -TIMING- ] technique=liber8tion       speed [ encode ]=2.861 [GB/s]
[ -TIMING- ] technique=liber8tion       speed [   reco ]=1.822 [GB/s]
[ -TIMING- ] technique=liberation       speed [ encode ]=2.863 [GB/s]
[ -TIMING- ] technique=liberation       speed [   reco ]=1.815 [GB/s]
[ -TIMING- ] technique=reed_sol_r6_op   speed [ encode ]=1.194 [GB/s]
[ -TIMING- ] technique=reed_sol_r6_op   speed [   reco ]=0.489 [GB/s]
[ -TIMING- ] technique=reed_sol_van     speed [ encode ]=0.600 [GB/s]
[ -TIMING- ] technique=reed_sol_van     speed [   reco ]=0.429 [GB/s]
[       OK ] ErasureCodeTiming.PropertyOutput (0 ms)
[----------] 1 test from ErasureCodeTiming (0 ms total)

[----------] Global test environment tear-down
[==========] 8 tests from 8 test cases ran. (231307 ms total)
[  PASSED  ] 8 tests.


[----------] Global test environment tear-down
[==========] 8 tests from 8 test cases ran. (31351 ms total)
[  PASSED  ] 8 tests.

And the XML:

<testsuite name="ErasureCodeTiming" tests="1" failures="0" disabled="0" errors="0" time="0">
    <testcase name="PropertyOutput" status="run" time="0" classname="ErasureCodeTiming" jerasure::blaum_roth::encode="2902" jerasure::blaum_roth::reco="1700" jerasure::cauchy_good::encode="2551" jerasure::cauchy_good::reco="1571" jerasure::cauchy_orig::encode="1401" jerasure::cauchy_orig::reco="910" jerasure::liber8tion::encode="2861" jerasure::liber8tion::reco="1821" jerasure::liberation::encode="2862" jerasure::liberation::reco="1814" jerasure::reed_sol_r6_op::encode="1194" jerasure::reed_sol_r6_op::reco="489" jerasure::reed_sol_van::encode="599" jerasure::reed_sol_van::reco="428" object-size="1000000000" />
  </testsuite>

Maybe you could use this directly for QA.

Cheers Andreas.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: TestErasureCodeJerasure.cc --]
[-- Type: text/x-c++src; name="TestErasureCodeJerasure.cc", Size: 6271 bytes --]

// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*- 
// vim: ts=8 sw=2 smarttab
/*
 * Ceph - scalable distributed file system
 *
 * Copyright (C) 2013 Cloudwatt <libre.licensing@cloudwatt.com>
 *
 * Author: Loic Dachary <loic@dachary.org>
 *
 *  This library is free software; you can redistribute it and/or
 *  modify it under the terms of the GNU Lesser General Public
 *  License as published by the Free Software Foundation; either
 *  version 2.1 of the License, or (at your option) any later version.
 * 
 */

#include "global/global_init.h"
#include "osd/ErasureCodePluginJerasure/ErasureCodeJerasure.h"
#include "common/ceph_argparse.h"
#include "common/Clock.h"
#include "common/ceph_context.h"
#include "global/global_context.h"
#include "gtest/gtest.h"

typedef std::map<std::string,utime_t> timing_t;
typedef std::map<std::string,timing_t > timing_map_t;
timing_map_t timing;
unsigned object_size = 4*1000*1000ll;

template <typename T>
class ErasureCodeTest : public ::testing::Test {
 public:
};

typedef ::testing::Types<
  ErasureCodeJerasureReedSolomonVandermonde,
  ErasureCodeJerasureReedSolomonRAID6,
  ErasureCodeJerasureCauchyOrig,
  ErasureCodeJerasureCauchyGood,
  ErasureCodeJerasureLiberation,
  ErasureCodeJerasureBlaumRoth,
  ErasureCodeJerasureLiber8tion
> JerasureTypes;
TYPED_TEST_CASE(ErasureCodeTest, JerasureTypes);

TYPED_TEST(ErasureCodeTest, encode_decode) {
  TypeParam jerasure;
  map<std::string,std::string> parameters;
  parameters["erasure-code-k"] = "4";
  parameters["erasure-code-m"] = "2";

  if ( ((std::string(jerasure.technique)=="liberation")) ||
       ((std::string(jerasure.technique)=="blaum_roth")) )
    parameters["erasure-code-w"] = "7";
  else
    parameters["erasure-code-w"] = "8";

  parameters["erasure-code-packetsize"] = "4096";
  jerasure.init(parameters);

#define LARGE_ENOUGH (7*object_size)

  bufferptr in_ptr(LARGE_ENOUGH);
  in_ptr.zero();
  in_ptr.set_length(0);
  for (size_t i=0; i< object_size; i++) {
    char c = random();
    in_ptr.append(&c,1);
  }

  bufferlist in;
  in.push_front(in_ptr);
  int want_to_encode[] = { 0, 1, 2, 3, 4, 5 };
  map<int, bufferlist> encoded;

  timing[jerasure.technique]["encode-start"] = ceph_clock_now(0);
  EXPECT_EQ(0, jerasure.encode(set<int>(want_to_encode, want_to_encode+6),
                              in,
                              &encoded));

  timing[jerasure.technique]["encode-stop"] = ceph_clock_now(0);
  EXPECT_EQ(6u, encoded.size());
  unsigned length =  encoded[0].length();
  EXPECT_EQ(0, strncmp(encoded[0].c_str(), in.c_str(), length));
  EXPECT_EQ(0, strncmp(encoded[1].c_str(), in.c_str() + length, in.length() - length));


  // all chunks are available
  {
    int want_to_decode[] = { 0, 1 };
    map<int, bufferlist> decoded;
    EXPECT_EQ(0, jerasure.decode(set<int>(want_to_decode, want_to_decode+2),
                                encoded,
                                &decoded));
    // always decode all, regardless of want_to_decode
    EXPECT_EQ(6u, decoded.size()); 
    EXPECT_EQ(length, decoded[0].length());
    EXPECT_EQ(0, strncmp(decoded[0].c_str(), in.c_str(), length));
    EXPECT_EQ(0, strncmp(decoded[1].c_str(), in.c_str() + length, in.length() - length));
  }

  // two chunks are missing 
  {
    map<int, bufferlist> degraded = encoded;
    degraded.erase(0);
    degraded.erase(1);
    EXPECT_EQ(4u, degraded.size());
    int want_to_decode[] = { 0, 1 };
    map<int, bufferlist> decoded;
    timing[jerasure.technique]["reco-start"] = ceph_clock_now(0);
    EXPECT_EQ(0, jerasure.decode(set<int>(want_to_decode, want_to_decode+2),
                                degraded,
                                &decoded));
    timing[jerasure.technique]["reco-stop"] = ceph_clock_now(0);
    // always decode all, regardless of want_to_decode
    EXPECT_EQ(6u, decoded.size()); 
    EXPECT_EQ(length, decoded[0].length());
    EXPECT_EQ(0, strncmp(decoded[0].c_str(), in.c_str(), length));
    EXPECT_EQ(0, strncmp(decoded[1].c_str(), in.c_str() + length, in.length() - length));
  }
    
  timing[jerasure.technique]["encode"] = timing[jerasure.technique]["encode-stop"]-timing[jerasure.technique]["encode-start"];
  timing[jerasure.technique]["reco"] = timing[jerasure.technique]["reco-stop"]-timing[jerasure.technique]["reco-start"];

}

class ErasureCodeTiming : public ::testing::Test {
 public:
};

TEST_F(ErasureCodeTiming, PropertyOutput) {
  for (timing_map_t::const_iterator techniqueit=timing.begin(); techniqueit!=timing.end(); ++techniqueit) {
    for (timing_t::const_iterator modeit=techniqueit->second.begin(); modeit!=techniqueit->second.end(); ++modeit) {
      char timingout[4096];
      if (modeit->first.find("-start") != std::string::npos)
	continue;
      if (modeit->first.find("-stop") != std::string::npos)
	continue;
      double speed = object_size/1000000l/((double)modeit->second)/1000.0;
      snprintf(timingout,
	       sizeof(timingout)-1, 
	       "[ -TIMING- ] technique=%-16s speed [ %6s ]=%02.03f [GB/s]\n",
	       techniqueit->first.c_str(),
	       modeit->first.c_str(), 
	       speed);

      cout << timingout;
      std::string property= std::string("jerasure::") + techniqueit->first.c_str() + "::" + modeit->first.c_str();
      RecordProperty(property.c_str(), speed *1000);
    }
  }  
  RecordProperty("object-size", object_size);
}

int main(int argc, char **argv) {
  vector<const char*> args;
  argv_to_vec(argc, (const char **)argv, args);

  global_init(NULL, args, CEPH_ENTITY_TYPE_CLIENT, CODE_ENVIRONMENT_UTILITY, 0);
  common_init_finish(g_ceph_context);

  ::testing::InitGoogleTest(&argc, argv);
  
  for (int i=0; i< argc; i++) {
    std::string arg=argv[i];
    if (arg.substr(0,14)=="--object-size=") {
      arg.erase(0,14);
      object_size = atoi(arg.c_str())*1000*1000ll;
    }
    if ( !object_size || (object_size > 2000000000ll) ) {
      fprintf(stderr,"error: --object-size=MB ==> ( 0 < MB <= 2000 )\n");
      exit(EINVAL);
    }
  }
  return RUN_ALL_TESTS();
}

// Local Variables:
// compile-command: "cd ../.. ; make -j4 && make unittest_erasure_code_jerasure && valgrind --tool=memcheck ./unittest_erasure_code_jerasure --gtest_filter=*.* --log-to-stderr=true --debug-osd=20 [--object-size=4]"
// End:

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-09-23  7:27 ` Loic Dachary
@ 2013-09-23  9:37   ` Andreas Joachim Peters
  2013-09-23 15:43   ` Andreas Joachim Peters
  1 sibling, 0 replies; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-09-23  9:37 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

Hi Loic,
I suggest still one tiny change, since if in.length does not need alignment, you still increase the in-length by the value of alignment:
 
   unsigned alignment = get_alignment();
-  unsigned in_length = in.length() + alignment - ( in.length() % alignment );
+ unsigned tail = in.length()%alignment; 
+ unsigned in_length = in.length() + tail?(alignment - tail):0;

Cheers Andreas.


______________________________________
From: Loic Dachary [loic@dachary.org]
Sent: 23 September 2013 09:27
To: Andreas Joachim Peters
Cc: ceph-devel@vger.kernel.org
Subject: Re: CEPH Erasure Encoding + OSD Scalability

Hi Andreas,

Very inefficient implementation indeed :-) I've integrated your change in

https://github.com/ceph/ceph/pull/619

Cheers

On 23/09/2013 01:00, Andreas Joachim Peters wrote:
>   alignment = k*w*packetsize*sizeof(int);
>   in_length += (alignment - (in_length%alignment);

--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-09-22 23:00 Andreas Joachim Peters
@ 2013-09-23  7:27 ` Loic Dachary
  2013-09-23  9:37   ` Andreas Joachim Peters
  2013-09-23 15:43   ` Andreas Joachim Peters
  0 siblings, 2 replies; 52+ messages in thread
From: Loic Dachary @ 2013-09-23  7:27 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 420 bytes --]

Hi Andreas,

Very inefficient implementation indeed :-) I've integrated your change in

https://github.com/ceph/ceph/pull/619

Cheers

On 23/09/2013 01:00, Andreas Joachim Peters wrote:
>   alignment = k*w*packetsize*sizeof(int);
>   in_length += (alignment - (in_length%alignment);

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
@ 2013-09-22 23:00 Andreas Joachim Peters
  2013-09-23  7:27 ` Loic Dachary
  0 siblings, 1 reply; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-09-22 23:00 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

Hi Loic, 
I was applying the changes and the 
situation improves, however there is still one important thing which 
actually dominated all the measurements which were needing larger packet
 sizes (everything besides Raid6):

                     pad_in_length(unsigned in_length)


The implementation is sort of 'unlucky' and slow when one increases the packetsize.

   while (in_length%(k*w*packetsize*sizeof(int)) != 0)
      in_length++;

better do like this:

  alignment = k*w*packetsize*sizeof(int);
  in_length += (alignment - (in_length%alignment);

E.g. for the CauchyGood algorithm one should increase the packetsize and when changing the pad_in_length 
implementation one get's excellent (pure encoding) performance for (3+2) : 2.6 GB/s and it scales well with the number of core's to > 8 GB/s.

I compared this with the output of the 'encode' example of the jerasure
 example and it gives the same result for (3+2), so that looks now good 
and consistent! (10,4) is ~ 610 MB/s.

... 
Finally the description of Jerasure 2.0 looks really great and will probably shift all the performance problems upstream  ;-)

Do you evt. want to add support into the plugin for local parities (like Xorbas does) to improve the disk draining performance?

Cheers Andreas.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-09-22  7:26           ` Andreas Joachim Peters
@ 2013-09-22  9:41             ` Loic Dachary
  2013-11-12  1:11             ` Andreas Joachim Peters
  1 sibling, 0 replies; 52+ messages in thread
From: Loic Dachary @ 2013-09-22  9:41 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 5960 bytes --]

Hi Andreas,

That sounds reasonable. Would you be so kind as to send a patch with your changes ? I'll rework it into something that fits the test infrastructure of Ceph.

Cheers

On 22/09/2013 09:26, Andreas Joachim Peters wrote:
> Hi Loic, 
> I run a benchmark with the changed code tomorrow ... I actually had to insert some of my realtime benchmark macro's into your Jerasure code to see the different time fractions between buffer preparation & encoding step, but for you QA suite it is probably enough to get a total value after your fix. I will send you a program sampling the performance at different buffer sizes and encoding types.
> 
> I changed my code to use vector operations (128-bit XOR's) and it gives another 10% gain. I also want to try out if it makes sense to do the CRC32C computation in-line in the encoding step and compare it with the two step procedure first encoding all blocks, then CRC32C on all blocks.
> 
> Cheers Andreas.
> 
> 
> 
> ________________________________________
> From: Loic Dachary [loic@dachary.org]
> Sent: 21 September 2013 17:11
> To: Andreas Joachim Peters
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: CEPH Erasure Encoding + OSD Scalability
> 
> Hi Andreas,
> 
> It's probably too soon to be smart about reducing the number of copies, but you're right : this copy is not necessary. The following pull request gets rid of it:
> 
> https://github.com/ceph/ceph/pull/615
> 
> Cheers
> 
> On 20/09/2013 18:49, Loic Dachary wrote:
>> Hi,
>>
>> This is a first attempt at avoiding unnecessary copy:
>>
>> https://github.com/dachary/ceph/blob/03445a5926cd073c11cd8693fb110729e40f35fa/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L66
>>
>> I'm not sure how it could be made more readable / terse with bufferlist iterators. Any kind of hint would be welcome :-)
>>
>> Cheers
>>
>> On 20/09/2013 17:36, Sage Weil wrote:
>>> On Fri, 20 Sep 2013, Loic Dachary wrote:
>>>> Hi Andreas,
>>>>
>>>> Great work on these benchmarks ! It's definitely an incentive to improve as much as possible. Could you push / send the scripts and sequence of operations you've used ? I'll reproduce this locally while getting rid of the extra copy. It would be useful to capture that into a script that can be conveniently run from the teuthology integrations tests to check against performance regressions.
>>>>
>>>> Regarding the 3P implementation, in my opinion it would be very valuable for some people who prefer low CPU consumption. And I'm eager to see more than one plugin in the erasure code plugin directory ;-)
>>>
>>> One way to approach this might be to make a bufferlist 'multi-iterator'
>>> that you give you bufferlist::iterator's and will give you back a pair of
>>> points and length for each contiguous segment.  This would capture the
>>> annoying iterator details and let the user focus on processing chunks that
>>> are as large as possible.
>>>
>>> sage
>>>
>>>
>>>  >
>>>> Cheers
>>>>
>>>> On 20/09/2013 13:35, Andreas Joachim Peters wrote:
>>>>> Hi Loic,
>>>>>
>>>>> I have now some benchmarks on a Xeon 2.27 GHz 4-core with gcc 4.4 (-O2) for ENCODING based on the CEPH Jerasure port.
>>>>> I measured for objects from 128k to 512 MB with random contents (if you encode 1 GB objects you see slow downs due to caching inefficiencies ...), otherwise results are stable for the given object sizes.
>>>>>
>>>>> I quote only the benchmark for ErasureCodeJerasureReedSolomonRAID6 (3,2) , the other are significantly slower (2-3x slower) and my 3P(3,2,1) implementation providing the same redundancy level like RS-Raid6[3,2] (double disk failure) but using more space (66% vs 100% overhead).
>>>>>
>>>>> The effect of out.c_str() is significant ( contributes with factor 2 slow-down for the best jerasure algorithm for [3,2] ).
>>>>>
>>>>> Averaged results for Objects Size 4MB:
>>>>>
>>>>> 1) Erasure CRS [3,2] - 2.6 ms buffer preparation (out.c_str()) - 2.4 ms encoding => ~780 MB/s
>>>>> 2) 3P [3,2,1] - 0,005 ms buffer preparation (3P adjusts the padding in the algorithm) - 0.87ms encoding => ~4.4 GB/s
>>>>>
>>>>> I think it pays off to avoid the copy in the encoding if it does not matter for the buffer handling upstream and pad only the last chunk.
>>>>>
>>>>> Last thing I tested is how performances scales with number of cores running 4 tests in parallel:
>>>>>
>>>>> Jerasure (3,2) limits at ~2,0 GB/s for a 4-core CPU (Xeon 2.27 GHz).
>>>>> 3P(3,2,1) limits ~8 GB/s for a 4-core CPU (Xeon 2.27 GHz).
>>>>>
>>>>> I also implemented the decoding for 3P, but didn't test yet all reconstruction cases. There is probably room for improvements using AVX support for XOR operations in both implementations.
>>>>>
>>>>> Before I invest more time, do think it is useful to have this fast 3P algorithm for double disk failures with 100% space overhead? Because I believe that people will always optimize for space and would rather use something like (10,2) even if the performance degrades and CPU consumption goes up?!? Let me know, no problem in any case!
>>>>>
>>>>> Finally I tested some combinations for ErasureCodeJerasureReedSolomonRAID6:
>>>>>
>>>>> (3,2) (4,2) (6,2) (8,2) (10,2) they all run around 780-800 MB/s
>>>>>
>>>>> Cheers Andreas.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Lo?c Dachary, Artisan Logiciel Libre
>>>> All that is necessary for the triumph of evil is that good people do nothing.
>>>>
>>>>
>>
> 
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
  2013-09-21 15:11         ` Loic Dachary
@ 2013-09-22  7:26           ` Andreas Joachim Peters
  2013-09-22  9:41             ` Loic Dachary
  2013-11-12  1:11             ` Andreas Joachim Peters
  0 siblings, 2 replies; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-09-22  7:26 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

Hi Loic, 
I run a benchmark with the changed code tomorrow ... I actually had to insert some of my realtime benchmark macro's into your Jerasure code to see the different time fractions between buffer preparation & encoding step, but for you QA suite it is probably enough to get a total value after your fix. I will send you a program sampling the performance at different buffer sizes and encoding types.

I changed my code to use vector operations (128-bit XOR's) and it gives another 10% gain. I also want to try out if it makes sense to do the CRC32C computation in-line in the encoding step and compare it with the two step procedure first encoding all blocks, then CRC32C on all blocks.

Cheers Andreas.



________________________________________
From: Loic Dachary [loic@dachary.org]
Sent: 21 September 2013 17:11
To: Andreas Joachim Peters
Cc: ceph-devel@vger.kernel.org
Subject: Re: CEPH Erasure Encoding + OSD Scalability

Hi Andreas,

It's probably too soon to be smart about reducing the number of copies, but you're right : this copy is not necessary. The following pull request gets rid of it:

https://github.com/ceph/ceph/pull/615

Cheers

On 20/09/2013 18:49, Loic Dachary wrote:
> Hi,
>
> This is a first attempt at avoiding unnecessary copy:
>
> https://github.com/dachary/ceph/blob/03445a5926cd073c11cd8693fb110729e40f35fa/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L66
>
> I'm not sure how it could be made more readable / terse with bufferlist iterators. Any kind of hint would be welcome :-)
>
> Cheers
>
> On 20/09/2013 17:36, Sage Weil wrote:
>> On Fri, 20 Sep 2013, Loic Dachary wrote:
>>> Hi Andreas,
>>>
>>> Great work on these benchmarks ! It's definitely an incentive to improve as much as possible. Could you push / send the scripts and sequence of operations you've used ? I'll reproduce this locally while getting rid of the extra copy. It would be useful to capture that into a script that can be conveniently run from the teuthology integrations tests to check against performance regressions.
>>>
>>> Regarding the 3P implementation, in my opinion it would be very valuable for some people who prefer low CPU consumption. And I'm eager to see more than one plugin in the erasure code plugin directory ;-)
>>
>> One way to approach this might be to make a bufferlist 'multi-iterator'
>> that you give you bufferlist::iterator's and will give you back a pair of
>> points and length for each contiguous segment.  This would capture the
>> annoying iterator details and let the user focus on processing chunks that
>> are as large as possible.
>>
>> sage
>>
>>
>>  >
>>> Cheers
>>>
>>> On 20/09/2013 13:35, Andreas Joachim Peters wrote:
>>>> Hi Loic,
>>>>
>>>> I have now some benchmarks on a Xeon 2.27 GHz 4-core with gcc 4.4 (-O2) for ENCODING based on the CEPH Jerasure port.
>>>> I measured for objects from 128k to 512 MB with random contents (if you encode 1 GB objects you see slow downs due to caching inefficiencies ...), otherwise results are stable for the given object sizes.
>>>>
>>>> I quote only the benchmark for ErasureCodeJerasureReedSolomonRAID6 (3,2) , the other are significantly slower (2-3x slower) and my 3P(3,2,1) implementation providing the same redundancy level like RS-Raid6[3,2] (double disk failure) but using more space (66% vs 100% overhead).
>>>>
>>>> The effect of out.c_str() is significant ( contributes with factor 2 slow-down for the best jerasure algorithm for [3,2] ).
>>>>
>>>> Averaged results for Objects Size 4MB:
>>>>
>>>> 1) Erasure CRS [3,2] - 2.6 ms buffer preparation (out.c_str()) - 2.4 ms encoding => ~780 MB/s
>>>> 2) 3P [3,2,1] - 0,005 ms buffer preparation (3P adjusts the padding in the algorithm) - 0.87ms encoding => ~4.4 GB/s
>>>>
>>>> I think it pays off to avoid the copy in the encoding if it does not matter for the buffer handling upstream and pad only the last chunk.
>>>>
>>>> Last thing I tested is how performances scales with number of cores running 4 tests in parallel:
>>>>
>>>> Jerasure (3,2) limits at ~2,0 GB/s for a 4-core CPU (Xeon 2.27 GHz).
>>>> 3P(3,2,1) limits ~8 GB/s for a 4-core CPU (Xeon 2.27 GHz).
>>>>
>>>> I also implemented the decoding for 3P, but didn't test yet all reconstruction cases. There is probably room for improvements using AVX support for XOR operations in both implementations.
>>>>
>>>> Before I invest more time, do think it is useful to have this fast 3P algorithm for double disk failures with 100% space overhead? Because I believe that people will always optimize for space and would rather use something like (10,2) even if the performance degrades and CPU consumption goes up?!? Let me know, no problem in any case!
>>>>
>>>> Finally I tested some combinations for ErasureCodeJerasureReedSolomonRAID6:
>>>>
>>>> (3,2) (4,2) (6,2) (8,2) (10,2) they all run around 780-800 MB/s
>>>>
>>>> Cheers Andreas.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Lo?c Dachary, Artisan Logiciel Libre
>>> All that is necessary for the triumph of evil is that good people do nothing.
>>>
>>>
>

--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-09-20 16:49       ` Loic Dachary
@ 2013-09-21 15:11         ` Loic Dachary
  2013-09-22  7:26           ` Andreas Joachim Peters
  0 siblings, 1 reply; 52+ messages in thread
From: Loic Dachary @ 2013-09-21 15:11 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 4317 bytes --]

Hi Andreas,

It's probably too soon to be smart about reducing the number of copies, but you're right : this copy is not necessary. The following pull request gets rid of it:

https://github.com/ceph/ceph/pull/615

Cheers

On 20/09/2013 18:49, Loic Dachary wrote:
> Hi,
> 
> This is a first attempt at avoiding unnecessary copy:
> 
> https://github.com/dachary/ceph/blob/03445a5926cd073c11cd8693fb110729e40f35fa/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L66
> 
> I'm not sure how it could be made more readable / terse with bufferlist iterators. Any kind of hint would be welcome :-)
> 
> Cheers
> 
> On 20/09/2013 17:36, Sage Weil wrote:
>> On Fri, 20 Sep 2013, Loic Dachary wrote:
>>> Hi Andreas,
>>>
>>> Great work on these benchmarks ! It's definitely an incentive to improve as much as possible. Could you push / send the scripts and sequence of operations you've used ? I'll reproduce this locally while getting rid of the extra copy. It would be useful to capture that into a script that can be conveniently run from the teuthology integrations tests to check against performance regressions.
>>>
>>> Regarding the 3P implementation, in my opinion it would be very valuable for some people who prefer low CPU consumption. And I'm eager to see more than one plugin in the erasure code plugin directory ;-)
>>
>> One way to approach this might be to make a bufferlist 'multi-iterator' 
>> that you give you bufferlist::iterator's and will give you back a pair of 
>> points and length for each contiguous segment.  This would capture the 
>> annoying iterator details and let the user focus on processing chunks that 
>> are as large as possible.
>>
>> sage
>>
>>
>>  > 
>>> Cheers
>>>
>>> On 20/09/2013 13:35, Andreas Joachim Peters wrote:
>>>> Hi Loic, 
>>>>
>>>> I have now some benchmarks on a Xeon 2.27 GHz 4-core with gcc 4.4 (-O2) for ENCODING based on the CEPH Jerasure port.
>>>> I measured for objects from 128k to 512 MB with random contents (if you encode 1 GB objects you see slow downs due to caching inefficiencies ...), otherwise results are stable for the given object sizes.
>>>>
>>>> I quote only the benchmark for ErasureCodeJerasureReedSolomonRAID6 (3,2) , the other are significantly slower (2-3x slower) and my 3P(3,2,1) implementation providing the same redundancy level like RS-Raid6[3,2] (double disk failure) but using more space (66% vs 100% overhead).
>>>>
>>>> The effect of out.c_str() is significant ( contributes with factor 2 slow-down for the best jerasure algorithm for [3,2] ).
>>>>
>>>> Averaged results for Objects Size 4MB:
>>>>
>>>> 1) Erasure CRS [3,2] - 2.6 ms buffer preparation (out.c_str()) - 2.4 ms encoding => ~780 MB/s
>>>> 2) 3P [3,2,1] - 0,005 ms buffer preparation (3P adjusts the padding in the algorithm) - 0.87ms encoding => ~4.4 GB/s
>>>>
>>>> I think it pays off to avoid the copy in the encoding if it does not matter for the buffer handling upstream and pad only the last chunk.
>>>>
>>>> Last thing I tested is how performances scales with number of cores running 4 tests in parallel:
>>>>
>>>> Jerasure (3,2) limits at ~2,0 GB/s for a 4-core CPU (Xeon 2.27 GHz).
>>>> 3P(3,2,1) limits ~8 GB/s for a 4-core CPU (Xeon 2.27 GHz).
>>>>
>>>> I also implemented the decoding for 3P, but didn't test yet all reconstruction cases. There is probably room for improvements using AVX support for XOR operations in both implementations.
>>>>
>>>> Before I invest more time, do think it is useful to have this fast 3P algorithm for double disk failures with 100% space overhead? Because I believe that people will always optimize for space and would rather use something like (10,2) even if the performance degrades and CPU consumption goes up?!? Let me know, no problem in any case!
>>>>
>>>> Finally I tested some combinations for ErasureCodeJerasureReedSolomonRAID6:
>>>>
>>>> (3,2) (4,2) (6,2) (8,2) (10,2) they all run around 780-800 MB/s
>>>>
>>>> Cheers Andreas.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> -- 
>>> Lo?c Dachary, Artisan Logiciel Libre
>>> All that is necessary for the triumph of evil is that good people do nothing.
>>>
>>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-09-20 15:36     ` Sage Weil
@ 2013-09-20 16:49       ` Loic Dachary
  2013-09-21 15:11         ` Loic Dachary
  0 siblings, 1 reply; 52+ messages in thread
From: Loic Dachary @ 2013-09-20 16:49 UTC (permalink / raw)
  To: Sage Weil; +Cc: Andreas Joachim Peters, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 3961 bytes --]

Hi,

This is a first attempt at avoiding unnecessary copy:

https://github.com/dachary/ceph/blob/03445a5926cd073c11cd8693fb110729e40f35fa/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L66

I'm not sure how it could be made more readable / terse with bufferlist iterators. Any kind of hint would be welcome :-)

Cheers

On 20/09/2013 17:36, Sage Weil wrote:
> On Fri, 20 Sep 2013, Loic Dachary wrote:
>> Hi Andreas,
>>
>> Great work on these benchmarks ! It's definitely an incentive to improve as much as possible. Could you push / send the scripts and sequence of operations you've used ? I'll reproduce this locally while getting rid of the extra copy. It would be useful to capture that into a script that can be conveniently run from the teuthology integrations tests to check against performance regressions.
>>
>> Regarding the 3P implementation, in my opinion it would be very valuable for some people who prefer low CPU consumption. And I'm eager to see more than one plugin in the erasure code plugin directory ;-)
> 
> One way to approach this might be to make a bufferlist 'multi-iterator' 
> that you give you bufferlist::iterator's and will give you back a pair of 
> points and length for each contiguous segment.  This would capture the 
> annoying iterator details and let the user focus on processing chunks that 
> are as large as possible.
> 
> sage
> 
> 
>  > 
>> Cheers
>>
>> On 20/09/2013 13:35, Andreas Joachim Peters wrote:
>>> Hi Loic, 
>>>
>>> I have now some benchmarks on a Xeon 2.27 GHz 4-core with gcc 4.4 (-O2) for ENCODING based on the CEPH Jerasure port.
>>> I measured for objects from 128k to 512 MB with random contents (if you encode 1 GB objects you see slow downs due to caching inefficiencies ...), otherwise results are stable for the given object sizes.
>>>
>>> I quote only the benchmark for ErasureCodeJerasureReedSolomonRAID6 (3,2) , the other are significantly slower (2-3x slower) and my 3P(3,2,1) implementation providing the same redundancy level like RS-Raid6[3,2] (double disk failure) but using more space (66% vs 100% overhead).
>>>
>>> The effect of out.c_str() is significant ( contributes with factor 2 slow-down for the best jerasure algorithm for [3,2] ).
>>>
>>> Averaged results for Objects Size 4MB:
>>>
>>> 1) Erasure CRS [3,2] - 2.6 ms buffer preparation (out.c_str()) - 2.4 ms encoding => ~780 MB/s
>>> 2) 3P [3,2,1] - 0,005 ms buffer preparation (3P adjusts the padding in the algorithm) - 0.87ms encoding => ~4.4 GB/s
>>>
>>> I think it pays off to avoid the copy in the encoding if it does not matter for the buffer handling upstream and pad only the last chunk.
>>>
>>> Last thing I tested is how performances scales with number of cores running 4 tests in parallel:
>>>
>>> Jerasure (3,2) limits at ~2,0 GB/s for a 4-core CPU (Xeon 2.27 GHz).
>>> 3P(3,2,1) limits ~8 GB/s for a 4-core CPU (Xeon 2.27 GHz).
>>>
>>> I also implemented the decoding for 3P, but didn't test yet all reconstruction cases. There is probably room for improvements using AVX support for XOR operations in both implementations.
>>>
>>> Before I invest more time, do think it is useful to have this fast 3P algorithm for double disk failures with 100% space overhead? Because I believe that people will always optimize for space and would rather use something like (10,2) even if the performance degrades and CPU consumption goes up?!? Let me know, no problem in any case!
>>>
>>> Finally I tested some combinations for ErasureCodeJerasureReedSolomonRAID6:
>>>
>>> (3,2) (4,2) (6,2) (8,2) (10,2) they all run around 780-800 MB/s
>>>
>>> Cheers Andreas.
>>>
>>>
>>>
>>>
>>>
>>
>> -- 
>> Lo?c Dachary, Artisan Logiciel Libre
>> All that is necessary for the triumph of evil is that good people do nothing.
>>
>>

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-09-20 12:33   ` Loic Dachary
  2013-09-20 13:19     ` Mark Nelson
@ 2013-09-20 15:36     ` Sage Weil
  2013-09-20 16:49       ` Loic Dachary
  1 sibling, 1 reply; 52+ messages in thread
From: Sage Weil @ 2013-09-20 15:36 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Andreas Joachim Peters, ceph-devel

On Fri, 20 Sep 2013, Loic Dachary wrote:
> Hi Andreas,
> 
> Great work on these benchmarks ! It's definitely an incentive to improve as much as possible. Could you push / send the scripts and sequence of operations you've used ? I'll reproduce this locally while getting rid of the extra copy. It would be useful to capture that into a script that can be conveniently run from the teuthology integrations tests to check against performance regressions.
> 
> Regarding the 3P implementation, in my opinion it would be very valuable for some people who prefer low CPU consumption. And I'm eager to see more than one plugin in the erasure code plugin directory ;-)

One way to approach this might be to make a bufferlist 'multi-iterator' 
that you give you bufferlist::iterator's and will give you back a pair of 
points and length for each contiguous segment.  This would capture the 
annoying iterator details and let the user focus on processing chunks that 
are as large as possible.

sage


 > 
> Cheers
> 
> On 20/09/2013 13:35, Andreas Joachim Peters wrote:
> > Hi Loic, 
> > 
> > I have now some benchmarks on a Xeon 2.27 GHz 4-core with gcc 4.4 (-O2) for ENCODING based on the CEPH Jerasure port.
> > I measured for objects from 128k to 512 MB with random contents (if you encode 1 GB objects you see slow downs due to caching inefficiencies ...), otherwise results are stable for the given object sizes.
> > 
> > I quote only the benchmark for ErasureCodeJerasureReedSolomonRAID6 (3,2) , the other are significantly slower (2-3x slower) and my 3P(3,2,1) implementation providing the same redundancy level like RS-Raid6[3,2] (double disk failure) but using more space (66% vs 100% overhead).
> > 
> > The effect of out.c_str() is significant ( contributes with factor 2 slow-down for the best jerasure algorithm for [3,2] ).
> > 
> > Averaged results for Objects Size 4MB:
> > 
> > 1) Erasure CRS [3,2] - 2.6 ms buffer preparation (out.c_str()) - 2.4 ms encoding => ~780 MB/s
> > 2) 3P [3,2,1] - 0,005 ms buffer preparation (3P adjusts the padding in the algorithm) - 0.87ms encoding => ~4.4 GB/s
> > 
> > I think it pays off to avoid the copy in the encoding if it does not matter for the buffer handling upstream and pad only the last chunk.
> > 
> > Last thing I tested is how performances scales with number of cores running 4 tests in parallel:
> > 
> > Jerasure (3,2) limits at ~2,0 GB/s for a 4-core CPU (Xeon 2.27 GHz).
> > 3P(3,2,1) limits ~8 GB/s for a 4-core CPU (Xeon 2.27 GHz).
> > 
> > I also implemented the decoding for 3P, but didn't test yet all reconstruction cases. There is probably room for improvements using AVX support for XOR operations in both implementations.
> > 
> > Before I invest more time, do think it is useful to have this fast 3P algorithm for double disk failures with 100% space overhead? Because I believe that people will always optimize for space and would rather use something like (10,2) even if the performance degrades and CPU consumption goes up?!? Let me know, no problem in any case!
> > 
> > Finally I tested some combinations for ErasureCodeJerasureReedSolomonRAID6:
> > 
> > (3,2) (4,2) (6,2) (8,2) (10,2) they all run around 780-800 MB/s
> > 
> > Cheers Andreas.
> > 
> > 
> > 
> > 
> > 
> 
> -- 
> Lo?c Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-09-20 12:33   ` Loic Dachary
@ 2013-09-20 13:19     ` Mark Nelson
  2013-09-20 15:36     ` Sage Weil
  1 sibling, 0 replies; 52+ messages in thread
From: Mark Nelson @ 2013-09-20 13:19 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Andreas Joachim Peters, ceph-devel

Very exciting work guys!

I suspect there will definitely be people who want lower CPU 
consumption, especially as ARM and Atom processors become more prolific. :)

Mark

On 09/20/2013 07:33 AM, Loic Dachary wrote:
> Hi Andreas,
>
> Great work on these benchmarks ! It's definitely an incentive to improve as much as possible. Could you push / send the scripts and sequence of operations you've used ? I'll reproduce this locally while getting rid of the extra copy. It would be useful to capture that into a script that can be conveniently run from the teuthology integrations tests to check against performance regressions.
>
> Regarding the 3P implementation, in my opinion it would be very valuable for some people who prefer low CPU consumption. And I'm eager to see more than one plugin in the erasure code plugin directory ;-)
>
> Cheers
>
> On 20/09/2013 13:35, Andreas Joachim Peters wrote:
>> Hi Loic,
>>
>> I have now some benchmarks on a Xeon 2.27 GHz 4-core with gcc 4.4 (-O2) for ENCODING based on the CEPH Jerasure port.
>> I measured for objects from 128k to 512 MB with random contents (if you encode 1 GB objects you see slow downs due to caching inefficiencies ...), otherwise results are stable for the given object sizes.
>>
>> I quote only the benchmark for ErasureCodeJerasureReedSolomonRAID6 (3,2) , the other are significantly slower (2-3x slower) and my 3P(3,2,1) implementation providing the same redundancy level like RS-Raid6[3,2] (double disk failure) but using more space (66% vs 100% overhead).
>>
>> The effect of out.c_str() is significant ( contributes with factor 2 slow-down for the best jerasure algorithm for [3,2] ).
>>
>> Averaged results for Objects Size 4MB:
>>
>> 1) Erasure CRS [3,2] - 2.6 ms buffer preparation (out.c_str()) - 2.4 ms encoding => ~780 MB/s
>> 2) 3P [3,2,1] - 0,005 ms buffer preparation (3P adjusts the padding in the algorithm) - 0.87ms encoding => ~4.4 GB/s
>>
>> I think it pays off to avoid the copy in the encoding if it does not matter for the buffer handling upstream and pad only the last chunk.
>>
>> Last thing I tested is how performances scales with number of cores running 4 tests in parallel:
>>
>> Jerasure (3,2) limits at ~2,0 GB/s for a 4-core CPU (Xeon 2.27 GHz).
>> 3P(3,2,1) limits ~8 GB/s for a 4-core CPU (Xeon 2.27 GHz).
>>
>> I also implemented the decoding for 3P, but didn't test yet all reconstruction cases. There is probably room for improvements using AVX support for XOR operations in both implementations.
>>
>> Before I invest more time, do think it is useful to have this fast 3P algorithm for double disk failures with 100% space overhead? Because I believe that people will always optimize for space and would rather use something like (10,2) even if the performance degrades and CPU consumption goes up?!? Let me know, no problem in any case!
>>
>> Finally I tested some combinations for ErasureCodeJerasureReedSolomonRAID6:
>>
>> (3,2) (4,2) (6,2) (8,2) (10,2) they all run around 780-800 MB/s
>>
>> Cheers Andreas.
>>
>>
>>
>>
>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
  2013-09-20 11:35 ` Andreas Joachim Peters
@ 2013-09-20 12:33   ` Loic Dachary
  2013-09-20 13:19     ` Mark Nelson
  2013-09-20 15:36     ` Sage Weil
  0 siblings, 2 replies; 52+ messages in thread
From: Loic Dachary @ 2013-09-20 12:33 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 2956 bytes --]

Hi Andreas,

Great work on these benchmarks ! It's definitely an incentive to improve as much as possible. Could you push / send the scripts and sequence of operations you've used ? I'll reproduce this locally while getting rid of the extra copy. It would be useful to capture that into a script that can be conveniently run from the teuthology integrations tests to check against performance regressions.

Regarding the 3P implementation, in my opinion it would be very valuable for some people who prefer low CPU consumption. And I'm eager to see more than one plugin in the erasure code plugin directory ;-)

Cheers

On 20/09/2013 13:35, Andreas Joachim Peters wrote:
> Hi Loic, 
> 
> I have now some benchmarks on a Xeon 2.27 GHz 4-core with gcc 4.4 (-O2) for ENCODING based on the CEPH Jerasure port.
> I measured for objects from 128k to 512 MB with random contents (if you encode 1 GB objects you see slow downs due to caching inefficiencies ...), otherwise results are stable for the given object sizes.
> 
> I quote only the benchmark for ErasureCodeJerasureReedSolomonRAID6 (3,2) , the other are significantly slower (2-3x slower) and my 3P(3,2,1) implementation providing the same redundancy level like RS-Raid6[3,2] (double disk failure) but using more space (66% vs 100% overhead).
> 
> The effect of out.c_str() is significant ( contributes with factor 2 slow-down for the best jerasure algorithm for [3,2] ).
> 
> Averaged results for Objects Size 4MB:
> 
> 1) Erasure CRS [3,2] - 2.6 ms buffer preparation (out.c_str()) - 2.4 ms encoding => ~780 MB/s
> 2) 3P [3,2,1] - 0,005 ms buffer preparation (3P adjusts the padding in the algorithm) - 0.87ms encoding => ~4.4 GB/s
> 
> I think it pays off to avoid the copy in the encoding if it does not matter for the buffer handling upstream and pad only the last chunk.
> 
> Last thing I tested is how performances scales with number of cores running 4 tests in parallel:
> 
> Jerasure (3,2) limits at ~2,0 GB/s for a 4-core CPU (Xeon 2.27 GHz).
> 3P(3,2,1) limits ~8 GB/s for a 4-core CPU (Xeon 2.27 GHz).
> 
> I also implemented the decoding for 3P, but didn't test yet all reconstruction cases. There is probably room for improvements using AVX support for XOR operations in both implementations.
> 
> Before I invest more time, do think it is useful to have this fast 3P algorithm for double disk failures with 100% space overhead? Because I believe that people will always optimize for space and would rather use something like (10,2) even if the performance degrades and CPU consumption goes up?!? Let me know, no problem in any case!
> 
> Finally I tested some combinations for ErasureCodeJerasureReedSolomonRAID6:
> 
> (3,2) (4,2) (6,2) (8,2) (10,2) they all run around 780-800 MB/s
> 
> Cheers Andreas.
> 
> 
> 
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: CEPH Erasure Encoding + OSD Scalability
       [not found] <-7369304096744919226@unknownmsgid>
@ 2013-09-20 11:35 ` Andreas Joachim Peters
  2013-09-20 12:33   ` Loic Dachary
  0 siblings, 1 reply; 52+ messages in thread
From: Andreas Joachim Peters @ 2013-09-20 11:35 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

Hi Loic, 

I have now some benchmarks on a Xeon 2.27 GHz 4-core with gcc 4.4 (-O2) for ENCODING based on the CEPH Jerasure port.
I measured for objects from 128k to 512 MB with random contents (if you encode 1 GB objects you see slow downs due to caching inefficiencies ...), otherwise results are stable for the given object sizes.

I quote only the benchmark for ErasureCodeJerasureReedSolomonRAID6 (3,2) , the other are significantly slower (2-3x slower) and my 3P(3,2,1) implementation providing the same redundancy level like RS-Raid6[3,2] (double disk failure) but using more space (66% vs 100% overhead).

The effect of out.c_str() is significant ( contributes with factor 2 slow-down for the best jerasure algorithm for [3,2] ).

Averaged results for Objects Size 4MB:

1) Erasure CRS [3,2] - 2.6 ms buffer preparation (out.c_str()) - 2.4 ms encoding => ~780 MB/s
2) 3P [3,2,1] - 0,005 ms buffer preparation (3P adjusts the padding in the algorithm) - 0.87ms encoding => ~4.4 GB/s

I think it pays off to avoid the copy in the encoding if it does not matter for the buffer handling upstream and pad only the last chunk.

Last thing I tested is how performances scales with number of cores running 4 tests in parallel:

Jerasure (3,2) limits at ~2,0 GB/s for a 4-core CPU (Xeon 2.27 GHz).
3P(3,2,1) limits ~8 GB/s for a 4-core CPU (Xeon 2.27 GHz).

I also implemented the decoding for 3P, but didn't test yet all reconstruction cases. There is probably room for improvements using AVX support for XOR operations in both implementations.

Before I invest more time, do think it is useful to have this fast 3P algorithm for double disk failures with 100% space overhead? Because I believe that people will always optimize for space and would rather use something like (10,2) even if the performance degrades and CPU consumption goes up?!? Let me know, no problem in any case!

Finally I tested some combinations for ErasureCodeJerasureReedSolomonRAID6:

(3,2) (4,2) (6,2) (8,2) (10,2) they all run around 780-800 MB/s

Cheers Andreas.






^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CEPH Erasure Encoding + OSD Scalability
       [not found] <CAGhffvws=OabwJHi+7n=SOg+YNxAnU=Zt8WLVZtvf1neHZQYhw@mail.gmail.com>
@ 2013-07-04 13:07 ` Loic Dachary
  0 siblings, 0 replies; 52+ messages in thread
From: Loic Dachary @ 2013-07-04 13:07 UTC (permalink / raw)
  To: Andreas-Joachim Peters; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 6200 bytes --]

Hi Andreas,

On 03/07/2013 18:55, Andreas-Joachim Peters wrote:> Dear Loic et. al., 
> 
> I have/had some questions about the idea's of Erasure Encoding plans and OSD scalability. 
> Please forgive me that I didnt' study too much any source code or details of the current CEPH implementation (yet).
> 
> Some of my questions I found now already answered here,
> 
> ( https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst )
> 
> but they also created some more ;-)
> 
> *ERASUE ENCODING*
> 
> 1.) I understand that you will cover only OSD outages with the implementation and will delegate block corruption to be discovered by the file system implementation (like BTRFS would do) Is that correct? 

Ceph also does scrubbing to detect block ( I assume you mean chunk ) corruption. The idea is to adapt the logic which is currently assuming replicas so that it detects corruption ( for instance more than K missing chunks if M+K is used ).

> 2.) Blocks would be assembled always on the OSDs (?)

Yes. 

> 3.) I understood that the (3,2) RS sketched in the Blog is the easiest to implement since it can be done with simple parity(XOR) operations but do you intend to have a generic (M,K) implementation?

Yes. The idea is to use the jerasure library which provides reed-solomon and can be configured in various ways.
 
> 4.) Would you split a 4M object into M x(4/M) objects? Would this not (even more) degrade single disk performance to random IO performance when many clients retrieve objects at random disk positions? Is 4M just a default or a hard coded parameter of CEPHFS/S3 ?

It is just a default. I hope the updated (look for "Partials" ) https://github.com/dachary/ceph/blob/5efcac8fa6e08119f0deaaf1ae9919080e90cf0a/doc/dev/osd_internals/erasure-code.rst answers the rest of the question .

> 5.) Local Parity like in Xorbas makes sense for large M, but would a large M not hit scalability limits given by a single OSD in terms of object bookkeeping/scrubbing/synchronization, Network packet limitations (atleast in 1GBit networks) etc ... 1 TB = 250k objects => M=10 => 2.5 Mio objects ( a 100 TB disk server would have 250 Mio object fragments ?!?!) 

We are looking at M+K < 2^8 at the moment which significantly reduces the problem you mention as well as the CPU consumption issues.

> 6.) Does a CEPH object know something like a parent object so it could understand if it is still a 'connected' object (like part of a block collection implementing a block, a file or container?)

At the level where erasure coding is implemented ( librados ) there is no nothing of relationships between objects.

> *OSD SCALABILITY*

Please take my answers there with a grain of salt because there are many people with much more knowledge than I have :-)

> 1.) Are there some deployment numbers about the largest number of OSDs per placement group and the number of objects you can handle well in a placement group?

The acceptable range seems to be ( number of OSDs ) * 100 up to ( number of OSDs ) * 1000

> 2.) What is the largest number of OSDs people have ever tried out? Many presentations say 10-10k nodes, but probably it should be more OSDs?

The largest deployment I'm aware of is Dream{Object,Compute} but I don't know the actual numbers.

> 3.) In our CC we operate disk server with up to 100 TB (25 disks) , next year 200 TB (50 disks) and in the future even bigger. 
> If I remember right the recommendation is to have 2GB of memory per OSD. 
> Can the memory footprint be lowered or is it a 'feature' of the OSD architecture?
> Is there in-memory information limiting scalability?

The OSD memory usage varies from from a few hundred mega bytes when running normal operations to about 2GB when recovering, which can be a problem if you have a large number of OSDs running on the same hardware. You can control this by grouping the disks together. For instance if your machine has 50 disks you could group them in 10 RAID0 including 5 physical disks each and run 10 OSD instead of 50. Of course it means that you'll lose 5 disks at once if one fails but when grouping 50 disks on a single hardware you already made a decision that leans in this direction.

> 4.) Today we run disk-only storage with 20k disks and 24 to 33 disk per node. There is a weekly activity of repair & replacement and reboots.

I assume that's of 1,000 machines, right ? How many disk / machines do you need to replace on a weekly basis ? 

> A typical scenario is that after a reboot filesystem contents was not synced and information is lost. Does CEPH OSD sync every block or if not use a quorum on block contents when reading or it would just return the block as is and only scrubbing would mark a block as corrupted?

I don't think Ceph can ever return a corrupted object as if it was not. That would either require a manual intervention from the operator to tamper with the file without notifying Ceph ( which would be the equivalent of shotting himself in the foot ;-) or a bug in XFS ( or the underlying file system on which objects are stored ) that similarly corrupts the file. And all this would have to happen before deep scrubbing discovers the problem.  

> 5.) When rebalancing is needed is there some time slice or scheduling mechanism which regulates the block relocation with respect to the 'normal' IO activity on the source and target OSD? Is there an overload protection in particular on the block target OSD?

There is a reservation mechanism to avoid creating too many communication paths during recovery ( see http://ceph.com/docs/master/dev/osd_internals/backfill_reservation/ for instance ) and throttling to regulate the bandwidth usage ( not 100% sure how that works though ). In addition it is recommended when operating a large cluster to dedicate an interface to internal communications ( check http://ceph.com/docs/master/rados/configuration/network-config-ref/ for more information ).

Cheers

> 
> Thanks.
> 
> Andreas.
> 
> 
> 
> 
> 
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2013-12-13 16:42 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <3472A07E6605974CBC9BC573F1BC02E494B06990@PLOXCHG04.cern.ch>
2013-07-05 21:23 ` CEPH Erasure Encoding + OSD Scalability Loic Dachary
2013-07-06 13:45   ` Andreas Joachim Peters
2013-07-06 15:28     ` Mark Nelson
2013-07-06 20:43       ` Loic Dachary
2013-07-08 15:38         ` Mark Nelson
     [not found]   ` <CAGhffvx5-xmprT-vL1VNrz12+pJSikg1WsUqy_JRdW0JNm5auQ@mail.gmail.com>
2013-07-06 20:47     ` Loic Dachary
2013-07-07 21:04       ` Andreas Joachim Peters
2013-07-08  3:37         ` Sage Weil
2013-07-08 10:00           ` Andreas Joachim Peters
2013-07-08 10:31             ` Loic Dachary
2013-07-08 15:47             ` Sage Weil
2013-08-19 10:35         ` Loic Dachary
2013-08-22 21:50           ` Andreas Joachim Peters
     [not found]           ` <CAGhffvwB87a+1294BjmPrfu0a9hYdu17N-eHOvYCHWMXDLcJmA@mail.gmail.com>
2013-08-22 23:03             ` Loic Dachary
     [not found]               ` <CAGhffvxW9sG5LtcF-tU1YGkCMAQUfh2WW_3N=f=-vWs48vyxkQ@mail.gmail.com>
2013-08-24 19:41                 ` Loic Dachary
2013-08-25 11:49                   ` Loic Dachary
2013-09-14 14:59                     ` Andreas Joachim Peters
2013-09-14 18:04                       ` Loic Dachary
2013-09-22 23:00 Andreas Joachim Peters
2013-09-23  7:27 ` Loic Dachary
2013-09-23  9:37   ` Andreas Joachim Peters
2013-09-23 15:43   ` Andreas Joachim Peters
2013-09-25 15:14     ` Loic Dachary
2013-09-25 18:33       ` Andreas Joachim Peters
2013-09-25 18:48         ` Loic Dachary
2013-09-25 18:53           ` Sage Weil
     [not found]           ` <CAGhffvz1TYYLoqn0tps1HiLObSCv7H0ZNVgOd0raicGqgRuukA@mail.gmail.com>
2013-09-26 19:18             ` Loic Dachary
2013-09-26 21:49               ` Andreas Joachim Peters
2013-09-27  9:40                 ` Loic Dachary
2013-10-01 23:00                   ` Andreas Joachim Peters
2013-10-02 10:04                     ` Loic Dachary
2013-10-02 10:15                     ` Loic Dachary
     [not found] <-7369304096744919226@unknownmsgid>
2013-09-20 11:35 ` Andreas Joachim Peters
2013-09-20 12:33   ` Loic Dachary
2013-09-20 13:19     ` Mark Nelson
2013-09-20 15:36     ` Sage Weil
2013-09-20 16:49       ` Loic Dachary
2013-09-21 15:11         ` Loic Dachary
2013-09-22  7:26           ` Andreas Joachim Peters
2013-09-22  9:41             ` Loic Dachary
2013-11-12  1:11             ` Andreas Joachim Peters
2013-11-12 18:06               ` Loic Dachary
2013-11-19 11:35                 ` Andreas Joachim Peters
2013-12-09 16:45                 ` Loic Dachary
2013-12-09 17:03                   ` Mark Nelson
2013-12-10  8:43                   ` Loic Dachary
2013-12-11  9:49                     ` Andreas Joachim Peters
2013-12-11 12:28                       ` Loic Dachary
2013-12-11 13:00                         ` Mark Nelson
2013-12-13 15:47                           ` Andreas Joachim Peters
2013-12-13 16:42                             ` Loic Dachary
     [not found] <CAGhffvws=OabwJHi+7n=SOg+YNxAnU=Zt8WLVZtvf1neHZQYhw@mail.gmail.com>
2013-07-04 13:07 ` Loic Dachary

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.