All of lore.kernel.org
 help / color / mirror / Atom feed
* CRC32 of messages
@ 2015-06-26 16:49 Erik G. Burrows
  2015-06-26 17:51 ` Somnath Roy
  2015-06-29  6:31 ` Dałek, Piotr
  0 siblings, 2 replies; 11+ messages in thread
From: Erik G. Burrows @ 2015-06-26 16:49 UTC (permalink / raw)
  To: ceph-devel

All,
Can someone explain to me the rationale for performing in-software CRC32
hashes of all messages through the Pipe and AsyncMessage classes?

On my servers, operf shows that 20% of the total CPU time in my benchmark
tests are being spent in the librados ceph_crc32c_sctp function. I can see
that the library is trying to use CPU accelerations if available, but what
I'd like to understand is: why checksum the messages at all?

If the messages are local, there should not be any corruption at all, and
if they are coming in over IP, then the kernel and NIC should do Layer-2/3
CRCs and reject any corrupted packets. So why re-CRC the messages at the
Ceph layer?

Thanks,
  Erik Burrows



^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: CRC32 of messages
  2015-06-26 16:49 CRC32 of messages Erik G. Burrows
@ 2015-06-26 17:51 ` Somnath Roy
  2015-06-29  6:27   ` Dałek, Piotr
  2015-06-29  6:31 ` Dałek, Piotr
  1 sibling, 1 reply; 11+ messages in thread
From: Somnath Roy @ 2015-06-26 17:51 UTC (permalink / raw)
  To: Erik G. Burrows, ceph-devel

ceph_crc32c_intel_fast is ~6 times faster than ceph_crc32c_sctp. If you are not using intel cpus or you have older intel cpus where this sse4 instruction sets are not enabled , the performance will be badly impacted as you saw. If you are building ceph yourself, make sure you have 'yasm' installed to enable ceph to detect the cpu architecture properly. BTW, hope you are aware that this crc calculation can be turned off by 'ms_nocrc = true' with giant and 'ms_crc_data = false' / 'ms_crc_header = false' post giant.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Erik G. Burrows
Sent: Friday, June 26, 2015 9:49 AM
To: ceph-devel@vger.kernel.org
Subject: CRC32 of messages

All,
Can someone explain to me the rationale for performing in-software CRC32 hashes of all messages through the Pipe and AsyncMessage classes?

On my servers, operf shows that 20% of the total CPU time in my benchmark tests are being spent in the librados ceph_crc32c_sctp function. I can see that the library is trying to use CPU accelerations if available, but what I'd like to understand is: why checksum the messages at all?

If the messages are local, there should not be any corruption at all, and if they are coming in over IP, then the kernel and NIC should do Layer-2/3 CRCs and reject any corrupted packets. So why re-CRC the messages at the Ceph layer?

Thanks,
  Erik Burrows


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: CRC32 of messages
  2015-06-26 17:51 ` Somnath Roy
@ 2015-06-29  6:27   ` Dałek, Piotr
  2015-06-29  7:00     ` Somnath Roy
  0 siblings, 1 reply; 11+ messages in thread
From: Dałek, Piotr @ 2015-06-29  6:27 UTC (permalink / raw)
  To: ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Somnath Roy
> Sent: Friday, June 26, 2015 7:52 PM
> 
> ceph_crc32c_intel_fast is ~6 times faster than ceph_crc32c_sctp. If you are
> not using intel cpus or you have older intel cpus where this sse4 instruction

Not exactly true, AMD CPUs released after October 2011 support SSE 4.2 (which include CRC32 instructions) as well.
See this: http://www.cpu-world.com/Glossary/S/SSE4.2.html

With best regards / Pozdrawiam
Piotr Dałek
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: CRC32 of messages
  2015-06-26 16:49 CRC32 of messages Erik G. Burrows
  2015-06-26 17:51 ` Somnath Roy
@ 2015-06-29  6:31 ` Dałek, Piotr
  2015-06-29  6:55   ` Dan van der Ster
  1 sibling, 1 reply; 11+ messages in thread
From: Dałek, Piotr @ 2015-06-29  6:31 UTC (permalink / raw)
  To: ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Erik G. Burrows
> Sent: Friday, June 26, 2015 6:49 PM
> 
> All,
> Can someone explain to me the rationale for performing in-software CRC32
> hashes of all messages through the Pipe and AsyncMessage classes?
> 
> On my servers, operf shows that 20% of the total CPU time in my benchmark
> tests are being spent in the librados ceph_crc32c_sctp function. I can see that
> the library is trying to use CPU accelerations if available, but what I'd like to
> understand is: why checksum the messages at all?

As Somnath already wrote, you can disable CRC checking for messages. But they're also used for journals, among other things, so you'll always see some CPU usage spent on CRC32 calculations.

> If the messages are local, there should not be any corruption at all, and if
> they are coming in over IP, then the kernel and NIC should do Layer-2/3 CRCs
> and reject any corrupted packets. So why re-CRC the messages at the Ceph
> layer?

I can imagine data corruption coming from Ceph itself and not caught by IP layers, for example due to bug in Ceph code or mainboard/RAM failure. And it's a nice debug feature you can use when dealing with low-level code.

With best regards / Pozdrawiam
Piotr Dałek
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: CRC32 of messages
  2015-06-29  6:31 ` Dałek, Piotr
@ 2015-06-29  6:55   ` Dan van der Ster
  2015-06-29 10:51     ` Gregory Farnum
  0 siblings, 1 reply; 11+ messages in thread
From: Dan van der Ster @ 2015-06-29  6:55 UTC (permalink / raw)
  To: Dałek, Piotr; +Cc: ceph-devel

On Mon, Jun 29, 2015 at 8:31 AM, Dałek, Piotr
<Piotr.Dalek@ts.fujitsu.com> wrote:
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Erik G. Burrows
>> Sent: Friday, June 26, 2015 6:49 PM
>>
>> All,
>> Can someone explain to me the rationale for performing in-software CRC32
>> hashes of all messages through the Pipe and AsyncMessage classes?
>>
>> On my servers, operf shows that 20% of the total CPU time in my benchmark
>> tests are being spent in the librados ceph_crc32c_sctp function. I can see that
>> the library is trying to use CPU accelerations if available, but what I'd like to
>> understand is: why checksum the messages at all?
>
> As Somnath already wrote, you can disable CRC checking for messages. But they're also used for journals, among other things, so you'll always see some CPU usage spent on CRC32 calculations.
>
>> If the messages are local, there should not be any corruption at all, and if
>> they are coming in over IP, then the kernel and NIC should do Layer-2/3 CRCs
>> and reject any corrupted packets. So why re-CRC the messages at the Ceph
>> layer?
>
> I can imagine data corruption coming from Ceph itself and not caught by IP layers, for example due to bug in Ceph code or mainboard/RAM failure. And it's a nice debug feature you can use when dealing with low-level code.
>

That's not to mention that the TCP checksum is remarkably weak. We've
just had an incident where a broken router was quite efficiently
corrupting something like 1/66 packets in a way which was invisible to
the TCP checksum. Some example corruptions are here our report -- note
that it's still a work in progress:
https://cds.cern.ch/record/2026187/files/Adler32_Data_Corruption.pdf

Thankfully CRC32-C /probably/ prevented this broken router from
corrupting our Ceph volumes.

Cheers, Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: CRC32 of messages
  2015-06-29  6:27   ` Dałek, Piotr
@ 2015-06-29  7:00     ` Somnath Roy
  2015-06-29  7:31       ` Dałek, Piotr
  0 siblings, 1 reply; 11+ messages in thread
From: Somnath Roy @ 2015-06-29  7:00 UTC (permalink / raw)
  To: Dałek, Piotr, ceph-devel

Thanks Piotr for the info, but I am not sure the asm instructions ceph is using for probing cpu is compatible to AMD or not.

Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Dalek, Piotr
Sent: Sunday, June 28, 2015 11:27 PM
To: ceph-devel@vger.kernel.org
Subject: RE: CRC32 of messages

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Somnath Roy
> Sent: Friday, June 26, 2015 7:52 PM
>
> ceph_crc32c_intel_fast is ~6 times faster than ceph_crc32c_sctp. If
> you are not using intel cpus or you have older intel cpus where this
> sse4 instruction

Not exactly true, AMD CPUs released after October 2011 support SSE 4.2 (which include CRC32 instructions) as well.
See this: http://www.cpu-world.com/Glossary/S/SSE4.2.html

With best regards / Pozdrawiam
Piotr Dałek
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: CRC32 of messages
  2015-06-29  7:00     ` Somnath Roy
@ 2015-06-29  7:31       ` Dałek, Piotr
  0 siblings, 0 replies; 11+ messages in thread
From: Dałek, Piotr @ 2015-06-29  7:31 UTC (permalink / raw)
  To: ceph-devel

> -----Original Message-----
> From: Somnath Roy [mailto:Somnath.Roy@sandisk.com]
> Sent: Monday, June 29, 2015 9:01 AM
> To: Dałek, Piotr; ceph-devel@vger.kernel.org
> Subject: RE: CRC32 of messages
> 
> Thanks Piotr for the info, but I am not sure the asm instructions ceph is using
> for probing cpu is compatible to AMD or not.

Looks like it's vendor-neutral:

        asm("movl %4, %%eax;"
            "cpuid;"
            "movl %%eax, %0;"
            "movl %%ebx, %1;"
            "movl %%ecx, %2;"
            "movl %%edx, %3;"
                : "=r" (*eax), "=r" (*ebx), "=r" (*ecx), "=r" (*edx)
                : "r" (id)
                : "eax", "ebx", "ecx", "edx");


This code reads the actual CPU features, regardless of vendor. Later these features are probed for a selected few, which include SSE 4.2.

Old, but interesting, related read: http://arstechnica.com/gadgets/2008/07/atom-nano-review/6/


With best regards / Pozdrawiam
Piotr Dałek
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: CRC32 of messages
  2015-06-29  6:55   ` Dan van der Ster
@ 2015-06-29 10:51     ` Gregory Farnum
  2015-06-29 11:30       ` Daniel Swarbrick
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2015-06-29 10:51 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: Dałek, Piotr, ceph-devel

On Mon, Jun 29, 2015 at 7:55 AM, Dan van der Ster <dan@vanderster.com> wrote:
> On Mon, Jun 29, 2015 at 8:31 AM, Dałek, Piotr
> <Piotr.Dalek@ts.fujitsu.com> wrote:
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of Erik G. Burrows
>>> Sent: Friday, June 26, 2015 6:49 PM
>>>
>>> All,
>>> Can someone explain to me the rationale for performing in-software CRC32
>>> hashes of all messages through the Pipe and AsyncMessage classes?
>>>
>>> On my servers, operf shows that 20% of the total CPU time in my benchmark
>>> tests are being spent in the librados ceph_crc32c_sctp function. I can see that
>>> the library is trying to use CPU accelerations if available, but what I'd like to
>>> understand is: why checksum the messages at all?
>>
>> As Somnath already wrote, you can disable CRC checking for messages. But they're also used for journals, among other things, so you'll always see some CPU usage spent on CRC32 calculations.
>>
>>> If the messages are local, there should not be any corruption at all, and if
>>> they are coming in over IP, then the kernel and NIC should do Layer-2/3 CRCs
>>> and reject any corrupted packets. So why re-CRC the messages at the Ceph
>>> layer?
>>
>> I can imagine data corruption coming from Ceph itself and not caught by IP layers, for example due to bug in Ceph code or mainboard/RAM failure. And it's a nice debug feature you can use when dealing with low-level code.
>>
>
> That's not to mention that the TCP checksum is remarkably weak. We've
> just had an incident where a broken router was quite efficiently
> corrupting something like 1/66 packets in a way which was invisible to
> the TCP checksum. Some example corruptions are here our report -- note
> that it's still a work in progress:
> https://cds.cern.ch/record/2026187/files/Adler32_Data_Corruption.pdf
>
> Thankfully CRC32-C /probably/ prevented this broken router from
> corrupting our Ceph volumes.

Yes, we have our own CRC32 checksum because loooong ago (before I
started!) Sage saw a lot of network corruption that wasn't being
caught by the TCP checksums so he added some to the Ceph message
stream. I can't tell you with any authority whatsoever how common that
problem is, but I don't think we're turning them off by default in
upstream. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: CRC32 of messages
  2015-06-29 10:51     ` Gregory Farnum
@ 2015-06-29 11:30       ` Daniel Swarbrick
  2015-06-29 11:37         ` Gregory Farnum
  2015-06-29 12:20         ` Dałek, Piotr
  0 siblings, 2 replies; 11+ messages in thread
From: Daniel Swarbrick @ 2015-06-29 11:30 UTC (permalink / raw)
  To: ceph-devel

On 29/06/15 12:51, Gregory Farnum wrote:
> 
> Yes, we have our own CRC32 checksum because loooong ago (before I
> started!) Sage saw a lot of network corruption that wasn't being
> caught by the TCP checksums so he added some to the Ceph message
> stream. I can't tell you with any authority whatsoever how common that
> problem is, but I don't think we're turning them off by default in
> upstream. :)

If the CRC32 implementation in Ceph is that dated (particularly the
software implementations that will be used on AMD hardware), would it be
worth checking out some of the updated implementations, such as the
slice-by-16 or chunked methods?

I found this link http://create.stephan-brumme.com/crc32/ and tried
running the benchmark on an AMD Opteron 6386 SE system, with the
following results:

bitwise          : CRC=221F390F, 47.525s, 21.546 MB/s
half-byte        : CRC=221F390F, 11.828s, 86.576 MB/s
  1 byte  at once: CRC=221F390F, 6.347s, 161.332 MB/s
  4 bytes at once: CRC=221F390F, 2.875s, 356.178 MB/s
  8 bytes at once: CRC=221F390F, 2.004s, 510.932 MB/s
4x8 bytes at once: CRC=221F390F, 1.929s, 530.811 MB/s
 16 bytes at once: CRC=221F390F, 1.892s, 541.179 MB/s
 16 bytes at once: CRC=221F390F, 1.926s, 531.797 MB/s (including
prefetching)
    chunked      : CRC=221F390F, 1.919s, 533.656 MB/s

AFAIK, Ceph uses the slice-by-8 method if no hardware crc32 is found.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: CRC32 of messages
  2015-06-29 11:30       ` Daniel Swarbrick
@ 2015-06-29 11:37         ` Gregory Farnum
  2015-06-29 12:20         ` Dałek, Piotr
  1 sibling, 0 replies; 11+ messages in thread
From: Gregory Farnum @ 2015-06-29 11:37 UTC (permalink / raw)
  To: Daniel Swarbrick; +Cc: ceph-devel

On Mon, Jun 29, 2015 at 12:30 PM, Daniel Swarbrick
<daniel.swarbrick@profitbricks.com> wrote:
> On 29/06/15 12:51, Gregory Farnum wrote:
>>
>> Yes, we have our own CRC32 checksum because loooong ago (before I
>> started!) Sage saw a lot of network corruption that wasn't being
>> caught by the TCP checksums so he added some to the Ceph message
>> stream. I can't tell you with any authority whatsoever how common that
>> problem is, but I don't think we're turning them off by default in
>> upstream. :)
>
> If the CRC32 implementation in Ceph is that dated (particularly the
> software implementations that will be used on AMD hardware), would it be
> worth checking out some of the updated implementations, such as the
> slice-by-16 or chunked methods?

Possibly? I really don't know anything about this. :) There have been
several changes to add new implementations (aarch64 and intel) since
then; pull requests welcome!
-Greg

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: CRC32 of messages
  2015-06-29 11:30       ` Daniel Swarbrick
  2015-06-29 11:37         ` Gregory Farnum
@ 2015-06-29 12:20         ` Dałek, Piotr
  1 sibling, 0 replies; 11+ messages in thread
From: Dałek, Piotr @ 2015-06-29 12:20 UTC (permalink / raw)
  To: ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Daniel Swarbrick
> Sent: Monday, June 29, 2015 1:31 PM

> > Yes, we have our own CRC32 checksum because loooong ago (before I
> > started!) Sage saw a lot of network corruption that wasn't being
> > caught by the TCP checksums so he added some to the Ceph message
> > stream. I can't tell you with any authority whatsoever how common that
> > problem is, but I don't think we're turning them off by default in
> > upstream. :)
> 
> If the CRC32 implementation in Ceph is that dated (particularly the software
> implementations that will be used on AMD hardware), would it be worth
> checking out some of the updated implementations, such as the
> slice-by-16 or chunked methods?
> I found this link http://create.stephan-brumme.com/crc32/ and tried running
> the benchmark on an AMD Opteron 6386 SE system, with the following
> results:

First of all, this processor actually supports SSE 4.2. See here:
http://www.cpu-world.com/CPUs/Bulldozer/AMD-Opteron%206386%20SE%20-%20OS6386YETGGHK.html
In other words, it *does* support hardware CRC32 calculation.

> bitwise          : CRC=221F390F, 47.525s, 21.546 MB/s
> half-byte        : CRC=221F390F, 11.828s, 86.576 MB/s
>   1 byte  at once: CRC=221F390F, 6.347s, 161.332 MB/s
>   4 bytes at once: CRC=221F390F, 2.875s, 356.178 MB/s
>   8 bytes at once: CRC=221F390F, 2.004s, 510.932 MB/s
> 4x8 bytes at once: CRC=221F390F, 1.929s, 530.811 MB/s
>  16 bytes at once: CRC=221F390F, 1.892s, 541.179 MB/s
>  16 bytes at once: CRC=221F390F, 1.926s, 531.797 MB/s (including
> prefetching)
>     chunked      : CRC=221F390F, 1.919s, 533.656 MB/s
> 
> AFAIK, Ceph uses the slice-by-8 method if no hardware crc32 is found.

Slicing-by-16 makes use of large (16k contrary to 8k used by slicing-by-8) so switching to slicing-by-16 would cause more cache trashing than with slicing-by-8 and in turn, decrease overall Ceph performance. Not to mention that in your case, slicing-by-16 was just ~31MB/s faster, which is just 6% faster. IMHO, increased memory usage is definitely not worth it.



With best regards / Pozdrawiam
Piotr Dałek



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-06-29 12:20 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-26 16:49 CRC32 of messages Erik G. Burrows
2015-06-26 17:51 ` Somnath Roy
2015-06-29  6:27   ` Dałek, Piotr
2015-06-29  7:00     ` Somnath Roy
2015-06-29  7:31       ` Dałek, Piotr
2015-06-29  6:31 ` Dałek, Piotr
2015-06-29  6:55   ` Dan van der Ster
2015-06-29 10:51     ` Gregory Farnum
2015-06-29 11:30       ` Daniel Swarbrick
2015-06-29 11:37         ` Gregory Farnum
2015-06-29 12:20         ` Dałek, Piotr

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.