linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
@ 2014-05-28 14:40 George Spelvin
  2014-05-28 15:32 ` George Spelvin
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: George Spelvin @ 2014-05-28 14:40 UTC (permalink / raw)
  To: herbert, JBeulich, tim.c.chen; +Cc: linux, linux-kernel, sandyw

While following a number of tangents in the code (I was figuring out
how to edit lib/Kconfig; don't ask), I came across a table of 256 64-bit
words, all of which had the high half set to zero.

Since the code depends on both pclmulq and crc32, SSE 4.1 is obviously
present, so it could use pmovzxdq and save 1K of kernel data.

The following patch obviously lacks the kludges for old binutils,
but should convey the general idea.

Jan: Is support for SLE10's pre-2.18 binutils still required?
Your PEXTRD fix was only a year ago, so I expect, but I wanted to ask.

Two other minor additional changes:

1. The current code unnecessarily puts the table in the read-write
   .data section.  Moved to .text.
2. I'm also not sure why it's necessary to force such large alignment
   on K_table.  Comments on reducing it?

Signed-off-by: George Spelvin <linux@horizon.com>


diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
index dbc4339b..9f885ee4 100644
--- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
+++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
@@ -216,15 +216,11 @@ LABEL crc_ %i
 	## 4) Combine three results:
 	################################################################
 
-	lea	(K_table-16)(%rip), bufp	# first entry is for idx 1
+	lea	(K_table-8)(%rip), bufp		# first entry is for idx 1
 	shlq    $3, %rax			# rax *= 8
-	subq    %rax, tmp			# tmp -= rax*8
-	shlq    $1, %rax
-	subq    %rax, tmp			# tmp -= rax*16
-						# (total tmp -= rax*24)
-	addq    %rax, bufp
-
-	movdqa  (bufp), %xmm0			# 2 consts: K1:K2
+	pmovzxdq (bufp,%rax), %xmm0		# 2 consts: K1:K2
+	leal	(%eax,%eax,2), %eax		# rax *= 3 (total *24)
+	subq    %rax, tmp			# tmp -= rax*24
 
 	movq    crc_init, %xmm1			# CRC for block 1
 	PCLMULQDQ 0x00,%xmm0,%xmm1		# Multiply by K2
@@ -331,136 +327,135 @@ ENDPROC(crc_pcl)
 
 	################################################################
 	## PCLMULQDQ tables
-	## Table is 128 entries x 2 quad words each
+	## Table is 128 entries x 2 words (8 bytes) each
 	################################################################
-.data
-.align 64
+.align 8
 K_table:
-        .quad 0x14cd00bd6,0x105ec76f0
+        .long 0x14cd00bd6,0x105ec76f0
-        .quad 0x0ba4fc28e,0x14cd00bd6
+        .long 0x0ba4fc28e,0x14cd00bd6
-        .quad 0x1d82c63da,0x0f20c0dfe
+        .long 0x1d82c63da,0x0f20c0dfe
-        .quad 0x09e4addf8,0x0ba4fc28e
+        .long 0x09e4addf8,0x0ba4fc28e
-        .quad 0x039d3b296,0x1384aa63a
+        .long 0x039d3b296,0x1384aa63a
-        .quad 0x102f9b8a2,0x1d82c63da
+        .long 0x102f9b8a2,0x1d82c63da
-        .quad 0x14237f5e6,0x01c291d04
+        .long 0x14237f5e6,0x01c291d04
-        .quad 0x00d3b6092,0x09e4addf8
+        .long 0x00d3b6092,0x09e4addf8

(Remaining boring bits of this hunk elided.)

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-28 14:40 [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table George Spelvin
@ 2014-05-28 15:32 ` George Spelvin
  2014-05-28 22:15   ` [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words George Spelvin
  2014-05-28 20:47 ` [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table Jan Beulich
  2014-05-28 22:32 ` Tim Chen
  2 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2014-05-28 15:32 UTC (permalink / raw)
  To: herbert, JBeulich, linux, tim.c.chen; +Cc: linux-kernel, sandyw

Um, yeah, I just noticed the problem with that patch: half of the numbers
in that table are 33 bits, and cause a pile of warnings (not errors,
unfortunately!) from gas that scrolled by when I wasn't looking.

Logically, there should be no need for 33-bit values; they should all be
reducible modulo the polynomial.  But that is going to take a slightly
larger change.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-28 14:40 [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table George Spelvin
  2014-05-28 15:32 ` George Spelvin
@ 2014-05-28 20:47 ` Jan Beulich
  2014-05-28 21:47   ` George Spelvin
  2014-05-28 22:32 ` Tim Chen
  2 siblings, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2014-05-28 20:47 UTC (permalink / raw)
  To: herbert, linux, tim.c.chen; +Cc: sandyw, linux-kernel

>>> "George Spelvin" <linux@horizon.com> 05/28/14 4:40 PM >>>
>Jan: Is support for SLE10's pre-2.18 binutils still required?
>Your PEXTRD fix was only a year ago, so I expect, but I wanted to ask.

I'd much appreciate if I would be able to build the kernel that way for another while.

>Two other minor additional changes:
>
>1. The current code unnecessarily puts the table in the read-write
   >.data section.  Moved to .text.

Putting data into .text seems wrong - it should go into .rodata.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-28 20:47 ` [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table Jan Beulich
@ 2014-05-28 21:47   ` George Spelvin
  2014-05-29  6:44     ` Jan Beulich
  0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2014-05-28 21:47 UTC (permalink / raw)
  To: jbeulich, linux, tim.c.chen; +Cc: linux-kernel

Jan Beulich <jbeulich@suse.com> wrote:
> "George Spelvin" <linux@horizon.com> 05/28/14 4:40 PM
>> Jan: Is support for SLE10's pre-2.18 binutils still required?
>> Your PEXTRD fix was only a year ago, so I expect, but I wanted to ask.

> I'd much appreciate if I would be able to build the kernel that way for
> another while.

Does it matter that the code I'm working on is 64-bit only?  It aready
uses crc32q instruction (added with SSE4.2) with no assembler workarounds,
so I figure pmovzxdq (part of SSE 4.1) doesn't make it any worse.

The annoying thing about doing it with macros is that it would be a
PITA to support a memory operand; I'd probably have to punt to .byte.

> Putting data into .text seems wrong - it should go into .rodata.

I don't really care, but it's being accessed PC-relative the same as
a jump table that's already in .text, so I just figured I'd be lazy.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words
  2014-05-28 15:32 ` George Spelvin
@ 2014-05-28 22:15   ` George Spelvin
  2014-05-28 23:02     ` Tim Chen
  0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2014-05-28 22:15 UTC (permalink / raw)
  To: herbert, JBeulich, linux, tim.c.chen
  Cc: david.m.cote, james.guilford, linux-kernel, sandyw, wajdi.k.feghali

crypto: crc32c-pclmul - Shrink K_table to 32-bit words

There's no need for the K_table to be made of 64-bit words.  For some
reason, the original authors didn't fully reduce the values modulo the
CRC32C polynomial, and so had some 33-bit number in there.  They
can all be reduced to 32 bits.

Doing that cuts the table size in half.  Since the code depends on both
pclmulq and crc32, SSE 4.1 is obviously present, so we can use pmovzxdq
to fetch it in the correct format.

Two other related fixes:
* K_table is read-only, so belongs in .text, not .data, and
* There's no need for more than 8-byte alignment

Signed-off-by: George Spelvin <linux@horizon.com>
---
Fixed properly and tested with an exhaustive user-space test harness.

I filled a 4K byte buffer with pseudorandom bytes and computed CRCs
from i to j and from j to k for all 0 <= i < j < k < 4096, comparing
both the intermediate and final results against a basic bit-at-a-time
software algorithm.

There's still room for improvement.  Some additional areas that
could use tweaking:
- If the SMALL_SIZE is set right, that should also be the size where
  we fall out of the 3-part algorithm.  As it is, a buffer of size
  3096 will do a 3072-byte chunk and then do 3 8-byte CRCs and
  mess around a lot combining them.
- Does it really warrant all the unrolling?  Surely any processor
  new enough to have a fully pipelined crc32 insutruction can
  also handle some loop overhead instructions as well?
- Some reassignment of the registers would put 32-bit variables
  (like crc_init_dw) in low registers so that they can be addressed
  without REX prefixes and shrink the ode.  But 64-bit pointers like
  block_0 and block_1 are only ever used with 64-bit operands and thus
  REX prefixes.

diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
index dbc4339b..dcc50752 100644
--- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
+++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
@@ -216,15 +216,11 @@ LABEL crc_ %i
 	## 4) Combine three results:
 	################################################################
 
-	lea	(K_table-16)(%rip), bufp	# first entry is for idx 1
+	lea	(K_table-8)(%rip), bufp		# first entry is for idx 1
 	shlq    $3, %rax			# rax *= 8
-	subq    %rax, tmp			# tmp -= rax*8
-	shlq    $1, %rax
-	subq    %rax, tmp			# tmp -= rax*16
-						# (total tmp -= rax*24)
-	addq    %rax, bufp
-
-	movdqa  (bufp), %xmm0			# 2 consts: K1:K2
+	pmovzxdq (bufp,%rax), %xmm0		# 2 consts: K1:K2
+	leal	(%eax,%eax,2), %eax		# rax *= 3 (total *24)
+	subq    %rax, tmp			# tmp -= rax*24
 
 	movq    crc_init, %xmm1			# CRC for block 1
 	PCLMULQDQ 0x00,%xmm0,%xmm1		# Multiply by K2
@@ -238,9 +234,9 @@ LABEL crc_ %i
 	mov     crc2, crc_init
 	crc32   %rax, crc_init
 
-################################################################
-## 5) Check for end:
-################################################################
+	################################################################
+	## 5) Check for end:
+	################################################################
 
 LABEL crc_ 0
 	mov     tmp, len
@@ -331,136 +327,135 @@ ENDPROC(crc_pcl)
 
 	################################################################
 	## PCLMULQDQ tables
-	## Table is 128 entries x 2 quad words each
+	## Table is 128 entries x 2 words (8 bytes) each
 	################################################################
-.data
-.align 64
+.align 8
 K_table:
-        .quad 0x14cd00bd6,0x105ec76f0
-        .quad 0x0ba4fc28e,0x14cd00bd6
-        .quad 0x1d82c63da,0x0f20c0dfe
-        .quad 0x09e4addf8,0x0ba4fc28e
-        .quad 0x039d3b296,0x1384aa63a
-        .quad 0x102f9b8a2,0x1d82c63da
-        .quad 0x14237f5e6,0x01c291d04
-        .quad 0x00d3b6092,0x09e4addf8
-        .quad 0x0c96cfdc0,0x0740eef02
-        .quad 0x18266e456,0x039d3b296
-        .quad 0x0daece73e,0x0083a6eec
-        .quad 0x0ab7aff2a,0x102f9b8a2
-        .quad 0x1248ea574,0x1c1733996
-        .quad 0x083348832,0x14237f5e6
-        .quad 0x12c743124,0x02ad91c30
-        .quad 0x0b9e02b86,0x00d3b6092
-        .quad 0x018b33a4e,0x06992cea2
-        .quad 0x1b331e26a,0x0c96cfdc0
-        .quad 0x17d35ba46,0x07e908048
-        .quad 0x1bf2e8b8a,0x18266e456
-        .quad 0x1a3e0968a,0x11ed1f9d8
-        .quad 0x0ce7f39f4,0x0daece73e
-        .quad 0x061d82e56,0x0f1d0f55e
-        .quad 0x0d270f1a2,0x0ab7aff2a
-        .quad 0x1c3f5f66c,0x0a87ab8a8
-        .quad 0x12ed0daac,0x1248ea574
-        .quad 0x065863b64,0x08462d800
-        .quad 0x11eef4f8e,0x083348832
-        .quad 0x1ee54f54c,0x071d111a8
-        .quad 0x0b3e32c28,0x12c743124
-        .quad 0x0064f7f26,0x0ffd852c6
-        .quad 0x0dd7e3b0c,0x0b9e02b86
-        .quad 0x0f285651c,0x0dcb17aa4
-        .quad 0x010746f3c,0x018b33a4e
-        .quad 0x1c24afea4,0x0f37c5aee
-        .quad 0x0271d9844,0x1b331e26a
-        .quad 0x08e766a0c,0x06051d5a2
-        .quad 0x093a5f730,0x17d35ba46
-        .quad 0x06cb08e5c,0x11d5ca20e
-        .quad 0x06b749fb2,0x1bf2e8b8a
-        .quad 0x1167f94f2,0x021f3d99c
-        .quad 0x0cec3662e,0x1a3e0968a
-        .quad 0x19329634a,0x08f158014
-        .quad 0x0e6fc4e6a,0x0ce7f39f4
-        .quad 0x08227bb8a,0x1a5e82106
-        .quad 0x0b0cd4768,0x061d82e56
-        .quad 0x13c2b89c4,0x188815ab2
-        .quad 0x0d7a4825c,0x0d270f1a2
-        .quad 0x10f5ff2ba,0x105405f3e
-        .quad 0x00167d312,0x1c3f5f66c
-        .quad 0x0f6076544,0x0e9adf796
-        .quad 0x026f6a60a,0x12ed0daac
-        .quad 0x1a2adb74e,0x096638b34
-        .quad 0x19d34af3a,0x065863b64
-        .quad 0x049c3cc9c,0x1e50585a0
-        .quad 0x068bce87a,0x11eef4f8e
-        .quad 0x1524fa6c6,0x19f1c69dc
-        .quad 0x16cba8aca,0x1ee54f54c
-        .quad 0x042d98888,0x12913343e
-        .quad 0x1329d9f7e,0x0b3e32c28
-        .quad 0x1b1c69528,0x088f25a3a
-        .quad 0x02178513a,0x0064f7f26
-        .quad 0x0e0ac139e,0x04e36f0b0
-        .quad 0x0170076fa,0x0dd7e3b0c
-        .quad 0x141a1a2e2,0x0bd6f81f8
-        .quad 0x16ad828b4,0x0f285651c
-        .quad 0x041d17b64,0x19425cbba
-        .quad 0x1fae1cc66,0x010746f3c
-        .quad 0x1a75b4b00,0x18db37e8a
-        .quad 0x0f872e54c,0x1c24afea4
-        .quad 0x01e41e9fc,0x04c144932
-        .quad 0x086d8e4d2,0x0271d9844
-        .quad 0x160f7af7a,0x052148f02
-        .quad 0x05bb8f1bc,0x08e766a0c
-        .quad 0x0a90fd27a,0x0a3c6f37a
-        .quad 0x0b3af077a,0x093a5f730
-        .quad 0x04984d782,0x1d22c238e
-        .quad 0x0ca6ef3ac,0x06cb08e5c
-        .quad 0x0234e0b26,0x063ded06a
-        .quad 0x1d88abd4a,0x06b749fb2
-        .quad 0x04597456a,0x04d56973c
-        .quad 0x0e9e28eb4,0x1167f94f2
-        .quad 0x07b3ff57a,0x19385bf2e
-        .quad 0x0c9c8b782,0x0cec3662e
-        .quad 0x13a9cba9e,0x0e417f38a
-        .quad 0x093e106a4,0x19329634a
-        .quad 0x167001a9c,0x14e727980
-        .quad 0x1ddffc5d4,0x0e6fc4e6a
-        .quad 0x00df04680,0x0d104b8fc
-        .quad 0x02342001e,0x08227bb8a
-        .quad 0x00a2a8d7e,0x05b397730
-        .quad 0x168763fa6,0x0b0cd4768
-        .quad 0x1ed5a407a,0x0e78eb416
-        .quad 0x0d2c3ed1a,0x13c2b89c4
-        .quad 0x0995a5724,0x1641378f0
-        .quad 0x19b1afbc4,0x0d7a4825c
-        .quad 0x109ffedc0,0x08d96551c
-        .quad 0x0f2271e60,0x10f5ff2ba
-        .quad 0x00b0bf8ca,0x00bf80dd2
-        .quad 0x123888b7a,0x00167d312
-        .quad 0x1e888f7dc,0x18dcddd1c
-        .quad 0x002ee03b2,0x0f6076544
-        .quad 0x183e8d8fe,0x06a45d2b2
-        .quad 0x133d7a042,0x026f6a60a
-        .quad 0x116b0f50c,0x1dd3e10e8
-        .quad 0x05fabe670,0x1a2adb74e
-        .quad 0x130004488,0x0de87806c
-        .quad 0x000bcf5f6,0x19d34af3a
-        .quad 0x18f0c7078,0x014338754
-        .quad 0x017f27698,0x049c3cc9c
-        .quad 0x058ca5f00,0x15e3e77ee
-        .quad 0x1af900c24,0x068bce87a
-        .quad 0x0b5cfca28,0x0dd07448e
-        .quad 0x0ded288f8,0x1524fa6c6
-        .quad 0x059f229bc,0x1d8048348
-        .quad 0x06d390dec,0x16cba8aca
-        .quad 0x037170390,0x0a3e3e02c
-        .quad 0x06353c1cc,0x042d98888
-        .quad 0x0c4584f5c,0x0d73c7bea
-        .quad 0x1f16a3418,0x1329d9f7e
-        .quad 0x0531377e2,0x185137662
-        .quad 0x1d8d9ca7c,0x1b1c69528
-        .quad 0x0b25b29f2,0x18a08b5bc
-        .quad 0x19fb2a8b0,0x02178513a
-        .quad 0x1a08fe6ac,0x1da758ae0
-        .quad 0x045cddf4e,0x0e0ac139e
-        .quad 0x1a91647f2,0x169cf9eb0
-        .quad 0x1a0f717c4,0x0170076fa
+	.long 0x493c7d27, 0x00000001
+	.long 0xba4fc28e, 0x493c7d27
+	.long 0xddc0152b, 0xf20c0dfe
+	.long 0x9e4addf8, 0xba4fc28e
+	.long 0x39d3b296, 0x3da6d0cb
+	.long 0x0715ce53, 0xddc0152b
+	.long 0x47db8317, 0x1c291d04
+	.long 0x0d3b6092, 0x9e4addf8
+	.long 0xc96cfdc0, 0x740eef02
+	.long 0x878a92a7, 0x39d3b296
+	.long 0xdaece73e, 0x083a6eec
+	.long 0xab7aff2a, 0x0715ce53
+	.long 0x2162d385, 0xc49f4f67
+	.long 0x83348832, 0x47db8317
+	.long 0x299847d5, 0x2ad91c30
+	.long 0xb9e02b86, 0x0d3b6092
+	.long 0x18b33a4e, 0x6992cea2
+	.long 0xb6dd949b, 0xc96cfdc0
+	.long 0x78d9ccb7, 0x7e908048
+	.long 0xbac2fd7b, 0x878a92a7
+	.long 0xa60ce07b, 0x1b3d8f29
+	.long 0xce7f39f4, 0xdaece73e
+	.long 0x61d82e56, 0xf1d0f55e
+	.long 0xd270f1a2, 0xab7aff2a
+	.long 0xc619809d, 0xa87ab8a8
+	.long 0x2b3cac5d, 0x2162d385
+	.long 0x65863b64, 0x8462d800
+	.long 0x1b03397f, 0x83348832
+	.long 0xebb883bd, 0x71d111a8
+	.long 0xb3e32c28, 0x299847d5
+	.long 0x064f7f26, 0xffd852c6
+	.long 0xdd7e3b0c, 0xb9e02b86
+	.long 0xf285651c, 0xdcb17aa4
+	.long 0x10746f3c, 0x18b33a4e
+	.long 0xc7a68855, 0xf37c5aee
+	.long 0x271d9844, 0xb6dd949b
+	.long 0x8e766a0c, 0x6051d5a2
+	.long 0x93a5f730, 0x78d9ccb7
+	.long 0x6cb08e5c, 0x18b0d4ff
+	.long 0x6b749fb2, 0xbac2fd7b
+	.long 0x1393e203, 0x21f3d99c
+	.long 0xcec3662e, 0xa60ce07b
+	.long 0x96c515bb, 0x8f158014
+	.long 0xe6fc4e6a, 0xce7f39f4
+	.long 0x8227bb8a, 0xa00457f7
+	.long 0xb0cd4768, 0x61d82e56
+	.long 0x39c7ff35, 0x8d6d2c43
+	.long 0xd7a4825c, 0xd270f1a2
+	.long 0x0ab3844b, 0x00ac29cf
+	.long 0x0167d312, 0xc619809d
+	.long 0xf6076544, 0xe9adf796
+	.long 0x26f6a60a, 0x2b3cac5d
+	.long 0xa741c1bf, 0x96638b34
+	.long 0x98d8d9cb, 0x65863b64
+	.long 0x49c3cc9c, 0xe0e9f351
+	.long 0x68bce87a, 0x1b03397f
+	.long 0x57a3d037, 0x9af01f2d
+	.long 0x6956fc3b, 0xebb883bd
+	.long 0x42d98888, 0x2cff42cf
+	.long 0x3771e98f, 0xb3e32c28
+	.long 0xb42ae3d9, 0x88f25a3a
+	.long 0x2178513a, 0x064f7f26
+	.long 0xe0ac139e, 0x4e36f0b0
+	.long 0x170076fa, 0xdd7e3b0c
+	.long 0x444dd413, 0xbd6f81f8
+	.long 0x6f345e45, 0xf285651c
+	.long 0x41d17b64, 0x91c9bd4b
+	.long 0xff0dba97, 0x10746f3c
+	.long 0xa2b73df1, 0x885f087b
+	.long 0xf872e54c, 0xc7a68855
+	.long 0x1e41e9fc, 0x4c144932
+	.long 0x86d8e4d2, 0x271d9844
+	.long 0x651bd98b, 0x52148f02
+	.long 0x5bb8f1bc, 0x8e766a0c
+	.long 0xa90fd27a, 0xa3c6f37a
+	.long 0xb3af077a, 0x93a5f730
+	.long 0x4984d782, 0xd7c0557f
+	.long 0xca6ef3ac, 0x6cb08e5c
+	.long 0x234e0b26, 0x63ded06a
+	.long 0xdd66cbbb, 0x6b749fb2
+	.long 0x4597456a, 0x4d56973c
+	.long 0xe9e28eb4, 0x1393e203
+	.long 0x7b3ff57a, 0x9669c9df
+	.long 0xc9c8b782, 0xcec3662e
+	.long 0x3f70cc6f, 0xe417f38a
+	.long 0x93e106a4, 0x96c515bb
+	.long 0x62ec6c6d, 0x4b9e0f71
+	.long 0xd813b325, 0xe6fc4e6a
+	.long 0x0df04680, 0xd104b8fc
+	.long 0x2342001e, 0x8227bb8a
+	.long 0x0a2a8d7e, 0x5b397730
+	.long 0x6d9a4957, 0xb0cd4768
+	.long 0xe8b6368b, 0xe78eb416
+	.long 0xd2c3ed1a, 0x39c7ff35
+	.long 0x995a5724, 0x61ff0e01
+	.long 0x9ef68d35, 0xd7a4825c
+	.long 0x0c139b31, 0x8d96551c
+	.long 0xf2271e60, 0x0ab3844b
+	.long 0x0b0bf8ca, 0x0bf80dd2
+	.long 0x2664fd8b, 0x0167d312
+	.long 0xed64812d, 0x8821abed
+	.long 0x02ee03b2, 0xf6076544
+	.long 0x8604ae0f, 0x6a45d2b2
+	.long 0x363bd6b3, 0x26f6a60a
+	.long 0x135c83fd, 0xd8d26619
+	.long 0x5fabe670, 0xa741c1bf
+	.long 0x35ec3279, 0xde87806c
+	.long 0x00bcf5f6, 0x98d8d9cb
+	.long 0x8ae00689, 0x14338754
+	.long 0x17f27698, 0x49c3cc9c
+	.long 0x58ca5f00, 0x5bd2011f
+	.long 0xaa7c7ad5, 0x68bce87a
+	.long 0xb5cfca28, 0xdd07448e
+	.long 0xded288f8, 0x57a3d037
+	.long 0x59f229bc, 0xdde8f5b9
+	.long 0x6d390dec, 0x6956fc3b
+	.long 0x37170390, 0xa3e3e02c
+	.long 0x6353c1cc, 0x42d98888
+	.long 0xc4584f5c, 0xd73c7bea
+	.long 0xf48642e9, 0x3771e98f
+	.long 0x531377e2, 0x80ff0093
+	.long 0xdd35bc8d, 0xb42ae3d9
+	.long 0xb25b29f2, 0x8fe4c34d
+	.long 0x9a5ede41, 0x2178513a
+	.long 0xa563905d, 0xdf99fc11
+	.long 0x45cddf4e, 0xe0ac139e
+	.long 0xacfa3103, 0x6c23e841
+	.long 0xa51b6135, 0x170076fa

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-28 14:40 [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table George Spelvin
  2014-05-28 15:32 ` George Spelvin
  2014-05-28 20:47 ` [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table Jan Beulich
@ 2014-05-28 22:32 ` Tim Chen
  2014-05-28 23:01   ` George Spelvin
  2 siblings, 1 reply; 29+ messages in thread
From: Tim Chen @ 2014-05-28 22:32 UTC (permalink / raw)
  To: George Spelvin; +Cc: herbert, JBeulich, linux-kernel, sandyw, James Guilford

On Wed, 2014-05-28 at 10:40 -0400, George Spelvin wrote:
> While following a number of tangents in the code (I was figuring out
> how to edit lib/Kconfig; don't ask), I came across a table of 256 64-bit
> words, all of which had the high half set to zero.
> 
> Since the code depends on both pclmulq and crc32, SSE 4.1 is obviously
> present, so it could use pmovzxdq and save 1K of kernel data.
> 
> The following patch obviously lacks the kludges for old binutils,
> but should convey the general idea.
> 
> Jan: Is support for SLE10's pre-2.18 binutils still required?
> Your PEXTRD fix was only a year ago, so I expect, but I wanted to ask.
> 
> Two other minor additional changes:
> 
> 1. The current code unnecessarily puts the table in the read-write
>    .data section.  Moved to .text.
> 2. I'm also not sure why it's necessary to force such large alignment
>    on K_table.  Comments on reducing it?
> 
> Signed-off-by: George Spelvin <linux@horizon.com>
> 
> 
> diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
> index dbc4339b..9f885ee4 100644
> --- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
> +++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
> @@ -216,15 +216,11 @@ LABEL crc_ %i
>  	## 4) Combine three results:
>  	################################################################
>  
> -	lea	(K_table-16)(%rip), bufp	# first entry is for idx 1
> +	lea	(K_table-8)(%rip), bufp		# first entry is for idx 1
>  	shlq    $3, %rax			# rax *= 8
> -	subq    %rax, tmp			# tmp -= rax*8
> -	shlq    $1, %rax
> -	subq    %rax, tmp			# tmp -= rax*16
> -						# (total tmp -= rax*24)
> -	addq    %rax, bufp
> -
> -	movdqa  (bufp), %xmm0			# 2 consts: K1:K2
> +	pmovzxdq (bufp,%rax), %xmm0		# 2 consts: K1:K2

Changing from the aligned move (movdqa) to unaligned move and zeroing
(pmovzxdq), is going to make things slower.  If the table is aligned
on 8 byte boundary, some of the table can span 2 cache lines, which
can slow things further.

We are trading speed for only 4096 bytes of memory save,
which is likely not a good trade for most systems except for 
those really constrained of memory.  For this kind of non-performance
critical system, it may as well use the generic crc32c algorithm and
compile out this module.

Thanks.

Tim




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-28 22:32 ` Tim Chen
@ 2014-05-28 23:01   ` George Spelvin
  2014-05-28 23:28     ` Tim Chen
  0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2014-05-28 23:01 UTC (permalink / raw)
  To: linux, tim.c.chen; +Cc: herbert, james.guilford, JBeulich, linux-kernel, sandyw

Thanks for the reply!

> Changing from the aligned move (movdqa) to unaligned move and zeroing
> (pmovzxdq), is going to make things slower.  If the table is aligned
> on 8 byte boundary, some of the table can span 2 cache lines, which
> can slow things further.

Um, two notes:
1) This load is performed once per 3072-byte block, which
   is a minimum of 128 cycles just for the crc32q instructions,
   never mind all the pcmulqdq folderol.

   Is it really more than 2 cycles?  Heck, is it *any* overall
   time given that it's preceded by a stretch of 384 instructions
   that it's not data-dependent on?

   I'll do some benchmarking to find out.

2) The shrunk table entries are 8 bytes long, and so can't
   span a cache line.  Is there any benefit to using a
   larger alignment, other than the very small issue of the
   full table needing 1 more cache line to be fully cached?
   
> We are trading speed for only 4096 bytes of memory save,
> which is likely not a good trade for most systems except for 
> those really constrained of memory.  For this kind of non-performance
> critical system, it may as well use the generic crc32c algorithm and
> compile out this module.

I hadn't intended to cause any speed penalty at all.
Do you really think there will be one?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words
  2014-05-28 22:15   ` [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words George Spelvin
@ 2014-05-28 23:02     ` Tim Chen
  2014-05-28 23:55       ` George Spelvin
  2014-05-29  3:26       ` George Spelvin
  0 siblings, 2 replies; 29+ messages in thread
From: Tim Chen @ 2014-05-28 23:02 UTC (permalink / raw)
  To: George Spelvin
  Cc: herbert, JBeulich, david.m.cote, james.guilford, linux-kernel,
	sandyw, wajdi.k.feghali

On Wed, 2014-05-28 at 18:15 -0400, George Spelvin wrote:
> crypto: crc32c-pclmul - Shrink K_table to 32-bit words
> 
> There's no need for the K_table to be made of 64-bit words.  For some
> reason, the original authors didn't fully reduce the values modulo the
> CRC32C polynomial, and so had some 33-bit number in there.  They
> can all be reduced to 32 bits.
> 
> Doing that cuts the table size in half.  Since the code depends on both
> pclmulq and crc32, SSE 4.1 is obviously present, so we can use pmovzxdq
> to fetch it in the correct format.
> 
> Two other related fixes:
> * K_table is read-only, so belongs in .text, not .data, and
> * There's no need for more than 8-byte alignment

George,

Can you do a tcrypt speed measurement with and without your changes?
Check to see if there's any slowdown.  Please make sure you pin
the frequency of your cpu when running the test.  

e.g.
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Thanks.

Tim



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-28 23:01   ` George Spelvin
@ 2014-05-28 23:28     ` Tim Chen
  2014-05-29 23:54       ` George Spelvin
  0 siblings, 1 reply; 29+ messages in thread
From: Tim Chen @ 2014-05-28 23:28 UTC (permalink / raw)
  To: George Spelvin; +Cc: herbert, james.guilford, JBeulich, linux-kernel, sandyw

On Wed, 2014-05-28 at 19:01 -0400, George Spelvin wrote:
> Thanks for the reply!
> 
> > Changing from the aligned move (movdqa) to unaligned move and zeroing
> > (pmovzxdq), is going to make things slower.  If the table is aligned
> > on 8 byte boundary, some of the table can span 2 cache lines, which
> > can slow things further.
> 
> Um, two notes:
> 1) This load is performed once per 3072-byte block, which
>    is a minimum of 128 cycles just for the crc32q instructions,
>    never mind all the pcmulqdq folderol.
> 
>    Is it really more than 2 cycles?  Heck, is it *any* overall
>    time given that it's preceded by a stretch of 384 instructions
>    that it's not data-dependent on?
> 
>    I'll do some benchmarking to find out.
> 
> 2) The shrunk table entries are 8 bytes long, and so can't
>    span a cache line.  Is there any benefit to using a
>    larger alignment, other than the very small issue of the
>    full table needing 1 more cache line to be fully cached?

I think you are fine.  Each entry should fit in a cache line
entirely.  With the reduced entry size, we will be fitting
twice as many entries per cache line so it may help to reduce
the cache miss.

>    
> > We are trading speed for only 4096 bytes of memory save,
> > which is likely not a good trade for most systems except for 
> > those really constrained of memory.  For this kind of non-performance
> > critical system, it may as well use the generic crc32c algorithm and
> > compile out this module.
> 
> I hadn't intended to cause any speed penalty at all.
> Do you really think there will be one?

If you can do some benchmarking to find out the change's
speed impact, that will help to eliminate concerns about
speed penalty.  

Thanks.

Tim

Tim




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words
  2014-05-28 23:02     ` Tim Chen
@ 2014-05-28 23:55       ` George Spelvin
  2014-05-29  3:26       ` George Spelvin
  1 sibling, 0 replies; 29+ messages in thread
From: George Spelvin @ 2014-05-28 23:55 UTC (permalink / raw)
  To: linux, tim.c.chen
  Cc: david.m.cote, herbert, james.guilford, JBeulich, linux-kernel,
	sandyw, wajdi.k.feghali

> Can you do a tcrypt speed measurement with and without your changes?
> Check to see if there's any slowdown.  Please make sure you pin
> the frequency of your cpu when running the test.  

Sure thing; I was already inspired to do that based on your concerns.
Do you have any particular buffer sizes or alignments you'd suggest?

Since I'm changing only the three-part core, I was going to
avoid unaligned or short buffers, stick with a single buffer so
it stays in L1 D-cache, but vary the length so we use lots of
the K_table.

It's not the RAM I was worried about, but the D-cache wasted on
on the K table.  Which doesn't affect the CRC code itself, but the
surrounding kernel code.


I'm also thinking of some ideas for handling even larger buffer sizes
without having to interrupt the 3-way main loop.  Pclmulqdq can
mutiply up to 4 32-bit values to produce a 128-bit result, which
crc32 can efficiently reduce.  So if we have three tables, of
x^(64*n) x^(4096*n), and x^(262144*n), each for n=0..63, we can
multiply them all together to handle up to a 16 MiB chunk.

The other option is to schedule the pclmulqdq in parallel with
the crc32q iterations and, after arranging a staggered start,
have a 4-part main loop, where 3 parts are performing crc32q
iterations and the fourth is using SSE to shift itself
forward (at which point it gets XORed into the data stream
that one other part is working on).

I haven't got all the details of that idea worked out in my head, but
it seems possible.  I have to study the optimization guide in detail to
see how many micro-ops the crc32q instruction from memory is (and thus
how much of the decoder it requires).

As of Nehalem, a small inner loop that fits in the decoded uop cache
has the potential to be faster than a hugely unrolled one.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words
  2014-05-28 23:02     ` Tim Chen
  2014-05-28 23:55       ` George Spelvin
@ 2014-05-29  3:26       ` George Spelvin
  2014-05-29 16:33         ` Tim Chen
  1 sibling, 1 reply; 29+ messages in thread
From: George Spelvin @ 2014-05-29  3:26 UTC (permalink / raw)
  To: linux, tim.c.chen
  Cc: david.m.cote, herbert, james.guilford, JBeulich, linux-kernel,
	sandyw, wajdi.k.feghali

> Can you do a tcrypt speed measurement with and without your changes?
> Check to see if there's any slowdown.  Please make sure you pin
> the frequency of your cpu when running the test.  
> 
> e.g.
> echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

I just now re-read your e-mail and noticed you suggested a specific tool.
Oops, I haven't run that yet.  I just made up my own in user space.
As I mentioned, since the changes are to the main loop that operates on
aligned buffers in multiples of 24 bytes, I focused my benchmarking there:

#define BUFFER 6114
static unsigned char buf[BUFFER] __attribute__ ((aligned(8)));
#define ITER 24 /* Number of test iterations */

uint32_t
do_test(uint32_t crc, uint32_t (*f)(void const *, unsigned, uint32_t))
{
	int i, j;
	for (i = 0; i < BUFFER; i += 8)
		for (j = i+24; j <= BUFFER; j += 24)
			crc = f(buf+i, j-i, crc);
	return crc;
}

uint32_t
time_test(uint64_t *time, uint32_t crc, uint32_t (*f)(void const *, unsigned, ui
nt32_t))
{
	uint64_t start = rdtsc();
	crc = do_test(crc, f);
	*time = rdtsc() - start;
	return crc;
}

The actual test goes in ABBA order to reduce bias:

	for (i = 0; i < ITER; i += 2) {
		crc1 = time_test(times[i]+0, crc1, crc_pcl_1);
		crc2 = time_test(times[i]+1, crc2, crc_pcl_2);
		crc2 = time_test(times[i+1]+1, crc2, crc_pcl_2);
		crc1 = time_test(times[i+1]+0, crc1, crc_pcl_1);
	}

crc_pcl_1 is the old code, crc_pcl_2 is my revised version.


The results are as follows (the last line is a total):

        Old code     New code
 0:     85009953     71812457 (-13197496)
 1:     57408829     63361572 (+5952743)
 2:     52552399     49195266 (-3357133)
 3:     43595130     45988364 (+2393234)
 4:     41541760     39714198 (-1827562)
 5:     36576082     38021344 (+1445262)
 6:     35307854     34150656 (-1157198)
 7:     32182230     33134236 (+952006)
 8:     31341596     31307004 (-34592)
 9:     31340900     31329408 (-11492)
10:     31344884     31329144 (-15740)
11:     31334144     31312492 (-21652)
12:     31338992     31330356 (-8636)
13:     31343744     31311344 (-32400)
14:     31339000     31340196 (+1196)
15:     31337492     31313988 (-23504)
16:     31341688     31334040 (-7648)
17:     31341804     31308936 (-32868)
18:     31339936     31332020 (-7916)
19:     31323228     31324240 (+1012)
20:     31339744     31331768 (-7976)
21:     31321536     31332688 (+11152)
22:     31340280     31335212 (-5068)
23:     31332056     31335768 (+3712)
24:    885575261    876586697 (-8988564)

I swapped the link order of the two .o files in case cache
placement made a difference:

 0:     84305981     71483150 (-12822831)
 1:     57341376     63129024 (+5787648)
 2:     52361618     49240069 (-3121549)
 3:     43520576     45822670 (+2302094)
 4:     41500104     39684116 (-1815988)
 5:     36542864     37940196 (+1397332)
 6:     35281570     34144348 (-1137222)
 7:     32149420     33088652 (+939232)
 8:     31342368     31329056 (-13312)
 9:     31338788     31313212 (-25576)
10:     31336324     31335612 (-712)
11:     31341892     31319576 (-22316)
12:     31336224     31322808 (-13416)
13:     31338560     31315084 (-23476)
14:     31338332     31332976 (-5356)
15:     31337300     31315088 (-22212)
16:     31334300     31330884 (-3416)
17:     31318660     31329916 (+11256)
18:     31334984     31327740 (-7244)
19:     31315084     31327768 (+12684)
20:     31334708     31345872 (+11164)
21:     31325988     31330948 (+4960)
22:     31333956     31339800 (+5844)
23:     31322880     31327316 (+4436)
24:    884333857    875775881 (-8557976)

It doesn't look like a slowdown; more like a 1% speedup.

I'll figure out tcrypt in a bit.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-28 21:47   ` George Spelvin
@ 2014-05-29  6:44     ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2014-05-29  6:44 UTC (permalink / raw)
  To: linux, tim.c.chen; +Cc: linux-kernel

>>> "George Spelvin" <linux@horizon.com> 05/28/14 11:47 PM >>>
>Jan Beulich <jbeulich@suse.com> wrote:
>> "George Spelvin" <linux@horizon.com> 05/28/14 4:40 PM
>>> Jan: Is support for SLE10's pre-2.18 binutils still required?
>>> Your PEXTRD fix was only a year ago, so I expect, but I wanted to ask.
>
>> I'd much appreciate if I would be able to build the kernel that way for
>> another while.
>
>Does it matter that the code I'm working on is 64-bit only?

No.

>It aready
>uses crc32q instruction (added with SSE4.2) with no assembler workarounds,
>so I figure pmovzxdq (part of SSE 4.1) doesn't make it any worse.

If that's the case, then adding another (earlier) one shouldn't be an issue.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words
  2014-05-29  3:26       ` George Spelvin
@ 2014-05-29 16:33         ` Tim Chen
  0 siblings, 0 replies; 29+ messages in thread
From: Tim Chen @ 2014-05-29 16:33 UTC (permalink / raw)
  To: George Spelvin
  Cc: david.m.cote, herbert, james.guilford, JBeulich, linux-kernel,
	sandyw, wajdi.k.feghali

On Wed, 2014-05-28 at 23:26 -0400, George Spelvin wrote:
> > Can you do a tcrypt speed measurement with and without your changes?
> > Check to see if there's any slowdown.  Please make sure you pin
> > the frequency of your cpu when running the test.  
> > 
> > e.g.
> > echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
> 
> I just now re-read your e-mail and noticed you suggested a specific tool.

Try to run the standard kernel crypto test with tcrypt.  For speed test
of crc32c, use test 319:

modprobe tcrypt mode=319

Then you will see the output in dmesg (or tail of /var/log/messages).
It will give you the cycles you spent for various block sizes.

For consistent test numbers, before test, 
disable turbo mode of cpu in BIOS and pin 
frequency of all your cpus to max with something like

i=0
num_cpus=`cat /proc/cpuinfo| grep "^processor"| wc -l `
while [ $i -lt $num_cpus ]
do
  echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor
  i=`expr $i + 1`
done

> Oops, I haven't run that yet.  I just made up my own in user space.
> As I mentioned, since the changes are to the main loop that operates on
> aligned buffers in multiples of 24 bytes, I focused my benchmarking there:
> 
> #define BUFFER 6114
> static unsigned char buf[BUFFER] __attribute__ ((aligned(8)));
> #define ITER 24 /* Number of test iterations */
> 
> uint32_t
> do_test(uint32_t crc, uint32_t (*f)(void const *, unsigned, uint32_t))
> {
> 	int i, j;
> 	for (i = 0; i < BUFFER; i += 8)
> 		for (j = i+24; j <= BUFFER; j += 24)
> 			crc = f(buf+i, j-i, crc);
> 	return crc;
> }
> 
> uint32_t
> time_test(uint64_t *time, uint32_t crc, uint32_t (*f)(void const *, unsigned, ui
> nt32_t))
> {
> 	uint64_t start = rdtsc();
> 	crc = do_test(crc, f);
> 	*time = rdtsc() - start;
> 	return crc;
> }
> 
> The actual test goes in ABBA order to reduce bias:
> 
> 	for (i = 0; i < ITER; i += 2) {
> 		crc1 = time_test(times[i]+0, crc1, crc_pcl_1);
> 		crc2 = time_test(times[i]+1, crc2, crc_pcl_2);
> 		crc2 = time_test(times[i+1]+1, crc2, crc_pcl_2);
> 		crc1 = time_test(times[i+1]+0, crc1, crc_pcl_1);
> 	}
> 
> crc_pcl_1 is the old code, crc_pcl_2 is my revised version.
> 
> 
> The results are as follows (the last line is a total):
> 
>         Old code     New code
>  0:     85009953     71812457 (-13197496)
>  1:     57408829     63361572 (+5952743)

Maybe your cpu has not been pinned to constant frequency?
The cycles are much higher in the first few iterations.  
Likely cpu frequency is going up when governor detect 
the load on cpu. Please also check that turbo is 
turned off as this can introduce much variations
in your testing.

>  2:     52552399     49195266 (-3357133)
>  3:     43595130     45988364 (+2393234)
>  4:     41541760     39714198 (-1827562)
>  5:     36576082     38021344 (+1445262)
>  6:     35307854     34150656 (-1157198)
>  7:     32182230     33134236 (+952006)
>  8:     31341596     31307004 (-34592)
>  9:     31340900     31329408 (-11492)
> 10:     31344884     31329144 (-15740)
> 11:     31334144     31312492 (-21652)
> 12:     31338992     31330356 (-8636)
> 13:     31343744     31311344 (-32400)
> 14:     31339000     31340196 (+1196)
> 15:     31337492     31313988 (-23504)
> 16:     31341688     31334040 (-7648)
> 17:     31341804     31308936 (-32868)
> 18:     31339936     31332020 (-7916)
> 19:     31323228     31324240 (+1012)
> 20:     31339744     31331768 (-7976)
> 21:     31321536     31332688 (+11152)
> 22:     31340280     31335212 (-5068)
> 23:     31332056     31335768 (+3712)

Looks encouraging that the time difference is fairly
small between the two algorithms.

> 24:    885575261    876586697 (-8988564)

> 
> It doesn't look like a slowdown; more like a 1% speedup.

You will need to throw away the first few iterations of
the test to account for cache warming effects.

Thanks.

Tim


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-28 23:28     ` Tim Chen
@ 2014-05-29 23:54       ` George Spelvin
  2014-05-30  1:07         ` Tim Chen
  0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2014-05-29 23:54 UTC (permalink / raw)
  To: linux, tim.c.chen; +Cc: herbert, james.guilford, JBeulich, linux-kernel

Sorry for the delay; my Ivy Bridge test machine isn't in my
office and getting to the console to tweak the BIOS is a
bit of a bother.

Anyway, i7-4930K, turbo boost & hyperthreading disabled,
$ cat /sys/devices/system/cpu/cpu?/cpufreq/scaling_governor
performance
performance
performance
performance
performance
performance

Oddly, though, CPU speed still seems to be fluctuating:
$ grep MHz /proc/cpuinfo
cpu MHz         : 1255.875
cpu MHz         : 3168.375
cpu MHz         : 3062.125
cpu MHz         : 1468.375
cpu MHz         : 1309.000
cpu MHz         : 2212.125
$ grep MHz /proc/cpuinfo
cpu MHz         : 1255.875
cpu MHz         : 2690.250
cpu MHz         : 1255.875
cpu MHz         : 2530.875
cpu MHz         : 2212.125
cpu MHz         : 1521.500

It does this even if I set scaling_min_freq to 3400000.
Very annoying.  Should I be using a different
scaling_governor than intel_pstate?

>> It doesn't look like a slowdown; more like a 1% speedup.
>
> You will need to throw away the first few iterations of
> the test to account for cache warming effects.

You're absolutely right; that's exactly *why* I ran it 24 times and
listed them all separately.  The "1%" number was B.S. and I was not
thinking when I quoted it.

What I had legitimately noticed was that the code with the patch took
slightly fewer cycles most of the time, even after discounting the
first few.  Not statistically significant, but enough to argue that it
didn't cause a noticeable slowdown.


Anyway, two iterations each of "modprobe tcrypt mode=319".

Old code:
[ 1530.513529] 
[ 1530.513529] testing speed of crc32c
[ 1530.513535] test  0 (   16 byte blocks,   16 bytes per update,   1 updates):     75 cycles/operation,    4 cycles/byte
[ 1530.513537] test  1 (   64 byte blocks,   16 bytes per update,   4 updates):    413 cycles/operation,    6 cycles/byte
[ 1530.513540] test  2 (   64 byte blocks,   64 bytes per update,   1 updates):     88 cycles/operation,    1 cycles/byte
[ 1530.513542] test  3 (  256 byte blocks,   16 bytes per update,  16 updates):   1327 cycles/operation,    5 cycles/byte
[ 1530.513548] test  4 (  256 byte blocks,   64 bytes per update,   4 updates):    503 cycles/operation,    1 cycles/byte
[ 1530.513551] test  5 (  256 byte blocks,  256 bytes per update,   1 updates):    178 cycles/operation,    0 cycles/byte
[ 1530.513553] test  6 ( 1024 byte blocks,   16 bytes per update,  64 updates):   4972 cycles/operation,    4 cycles/byte
[ 1530.513572] test  7 ( 1024 byte blocks,  256 bytes per update,   4 updates):    806 cycles/operation,    0 cycles/byte
[ 1530.513576] test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):    370 cycles/operation,    0 cycles/byte
[ 1530.513579] test  9 ( 2048 byte blocks,   16 bytes per update, 128 updates):   9835 cycles/operation,    4 cycles/byte
[ 1530.513615] test 10 ( 2048 byte blocks,  256 bytes per update,   8 updates):   1461 cycles/operation,    0 cycles/byte
[ 1530.513622] test 11 ( 2048 byte blocks, 1024 bytes per update,   2 updates):    847 cycles/operation,    0 cycles/byte
[ 1530.513626] test 12 ( 2048 byte blocks, 2048 bytes per update,   1 updates):    495 cycles/operation,    0 cycles/byte
[ 1530.513630] test 13 ( 4096 byte blocks,   16 bytes per update, 256 updates):  19571 cycles/operation,    4 cycles/byte
[ 1530.513700] test 14 ( 4096 byte blocks,  256 bytes per update,  16 updates):   2758 cycles/operation,    0 cycles/byte
[ 1530.513711] test 15 ( 4096 byte blocks, 1024 bytes per update,   4 updates):   1676 cycles/operation,    0 cycles/byte
[ 1530.513718] test 16 ( 4096 byte blocks, 4096 bytes per update,   1 updates):    859 cycles/operation,    0 cycles/byte
[ 1530.513722] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39012 cycles/operation,    4 cycles/byte
[ 1530.513861] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5417 cycles/operation,    0 cycles/byte
[ 1530.513882] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3162 cycles/operation,    0 cycles/byte
[ 1530.513894] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1678 cycles/operation,    0 cycles/byte
[ 1530.513901] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1653 cycles/operation,    0 cycles/byte

[ 1662.359717] 
[ 1662.359717] testing speed of crc32c
[ 1662.359723] test  0 (   16 byte blocks,   16 bytes per update,   1 updates):     80 cycles/operation,    5 cycles/byte
[ 1662.359725] test  1 (   64 byte blocks,   16 bytes per update,   4 updates):    430 cycles/operation,    6 cycles/byte
[ 1662.359729] test  2 (   64 byte blocks,   64 bytes per update,   1 updates):     81 cycles/operation,    1 cycles/byte
[ 1662.359730] test  3 (  256 byte blocks,   16 bytes per update,  16 updates):   1324 cycles/operation,    5 cycles/byte
[ 1662.359736] test  4 (  256 byte blocks,   64 bytes per update,   4 updates):    503 cycles/operation,    1 cycles/byte
[ 1662.359740] test  5 (  256 byte blocks,  256 bytes per update,   1 updates):    171 cycles/operation,    0 cycles/byte
[ 1662.359741] test  6 ( 1024 byte blocks,   16 bytes per update,  64 updates):   4983 cycles/operation,    4 cycles/byte
[ 1662.359760] test  7 ( 1024 byte blocks,  256 bytes per update,   4 updates):    832 cycles/operation,    0 cycles/byte
[ 1662.359764] test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):    366 cycles/operation,    0 cycles/byte
[ 1662.359768] test  9 ( 2048 byte blocks,   16 bytes per update, 128 updates):   9839 cycles/operation,    4 cycles/byte
[ 1662.359804] test 10 ( 2048 byte blocks,  256 bytes per update,   8 updates):   1437 cycles/operation,    0 cycles/byte
[ 1662.359810] test 11 ( 2048 byte blocks, 1024 bytes per update,   2 updates):    862 cycles/operation,    0 cycles/byte
[ 1662.359815] test 12 ( 2048 byte blocks, 2048 bytes per update,   1 updates):    494 cycles/operation,    0 cycles/byte
[ 1662.359818] test 13 ( 4096 byte blocks,   16 bytes per update, 256 updates):  19553 cycles/operation,    4 cycles/byte
[ 1662.359901] test 14 ( 4096 byte blocks,  256 bytes per update,  16 updates):   2761 cycles/operation,    0 cycles/byte
[ 1662.359912] test 15 ( 4096 byte blocks, 1024 bytes per update,   4 updates):   1715 cycles/operation,    0 cycles/byte
[ 1662.359919] test 16 ( 4096 byte blocks, 4096 bytes per update,   1 updates):    852 cycles/operation,    0 cycles/byte
[ 1662.359928] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39016 cycles/operation,    4 cycles/byte
[ 1662.360069] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5538 cycles/operation,    0 cycles/byte
[ 1662.360090] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3280 cycles/operation,    0 cycles/byte
[ 1662.360102] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1695 cycles/operation,    0 cycles/byte
[ 1662.360110] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1639 cycles/operation,    0 cycles/byte

New code:
[  710.814463] 
[  710.814463] testing speed of crc32c
[  710.814469] test  0 (   16 byte blocks,   16 bytes per update,   1 updates):     80 cycles/operation,    5 cycles/byte
[  710.814472] test  1 (   64 byte blocks,   16 bytes per update,   4 updates):    410 cycles/operation,    6 cycles/byte
[  710.814476] test  2 (   64 byte blocks,   64 bytes per update,   1 updates):     94 cycles/operation,    1 cycles/byte
[  710.814477] test  3 (  256 byte blocks,   16 bytes per update,  16 updates):   1327 cycles/operation,    5 cycles/byte
[  710.814483] test  4 (  256 byte blocks,   64 bytes per update,   4 updates):    492 cycles/operation,    1 cycles/byte
[  710.814486] test  5 (  256 byte blocks,  256 bytes per update,   1 updates):    175 cycles/operation,    0 cycles/byte
[  710.814488] test  6 ( 1024 byte blocks,   16 bytes per update,  64 updates):   4970 cycles/operation,    4 cycles/byte
[  710.814507] test  7 ( 1024 byte blocks,  256 bytes per update,   4 updates):    797 cycles/operation,    0 cycles/byte
[  710.814511] test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):    370 cycles/operation,    0 cycles/byte
[  710.814514] test  9 ( 2048 byte blocks,   16 bytes per update, 128 updates):   9846 cycles/operation,    4 cycles/byte
[  710.814551] test 10 ( 2048 byte blocks,  256 bytes per update,   8 updates):   1452 cycles/operation,    0 cycles/byte
[  710.814557] test 11 ( 2048 byte blocks, 1024 bytes per update,   2 updates):    840 cycles/operation,    0 cycles/byte
[  710.814561] test 12 ( 2048 byte blocks, 2048 bytes per update,   1 updates):    497 cycles/operation,    0 cycles/byte
[  710.814564] test 13 ( 4096 byte blocks,   16 bytes per update, 256 updates):  19563 cycles/operation,    4 cycles/byte
[  710.814635] test 14 ( 4096 byte blocks,  256 bytes per update,  16 updates):   2764 cycles/operation,    0 cycles/byte
[  710.814646] test 15 ( 4096 byte blocks, 1024 bytes per update,   4 updates):   1646 cycles/operation,    0 cycles/byte
[  710.814653] test 16 ( 4096 byte blocks, 4096 bytes per update,   1 updates):    858 cycles/operation,    0 cycles/byte
[  710.814657] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39020 cycles/operation,    4 cycles/byte
[  710.814796] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5422 cycles/operation,    0 cycles/byte
[  710.814816] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3182 cycles/operation,    0 cycles/byte
[  710.814829] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1669 cycles/operation,    0 cycles/byte
[  710.814836] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1636 cycles/operation,    0 cycles/byte

[ 1751.451733] 
[ 1751.451733] testing speed of crc32c
[ 1751.451739] test  0 (   16 byte blocks,   16 bytes per update,   1 updates):     75 cycles/operation,    4 cycles/byte
[ 1751.451741] test  1 (   64 byte blocks,   16 bytes per update,   4 updates):    414 cycles/operation,    6 cycles/byte
[ 1751.451745] test  2 (   64 byte blocks,   64 bytes per update,   1 updates):     87 cycles/operation,    1 cycles/byte
[ 1751.451746] test  3 (  256 byte blocks,   16 bytes per update,  16 updates):   1329 cycles/operation,    5 cycles/byte
[ 1751.451752] test  4 (  256 byte blocks,   64 bytes per update,   4 updates):    499 cycles/operation,    1 cycles/byte
[ 1751.451756] test  5 (  256 byte blocks,  256 bytes per update,   1 updates):    170 cycles/operation,    0 cycles/byte
[ 1751.451757] test  6 ( 1024 byte blocks,   16 bytes per update,  64 updates):   4964 cycles/operation,    4 cycles/byte
[ 1751.451776] test  7 ( 1024 byte blocks,  256 bytes per update,   4 updates):    836 cycles/operation,    0 cycles/byte
[ 1751.451780] test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):    370 cycles/operation,    0 cycles/byte
[ 1751.451784] test  9 ( 2048 byte blocks,   16 bytes per update, 128 updates):   9844 cycles/operation,    4 cycles/byte
[ 1751.451820] test 10 ( 2048 byte blocks,  256 bytes per update,   8 updates):   1468 cycles/operation,    0 cycles/byte
[ 1751.451826] test 11 ( 2048 byte blocks, 1024 bytes per update,   2 updates):    835 cycles/operation,    0 cycles/byte
[ 1751.451830] test 12 ( 2048 byte blocks, 2048 bytes per update,   1 updates):    493 cycles/operation,    0 cycles/byte
[ 1751.451834] test 13 ( 4096 byte blocks,   16 bytes per update, 256 updates):  19564 cycles/operation,    4 cycles/byte
[ 1751.451904] test 14 ( 4096 byte blocks,  256 bytes per update,  16 updates):   2776 cycles/operation,    0 cycles/byte
[ 1751.451915] test 15 ( 4096 byte blocks, 1024 bytes per update,   4 updates):   1662 cycles/operation,    0 cycles/byte
[ 1751.451922] test 16 ( 4096 byte blocks, 4096 bytes per update,   1 updates):    858 cycles/operation,    0 cycles/byte
[ 1751.451927] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39531 cycles/operation,    4 cycles/byte
[ 1751.452067] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5427 cycles/operation,    0 cycles/byte
[ 1751.452088] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3175 cycles/operation,    0 cycles/byte
[ 1751.452100] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1666 cycles/operation,    0 cycles/byte
[ 1751.452107] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1634 cycles/operation,    0 cycles/byte

The tests are pretty short, but there's no obvious slowdown.  Particularly
on the tests with > 200 byte per update where the modified code paths are
found.

Of course, whether the timing is valid is an interesting question.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-29 23:54       ` George Spelvin
@ 2014-05-30  1:07         ` Tim Chen
  2014-05-30  1:16           ` Dave Jones
  2014-05-30  1:37           ` George Spelvin
  0 siblings, 2 replies; 29+ messages in thread
From: Tim Chen @ 2014-05-30  1:07 UTC (permalink / raw)
  To: George Spelvin; +Cc: herbert, james.guilford, JBeulich, linux-kernel

On Thu, 2014-05-29 at 19:54 -0400, George Spelvin wrote:
> Sorry for the delay; my Ivy Bridge test machine isn't in my
> office and getting to the console to tweak the BIOS is a
> bit of a bother.
> 
> Anyway, i7-4930K, turbo boost & hyperthreading disabled,
> $ cat /sys/devices/system/cpu/cpu?/cpufreq/scaling_governor
> performance
> performance
> performance
> performance
> performance
> performance
> 
> Oddly, though, CPU speed still seems to be fluctuating:
> $ grep MHz /proc/cpuinfo
> cpu MHz         : 1255.875
> cpu MHz         : 3168.375
> cpu MHz         : 3062.125
> cpu MHz         : 1468.375
> cpu MHz         : 1309.000
> cpu MHz         : 2212.125
> $ grep MHz /proc/cpuinfo
> cpu MHz         : 1255.875
> cpu MHz         : 2690.250
> cpu MHz         : 1255.875
> cpu MHz         : 2530.875
> cpu MHz         : 2212.125
> cpu MHz         : 1521.500

This is odd.  On my Ivy Bridge system the CPU speed from /proc/cpuinfo 
is at max freq once I set the performance governor.  
The numbers above almost look like
the cpu frequency is fluctuating and an average is taken.
What version of the kernel are you running?  Is 
CONFIG_CPU_FREQ_GOV_PERFORMANCE compiled in?

Does /sys/devices/system/cpu/cpu?/cpufreq/scaling_cur_freq
also changes?

Can you check what are the available governors in your system
and available frequencies?

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies

If userspace governor is available, you can try set the governor
to userspace, then pin frequency to 3400 MHz (assuming that's your
max) with command like:

i=0
num_cpus=`cat /proc/cpuinfo| grep "^processor"| wc -l `
while [ $i -lt $num_cpus ]
do
  echo userspace > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor
  echo 3400000 > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_setspeed
  i=`expr $i + 1`
done


> 
> It does this even if I set scaling_min_freq to 3400000.
> Very annoying.  Should I be using a different
> scaling_governor than intel_pstate?
> 
> >> It doesn't look like a slowdown; more like a 1% speedup.
> >
> > You will need to throw away the first few iterations of
> > the test to account for cache warming effects.
> 
> You're absolutely right; that's exactly *why* I ran it 24 times and
> listed them all separately.  The "1%" number was B.S. and I was not
> thinking when I quoted it.
> 
> What I had legitimately noticed was that the code with the patch took
> slightly fewer cycles most of the time, even after discounting the
> first few.  Not statistically significant, but enough to argue that it
> didn't cause a noticeable slowdown.
> 
> 
> Anyway, two iterations each of "modprobe tcrypt mode=319".
> 
> Old code:
> [ 1530.513529] 
> [ 1530.513529] testing speed of crc32c
> [ 1530.513535] test  0 (   16 byte blocks,   16 bytes per update,   1 updates):     75 cycles/operation,    4 cycles/byte
> [ 1530.513537] test  1 (   64 byte blocks,   16 bytes per update,   4 updates):    413 cycles/operation,    6 cycles/byte
> [ 1530.513540] test  2 (   64 byte blocks,   64 bytes per update,   1 updates):     88 cycles/operation,    1 cycles/byte
> [ 1530.513542] test  3 (  256 byte blocks,   16 bytes per update,  16 updates):   1327 cycles/operation,    5 cycles/byte
> [ 1530.513548] test  4 (  256 byte blocks,   64 bytes per update,   4 updates):    503 cycles/operation,    1 cycles/byte
> [ 1530.513551] test  5 (  256 byte blocks,  256 bytes per update,   1 updates):    178 cycles/operation,    0 cycles/byte
> [ 1530.513553] test  6 ( 1024 byte blocks,   16 bytes per update,  64 updates):   4972 cycles/operation,    4 cycles/byte
> [ 1530.513572] test  7 ( 1024 byte blocks,  256 bytes per update,   4 updates):    806 cycles/operation,    0 cycles/byte
> [ 1530.513576] test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):    370 cycles/operation,    0 cycles/byte
> [ 1530.513579] test  9 ( 2048 byte blocks,   16 bytes per update, 128 updates):   9835 cycles/operation,    4 cycles/byte
> [ 1530.513615] test 10 ( 2048 byte blocks,  256 bytes per update,   8 updates):   1461 cycles/operation,    0 cycles/byte
> [ 1530.513622] test 11 ( 2048 byte blocks, 1024 bytes per update,   2 updates):    847 cycles/operation,    0 cycles/byte
> [ 1530.513626] test 12 ( 2048 byte blocks, 2048 bytes per update,   1 updates):    495 cycles/operation,    0 cycles/byte
> [ 1530.513630] test 13 ( 4096 byte blocks,   16 bytes per update, 256 updates):  19571 cycles/operation,    4 cycles/byte
> [ 1530.513700] test 14 ( 4096 byte blocks,  256 bytes per update,  16 updates):   2758 cycles/operation,    0 cycles/byte
> [ 1530.513711] test 15 ( 4096 byte blocks, 1024 bytes per update,   4 updates):   1676 cycles/operation,    0 cycles/byte
> [ 1530.513718] test 16 ( 4096 byte blocks, 4096 bytes per update,   1 updates):    859 cycles/operation,    0 cycles/byte
> [ 1530.513722] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39012 cycles/operation,    4 cycles/byte
> [ 1530.513861] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5417 cycles/operation,    0 cycles/byte
> [ 1530.513882] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3162 cycles/operation,    0 cycles/byte
> [ 1530.513894] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1678 cycles/operation,    0 cycles/byte
> [ 1530.513901] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1653 cycles/operation,    0 cycles/byte
> 
> [ 1662.359717] 
> [ 1662.359717] testing speed of crc32c
> [ 1662.359723] test  0 (   16 byte blocks,   16 bytes per update,   1 updates):     80 cycles/operation,    5 cycles/byte
> [ 1662.359725] test  1 (   64 byte blocks,   16 bytes per update,   4 updates):    430 cycles/operation,    6 cycles/byte
> [ 1662.359729] test  2 (   64 byte blocks,   64 bytes per update,   1 updates):     81 cycles/operation,    1 cycles/byte
> [ 1662.359730] test  3 (  256 byte blocks,   16 bytes per update,  16 updates):   1324 cycles/operation,    5 cycles/byte
> [ 1662.359736] test  4 (  256 byte blocks,   64 bytes per update,   4 updates):    503 cycles/operation,    1 cycles/byte
> [ 1662.359740] test  5 (  256 byte blocks,  256 bytes per update,   1 updates):    171 cycles/operation,    0 cycles/byte
> [ 1662.359741] test  6 ( 1024 byte blocks,   16 bytes per update,  64 updates):   4983 cycles/operation,    4 cycles/byte
> [ 1662.359760] test  7 ( 1024 byte blocks,  256 bytes per update,   4 updates):    832 cycles/operation,    0 cycles/byte
> [ 1662.359764] test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):    366 cycles/operation,    0 cycles/byte
> [ 1662.359768] test  9 ( 2048 byte blocks,   16 bytes per update, 128 updates):   9839 cycles/operation,    4 cycles/byte
> [ 1662.359804] test 10 ( 2048 byte blocks,  256 bytes per update,   8 updates):   1437 cycles/operation,    0 cycles/byte
> [ 1662.359810] test 11 ( 2048 byte blocks, 1024 bytes per update,   2 updates):    862 cycles/operation,    0 cycles/byte
> [ 1662.359815] test 12 ( 2048 byte blocks, 2048 bytes per update,   1 updates):    494 cycles/operation,    0 cycles/byte
> [ 1662.359818] test 13 ( 4096 byte blocks,   16 bytes per update, 256 updates):  19553 cycles/operation,    4 cycles/byte
> [ 1662.359901] test 14 ( 4096 byte blocks,  256 bytes per update,  16 updates):   2761 cycles/operation,    0 cycles/byte
> [ 1662.359912] test 15 ( 4096 byte blocks, 1024 bytes per update,   4 updates):   1715 cycles/operation,    0 cycles/byte
> [ 1662.359919] test 16 ( 4096 byte blocks, 4096 bytes per update,   1 updates):    852 cycles/operation,    0 cycles/byte
> [ 1662.359928] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39016 cycles/operation,    4 cycles/byte
> [ 1662.360069] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5538 cycles/operation,    0 cycles/byte
> [ 1662.360090] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3280 cycles/operation,    0 cycles/byte
> [ 1662.360102] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1695 cycles/operation,    0 cycles/byte
> [ 1662.360110] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1639 cycles/operation,    0 cycles/byte
> 
> New code:
> [  710.814463] 
> [  710.814463] testing speed of crc32c
> [  710.814469] test  0 (   16 byte blocks,   16 bytes per update,   1 updates):     80 cycles/operation,    5 cycles/byte
> [  710.814472] test  1 (   64 byte blocks,   16 bytes per update,   4 updates):    410 cycles/operation,    6 cycles/byte
> [  710.814476] test  2 (   64 byte blocks,   64 bytes per update,   1 updates):     94 cycles/operation,    1 cycles/byte
> [  710.814477] test  3 (  256 byte blocks,   16 bytes per update,  16 updates):   1327 cycles/operation,    5 cycles/byte
> [  710.814483] test  4 (  256 byte blocks,   64 bytes per update,   4 updates):    492 cycles/operation,    1 cycles/byte
> [  710.814486] test  5 (  256 byte blocks,  256 bytes per update,   1 updates):    175 cycles/operation,    0 cycles/byte
> [  710.814488] test  6 ( 1024 byte blocks,   16 bytes per update,  64 updates):   4970 cycles/operation,    4 cycles/byte
> [  710.814507] test  7 ( 1024 byte blocks,  256 bytes per update,   4 updates):    797 cycles/operation,    0 cycles/byte
> [  710.814511] test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):    370 cycles/operation,    0 cycles/byte
> [  710.814514] test  9 ( 2048 byte blocks,   16 bytes per update, 128 updates):   9846 cycles/operation,    4 cycles/byte
> [  710.814551] test 10 ( 2048 byte blocks,  256 bytes per update,   8 updates):   1452 cycles/operation,    0 cycles/byte
> [  710.814557] test 11 ( 2048 byte blocks, 1024 bytes per update,   2 updates):    840 cycles/operation,    0 cycles/byte
> [  710.814561] test 12 ( 2048 byte blocks, 2048 bytes per update,   1 updates):    497 cycles/operation,    0 cycles/byte
> [  710.814564] test 13 ( 4096 byte blocks,   16 bytes per update, 256 updates):  19563 cycles/operation,    4 cycles/byte
> [  710.814635] test 14 ( 4096 byte blocks,  256 bytes per update,  16 updates):   2764 cycles/operation,    0 cycles/byte
> [  710.814646] test 15 ( 4096 byte blocks, 1024 bytes per update,   4 updates):   1646 cycles/operation,    0 cycles/byte
> [  710.814653] test 16 ( 4096 byte blocks, 4096 bytes per update,   1 updates):    858 cycles/operation,    0 cycles/byte
> [  710.814657] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39020 cycles/operation,    4 cycles/byte
> [  710.814796] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5422 cycles/operation,    0 cycles/byte
> [  710.814816] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3182 cycles/operation,    0 cycles/byte
> [  710.814829] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1669 cycles/operation,    0 cycles/byte
> [  710.814836] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1636 cycles/operation,    0 cycles/byte
> 
> [ 1751.451733] 
> [ 1751.451733] testing speed of crc32c
> [ 1751.451739] test  0 (   16 byte blocks,   16 bytes per update,   1 updates):     75 cycles/operation,    4 cycles/byte
> [ 1751.451741] test  1 (   64 byte blocks,   16 bytes per update,   4 updates):    414 cycles/operation,    6 cycles/byte
> [ 1751.451745] test  2 (   64 byte blocks,   64 bytes per update,   1 updates):     87 cycles/operation,    1 cycles/byte
> [ 1751.451746] test  3 (  256 byte blocks,   16 bytes per update,  16 updates):   1329 cycles/operation,    5 cycles/byte
> [ 1751.451752] test  4 (  256 byte blocks,   64 bytes per update,   4 updates):    499 cycles/operation,    1 cycles/byte
> [ 1751.451756] test  5 (  256 byte blocks,  256 bytes per update,   1 updates):    170 cycles/operation,    0 cycles/byte
> [ 1751.451757] test  6 ( 1024 byte blocks,   16 bytes per update,  64 updates):   4964 cycles/operation,    4 cycles/byte
> [ 1751.451776] test  7 ( 1024 byte blocks,  256 bytes per update,   4 updates):    836 cycles/operation,    0 cycles/byte
> [ 1751.451780] test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):    370 cycles/operation,    0 cycles/byte
> [ 1751.451784] test  9 ( 2048 byte blocks,   16 bytes per update, 128 updates):   9844 cycles/operation,    4 cycles/byte
> [ 1751.451820] test 10 ( 2048 byte blocks,  256 bytes per update,   8 updates):   1468 cycles/operation,    0 cycles/byte
> [ 1751.451826] test 11 ( 2048 byte blocks, 1024 bytes per update,   2 updates):    835 cycles/operation,    0 cycles/byte
> [ 1751.451830] test 12 ( 2048 byte blocks, 2048 bytes per update,   1 updates):    493 cycles/operation,    0 cycles/byte
> [ 1751.451834] test 13 ( 4096 byte blocks,   16 bytes per update, 256 updates):  19564 cycles/operation,    4 cycles/byte
> [ 1751.451904] test 14 ( 4096 byte blocks,  256 bytes per update,  16 updates):   2776 cycles/operation,    0 cycles/byte
> [ 1751.451915] test 15 ( 4096 byte blocks, 1024 bytes per update,   4 updates):   1662 cycles/operation,    0 cycles/byte
> [ 1751.451922] test 16 ( 4096 byte blocks, 4096 bytes per update,   1 updates):    858 cycles/operation,    0 cycles/byte
> [ 1751.451927] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39531 cycles/operation,    4 cycles/byte
> [ 1751.452067] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5427 cycles/operation,    0 cycles/byte
> [ 1751.452088] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3175 cycles/operation,    0 cycles/byte
> [ 1751.452100] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1666 cycles/operation,    0 cycles/byte
> [ 1751.452107] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1634 cycles/operation,    0 cycles/byte
> 
> The tests are pretty short, but there's no obvious slowdown.  Particularly
> on the tests with > 200 byte per update where the modified code paths are
> found.

So far, the numbers look good.

BTW, why do you place the K table in .text, instead of .rodata? 

Thanks.

Tim



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-30  1:07         ` Tim Chen
@ 2014-05-30  1:16           ` Dave Jones
  2014-05-30 17:56             ` Tim Chen
  2014-05-30  1:37           ` George Spelvin
  1 sibling, 1 reply; 29+ messages in thread
From: Dave Jones @ 2014-05-30  1:16 UTC (permalink / raw)
  To: Tim Chen; +Cc: George Spelvin, herbert, james.guilford, JBeulich, linux-kernel

On Thu, May 29, 2014 at 06:07:16PM -0700, Tim Chen wrote:
 > On Thu, 2014-05-29 at 19:54 -0400, George Spelvin wrote:
 > > Sorry for the delay; my Ivy Bridge test machine isn't in my
 > > office and getting to the console to tweak the BIOS is a
 > > bit of a bother.
 > > 
 > > Anyway, i7-4930K, turbo boost & hyperthreading disabled,
 > > $ cat /sys/devices/system/cpu/cpu?/cpufreq/scaling_governor
 > > performance
 > > performance
 > > performance
 > > performance
 > > performance
 > > performance
 > > 
 > > Oddly, though, CPU speed still seems to be fluctuating:
 > > $ grep MHz /proc/cpuinfo
 > > cpu MHz         : 1255.875
 > > cpu MHz         : 3168.375
 > > cpu MHz         : 3062.125
 > > cpu MHz         : 1468.375
 > > cpu MHz         : 1309.000
 > > cpu MHz         : 2212.125
 > > $ grep MHz /proc/cpuinfo
 > > cpu MHz         : 1255.875
 > > cpu MHz         : 2690.250
 > > cpu MHz         : 1255.875
 > > cpu MHz         : 2530.875
 > > cpu MHz         : 2212.125
 > > cpu MHz         : 1521.500
 > 
 > This is odd.  On my Ivy Bridge system the CPU speed from /proc/cpuinfo 
 > is at max freq once I set the performance governor.  
 > The numbers above almost look like
 > the cpu frequency is fluctuating and an average is taken.
 > What version of the kernel are you running?  Is 
 > CONFIG_CPU_FREQ_GOV_PERFORMANCE compiled in?
 > 
 > Does /sys/devices/system/cpu/cpu?/cpufreq/scaling_cur_freq
 > also changes?
 > 
 > Can you check what are the available governors in your system
 > and available frequencies?
 > 
 > cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
 > cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
 > 
 > If userspace governor is available, you can try set the governor
 > to userspace, then pin frequency to 3400 MHz (assuming that's your
 > max) with command like:
 
intel_pstate overrides any governor choice you make through sysfs.

	Dave


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-30  1:07         ` Tim Chen
  2014-05-30  1:16           ` Dave Jones
@ 2014-05-30  1:37           ` George Spelvin
  2014-05-30  5:25             ` George Spelvin
  1 sibling, 1 reply; 29+ messages in thread
From: George Spelvin @ 2014-05-30  1:37 UTC (permalink / raw)
  To: linux, tim.c.chen; +Cc: herbert, james.guilford, JBeulich, linux-kernel

> This is odd.  On my Ivy Bridge system the CPU speed from /proc/cpuinfo 
> is at max freq once I set the performance governor.  
> The numbers above almost look like
> the cpu frequency is fluctuating and an average is taken.
> What version of the kernel are you running?  Is 
> CONFIG_CPU_FREQ_GOV_PERFORMANCE compiled in?

Yes; I have

CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_COMMON=y
CONFIG_CPU_FREQ_STAT=y
# CONFIG_CPU_FREQ_STAT_DETAILS is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
# CONFIG_CPU_FREQ_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y

However scaling_available_governor only lists "performance powersave"

> Does /sys/devices/system/cpu/cpu?/cpufreq/scaling_cur_freq
> also changes?

That fine does not exist.  However,
/sys/devices/system/cpu/cpu?/cpufreq/cpuinfo_cur_freq
exists and changes.  Several snapshots:

	Snap1	Snap2	Snap3	Snap4
cpu0	1255875	1255875	1255875	1255875
cpu1	1202750 1202750 1202750 1415250
cpu2	1680875 1255875 1468375 1468375
cpu3	1202750 1255875 1521500 1521500
cpu4	1946500 1255875 1255875 1255875
cpu5	2690250 2371500 1946500 1734000

> Can you check what are the available governors in your system
> and available frequencies?

> cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
performance powersave
> cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
cat: /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies: No such file or directory
$ ls /sys/devices/system/cpu/cpu0/cpufreq/
affected_cpus     cpuinfo_transition_latency   scaling_governor
cpuinfo_cur_freq  related_cpus                 scaling_max_freq
cpuinfo_max_freq  scaling_available_governors  scaling_min_freq
cpuinfo_min_freq  scaling_driver               scaling_setspeed
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
<unsupported>

> If userspace governor is available, you can try set the governor
> to userspace, then pin frequency to 3400 MHz (assuming that's your
> max) with command like:

I'll have to recompile and reboot, but sure.

Do you want me to change from the intel_pstate driver while I'm at it?

> BTW, why do you place the K table in .text, instead of .rodata? 

Because the jump table before it was in .text, and if I try to move
*that* to .rodata I get a linker error.  So I just put the K_table
right next to it.

However, it's all moot: my current v3 does move K_table to .rodata.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-30  1:37           ` George Spelvin
@ 2014-05-30  5:25             ` George Spelvin
  2014-05-30 16:10               ` Tim Chen
  0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2014-05-30  5:25 UTC (permalink / raw)
  To: linux, tim.c.chen; +Cc: herbert, james.guilford, JBeulich, linux-kernel

Olay, recompiled with the acpi-cpufreq driver, so the performance governor
actually works, pegging the frequency at 3900 MHz.

Existing (old) code:
[  455.641397] 
[  455.641397] testing speed of crc32c
[  455.641403] test  0 (   16 byte blocks,   16 bytes per update,   1 updates):     73 cycles/operation,    4 cycles/byte
[  455.641406] test  1 (   64 byte blocks,   16 bytes per update,   4 updates):    418 cycles/operation,    6 cycles/byte
[  455.641409] test  2 (   64 byte blocks,   64 bytes per update,   1 updates):     89 cycles/operation,    1 cycles/byte
[  455.641411] test  3 (  256 byte blocks,   16 bytes per update,  16 updates):   1330 cycles/operation,    5 cycles/byte
[  455.641417] test  4 (  256 byte blocks,   64 bytes per update,   4 updates):    502 cycles/operation,    1 cycles/byte
[  455.641420] test  5 (  256 byte blocks,  256 bytes per update,   1 updates):    170 cycles/operation,    0 cycles/byte
[  455.641422] test  6 ( 1024 byte blocks,   16 bytes per update,  64 updates):   4971 cycles/operation,    4 cycles/byte
[  455.641440] test  7 ( 1024 byte blocks,  256 bytes per update,   4 updates):    805 cycles/operation,    0 cycles/byte
[  455.641445] test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):    371 cycles/operation,    0 cycles/byte
[  455.641448] test  9 ( 2048 byte blocks,   16 bytes per update, 128 updates):   9839 cycles/operation,    4 cycles/byte
[  455.641484] test 10 ( 2048 byte blocks,  256 bytes per update,   8 updates):   1436 cycles/operation,    0 cycles/byte
[  455.641490] test 11 ( 2048 byte blocks, 1024 bytes per update,   2 updates):    824 cycles/operation,    0 cycles/byte
[  455.641494] test 12 ( 2048 byte blocks, 2048 bytes per update,   1 updates):    494 cycles/operation,    0 cycles/byte
[  455.641498] test 13 ( 4096 byte blocks,   16 bytes per update, 256 updates):  19561 cycles/operation,    4 cycles/byte
[  455.641568] test 14 ( 4096 byte blocks,  256 bytes per update,  16 updates):   2757 cycles/operation,    0 cycles/byte
[  455.641579] test 15 ( 4096 byte blocks, 1024 bytes per update,   4 updates):   1633 cycles/operation,    0 cycles/byte
[  455.641586] test 16 ( 4096 byte blocks, 4096 bytes per update,   1 updates):    861 cycles/operation,    0 cycles/byte
[  455.641590] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39015 cycles/operation,    4 cycles/byte
[  455.641729] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5412 cycles/operation,    0 cycles/byte
[  455.641749] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3106 cycles/operation,    0 cycles/byte
[  455.641762] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1656 cycles/operation,    0 cycles/byte
[  455.641769] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1639 cycles/operation,    0 cycles/byte
[  480.885336] 
[  480.885336] testing speed of crc32c
[  480.885342] test  0 (   16 byte blocks,   16 bytes per update,   1 updates):     81 cycles/operation,    5 cycles/byte
[  480.885345] test  1 (   64 byte blocks,   16 bytes per update,   4 updates):    426 cycles/operation,    6 cycles/byte
[  480.885348] test  2 (   64 byte blocks,   64 bytes per update,   1 updates):     96 cycles/operation,    1 cycles/byte
[  480.885350] test  3 (  256 byte blocks,   16 bytes per update,  16 updates):   1331 cycles/operation,    5 cycles/byte
[  480.885356] test  4 (  256 byte blocks,   64 bytes per update,   4 updates):    497 cycles/operation,    1 cycles/byte
[  480.885359] test  5 (  256 byte blocks,  256 bytes per update,   1 updates):    179 cycles/operation,    0 cycles/byte
[  480.885361] test  6 ( 1024 byte blocks,   16 bytes per update,  64 updates):   4961 cycles/operation,    4 cycles/byte
[  480.885380] test  7 ( 1024 byte blocks,  256 bytes per update,   4 updates):    795 cycles/operation,    0 cycles/byte
[  480.885384] test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):    366 cycles/operation,    0 cycles/byte
[  480.885387] test  9 ( 2048 byte blocks,   16 bytes per update, 128 updates):   9827 cycles/operation,    4 cycles/byte
[  480.885423] test 10 ( 2048 byte blocks,  256 bytes per update,   8 updates):   1445 cycles/operation,    0 cycles/byte
[  480.885430] test 11 ( 2048 byte blocks, 1024 bytes per update,   2 updates):    834 cycles/operation,    0 cycles/byte
[  480.885434] test 12 ( 2048 byte blocks, 2048 bytes per update,   1 updates):    495 cycles/operation,    0 cycles/byte
[  480.885437] test 13 ( 4096 byte blocks,   16 bytes per update, 256 updates):  19560 cycles/operation,    4 cycles/byte
[  480.885507] test 14 ( 4096 byte blocks,  256 bytes per update,  16 updates):   2767 cycles/operation,    0 cycles/byte
[  480.885518] test 15 ( 4096 byte blocks, 1024 bytes per update,   4 updates):   1643 cycles/operation,    0 cycles/byte
[  480.885525] test 16 ( 4096 byte blocks, 4096 bytes per update,   1 updates):    862 cycles/operation,    0 cycles/byte
[  480.885530] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39013 cycles/operation,    4 cycles/byte
[  480.885669] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5417 cycles/operation,    0 cycles/byte
[  480.885689] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3113 cycles/operation,    0 cycles/byte
[  480.885701] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1665 cycles/operation,    0 cycles/byte
[  480.885708] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1646 cycles/operation,    0 cycles/byte

Proposed (new) code:
[  800.253907] 
[  800.253907] testing speed of crc32c
[  800.253913] test  0 (   16 byte blocks,   16 bytes per update,   1 updates):     75 cycles/operation,    4 cycles/byte
[  800.253915] test  1 (   64 byte blocks,   16 bytes per update,   4 updates):    421 cycles/operation,    6 cycles/byte
[  800.253919] test  2 (   64 byte blocks,   64 bytes per update,   1 updates):     88 cycles/operation,    1 cycles/byte
[  800.253920] test  3 (  256 byte blocks,   16 bytes per update,  16 updates):   1339 cycles/operation,    5 cycles/byte
[  800.253942] test  4 (  256 byte blocks,   64 bytes per update,   4 updates):    511 cycles/operation,    1 cycles/byte
[  800.253945] test  5 (  256 byte blocks,  256 bytes per update,   1 updates):    180 cycles/operation,    0 cycles/byte
[  800.253947] test  6 ( 1024 byte blocks,   16 bytes per update,  64 updates):   4972 cycles/operation,    4 cycles/byte
[  800.253966] test  7 ( 1024 byte blocks,  256 bytes per update,   4 updates):    789 cycles/operation,    0 cycles/byte
[  800.253970] test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):    371 cycles/operation,    0 cycles/byte
[  800.253973] test  9 ( 2048 byte blocks,   16 bytes per update, 128 updates):  10093 cycles/operation,    4 cycles/byte
[  800.254010] test 10 ( 2048 byte blocks,  256 bytes per update,   8 updates):   1443 cycles/operation,    0 cycles/byte
[  800.254017] test 11 ( 2048 byte blocks, 1024 bytes per update,   2 updates):    829 cycles/operation,    0 cycles/byte
[  800.254021] test 12 ( 2048 byte blocks, 2048 bytes per update,   1 updates):    495 cycles/operation,    0 cycles/byte
[  800.254024] test 13 ( 4096 byte blocks,   16 bytes per update, 256 updates):  19556 cycles/operation,    4 cycles/byte
[  800.254094] test 14 ( 4096 byte blocks,  256 bytes per update,  16 updates):   2762 cycles/operation,    0 cycles/byte
[  800.254105] test 15 ( 4096 byte blocks, 1024 bytes per update,   4 updates):   1640 cycles/operation,    0 cycles/byte
[  800.254113] test 16 ( 4096 byte blocks, 4096 bytes per update,   1 updates):    854 cycles/operation,    0 cycles/byte
[  800.254117] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39015 cycles/operation,    4 cycles/byte
[  800.254256] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5415 cycles/operation,    0 cycles/byte
[  800.254276] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3113 cycles/operation,    0 cycles/byte
[  800.254288] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1666 cycles/operation,    0 cycles/byte
[  800.254295] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1638 cycles/operation,    0 cycles/byte
[  808.113346] 
[  808.113346] testing speed of crc32c
[  808.113353] test  0 (   16 byte blocks,   16 bytes per update,   1 updates):     70 cycles/operation,    4 cycles/byte
[  808.113355] test  1 (   64 byte blocks,   16 bytes per update,   4 updates):    432 cycles/operation,    6 cycles/byte
[  808.113359] test  2 (   64 byte blocks,   64 bytes per update,   1 updates):     89 cycles/operation,    1 cycles/byte
[  808.113360] test  3 (  256 byte blocks,   16 bytes per update,  16 updates):   1330 cycles/operation,    5 cycles/byte
[  808.113366] test  4 (  256 byte blocks,   64 bytes per update,   4 updates):    514 cycles/operation,    2 cycles/byte
[  808.113369] test  5 (  256 byte blocks,  256 bytes per update,   1 updates):    171 cycles/operation,    0 cycles/byte
[  808.113371] test  6 ( 1024 byte blocks,   16 bytes per update,  64 updates):   4968 cycles/operation,    4 cycles/byte
[  808.113390] test  7 ( 1024 byte blocks,  256 bytes per update,   4 updates):    833 cycles/operation,    0 cycles/byte
[  808.113394] test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):    368 cycles/operation,    0 cycles/byte
[  808.113398] test  9 ( 2048 byte blocks,   16 bytes per update, 128 updates):   9842 cycles/operation,    4 cycles/byte
[  808.113434] test 10 ( 2048 byte blocks,  256 bytes per update,   8 updates):   1462 cycles/operation,    0 cycles/byte
[  808.113440] test 11 ( 2048 byte blocks, 1024 bytes per update,   2 updates):    827 cycles/operation,    0 cycles/byte
[  808.113444] test 12 ( 2048 byte blocks, 2048 bytes per update,   1 updates):    494 cycles/operation,    0 cycles/byte
[  808.113448] test 13 ( 4096 byte blocks,   16 bytes per update, 256 updates):  19556 cycles/operation,    4 cycles/byte
[  808.113518] test 14 ( 4096 byte blocks,  256 bytes per update,  16 updates):   2783 cycles/operation,    0 cycles/byte
[  808.113529] test 15 ( 4096 byte blocks, 1024 bytes per update,   4 updates):   1645 cycles/operation,    0 cycles/byte
[  808.113536] test 16 ( 4096 byte blocks, 4096 bytes per update,   1 updates):    853 cycles/operation,    0 cycles/byte
[  808.113540] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39019 cycles/operation,    4 cycles/byte
[  808.113679] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5437 cycles/operation,    0 cycles/byte
[  808.113700] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3118 cycles/operation,    0 cycles/byte
[  808.113712] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1652 cycles/operation,    0 cycles/byte
[  808.113719] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1639 cycles/operation,    0 cycles/byte

As you can see, the differences look comparable to the spread in the
measured values.  Considering just the last 4 tests (since only blocks
of at least 200 bytes are affected by the change), here are 10 more runs
of each:

Existing (old) code:
[ 2168.651975] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39013 cycles/operation,    4 cycles/byte
[ 2168.652114] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5425 cycles/operation,    0 cycles/byte
[ 2168.652134] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3121 cycles/operation,    0 cycles/byte
[ 2168.652146] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1654 cycles/operation,    0 cycles/byte
[ 2168.652153] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1644 cycles/operation,    0 cycles/byte
[ 2168.672956] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39025 cycles/operation,    4 cycles/byte
[ 2168.673095] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5494 cycles/operation,    0 cycles/byte
[ 2168.673116] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3113 cycles/operation,    0 cycles/byte
[ 2168.673157] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1674 cycles/operation,    0 cycles/byte
[ 2168.673169] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1636 cycles/operation,    0 cycles/byte
[ 2168.696197] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39012 cycles/operation,    4 cycles/byte
[ 2168.696336] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5410 cycles/operation,    0 cycles/byte
[ 2168.696356] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3119 cycles/operation,    0 cycles/byte
[ 2168.696368] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1667 cycles/operation,    0 cycles/byte
[ 2168.696375] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1635 cycles/operation,    0 cycles/byte
[ 2168.716198] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39015 cycles/operation,    4 cycles/byte
[ 2168.716337] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5543 cycles/operation,    0 cycles/byte
[ 2168.716358] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3111 cycles/operation,    0 cycles/byte
[ 2168.716370] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1673 cycles/operation,    0 cycles/byte
[ 2168.716377] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1636 cycles/operation,    0 cycles/byte
[ 2168.739520] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39022 cycles/operation,    4 cycles/byte
[ 2168.739659] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5484 cycles/operation,    0 cycles/byte
[ 2168.739680] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3110 cycles/operation,    0 cycles/byte
[ 2168.739692] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1687 cycles/operation,    0 cycles/byte
[ 2168.739699] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1640 cycles/operation,    0 cycles/byte
[ 2168.762814] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39015 cycles/operation,    4 cycles/byte
[ 2168.762953] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5412 cycles/operation,    0 cycles/byte
[ 2168.762973] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3109 cycles/operation,    0 cycles/byte
[ 2168.762985] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1675 cycles/operation,    0 cycles/byte
[ 2168.762992] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1634 cycles/operation,    0 cycles/byte
[ 2168.796244] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39008 cycles/operation,    4 cycles/byte
[ 2168.796383] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5413 cycles/operation,    0 cycles/byte
[ 2168.796403] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3106 cycles/operation,    0 cycles/byte
[ 2168.796415] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1658 cycles/operation,    0 cycles/byte
[ 2168.796422] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1637 cycles/operation,    0 cycles/byte
[ 2168.819616] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39040 cycles/operation,    4 cycles/byte
[ 2168.819757] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5416 cycles/operation,    0 cycles/byte
[ 2168.819777] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3109 cycles/operation,    0 cycles/byte
[ 2168.819814] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1674 cycles/operation,    0 cycles/byte
[ 2168.819823] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1649 cycles/operation,    0 cycles/byte
[ 2168.859652] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39011 cycles/operation,    4 cycles/byte
[ 2168.859806] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5412 cycles/operation,    0 cycles/byte
[ 2168.859826] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3117 cycles/operation,    0 cycles/byte
[ 2168.859841] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1665 cycles/operation,    0 cycles/byte
[ 2168.859850] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1639 cycles/operation,    0 cycles/byte
[ 2168.896378] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39023 cycles/operation,    4 cycles/byte
[ 2168.896532] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5424 cycles/operation,    0 cycles/byte
[ 2168.896554] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3126 cycles/operation,    0 cycles/byte
[ 2168.896567] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1664 cycles/operation,    0 cycles/byte
[ 2168.896574] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1634 cycles/operation,    0 cycles/byte

Proposed (new) code:
[ 2061.715381] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39018 cycles/operation,    4 cycles/byte
[ 2061.715520] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5420 cycles/operation,    0 cycles/byte
[ 2061.715540] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3101 cycles/operation,    0 cycles/byte
[ 2061.715552] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1662 cycles/operation,    0 cycles/byte
[ 2061.715559] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1647 cycles/operation,    0 cycles/byte
[ 2061.734935] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39029 cycles/operation,    4 cycles/byte
[ 2061.735074] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5416 cycles/operation,    0 cycles/byte
[ 2061.735094] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3253 cycles/operation,    0 cycles/byte
[ 2061.735107] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1685 cycles/operation,    0 cycles/byte
[ 2061.735114] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1642 cycles/operation,    0 cycles/byte
[ 2061.761667] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39027 cycles/operation,    4 cycles/byte
[ 2061.761806] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5415 cycles/operation,    0 cycles/byte
[ 2061.761826] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3112 cycles/operation,    0 cycles/byte
[ 2061.761838] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1673 cycles/operation,    0 cycles/byte
[ 2061.761845] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1644 cycles/operation,    0 cycles/byte
[ 2061.781846] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39010 cycles/operation,    4 cycles/byte
[ 2061.781985] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5424 cycles/operation,    0 cycles/byte
[ 2061.782005] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3242 cycles/operation,    0 cycles/byte
[ 2061.782018] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1687 cycles/operation,    0 cycles/byte
[ 2061.782025] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1641 cycles/operation,    0 cycles/byte
[ 2061.801881] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39020 cycles/operation,    4 cycles/byte
[ 2061.802020] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5424 cycles/operation,    0 cycles/byte
[ 2061.802041] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3113 cycles/operation,    0 cycles/byte
[ 2061.802053] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1673 cycles/operation,    0 cycles/byte
[ 2061.802060] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1638 cycles/operation,    0 cycles/byte
[ 2061.822194] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39034 cycles/operation,    4 cycles/byte
[ 2061.822333] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5414 cycles/operation,    0 cycles/byte
[ 2061.822353] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3246 cycles/operation,    0 cycles/byte
[ 2061.822366] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1703 cycles/operation,    0 cycles/byte
[ 2061.822373] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1643 cycles/operation,    0 cycles/byte
[ 2061.842361] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39017 cycles/operation,    4 cycles/byte
[ 2061.842500] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5411 cycles/operation,    0 cycles/byte
[ 2061.842520] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3115 cycles/operation,    0 cycles/byte
[ 2061.842532] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1671 cycles/operation,    0 cycles/byte
[ 2061.842539] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1641 cycles/operation,    0 cycles/byte
[ 2061.875909] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39020 cycles/operation,    4 cycles/byte
[ 2061.876048] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5414 cycles/operation,    0 cycles/byte
[ 2061.876068] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3250 cycles/operation,    0 cycles/byte
[ 2061.876081] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1699 cycles/operation,    0 cycles/byte
[ 2061.876088] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1640 cycles/operation,    0 cycles/byte
[ 2061.899397] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39017 cycles/operation,    4 cycles/byte
[ 2061.899536] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5420 cycles/operation,    0 cycles/byte
[ 2061.899556] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3109 cycles/operation,    0 cycles/byte
[ 2061.899568] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1683 cycles/operation,    0 cycles/byte
[ 2061.899576] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1640 cycles/operation,    0 cycles/byte
[ 2061.922872] test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates):  39016 cycles/operation,    4 cycles/byte
[ 2061.923010] test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):   5538 cycles/operation,    0 cycles/byte
[ 2061.923032] test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):   3248 cycles/operation,    0 cycles/byte
[ 2061.923044] test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):   1683 cycles/operation,    0 cycles/byte
[ 2061.923052] test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):   1640 cycles/operation,    0 cycles/byte

Averaging the 8K bytes per update, I do see an average of 3.2 cycles per
operation (that is, per 8K of data processed) lost, or about 1 cycle per
(3K or less) block processed.  I'm hoping the reduced D-cache polution
makes it up somewhere else.

Comments?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-30  5:25             ` George Spelvin
@ 2014-05-30 16:10               ` Tim Chen
  2014-05-30 16:52                 ` George Spelvin
  0 siblings, 1 reply; 29+ messages in thread
From: Tim Chen @ 2014-05-30 16:10 UTC (permalink / raw)
  To: George Spelvin; +Cc: herbert, james.guilford, JBeulich, linux-kernel

On Fri, 2014-05-30 at 01:25 -0400, George Spelvin wrote:

> 
> Averaging the 8K bytes per update, I do see an average of 3.2 cycles per
> operation (that is, per 8K of data processed) lost, or about 1 cycle per
> (3K or less) block processed.  I'm hoping the reduced D-cache polution
> makes it up somewhere else.

That's very small (less than 0.2%) so I think it's acceptable.

Tim



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-30 16:10               ` Tim Chen
@ 2014-05-30 16:52                 ` George Spelvin
  2014-05-30 17:01                   ` Tim Chen
  0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2014-05-30 16:52 UTC (permalink / raw)
  To: linux, tim.c.chen; +Cc: herbert, james.guilford, JBeulich, linux-kernel

> That's very small (less than 0.2%) so I think it's acceptable.

Thank you!  May I take this as an Acked-by; ?

I'll work on some performance improvements, but they proably
won't be ready for the 3.16 merge window.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-30 16:52                 ` George Spelvin
@ 2014-05-30 17:01                   ` Tim Chen
  2014-06-07  3:08                     ` [PATCH v3] crypto: crc32c-pclmul - Shrink K_table to 32-bit words George Spelvin
  0 siblings, 1 reply; 29+ messages in thread
From: Tim Chen @ 2014-05-30 17:01 UTC (permalink / raw)
  To: George Spelvin; +Cc: herbert, james.guilford, JBeulich, linux-kernel

On Fri, 2014-05-30 at 12:52 -0400, George Spelvin wrote:
> > That's very small (less than 0.2%) so I think it's acceptable.
> 
> Thank you!  May I take this as an Acked-by; ?

Yes, with the caveat that you still have a v3 of this patch
that reorganize the K table to rodata.

Tim
> 
> I'll work on some performance improvements, but they proably
> won't be ready for the 3.16 merge window.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-30  1:16           ` Dave Jones
@ 2014-05-30 17:56             ` Tim Chen
  2014-05-30 18:45               ` Dirk Brandewie
  0 siblings, 1 reply; 29+ messages in thread
From: Tim Chen @ 2014-05-30 17:56 UTC (permalink / raw)
  To: Dave Jones, dirk.brandewie
  Cc: George Spelvin, herbert, james.guilford, JBeulich, linux-kernel,
	Jacob jun Pan

On Thu, 2014-05-29 at 21:16 -0400, Dave Jones wrote:
> On Thu, May 29, 2014 at 06:07:16PM -0700, Tim Chen wrote:
>  > On Thu, 2014-05-29 at 19:54 -0400, George Spelvin wrote:
>  > > Sorry for the delay; my Ivy Bridge test machine isn't in my
>  > > office and getting to the console to tweak the BIOS is a
>  > > bit of a bother.
>  > > 
>  > > Anyway, i7-4930K, turbo boost & hyperthreading disabled,
>  > > $ cat /sys/devices/system/cpu/cpu?/cpufreq/scaling_governor
>  > > performance
>  > > performance
>  > > performance
>  > > performance
>  > > performance
>  > > performance
>  > > 
>  > > Oddly, though, CPU speed still seems to be fluctuating:
>  > > $ grep MHz /proc/cpuinfo
>  > > cpu MHz         : 1255.875
>  > > cpu MHz         : 3168.375
>  > > cpu MHz         : 3062.125
>  > > cpu MHz         : 1468.375
>  > > cpu MHz         : 1309.000
>  > > cpu MHz         : 2212.125
>  > > $ grep MHz /proc/cpuinfo
>  > > cpu MHz         : 1255.875
>  > > cpu MHz         : 2690.250
>  > > cpu MHz         : 1255.875
>  > > cpu MHz         : 2530.875
>  > > cpu MHz         : 2212.125
>  > > cpu MHz         : 1521.500
>  > 
>  > This is odd.  On my Ivy Bridge system the CPU speed from /proc/cpuinfo 
>  > is at max freq once I set the performance governor.  
>  > The numbers above almost look like
>  > the cpu frequency is fluctuating and an average is taken.
>  > What version of the kernel are you running?  Is 
>  > CONFIG_CPU_FREQ_GOV_PERFORMANCE compiled in?
>  > 
>  > Does /sys/devices/system/cpu/cpu?/cpufreq/scaling_cur_freq
>  > also changes?
>  > 
>  > Can you check what are the available governors in your system
>  > and available frequencies?
>  > 
>  > cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
>  > cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
>  > 
>  > If userspace governor is available, you can try set the governor
>  > to userspace, then pin frequency to 3400 MHz (assuming that's your
>  > max) with command like:
>  
> intel_pstate overrides any governor choice you make through sysfs.
> 
> 	Dave
> 

Dirk,

Wonder if this the right behavior for intel_pstate that when I set the 
governor to performance, intel_pstate driver still adjusts 
the cpu frequencies around?

Turbotstat also confirms that the frequencies are not at max,
even though the max_perf_pct and min_perf_pct are both set at 100.  

I ran on my HSW system with 3.15-rc7 kernel and see similar
issue that Geroge reported.

It is really a pain when we need to do performance benchmarking and 
need to have a constant cpu frequency.  

Thanks.

Tim


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-30 17:56             ` Tim Chen
@ 2014-05-30 18:45               ` Dirk Brandewie
  2014-05-30 19:32                 ` Tim Chen
  0 siblings, 1 reply; 29+ messages in thread
From: Dirk Brandewie @ 2014-05-30 18:45 UTC (permalink / raw)
  To: Tim Chen, Dave Jones
  Cc: dirk.brandewie, George Spelvin, herbert, james.guilford,
	JBeulich, linux-kernel, Jacob jun Pan

On 05/30/2014 10:56 AM, Tim Chen wrote:
> On Thu, 2014-05-29 at 21:16 -0400, Dave Jones wrote:
>> On Thu, May 29, 2014 at 06:07:16PM -0700, Tim Chen wrote:
>>   > On Thu, 2014-05-29 at 19:54 -0400, George Spelvin wrote:
>>   > > Sorry for the delay; my Ivy Bridge test machine isn't in my
>>   > > office and getting to the console to tweak the BIOS is a
>>   > > bit of a bother.
>>   > >
>>   > > Anyway, i7-4930K, turbo boost & hyperthreading disabled,
>>   > > $ cat /sys/devices/system/cpu/cpu?/cpufreq/scaling_governor
>>   > > performance
>>   > > performance
>>   > > performance
>>   > > performance
>>   > > performance
>>   > > performance
>>   > >
>>   > > Oddly, though, CPU speed still seems to be fluctuating:
>>   > > $ grep MHz /proc/cpuinfo
>>   > > cpu MHz         : 1255.875
>>   > > cpu MHz         : 3168.375
>>   > > cpu MHz         : 3062.125
>>   > > cpu MHz         : 1468.375
>>   > > cpu MHz         : 1309.000
>>   > > cpu MHz         : 2212.125
>>   > > $ grep MHz /proc/cpuinfo
>>   > > cpu MHz         : 1255.875
>>   > > cpu MHz         : 2690.250
>>   > > cpu MHz         : 1255.875
>>   > > cpu MHz         : 2530.875
>>   > > cpu MHz         : 2212.125
>>   > > cpu MHz         : 1521.500
>>   >
>>   > This is odd.  On my Ivy Bridge system the CPU speed from /proc/cpuinfo
>>   > is at max freq once I set the performance governor.
>>   > The numbers above almost look like
>>   > the cpu frequency is fluctuating and an average is taken.
>>   > What version of the kernel are you running?  Is
>>   > CONFIG_CPU_FREQ_GOV_PERFORMANCE compiled in?
>>   >
>>   > Does /sys/devices/system/cpu/cpu?/cpufreq/scaling_cur_freq
>>   > also changes?
>>   >
>>   > Can you check what are the available governors in your system
>>   > and available frequencies?
>>   >
>>   > cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
>>   > cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
>>   >
>>   > If userspace governor is available, you can try set the governor
>>   > to userspace, then pin frequency to 3400 MHz (assuming that's your
>>   > max) with command like:
>>   
>> intel_pstate overrides any governor choice you make through sysfs.
>>
>> 	Dave
>>
> 
> Dirk,
> 
> Wonder if this the right behavior for intel_pstate that when I set the
> governor to performance, intel_pstate driver still adjusts
> the cpu frequencies around?

No, the value returned is a measured/delivered frequency instead of the P state
requested which is what the other governors return.

> 
> Turbotstat also confirms that the frequencies are not at max,
> even though the max_perf_pct and min_perf_pct are both set at 100.
> 

I calculate frequency the same way turbostat does but my samples are a *lot* 
shorter.
 

> I ran on my HSW system with 3.15-rc7 kernel and see similar
> issue that Geroge reported.
> 
> It is really a pain when we need to do performance benchmarking and
> need to have a constant cpu frequency.
> 

With turbostat from rc7.
[root@echolake turbostat]# ./turbostat 
    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
       -       -       1    0.08    1178    3492       0    0.12    0.08    0.01   99.71      29      29   99.23    0.00    0.00    0.00    2.18    0.00    0.00
       0       0       2    0.19    1189    3492       0    0.22    0.30    0.00   99.29      29      29   99.24    0.00    0.00    0.00    2.18    0.00    0.00
       0       4       1    0.12    1253    3492       0    0.29
       1       1       0    0.03    1065    3492       0    0.03    0.00    0.00   99.93      23
       1       5       0    0.01    1104    3492       0    0.05
       2       2       0    0.02    1275    3492       0    0.22    0.00    0.03   99.73      24
       2       6       2    0.18    1220    3492       0    0.06
       3       3       0    0.01     992    3492       0    0.07    0.00    0.01   99.90      23
       3       7       0    0.05     915    3492       0    0.04
    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
       -       -       1    0.06    1034    3492       0    0.09    5.15    0.00   94.70      28      28   99.49    0.00    0.00    0.00    2.48    0.01    0.00
       0       0       1    0.09    1066    3492       0    0.17    0.01    0.00   99.73      28      28   99.49    0.00    0.00    0.00    2.48    0.01    0.00
       0       4       1    0.12    1036    3492       0    0.14
       1       1       0    0.04    1009    3492       0    0.05   20.59    0.00   79.32      24
       1       5       0    0.02     922    3492       0    0.07
       2       2       0    0.03     924    3492       0    0.15    0.00    0.00   99.82      25
       2       6       1    0.12    1117    3492       0    0.06
       3       3       0    0.01     911    3492       0    0.04    0.01    0.00   99.94      22
       3       7       0    0.03     856    3492       0    0.02
    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
       -       -       1    0.08     889    3492       0    0.12    0.03    0.06   99.71      29      29   99.32    0.00    0.00    0.00    2.21    0.00    0.00
       0       0       1    0.11     867    3492       0    0.20    0.02    0.22   99.44      29      29   99.32    0.00    0.00    0.00    2.21    0.00    0.00
       0       4       1    0.14     907    3492       0    0.17
       1       1       1    0.12     809    3492       0    0.04    0.11    0.01   99.73      24
       1       5       0    0.01     798    3492       0    0.14
       2       2       0    0.03     863    3492       0    0.18    0.00    0.01   99.78      24
       2       6       1    0.14    1013    3492       0    0.07
       3       3       0    0.02     853    3492       0    0.09    0.00    0.00   99.89      23
       3       7       1    0.06     815    3492       0    0.05
^C
[root@echolake turbostat]# echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct 
[root@echolake turbostat]# ./turbostat 
    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
       -       -       1    0.03    3489    3492       0    2.43    0.01    0.00   97.53      30      30   90.20    0.00    0.00    0.00    2.85    0.06    0.00
       0       0       1    0.04    3470    3492       0    0.09    0.00    0.00   99.88      30      30   90.20    0.00    0.00    0.00    2.85    0.06    0.00
       0       4       2    0.06    3492    3492       0    0.07
       1       1       1    0.02    3495    3492       0    0.05    0.03    0.00   99.90      25
       1       5       0    0.00    3494    3492       0    0.07
       2       2       0    0.01    3492    3492       0    9.53    0.00    0.01   90.45      25
       2       6       1    0.04    3492    3492       0    9.50
       3       3       1    0.03    3492    3492       0    0.05    0.01    0.00   99.91      23
       3       7       1    0.02    3493    3492       0    0.06
    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
       -       -       1    0.02    3492    3492       0    4.93    0.00    0.00   95.04      30      30   80.19    0.00    0.00    0.00    3.54    0.10    0.00
       0       0       1    0.02    3491    3492       0    0.08    0.01    0.00   99.89      30      30   80.19    0.00    0.00    0.00    3.54    0.10    0.00
       0       4       2    0.05    3492    3492       0    0.05
       1       1       0    0.01    3492    3492       0    0.02    0.00    0.00   99.97      24
       1       5       0    0.01    3493    3492       0    0.02
       2       2       0    0.01    3493    3492       0   19.65    0.01    0.00   80.34      24
       2       6       2    0.05    3493    3492       0   19.61
       3       3       1    0.01    3492    3492       0    0.02    0.00    0.00   99.97      23
       3       7       0    0.01    3494    3492       0    0.02
    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
       -       -       2    0.05    3493    3492       0    1.64    0.01    0.00   98.29      30      30   93.25    0.00    0.00    0.00    2.64    0.04    0.00
       0       0       4    0.12    3492    3492       0    0.13    0.01    0.00   99.74      30      30   93.25    0.00    0.00    0.00    2.64    0.04    0.00
       0       4       2    0.06    3493    3492       0    0.19
       1       1       1    0.02    3492    3492       0    0.03    0.04    0.00   99.91      23
       1       5       0    0.01    3494    3492       0    0.04
       2       2       0    0.01    3492    3492       0    6.42    0.00    0.00   93.57      25
       2       6       6    0.16    3492    3492       0    6.27
       3       3       0    0.01    3501    3492       0    0.05    0.01    0.00   99.93      22
       3       7       1    0.03    3492    3492       0    0.03
[root@echolake turbostat]# grep MH /proc/cpuinfo
cpu MHz		: 997.089
cpu MHz		: 797.480
cpu MHz		: 998.320
cpu MHz		: 800.078
cpu MHz		: 845.878
cpu MHz		: 801.445
cpu MHz		: 800.078
cpu MHz		: 800.351
[root@echolake turbostat]# echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct 
[root@echolake turbostat]# grep MH /proc/cpuinfo
cpu MHz		: 3497.128
cpu MHz		: 3506.699
cpu MHz		: 3500.273
cpu MHz		: 3500.273
cpu MHz		: 3500.000
cpu MHz		: 3500.000
cpu MHz		: 3500.000
cpu MHz		: 3495.898


> Thanks.
> 
> Tim
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-30 18:45               ` Dirk Brandewie
@ 2014-05-30 19:32                 ` Tim Chen
  2014-05-30 19:38                   ` Dirk Brandewie
  0 siblings, 1 reply; 29+ messages in thread
From: Tim Chen @ 2014-05-30 19:32 UTC (permalink / raw)
  To: Dirk Brandewie
  Cc: Dave Jones, George Spelvin, herbert, james.guilford, JBeulich,
	linux-kernel, Jacob jun Pan

On Fri, 2014-05-30 at 11:45 -0700, Dirk Brandewie wrote:

> 
> With turbostat from rc7.
> [root@echolake turbostat]# ./turbostat 
>     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
>        -       -       1    0.08    1178    3492       0    0.12    0.08    0.01   99.71      29      29   99.23    0.00    0.00    0.00    2.18    0.00    0.00
>        0       0       2    0.19    1189    3492       0    0.22    0.30    0.00   99.29      29      29   99.24    0.00    0.00    0.00    2.18    0.00    0.00
>        0       4       1    0.12    1253    3492       0    0.29
>        1       1       0    0.03    1065    3492       0    0.03    0.00    0.00   99.93      23
>        1       5       0    0.01    1104    3492       0    0.05
>        2       2       0    0.02    1275    3492       0    0.22    0.00    0.03   99.73      24
>        2       6       2    0.18    1220    3492       0    0.06
>        3       3       0    0.01     992    3492       0    0.07    0.00    0.01   99.90      23
>        3       7       0    0.05     915    3492       0    0.04
>     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
>        -       -       1    0.06    1034    3492       0    0.09    5.15    0.00   94.70      28      28   99.49    0.00    0.00    0.00    2.48    0.01    0.00
>        0       0       1    0.09    1066    3492       0    0.17    0.01    0.00   99.73      28      28   99.49    0.00    0.00    0.00    2.48    0.01    0.00
>        0       4       1    0.12    1036    3492       0    0.14
>        1       1       0    0.04    1009    3492       0    0.05   20.59    0.00   79.32      24
>        1       5       0    0.02     922    3492       0    0.07
>        2       2       0    0.03     924    3492       0    0.15    0.00    0.00   99.82      25
>        2       6       1    0.12    1117    3492       0    0.06
>        3       3       0    0.01     911    3492       0    0.04    0.01    0.00   99.94      22
>        3       7       0    0.03     856    3492       0    0.02
>     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
>        -       -       1    0.08     889    3492       0    0.12    0.03    0.06   99.71      29      29   99.32    0.00    0.00    0.00    2.21    0.00    0.00
>        0       0       1    0.11     867    3492       0    0.20    0.02    0.22   99.44      29      29   99.32    0.00    0.00    0.00    2.21    0.00    0.00
>        0       4       1    0.14     907    3492       0    0.17
>        1       1       1    0.12     809    3492       0    0.04    0.11    0.01   99.73      24
>        1       5       0    0.01     798    3492       0    0.14
>        2       2       0    0.03     863    3492       0    0.18    0.00    0.01   99.78      24
>        2       6       1    0.14    1013    3492       0    0.07
>        3       3       0    0.02     853    3492       0    0.09    0.00    0.00   99.89      23
>        3       7       1    0.06     815    3492       0    0.05
> ^C
> [root@echolake turbostat]# echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct 
> [root@echolake turbostat]# ./turbostat 
>     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
>        -       -       1    0.03    3489    3492       0    2.43    0.01    0.00   97.53      30      30   90.20    0.00    0.00    0.00    2.85    0.06    0.00
>        0       0       1    0.04    3470    3492       0    0.09    0.00    0.00   99.88      30      30   90.20    0.00    0.00    0.00    2.85    0.06    0.00
>        0       4       2    0.06    3492    3492       0    0.07
>        1       1       1    0.02    3495    3492       0    0.05    0.03    0.00   99.90      25
>        1       5       0    0.00    3494    3492       0    0.07
>        2       2       0    0.01    3492    3492       0    9.53    0.00    0.01   90.45      25
>        2       6       1    0.04    3492    3492       0    9.50
>        3       3       1    0.03    3492    3492       0    0.05    0.01    0.00   99.91      23
>        3       7       1    0.02    3493    3492       0    0.06
>     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
>        -       -       1    0.02    3492    3492       0    4.93    0.00    0.00   95.04      30      30   80.19    0.00    0.00    0.00    3.54    0.10    0.00
>        0       0       1    0.02    3491    3492       0    0.08    0.01    0.00   99.89      30      30   80.19    0.00    0.00    0.00    3.54    0.10    0.00
>        0       4       2    0.05    3492    3492       0    0.05
>        1       1       0    0.01    3492    3492       0    0.02    0.00    0.00   99.97      24
>        1       5       0    0.01    3493    3492       0    0.02
>        2       2       0    0.01    3493    3492       0   19.65    0.01    0.00   80.34      24
>        2       6       2    0.05    3493    3492       0   19.61
>        3       3       1    0.01    3492    3492       0    0.02    0.00    0.00   99.97      23
>        3       7       0    0.01    3494    3492       0    0.02
>     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt 
>        -       -       2    0.05    3493    3492       0    1.64    0.01    0.00   98.29      30      30   93.25    0.00    0.00    0.00    2.64    0.04    0.00
>        0       0       4    0.12    3492    3492       0    0.13    0.01    0.00   99.74      30      30   93.25    0.00    0.00    0.00    2.64    0.04    0.00
>        0       4       2    0.06    3493    3492       0    0.19
>        1       1       1    0.02    3492    3492       0    0.03    0.04    0.00   99.91      23
>        1       5       0    0.01    3494    3492       0    0.04
>        2       2       0    0.01    3492    3492       0    6.42    0.00    0.00   93.57      25
>        2       6       6    0.16    3492    3492       0    6.27
>        3       3       0    0.01    3501    3492       0    0.05    0.01    0.00   99.93      22
>        3       7       1    0.03    3492    3492       0    0.03
> [root@echolake turbostat]# grep MH /proc/cpuinfo
> cpu MHz		: 997.089
> cpu MHz		: 797.480
> cpu MHz		: 998.320
> cpu MHz		: 800.078
> cpu MHz		: 845.878
> cpu MHz		: 801.445
> cpu MHz		: 800.078
> cpu MHz		: 800.351
> [root@echolake turbostat]# echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct 
> [root@echolake turbostat]# grep MH /proc/cpuinfo
> cpu MHz		: 3497.128
> cpu MHz		: 3506.699
> cpu MHz		: 3500.273
> cpu MHz		: 3500.273
> cpu MHz		: 3500.000
> cpu MHz		: 3500.000
> cpu MHz		: 3500.000
> cpu MHz		: 3495.898
> 

Dirk,

Thanks for checking things out.

I tested on a Haswell system, and I see that the frequency
can dip below the max even when I set the min_perf_pct to 100. 
Let me know if you want to log on to my system and check if 
there's something I missed. It is odd that the package 1's
cores are at a much higher frequency and close to
max than package 0, once min_perf_pct is set to 100.

Tim

[root@otc-grantly-02 ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq 
3600000
[root@otc-grantly-02 ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq 
1200000
[root@otc-grantly-02 ~]# echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
[root@otc-grantly-02 ~]# cat /sys/devices/system/cpu/intel_pstate/min_perf_pct
100
[root@otc-grantly-02 ~]# uname -a
Linux otc-grantly-02 3.15.0-rc7+ #3 SMP Thu May 29 11:34:39 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
[root@otc-grantly-02 ~]# cpupower -c 0-1 frequency-info 
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 1.20 GHz - 3.60 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 1.20 GHz and 3.60 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 1.20 GHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes
analyzing CPU 1:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 1
  CPUs which need to have their frequency coordinated by software: 1
  maximum transition latency: 0.97 ms.
  hardware limits: 1.20 GHz - 3.60 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 1.20 GHz and 3.60 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 2.02 GHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes
[root@otc-grantly-02 ~]# turbostat
Package     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt RAMWatt   PKG_%   RAM_% 
       -       -       -       0    0.02    1964    2594       0    0.13    0.00   99.85    0.00      33      41    4.92    0.00   93.99    0.00   23.04    3.60    0.18    0.00
       0       0       0       1    0.07    2154    2594       0    0.21    0.00   99.72    0.00      32      41    4.42    0.00   94.00    0.00   17.16    1.73    0.10    0.00
       0       0      28       0    0.01    1465    2594       0    0.26
       0       1       1       1    0.04    1941    2594       0    0.18    0.00   99.78    0.00      33
       0       1      29       0    0.02    1587    2594       0    0.20
       0       2       2       1    0.04    1586    2594       0    0.15    0.00   99.81    0.00      28
       0       2      30       0    0.01    1539    2594       0    0.17
       0       3       3       1    0.04    1656    2594       0    0.17    0.00   99.79    0.00      31
       0       3      31       0    0.01    1723    2594       0    0.19
       0       4       4       1    0.06    1800    2594       0    0.21    0.00   99.74    0.00      33
       0       4      32       0    0.02    1725    2594       0    0.24
       0       5       5       1    0.04    1917    2594       0    0.15    0.00   99.81    0.00      29
       0       5      33       0    0.02    1707    2594       0    0.17
       0       6       6       1    0.04    1820    2594       0    0.17    0.00   99.79    0.00      33
       0       6      34       0    0.01    1564    2594       0    0.20
       0       8       7       0    0.02    1655    2594       0    0.11    0.00   99.86    0.00      29
       0       8      35       0    0.01    1687    2594       0    0.12
       0       9       8       0    0.03    1748    2594       0    0.15    0.00   99.83    0.00      32
       0       9      36       0    0.02    2001    2594       0    0.15
       0      10       9       1    0.06    1604    2594       0    0.20    0.00   99.74    0.00      32
       0      10      37       0    0.02    1679    2594       0    0.24
       0      11      10       1    0.04    1644    2594       0    0.12    0.00   99.84    0.00      30
       0      11      38       0    0.01    1509    2594       0    0.14
       0      12      11       1    0.04    1773    2594       0    0.13    0.00   99.83    0.00      30
       0      12      39       0    0.01    1529    2594       0    0.16
       0      13      12       0    0.02    1907    2594       0    0.11    0.00   99.87    0.00      30
       0      13      40       0    0.01    1574    2594       0    0.12
       0      14      13       1    0.04    1831    2594       0    0.19    0.00   99.77    0.00      31
       0      14      41       0    0.01    1735    2594       0    0.22
       1       0      14       1    0.04    1831    2594       0    0.11    0.00   99.85    0.00      28      37    5.43    0.00   93.98    0.00    5.88    1.87    0.08    0.00
       1       0      42       0    0.01    2238    2594       0    0.14
       1       1      15       1    0.04    1869    2594       0    0.15    0.00   99.81    0.00      31
       1       1      43       0    0.01    2407    2594       0    0.18
       1       2      16       0    0.02    2164    2594       0    0.10    0.00   99.88    0.00      28
       1       2      44       0    0.01    2326    2594       0    0.11
       1       3      17       1    0.04    2101    2594       0    0.10    0.00   99.86    0.00      30
       1       3      45       0    0.01    2355    2594       0    0.13
       1       4      18       0    0.01    2429    2594       0    0.08    0.00   99.90    0.00      29
       1       4      46       0    0.01    2545    2594       0    0.08
       1       5      19       0    0.01    2412    2594       0    0.08    0.00   99.91    0.00      29
       1       5      47       0    0.01    2392    2594       0    0.08
       1       6      20       0    0.01    2448    2594       0    0.08    0.00   99.90    0.00      29
       1       6      48       0    0.01    2430    2594       0    0.08
       1       8      21       0    0.01    2574    2594       0    0.08    0.00   99.90    0.00      29
       1       8      49       0    0.01    2450    2594       0    0.09
       1       9      22       0    0.02    2470    2594       0    0.08    0.00   99.90    0.00      31
       1       9      50       0    0.01    2555    2594       0    0.08
       1      10      23       0    0.01    2540    2594       0    0.07    0.00   99.92    0.00      26
       1      10      51       0    0.01    2672    2594       0    0.07
       1      11      24       0    0.01    2472    2594       0    0.08    0.00   99.91    0.00      28
       1      11      52       0    0.01    2461    2594       0    0.08
       1      12      25       0    0.01    2438    2594       0    0.07    0.00   99.92    0.00      29
       1      12      53       0    0.01    2316    2594       0    0.07
       1      13      26       0    0.01    2363    2594       0    0.08    0.00   99.90    0.00      28
       1      13      54       0    0.01    2586    2594       0    0.09
       1      14      27       0    0.01    2459    2594       0    0.09    0.00   99.90    0.00      27
       1      14      55       1    0.02    2939    2594       0    0.08

Tim


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-30 19:32                 ` Tim Chen
@ 2014-05-30 19:38                   ` Dirk Brandewie
  2014-05-30 20:07                     ` Tim Chen
  0 siblings, 1 reply; 29+ messages in thread
From: Dirk Brandewie @ 2014-05-30 19:38 UTC (permalink / raw)
  To: Tim Chen
  Cc: dirk.brandewie, Dave Jones, George Spelvin, herbert,
	james.guilford, JBeulich, linux-kernel, Jacob jun Pan

On 05/30/2014 12:32 PM, Tim Chen wrote:
> On Fri, 2014-05-30 at 11:45 -0700, Dirk Brandewie wrote:
>
>>
>> With turbostat from rc7.
>> [root@echolake turbostat]# ./turbostat
>>      Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
>>         -       -       1    0.08    1178    3492       0    0.12    0.08    0.01   99.71      29      29   99.23    0.00    0.00    0.00    2.18    0.00    0.00
>>         0       0       2    0.19    1189    3492       0    0.22    0.30    0.00   99.29      29      29   99.24    0.00    0.00    0.00    2.18    0.00    0.00
>>         0       4       1    0.12    1253    3492       0    0.29
>>         1       1       0    0.03    1065    3492       0    0.03    0.00    0.00   99.93      23
>>         1       5       0    0.01    1104    3492       0    0.05
>>         2       2       0    0.02    1275    3492       0    0.22    0.00    0.03   99.73      24
>>         2       6       2    0.18    1220    3492       0    0.06
>>         3       3       0    0.01     992    3492       0    0.07    0.00    0.01   99.90      23
>>         3       7       0    0.05     915    3492       0    0.04
>>      Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
>>         -       -       1    0.06    1034    3492       0    0.09    5.15    0.00   94.70      28      28   99.49    0.00    0.00    0.00    2.48    0.01    0.00
>>         0       0       1    0.09    1066    3492       0    0.17    0.01    0.00   99.73      28      28   99.49    0.00    0.00    0.00    2.48    0.01    0.00
>>         0       4       1    0.12    1036    3492       0    0.14
>>         1       1       0    0.04    1009    3492       0    0.05   20.59    0.00   79.32      24
>>         1       5       0    0.02     922    3492       0    0.07
>>         2       2       0    0.03     924    3492       0    0.15    0.00    0.00   99.82      25
>>         2       6       1    0.12    1117    3492       0    0.06
>>         3       3       0    0.01     911    3492       0    0.04    0.01    0.00   99.94      22
>>         3       7       0    0.03     856    3492       0    0.02
>>      Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
>>         -       -       1    0.08     889    3492       0    0.12    0.03    0.06   99.71      29      29   99.32    0.00    0.00    0.00    2.21    0.00    0.00
>>         0       0       1    0.11     867    3492       0    0.20    0.02    0.22   99.44      29      29   99.32    0.00    0.00    0.00    2.21    0.00    0.00
>>         0       4       1    0.14     907    3492       0    0.17
>>         1       1       1    0.12     809    3492       0    0.04    0.11    0.01   99.73      24
>>         1       5       0    0.01     798    3492       0    0.14
>>         2       2       0    0.03     863    3492       0    0.18    0.00    0.01   99.78      24
>>         2       6       1    0.14    1013    3492       0    0.07
>>         3       3       0    0.02     853    3492       0    0.09    0.00    0.00   99.89      23
>>         3       7       1    0.06     815    3492       0    0.05
>> ^C
>> [root@echolake turbostat]# echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
>> [root@echolake turbostat]# ./turbostat
>>      Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
>>         -       -       1    0.03    3489    3492       0    2.43    0.01    0.00   97.53      30      30   90.20    0.00    0.00    0.00    2.85    0.06    0.00
>>         0       0       1    0.04    3470    3492       0    0.09    0.00    0.00   99.88      30      30   90.20    0.00    0.00    0.00    2.85    0.06    0.00
>>         0       4       2    0.06    3492    3492       0    0.07
>>         1       1       1    0.02    3495    3492       0    0.05    0.03    0.00   99.90      25
>>         1       5       0    0.00    3494    3492       0    0.07
>>         2       2       0    0.01    3492    3492       0    9.53    0.00    0.01   90.45      25
>>         2       6       1    0.04    3492    3492       0    9.50
>>         3       3       1    0.03    3492    3492       0    0.05    0.01    0.00   99.91      23
>>         3       7       1    0.02    3493    3492       0    0.06
>>      Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
>>         -       -       1    0.02    3492    3492       0    4.93    0.00    0.00   95.04      30      30   80.19    0.00    0.00    0.00    3.54    0.10    0.00
>>         0       0       1    0.02    3491    3492       0    0.08    0.01    0.00   99.89      30      30   80.19    0.00    0.00    0.00    3.54    0.10    0.00
>>         0       4       2    0.05    3492    3492       0    0.05
>>         1       1       0    0.01    3492    3492       0    0.02    0.00    0.00   99.97      24
>>         1       5       0    0.01    3493    3492       0    0.02
>>         2       2       0    0.01    3493    3492       0   19.65    0.01    0.00   80.34      24
>>         2       6       2    0.05    3493    3492       0   19.61
>>         3       3       1    0.01    3492    3492       0    0.02    0.00    0.00   99.97      23
>>         3       7       0    0.01    3494    3492       0    0.02
>>      Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
>>         -       -       2    0.05    3493    3492       0    1.64    0.01    0.00   98.29      30      30   93.25    0.00    0.00    0.00    2.64    0.04    0.00
>>         0       0       4    0.12    3492    3492       0    0.13    0.01    0.00   99.74      30      30   93.25    0.00    0.00    0.00    2.64    0.04    0.00
>>         0       4       2    0.06    3493    3492       0    0.19
>>         1       1       1    0.02    3492    3492       0    0.03    0.04    0.00   99.91      23
>>         1       5       0    0.01    3494    3492       0    0.04
>>         2       2       0    0.01    3492    3492       0    6.42    0.00    0.00   93.57      25
>>         2       6       6    0.16    3492    3492       0    6.27
>>         3       3       0    0.01    3501    3492       0    0.05    0.01    0.00   99.93      22
>>         3       7       1    0.03    3492    3492       0    0.03
>> [root@echolake turbostat]# grep MH /proc/cpuinfo
>> cpu MHz		: 997.089
>> cpu MHz		: 797.480
>> cpu MHz		: 998.320
>> cpu MHz		: 800.078
>> cpu MHz		: 845.878
>> cpu MHz		: 801.445
>> cpu MHz		: 800.078
>> cpu MHz		: 800.351
>> [root@echolake turbostat]# echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
>> [root@echolake turbostat]# grep MH /proc/cpuinfo
>> cpu MHz		: 3497.128
>> cpu MHz		: 3506.699
>> cpu MHz		: 3500.273
>> cpu MHz		: 3500.273
>> cpu MHz		: 3500.000
>> cpu MHz		: 3500.000
>> cpu MHz		: 3500.000
>> cpu MHz		: 3495.898
>>
>
> Dirk,
>
> Thanks for checking things out.
>
> I tested on a Haswell system, and I see that the frequency
> can dip below the max even when I set the min_perf_pct to 100.
> Let me know if you want to log on to my system and check if
> there's something I missed. It is odd that the package 1's
> cores are at a much higher frequency and close to
> max than package 0, once min_perf_pct is set to 100.
>

Can you run turbostat for a few samples it reports an average over the sample
time.

> Tim
>
> [root@otc-grantly-02 ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
> 3600000
> [root@otc-grantly-02 ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
> 1200000
> [root@otc-grantly-02 ~]# echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
> [root@otc-grantly-02 ~]# cat /sys/devices/system/cpu/intel_pstate/min_perf_pct
> 100
> [root@otc-grantly-02 ~]# uname -a
> Linux otc-grantly-02 3.15.0-rc7+ #3 SMP Thu May 29 11:34:39 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
> [root@otc-grantly-02 ~]# cpupower -c 0-1 frequency-info
> analyzing CPU 0:
>    driver: intel_pstate
>    CPUs which run at the same hardware frequency: 0
>    CPUs which need to have their frequency coordinated by software: 0
>    maximum transition latency: 0.97 ms.
>    hardware limits: 1.20 GHz - 3.60 GHz
>    available cpufreq governors: performance, powersave
>    current policy: frequency should be within 1.20 GHz and 3.60 GHz.
>                    The governor "powersave" may decide which speed to use
>                    within this range.
>    current CPU frequency is 1.20 GHz (asserted by call to hardware).
>    boost state support:
>      Supported: yes
>      Active: yes
> analyzing CPU 1:
>    driver: intel_pstate
>    CPUs which run at the same hardware frequency: 1
>    CPUs which need to have their frequency coordinated by software: 1
>    maximum transition latency: 0.97 ms.
>    hardware limits: 1.20 GHz - 3.60 GHz
>    available cpufreq governors: performance, powersave
>    current policy: frequency should be within 1.20 GHz and 3.60 GHz.
>                    The governor "powersave" may decide which speed to use
>                    within this range.
>    current CPU frequency is 2.02 GHz (asserted by call to hardware).
>    boost state support:
>      Supported: yes
>      Active: yes
> [root@otc-grantly-02 ~]# turbostat
> Package     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt RAMWatt   PKG_%   RAM_%
>         -       -       -       0    0.02    1964    2594       0    0.13    0.00   99.85    0.00      33      41    4.92    0.00   93.99    0.00   23.04    3.60    0.18    0.00
>         0       0       0       1    0.07    2154    2594       0    0.21    0.00   99.72    0.00      32      41    4.42    0.00   94.00    0.00   17.16    1.73    0.10    0.00
>         0       0      28       0    0.01    1465    2594       0    0.26
>         0       1       1       1    0.04    1941    2594       0    0.18    0.00   99.78    0.00      33
>         0       1      29       0    0.02    1587    2594       0    0.20
>         0       2       2       1    0.04    1586    2594       0    0.15    0.00   99.81    0.00      28
>         0       2      30       0    0.01    1539    2594       0    0.17
>         0       3       3       1    0.04    1656    2594       0    0.17    0.00   99.79    0.00      31
>         0       3      31       0    0.01    1723    2594       0    0.19
>         0       4       4       1    0.06    1800    2594       0    0.21    0.00   99.74    0.00      33
>         0       4      32       0    0.02    1725    2594       0    0.24
>         0       5       5       1    0.04    1917    2594       0    0.15    0.00   99.81    0.00      29
>         0       5      33       0    0.02    1707    2594       0    0.17
>         0       6       6       1    0.04    1820    2594       0    0.17    0.00   99.79    0.00      33
>         0       6      34       0    0.01    1564    2594       0    0.20
>         0       8       7       0    0.02    1655    2594       0    0.11    0.00   99.86    0.00      29
>         0       8      35       0    0.01    1687    2594       0    0.12
>         0       9       8       0    0.03    1748    2594       0    0.15    0.00   99.83    0.00      32
>         0       9      36       0    0.02    2001    2594       0    0.15
>         0      10       9       1    0.06    1604    2594       0    0.20    0.00   99.74    0.00      32
>         0      10      37       0    0.02    1679    2594       0    0.24
>         0      11      10       1    0.04    1644    2594       0    0.12    0.00   99.84    0.00      30
>         0      11      38       0    0.01    1509    2594       0    0.14
>         0      12      11       1    0.04    1773    2594       0    0.13    0.00   99.83    0.00      30
>         0      12      39       0    0.01    1529    2594       0    0.16
>         0      13      12       0    0.02    1907    2594       0    0.11    0.00   99.87    0.00      30
>         0      13      40       0    0.01    1574    2594       0    0.12
>         0      14      13       1    0.04    1831    2594       0    0.19    0.00   99.77    0.00      31
>         0      14      41       0    0.01    1735    2594       0    0.22
>         1       0      14       1    0.04    1831    2594       0    0.11    0.00   99.85    0.00      28      37    5.43    0.00   93.98    0.00    5.88    1.87    0.08    0.00
>         1       0      42       0    0.01    2238    2594       0    0.14
>         1       1      15       1    0.04    1869    2594       0    0.15    0.00   99.81    0.00      31
>         1       1      43       0    0.01    2407    2594       0    0.18
>         1       2      16       0    0.02    2164    2594       0    0.10    0.00   99.88    0.00      28
>         1       2      44       0    0.01    2326    2594       0    0.11
>         1       3      17       1    0.04    2101    2594       0    0.10    0.00   99.86    0.00      30
>         1       3      45       0    0.01    2355    2594       0    0.13
>         1       4      18       0    0.01    2429    2594       0    0.08    0.00   99.90    0.00      29
>         1       4      46       0    0.01    2545    2594       0    0.08
>         1       5      19       0    0.01    2412    2594       0    0.08    0.00   99.91    0.00      29
>         1       5      47       0    0.01    2392    2594       0    0.08
>         1       6      20       0    0.01    2448    2594       0    0.08    0.00   99.90    0.00      29
>         1       6      48       0    0.01    2430    2594       0    0.08
>         1       8      21       0    0.01    2574    2594       0    0.08    0.00   99.90    0.00      29
>         1       8      49       0    0.01    2450    2594       0    0.09
>         1       9      22       0    0.02    2470    2594       0    0.08    0.00   99.90    0.00      31
>         1       9      50       0    0.01    2555    2594       0    0.08
>         1      10      23       0    0.01    2540    2594       0    0.07    0.00   99.92    0.00      26
>         1      10      51       0    0.01    2672    2594       0    0.07
>         1      11      24       0    0.01    2472    2594       0    0.08    0.00   99.91    0.00      28
>         1      11      52       0    0.01    2461    2594       0    0.08
>         1      12      25       0    0.01    2438    2594       0    0.07    0.00   99.92    0.00      29
>         1      12      53       0    0.01    2316    2594       0    0.07
>         1      13      26       0    0.01    2363    2594       0    0.08    0.00   99.90    0.00      28
>         1      13      54       0    0.01    2586    2594       0    0.09
>         1      14      27       0    0.01    2459    2594       0    0.09    0.00   99.90    0.00      27
>         1      14      55       1    0.02    2939    2594       0    0.08
>
> Tim
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-30 19:38                   ` Dirk Brandewie
@ 2014-05-30 20:07                     ` Tim Chen
  2014-05-30 20:15                       ` Dirk Brandewie
  0 siblings, 1 reply; 29+ messages in thread
From: Tim Chen @ 2014-05-30 20:07 UTC (permalink / raw)
  To: Dirk Brandewie
  Cc: Dave Jones, George Spelvin, herbert, james.guilford, JBeulich,
	linux-kernel, Jacob jun Pan

On Fri, 2014-05-30 at 12:38 -0700, Dirk Brandewie wrote:

> > Dirk,
> >
> > Thanks for checking things out.
> >
> > I tested on a Haswell system, and I see that the frequency
> > can dip below the max even when I set the min_perf_pct to 100.
> > Let me know if you want to log on to my system and check if
> > there's something I missed. It is odd that the package 1's
> > cores are at a much higher frequency and close to
> > max than package 0, once min_perf_pct is set to 100.
> >
> 
> Can you run turbostat for a few samples it reports an average over the sample
> time.
> 

Here it is.

Tim

Package     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt RAMWatt   PKG_%   RAM_% 
       -       -       -       0    0.02    2048    2594       0    0.23    0.00   99.75    0.00      33      42    5.93    0.00   91.52    0.00   23.22    4.15    0.12    0.00
       0       0       0       1    0.06    1997    2594       0    0.16    0.00   99.78    0.00      32      42    7.92    0.00   91.55    0.00   16.88    1.95    0.06    0.00
       0       0      28       0    0.01    1338    2594       0    0.21
       0       1       1       0    0.02    1696    2594       0    0.11    0.00   99.87    0.00      33
       0       1      29       0    0.01    1455    2594       0    0.11
       0       2       2       0    0.01    1618    2594       0    0.07    0.00   99.91    0.00      30
       0       2      30       0    0.01    1513    2594       0    0.07
       0       3       3       0    0.01    1724    2594       0    0.08    0.00   99.91    0.00      31
       0       3      31       0    0.01    1447    2594       0    0.08
       0       4       4       0    0.01    1769    2594       0    0.06    0.00   99.92    0.00      32
       0       4      32       0    0.01    1483    2594       0    0.06
       0       5       5       0    0.01    1670    2594       0    0.07    0.00   99.92    0.00      29
       0       5      33       0    0.01    1515    2594       0    0.07
       0       6       6       0    0.01    1600    2594       0    0.07    0.00   99.92    0.00      33
       0       6      34       0    0.01    1412    2594       0    0.07
       0       8       7       0    0.01    1588    2594       0    0.07    0.00   99.92    0.00      30
       0       8      35       0    0.01    1432    2594       0    0.07
       0       9       8       0    0.01    1662    2594       0    0.11    0.00   99.88    0.00      32
       0       9      36       0    0.02    1658    2594       0    0.10
       0      10       9       0    0.01    1570    2594       0    0.07    0.00   99.91    0.00      32
       0      10      37       0    0.01    1468    2594       0    0.07
       0      11      10       0    0.01    1680    2594       0    0.07    0.00   99.92    0.00      31
       0      11      38       0    0.01    1511    2594       0    0.07
       0      12      11       0    0.01    1690    2594       0    0.08    0.00   99.91    0.00      30
       0      12      39       0    0.01    1560    2594       0    0.08
       0      13      12       0    0.02    1604    2594       0    0.11    0.00   99.87    0.00      29
       0      13      40       0    0.02    1436    2594       0    0.11
       0      14      13       0    0.02    1620    2594       0    0.09    0.00   99.89    0.00      29
       0      14      41       0    0.02    1440    2594       0    0.09
       1       0      14       0    0.03    1666    2594       0    0.16    0.00   99.82    0.00      28      36    3.94    0.00   91.50    0.00    6.34    2.20    0.06    0.00
       1       0      42       3    0.08    3263    2594       0    0.11
       1       1      15       0    0.01    2194    2594       0    0.09    0.00   99.90    0.00      30
       1       1      43       0    0.01    2358    2594       0    0.09
       1       2      16       0    0.01    2650    2594       0    0.08    0.00   99.91    0.00      28
       1       2      44       0    0.01    2032    2594       0    0.08
       1       3      17       1    0.03    2305    2594       0    4.11    0.00   95.86    0.00      30
       1       3      45       0    0.01    2290    2594       0    4.13
       1       4      18       0    0.01    2362    2594       0    0.09    0.00   99.90    0.00      28
       1       4      46       0    0.01    2325    2594       0    0.09
       1       5      19       0    0.01    2374    2594       0    0.07    0.00   99.92    0.00      30
       1       5      47       0    0.01    2442    2594       0    0.07
       1       6      20       0    0.01    2476    2594       0    0.08    0.00   99.91    0.00      30
       1       6      48       0    0.01    2382    2594       0    0.07
       1       8      21       0    0.01    2669    2594       0    0.09    0.00   99.90    0.00      29
       1       8      49       0    0.02    1953    2594       0    0.09
       1       9      22       0    0.01    2537    2594       0    0.10    0.00   99.89    0.00      31
       1       9      50       0    0.01    2117    2594       0    0.10
       1      10      23       0    0.01    2531    2594       0    0.07    0.00   99.92    0.00      27
       1      10      51       0    0.01    2404    2594       0    0.08
       1      11      24       0    0.01    2315    2594       0    0.08    0.00   99.91    0.00      28
       1      11      52       0    0.01    2210    2594       0    0.08
       1      12      25       0    0.01    2434    2594       0    0.07    0.00   99.91    0.00      28
       1      12      53       0    0.01    2113    2594       0    0.08
       1      13      26       0    0.01    2070    2594       0    0.07    0.00   99.91    0.00      27
       1      13      54       0    0.01    2114    2594       0    0.08
       1      14      27       0    0.01    2324    2594       0    0.10    0.00   99.89    0.00      27
       1      14      55       1    0.03    2991    2594       0    0.08
Package     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt RAMWatt   PKG_%   RAM_% 
       -       -       -       0    0.01    2138    2594       0    0.10    0.01   99.88    0.00      33      42    4.32    0.09   94.88    0.00   22.45    3.56    0.12    0.00
       0       0       0       1    0.07    2106    2594       0    0.25    0.00   99.68    0.00      31      42    4.20    0.00   95.00    0.00   16.72    1.73    0.06    0.00
       0       0      28       0    0.01    2163    2594       0    0.31
       0       1       1       0    0.02    2005    2594       0    0.11    0.00   99.87    0.00      33
       0       1      29       0    0.01    1823    2594       0    0.12
       0       2       2       0    0.02    2008    2594       0    0.10    0.00   99.88    0.00      30
       0       2      30       0    0.01    1903    2594       0    0.10
       0       3       3       0    0.02    1953    2594       0    0.10    0.00   99.88    0.00      31
       0       3      31       0    0.01    1840    2594       0    0.11
       0       4       4       0    0.02    2220    2594       0    0.09    0.01   99.89    0.00      33
       0       4      32       0    0.01    1806    2594       0    0.09
       0       5       5       0    0.01    1723    2594       0    0.09    0.00   99.89    0.00      28
       0       5      33       0    0.01    1904    2594       0    0.09
       0       6       6       0    0.01    1806    2594       0    0.08    0.00   99.91    0.00      33
       0       6      34       0    0.01    1824    2594       0    0.08
       0       8       7       0    0.01    1910    2594       0    0.10    0.00   99.89    0.00      30
       0       8      35       0    0.01    1847    2594       0    0.10
       0       9       8       0    0.02    2204    2594       0    0.11    0.00   99.88    0.00      30
       0       9      36       0    0.02    1899    2594       0    0.11
       0      10       9       0    0.01    1967    2594       0    0.09    0.00   99.90    0.00      33
       0      10      37       0    0.01    1838    2594       0    0.09
       0      11      10       0    0.01    1696    2594       0    0.08    0.00   99.90    0.00      31
       0      11      38       0    0.01    1728    2594       0    0.08
       0      12      11       0    0.02    1863    2594       0    0.09    0.00   99.90    0.00      30
       0      12      39       0    0.01    1838    2594       0    0.09
       0      13      12       0    0.02    1856    2594       0    0.11    0.00   99.87    0.00      29
       0      13      40       0    0.01    1741    2594       0    0.12
       0      14      13       0    0.02    1887    2594       0    0.10    0.00   99.88    0.00      30
       0      14      41       0    0.01    1860    2594       0    0.11
       1       0      14       0    0.03    1875    2594       0    0.09    0.00   99.88    0.00      28      38    4.44    0.18   94.75    0.00    5.72    1.82    0.06    0.00
       1       0      42       0    0.01    2363    2594       0    0.11
       1       1      15       0    0.01    2368    2594       0    0.09    0.00   99.90    0.00      31
       1       1      43       0    0.01    2403    2594       0    0.09
       1       2      16       0    0.01    2501    2594       0    0.07    0.00   99.91    0.00      27
       1       2      44       0    0.01    2469    2594       0    0.07
       1       3      17       1    0.04    2674    2594       0    0.10    0.19   99.66    0.00      30
       1       3      45       0    0.01    2374    2594       0    0.13
       1       4      18       0    0.01    2446    2594       0    0.08    0.00   99.91    0.00      28
       1       4      46       0    0.01    2372    2594       0    0.08
       1       5      19       0    0.01    2479    2594       0    0.08    0.00   99.91    0.00      29
       1       5      47       0    0.01    2352    2594       0    0.08
       1       6      20       0    0.01    2436    2594       0    0.07    0.00   99.91    0.00      30
       1       6      48       0    0.01    2381    2594       0    0.08
       1       8      21       0    0.01    2377    2594       0    0.08    0.00   99.91    0.00      29
       1       8      49       0    0.01    2629    2594       0    0.08
       1       9      22       0    0.01    2407    2594       0    0.09    0.00   99.90    0.00      30
       1       9      50       0    0.01    2547    2594       0    0.09
       1      10      23       0    0.01    2254    2594       0    0.09    0.00   99.90    0.00      28
       1      10      51       0    0.01    2514    2594       0    0.09
       1      11      24       0    0.01    2204    2594       0    0.10    0.00   99.89    0.00      29
       1      11      52       0    0.01    2187    2594       0    0.09
       1      12      25       0    0.01    2310    2594       0    0.09    0.00   99.90    0.00      27
       1      12      53       0    0.01    2636    2594       0    0.09
       1      13      26       0    0.01    2325    2594       0    0.09    0.00   99.89    0.00      29
       1      13      54       0    0.02    1959    2594       0    0.09
       1      14      27       0    0.01    2273    2594       0    0.11    0.00   99.88    0.00      28
       1      14      55       1    0.02    2678    2594       0    0.10
Package     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt RAMWatt   PKG_%   RAM_% 
       -       -       -       0    0.02    2223    2594       0    0.12    0.00   99.86    0.00      34      41    4.47    0.00   94.54    0.00   22.70    3.64    0.14    0.00
       0       0       0       1    0.05    2251    2594       0    0.18    0.00   99.77    0.00      32      41    4.81    0.00   94.56    0.00   16.89    1.78    0.06    0.00
       0       0      28       0    0.01    1846    2594       0    0.22
       0       1       1       0    0.02    1758    2594       0    0.12    0.00   99.86    0.00      33
       0       1      29       0    0.02    1945    2594       0    0.12
       0       2       2       0    0.02    1635    2594       0    0.09    0.00   99.89    0.00      29
       0       2      30       0    0.01    1939    2594       0    0.10
       0       3       3       0    0.01    1834    2594       0    0.08    0.00   99.90    0.00      31
       0       3      31       0    0.01    1554    2594       0    0.09
       0       4       4       0    0.02    1827    2594       0    0.08    0.00   99.91    0.00      33
       0       4      32       0    0.01    1824    2594       0    0.08
       0       5       5       0    0.02    1925    2594       0    0.08    0.00   99.90    0.00      29
       0       5      33       0    0.01    1796    2594       0    0.08
       0       6       6       0    0.02    1801    2594       0    0.07    0.00   99.91    0.00      34
       0       6      34       0    0.01    1874    2594       0    0.08
       0       8       7       0    0.02    1930    2594       0    0.08    0.00   99.91    0.00      30
       0       8      35       0    0.01    1901    2594       0    0.08
       0       9       8       0    0.02    1874    2594       0    0.10    0.00   99.88    0.00      30
       0       9      36       0    0.02    1915    2594       0    0.10
       0      10       9       0    0.02    1779    2594       0    0.08    0.00   99.90    0.00      32
       0      10      37       0    0.01    1983    2594       0    0.09
       0      11      10       0    0.02    1754    2594       0    0.08    0.00   99.90    0.00      31
       0      11      38       0    0.01    1722    2594       0    0.09
       0      12      11       0    0.02    1730    2594       0    0.08    0.00   99.90    0.00      29
       0      12      39       0    0.01    1892    2594       0    0.09
       0      13      12       0    0.02    1943    2594       0    0.10    0.00   99.88    0.00      30
       0      13      40       0    0.02    2016    2594       0    0.10
       0      14      13       0    0.02    1893    2594       0    0.10    0.00   99.87    0.00      31
       0      14      41       0    0.01    1790    2594       0    0.11
       1       0      14       1    0.03    1998    2594       0    0.16    0.00   99.81    0.00      28      37    4.13    0.00   94.52    0.00    5.81    1.86    0.08    0.00
       1       0      42       3    0.08    3493    2594       0    0.11
       1       1      15       0    0.01    2483    2594       0    0.08    0.00   99.90    0.00      31
       1       1      43       0    0.01    2279    2594       0    0.09
       1       2      16       0    0.01    2454    2594       0    0.07    0.00   99.92    0.00      27
       1       2      44       0    0.01    2405    2594       0    0.07
       1       3      17       1    0.03    3069    2594       0    0.29    0.00   99.68    0.00      31
       1       3      45       0    0.01    2298    2594       0    0.31
       1       4      18       0    0.01    2515    2594       0    0.08    0.00   99.91    0.00      28
       1       4      46       0    0.01    2193    2594       0    0.08
       1       5      19       0    0.01    2547    2594       0    0.06    0.00   99.93    0.00      28
       1       5      47       0    0.01    2327    2594       0    0.06
       1       6      20       0    0.01    2315    2594       0    0.07    0.00   99.92    0.00      29
       1       6      48       0    0.01    2120    2594       0    0.07
       1       8      21       0    0.01    2482    2594       0    0.07    0.00   99.92    0.00      29
       1       8      49       0    0.01    2311    2594       0    0.07
       1       9      22       0    0.01    2372    2594       0    0.09    0.00   99.90    0.00      30
       1       9      50       0    0.01    2509    2594       0    0.09
       1      10      23       0    0.02    2147    2594       0    0.08    0.00   99.91    0.00      27
       1      10      51       0    0.01    2477    2594       0    0.08
       1      11      24       0    0.01    2138    2594       0    0.08    0.00   99.90    0.00      29
       1      11      52       0    0.01    2365    2594       0    0.09
       1      12      25       0    0.01    1965    2594       0    0.07    0.00   99.91    0.00      28
       1      12      53       0    0.01    2447    2594       0    0.08
       1      13      26       0    0.01    2476    2594       0    0.08    0.00   99.91    0.00      28
       1      13      54       0    0.01    2282    2594       0    0.08
       1      14      27       0    0.01    2386    2594       0    0.76    0.00   99.22    0.00      28
       1      14      55       1    0.02    3065    2594       0    0.75



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
  2014-05-30 20:07                     ` Tim Chen
@ 2014-05-30 20:15                       ` Dirk Brandewie
  0 siblings, 0 replies; 29+ messages in thread
From: Dirk Brandewie @ 2014-05-30 20:15 UTC (permalink / raw)
  To: Tim Chen
  Cc: dirk.brandewie, Dave Jones, George Spelvin, herbert,
	james.guilford, JBeulich, linux-kernel, Jacob jun Pan

On 05/30/2014 01:07 PM, Tim Chen wrote:
> On Fri, 2014-05-30 at 12:38 -0700, Dirk Brandewie wrote:
>
>>> Dirk,
>>>
>>> Thanks for checking things out.
>>>
>>> I tested on a Haswell system, and I see that the frequency
>>> can dip below the max even when I set the min_perf_pct to 100.
>>> Let me know if you want to log on to my system and check if
>>> there's something I missed. It is odd that the package 1's
>>> cores are at a much higher frequency and close to
>>> max than package 0, once min_perf_pct is set to 100.
>>>
>>
>> Can you run turbostat for a few samples it reports an average over the sample
>> time.
>>
>
> Here it is.
>

You have me at a loss here I can come in on Monday if you are around and
we can try to figure out what is happening.

--Dirk
> Tim
>
> Package     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt RAMWatt   PKG_%   RAM_%
>         -       -       -       0    0.02    2048    2594       0    0.23    0.00   99.75    0.00      33      42    5.93    0.00   91.52    0.00   23.22    4.15    0.12    0.00
>         0       0       0       1    0.06    1997    2594       0    0.16    0.00   99.78    0.00      32      42    7.92    0.00   91.55    0.00   16.88    1.95    0.06    0.00
>         0       0      28       0    0.01    1338    2594       0    0.21
>         0       1       1       0    0.02    1696    2594       0    0.11    0.00   99.87    0.00      33
>         0       1      29       0    0.01    1455    2594       0    0.11
>         0       2       2       0    0.01    1618    2594       0    0.07    0.00   99.91    0.00      30
>         0       2      30       0    0.01    1513    2594       0    0.07
>         0       3       3       0    0.01    1724    2594       0    0.08    0.00   99.91    0.00      31
>         0       3      31       0    0.01    1447    2594       0    0.08
>         0       4       4       0    0.01    1769    2594       0    0.06    0.00   99.92    0.00      32
>         0       4      32       0    0.01    1483    2594       0    0.06
>         0       5       5       0    0.01    1670    2594       0    0.07    0.00   99.92    0.00      29
>         0       5      33       0    0.01    1515    2594       0    0.07
>         0       6       6       0    0.01    1600    2594       0    0.07    0.00   99.92    0.00      33
>         0       6      34       0    0.01    1412    2594       0    0.07
>         0       8       7       0    0.01    1588    2594       0    0.07    0.00   99.92    0.00      30
>         0       8      35       0    0.01    1432    2594       0    0.07
>         0       9       8       0    0.01    1662    2594       0    0.11    0.00   99.88    0.00      32
>         0       9      36       0    0.02    1658    2594       0    0.10
>         0      10       9       0    0.01    1570    2594       0    0.07    0.00   99.91    0.00      32
>         0      10      37       0    0.01    1468    2594       0    0.07
>         0      11      10       0    0.01    1680    2594       0    0.07    0.00   99.92    0.00      31
>         0      11      38       0    0.01    1511    2594       0    0.07
>         0      12      11       0    0.01    1690    2594       0    0.08    0.00   99.91    0.00      30
>         0      12      39       0    0.01    1560    2594       0    0.08
>         0      13      12       0    0.02    1604    2594       0    0.11    0.00   99.87    0.00      29
>         0      13      40       0    0.02    1436    2594       0    0.11
>         0      14      13       0    0.02    1620    2594       0    0.09    0.00   99.89    0.00      29
>         0      14      41       0    0.02    1440    2594       0    0.09
>         1       0      14       0    0.03    1666    2594       0    0.16    0.00   99.82    0.00      28      36    3.94    0.00   91.50    0.00    6.34    2.20    0.06    0.00
>         1       0      42       3    0.08    3263    2594       0    0.11
>         1       1      15       0    0.01    2194    2594       0    0.09    0.00   99.90    0.00      30
>         1       1      43       0    0.01    2358    2594       0    0.09
>         1       2      16       0    0.01    2650    2594       0    0.08    0.00   99.91    0.00      28
>         1       2      44       0    0.01    2032    2594       0    0.08
>         1       3      17       1    0.03    2305    2594       0    4.11    0.00   95.86    0.00      30
>         1       3      45       0    0.01    2290    2594       0    4.13
>         1       4      18       0    0.01    2362    2594       0    0.09    0.00   99.90    0.00      28
>         1       4      46       0    0.01    2325    2594       0    0.09
>         1       5      19       0    0.01    2374    2594       0    0.07    0.00   99.92    0.00      30
>         1       5      47       0    0.01    2442    2594       0    0.07
>         1       6      20       0    0.01    2476    2594       0    0.08    0.00   99.91    0.00      30
>         1       6      48       0    0.01    2382    2594       0    0.07
>         1       8      21       0    0.01    2669    2594       0    0.09    0.00   99.90    0.00      29
>         1       8      49       0    0.02    1953    2594       0    0.09
>         1       9      22       0    0.01    2537    2594       0    0.10    0.00   99.89    0.00      31
>         1       9      50       0    0.01    2117    2594       0    0.10
>         1      10      23       0    0.01    2531    2594       0    0.07    0.00   99.92    0.00      27
>         1      10      51       0    0.01    2404    2594       0    0.08
>         1      11      24       0    0.01    2315    2594       0    0.08    0.00   99.91    0.00      28
>         1      11      52       0    0.01    2210    2594       0    0.08
>         1      12      25       0    0.01    2434    2594       0    0.07    0.00   99.91    0.00      28
>         1      12      53       0    0.01    2113    2594       0    0.08
>         1      13      26       0    0.01    2070    2594       0    0.07    0.00   99.91    0.00      27
>         1      13      54       0    0.01    2114    2594       0    0.08
>         1      14      27       0    0.01    2324    2594       0    0.10    0.00   99.89    0.00      27
>         1      14      55       1    0.03    2991    2594       0    0.08
> Package     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt RAMWatt   PKG_%   RAM_%
>         -       -       -       0    0.01    2138    2594       0    0.10    0.01   99.88    0.00      33      42    4.32    0.09   94.88    0.00   22.45    3.56    0.12    0.00
>         0       0       0       1    0.07    2106    2594       0    0.25    0.00   99.68    0.00      31      42    4.20    0.00   95.00    0.00   16.72    1.73    0.06    0.00
>         0       0      28       0    0.01    2163    2594       0    0.31
>         0       1       1       0    0.02    2005    2594       0    0.11    0.00   99.87    0.00      33
>         0       1      29       0    0.01    1823    2594       0    0.12
>         0       2       2       0    0.02    2008    2594       0    0.10    0.00   99.88    0.00      30
>         0       2      30       0    0.01    1903    2594       0    0.10
>         0       3       3       0    0.02    1953    2594       0    0.10    0.00   99.88    0.00      31
>         0       3      31       0    0.01    1840    2594       0    0.11
>         0       4       4       0    0.02    2220    2594       0    0.09    0.01   99.89    0.00      33
>         0       4      32       0    0.01    1806    2594       0    0.09
>         0       5       5       0    0.01    1723    2594       0    0.09    0.00   99.89    0.00      28
>         0       5      33       0    0.01    1904    2594       0    0.09
>         0       6       6       0    0.01    1806    2594       0    0.08    0.00   99.91    0.00      33
>         0       6      34       0    0.01    1824    2594       0    0.08
>         0       8       7       0    0.01    1910    2594       0    0.10    0.00   99.89    0.00      30
>         0       8      35       0    0.01    1847    2594       0    0.10
>         0       9       8       0    0.02    2204    2594       0    0.11    0.00   99.88    0.00      30
>         0       9      36       0    0.02    1899    2594       0    0.11
>         0      10       9       0    0.01    1967    2594       0    0.09    0.00   99.90    0.00      33
>         0      10      37       0    0.01    1838    2594       0    0.09
>         0      11      10       0    0.01    1696    2594       0    0.08    0.00   99.90    0.00      31
>         0      11      38       0    0.01    1728    2594       0    0.08
>         0      12      11       0    0.02    1863    2594       0    0.09    0.00   99.90    0.00      30
>         0      12      39       0    0.01    1838    2594       0    0.09
>         0      13      12       0    0.02    1856    2594       0    0.11    0.00   99.87    0.00      29
>         0      13      40       0    0.01    1741    2594       0    0.12
>         0      14      13       0    0.02    1887    2594       0    0.10    0.00   99.88    0.00      30
>         0      14      41       0    0.01    1860    2594       0    0.11
>         1       0      14       0    0.03    1875    2594       0    0.09    0.00   99.88    0.00      28      38    4.44    0.18   94.75    0.00    5.72    1.82    0.06    0.00
>         1       0      42       0    0.01    2363    2594       0    0.11
>         1       1      15       0    0.01    2368    2594       0    0.09    0.00   99.90    0.00      31
>         1       1      43       0    0.01    2403    2594       0    0.09
>         1       2      16       0    0.01    2501    2594       0    0.07    0.00   99.91    0.00      27
>         1       2      44       0    0.01    2469    2594       0    0.07
>         1       3      17       1    0.04    2674    2594       0    0.10    0.19   99.66    0.00      30
>         1       3      45       0    0.01    2374    2594       0    0.13
>         1       4      18       0    0.01    2446    2594       0    0.08    0.00   99.91    0.00      28
>         1       4      46       0    0.01    2372    2594       0    0.08
>         1       5      19       0    0.01    2479    2594       0    0.08    0.00   99.91    0.00      29
>         1       5      47       0    0.01    2352    2594       0    0.08
>         1       6      20       0    0.01    2436    2594       0    0.07    0.00   99.91    0.00      30
>         1       6      48       0    0.01    2381    2594       0    0.08
>         1       8      21       0    0.01    2377    2594       0    0.08    0.00   99.91    0.00      29
>         1       8      49       0    0.01    2629    2594       0    0.08
>         1       9      22       0    0.01    2407    2594       0    0.09    0.00   99.90    0.00      30
>         1       9      50       0    0.01    2547    2594       0    0.09
>         1      10      23       0    0.01    2254    2594       0    0.09    0.00   99.90    0.00      28
>         1      10      51       0    0.01    2514    2594       0    0.09
>         1      11      24       0    0.01    2204    2594       0    0.10    0.00   99.89    0.00      29
>         1      11      52       0    0.01    2187    2594       0    0.09
>         1      12      25       0    0.01    2310    2594       0    0.09    0.00   99.90    0.00      27
>         1      12      53       0    0.01    2636    2594       0    0.09
>         1      13      26       0    0.01    2325    2594       0    0.09    0.00   99.89    0.00      29
>         1      13      54       0    0.02    1959    2594       0    0.09
>         1      14      27       0    0.01    2273    2594       0    0.11    0.00   99.88    0.00      28
>         1      14      55       1    0.02    2678    2594       0    0.10
> Package     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt RAMWatt   PKG_%   RAM_%
>         -       -       -       0    0.02    2223    2594       0    0.12    0.00   99.86    0.00      34      41    4.47    0.00   94.54    0.00   22.70    3.64    0.14    0.00
>         0       0       0       1    0.05    2251    2594       0    0.18    0.00   99.77    0.00      32      41    4.81    0.00   94.56    0.00   16.89    1.78    0.06    0.00
>         0       0      28       0    0.01    1846    2594       0    0.22
>         0       1       1       0    0.02    1758    2594       0    0.12    0.00   99.86    0.00      33
>         0       1      29       0    0.02    1945    2594       0    0.12
>         0       2       2       0    0.02    1635    2594       0    0.09    0.00   99.89    0.00      29
>         0       2      30       0    0.01    1939    2594       0    0.10
>         0       3       3       0    0.01    1834    2594       0    0.08    0.00   99.90    0.00      31
>         0       3      31       0    0.01    1554    2594       0    0.09
>         0       4       4       0    0.02    1827    2594       0    0.08    0.00   99.91    0.00      33
>         0       4      32       0    0.01    1824    2594       0    0.08
>         0       5       5       0    0.02    1925    2594       0    0.08    0.00   99.90    0.00      29
>         0       5      33       0    0.01    1796    2594       0    0.08
>         0       6       6       0    0.02    1801    2594       0    0.07    0.00   99.91    0.00      34
>         0       6      34       0    0.01    1874    2594       0    0.08
>         0       8       7       0    0.02    1930    2594       0    0.08    0.00   99.91    0.00      30
>         0       8      35       0    0.01    1901    2594       0    0.08
>         0       9       8       0    0.02    1874    2594       0    0.10    0.00   99.88    0.00      30
>         0       9      36       0    0.02    1915    2594       0    0.10
>         0      10       9       0    0.02    1779    2594       0    0.08    0.00   99.90    0.00      32
>         0      10      37       0    0.01    1983    2594       0    0.09
>         0      11      10       0    0.02    1754    2594       0    0.08    0.00   99.90    0.00      31
>         0      11      38       0    0.01    1722    2594       0    0.09
>         0      12      11       0    0.02    1730    2594       0    0.08    0.00   99.90    0.00      29
>         0      12      39       0    0.01    1892    2594       0    0.09
>         0      13      12       0    0.02    1943    2594       0    0.10    0.00   99.88    0.00      30
>         0      13      40       0    0.02    2016    2594       0    0.10
>         0      14      13       0    0.02    1893    2594       0    0.10    0.00   99.87    0.00      31
>         0      14      41       0    0.01    1790    2594       0    0.11
>         1       0      14       1    0.03    1998    2594       0    0.16    0.00   99.81    0.00      28      37    4.13    0.00   94.52    0.00    5.81    1.86    0.08    0.00
>         1       0      42       3    0.08    3493    2594       0    0.11
>         1       1      15       0    0.01    2483    2594       0    0.08    0.00   99.90    0.00      31
>         1       1      43       0    0.01    2279    2594       0    0.09
>         1       2      16       0    0.01    2454    2594       0    0.07    0.00   99.92    0.00      27
>         1       2      44       0    0.01    2405    2594       0    0.07
>         1       3      17       1    0.03    3069    2594       0    0.29    0.00   99.68    0.00      31
>         1       3      45       0    0.01    2298    2594       0    0.31
>         1       4      18       0    0.01    2515    2594       0    0.08    0.00   99.91    0.00      28
>         1       4      46       0    0.01    2193    2594       0    0.08
>         1       5      19       0    0.01    2547    2594       0    0.06    0.00   99.93    0.00      28
>         1       5      47       0    0.01    2327    2594       0    0.06
>         1       6      20       0    0.01    2315    2594       0    0.07    0.00   99.92    0.00      29
>         1       6      48       0    0.01    2120    2594       0    0.07
>         1       8      21       0    0.01    2482    2594       0    0.07    0.00   99.92    0.00      29
>         1       8      49       0    0.01    2311    2594       0    0.07
>         1       9      22       0    0.01    2372    2594       0    0.09    0.00   99.90    0.00      30
>         1       9      50       0    0.01    2509    2594       0    0.09
>         1      10      23       0    0.02    2147    2594       0    0.08    0.00   99.91    0.00      27
>         1      10      51       0    0.01    2477    2594       0    0.08
>         1      11      24       0    0.01    2138    2594       0    0.08    0.00   99.90    0.00      29
>         1      11      52       0    0.01    2365    2594       0    0.09
>         1      12      25       0    0.01    1965    2594       0    0.07    0.00   99.91    0.00      28
>         1      12      53       0    0.01    2447    2594       0    0.08
>         1      13      26       0    0.01    2476    2594       0    0.08    0.00   99.91    0.00      28
>         1      13      54       0    0.01    2282    2594       0    0.08
>         1      14      27       0    0.01    2386    2594       0    0.76    0.00   99.22    0.00      28
>         1      14      55       1    0.02    3065    2594       0    0.75
>
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v3] crypto: crc32c-pclmul - Shrink K_table to 32-bit words
  2014-05-30 17:01                   ` Tim Chen
@ 2014-06-07  3:08                     ` George Spelvin
  2014-06-20 18:42                       ` Herbert Xu
  0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2014-06-07  3:08 UTC (permalink / raw)
  To: linux, tim.c.chen; +Cc: herbert, james.guilford, JBeulich, linux-kernel

There's no need for the K_table to be made of 64-bit words.  For some
reason, the original authors didn't fully reduce the values modulo the
CRC32C polynomial, and so had some 33-bit values in there.  They can
all be reduced to 32 bits.

Doing that cuts the table size in half.  Since the code depends on both
pclmulq and crc32, SSE 4.1 is obviously present, so we can use pmovzxdq
to fetch it in the correct format.

This adds (measured on Ivy Bridge) 1 cycle per main loop iteration
(CRC of up to 3K bytes), less than 0.2%.  The hope is that the reduced
D-cache footprint will make up the loss in other code.

Two other related fixes:
* K_table is read-only, so belongs in .rodata, and
* There's no need for more than 8-byte alignment

Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: George Spelvin <linux@horizon.com>
---
Having been tweaked, benchmarked and acked, I think this is ready to
be merged.

My initial attempts at additional speedups resulted in slowdowns;
apparently Intel coders are fairly good at optimization. :-)

 arch/x86/crypto/crc32c-pcl-intel-asm_64.S | 281 +++++++++++++++---------------
 1 file changed, 139 insertions(+), 142 deletions(-)

diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
index dbc4339b..26d49eba 100644
--- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
+++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
@@ -72,6 +72,7 @@
 
 # unsigned int crc_pcl(u8 *buffer, int len, unsigned int crc_init);
 
+.text
 ENTRY(crc_pcl)
 #define    bufp		%rdi
 #define    bufp_dw	%edi
@@ -216,15 +217,11 @@ LABEL crc_ %i
 	## 4) Combine three results:
 	################################################################
 
-	lea	(K_table-16)(%rip), bufp	# first entry is for idx 1
+	lea	(K_table-8)(%rip), bufp		# first entry is for idx 1
 	shlq    $3, %rax			# rax *= 8
-	subq    %rax, tmp			# tmp -= rax*8
-	shlq    $1, %rax
-	subq    %rax, tmp			# tmp -= rax*16
-						# (total tmp -= rax*24)
-	addq    %rax, bufp
-
-	movdqa  (bufp), %xmm0			# 2 consts: K1:K2
+	pmovzxdq (bufp,%rax), %xmm0		# 2 consts: K1:K2
+	leal	(%eax,%eax,2), %eax		# rax *= 3 (total *24)
+	subq    %rax, tmp			# tmp -= rax*24
 
 	movq    crc_init, %xmm1			# CRC for block 1
 	PCLMULQDQ 0x00,%xmm0,%xmm1		# Multiply by K2
@@ -238,9 +235,9 @@ LABEL crc_ %i
 	mov     crc2, crc_init
 	crc32   %rax, crc_init
 
-################################################################
-## 5) Check for end:
-################################################################
+	################################################################
+	## 5) Check for end:
+	################################################################
 
 LABEL crc_ 0
 	mov     tmp, len
@@ -331,136 +328,136 @@ ENDPROC(crc_pcl)
 
 	################################################################
 	## PCLMULQDQ tables
-	## Table is 128 entries x 2 quad words each
+	## Table is 128 entries x 2 words (8 bytes) each
 	################################################################
-.data
-.align 64
+.section	.rotata, "a", %progbits
+.align 8
 K_table:
-        .quad 0x14cd00bd6,0x105ec76f0
-        .quad 0x0ba4fc28e,0x14cd00bd6
-        .quad 0x1d82c63da,0x0f20c0dfe
-        .quad 0x09e4addf8,0x0ba4fc28e
-        .quad 0x039d3b296,0x1384aa63a
-        .quad 0x102f9b8a2,0x1d82c63da
-        .quad 0x14237f5e6,0x01c291d04
-        .quad 0x00d3b6092,0x09e4addf8
-        .quad 0x0c96cfdc0,0x0740eef02
-        .quad 0x18266e456,0x039d3b296
-        .quad 0x0daece73e,0x0083a6eec
-        .quad 0x0ab7aff2a,0x102f9b8a2
-        .quad 0x1248ea574,0x1c1733996
-        .quad 0x083348832,0x14237f5e6
-        .quad 0x12c743124,0x02ad91c30
-        .quad 0x0b9e02b86,0x00d3b6092
-        .quad 0x018b33a4e,0x06992cea2
-        .quad 0x1b331e26a,0x0c96cfdc0
-        .quad 0x17d35ba46,0x07e908048
-        .quad 0x1bf2e8b8a,0x18266e456
-        .quad 0x1a3e0968a,0x11ed1f9d8
-        .quad 0x0ce7f39f4,0x0daece73e
-        .quad 0x061d82e56,0x0f1d0f55e
-        .quad 0x0d270f1a2,0x0ab7aff2a
-        .quad 0x1c3f5f66c,0x0a87ab8a8
-        .quad 0x12ed0daac,0x1248ea574
-        .quad 0x065863b64,0x08462d800
-        .quad 0x11eef4f8e,0x083348832
-        .quad 0x1ee54f54c,0x071d111a8
-        .quad 0x0b3e32c28,0x12c743124
-        .quad 0x0064f7f26,0x0ffd852c6
-        .quad 0x0dd7e3b0c,0x0b9e02b86
-        .quad 0x0f285651c,0x0dcb17aa4
-        .quad 0x010746f3c,0x018b33a4e
-        .quad 0x1c24afea4,0x0f37c5aee
-        .quad 0x0271d9844,0x1b331e26a
-        .quad 0x08e766a0c,0x06051d5a2
-        .quad 0x093a5f730,0x17d35ba46
-        .quad 0x06cb08e5c,0x11d5ca20e
-        .quad 0x06b749fb2,0x1bf2e8b8a
-        .quad 0x1167f94f2,0x021f3d99c
-        .quad 0x0cec3662e,0x1a3e0968a
-        .quad 0x19329634a,0x08f158014
-        .quad 0x0e6fc4e6a,0x0ce7f39f4
-        .quad 0x08227bb8a,0x1a5e82106
-        .quad 0x0b0cd4768,0x061d82e56
-        .quad 0x13c2b89c4,0x188815ab2
-        .quad 0x0d7a4825c,0x0d270f1a2
-        .quad 0x10f5ff2ba,0x105405f3e
-        .quad 0x00167d312,0x1c3f5f66c
-        .quad 0x0f6076544,0x0e9adf796
-        .quad 0x026f6a60a,0x12ed0daac
-        .quad 0x1a2adb74e,0x096638b34
-        .quad 0x19d34af3a,0x065863b64
-        .quad 0x049c3cc9c,0x1e50585a0
-        .quad 0x068bce87a,0x11eef4f8e
-        .quad 0x1524fa6c6,0x19f1c69dc
-        .quad 0x16cba8aca,0x1ee54f54c
-        .quad 0x042d98888,0x12913343e
-        .quad 0x1329d9f7e,0x0b3e32c28
-        .quad 0x1b1c69528,0x088f25a3a
-        .quad 0x02178513a,0x0064f7f26
-        .quad 0x0e0ac139e,0x04e36f0b0
-        .quad 0x0170076fa,0x0dd7e3b0c
-        .quad 0x141a1a2e2,0x0bd6f81f8
-        .quad 0x16ad828b4,0x0f285651c
-        .quad 0x041d17b64,0x19425cbba
-        .quad 0x1fae1cc66,0x010746f3c
-        .quad 0x1a75b4b00,0x18db37e8a
-        .quad 0x0f872e54c,0x1c24afea4
-        .quad 0x01e41e9fc,0x04c144932
-        .quad 0x086d8e4d2,0x0271d9844
-        .quad 0x160f7af7a,0x052148f02
-        .quad 0x05bb8f1bc,0x08e766a0c
-        .quad 0x0a90fd27a,0x0a3c6f37a
-        .quad 0x0b3af077a,0x093a5f730
-        .quad 0x04984d782,0x1d22c238e
-        .quad 0x0ca6ef3ac,0x06cb08e5c
-        .quad 0x0234e0b26,0x063ded06a
-        .quad 0x1d88abd4a,0x06b749fb2
-        .quad 0x04597456a,0x04d56973c
-        .quad 0x0e9e28eb4,0x1167f94f2
-        .quad 0x07b3ff57a,0x19385bf2e
-        .quad 0x0c9c8b782,0x0cec3662e
-        .quad 0x13a9cba9e,0x0e417f38a
-        .quad 0x093e106a4,0x19329634a
-        .quad 0x167001a9c,0x14e727980
-        .quad 0x1ddffc5d4,0x0e6fc4e6a
-        .quad 0x00df04680,0x0d104b8fc
-        .quad 0x02342001e,0x08227bb8a
-        .quad 0x00a2a8d7e,0x05b397730
-        .quad 0x168763fa6,0x0b0cd4768
-        .quad 0x1ed5a407a,0x0e78eb416
-        .quad 0x0d2c3ed1a,0x13c2b89c4
-        .quad 0x0995a5724,0x1641378f0
-        .quad 0x19b1afbc4,0x0d7a4825c
-        .quad 0x109ffedc0,0x08d96551c
-        .quad 0x0f2271e60,0x10f5ff2ba
-        .quad 0x00b0bf8ca,0x00bf80dd2
-        .quad 0x123888b7a,0x00167d312
-        .quad 0x1e888f7dc,0x18dcddd1c
-        .quad 0x002ee03b2,0x0f6076544
-        .quad 0x183e8d8fe,0x06a45d2b2
-        .quad 0x133d7a042,0x026f6a60a
-        .quad 0x116b0f50c,0x1dd3e10e8
-        .quad 0x05fabe670,0x1a2adb74e
-        .quad 0x130004488,0x0de87806c
-        .quad 0x000bcf5f6,0x19d34af3a
-        .quad 0x18f0c7078,0x014338754
-        .quad 0x017f27698,0x049c3cc9c
-        .quad 0x058ca5f00,0x15e3e77ee
-        .quad 0x1af900c24,0x068bce87a
-        .quad 0x0b5cfca28,0x0dd07448e
-        .quad 0x0ded288f8,0x1524fa6c6
-        .quad 0x059f229bc,0x1d8048348
-        .quad 0x06d390dec,0x16cba8aca
-        .quad 0x037170390,0x0a3e3e02c
-        .quad 0x06353c1cc,0x042d98888
-        .quad 0x0c4584f5c,0x0d73c7bea
-        .quad 0x1f16a3418,0x1329d9f7e
-        .quad 0x0531377e2,0x185137662
-        .quad 0x1d8d9ca7c,0x1b1c69528
-        .quad 0x0b25b29f2,0x18a08b5bc
-        .quad 0x19fb2a8b0,0x02178513a
-        .quad 0x1a08fe6ac,0x1da758ae0
-        .quad 0x045cddf4e,0x0e0ac139e
-        .quad 0x1a91647f2,0x169cf9eb0
-        .quad 0x1a0f717c4,0x0170076fa
+	.long 0x493c7d27, 0x00000001
+	.long 0xba4fc28e, 0x493c7d27
+	.long 0xddc0152b, 0xf20c0dfe
+	.long 0x9e4addf8, 0xba4fc28e
+	.long 0x39d3b296, 0x3da6d0cb
+	.long 0x0715ce53, 0xddc0152b
+	.long 0x47db8317, 0x1c291d04
+	.long 0x0d3b6092, 0x9e4addf8
+	.long 0xc96cfdc0, 0x740eef02
+	.long 0x878a92a7, 0x39d3b296
+	.long 0xdaece73e, 0x083a6eec
+	.long 0xab7aff2a, 0x0715ce53
+	.long 0x2162d385, 0xc49f4f67
+	.long 0x83348832, 0x47db8317
+	.long 0x299847d5, 0x2ad91c30
+	.long 0xb9e02b86, 0x0d3b6092
+	.long 0x18b33a4e, 0x6992cea2
+	.long 0xb6dd949b, 0xc96cfdc0
+	.long 0x78d9ccb7, 0x7e908048
+	.long 0xbac2fd7b, 0x878a92a7
+	.long 0xa60ce07b, 0x1b3d8f29
+	.long 0xce7f39f4, 0xdaece73e
+	.long 0x61d82e56, 0xf1d0f55e
+	.long 0xd270f1a2, 0xab7aff2a
+	.long 0xc619809d, 0xa87ab8a8
+	.long 0x2b3cac5d, 0x2162d385
+	.long 0x65863b64, 0x8462d800
+	.long 0x1b03397f, 0x83348832
+	.long 0xebb883bd, 0x71d111a8
+	.long 0xb3e32c28, 0x299847d5
+	.long 0x064f7f26, 0xffd852c6
+	.long 0xdd7e3b0c, 0xb9e02b86
+	.long 0xf285651c, 0xdcb17aa4
+	.long 0x10746f3c, 0x18b33a4e
+	.long 0xc7a68855, 0xf37c5aee
+	.long 0x271d9844, 0xb6dd949b
+	.long 0x8e766a0c, 0x6051d5a2
+	.long 0x93a5f730, 0x78d9ccb7
+	.long 0x6cb08e5c, 0x18b0d4ff
+	.long 0x6b749fb2, 0xbac2fd7b
+	.long 0x1393e203, 0x21f3d99c
+	.long 0xcec3662e, 0xa60ce07b
+	.long 0x96c515bb, 0x8f158014
+	.long 0xe6fc4e6a, 0xce7f39f4
+	.long 0x8227bb8a, 0xa00457f7
+	.long 0xb0cd4768, 0x61d82e56
+	.long 0x39c7ff35, 0x8d6d2c43
+	.long 0xd7a4825c, 0xd270f1a2
+	.long 0x0ab3844b, 0x00ac29cf
+	.long 0x0167d312, 0xc619809d
+	.long 0xf6076544, 0xe9adf796
+	.long 0x26f6a60a, 0x2b3cac5d
+	.long 0xa741c1bf, 0x96638b34
+	.long 0x98d8d9cb, 0x65863b64
+	.long 0x49c3cc9c, 0xe0e9f351
+	.long 0x68bce87a, 0x1b03397f
+	.long 0x57a3d037, 0x9af01f2d
+	.long 0x6956fc3b, 0xebb883bd
+	.long 0x42d98888, 0x2cff42cf
+	.long 0x3771e98f, 0xb3e32c28
+	.long 0xb42ae3d9, 0x88f25a3a
+	.long 0x2178513a, 0x064f7f26
+	.long 0xe0ac139e, 0x4e36f0b0
+	.long 0x170076fa, 0xdd7e3b0c
+	.long 0x444dd413, 0xbd6f81f8
+	.long 0x6f345e45, 0xf285651c
+	.long 0x41d17b64, 0x91c9bd4b
+	.long 0xff0dba97, 0x10746f3c
+	.long 0xa2b73df1, 0x885f087b
+	.long 0xf872e54c, 0xc7a68855
+	.long 0x1e41e9fc, 0x4c144932
+	.long 0x86d8e4d2, 0x271d9844
+	.long 0x651bd98b, 0x52148f02
+	.long 0x5bb8f1bc, 0x8e766a0c
+	.long 0xa90fd27a, 0xa3c6f37a
+	.long 0xb3af077a, 0x93a5f730
+	.long 0x4984d782, 0xd7c0557f
+	.long 0xca6ef3ac, 0x6cb08e5c
+	.long 0x234e0b26, 0x63ded06a
+	.long 0xdd66cbbb, 0x6b749fb2
+	.long 0x4597456a, 0x4d56973c
+	.long 0xe9e28eb4, 0x1393e203
+	.long 0x7b3ff57a, 0x9669c9df
+	.long 0xc9c8b782, 0xcec3662e
+	.long 0x3f70cc6f, 0xe417f38a
+	.long 0x93e106a4, 0x96c515bb
+	.long 0x62ec6c6d, 0x4b9e0f71
+	.long 0xd813b325, 0xe6fc4e6a
+	.long 0x0df04680, 0xd104b8fc
+	.long 0x2342001e, 0x8227bb8a
+	.long 0x0a2a8d7e, 0x5b397730
+	.long 0x6d9a4957, 0xb0cd4768
+	.long 0xe8b6368b, 0xe78eb416
+	.long 0xd2c3ed1a, 0x39c7ff35
+	.long 0x995a5724, 0x61ff0e01
+	.long 0x9ef68d35, 0xd7a4825c
+	.long 0x0c139b31, 0x8d96551c
+	.long 0xf2271e60, 0x0ab3844b
+	.long 0x0b0bf8ca, 0x0bf80dd2
+	.long 0x2664fd8b, 0x0167d312
+	.long 0xed64812d, 0x8821abed
+	.long 0x02ee03b2, 0xf6076544
+	.long 0x8604ae0f, 0x6a45d2b2
+	.long 0x363bd6b3, 0x26f6a60a
+	.long 0x135c83fd, 0xd8d26619
+	.long 0x5fabe670, 0xa741c1bf
+	.long 0x35ec3279, 0xde87806c
+	.long 0x00bcf5f6, 0x98d8d9cb
+	.long 0x8ae00689, 0x14338754
+	.long 0x17f27698, 0x49c3cc9c
+	.long 0x58ca5f00, 0x5bd2011f
+	.long 0xaa7c7ad5, 0x68bce87a
+	.long 0xb5cfca28, 0xdd07448e
+	.long 0xded288f8, 0x57a3d037
+	.long 0x59f229bc, 0xdde8f5b9
+	.long 0x6d390dec, 0x6956fc3b
+	.long 0x37170390, 0xa3e3e02c
+	.long 0x6353c1cc, 0x42d98888
+	.long 0xc4584f5c, 0xd73c7bea
+	.long 0xf48642e9, 0x3771e98f
+	.long 0x531377e2, 0x80ff0093
+	.long 0xdd35bc8d, 0xb42ae3d9
+	.long 0xb25b29f2, 0x8fe4c34d
+	.long 0x9a5ede41, 0x2178513a
+	.long 0xa563905d, 0xdf99fc11
+	.long 0x45cddf4e, 0xe0ac139e
+	.long 0xacfa3103, 0x6c23e841
+	.long 0xa51b6135, 0x170076fa
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v3] crypto: crc32c-pclmul - Shrink K_table to 32-bit words
  2014-06-07  3:08                     ` [PATCH v3] crypto: crc32c-pclmul - Shrink K_table to 32-bit words George Spelvin
@ 2014-06-20 18:42                       ` Herbert Xu
  0 siblings, 0 replies; 29+ messages in thread
From: Herbert Xu @ 2014-06-20 18:42 UTC (permalink / raw)
  To: George Spelvin; +Cc: tim.c.chen, james.guilford, JBeulich, linux-kernel

On Fri, Jun 06, 2014 at 11:08:58PM -0400, George Spelvin wrote:
> There's no need for the K_table to be made of 64-bit words.  For some
> reason, the original authors didn't fully reduce the values modulo the
> CRC32C polynomial, and so had some 33-bit values in there.  They can
> all be reduced to 32 bits.
> 
> Doing that cuts the table size in half.  Since the code depends on both
> pclmulq and crc32, SSE 4.1 is obviously present, so we can use pmovzxdq
> to fetch it in the correct format.
> 
> This adds (measured on Ivy Bridge) 1 cycle per main loop iteration
> (CRC of up to 3K bytes), less than 0.2%.  The hope is that the reduced
> D-cache footprint will make up the loss in other code.
> 
> Two other related fixes:
> * K_table is read-only, so belongs in .rodata, and
> * There's no need for more than 8-byte alignment
> 
> Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: George Spelvin <linux@horizon.com>

Patch applied.  Thanks!
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2014-06-20 18:42 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-28 14:40 [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table George Spelvin
2014-05-28 15:32 ` George Spelvin
2014-05-28 22:15   ` [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words George Spelvin
2014-05-28 23:02     ` Tim Chen
2014-05-28 23:55       ` George Spelvin
2014-05-29  3:26       ` George Spelvin
2014-05-29 16:33         ` Tim Chen
2014-05-28 20:47 ` [RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table Jan Beulich
2014-05-28 21:47   ` George Spelvin
2014-05-29  6:44     ` Jan Beulich
2014-05-28 22:32 ` Tim Chen
2014-05-28 23:01   ` George Spelvin
2014-05-28 23:28     ` Tim Chen
2014-05-29 23:54       ` George Spelvin
2014-05-30  1:07         ` Tim Chen
2014-05-30  1:16           ` Dave Jones
2014-05-30 17:56             ` Tim Chen
2014-05-30 18:45               ` Dirk Brandewie
2014-05-30 19:32                 ` Tim Chen
2014-05-30 19:38                   ` Dirk Brandewie
2014-05-30 20:07                     ` Tim Chen
2014-05-30 20:15                       ` Dirk Brandewie
2014-05-30  1:37           ` George Spelvin
2014-05-30  5:25             ` George Spelvin
2014-05-30 16:10               ` Tim Chen
2014-05-30 16:52                 ` George Spelvin
2014-05-30 17:01                   ` Tim Chen
2014-06-07  3:08                     ` [PATCH v3] crypto: crc32c-pclmul - Shrink K_table to 32-bit words George Spelvin
2014-06-20 18:42                       ` Herbert Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).