* [PATCH v2 0/3] powerpc/64: memcmp() optimization
@ 2017-09-20 23:34 wei.guo.simon
2017-09-20 23:34 ` [PATCH v2 1/3] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: wei.guo.simon @ 2017-09-20 23:34 UTC (permalink / raw)
To: linuxppc-dev
Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, David Laight,
Christophe LEROY, Simon Guo
From: Simon Guo <wei.guo.simon@gmail.com>
There is some room to optimize memcmp() in powerpc 64 bits version for
following 2 cases:
(1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
memcmp() can align them and go with .Llong comparision mode without
fallback to .Lshort comparision mode do compare buffer byte by byte.
(2) VMX instructions can be used to speed up for large size comparision.
This patch set also updates memcmp selftest case to make it compiled and
incorporate large size comparison case.
v1 -> v2:
- update 8bytes unaligned bytes comparison method.
- fix a VMX comparision bug.
- enhanced the original memcmp() selftest.
- add powerpc/64 to subject/commit message.
Simon Guo (3):
powerpc/64: Align bytes before fall back to .Lshort in powerpc64
memcmp().
powerpc/64: enhance memcmp() with VMX instruction for long bytes
comparision
powerpc:selftest update memcmp_64 selftest for VMX implementation
arch/powerpc/include/asm/asm-prototypes.h | 2 +-
arch/powerpc/lib/copypage_power7.S | 2 +-
arch/powerpc/lib/memcmp_64.S | 181 ++++++++++++++++++++-
arch/powerpc/lib/memcpy_power7.S | 2 +-
arch/powerpc/lib/vmx-helper.c | 2 +-
.../selftests/powerpc/copyloops/asm/ppc_asm.h | 2 +-
.../selftests/powerpc/stringloops/asm/ppc_asm.h | 31 ++++
.../testing/selftests/powerpc/stringloops/memcmp.c | 63 ++++---
8 files changed, 254 insertions(+), 31 deletions(-)
--
1.8.3.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 1/3] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp().
2017-09-20 23:34 [PATCH v2 0/3] powerpc/64: memcmp() optimization wei.guo.simon
@ 2017-09-20 23:34 ` wei.guo.simon
2017-09-20 23:34 ` [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
2017-09-20 23:34 ` [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation wei.guo.simon
2 siblings, 0 replies; 18+ messages in thread
From: wei.guo.simon @ 2017-09-20 23:34 UTC (permalink / raw)
To: linuxppc-dev
Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, David Laight,
Christophe LEROY, Simon Guo
From: Simon Guo <wei.guo.simon@gmail.com>
Currently memcmp() 64bytes version in powerpc will fall back to .Lshort
(compare per byte mode) if either src or dst address is not 8 bytes
aligned. It can be opmitized if both addresses are with the same offset
with 8 bytes boundary.
memcmp() can compare the unaligned bytes within 8 bytes boundary firstly
and then compare the rest 8-bytes-aligned content with .Llong mode.
This patch optmizes memcmp() behavior in this situation.
Test result:
(1) 256 bytes
Test with the existing tools/testing/selftests/powerpc/stringloops/memcmp:
- without patch
50.996607479 seconds time elapsed ( +- 0.01% )
- with patch
28.033316997 seconds time elapsed ( +- 0.01% )
-> There is ~+81% percent improvement
(2) 32 bytes
To observe performance impact on < 32 bytes, modify
tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
-------
#include <string.h>
#include "utils.h"
-#define SIZE 256
+#define SIZE 32
#define ITERATIONS 10000
int test_memcmp(const void *s1, const void *s2, size_t n);
--------
- Without patch
0.392578831 seconds time elapsed ( +- 0.05% )
- with patch
0.358446662 seconds time elapsed ( +- 0.04% )
-> There is ~+9% improvement
(3) 0~8 bytes
To observe <8 bytes performance impact, modify
tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
-------
#include <string.h>
#include "utils.h"
-#define SIZE 256
-#define ITERATIONS 10000
+#define SIZE 8
+#define ITERATIONS 1000000
int test_memcmp(const void *s1, const void *s2, size_t n);
-------
- Without patch
3.168752060 seconds time elapsed ( +- 0.10% )
- With patch
3.153030138 seconds time elapsed ( +- 0.09% )
-> They are nearly the same. (-0.4%)
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
arch/powerpc/lib/memcmp_64.S | 99 +++++++++++++++++++++++++++++++++++++++++---
1 file changed, 93 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index d75d18b..6dccfb8 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -24,28 +24,35 @@
#define rH r31
#ifdef __LITTLE_ENDIAN__
+#define LH lhbrx
+#define LW lwbrx
#define LD ldbrx
#else
+#define LH lhzx
+#define LW lwzx
#define LD ldx
#endif
_GLOBAL(memcmp)
cmpdi cr1,r5,0
- /* Use the short loop if both strings are not 8B aligned */
- or r6,r3,r4
+ /* Use the short loop if the src/dst addresses are not
+ * with the same offset of 8 bytes align boundary.
+ */
+ xor r6,r3,r4
andi. r6,r6,7
- /* Use the short loop if length is less than 32B */
- cmpdi cr6,r5,31
+ /* fall back to short loop if compare at aligned addrs
+ * with less than 8 bytes.
+ */
+ cmpdi cr6,r5,7
beq cr1,.Lzero
bne .Lshort
- bgt cr6,.Llong
+ bgt cr6,.L8bytes_make_align_start
.Lshort:
mtctr r5
-
1: lbz rA,0(r3)
lbz rB,0(r4)
subf. rC,rB,rA
@@ -78,6 +85,78 @@ _GLOBAL(memcmp)
li r3,0
blr
+.L8bytes_make_align_start:
+ /* attempt to compare bytes not aligned with 8 bytes so that
+ * left comparison can run based on 8 bytes alignment.
+ */
+ andi. r6,r3,7
+ beq .L8bytes_aligned
+
+ /* Try to compare the first double word which is not 8 bytes aligned:
+ * load the first double word at (src & ~7UL) and shift left appropriate
+ * bits before comparision.
+ */
+ clrlwi r6,r3,29
+ rlwinm r6,r6,3,0,28
+ clrrdi r3,r3,3
+ clrrdi r4,r4,3
+ LD rA,0,r3
+ LD rB,0,r4
+ sld rA,rA,r6
+ sld rB,rB,r6
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+ srwi r6,r6,3
+ subfic r6,r6,8
+ subfc. r5,r6,r5
+ beq .Lzero
+ addi r3,r3,8
+ addi r4,r4,8
+
+.L8bytes_aligned:
+ /* now we are aligned with 8 bytes.
+ * Use .Llong loop if left cmp bytes are equal or greater than 32B.
+ */
+ cmpdi cr6,r5,31
+ bgt cr6,.Llong
+
+ cmpdi cr6,r5,7
+ bgt cr6,.Lcmp_8bytes_31bytes
+
+.Lcmp_rest_lt8bytes:
+ /* Here we have only less than 8 bytes to compare with. Addresses
+ * are aligned with 8 bytes.
+ * The next double words are load and shift right with appropriate
+ * bits.
+ */
+ subfic r6,r5,8
+ rlwinm r6,r6,3,0,28
+ LD rA,0,r3
+ LD rB,0,r4
+ srd rA,rA,r6
+ srd rB,rB,r6
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+ beq .Lzero
+
+.Lcmp_8bytes_31bytes:
+ /* compare 8 ~ 31 bytes with 8 bytes aligned */
+ srdi. r0,r5,3
+ clrldi r5,r5,61
+ mtctr r0
+831:
+ LD rA,0,r3
+ LD rB,0,r4
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+ addi r3,r3,8
+ addi r4,r4,8
+ bdnz 831b
+
+ cmpwi r5,0
+ beq .Lzero
+ b .Lcmp_rest_lt8bytes
+
.Lnon_zero:
mr r3,rC
blr
@@ -232,4 +311,12 @@ _GLOBAL(memcmp)
ld r28,-32(r1)
ld r27,-40(r1)
blr
+
+.LcmpAB_lightweight: /* skip NV GPRS restore */
+ li r3,1
+ bgt cr0,8f
+ li r3,-1
+8:
+ blr
+
EXPORT_SYMBOL(memcmp)
--
1.8.3.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2017-09-20 23:34 [PATCH v2 0/3] powerpc/64: memcmp() optimization wei.guo.simon
2017-09-20 23:34 ` [PATCH v2 1/3] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
@ 2017-09-20 23:34 ` wei.guo.simon
2017-09-21 0:54 ` Simon Guo
2017-09-22 14:06 ` Cyril Bur
2017-09-20 23:34 ` [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation wei.guo.simon
2 siblings, 2 replies; 18+ messages in thread
From: wei.guo.simon @ 2017-09-20 23:34 UTC (permalink / raw)
To: linuxppc-dev
Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, David Laight,
Christophe LEROY, Simon Guo
From: Simon Guo <wei.guo.simon@gmail.com>
This patch add VMX primitives to do memcmp() in case the compare size
exceeds 4K bytes.
Test result with following test program(replace the "^>" with ""):
------
># cat tools/testing/selftests/powerpc/stringloops/memcmp.c
>#include <malloc.h>
>#include <stdlib.h>
>#include <string.h>
>#include <time.h>
>#include "utils.h"
>#define SIZE (1024 * 1024 * 900)
>#define ITERATIONS 40
int test_memcmp(const void *s1, const void *s2, size_t n);
static int testcase(void)
{
char *s1;
char *s2;
unsigned long i;
s1 = memalign(128, SIZE);
if (!s1) {
perror("memalign");
exit(1);
}
s2 = memalign(128, SIZE);
if (!s2) {
perror("memalign");
exit(1);
}
for (i = 0; i < SIZE; i++) {
s1[i] = i & 0xff;
s2[i] = i & 0xff;
}
for (i = 0; i < ITERATIONS; i++) {
int ret = test_memcmp(s1, s2, SIZE);
if (ret) {
printf("return %d at[%ld]! should have returned zero\n", ret, i);
abort();
}
}
return 0;
}
int main(void)
{
return test_harness(testcase, "memcmp");
}
------
Without VMX patch:
7.435191479 seconds time elapsed ( +- 0.51% )
With VMX patch:
6.802038938 seconds time elapsed ( +- 0.56% )
There is ~+8% improvement.
However I am not aware whether there is use case in kernel for memcmp on
large size yet.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
arch/powerpc/include/asm/asm-prototypes.h | 2 +-
arch/powerpc/lib/copypage_power7.S | 2 +-
arch/powerpc/lib/memcmp_64.S | 82 +++++++++++++++++++++++++++++++
arch/powerpc/lib/memcpy_power7.S | 2 +-
arch/powerpc/lib/vmx-helper.c | 2 +-
5 files changed, 86 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h
index 7330150..e6530d8 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -49,7 +49,7 @@ void __trace_hcall_exit(long opcode, unsigned long retval,
/* VMX copying */
int enter_vmx_usercopy(void);
int exit_vmx_usercopy(void);
-int enter_vmx_copy(void);
+int enter_vmx_ops(void);
void * exit_vmx_copy(void *dest);
/* Traps */
diff --git a/arch/powerpc/lib/copypage_power7.S b/arch/powerpc/lib/copypage_power7.S
index ca5fc8f..9e7729e 100644
--- a/arch/powerpc/lib/copypage_power7.S
+++ b/arch/powerpc/lib/copypage_power7.S
@@ -60,7 +60,7 @@ _GLOBAL(copypage_power7)
std r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
std r0,16(r1)
stdu r1,-STACKFRAMESIZE(r1)
- bl enter_vmx_copy
+ bl enter_vmx_ops
cmpwi r3,0
ld r0,STACKFRAMESIZE+16(r1)
ld r3,STK_REG(R31)(r1)
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index 6dccfb8..40218fc 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -162,6 +162,13 @@ _GLOBAL(memcmp)
blr
.Llong:
+#ifdef CONFIG_ALTIVEC
+ /* Try to use vmx loop if length is larger than 4K */
+ cmpldi cr6,r5,4096
+ bgt cr6,.Lvmx_cmp
+
+.Llong_novmx_cmp:
+#endif
li off8,8
li off16,16
li off24,24
@@ -319,4 +326,79 @@ _GLOBAL(memcmp)
8:
blr
+#ifdef CONFIG_ALTIVEC
+.Lvmx_cmp:
+ mflr r0
+ std r3,-STACKFRAMESIZE+STK_REG(R31)(r1)
+ std r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
+ std r5,-STACKFRAMESIZE+STK_REG(R29)(r1)
+ std r0,16(r1)
+ stdu r1,-STACKFRAMESIZE(r1)
+ bl enter_vmx_ops
+ cmpwi cr1,r3,0
+ ld r0,STACKFRAMESIZE+16(r1)
+ ld r3,STK_REG(R31)(r1)
+ ld r4,STK_REG(R30)(r1)
+ ld r5,STK_REG(R29)(r1)
+ addi r1,r1,STACKFRAMESIZE
+ mtlr r0
+ beq cr1,.Llong_novmx_cmp
+
+3:
+ /* Enter with src/dst address 8 bytes aligned, and len is
+ * no less than 4KB. Need to align with 16 bytes further.
+ */
+ andi. rA,r3,8
+ beq 4f
+ LD rA,0,r3
+ LD rB,0,r4
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+
+ addi r3,r3,8
+ addi r4,r4,8
+ addi r5,r5,-8
+
+4:
+ /* compare 32 bytes for each loop */
+ srdi r0,r5,5
+ mtctr r0
+ andi. r5,r5,31
+ li off16,16
+
+.balign 16
+5:
+ lvx v0,0,r3
+ lvx v1,0,r4
+ vcmpequd. v0,v0,v1
+ bf 24,7f
+ lvx v0,off16,r3
+ lvx v1,off16,r4
+ vcmpequd. v0,v0,v1
+ bf 24,6f
+ addi r3,r3,32
+ addi r4,r4,32
+ bdnz 5b
+
+ cmpdi r5,0
+ beq .Lzero
+ b .L8bytes_aligned
+
+6:
+ addi r3,r3,16
+ addi r4,r4,16
+
+7:
+ LD rA,0,r3
+ LD rB,0,r4
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+
+ li off8,8
+ LD rA,off8,r3
+ LD rB,off8,r4
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+ b .Lzero
+#endif
EXPORT_SYMBOL(memcmp)
diff --git a/arch/powerpc/lib/memcpy_power7.S b/arch/powerpc/lib/memcpy_power7.S
index 193909a..682e386 100644
--- a/arch/powerpc/lib/memcpy_power7.S
+++ b/arch/powerpc/lib/memcpy_power7.S
@@ -230,7 +230,7 @@ _GLOBAL(memcpy_power7)
std r5,-STACKFRAMESIZE+STK_REG(R29)(r1)
std r0,16(r1)
stdu r1,-STACKFRAMESIZE(r1)
- bl enter_vmx_copy
+ bl enter_vmx_ops
cmpwi cr1,r3,0
ld r0,STACKFRAMESIZE+16(r1)
ld r3,STK_REG(R31)(r1)
diff --git a/arch/powerpc/lib/vmx-helper.c b/arch/powerpc/lib/vmx-helper.c
index bf925cd..923a9ab 100644
--- a/arch/powerpc/lib/vmx-helper.c
+++ b/arch/powerpc/lib/vmx-helper.c
@@ -53,7 +53,7 @@ int exit_vmx_usercopy(void)
return 0;
}
-int enter_vmx_copy(void)
+int enter_vmx_ops(void)
{
if (in_interrupt())
return 0;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation
2017-09-20 23:34 [PATCH v2 0/3] powerpc/64: memcmp() optimization wei.guo.simon
2017-09-20 23:34 ` [PATCH v2 1/3] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
2017-09-20 23:34 ` [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
@ 2017-09-20 23:34 ` wei.guo.simon
2017-09-25 9:30 ` David Laight
2 siblings, 1 reply; 18+ messages in thread
From: wei.guo.simon @ 2017-09-20 23:34 UTC (permalink / raw)
To: linuxppc-dev
Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, David Laight,
Christophe LEROY, Simon Guo
From: Simon Guo <wei.guo.simon@gmail.com>
This patch adjust selftest memcmp_64 so that memcmp selftest can be
compiled successfully.
It also adds testcases for memcmp over 4K bytes size.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
.../selftests/powerpc/copyloops/asm/ppc_asm.h | 2 +-
.../selftests/powerpc/stringloops/asm/ppc_asm.h | 31 +++++++++++
.../testing/selftests/powerpc/stringloops/memcmp.c | 63 +++++++++++++++-------
3 files changed, 75 insertions(+), 21 deletions(-)
diff --git a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
index 80d34a9..a9da02d 100644
--- a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
+++ b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
@@ -35,7 +35,7 @@
li r3,0
blr
-FUNC_START(enter_vmx_copy)
+FUNC_START(enter_vmx_ops)
li r3,1
blr
diff --git a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
index 11bece8..c8bb360 100644
--- a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
+++ b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
@@ -1,3 +1,5 @@
+#ifndef _PPC_ASM_H
+#define __PPC_ASM_H
#include <ppc-asm.h>
#ifndef r1
@@ -5,3 +7,32 @@
#endif
#define _GLOBAL(A) FUNC_START(test_ ## A)
+
+#define CONFIG_ALTIVEC
+
+#define R14 r14
+#define R15 r15
+#define R16 r16
+#define R17 r17
+#define R18 r18
+#define R19 r19
+#define R20 r20
+#define R21 r21
+#define R22 r22
+#define R29 r29
+#define R30 r30
+#define R31 r31
+
+#define STACKFRAMESIZE 256
+#define STK_REG(i) (112 + ((i)-14)*8)
+
+#define _GLOBAL(A) FUNC_START(test_ ## A)
+#define _GLOBAL_TOC(A) _GLOBAL(A)
+
+#define PPC_MTOCRF(A, B) mtocrf A, B
+
+FUNC_START(enter_vmx_ops)
+ li r3, 1
+ blr
+
+#endif
diff --git a/tools/testing/selftests/powerpc/stringloops/memcmp.c b/tools/testing/selftests/powerpc/stringloops/memcmp.c
index 30b1222..4826669 100644
--- a/tools/testing/selftests/powerpc/stringloops/memcmp.c
+++ b/tools/testing/selftests/powerpc/stringloops/memcmp.c
@@ -1,20 +1,27 @@
#include <malloc.h>
#include <stdlib.h>
#include <string.h>
+#include <time.h>
#include "utils.h"
#define SIZE 256
#define ITERATIONS 10000
+#define LARGE_SIZE (5 * 1024)
+#define LARGE_ITERATIONS 1000
+#define LARGE_MAX_OFFSET 16
+#define LARGE_SIZE_START 4096
+
int test_memcmp(const void *s1, const void *s2, size_t n);
/* test all offsets and lengths */
-static void test_one(char *s1, char *s2)
+static void test_one(char *s1, char *s2, unsigned long max_offset,
+ unsigned long size_start)
{
unsigned long offset, size;
- for (offset = 0; offset < SIZE; offset++) {
- for (size = 0; size < (SIZE-offset); size++) {
+ for (offset = 0; offset < max_offset; offset++) {
+ for (size = size_start; size < (SIZE-offset); size++) {
int x, y;
unsigned long i;
@@ -38,66 +45,82 @@ static void test_one(char *s1, char *s2)
}
}
-static int testcase(void)
+static int testcase(bool islarge)
{
char *s1;
char *s2;
unsigned long i;
- s1 = memalign(128, SIZE);
+ unsigned long alloc_size = islarge ? LARGE_SIZE : SIZE;
+ int iterations = islarge ? LARGE_ITERATIONS : ITERATIONS;
+
+ s1 = memalign(128, alloc_size);
if (!s1) {
perror("memalign");
exit(1);
}
- s2 = memalign(128, SIZE);
+ s2 = memalign(128, alloc_size);
if (!s2) {
perror("memalign");
exit(1);
}
- srandom(1);
+ srandom(time(0));
- for (i = 0; i < ITERATIONS; i++) {
+ for (i = 0; i < iterations; i++) {
unsigned long j;
unsigned long change;
- for (j = 0; j < SIZE; j++)
+ for (j = 0; j < alloc_size; j++)
s1[j] = random();
- memcpy(s2, s1, SIZE);
+ memcpy(s2, s1, alloc_size);
/* change one byte */
- change = random() % SIZE;
+ change = random() % alloc_size;
s2[change] = random() & 0xff;
- test_one(s1, s2);
+ if (islarge)
+ test_one(s1, s2, LARGE_MAX_OFFSET, LARGE_SIZE_START);
+ else
+ test_one(s1, s2, SIZE, 0);
}
- srandom(1);
+ srandom(time(0));
- for (i = 0; i < ITERATIONS; i++) {
+ for (i = 0; i < iterations; i++) {
unsigned long j;
unsigned long change;
- for (j = 0; j < SIZE; j++)
+ for (j = 0; j < alloc_size; j++)
s1[j] = random();
- memcpy(s2, s1, SIZE);
+ memcpy(s2, s1, alloc_size);
/* change multiple bytes, 1/8 of total */
- for (j = 0; j < SIZE / 8; j++) {
- change = random() % SIZE;
+ for (j = 0; j < alloc_size / 8; j++) {
+ change = random() % alloc_size;
s2[change] = random() & 0xff;
}
- test_one(s1, s2);
+ if (islarge)
+ test_one(s1, s2, LARGE_MAX_OFFSET, LARGE_SIZE_START);
+ else
+ test_one(s1, s2, SIZE, 0);
}
return 0;
}
+static int testcases(void)
+{
+ testcase(0);
+ testcase(1);
+ return 0;
+}
+
int main(void)
{
- return test_harness(testcase, "memcmp");
+ return test_harness(testcases, "memcmp");
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2017-09-20 23:34 ` [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
@ 2017-09-21 0:54 ` Simon Guo
2017-09-22 14:06 ` Cyril Bur
1 sibling, 0 replies; 18+ messages in thread
From: Simon Guo @ 2017-09-21 0:54 UTC (permalink / raw)
To: linuxppc-dev
Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, David Laight,
Christophe LEROY
Hi,
On Thu, Sep 21, 2017 at 07:34:39AM +0800, wei.guo.simon@gmail.com wrote:
> From: Simon Guo <wei.guo.simon@gmail.com>
>
> This patch add VMX primitives to do memcmp() in case the compare size
> exceeds 4K bytes.
>
> Test result with following test program(replace the "^>" with ""):
> ------
I missed the exit_vmx_ops() part and need to rework on v3.
Thanks,
- Simon
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2017-09-20 23:34 ` [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
2017-09-21 0:54 ` Simon Guo
@ 2017-09-22 14:06 ` Cyril Bur
2017-09-23 21:18 ` Simon Guo
1 sibling, 1 reply; 18+ messages in thread
From: Cyril Bur @ 2017-09-22 14:06 UTC (permalink / raw)
To: wei.guo.simon, linuxppc-dev; +Cc: David Laight, Naveen N. Rao
On Thu, 2017-09-21 at 07:34 +0800, wei.guo.simon@gmail.com wrote:
> From: Simon Guo <wei.guo.simon@gmail.com>
>
> This patch add VMX primitives to do memcmp() in case the compare size
> exceeds 4K bytes.
>
Hi Simon,
Sorry I didn't see this sooner, I've actually been working on a kernel
version of glibc commit dec4a7105e (powerpc: Improve memcmp performance
for POWER8) unfortunately I've been distracted and it still isn't done.
I wonder if we can consolidate our efforts here. One thing I did come
across in my testing is that for memcmp() that will fail early (I
haven't narrowed down the the optimal number yet) the cost of enabling
VMX actually turns out to be a performance regression, as such I've
added a small check of the first 64 bytes to the start before enabling
VMX to ensure the penalty is worth taking.
Also, you should consider doing 4K and greater, KSM (Kernel Samepage
Merging) uses PAGE_SIZE which can be as small as 4K.
Cyril
> Test result with following test program(replace the "^>" with ""):
> ------
> > # cat tools/testing/selftests/powerpc/stringloops/memcmp.c
> > #include <malloc.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <time.h>
> > #include "utils.h"
> > #define SIZE (1024 * 1024 * 900)
> > #define ITERATIONS 40
>
> int test_memcmp(const void *s1, const void *s2, size_t n);
>
> static int testcase(void)
> {
> char *s1;
> char *s2;
> unsigned long i;
>
> s1 = memalign(128, SIZE);
> if (!s1) {
> perror("memalign");
> exit(1);
> }
>
> s2 = memalign(128, SIZE);
> if (!s2) {
> perror("memalign");
> exit(1);
> }
>
> for (i = 0; i < SIZE; i++) {
> s1[i] = i & 0xff;
> s2[i] = i & 0xff;
> }
> for (i = 0; i < ITERATIONS; i++) {
> int ret = test_memcmp(s1, s2, SIZE);
>
> if (ret) {
> printf("return %d at[%ld]! should have returned zero\n", ret, i);
> abort();
> }
> }
>
> return 0;
> }
>
> int main(void)
> {
> return test_harness(testcase, "memcmp");
> }
> ------
> Without VMX patch:
> 7.435191479 seconds time elapsed ( +- 0.51% )
> With VMX patch:
> 6.802038938 seconds time elapsed ( +- 0.56% )
> There is ~+8% improvement.
>
> However I am not aware whether there is use case in kernel for memcmp on
> large size yet.
>
> Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
> ---
> arch/powerpc/include/asm/asm-prototypes.h | 2 +-
> arch/powerpc/lib/copypage_power7.S | 2 +-
> arch/powerpc/lib/memcmp_64.S | 82 +++++++++++++++++++++++++++++++
> arch/powerpc/lib/memcpy_power7.S | 2 +-
> arch/powerpc/lib/vmx-helper.c | 2 +-
> 5 files changed, 86 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h
> index 7330150..e6530d8 100644
> --- a/arch/powerpc/include/asm/asm-prototypes.h
> +++ b/arch/powerpc/include/asm/asm-prototypes.h
> @@ -49,7 +49,7 @@ void __trace_hcall_exit(long opcode, unsigned long retval,
> /* VMX copying */
> int enter_vmx_usercopy(void);
> int exit_vmx_usercopy(void);
> -int enter_vmx_copy(void);
> +int enter_vmx_ops(void);
> void * exit_vmx_copy(void *dest);
>
> /* Traps */
> diff --git a/arch/powerpc/lib/copypage_power7.S b/arch/powerpc/lib/copypage_power7.S
> index ca5fc8f..9e7729e 100644
> --- a/arch/powerpc/lib/copypage_power7.S
> +++ b/arch/powerpc/lib/copypage_power7.S
> @@ -60,7 +60,7 @@ _GLOBAL(copypage_power7)
> std r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
> std r0,16(r1)
> stdu r1,-STACKFRAMESIZE(r1)
> - bl enter_vmx_copy
> + bl enter_vmx_ops
> cmpwi r3,0
> ld r0,STACKFRAMESIZE+16(r1)
> ld r3,STK_REG(R31)(r1)
> diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
> index 6dccfb8..40218fc 100644
> --- a/arch/powerpc/lib/memcmp_64.S
> +++ b/arch/powerpc/lib/memcmp_64.S
> @@ -162,6 +162,13 @@ _GLOBAL(memcmp)
> blr
>
> .Llong:
> +#ifdef CONFIG_ALTIVEC
> + /* Try to use vmx loop if length is larger than 4K */
> + cmpldi cr6,r5,4096
> + bgt cr6,.Lvmx_cmp
> +
> +.Llong_novmx_cmp:
> +#endif
> li off8,8
> li off16,16
> li off24,24
> @@ -319,4 +326,79 @@ _GLOBAL(memcmp)
> 8:
> blr
>
> +#ifdef CONFIG_ALTIVEC
> +.Lvmx_cmp:
> + mflr r0
> + std r3,-STACKFRAMESIZE+STK_REG(R31)(r1)
> + std r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
> + std r5,-STACKFRAMESIZE+STK_REG(R29)(r1)
> + std r0,16(r1)
> + stdu r1,-STACKFRAMESIZE(r1)
> + bl enter_vmx_ops
> + cmpwi cr1,r3,0
> + ld r0,STACKFRAMESIZE+16(r1)
> + ld r3,STK_REG(R31)(r1)
> + ld r4,STK_REG(R30)(r1)
> + ld r5,STK_REG(R29)(r1)
> + addi r1,r1,STACKFRAMESIZE
> + mtlr r0
> + beq cr1,.Llong_novmx_cmp
> +
> +3:
> + /* Enter with src/dst address 8 bytes aligned, and len is
> + * no less than 4KB. Need to align with 16 bytes further.
> + */
> + andi. rA,r3,8
> + beq 4f
> + LD rA,0,r3
> + LD rB,0,r4
> + cmpld cr0,rA,rB
> + bne cr0,.LcmpAB_lightweight
> +
> + addi r3,r3,8
> + addi r4,r4,8
> + addi r5,r5,-8
> +
> +4:
> + /* compare 32 bytes for each loop */
> + srdi r0,r5,5
> + mtctr r0
> + andi. r5,r5,31
> + li off16,16
> +
> +.balign 16
> +5:
> + lvx v0,0,r3
> + lvx v1,0,r4
> + vcmpequd. v0,v0,v1
> + bf 24,7f
> + lvx v0,off16,r3
> + lvx v1,off16,r4
> + vcmpequd. v0,v0,v1
> + bf 24,6f
> + addi r3,r3,32
> + addi r4,r4,32
> + bdnz 5b
> +
> + cmpdi r5,0
> + beq .Lzero
> + b .L8bytes_aligned
> +
> +6:
> + addi r3,r3,16
> + addi r4,r4,16
> +
> +7:
> + LD rA,0,r3
> + LD rB,0,r4
> + cmpld cr0,rA,rB
> + bne cr0,.LcmpAB_lightweight
> +
> + li off8,8
> + LD rA,off8,r3
> + LD rB,off8,r4
> + cmpld cr0,rA,rB
> + bne cr0,.LcmpAB_lightweight
> + b .Lzero
> +#endif
> EXPORT_SYMBOL(memcmp)
> diff --git a/arch/powerpc/lib/memcpy_power7.S b/arch/powerpc/lib/memcpy_power7.S
> index 193909a..682e386 100644
> --- a/arch/powerpc/lib/memcpy_power7.S
> +++ b/arch/powerpc/lib/memcpy_power7.S
> @@ -230,7 +230,7 @@ _GLOBAL(memcpy_power7)
> std r5,-STACKFRAMESIZE+STK_REG(R29)(r1)
> std r0,16(r1)
> stdu r1,-STACKFRAMESIZE(r1)
> - bl enter_vmx_copy
> + bl enter_vmx_ops
> cmpwi cr1,r3,0
> ld r0,STACKFRAMESIZE+16(r1)
> ld r3,STK_REG(R31)(r1)
> diff --git a/arch/powerpc/lib/vmx-helper.c b/arch/powerpc/lib/vmx-helper.c
> index bf925cd..923a9ab 100644
> --- a/arch/powerpc/lib/vmx-helper.c
> +++ b/arch/powerpc/lib/vmx-helper.c
> @@ -53,7 +53,7 @@ int exit_vmx_usercopy(void)
> return 0;
> }
>
> -int enter_vmx_copy(void)
> +int enter_vmx_ops(void)
> {
> if (in_interrupt())
> return 0;
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2017-09-22 14:06 ` Cyril Bur
@ 2017-09-23 21:18 ` Simon Guo
2017-09-25 23:59 ` Cyril Bur
0 siblings, 1 reply; 18+ messages in thread
From: Simon Guo @ 2017-09-23 21:18 UTC (permalink / raw)
To: Cyril Bur; +Cc: linuxppc-dev, David Laight, Naveen N. Rao
Hi Cyril,
On Sat, Sep 23, 2017 at 12:06:48AM +1000, Cyril Bur wrote:
> On Thu, 2017-09-21 at 07:34 +0800, wei.guo.simon@gmail.com wrote:
> > From: Simon Guo <wei.guo.simon@gmail.com>
> >
> > This patch add VMX primitives to do memcmp() in case the compare size
> > exceeds 4K bytes.
> >
>
> Hi Simon,
>
> Sorry I didn't see this sooner, I've actually been working on a kernel
> version of glibc commit dec4a7105e (powerpc: Improve memcmp performance
> for POWER8) unfortunately I've been distracted and it still isn't done.
Thanks for sync with me. Let's consolidate our effort together :)
I have a quick check on glibc commit dec4a7105e.
Looks the aligned case comparison with VSX is launched without rN size
limitation, which means it will have a VSX reg load penalty even when the
length is 9 bytes.
It did some optimization when src/dest addrs don't have the same offset
on 8 bytes alignment boundary. I need to read more closely.
> I wonder if we can consolidate our efforts here. One thing I did come
> across in my testing is that for memcmp() that will fail early (I
> haven't narrowed down the the optimal number yet) the cost of enabling
> VMX actually turns out to be a performance regression, as such I've
> added a small check of the first 64 bytes to the start before enabling
> VMX to ensure the penalty is worth taking.
Will there still be a penalty if the 65th byte differs?
>
> Also, you should consider doing 4K and greater, KSM (Kernel Samepage
> Merging) uses PAGE_SIZE which can be as small as 4K.
Currently the VMX will only be applied when size exceeds 4K. Are you
suggesting a bigger threshold than 4K?
We can sync more offline for v3.
Thanks,
- Simon
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation
2017-09-25 9:30 ` David Laight
@ 2017-09-24 6:19 ` Simon Guo
0 siblings, 0 replies; 18+ messages in thread
From: Simon Guo @ 2017-09-24 6:19 UTC (permalink / raw)
To: David Laight
Cc: linuxppc-dev, Paul Mackerras, Michael Ellerman, Naveen N. Rao,
Christophe LEROY
Hi David,
On Mon, Sep 25, 2017 at 09:30:28AM +0000, David Laight wrote:
> From: wei.guo.simon@gmail.com
> > Sent: 21 September 2017 00:35
> > This patch adjust selftest memcmp_64 so that memcmp selftest can be
> > compiled successfully.
> ...
> > #define ITERATIONS 10000
> >
> > +#define LARGE_SIZE (5 * 1024)
> > +#define LARGE_ITERATIONS 1000
> ...
>
> Measuring performance by doing a lot of iterations isn't ideal
> and is pretty pointless.
> Cold cache performance can be more useful.
> Also you don't really want any dynamic branch prediction logic
> tuned to the exact test you keep doing.
I think the (orignal) selftest aims at full coverage of functionality
correctness, since each iteration generates a new data set by random.
Thanks,
- Simon
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation
2017-09-20 23:34 ` [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation wei.guo.simon
@ 2017-09-25 9:30 ` David Laight
2017-09-24 6:19 ` Simon Guo
0 siblings, 1 reply; 18+ messages in thread
From: David Laight @ 2017-09-25 9:30 UTC (permalink / raw)
To: 'wei.guo.simon@gmail.com', linuxppc-dev
Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, Christophe LEROY
From: wei.guo.simon@gmail.com
> Sent: 21 September 2017 00:35
> This patch adjust selftest memcmp_64 so that memcmp selftest can be
> compiled successfully.
...
> #define ITERATIONS 10000
>=20
> +#define LARGE_SIZE (5 * 1024)
> +#define LARGE_ITERATIONS 1000
...
Measuring performance by doing a lot of iterations isn't ideal
and is pretty pointless.
Cold cache performance can be more useful.
Also you don't really want any dynamic branch prediction logic
tuned to the exact test you keep doing.
David
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2017-09-23 21:18 ` Simon Guo
@ 2017-09-25 23:59 ` Cyril Bur
2017-09-26 5:34 ` Michael Ellerman
0 siblings, 1 reply; 18+ messages in thread
From: Cyril Bur @ 2017-09-25 23:59 UTC (permalink / raw)
To: Simon Guo; +Cc: linuxppc-dev, David Laight, Naveen N. Rao
On Sun, 2017-09-24 at 05:18 +0800, Simon Guo wrote:
> Hi Cyril,
> On Sat, Sep 23, 2017 at 12:06:48AM +1000, Cyril Bur wrote:
> > On Thu, 2017-09-21 at 07:34 +0800, wei.guo.simon@gmail.com wrote:
> > > From: Simon Guo <wei.guo.simon@gmail.com>
> > >
> > > This patch add VMX primitives to do memcmp() in case the compare size
> > > exceeds 4K bytes.
> > >
> >
> > Hi Simon,
> >
> > Sorry I didn't see this sooner, I've actually been working on a kernel
> > version of glibc commit dec4a7105e (powerpc: Improve memcmp performance
> > for POWER8) unfortunately I've been distracted and it still isn't done.
>
> Thanks for sync with me. Let's consolidate our effort together :)
>
> I have a quick check on glibc commit dec4a7105e.
> Looks the aligned case comparison with VSX is launched without rN size
> limitation, which means it will have a VSX reg load penalty even when the
> length is 9 bytes.
>
This was written for userspace which doesn't have to explicitly enable
VMX in order to use it - we need to be smarter in the kernel.
> It did some optimization when src/dest addrs don't have the same offset
> on 8 bytes alignment boundary. I need to read more closely.
>
> > I wonder if we can consolidate our efforts here. One thing I did come
> > across in my testing is that for memcmp() that will fail early (I
> > haven't narrowed down the the optimal number yet) the cost of enabling
> > VMX actually turns out to be a performance regression, as such I've
> > added a small check of the first 64 bytes to the start before enabling
> > VMX to ensure the penalty is worth taking.
>
> Will there still be a penalty if the 65th byte differs?
>
I haven't benchmarked it exactly, my rationale for 64 bytes was that it
is the stride of the vectorised copy loop so, if we know we'll fail
before even completing one iteration of the vectorized loop there isn't
any point using the vector regs.
> >
> > Also, you should consider doing 4K and greater, KSM (Kernel Samepage
> > Merging) uses PAGE_SIZE which can be as small as 4K.
>
> Currently the VMX will only be applied when size exceeds 4K. Are you
> suggesting a bigger threshold than 4K?
>
Equal to or greater than 4K, KSM will benefit.
> We can sync more offline for v3.
>
> Thanks,
> - Simon
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2017-09-25 23:59 ` Cyril Bur
@ 2017-09-26 5:34 ` Michael Ellerman
2017-09-26 11:26 ` Segher Boessenkool
0 siblings, 1 reply; 18+ messages in thread
From: Michael Ellerman @ 2017-09-26 5:34 UTC (permalink / raw)
To: Cyril Bur, Simon Guo; +Cc: Naveen N. Rao, David Laight, linuxppc-dev, raji
Cyril Bur <cyrilbur@gmail.com> writes:
> On Sun, 2017-09-24 at 05:18 +0800, Simon Guo wrote:
>> Hi Cyril,
>> On Sat, Sep 23, 2017 at 12:06:48AM +1000, Cyril Bur wrote:
>> > On Thu, 2017-09-21 at 07:34 +0800, wei.guo.simon@gmail.com wrote:
>> > > From: Simon Guo <wei.guo.simon@gmail.com>
>> > >
>> > > This patch add VMX primitives to do memcmp() in case the compare size
>> > > exceeds 4K bytes.
>> >
>> > Sorry I didn't see this sooner, I've actually been working on a kernel
>> > version of glibc commit dec4a7105e (powerpc: Improve memcmp performance
>> > for POWER8) unfortunately I've been distracted and it still isn't done.
>>
>> Thanks for sync with me. Let's consolidate our effort together :)
>>
>> I have a quick check on glibc commit dec4a7105e.
>> Looks the aligned case comparison with VSX is launched without rN size
>> limitation, which means it will have a VSX reg load penalty even when the
>> length is 9 bytes.
>>
>
> This was written for userspace which doesn't have to explicitly enable
> VMX in order to use it - we need to be smarter in the kernel.
Well the kernel has to do it for them after a trap, which is actually
even more expensive, so arguably the glibc code should be smarter too
and the threshold before using VMX should probably be higher than in the
kernel (to cover the cost of the trap).
But I digress :)
cheers
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2017-09-26 5:34 ` Michael Ellerman
@ 2017-09-26 11:26 ` Segher Boessenkool
2017-09-27 3:38 ` Michael Ellerman
0 siblings, 1 reply; 18+ messages in thread
From: Segher Boessenkool @ 2017-09-26 11:26 UTC (permalink / raw)
To: Michael Ellerman
Cc: Cyril Bur, Simon Guo, raji, Naveen N. Rao, David Laight, linuxppc-dev
On Tue, Sep 26, 2017 at 03:34:36PM +1000, Michael Ellerman wrote:
> Cyril Bur <cyrilbur@gmail.com> writes:
> > This was written for userspace which doesn't have to explicitly enable
> > VMX in order to use it - we need to be smarter in the kernel.
>
> Well the kernel has to do it for them after a trap, which is actually
> even more expensive, so arguably the glibc code should be smarter too
> and the threshold before using VMX should probably be higher than in the
> kernel (to cover the cost of the trap).
A lot of userspace code uses V*X, more and more with newer CPUs and newer
compiler versions. If you already paid the price for using vector
registers you do not need to again :-)
> But I digress :)
Yeah sorry :-)
Segher
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2017-09-26 11:26 ` Segher Boessenkool
@ 2017-09-27 3:38 ` Michael Ellerman
2017-09-27 9:27 ` Segher Boessenkool
2017-09-27 16:22 ` Simon Guo
0 siblings, 2 replies; 18+ messages in thread
From: Michael Ellerman @ 2017-09-27 3:38 UTC (permalink / raw)
To: Segher Boessenkool
Cc: Cyril Bur, Simon Guo, raji, Naveen N. Rao, David Laight, linuxppc-dev
Segher Boessenkool <segher@kernel.crashing.org> writes:
> On Tue, Sep 26, 2017 at 03:34:36PM +1000, Michael Ellerman wrote:
>> Cyril Bur <cyrilbur@gmail.com> writes:
>> > This was written for userspace which doesn't have to explicitly enable
>> > VMX in order to use it - we need to be smarter in the kernel.
>>
>> Well the kernel has to do it for them after a trap, which is actually
>> even more expensive, so arguably the glibc code should be smarter too
>> and the threshold before using VMX should probably be higher than in the
>> kernel (to cover the cost of the trap).
>
> A lot of userspace code uses V*X, more and more with newer CPUs and newer
> compiler versions. If you already paid the price for using vector
> registers you do not need to again :-)
True, but you don't know if you've paid the price already.
You also pay the price on every context switch (more state to switch),
so it's not free even once enabled. Which is why the kernel will
eventually turn it off if it's unused again.
But now that I've actually looked at the glibc version, it does do some
checks for minimum length before doing any vector instructions, so
that's probably all we want. The exact trade off between checking some
bytes without vector vs turning on vector depends on your input data, so
it's tricky to tune in general.
>> But I digress :)
>
> Yeah sorry :-)
:)
cheers
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2017-09-27 3:38 ` Michael Ellerman
@ 2017-09-27 9:27 ` Segher Boessenkool
2017-09-27 9:43 ` David Laight
2017-09-27 16:22 ` Simon Guo
1 sibling, 1 reply; 18+ messages in thread
From: Segher Boessenkool @ 2017-09-27 9:27 UTC (permalink / raw)
To: Michael Ellerman
Cc: Cyril Bur, Simon Guo, raji, Naveen N. Rao, David Laight, linuxppc-dev
On Wed, Sep 27, 2017 at 01:38:09PM +1000, Michael Ellerman wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > A lot of userspace code uses V*X, more and more with newer CPUs and newer
> > compiler versions. If you already paid the price for using vector
> > registers you do not need to again :-)
>
> True, but you don't know if you've paid the price already.
>
> You also pay the price on every context switch (more state to switch),
> so it's not free even once enabled. Which is why the kernel will
> eventually turn it off if it's unused again.
Yup. But my point is that because user space code uses vector registers
more and more, the penalty for user space code to use vector registers
even more keeps shrinking.
> But now that I've actually looked at the glibc version, it does do some
> checks for minimum length before doing any vector instructions, so
> that's probably all we want. The exact trade off between checking some
> bytes without vector vs turning on vector depends on your input data, so
> it's tricky to tune in general.
You also need nasty code to deal with the start and end of strings, with
conditional branches and whatnot, which quickly overwhelms the benefit
of using vector registers at all. This tradeoff also changes with newer
ISA versions.
Things have to become *really* cheap before it will be good to often use
vector registers in the kernel though.
Segher
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2017-09-27 9:27 ` Segher Boessenkool
@ 2017-09-27 9:43 ` David Laight
2017-09-27 18:33 ` Simon Guo
0 siblings, 1 reply; 18+ messages in thread
From: David Laight @ 2017-09-27 9:43 UTC (permalink / raw)
To: 'Segher Boessenkool', Michael Ellerman
Cc: Cyril Bur, Simon Guo, raji, Naveen N. Rao, linuxppc-dev
From: Segher Boessenkool
> Sent: 27 September 2017 10:28
...
> You also need nasty code to deal with the start and end of strings, with
> conditional branches and whatnot, which quickly overwhelms the benefit
> of using vector registers at all. This tradeoff also changes with newer
> ISA versions.
The goal posts keep moving.
For instance with modern intel x86 cpus 'rep movsb' is by far the fastest
way to copy data (from cached memory).
> Things have to become *really* cheap before it will be good to often use
> vector registers in the kernel though.
I've had thoughts about this in the past.
If the vector registers belong to the current process then you might
get away with just saving the ones you want to use.
If they belong to a different process then you also need to tell the
FPU save code where you've saved the registers.
Then the IPI code can recover all the correct values.
On X86 all the AVX registers are caller saved, the system call
entry could issue the instruction that invalidates them all.
Kernel code running in the context of a user process could then
use the registers without saving them.
It would only need to set a mark to ensure they are invalidated
again on return to user (might be cheap enough to do anyway).
Dunno about PPC though.
David
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2017-09-27 3:38 ` Michael Ellerman
2017-09-27 9:27 ` Segher Boessenkool
@ 2017-09-27 16:22 ` Simon Guo
1 sibling, 0 replies; 18+ messages in thread
From: Simon Guo @ 2017-09-27 16:22 UTC (permalink / raw)
To: Michael Ellerman
Cc: Segher Boessenkool, Cyril Bur, raji, Naveen N. Rao,
David Laight, linuxppc-dev
Hi Michael,
On Wed, Sep 27, 2017 at 01:38:09PM +1000, Michael Ellerman wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
>
> > On Tue, Sep 26, 2017 at 03:34:36PM +1000, Michael Ellerman wrote:
> >> Cyril Bur <cyrilbur@gmail.com> writes:
> >> > This was written for userspace which doesn't have to explicitly enable
> >> > VMX in order to use it - we need to be smarter in the kernel.
> >>
> >> Well the kernel has to do it for them after a trap, which is actually
> >> even more expensive, so arguably the glibc code should be smarter too
> >> and the threshold before using VMX should probably be higher than in the
> >> kernel (to cover the cost of the trap).
> >
> > A lot of userspace code uses V*X, more and more with newer CPUs and newer
> > compiler versions. If you already paid the price for using vector
> > registers you do not need to again :-)
>
> True, but you don't know if you've paid the price already.
>
> You also pay the price on every context switch (more state to switch),
> so it's not free even once enabled. Which is why the kernel will
> eventually turn it off if it's unused again.
>
> But now that I've actually looked at the glibc version, it does do some
> checks for minimum length before doing any vector instructions, so
Looks the glibc version will use VSX instruction and lead to trap in a
9 bytes size cmp with src/dst 16 bytes aligned.
132 /* Now both rSTR1 and rSTR2 are aligned to QW. */
133 .align 4
134 L(qw_align):
135 vspltisb v0, 0
136 srdi. r6, rN, 6
137 li r8, 16
138 li r10, 32
139 li r11, 48
140 ble cr0, L(lessthan64)
141 mtctr r6
142 vspltisb v8, 0
143 vspltisb v6, 0
144 /* Aligned vector loop. */
145 .align 4
line 135 is the VSX instruction causing trap. Did I miss anything?
> that's probably all we want. The exact trade off between checking some
> bytes without vector vs turning on vector depends on your input data, so
> it's tricky to tune in general.
Discussed offline with Cyril. The plan is to use (>=4KB) as the minimum len
before vector regs steps at v3. Cyril will consolidate his existing work on
KSM optimization later, which is probably making 64bytes comparison-ahead to
determine whether it is an early or late matching pattern.
Cyril has also some other valuable comments and I will rework on v3.
Is it OK for you?
Thanks,
- Simon
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2017-09-27 9:43 ` David Laight
@ 2017-09-27 18:33 ` Simon Guo
2017-09-28 9:24 ` David Laight
0 siblings, 1 reply; 18+ messages in thread
From: Simon Guo @ 2017-09-27 18:33 UTC (permalink / raw)
To: David Laight
Cc: 'Segher Boessenkool',
Michael Ellerman, Cyril Bur, raji, Naveen N. Rao, linuxppc-dev
On Wed, Sep 27, 2017 at 09:43:44AM +0000, David Laight wrote:
> From: Segher Boessenkool
> > Sent: 27 September 2017 10:28
> ...
> > You also need nasty code to deal with the start and end of strings, with
> > conditional branches and whatnot, which quickly overwhelms the benefit
> > of using vector registers at all. This tradeoff also changes with newer
> > ISA versions.
>
> The goal posts keep moving.
> For instance with modern intel x86 cpus 'rep movsb' is by far the fastest
> way to copy data (from cached memory).
>
> > Things have to become *really* cheap before it will be good to often use
> > vector registers in the kernel though.
>
> I've had thoughts about this in the past.
> If the vector registers belong to the current process then you might
> get away with just saving the ones you want to use.
> If they belong to a different process then you also need to tell the
> FPU save code where you've saved the registers.
> Then the IPI code can recover all the correct values.
>
> On X86 all the AVX registers are caller saved, the system call
> entry could issue the instruction that invalidates them all.
> Kernel code running in the context of a user process could then
> use the registers without saving them.
> It would only need to set a mark to ensure they are invalidated
> again on return to user (might be cheap enough to do anyway).
> Dunno about PPC though.
I am not aware of any ppc instruction which can set a "mark" or provide
any high granularity flag against single or subgroup of vec regs' validity.
But ppc experts may want to correct me.
Thanks,
- Simon
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2017-09-27 18:33 ` Simon Guo
@ 2017-09-28 9:24 ` David Laight
0 siblings, 0 replies; 18+ messages in thread
From: David Laight @ 2017-09-28 9:24 UTC (permalink / raw)
To: 'Simon Guo'
Cc: 'Segher Boessenkool',
Michael Ellerman, Cyril Bur, raji, Naveen N. Rao, linuxppc-dev
From: Simon Guo
> Sent: 27 September 2017 19:34
...
> > On X86 all the AVX registers are caller saved, the system call
> > entry could issue the instruction that invalidates them all.
> > Kernel code running in the context of a user process could then
> > use the registers without saving them.
> > It would only need to set a mark to ensure they are invalidated
> > again on return to user (might be cheap enough to do anyway).
> > Dunno about PPC though.
>=20
> I am not aware of any ppc instruction which can set a "mark" or provide
> any high granularity flag against single or subgroup of vec regs' validit=
y.
> But ppc experts may want to correct me.
I was just thinking of a software flag.
David
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2017-09-28 9:24 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-20 23:34 [PATCH v2 0/3] powerpc/64: memcmp() optimization wei.guo.simon
2017-09-20 23:34 ` [PATCH v2 1/3] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
2017-09-20 23:34 ` [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
2017-09-21 0:54 ` Simon Guo
2017-09-22 14:06 ` Cyril Bur
2017-09-23 21:18 ` Simon Guo
2017-09-25 23:59 ` Cyril Bur
2017-09-26 5:34 ` Michael Ellerman
2017-09-26 11:26 ` Segher Boessenkool
2017-09-27 3:38 ` Michael Ellerman
2017-09-27 9:27 ` Segher Boessenkool
2017-09-27 9:43 ` David Laight
2017-09-27 18:33 ` Simon Guo
2017-09-28 9:24 ` David Laight
2017-09-27 16:22 ` Simon Guo
2017-09-20 23:34 ` [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation wei.guo.simon
2017-09-25 9:30 ` David Laight
2017-09-24 6:19 ` Simon Guo
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.