* [PATCH v6 0/4] powerpc/64: memcmp() optimization
@ 2018-05-25 4:07 wei.guo.simon
2018-05-25 4:07 ` [PATCH v6 1/4] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
` (3 more replies)
0 siblings, 4 replies; 17+ messages in thread
From: wei.guo.simon @ 2018-05-25 4:07 UTC (permalink / raw)
To: linuxppc-dev
Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, Cyril Bur, Simon Guo
From: Simon Guo <wei.guo.simon@gmail.com>
There is some room to optimize memcmp() in powerpc 64 bits version for
following 2 cases:
(1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
memcmp() can align them and go with .Llong comparision mode without
fallback to .Lshort comparision mode do compare buffer byte by byte.
(2) VMX instructions can be used to speed up for large size comparision,
currently the threshold is set for 4K bytes. Notes the VMX instructions
will lead to VMX regs save/load penalty. This patch set includes a
patch to add a 32 bytes pre-checking to minimize the penalty.
It did the similar with glibc commit dec4a7105e (powerpc: Improve memcmp
performance for POWER8). Thanks Cyril Bur's information.
This patch set also updates memcmp selftest case to make it compiled and
incorporate large size comparison case.
v5 -> v6:
- correct some comments/commit messsage.
- rename VMX_OPS_THRES to VMX_THRESH
v4 -> v5:
- Expand 32 bytes prechk to src/dst different offset case, and remove
KSM specific label/comment.
v3 -> v4:
- Add 32 bytes pre-checking before using VMX instructions.
v2 -> v3:
- add optimization for src/dst with different offset against 8 bytes
boundary.
- renamed some label names.
- reworked some comments from Cyril Bur, such as fill the pipeline,
and use VMX when size == 4K.
- fix a bug of enter/exit_vmx_ops pairness issue. And revised test
case to test whether enter/exit_vmx_ops are paired.
v1 -> v2:
- update 8bytes unaligned bytes comparison method.
- fix a VMX comparision bug.
- enhanced the original memcmp() selftest.
- add powerpc/64 to subject/commit message.
Simon Guo (4):
powerpc/64: Align bytes before fall back to .Lshort in powerpc64
memcmp()
powerpc/64: enhance memcmp() with VMX instruction for long bytes
comparision
powerpc/64: add 32 bytes prechecking before using VMX optimization on
memcmp()
powerpc:selftest update memcmp_64 selftest for VMX implementation
arch/powerpc/include/asm/asm-prototypes.h | 4 +-
arch/powerpc/lib/copypage_power7.S | 4 +-
arch/powerpc/lib/memcmp_64.S | 408 ++++++++++++++++++++-
arch/powerpc/lib/memcpy_power7.S | 6 +-
arch/powerpc/lib/vmx-helper.c | 4 +-
.../selftests/powerpc/copyloops/asm/ppc_asm.h | 4 +-
.../selftests/powerpc/stringloops/asm/ppc_asm.h | 22 ++
.../testing/selftests/powerpc/stringloops/memcmp.c | 98 +++--
8 files changed, 510 insertions(+), 40 deletions(-)
--
1.8.3.1
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v6 1/4] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp()
2018-05-25 4:07 [PATCH v6 0/4] powerpc/64: memcmp() optimization wei.guo.simon
@ 2018-05-25 4:07 ` wei.guo.simon
2018-05-28 10:35 ` Segher Boessenkool
2018-05-25 4:07 ` [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
` (2 subsequent siblings)
3 siblings, 1 reply; 17+ messages in thread
From: wei.guo.simon @ 2018-05-25 4:07 UTC (permalink / raw)
To: linuxppc-dev
Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, Cyril Bur, Simon Guo
From: Simon Guo <wei.guo.simon@gmail.com>
Currently memcmp() 64bytes version in powerpc will fall back to .Lshort
(compare per byte mode) if either src or dst address is not 8 bytes aligned.
It can be opmitized in 2 situations:
1) if both addresses are with the same offset with 8 bytes boundary:
memcmp() can compare the unaligned bytes within 8 bytes boundary firstly
and then compare the rest 8-bytes-aligned content with .Llong mode.
2) If src/dst addrs are not with the same offset of 8 bytes boundary:
memcmp() can align src addr with 8 bytes, increment dst addr accordingly,
then load src with aligned mode and load dst with unaligned mode.
This patch optmizes memcmp() behavior in the above 2 situations.
Tested with both little/big endian. Performance result below is based on
little endian.
Following is the test result with src/dst having the same offset case:
(a similar result was observed when src/dst having different offset):
(1) 256 bytes
Test with the existing tools/testing/selftests/powerpc/stringloops/memcmp:
- without patch
29.773018302 seconds time elapsed ( +- 0.09% )
- with patch
16.485568173 seconds time elapsed ( +- 0.02% )
-> There is ~+80% percent improvement
(2) 32 bytes
To observe performance impact on < 32 bytes, modify
tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
-------
#include <string.h>
#include "utils.h"
-#define SIZE 256
+#define SIZE 32
#define ITERATIONS 10000
int test_memcmp(const void *s1, const void *s2, size_t n);
--------
- Without patch
0.244746482 seconds time elapsed ( +- 0.36%)
- with patch
0.215069477 seconds time elapsed ( +- 0.51%)
-> There is ~+13% improvement
(3) 0~8 bytes
To observe <8 bytes performance impact, modify
tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
-------
#include <string.h>
#include "utils.h"
-#define SIZE 256
-#define ITERATIONS 10000
+#define SIZE 8
+#define ITERATIONS 1000000
int test_memcmp(const void *s1, const void *s2, size_t n);
-------
- Without patch
1.845642503 seconds time elapsed ( +- 0.12% )
- With patch
1.849767135 seconds time elapsed ( +- 0.26% )
-> They are nearly the same. (-0.2%)
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
arch/powerpc/lib/memcmp_64.S | 143 ++++++++++++++++++++++++++++++++++++++++---
1 file changed, 136 insertions(+), 7 deletions(-)
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index d75d18b..f20e883 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -24,28 +24,41 @@
#define rH r31
#ifdef __LITTLE_ENDIAN__
+#define LH lhbrx
+#define LW lwbrx
#define LD ldbrx
#else
+#define LH lhzx
+#define LW lwzx
#define LD ldx
#endif
+/*
+ * There are 2 categories for memcmp:
+ * 1) src/dst has the same offset to the 8 bytes boundary. The handlers
+ * are named like .Lsameoffset_xxxx
+ * 2) src/dst has different offset to the 8 bytes boundary. The handlers
+ * are named like .Ldiffoffset_xxxx
+ */
_GLOBAL(memcmp)
cmpdi cr1,r5,0
- /* Use the short loop if both strings are not 8B aligned */
- or r6,r3,r4
+ /* Use the short loop if the src/dst addresses are not
+ * with the same offset of 8 bytes align boundary.
+ */
+ xor r6,r3,r4
andi. r6,r6,7
- /* Use the short loop if length is less than 32B */
- cmpdi cr6,r5,31
+ /* Fall back to short loop if compare at aligned addrs
+ * with less than 8 bytes.
+ */
+ cmpdi cr6,r5,7
beq cr1,.Lzero
- bne .Lshort
- bgt cr6,.Llong
+ bgt cr6,.Lno_short
.Lshort:
mtctr r5
-
1: lbz rA,0(r3)
lbz rB,0(r4)
subf. rC,rB,rA
@@ -78,11 +91,90 @@ _GLOBAL(memcmp)
li r3,0
blr
+.Lno_short:
+ dcbt 0,r3
+ dcbt 0,r4
+ bne .Ldiffoffset_8bytes_make_align_start
+
+
+.Lsameoffset_8bytes_make_align_start:
+ /* attempt to compare bytes not aligned with 8 bytes so that
+ * rest comparison can run based on 8 bytes alignment.
+ */
+ andi. r6,r3,7
+
+ /* Try to compare the first double word which is not 8 bytes aligned:
+ * load the first double word at (src & ~7UL) and shift left appropriate
+ * bits before comparision.
+ */
+ clrlwi r6,r3,29
+ rlwinm r6,r6,3,0,28
+ beq .Lsameoffset_8bytes_aligned
+ clrrdi r3,r3,3
+ clrrdi r4,r4,3
+ LD rA,0,r3
+ LD rB,0,r4
+ sld rA,rA,r6
+ sld rB,rB,r6
+ cmpld cr0,rA,rB
+ srwi r6,r6,3
+ bne cr0,.LcmpAB_lightweight
+ subfic r6,r6,8
+ subfc. r5,r6,r5
+ addi r3,r3,8
+ addi r4,r4,8
+ beq .Lzero
+
+.Lsameoffset_8bytes_aligned:
+ /* now we are aligned with 8 bytes.
+ * Use .Llong loop if left cmp bytes are equal or greater than 32B.
+ */
+ cmpdi cr6,r5,31
+ bgt cr6,.Llong
+
+.Lcmp_lt32bytes:
+ /* compare 1 ~ 32 bytes, at least r3 addr is 8 bytes aligned now */
+ cmpdi cr5,r5,7
+ srdi r0,r5,3
+ ble cr5,.Lcmp_rest_lt8bytes
+
+ /* handle 8 ~ 31 bytes */
+ clrldi r5,r5,61
+ mtctr r0
+2:
+ LD rA,0,r3
+ LD rB,0,r4
+ cmpld cr0,rA,rB
+ addi r3,r3,8
+ addi r4,r4,8
+ bne cr0,.LcmpAB_lightweight
+ bdnz 2b
+
+ cmpwi r5,0
+ beq .Lzero
+
+.Lcmp_rest_lt8bytes:
+ /* Here we have only less than 8 bytes to compare with. at least s1
+ * Address is aligned with 8 bytes.
+ * The next double words are load and shift right with appropriate
+ * bits.
+ */
+ subfic r6,r5,8
+ rlwinm r6,r6,3,0,28
+ LD rA,0,r3
+ LD rB,0,r4
+ srd rA,rA,r6
+ srd rB,rB,r6
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+ b .Lzero
+
.Lnon_zero:
mr r3,rC
blr
.Llong:
+ /* At least s1 addr is aligned with 8 bytes */
li off8,8
li off16,16
li off24,24
@@ -232,4 +324,41 @@ _GLOBAL(memcmp)
ld r28,-32(r1)
ld r27,-40(r1)
blr
+
+.LcmpAB_lightweight: /* skip NV GPRS restore */
+ li r3,1
+ bgt cr0,8f
+ li r3,-1
+8:
+ blr
+
+.Ldiffoffset_8bytes_make_align_start:
+ /* now try to align s1 with 8 bytes */
+ andi. r6,r3,0x7
+ rlwinm r6,r6,3,0,28
+ beq .Ldiffoffset_align_s1_8bytes
+
+ clrrdi r3,r3,3
+ LD rA,0,r3
+ LD rB,0,r4 /* unaligned load */
+ sld rA,rA,r6
+ srd rA,rA,r6
+ srd rB,rB,r6
+ cmpld cr0,rA,rB
+ srwi r6,r6,3
+ bne cr0,.LcmpAB_lightweight
+
+ subfic r6,r6,8
+ subfc. r5,r6,r5
+ addi r3,r3,8
+ add r4,r4,r6
+
+ beq .Lzero
+
+.Ldiffoffset_align_s1_8bytes:
+ /* now s1 is aligned with 8 bytes. */
+ cmpdi cr5,r5,31
+ ble cr5,.Lcmp_lt32bytes
+ b .Llong
+
EXPORT_SYMBOL(memcmp)
--
1.8.3.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2018-05-25 4:07 [PATCH v6 0/4] powerpc/64: memcmp() optimization wei.guo.simon
2018-05-25 4:07 ` [PATCH v6 1/4] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
@ 2018-05-25 4:07 ` wei.guo.simon
2018-05-28 11:05 ` Segher Boessenkool
2018-05-28 11:59 ` Michael Ellerman
2018-05-25 4:07 ` [PATCH v6 3/4] powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp() wei.guo.simon
2018-05-25 4:07 ` [PATCH v6 4/4] powerpc:selftest update memcmp_64 selftest for VMX implementation wei.guo.simon
3 siblings, 2 replies; 17+ messages in thread
From: wei.guo.simon @ 2018-05-25 4:07 UTC (permalink / raw)
To: linuxppc-dev
Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, Cyril Bur, Simon Guo
From: Simon Guo <wei.guo.simon@gmail.com>
This patch add VMX primitives to do memcmp() in case the compare size
is equal or greater than 4K bytes. KSM feature can benefit from this.
Test result with following test program(replace the "^>" with ""):
------
># cat tools/testing/selftests/powerpc/stringloops/memcmp.c
>#include <malloc.h>
>#include <stdlib.h>
>#include <string.h>
>#include <time.h>
>#include "utils.h"
>#define SIZE (1024 * 1024 * 900)
>#define ITERATIONS 40
int test_memcmp(const void *s1, const void *s2, size_t n);
static int testcase(void)
{
char *s1;
char *s2;
unsigned long i;
s1 = memalign(128, SIZE);
if (!s1) {
perror("memalign");
exit(1);
}
s2 = memalign(128, SIZE);
if (!s2) {
perror("memalign");
exit(1);
}
for (i = 0; i < SIZE; i++) {
s1[i] = i & 0xff;
s2[i] = i & 0xff;
}
for (i = 0; i < ITERATIONS; i++) {
int ret = test_memcmp(s1, s2, SIZE);
if (ret) {
printf("return %d at[%ld]! should have returned zero\n", ret, i);
abort();
}
}
return 0;
}
int main(void)
{
return test_harness(testcase, "memcmp");
}
------
Without this patch (but with the first patch "powerpc/64: Align bytes
before fall back to .Lshort in powerpc64 memcmp()." in the series):
4.726728762 seconds time elapsed ( +- 3.54%)
With VMX patch:
4.234335473 seconds time elapsed ( +- 2.63%)
There is ~+10% improvement.
Testing with unaligned and different offset version (make s1 and s2 shift
random offset within 16 bytes) can archieve higher improvement than 10%..
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
arch/powerpc/include/asm/asm-prototypes.h | 4 +-
arch/powerpc/lib/copypage_power7.S | 4 +-
arch/powerpc/lib/memcmp_64.S | 233 +++++++++++++++++++++++++++++-
arch/powerpc/lib/memcpy_power7.S | 6 +-
arch/powerpc/lib/vmx-helper.c | 4 +-
5 files changed, 241 insertions(+), 10 deletions(-)
diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h
index d9713ad..31fdcee 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -49,8 +49,8 @@ void __trace_hcall_exit(long opcode, unsigned long retval,
/* VMX copying */
int enter_vmx_usercopy(void);
int exit_vmx_usercopy(void);
-int enter_vmx_copy(void);
-void * exit_vmx_copy(void *dest);
+int enter_vmx_ops(void);
+void *exit_vmx_ops(void *dest);
/* Traps */
long machine_check_early(struct pt_regs *regs);
diff --git a/arch/powerpc/lib/copypage_power7.S b/arch/powerpc/lib/copypage_power7.S
index 8fa73b7..e38f956 100644
--- a/arch/powerpc/lib/copypage_power7.S
+++ b/arch/powerpc/lib/copypage_power7.S
@@ -57,7 +57,7 @@ _GLOBAL(copypage_power7)
std r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
std r0,16(r1)
stdu r1,-STACKFRAMESIZE(r1)
- bl enter_vmx_copy
+ bl enter_vmx_ops
cmpwi r3,0
ld r0,STACKFRAMESIZE+16(r1)
ld r3,STK_REG(R31)(r1)
@@ -100,7 +100,7 @@ _GLOBAL(copypage_power7)
addi r3,r3,128
bdnz 1b
- b exit_vmx_copy /* tail call optimise */
+ b exit_vmx_ops /* tail call optimise */
#else
li r0,(PAGE_SIZE/128)
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index f20e883..4ba7bb6 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -27,12 +27,73 @@
#define LH lhbrx
#define LW lwbrx
#define LD ldbrx
+#define LVS lvsr
+#define VPERM(_VRT,_VRA,_VRB,_VRC) \
+ vperm _VRT,_VRB,_VRA,_VRC
#else
#define LH lhzx
#define LW lwzx
#define LD ldx
+#define LVS lvsl
+#define VPERM(_VRT,_VRA,_VRB,_VRC) \
+ vperm _VRT,_VRA,_VRB,_VRC
#endif
+#define VMX_THRESH 4096
+#define ENTER_VMX_OPS \
+ mflr r0; \
+ std r3,-STACKFRAMESIZE+STK_REG(R31)(r1); \
+ std r4,-STACKFRAMESIZE+STK_REG(R30)(r1); \
+ std r5,-STACKFRAMESIZE+STK_REG(R29)(r1); \
+ std r0,16(r1); \
+ stdu r1,-STACKFRAMESIZE(r1); \
+ bl enter_vmx_ops; \
+ cmpwi cr1,r3,0; \
+ ld r0,STACKFRAMESIZE+16(r1); \
+ ld r3,STK_REG(R31)(r1); \
+ ld r4,STK_REG(R30)(r1); \
+ ld r5,STK_REG(R29)(r1); \
+ addi r1,r1,STACKFRAMESIZE; \
+ mtlr r0
+
+#define EXIT_VMX_OPS \
+ mflr r0; \
+ std r3,-STACKFRAMESIZE+STK_REG(R31)(r1); \
+ std r4,-STACKFRAMESIZE+STK_REG(R30)(r1); \
+ std r5,-STACKFRAMESIZE+STK_REG(R29)(r1); \
+ std r0,16(r1); \
+ stdu r1,-STACKFRAMESIZE(r1); \
+ bl exit_vmx_ops; \
+ ld r0,STACKFRAMESIZE+16(r1); \
+ ld r3,STK_REG(R31)(r1); \
+ ld r4,STK_REG(R30)(r1); \
+ ld r5,STK_REG(R29)(r1); \
+ addi r1,r1,STACKFRAMESIZE; \
+ mtlr r0
+
+/*
+ * LD_VSR_CROSS16B load the 2nd 16 bytes for _vaddr which is unaligned with
+ * 16 bytes boundary and permute the result with the 1st 16 bytes.
+
+ * | y y y y y y y y y y y y y 0 1 2 | 3 4 5 6 7 8 9 a b c d e f z z z |
+ * ^ ^ ^
+ * 0xbbbb10 0xbbbb20 0xbbb30
+ * ^
+ * _vaddr
+ *
+ *
+ * _vmask is the mask generated by LVS
+ * _v1st_qw is the 1st aligned QW of current addr which is already loaded.
+ * for example: 0xyyyyyyyyyyyyy012 for big endian
+ * _v2nd_qw is the 2nd aligned QW of cur _vaddr to be loaded.
+ * for example: 0x3456789abcdefzzz for big endian
+ * The permute result is saved in _v_res.
+ * for example: 0x0123456789abcdef for big endian.
+ */
+#define LD_VSR_CROSS16B(_vaddr,_vmask,_v1st_qw,_v2nd_qw,_v_res) \
+ lvx _v2nd_qw,_vaddr,off16; \
+ VPERM(_v_res,_v1st_qw,_v2nd_qw,_vmask)
+
/*
* There are 2 categories for memcmp:
* 1) src/dst has the same offset to the 8 bytes boundary. The handlers
@@ -133,7 +194,7 @@ _GLOBAL(memcmp)
bgt cr6,.Llong
.Lcmp_lt32bytes:
- /* compare 1 ~ 32 bytes, at least r3 addr is 8 bytes aligned now */
+ /* compare 1 ~ 31 bytes, at least r3 addr is 8 bytes aligned now */
cmpdi cr5,r5,7
srdi r0,r5,3
ble cr5,.Lcmp_rest_lt8bytes
@@ -174,6 +235,13 @@ _GLOBAL(memcmp)
blr
.Llong:
+#ifdef CONFIG_ALTIVEC
+ /* Try to use vmx loop if length is equal or greater than 4K */
+ cmpldi cr6,r5,VMX_THRESH
+ bge cr6,.Lsameoffset_vmx_cmp
+
+.Llong_novmx_cmp:
+#endif
/* At least s1 addr is aligned with 8 bytes */
li off8,8
li off16,16
@@ -332,7 +400,94 @@ _GLOBAL(memcmp)
8:
blr
+#ifdef CONFIG_ALTIVEC
+.Lsameoffset_vmx_cmp:
+ /* Enter with src/dst addrs has the same offset with 8 bytes
+ * align boundary
+ */
+ ENTER_VMX_OPS
+ beq cr1,.Llong_novmx_cmp
+
+3:
+ /* need to check whether r4 has the same offset with r3
+ * for 16 bytes boundary.
+ */
+ xor r0,r3,r4
+ andi. r0,r0,0xf
+ bne .Ldiffoffset_vmx_cmp_start
+
+ /* len is no less than 4KB. Need to align with 16 bytes further.
+ */
+ andi. rA,r3,8
+ LD rA,0,r3
+ beq 4f
+ LD rB,0,r4
+ cmpld cr0,rA,rB
+ addi r3,r3,8
+ addi r4,r4,8
+ addi r5,r5,-8
+
+ beq cr0,4f
+ /* save and restore cr0 */
+ mfocrf r5,64
+ EXIT_VMX_OPS
+ mtocrf 64,r5
+ b .LcmpAB_lightweight
+
+4:
+ /* compare 32 bytes for each loop */
+ srdi r0,r5,5
+ mtctr r0
+ clrldi r5,r5,59
+ li off16,16
+
+.balign 16
+5:
+ lvx v0,0,r3
+ lvx v1,0,r4
+ vcmpequd. v0,v0,v1
+ bf 24,7f
+ lvx v0,off16,r3
+ lvx v1,off16,r4
+ vcmpequd. v0,v0,v1
+ bf 24,6f
+ addi r3,r3,32
+ addi r4,r4,32
+ bdnz 5b
+
+ EXIT_VMX_OPS
+ cmpdi r5,0
+ beq .Lzero
+ b .Lcmp_lt32bytes
+
+6:
+ addi r3,r3,16
+ addi r4,r4,16
+
+7:
+ /* diff the last 16 bytes */
+ EXIT_VMX_OPS
+ LD rA,0,r3
+ LD rB,0,r4
+ cmpld cr0,rA,rB
+ li off8,8
+ bne cr0,.LcmpAB_lightweight
+
+ LD rA,off8,r3
+ LD rB,off8,r4
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+ b .Lzero
+#endif
+
.Ldiffoffset_8bytes_make_align_start:
+#ifdef CONFIG_ALTIVEC
+ /* only do vmx ops when the size equal or greater than 4K bytes */
+ cmpdi cr5,r5,VMX_THRESH
+ bge cr5,.Ldiffoffset_vmx_cmp
+.Ldiffoffset_novmx_cmp:
+#endif
+
/* now try to align s1 with 8 bytes */
andi. r6,r3,0x7
rlwinm r6,r6,3,0,28
@@ -359,6 +514,82 @@ _GLOBAL(memcmp)
/* now s1 is aligned with 8 bytes. */
cmpdi cr5,r5,31
ble cr5,.Lcmp_lt32bytes
+
+#ifdef CONFIG_ALTIVEC
+ b .Llong_novmx_cmp
+#else
b .Llong
+#endif
+
+#ifdef CONFIG_ALTIVEC
+.Ldiffoffset_vmx_cmp:
+ ENTER_VMX_OPS
+ beq cr1,.Ldiffoffset_novmx_cmp
+
+.Ldiffoffset_vmx_cmp_start:
+ /* Firstly try to align r3 with 16 bytes */
+ andi. r6,r3,0xf
+ li off16,16
+ beq .Ldiffoffset_vmx_s1_16bytes_align
+ LVS v3,0,r3
+ LVS v4,0,r4
+
+ lvx v5,0,r3
+ lvx v6,0,r4
+ LD_VSR_CROSS16B(r3,v3,v5,v7,v9)
+ LD_VSR_CROSS16B(r4,v4,v6,v8,v10)
+
+ vcmpequb. v7,v9,v10
+ bnl cr6,.Ldiffoffset_vmx_diff_found
+
+ subfic r6,r6,16
+ subf r5,r6,r5
+ add r3,r3,r6
+ add r4,r4,r6
+
+.Ldiffoffset_vmx_s1_16bytes_align:
+ /* now s1 is aligned with 16 bytes */
+ lvx v6,0,r4
+ LVS v4,0,r4
+ srdi r6,r5,5 /* loop for 32 bytes each */
+ clrldi r5,r5,59
+ mtctr r6
+
+.balign 16
+.Ldiffoffset_vmx_32bytesloop:
+ /* the first qw of r4 was saved in v6 */
+ lvx v9,0,r3
+ LD_VSR_CROSS16B(r4,v4,v6,v8,v10)
+ vcmpequb. v7,v9,v10
+ vor v6,v8,v8
+ bnl cr6,.Ldiffoffset_vmx_diff_found
+
+ addi r3,r3,16
+ addi r4,r4,16
+
+ lvx v9,0,r3
+ LD_VSR_CROSS16B(r4,v4,v6,v8,v10)
+ vcmpequb. v7,v9,v10
+ vor v6,v8,v8
+ bnl cr6,.Ldiffoffset_vmx_diff_found
+
+ addi r3,r3,16
+ addi r4,r4,16
+
+ bdnz .Ldiffoffset_vmx_32bytesloop
+
+ EXIT_VMX_OPS
+
+ cmpdi r5,0
+ beq .Lzero
+ b .Lcmp_lt32bytes
+
+.Ldiffoffset_vmx_diff_found:
+ EXIT_VMX_OPS
+ /* anyway, the diff will appear in next 16 bytes */
+ li r5,16
+ b .Lcmp_lt32bytes
+
+#endif
EXPORT_SYMBOL(memcmp)
diff --git a/arch/powerpc/lib/memcpy_power7.S b/arch/powerpc/lib/memcpy_power7.S
index df7de9d..070cdf6 100644
--- a/arch/powerpc/lib/memcpy_power7.S
+++ b/arch/powerpc/lib/memcpy_power7.S
@@ -230,7 +230,7 @@ _GLOBAL(memcpy_power7)
std r5,-STACKFRAMESIZE+STK_REG(R29)(r1)
std r0,16(r1)
stdu r1,-STACKFRAMESIZE(r1)
- bl enter_vmx_copy
+ bl enter_vmx_ops
cmpwi cr1,r3,0
ld r0,STACKFRAMESIZE+16(r1)
ld r3,STK_REG(R31)(r1)
@@ -445,7 +445,7 @@ _GLOBAL(memcpy_power7)
15: addi r1,r1,STACKFRAMESIZE
ld r3,-STACKFRAMESIZE+STK_REG(R31)(r1)
- b exit_vmx_copy /* tail call optimise */
+ b exit_vmx_ops /* tail call optimise */
.Lvmx_unaligned_copy:
/* Get the destination 16B aligned */
@@ -649,5 +649,5 @@ _GLOBAL(memcpy_power7)
15: addi r1,r1,STACKFRAMESIZE
ld r3,-STACKFRAMESIZE+STK_REG(R31)(r1)
- b exit_vmx_copy /* tail call optimise */
+ b exit_vmx_ops /* tail call optimise */
#endif /* CONFIG_ALTIVEC */
diff --git a/arch/powerpc/lib/vmx-helper.c b/arch/powerpc/lib/vmx-helper.c
index bf925cd..9f34049 100644
--- a/arch/powerpc/lib/vmx-helper.c
+++ b/arch/powerpc/lib/vmx-helper.c
@@ -53,7 +53,7 @@ int exit_vmx_usercopy(void)
return 0;
}
-int enter_vmx_copy(void)
+int enter_vmx_ops(void)
{
if (in_interrupt())
return 0;
@@ -70,7 +70,7 @@ int enter_vmx_copy(void)
* passed a pointer to the destination which we return as required by a
* memcpy implementation.
*/
-void *exit_vmx_copy(void *dest)
+void *exit_vmx_ops(void *dest)
{
disable_kernel_altivec();
preempt_enable();
--
1.8.3.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v6 3/4] powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp()
2018-05-25 4:07 [PATCH v6 0/4] powerpc/64: memcmp() optimization wei.guo.simon
2018-05-25 4:07 ` [PATCH v6 1/4] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
2018-05-25 4:07 ` [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
@ 2018-05-25 4:07 ` wei.guo.simon
2018-05-25 4:07 ` [PATCH v6 4/4] powerpc:selftest update memcmp_64 selftest for VMX implementation wei.guo.simon
3 siblings, 0 replies; 17+ messages in thread
From: wei.guo.simon @ 2018-05-25 4:07 UTC (permalink / raw)
To: linuxppc-dev
Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, Cyril Bur, Simon Guo
From: Simon Guo <wei.guo.simon@gmail.com>
This patch is based on the previous VMX patch on memcmp().
To optimize ppc64 memcmp() with VMX instruction, we need to think about
the VMX penalty brought with: If kernel uses VMX instruction, it needs
to save/restore current thread's VMX registers. There are 32 x 128 bits
VMX registers in PPC, which means 32 x 16 = 512 bytes for load and store.
The major concern regarding the memcmp() performance in kernel is KSM,
who will use memcmp() frequently to merge identical pages. So it will
make sense to take some measures/enhancement on KSM to see whether any
improvement can be done here. Cyril Bur indicates that the memcmp() for
KSM has a higher possibility to fail (unmatch) early in previous bytes
in following mail.
https://patchwork.ozlabs.org/patch/817322/#1773629
And I am taking a follow-up on this with this patch.
Per some testing, it shows KSM memcmp() will fail early at previous 32
bytes. More specifically:
- 76% cases will fail/unmatch before 16 bytes;
- 83% cases will fail/unmatch before 32 bytes;
- 84% cases will fail/unmatch before 64 bytes;
So 32 bytes looks a better choice than other bytes for pre-checking.
The early failure is also true for memcmp() for non-KSM case. With a
non-typical call load, it shows ~73% cases fail before first 32 bytes.
This patch adds a 32 bytes pre-checking firstly before jumping into VMX
operations, to avoid the unnecessary VMX penalty. It is not limited to
KSM case. And the testing shows ~20% improvement on memcmp() average
execution time with this patch.
And note the 32B pre-checking is only performed when the compare size
is long enough (>=4K currently) to allow VMX operation.
The detail data and analysis is at:
https://github.com/justdoitqd/publicFiles/blob/master/memcmp/README.md
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
arch/powerpc/lib/memcmp_64.S | 50 +++++++++++++++++++++++++++++++++++++-------
1 file changed, 42 insertions(+), 8 deletions(-)
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index 4ba7bb6..96eb08b 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -403,8 +403,27 @@ _GLOBAL(memcmp)
#ifdef CONFIG_ALTIVEC
.Lsameoffset_vmx_cmp:
/* Enter with src/dst addrs has the same offset with 8 bytes
- * align boundary
+ * align boundary.
+ *
+ * There is an optimization based on following fact: memcmp()
+ * prones to fail early at the first 32 bytes.
+ * Before applying VMX instructions which will lead to 32x128bits
+ * VMX regs load/restore penalty, we compare the first 32 bytes
+ * so that we can catch the ~80% fail cases.
*/
+
+ li r0,4
+ mtctr r0
+.Lsameoffset_prechk_32B_loop:
+ LD rA,0,r3
+ LD rB,0,r4
+ cmpld cr0,rA,rB
+ addi r3,r3,8
+ addi r4,r4,8
+ bne cr0,.LcmpAB_lightweight
+ addi r5,r5,-8
+ bdnz .Lsameoffset_prechk_32B_loop
+
ENTER_VMX_OPS
beq cr1,.Llong_novmx_cmp
@@ -481,13 +500,6 @@ _GLOBAL(memcmp)
#endif
.Ldiffoffset_8bytes_make_align_start:
-#ifdef CONFIG_ALTIVEC
- /* only do vmx ops when the size equal or greater than 4K bytes */
- cmpdi cr5,r5,VMX_THRESH
- bge cr5,.Ldiffoffset_vmx_cmp
-.Ldiffoffset_novmx_cmp:
-#endif
-
/* now try to align s1 with 8 bytes */
andi. r6,r3,0x7
rlwinm r6,r6,3,0,28
@@ -512,6 +524,13 @@ _GLOBAL(memcmp)
.Ldiffoffset_align_s1_8bytes:
/* now s1 is aligned with 8 bytes. */
+#ifdef CONFIG_ALTIVEC
+ /* only do vmx ops when the size is equal or greater than 4K bytes */
+ cmpdi cr5,r5,VMX_THRESH
+ bge cr5,.Ldiffoffset_vmx_cmp
+.Ldiffoffset_novmx_cmp:
+#endif
+
cmpdi cr5,r5,31
ble cr5,.Lcmp_lt32bytes
@@ -523,6 +542,21 @@ _GLOBAL(memcmp)
#ifdef CONFIG_ALTIVEC
.Ldiffoffset_vmx_cmp:
+ /* perform a 32 bytes pre-checking before
+ * enable VMX operations.
+ */
+ li r0,4
+ mtctr r0
+.Ldiffoffset_prechk_32B_loop:
+ LD rA,0,r3
+ LD rB,0,r4
+ cmpld cr0,rA,rB
+ addi r3,r3,8
+ addi r4,r4,8
+ bne cr0,.LcmpAB_lightweight
+ addi r5,r5,-8
+ bdnz .Ldiffoffset_prechk_32B_loop
+
ENTER_VMX_OPS
beq cr1,.Ldiffoffset_novmx_cmp
--
1.8.3.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v6 4/4] powerpc:selftest update memcmp_64 selftest for VMX implementation
2018-05-25 4:07 [PATCH v6 0/4] powerpc/64: memcmp() optimization wei.guo.simon
` (2 preceding siblings ...)
2018-05-25 4:07 ` [PATCH v6 3/4] powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp() wei.guo.simon
@ 2018-05-25 4:07 ` wei.guo.simon
3 siblings, 0 replies; 17+ messages in thread
From: wei.guo.simon @ 2018-05-25 4:07 UTC (permalink / raw)
To: linuxppc-dev
Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, Cyril Bur, Simon Guo
From: Simon Guo <wei.guo.simon@gmail.com>
This patch reworked selftest memcmp_64 so that memcmp selftest can
cover more test cases.
It adds testcases for:
- memcmp over 4K bytes size.
- s1/s2 with different/random offset on 16 bytes boundary.
- enter/exit_vmx_ops pairness.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
.../selftests/powerpc/copyloops/asm/ppc_asm.h | 4 +-
.../selftests/powerpc/stringloops/asm/ppc_asm.h | 22 +++++
.../testing/selftests/powerpc/stringloops/memcmp.c | 98 +++++++++++++++++-----
3 files changed, 100 insertions(+), 24 deletions(-)
diff --git a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
index 5ffe04d..dfce161 100644
--- a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
+++ b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
@@ -36,11 +36,11 @@
li r3,0
blr
-FUNC_START(enter_vmx_copy)
+FUNC_START(enter_vmx_ops)
li r3,1
blr
-FUNC_START(exit_vmx_copy)
+FUNC_START(exit_vmx_ops)
blr
FUNC_START(memcpy_power7)
diff --git a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
index 136242e..185d257 100644
--- a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
+++ b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
@@ -1,4 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _PPC_ASM_H
+#define __PPC_ASM_H
#include <ppc-asm.h>
#ifndef r1
@@ -6,3 +8,23 @@
#endif
#define _GLOBAL(A) FUNC_START(test_ ## A)
+
+#define CONFIG_ALTIVEC
+
+#define R14 r14
+#define R15 r15
+#define R16 r16
+#define R17 r17
+#define R18 r18
+#define R19 r19
+#define R20 r20
+#define R21 r21
+#define R22 r22
+#define R29 r29
+#define R30 r30
+#define R31 r31
+
+#define STACKFRAMESIZE 256
+#define STK_REG(i) (112 + ((i)-14)*8)
+
+#endif
diff --git a/tools/testing/selftests/powerpc/stringloops/memcmp.c b/tools/testing/selftests/powerpc/stringloops/memcmp.c
index 8250db2..b5cf717 100644
--- a/tools/testing/selftests/powerpc/stringloops/memcmp.c
+++ b/tools/testing/selftests/powerpc/stringloops/memcmp.c
@@ -2,20 +2,40 @@
#include <malloc.h>
#include <stdlib.h>
#include <string.h>
+#include <time.h>
#include "utils.h"
#define SIZE 256
#define ITERATIONS 10000
+#define LARGE_SIZE (5 * 1024)
+#define LARGE_ITERATIONS 1000
+#define LARGE_MAX_OFFSET 32
+#define LARGE_SIZE_START 4096
+
+#define MAX_OFFSET_DIFF_S1_S2 48
+
+int vmx_count;
+int enter_vmx_ops(void)
+{
+ vmx_count++;
+ return 1;
+}
+
+void exit_vmx_ops(void)
+{
+ vmx_count--;
+}
int test_memcmp(const void *s1, const void *s2, size_t n);
/* test all offsets and lengths */
-static void test_one(char *s1, char *s2)
+static void test_one(char *s1, char *s2, unsigned long max_offset,
+ unsigned long size_start, unsigned long max_size)
{
unsigned long offset, size;
- for (offset = 0; offset < SIZE; offset++) {
- for (size = 0; size < (SIZE-offset); size++) {
+ for (offset = 0; offset < max_offset; offset++) {
+ for (size = size_start; size < (max_size - offset); size++) {
int x, y;
unsigned long i;
@@ -35,70 +55,104 @@ static void test_one(char *s1, char *s2)
printf("\n");
abort();
}
+
+ if (vmx_count != 0) {
+ printf("vmx enter/exit not paired.(offset:%ld size:%ld s1:%p s2:%p vc:%d\n",
+ offset, size, s1, s2, vmx_count);
+ printf("\n");
+ abort();
+ }
}
}
}
-static int testcase(void)
+static int testcase(bool islarge)
{
char *s1;
char *s2;
unsigned long i;
- s1 = memalign(128, SIZE);
+ unsigned long comp_size = (islarge ? LARGE_SIZE : SIZE);
+ unsigned long alloc_size = comp_size + MAX_OFFSET_DIFF_S1_S2;
+ int iterations = islarge ? LARGE_ITERATIONS : ITERATIONS;
+
+ s1 = memalign(128, alloc_size);
if (!s1) {
perror("memalign");
exit(1);
}
- s2 = memalign(128, SIZE);
+ s2 = memalign(128, alloc_size);
if (!s2) {
perror("memalign");
exit(1);
}
- srandom(1);
+ srandom(time(0));
- for (i = 0; i < ITERATIONS; i++) {
+ for (i = 0; i < iterations; i++) {
unsigned long j;
unsigned long change;
+ char *rand_s1 = s1;
+ char *rand_s2 = s2;
- for (j = 0; j < SIZE; j++)
+ for (j = 0; j < alloc_size; j++)
s1[j] = random();
- memcpy(s2, s1, SIZE);
+ rand_s1 += random() % MAX_OFFSET_DIFF_S1_S2;
+ rand_s2 += random() % MAX_OFFSET_DIFF_S1_S2;
+ memcpy(rand_s2, rand_s1, comp_size);
/* change one byte */
- change = random() % SIZE;
- s2[change] = random() & 0xff;
-
- test_one(s1, s2);
+ change = random() % comp_size;
+ rand_s2[change] = random() & 0xff;
+
+ if (islarge)
+ test_one(rand_s1, rand_s2, LARGE_MAX_OFFSET,
+ LARGE_SIZE_START, comp_size);
+ else
+ test_one(rand_s1, rand_s2, SIZE, 0, comp_size);
}
- srandom(1);
+ srandom(time(0));
- for (i = 0; i < ITERATIONS; i++) {
+ for (i = 0; i < iterations; i++) {
unsigned long j;
unsigned long change;
+ char *rand_s1 = s1;
+ char *rand_s2 = s2;
- for (j = 0; j < SIZE; j++)
+ for (j = 0; j < alloc_size; j++)
s1[j] = random();
- memcpy(s2, s1, SIZE);
+ rand_s1 += random() % MAX_OFFSET_DIFF_S1_S2;
+ rand_s2 += random() % MAX_OFFSET_DIFF_S1_S2;
+ memcpy(rand_s2, rand_s1, comp_size);
/* change multiple bytes, 1/8 of total */
- for (j = 0; j < SIZE / 8; j++) {
- change = random() % SIZE;
+ for (j = 0; j < comp_size / 8; j++) {
+ change = random() % comp_size;
s2[change] = random() & 0xff;
}
- test_one(s1, s2);
+ if (islarge)
+ test_one(rand_s1, rand_s2, LARGE_MAX_OFFSET,
+ LARGE_SIZE_START, comp_size);
+ else
+ test_one(rand_s1, rand_s2, SIZE, 0, comp_size);
}
return 0;
}
+static int testcases(void)
+{
+ testcase(0);
+ testcase(1);
+ return 0;
+}
+
int main(void)
{
- return test_harness(testcase, "memcmp");
+ return test_harness(testcases, "memcmp");
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v6 1/4] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp()
2018-05-25 4:07 ` [PATCH v6 1/4] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
@ 2018-05-28 10:35 ` Segher Boessenkool
2018-05-30 8:11 ` Simon Guo
0 siblings, 1 reply; 17+ messages in thread
From: Segher Boessenkool @ 2018-05-28 10:35 UTC (permalink / raw)
To: wei.guo.simon; +Cc: linuxppc-dev, Naveen N. Rao, Cyril Bur
On Fri, May 25, 2018 at 12:07:33PM +0800, wei.guo.simon@gmail.com wrote:
> _GLOBAL(memcmp)
> cmpdi cr1,r5,0
>
> - /* Use the short loop if both strings are not 8B aligned */
> - or r6,r3,r4
> + /* Use the short loop if the src/dst addresses are not
> + * with the same offset of 8 bytes align boundary.
> + */
> + xor r6,r3,r4
> andi. r6,r6,7
>
> - /* Use the short loop if length is less than 32B */
> - cmpdi cr6,r5,31
> + /* Fall back to short loop if compare at aligned addrs
> + * with less than 8 bytes.
> + */
> + cmpdi cr6,r5,7
>
> beq cr1,.Lzero
> - bne .Lshort
> - bgt cr6,.Llong
> + bgt cr6,.Lno_short
If this doesn't use cr0 anymore, you can do rlwinm r6,r6,0,7 instead of
andi r6,r6,7 .
> +.Lsameoffset_8bytes_make_align_start:
> + /* attempt to compare bytes not aligned with 8 bytes so that
> + * rest comparison can run based on 8 bytes alignment.
> + */
> + andi. r6,r3,7
> +
> + /* Try to compare the first double word which is not 8 bytes aligned:
> + * load the first double word at (src & ~7UL) and shift left appropriate
> + * bits before comparision.
> + */
> + clrlwi r6,r3,29
> + rlwinm r6,r6,3,0,28
Those last two lines are together just
rlwinm r6,r3,3,0x1c
> + subfc. r5,r6,r5
Why subfc? You don't use the carry.
> + rlwinm r6,r6,3,0,28
That's
slwi r6,r6,3
> + bgt cr0,8f
> + li r3,-1
> +8:
> + blr
blelr
li r3,-1
blr
(and more of the same things elsewhere).
Segher
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2018-05-25 4:07 ` [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
@ 2018-05-28 11:05 ` Segher Boessenkool
2018-05-30 8:14 ` Simon Guo
2018-05-28 11:59 ` Michael Ellerman
1 sibling, 1 reply; 17+ messages in thread
From: Segher Boessenkool @ 2018-05-28 11:05 UTC (permalink / raw)
To: wei.guo.simon; +Cc: linuxppc-dev, Naveen N. Rao, Cyril Bur
On Fri, May 25, 2018 at 12:07:34PM +0800, wei.guo.simon@gmail.com wrote:
> + /* save and restore cr0 */
> + mfocrf r5,64
> + EXIT_VMX_OPS
> + mtocrf 64,r5
> + b .LcmpAB_lightweight
That's cr1, not cr0. You can use mcrf instead, it is cheaper (esp. if
you have it in a non-volatile CR field before so you need only one, if any).
> + vcmpequb. v7,v9,v10
> + bnl cr6,.Ldiffoffset_vmx_diff_found
In other places you say bf 24,... Dunno which is more readable, but
please pick one?
Segher
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2018-05-25 4:07 ` [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
2018-05-28 11:05 ` Segher Boessenkool
@ 2018-05-28 11:59 ` Michael Ellerman
2018-05-30 8:15 ` Simon Guo
1 sibling, 1 reply; 17+ messages in thread
From: Michael Ellerman @ 2018-05-28 11:59 UTC (permalink / raw)
To: wei.guo.simon, linuxppc-dev
Cc: Paul Mackerras, Naveen N. Rao, Cyril Bur, Simon Guo
Hi Simon,
wei.guo.simon@gmail.com writes:
> diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
> index f20e883..4ba7bb6 100644
> --- a/arch/powerpc/lib/memcmp_64.S
> +++ b/arch/powerpc/lib/memcmp_64.S
> @@ -174,6 +235,13 @@ _GLOBAL(memcmp)
> blr
>
> .Llong:
> +#ifdef CONFIG_ALTIVEC
> + /* Try to use vmx loop if length is equal or greater than 4K */
> + cmpldi cr6,r5,VMX_THRESH
> + bge cr6,.Lsameoffset_vmx_cmp
> +
Here we decide to use vmx, but we don't do any CPU feature checks.
> @@ -332,7 +400,94 @@ _GLOBAL(memcmp)
> 8:
> blr
>
> +#ifdef CONFIG_ALTIVEC
> +.Lsameoffset_vmx_cmp:
> + /* Enter with src/dst addrs has the same offset with 8 bytes
> + * align boundary
> + */
> + ENTER_VMX_OPS
> + beq cr1,.Llong_novmx_cmp
> +
> +3:
> + /* need to check whether r4 has the same offset with r3
> + * for 16 bytes boundary.
> + */
> + xor r0,r3,r4
> + andi. r0,r0,0xf
> + bne .Ldiffoffset_vmx_cmp_start
> +
> + /* len is no less than 4KB. Need to align with 16 bytes further.
> + */
> + andi. rA,r3,8
> + LD rA,0,r3
> + beq 4f
> + LD rB,0,r4
> + cmpld cr0,rA,rB
> + addi r3,r3,8
> + addi r4,r4,8
> + addi r5,r5,-8
> +
> + beq cr0,4f
> + /* save and restore cr0 */
> + mfocrf r5,64
> + EXIT_VMX_OPS
> + mtocrf 64,r5
> + b .LcmpAB_lightweight
> +
> +4:
> + /* compare 32 bytes for each loop */
> + srdi r0,r5,5
> + mtctr r0
> + clrldi r5,r5,59
> + li off16,16
> +
> +.balign 16
> +5:
> + lvx v0,0,r3
> + lvx v1,0,r4
> + vcmpequd. v0,v0,v1
vcmpequd is only available on Power8 and later CPUs.
Which means this will crash on Power7 or earlier.
Something like this should fix it I think.
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index 96eb08b2be2e..0a11ff14dcd9 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -236,9 +236,11 @@ _GLOBAL(memcmp)
.Llong:
#ifdef CONFIG_ALTIVEC
+BEGIN_FTR_SECTION
/* Try to use vmx loop if length is equal or greater than 4K */
cmpldi cr6,r5,VMX_THRESH
bge cr6,.Lsameoffset_vmx_cmp
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
.Llong_novmx_cmp:
#endif
There's another problem which is that old toolchains don't know about
vcmpequd. To fix that we'll need to add a macro that uses .long to
construct the instruction.
cheers
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v6 1/4] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp()
2018-05-28 10:35 ` Segher Boessenkool
@ 2018-05-30 8:11 ` Simon Guo
2018-05-30 8:27 ` Segher Boessenkool
0 siblings, 1 reply; 17+ messages in thread
From: Simon Guo @ 2018-05-30 8:11 UTC (permalink / raw)
To: Segher Boessenkool; +Cc: linuxppc-dev, Naveen N. Rao, Cyril Bur
Hi Segher,
On Mon, May 28, 2018 at 05:35:12AM -0500, Segher Boessenkool wrote:
> On Fri, May 25, 2018 at 12:07:33PM +0800, wei.guo.simon@gmail.com wrote:
> > _GLOBAL(memcmp)
> > cmpdi cr1,r5,0
> >
> > - /* Use the short loop if both strings are not 8B aligned */
> > - or r6,r3,r4
> > + /* Use the short loop if the src/dst addresses are not
> > + * with the same offset of 8 bytes align boundary.
> > + */
> > + xor r6,r3,r4
> > andi. r6,r6,7
> >
> > - /* Use the short loop if length is less than 32B */
> > - cmpdi cr6,r5,31
> > + /* Fall back to short loop if compare at aligned addrs
> > + * with less than 8 bytes.
> > + */
> > + cmpdi cr6,r5,7
> >
> > beq cr1,.Lzero
> > - bne .Lshort
> > - bgt cr6,.Llong
> > + bgt cr6,.Lno_short
>
> If this doesn't use cr0 anymore, you can do rlwinm r6,r6,0,7 instead of
> andi r6,r6,7 .
>
CR0 is used at .Lno_short handling.
> > +.Lsameoffset_8bytes_make_align_start:
> > + /* attempt to compare bytes not aligned with 8 bytes so that
> > + * rest comparison can run based on 8 bytes alignment.
> > + */
> > + andi. r6,r3,7
> > +
> > + /* Try to compare the first double word which is not 8 bytes aligned:
> > + * load the first double word at (src & ~7UL) and shift left appropriate
> > + * bits before comparision.
> > + */
> > + clrlwi r6,r3,29
> > + rlwinm r6,r6,3,0,28
>
> Those last two lines are together just
> rlwinm r6,r3,3,0x1c
>
Yes. I will combine them.
> > + subfc. r5,r6,r5
>
> Why subfc? You don't use the carry.
OK. I will use subfc instead.
>
> > + rlwinm r6,r6,3,0,28
>
> That's
> slwi r6,r6,3
Yes.
>
> > + bgt cr0,8f
> > + li r3,-1
> > +8:
> > + blr
>
> blelr
> li r3,-1
> blr
Sure. That looks more impact.
>
> (and more of the same things elsewhere).
>
>
> Segher
Thanks for your good comments.
BR,
- Simon
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2018-05-28 11:05 ` Segher Boessenkool
@ 2018-05-30 8:14 ` Simon Guo
2018-05-30 8:35 ` Segher Boessenkool
0 siblings, 1 reply; 17+ messages in thread
From: Simon Guo @ 2018-05-30 8:14 UTC (permalink / raw)
To: Segher Boessenkool; +Cc: linuxppc-dev, Naveen N. Rao, Cyril Bur
Hi Segher,
On Mon, May 28, 2018 at 06:05:59AM -0500, Segher Boessenkool wrote:
> On Fri, May 25, 2018 at 12:07:34PM +0800, wei.guo.simon@gmail.com wrote:
> > + /* save and restore cr0 */
> > + mfocrf r5,64
> > + EXIT_VMX_OPS
> > + mtocrf 64,r5
> > + b .LcmpAB_lightweight
>
> That's cr1, not cr0. You can use mcrf instead, it is cheaper (esp. if
> you have it in a non-volatile CR field before so you need only one, if any).
>
You are right :) How about using mtcr/mfcr instead, I think they are
fast as well and more readable.
> > + vcmpequb. v7,v9,v10
> > + bnl cr6,.Ldiffoffset_vmx_diff_found
>
> In other places you say bf 24,... Dunno which is more readable, but
> please pick one?
I will update to bnl for other cases.
>
>
> Segher
Thanks for your review.
BR,
- Simon
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2018-05-28 11:59 ` Michael Ellerman
@ 2018-05-30 8:15 ` Simon Guo
0 siblings, 0 replies; 17+ messages in thread
From: Simon Guo @ 2018-05-30 8:15 UTC (permalink / raw)
To: Michael Ellerman; +Cc: linuxppc-dev, Paul Mackerras, Naveen N. Rao, Cyril Bur
Hi Michael,
On Mon, May 28, 2018 at 09:59:29PM +1000, Michael Ellerman wrote:
> Hi Simon,
>
> wei.guo.simon@gmail.com writes:
> > diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
> > index f20e883..4ba7bb6 100644
> > --- a/arch/powerpc/lib/memcmp_64.S
> > +++ b/arch/powerpc/lib/memcmp_64.S
> > @@ -174,6 +235,13 @@ _GLOBAL(memcmp)
> > blr
> >
> > .Llong:
> > +#ifdef CONFIG_ALTIVEC
> > + /* Try to use vmx loop if length is equal or greater than 4K */
> > + cmpldi cr6,r5,VMX_THRESH
> > + bge cr6,.Lsameoffset_vmx_cmp
> > +
>
> Here we decide to use vmx, but we don't do any CPU feature checks.
>
>
> > @@ -332,7 +400,94 @@ _GLOBAL(memcmp)
> > 8:
> > blr
> >
> > +#ifdef CONFIG_ALTIVEC
> > +.Lsameoffset_vmx_cmp:
> > + /* Enter with src/dst addrs has the same offset with 8 bytes
> > + * align boundary
> > + */
> > + ENTER_VMX_OPS
> > + beq cr1,.Llong_novmx_cmp
> > +
> > +3:
> > + /* need to check whether r4 has the same offset with r3
> > + * for 16 bytes boundary.
> > + */
> > + xor r0,r3,r4
> > + andi. r0,r0,0xf
> > + bne .Ldiffoffset_vmx_cmp_start
> > +
> > + /* len is no less than 4KB. Need to align with 16 bytes further.
> > + */
> > + andi. rA,r3,8
> > + LD rA,0,r3
> > + beq 4f
> > + LD rB,0,r4
> > + cmpld cr0,rA,rB
> > + addi r3,r3,8
> > + addi r4,r4,8
> > + addi r5,r5,-8
> > +
> > + beq cr0,4f
> > + /* save and restore cr0 */
> > + mfocrf r5,64
> > + EXIT_VMX_OPS
> > + mtocrf 64,r5
> > + b .LcmpAB_lightweight
> > +
> > +4:
> > + /* compare 32 bytes for each loop */
> > + srdi r0,r5,5
> > + mtctr r0
> > + clrldi r5,r5,59
> > + li off16,16
> > +
> > +.balign 16
> > +5:
> > + lvx v0,0,r3
> > + lvx v1,0,r4
> > + vcmpequd. v0,v0,v1
>
> vcmpequd is only available on Power8 and later CPUs.
>
> Which means this will crash on Power7 or earlier.
>
> Something like this should fix it I think.
>
> diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
> index 96eb08b2be2e..0a11ff14dcd9 100644
> --- a/arch/powerpc/lib/memcmp_64.S
> +++ b/arch/powerpc/lib/memcmp_64.S
> @@ -236,9 +236,11 @@ _GLOBAL(memcmp)
>
> .Llong:
> #ifdef CONFIG_ALTIVEC
> +BEGIN_FTR_SECTION
> /* Try to use vmx loop if length is equal or greater than 4K */
> cmpldi cr6,r5,VMX_THRESH
> bge cr6,.Lsameoffset_vmx_cmp
> +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>
> .Llong_novmx_cmp:
> #endif
Thanks for the good catch! I will update that.
>
>
> There's another problem which is that old toolchains don't know about
> vcmpequd. To fix that we'll need to add a macro that uses .long to
> construct the instruction.
Right. I will add the corresponding macros.
Thanks for your review.
BR,
- Simon
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 1/4] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp()
2018-05-30 8:11 ` Simon Guo
@ 2018-05-30 8:27 ` Segher Boessenkool
2018-05-30 9:02 ` Simon Guo
0 siblings, 1 reply; 17+ messages in thread
From: Segher Boessenkool @ 2018-05-30 8:27 UTC (permalink / raw)
To: Simon Guo; +Cc: linuxppc-dev, Naveen N. Rao, Cyril Bur
Hi!
On Wed, May 30, 2018 at 04:11:50PM +0800, Simon Guo wrote:
> On Mon, May 28, 2018 at 05:35:12AM -0500, Segher Boessenkool wrote:
> > On Fri, May 25, 2018 at 12:07:33PM +0800, wei.guo.simon@gmail.com wrote:
> > If this doesn't use cr0 anymore, you can do rlwinm r6,r6,0,7 instead of
> > andi r6,r6,7 .
> >
> CR0 is used at .Lno_short handling.
Tricky.
> > > + subfc. r5,r6,r5
> >
> > Why subfc? You don't use the carry.
> OK. I will use subfc instead.
I meant subf -- no carry. If you want CR0 set there is subf. just fine.
> > > + bgt cr0,8f
> > > + li r3,-1
> > > +8:
> > > + blr
> >
> > blelr
> > li r3,-1
> > blr
> Sure. That looks more impact.
Should have been bgtlr of course -- well check please :-)
Segher
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2018-05-30 8:14 ` Simon Guo
@ 2018-05-30 8:35 ` Segher Boessenkool
2018-05-30 9:03 ` Simon Guo
0 siblings, 1 reply; 17+ messages in thread
From: Segher Boessenkool @ 2018-05-30 8:35 UTC (permalink / raw)
To: Simon Guo; +Cc: linuxppc-dev, Naveen N. Rao, Cyril Bur
On Wed, May 30, 2018 at 04:14:02PM +0800, Simon Guo wrote:
> Hi Segher,
> On Mon, May 28, 2018 at 06:05:59AM -0500, Segher Boessenkool wrote:
> > On Fri, May 25, 2018 at 12:07:34PM +0800, wei.guo.simon@gmail.com wrote:
> > > + /* save and restore cr0 */
> > > + mfocrf r5,64
> > > + EXIT_VMX_OPS
> > > + mtocrf 64,r5
> > > + b .LcmpAB_lightweight
> >
> > That's cr1, not cr0. You can use mcrf instead, it is cheaper (esp. if
> > you have it in a non-volatile CR field before so you need only one, if any).
> >
> You are right :) How about using mtcr/mfcr instead, I think they are
> fast as well and more readable.
Those are much worse than m[ft]ocrf.
You probably should just shuffle things around so that EXIT_VMX_OPS
does not clobber the CR field you need to keep.
Segher
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 1/4] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp()
2018-05-30 8:27 ` Segher Boessenkool
@ 2018-05-30 9:02 ` Simon Guo
0 siblings, 0 replies; 17+ messages in thread
From: Simon Guo @ 2018-05-30 9:02 UTC (permalink / raw)
To: Segher Boessenkool; +Cc: linuxppc-dev, Naveen N. Rao, Cyril Bur
On Wed, May 30, 2018 at 03:27:39AM -0500, Segher Boessenkool wrote:
> Hi!
>
> On Wed, May 30, 2018 at 04:11:50PM +0800, Simon Guo wrote:
> > On Mon, May 28, 2018 at 05:35:12AM -0500, Segher Boessenkool wrote:
> > > On Fri, May 25, 2018 at 12:07:33PM +0800, wei.guo.simon@gmail.com wrote:
> > > If this doesn't use cr0 anymore, you can do rlwinm r6,r6,0,7 instead of
> > > andi r6,r6,7 .
> > >
> > CR0 is used at .Lno_short handling.
>
> Tricky.
>
> > > > + subfc. r5,r6,r5
> > >
> > > Why subfc? You don't use the carry.
> > OK. I will use subfc instead.
>
> I meant subf -- no carry. If you want CR0 set there is subf. just fine.
>
> > > > + bgt cr0,8f
> > > > + li r3,-1
> > > > +8:
> > > > + blr
> > >
> > > blelr
> > > li r3,-1
> > > blr
> > Sure. That looks more impact.
>
> Should have been bgtlr of course -- well check please :-)
Yes :)
Thanks,
- Simon
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2018-05-30 8:35 ` Segher Boessenkool
@ 2018-05-30 9:03 ` Simon Guo
2018-06-06 6:42 ` Simon Guo
0 siblings, 1 reply; 17+ messages in thread
From: Simon Guo @ 2018-05-30 9:03 UTC (permalink / raw)
To: Segher Boessenkool; +Cc: linuxppc-dev, Naveen N. Rao, Cyril Bur
On Wed, May 30, 2018 at 03:35:40AM -0500, Segher Boessenkool wrote:
> On Wed, May 30, 2018 at 04:14:02PM +0800, Simon Guo wrote:
> > Hi Segher,
> > On Mon, May 28, 2018 at 06:05:59AM -0500, Segher Boessenkool wrote:
> > > On Fri, May 25, 2018 at 12:07:34PM +0800, wei.guo.simon@gmail.com wrote:
> > > > + /* save and restore cr0 */
> > > > + mfocrf r5,64
> > > > + EXIT_VMX_OPS
> > > > + mtocrf 64,r5
> > > > + b .LcmpAB_lightweight
> > >
> > > That's cr1, not cr0. You can use mcrf instead, it is cheaper (esp. if
> > > you have it in a non-volatile CR field before so you need only one, if any).
> > >
> > You are right :) How about using mtcr/mfcr instead, I think they are
> > fast as well and more readable.
>
> Those are much worse than m[ft]ocrf.
>
> You probably should just shuffle things around so that EXIT_VMX_OPS
> does not clobber the CR field you need to keep.
Let me use mcrf then :)
Thanks,
- Simon
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2018-05-30 9:03 ` Simon Guo
@ 2018-06-06 6:42 ` Simon Guo
2018-06-06 20:00 ` Segher Boessenkool
0 siblings, 1 reply; 17+ messages in thread
From: Simon Guo @ 2018-06-06 6:42 UTC (permalink / raw)
To: Segher Boessenkool; +Cc: linuxppc-dev, Naveen N. Rao, Cyril Bur
Hi segher,
On Wed, May 30, 2018 at 05:03:21PM +0800, Simon Guo wrote:
> On Wed, May 30, 2018 at 03:35:40AM -0500, Segher Boessenkool wrote:
> > On Wed, May 30, 2018 at 04:14:02PM +0800, Simon Guo wrote:
> > > Hi Segher,
> > > On Mon, May 28, 2018 at 06:05:59AM -0500, Segher Boessenkool wrote:
> > > > On Fri, May 25, 2018 at 12:07:34PM +0800, wei.guo.simon@gmail.com wrote:
> > > > > + /* save and restore cr0 */
> > > > > + mfocrf r5,64
> > > > > + EXIT_VMX_OPS
> > > > > + mtocrf 64,r5
> > > > > + b .LcmpAB_lightweight
> > > >
> > > > That's cr1, not cr0. You can use mcrf instead, it is cheaper (esp. if
> > > > you have it in a non-volatile CR field before so you need only one, if any).
> > > >
> > > You are right :) How about using mtcr/mfcr instead, I think they are
> > > fast as well and more readable.
> >
> > Those are much worse than m[ft]ocrf.
> >
> > You probably should just shuffle things around so that EXIT_VMX_OPS
> > does not clobber the CR field you need to keep.
> Let me use mcrf then :)
I now felt unformatable to use mcrf like:
mcrf 7,0
since I cannot 100% confident that compiler will not use CR7 or other
CR# in exit_vmx_ops().
Can we switch back to mfocrf/mtocrf with correct CR0 value?
mfocrf r5,128
...
mtocrf 128,r5
Thanks,
- Simon
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
2018-06-06 6:42 ` Simon Guo
@ 2018-06-06 20:00 ` Segher Boessenkool
0 siblings, 0 replies; 17+ messages in thread
From: Segher Boessenkool @ 2018-06-06 20:00 UTC (permalink / raw)
To: Simon Guo; +Cc: linuxppc-dev, Naveen N. Rao, Cyril Bur
On Wed, Jun 06, 2018 at 02:42:27PM +0800, Simon Guo wrote:
> I now felt unformatable to use mcrf like:
> mcrf 7,0
>
> since I cannot 100% confident that compiler will not use CR7 or other
> CR# in exit_vmx_ops().
It wasn't clear to me this macro boils down to a function call.
You can use CR2,CR3,CR4, but you'll need to save and restore those at
the start and end of function then, which is just as nasty.
Better is to restructure some code so you don't need that CR field
there anymore.
> Can we switch back to mfocrf/mtocrf with correct CR0 value?
> mfocrf r5,128
> ...
> mtocrf 128,r5
Sure, I'm not your boss ;-) It seems a shame to me to have this 12 or
whatever cycle delay here, since the whole point of the patch is to
make things faster, that's all (but it still is faster, right, you
tested it).
Segher
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2018-06-06 20:00 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-25 4:07 [PATCH v6 0/4] powerpc/64: memcmp() optimization wei.guo.simon
2018-05-25 4:07 ` [PATCH v6 1/4] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
2018-05-28 10:35 ` Segher Boessenkool
2018-05-30 8:11 ` Simon Guo
2018-05-30 8:27 ` Segher Boessenkool
2018-05-30 9:02 ` Simon Guo
2018-05-25 4:07 ` [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
2018-05-28 11:05 ` Segher Boessenkool
2018-05-30 8:14 ` Simon Guo
2018-05-30 8:35 ` Segher Boessenkool
2018-05-30 9:03 ` Simon Guo
2018-06-06 6:42 ` Simon Guo
2018-06-06 20:00 ` Segher Boessenkool
2018-05-28 11:59 ` Michael Ellerman
2018-05-30 8:15 ` Simon Guo
2018-05-25 4:07 ` [PATCH v6 3/4] powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp() wei.guo.simon
2018-05-25 4:07 ` [PATCH v6 4/4] powerpc:selftest update memcmp_64 selftest for VMX implementation wei.guo.simon
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.