All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/5] powerpc/64: memcmp() optimization
@ 2018-05-30  9:20 wei.guo.simon
  2018-05-30  9:20 ` [PATCH v7 1/5] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: wei.guo.simon @ 2018-05-30  9:20 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Paul Mackerras, Michael Ellerman, Naveen N.  Rao, Cyril Bur, Simon Guo

From: Simon Guo <wei.guo.simon@gmail.com>

There is some room to optimize memcmp() in powerpc 64 bits version for
following 2 cases:
(1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
memcmp() can align them and go with .Llong comparision mode without
fallback to .Lshort comparision mode do compare buffer byte by byte.
(2) VMX instructions can be used to speed up for large size comparision,
currently the threshold is set for 4K bytes. Notes the VMX instructions
will lead to VMX regs save/load penalty. This patch set includes a
patch to add a 32 bytes pre-checking to minimize the penalty.

It did the similar with glibc commit dec4a7105e (powerpc: Improve memcmp 
performance for POWER8). Thanks Cyril Bur's information.
This patch set also updates memcmp selftest case to make it compiled and
incorporate large size comparison case.

v6 -> v7:
- add vcmpequd/vcmpequdb .long macro
- add CPU_FTR pair so that Power7 won't invoke Altivec instrs.
- rework some instructions for higher performance or more readable.

v5 -> v6:
- correct some comments/commit messsage.
- rename VMX_OPS_THRES to VMX_THRESH

v4 -> v5:
- Expand 32 bytes prechk to src/dst different offset case, and remove
KSM specific label/comment.

v3 -> v4:
- Add 32 bytes pre-checking before using VMX instructions.

v2 -> v3:
- add optimization for src/dst with different offset against 8 bytes
boundary.
- renamed some label names.
- reworked some comments from Cyril Bur, such as fill the pipeline, 
and use VMX when size == 4K.
- fix a bug of enter/exit_vmx_ops pairness issue. And revised test 
case to test whether enter/exit_vmx_ops are paired.

v1 -> v2:
- update 8bytes unaligned bytes comparison method.
- fix a VMX comparision bug.
- enhanced the original memcmp() selftest.
- add powerpc/64 to subject/commit message.


Simon Guo (5):
  powerpc/64: Align bytes before fall back to .Lshort in powerpc64
    memcmp()
  powerpc: add vcmpequd/vcmpequb ppc instruction macro
  powerpc/64: enhance memcmp() with VMX instruction for long bytes
    comparision
  powerpc/64: add 32 bytes prechecking before using VMX optimization on
    memcmp()
  powerpc:selftest update memcmp_64 selftest for VMX implementation

 arch/powerpc/include/asm/asm-prototypes.h          |   4 +-
 arch/powerpc/include/asm/ppc-opcode.h              |  11 +
 arch/powerpc/lib/copypage_power7.S                 |   4 +-
 arch/powerpc/lib/memcmp_64.S                       | 412 ++++++++++++++++++++-
 arch/powerpc/lib/memcpy_power7.S                   |   6 +-
 arch/powerpc/lib/vmx-helper.c                      |   4 +-
 .../selftests/powerpc/copyloops/asm/ppc_asm.h      |   4 +-
 .../selftests/powerpc/stringloops/asm/ppc-opcode.h |  39 ++
 .../selftests/powerpc/stringloops/asm/ppc_asm.h    |  24 ++
 .../testing/selftests/powerpc/stringloops/memcmp.c |  98 +++--
 10 files changed, 566 insertions(+), 40 deletions(-)
 create mode 100644 tools/testing/selftests/powerpc/stringloops/asm/ppc-opcode.h

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v7 1/5] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp()
  2018-05-30  9:20 [PATCH v7 0/5] powerpc/64: memcmp() optimization wei.guo.simon
@ 2018-05-30  9:20 ` wei.guo.simon
  2018-05-30  9:21 ` [PATCH v7 2/5] powerpc: add vcmpequd/vcmpequb ppc instruction macro wei.guo.simon
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: wei.guo.simon @ 2018-05-30  9:20 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Paul Mackerras, Michael Ellerman, Naveen N.  Rao, Cyril Bur, Simon Guo

From: Simon Guo <wei.guo.simon@gmail.com>

Currently memcmp() 64bytes version in powerpc will fall back to .Lshort
(compare per byte mode) if either src or dst address is not 8 bytes aligned.
It can be opmitized in 2 situations:

1) if both addresses are with the same offset with 8 bytes boundary:
memcmp() can compare the unaligned bytes within 8 bytes boundary firstly
and then compare the rest 8-bytes-aligned content with .Llong mode.

2)  If src/dst addrs are not with the same offset of 8 bytes boundary:
memcmp() can align src addr with 8 bytes, increment dst addr accordingly,
 then load src with aligned mode and load dst with unaligned mode.

This patch optmizes memcmp() behavior in the above 2 situations.

Tested with both little/big endian. Performance result below is based on
little endian.

Following is the test result with src/dst having the same offset case:
(a similar result was observed when src/dst having different offset):
(1) 256 bytes
Test with the existing tools/testing/selftests/powerpc/stringloops/memcmp:
- without patch
	29.773018302 seconds time elapsed                                          ( +- 0.09% )
- with patch
	16.485568173 seconds time elapsed                                          ( +-  0.02% )
		-> There is ~+80% percent improvement

(2) 32 bytes
To observe performance impact on < 32 bytes, modify
tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
-------
 #include <string.h>
 #include "utils.h"

-#define SIZE 256
+#define SIZE 32
 #define ITERATIONS 10000

 int test_memcmp(const void *s1, const void *s2, size_t n);
--------

- Without patch
	0.244746482 seconds time elapsed                                          ( +-  0.36%)
- with patch
	0.215069477 seconds time elapsed                                          ( +-  0.51%)
		-> There is ~+13% improvement

(3) 0~8 bytes
To observe <8 bytes performance impact, modify
tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
-------
 #include <string.h>
 #include "utils.h"

-#define SIZE 256
-#define ITERATIONS 10000
+#define SIZE 8
+#define ITERATIONS 1000000

 int test_memcmp(const void *s1, const void *s2, size_t n);
-------
- Without patch
       1.845642503 seconds time elapsed                                          ( +- 0.12% )
- With patch
       1.849767135 seconds time elapsed                                          ( +- 0.26% )
		-> They are nearly the same. (-0.2%)

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
 arch/powerpc/lib/memcmp_64.S | 140 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 133 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index d75d18b..5776f91 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -24,28 +24,41 @@
 #define rH	r31
 
 #ifdef __LITTLE_ENDIAN__
+#define LH	lhbrx
+#define LW	lwbrx
 #define LD	ldbrx
 #else
+#define LH	lhzx
+#define LW	lwzx
 #define LD	ldx
 #endif
 
+/*
+ * There are 2 categories for memcmp:
+ * 1) src/dst has the same offset to the 8 bytes boundary. The handlers
+ * are named like .Lsameoffset_xxxx
+ * 2) src/dst has different offset to the 8 bytes boundary. The handlers
+ * are named like .Ldiffoffset_xxxx
+ */
 _GLOBAL(memcmp)
 	cmpdi	cr1,r5,0
 
-	/* Use the short loop if both strings are not 8B aligned */
-	or	r6,r3,r4
+	/* Use the short loop if the src/dst addresses are not
+	 * with the same offset of 8 bytes align boundary.
+	 */
+	xor	r6,r3,r4
 	andi.	r6,r6,7
 
-	/* Use the short loop if length is less than 32B */
-	cmpdi	cr6,r5,31
+	/* Fall back to short loop if compare at aligned addrs
+	 * with less than 8 bytes.
+	 */
+	cmpdi   cr6,r5,7
 
 	beq	cr1,.Lzero
-	bne	.Lshort
-	bgt	cr6,.Llong
+	bgt	cr6,.Lno_short
 
 .Lshort:
 	mtctr	r5
-
 1:	lbz	rA,0(r3)
 	lbz	rB,0(r4)
 	subf.	rC,rB,rA
@@ -78,11 +91,89 @@ _GLOBAL(memcmp)
 	li	r3,0
 	blr
 
+.Lno_short:
+	dcbt	0,r3
+	dcbt	0,r4
+	bne	.Ldiffoffset_8bytes_make_align_start
+
+
+.Lsameoffset_8bytes_make_align_start:
+	/* attempt to compare bytes not aligned with 8 bytes so that
+	 * rest comparison can run based on 8 bytes alignment.
+	 */
+	andi.   r6,r3,7
+
+	/* Try to compare the first double word which is not 8 bytes aligned:
+	 * load the first double word at (src & ~7UL) and shift left appropriate
+	 * bits before comparision.
+	 */
+	rlwinm  r6,r3,3,26,28
+	beq     .Lsameoffset_8bytes_aligned
+	clrrdi	r3,r3,3
+	clrrdi	r4,r4,3
+	LD	rA,0,r3
+	LD	rB,0,r4
+	sld	rA,rA,r6
+	sld	rB,rB,r6
+	cmpld	cr0,rA,rB
+	srwi	r6,r6,3
+	bne	cr0,.LcmpAB_lightweight
+	subfic  r6,r6,8
+	subf.	r5,r6,r5
+	addi	r3,r3,8
+	addi	r4,r4,8
+	beq	.Lzero
+
+.Lsameoffset_8bytes_aligned:
+	/* now we are aligned with 8 bytes.
+	 * Use .Llong loop if left cmp bytes are equal or greater than 32B.
+	 */
+	cmpdi   cr6,r5,31
+	bgt	cr6,.Llong
+
+.Lcmp_lt32bytes:
+	/* compare 1 ~ 32 bytes, at least r3 addr is 8 bytes aligned now */
+	cmpdi   cr5,r5,7
+	srdi    r0,r5,3
+	ble	cr5,.Lcmp_rest_lt8bytes
+
+	/* handle 8 ~ 31 bytes */
+	clrldi  r5,r5,61
+	mtctr   r0
+2:
+	LD	rA,0,r3
+	LD	rB,0,r4
+	cmpld	cr0,rA,rB
+	addi	r3,r3,8
+	addi	r4,r4,8
+	bne	cr0,.LcmpAB_lightweight
+	bdnz	2b
+
+	cmpwi   r5,0
+	beq	.Lzero
+
+.Lcmp_rest_lt8bytes:
+	/* Here we have only less than 8 bytes to compare with. at least s1
+	 * Address is aligned with 8 bytes.
+	 * The next double words are load and shift right with appropriate
+	 * bits.
+	 */
+	subfic  r6,r5,8
+	slwi	r6,r6,3
+	LD	rA,0,r3
+	LD	rB,0,r4
+	srd	rA,rA,r6
+	srd	rB,rB,r6
+	cmpld	cr0,rA,rB
+	bne	cr0,.LcmpAB_lightweight
+	b	.Lzero
+
 .Lnon_zero:
 	mr	r3,rC
 	blr
 
 .Llong:
+	/* At least s1 addr is aligned with 8 bytes */
 	li	off8,8
 	li	off16,16
 	li	off24,24
@@ -232,4 +323,39 @@ _GLOBAL(memcmp)
 	ld	r28,-32(r1)
 	ld	r27,-40(r1)
 	blr
+
+.LcmpAB_lightweight:   /* skip NV GPRS restore */
+	li	r3,1
+	bgtlr
+	li	r3,-1
+	blr
+
+.Ldiffoffset_8bytes_make_align_start:
+	/* now try to align s1 with 8 bytes */
+	rlwinm  r6,r3,3,26,28
+	beq     .Ldiffoffset_align_s1_8bytes
+
+	clrrdi	r3,r3,3
+	LD	rA,0,r3
+	LD	rB,0,r4  /* unaligned load */
+	sld	rA,rA,r6
+	srd	rA,rA,r6
+	srd	rB,rB,r6
+	cmpld	cr0,rA,rB
+	srwi	r6,r6,3
+	bne	cr0,.LcmpAB_lightweight
+
+	subfic  r6,r6,8
+	subf.	r5,r6,r5
+	addi	r3,r3,8
+	add	r4,r4,r6
+
+	beq	.Lzero
+
+.Ldiffoffset_align_s1_8bytes:
+	/* now s1 is aligned with 8 bytes. */
+	cmpdi   cr5,r5,31
+	ble	cr5,.Lcmp_lt32bytes
+	b	.Llong
+
 EXPORT_SYMBOL(memcmp)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v7 2/5] powerpc: add vcmpequd/vcmpequb ppc instruction macro
  2018-05-30  9:20 [PATCH v7 0/5] powerpc/64: memcmp() optimization wei.guo.simon
  2018-05-30  9:20 ` [PATCH v7 1/5] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
@ 2018-05-30  9:21 ` wei.guo.simon
  2018-05-30  9:21 ` [PATCH v7 3/5] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: wei.guo.simon @ 2018-05-30  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Paul Mackerras, Michael Ellerman, Naveen N.  Rao, Cyril Bur, Simon Guo

From: Simon Guo <wei.guo.simon@gmail.com>

Some old tool chains don't know about instructions like vcmpequd.

This patch adds .long macro for vcmpequd and vcmpequb, which is
a preparation to optimize ppc64 memcmp with VMX instructions.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
 arch/powerpc/include/asm/ppc-opcode.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index 18883b8..1866a97 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -366,6 +366,8 @@
 #define PPC_INST_STFDX			0x7c0005ae
 #define PPC_INST_LVX			0x7c0000ce
 #define PPC_INST_STVX			0x7c0001ce
+#define PPC_INST_VCMPEQUD		0x100000c7
+#define PPC_INST_VCMPEQUB		0x10000006
 
 /* macros to insert fields into opcodes */
 #define ___PPC_RA(a)	(((a) & 0x1f) << 16)
@@ -396,6 +398,7 @@
 #define __PPC_BI(s)	(((s) & 0x1f) << 16)
 #define __PPC_CT(t)	(((t) & 0x0f) << 21)
 #define __PPC_SPR(r)	((((r) & 0x1f) << 16) | ((((r) >> 5) & 0x1f) << 11))
+#define __PPC_RC21	(0x1 << 10)
 
 /*
  * Only use the larx hint bit on 64bit CPUs. e500v1/v2 based CPUs will treat a
@@ -567,4 +570,12 @@
 				       ((IH & 0x7) << 21))
 #define PPC_INVALIDATE_ERAT	PPC_SLBIA(7)
 
+#define VCMPEQUD_RC(vrt, vra, vrb)	stringify_in_c(.long PPC_INST_VCMPEQUD | \
+			      ___PPC_RT(vrt) | ___PPC_RA(vra) | \
+			      ___PPC_RB(vrb) | __PPC_RC21)
+
+#define VCMPEQUB_RC(vrt, vra, vrb)	stringify_in_c(.long PPC_INST_VCMPEQUB | \
+			      ___PPC_RT(vrt) | ___PPC_RA(vra) | \
+			      ___PPC_RB(vrb) | __PPC_RC21)
+
 #endif /* _ASM_POWERPC_PPC_OPCODE_H */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v7 3/5] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
  2018-05-30  9:20 [PATCH v7 0/5] powerpc/64: memcmp() optimization wei.guo.simon
  2018-05-30  9:20 ` [PATCH v7 1/5] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
  2018-05-30  9:21 ` [PATCH v7 2/5] powerpc: add vcmpequd/vcmpequb ppc instruction macro wei.guo.simon
@ 2018-05-30  9:21 ` wei.guo.simon
  2018-05-30  9:21 ` [PATCH v7 4/5] powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp() wei.guo.simon
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: wei.guo.simon @ 2018-05-30  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Paul Mackerras, Michael Ellerman, Naveen N.  Rao, Cyril Bur, Simon Guo

From: Simon Guo <wei.guo.simon@gmail.com>

This patch add VMX primitives to do memcmp() in case the compare size
is equal or greater than 4K bytes. KSM feature can benefit from this.

Test result with following test program(replace the "^>" with ""):
------
># cat tools/testing/selftests/powerpc/stringloops/memcmp.c
>#include <malloc.h>
>#include <stdlib.h>
>#include <string.h>
>#include <time.h>
>#include "utils.h"
>#define SIZE (1024 * 1024 * 900)
>#define ITERATIONS 40

int test_memcmp(const void *s1, const void *s2, size_t n);

static int testcase(void)
{
        char *s1;
        char *s2;
        unsigned long i;

        s1 = memalign(128, SIZE);
        if (!s1) {
                perror("memalign");
                exit(1);
        }

        s2 = memalign(128, SIZE);
        if (!s2) {
                perror("memalign");
                exit(1);
        }

        for (i = 0; i < SIZE; i++)  {
                s1[i] = i & 0xff;
                s2[i] = i & 0xff;
        }
        for (i = 0; i < ITERATIONS; i++) {
		int ret = test_memcmp(s1, s2, SIZE);

		if (ret) {
			printf("return %d at[%ld]! should have returned zero\n", ret, i);
			abort();
		}
	}

        return 0;
}

int main(void)
{
        return test_harness(testcase, "memcmp");
}
------
Without this patch (but with the first patch "powerpc/64: Align bytes
before fall back to .Lshort in powerpc64 memcmp()." in the series):
	4.726728762 seconds time elapsed                                          ( +-  3.54%)
With VMX patch:
	4.234335473 seconds time elapsed                                          ( +-  2.63%)
		There is ~+10% improvement.

Testing with unaligned and different offset version (make s1 and s2 shift
random offset within 16 bytes) can archieve higher improvement than 10%..

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
 arch/powerpc/include/asm/asm-prototypes.h |   4 +-
 arch/powerpc/lib/copypage_power7.S        |   4 +-
 arch/powerpc/lib/memcmp_64.S              | 239 +++++++++++++++++++++++++++++-
 arch/powerpc/lib/memcpy_power7.S          |   6 +-
 arch/powerpc/lib/vmx-helper.c             |   4 +-
 5 files changed, 247 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h
index d9713ad..31fdcee 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -49,8 +49,8 @@ void __trace_hcall_exit(long opcode, unsigned long retval,
 /* VMX copying */
 int enter_vmx_usercopy(void);
 int exit_vmx_usercopy(void);
-int enter_vmx_copy(void);
-void * exit_vmx_copy(void *dest);
+int enter_vmx_ops(void);
+void *exit_vmx_ops(void *dest);
 
 /* Traps */
 long machine_check_early(struct pt_regs *regs);
diff --git a/arch/powerpc/lib/copypage_power7.S b/arch/powerpc/lib/copypage_power7.S
index 8fa73b7..e38f956 100644
--- a/arch/powerpc/lib/copypage_power7.S
+++ b/arch/powerpc/lib/copypage_power7.S
@@ -57,7 +57,7 @@ _GLOBAL(copypage_power7)
 	std	r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
 	std	r0,16(r1)
 	stdu	r1,-STACKFRAMESIZE(r1)
-	bl	enter_vmx_copy
+	bl	enter_vmx_ops
 	cmpwi	r3,0
 	ld	r0,STACKFRAMESIZE+16(r1)
 	ld	r3,STK_REG(R31)(r1)
@@ -100,7 +100,7 @@ _GLOBAL(copypage_power7)
 	addi	r3,r3,128
 	bdnz	1b
 
-	b	exit_vmx_copy		/* tail call optimise */
+	b	exit_vmx_ops		/* tail call optimise */
 
 #else
 	li	r0,(PAGE_SIZE/128)
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index 5776f91..aef0e41 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -9,6 +9,7 @@
  */
 #include <asm/ppc_asm.h>
 #include <asm/export.h>
+#include <asm/ppc-opcode.h>
 
 #define off8	r6
 #define off16	r7
@@ -27,12 +28,73 @@
 #define LH	lhbrx
 #define LW	lwbrx
 #define LD	ldbrx
+#define LVS	lvsr
+#define VPERM(_VRT,_VRA,_VRB,_VRC) \
+	vperm _VRT,_VRB,_VRA,_VRC
 #else
 #define LH	lhzx
 #define LW	lwzx
 #define LD	ldx
+#define LVS	lvsl
+#define VPERM(_VRT,_VRA,_VRB,_VRC) \
+	vperm _VRT,_VRA,_VRB,_VRC
 #endif
 
+#define VMX_THRESH 4096
+#define ENTER_VMX_OPS	\
+	mflr    r0;	\
+	std     r3,-STACKFRAMESIZE+STK_REG(R31)(r1); \
+	std     r4,-STACKFRAMESIZE+STK_REG(R30)(r1); \
+	std     r5,-STACKFRAMESIZE+STK_REG(R29)(r1); \
+	std     r0,16(r1); \
+	stdu    r1,-STACKFRAMESIZE(r1); \
+	bl      enter_vmx_ops; \
+	cmpwi   cr1,r3,0; \
+	ld      r0,STACKFRAMESIZE+16(r1); \
+	ld      r3,STK_REG(R31)(r1); \
+	ld      r4,STK_REG(R30)(r1); \
+	ld      r5,STK_REG(R29)(r1); \
+	addi	r1,r1,STACKFRAMESIZE; \
+	mtlr    r0
+
+#define EXIT_VMX_OPS \
+	mflr    r0; \
+	std     r3,-STACKFRAMESIZE+STK_REG(R31)(r1); \
+	std     r4,-STACKFRAMESIZE+STK_REG(R30)(r1); \
+	std     r5,-STACKFRAMESIZE+STK_REG(R29)(r1); \
+	std     r0,16(r1); \
+	stdu    r1,-STACKFRAMESIZE(r1); \
+	bl      exit_vmx_ops; \
+	ld      r0,STACKFRAMESIZE+16(r1); \
+	ld      r3,STK_REG(R31)(r1); \
+	ld      r4,STK_REG(R30)(r1); \
+	ld      r5,STK_REG(R29)(r1); \
+	addi	r1,r1,STACKFRAMESIZE; \
+	mtlr    r0
+
+/*
+ * LD_VSR_CROSS16B load the 2nd 16 bytes for _vaddr which is unaligned with
+ * 16 bytes boundary and permute the result with the 1st 16 bytes.
+
+ *    |  y y y y y y y y y y y y y 0 1 2 | 3 4 5 6 7 8 9 a b c d e f z z z |
+ *    ^                                  ^                                 ^
+ * 0xbbbb10                          0xbbbb20                          0xbbb30
+ *                                 ^
+ *                                _vaddr
+ *
+ *
+ * _vmask is the mask generated by LVS
+ * _v1st_qw is the 1st aligned QW of current addr which is already loaded.
+ *   for example: 0xyyyyyyyyyyyyy012 for big endian
+ * _v2nd_qw is the 2nd aligned QW of cur _vaddr to be loaded.
+ *   for example: 0x3456789abcdefzzz for big endian
+ * The permute result is saved in _v_res.
+ *   for example: 0x0123456789abcdef for big endian.
+ */
+#define LD_VSR_CROSS16B(_vaddr,_vmask,_v1st_qw,_v2nd_qw,_v_res) \
+        lvx     _v2nd_qw,_vaddr,off16; \
+        VPERM(_v_res,_v1st_qw,_v2nd_qw,_vmask)
+
 /*
  * There are 2 categories for memcmp:
  * 1) src/dst has the same offset to the 8 bytes boundary. The handlers
@@ -132,7 +194,7 @@ _GLOBAL(memcmp)
 	bgt	cr6,.Llong
 
 .Lcmp_lt32bytes:
-	/* compare 1 ~ 32 bytes, at least r3 addr is 8 bytes aligned now */
+	/* compare 1 ~ 31 bytes, at least r3 addr is 8 bytes aligned now */
 	cmpdi   cr5,r5,7
 	srdi    r0,r5,3
 	ble	cr5,.Lcmp_rest_lt8bytes
@@ -173,6 +235,15 @@ _GLOBAL(memcmp)
 	blr
 
 .Llong:
+#ifdef CONFIG_ALTIVEC
+BEGIN_FTR_SECTION
+	/* Try to use vmx loop if length is equal or greater than 4K */
+	cmpldi  cr6,r5,VMX_THRESH
+	bge	cr6,.Lsameoffset_vmx_cmp
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
+
+.Llong_novmx_cmp:
+#endif
 	/* At least s1 addr is aligned with 8 bytes */
 	li	off8,8
 	li	off16,16
@@ -330,7 +401,97 @@ _GLOBAL(memcmp)
 	li	r3,-1
 	blr
 
+#ifdef CONFIG_ALTIVEC
+.Lsameoffset_vmx_cmp:
+	/* Enter with src/dst addrs has the same offset with 8 bytes
+	 * align boundary
+	 */
+	ENTER_VMX_OPS
+	beq     cr1,.Llong_novmx_cmp
+
+3:
+	/* need to check whether r4 has the same offset with r3
+	 * for 16 bytes boundary.
+	 */
+	xor	r0,r3,r4
+	andi.	r0,r0,0xf
+	bne	.Ldiffoffset_vmx_cmp_start
+
+	/* len is no less than 4KB. Need to align with 16 bytes further.
+	 */
+	andi.	rA,r3,8
+	LD	rA,0,r3
+	beq	4f
+	LD	rB,0,r4
+	cmpld	cr0,rA,rB
+	addi	r3,r3,8
+	addi	r4,r4,8
+	addi	r5,r5,-8
+
+	beq	cr0,4f
+	/* save and restore cr0 */
+	mcrf	7,0
+	EXIT_VMX_OPS
+	mcrf	0,7
+	b	.LcmpAB_lightweight
+
+4:
+	/* compare 32 bytes for each loop */
+	srdi	r0,r5,5
+	mtctr	r0
+	clrldi  r5,r5,59
+	li	off16,16
+
+.balign 16
+5:
+	lvx 	v0,0,r3
+	lvx 	v1,0,r4
+	VCMPEQUD_RC(v0,v0,v1)
+	bnl	cr6,7f
+	lvx 	v0,off16,r3
+	lvx 	v1,off16,r4
+	VCMPEQUD_RC(v0,v0,v1)
+	bnl	cr6,6f
+	addi	r3,r3,32
+	addi	r4,r4,32
+	bdnz	5b
+
+	EXIT_VMX_OPS
+	cmpdi	r5,0
+	beq	.Lzero
+	b	.Lcmp_lt32bytes
+
+6:
+	addi	r3,r3,16
+	addi	r4,r4,16
+
+7:
+	/* diff the last 16 bytes */
+	EXIT_VMX_OPS
+	LD	rA,0,r3
+	LD	rB,0,r4
+	cmpld	cr0,rA,rB
+	li	off8,8
+	bne	cr0,.LcmpAB_lightweight
+
+	LD	rA,off8,r3
+	LD	rB,off8,r4
+	cmpld	cr0,rA,rB
+	bne	cr0,.LcmpAB_lightweight
+	b	.Lzero
+#endif
+
 .Ldiffoffset_8bytes_make_align_start:
+#ifdef CONFIG_ALTIVEC
+BEGIN_FTR_SECTION
+	/* only do vmx ops when the size equal or greater than 4K bytes */
+	cmpdi	cr5,r5,VMX_THRESH
+	bge	cr5,.Ldiffoffset_vmx_cmp
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
+
+.Ldiffoffset_novmx_cmp:
+#endif
+
 	/* now try to align s1 with 8 bytes */
 	rlwinm  r6,r3,3,26,28
 	beq     .Ldiffoffset_align_s1_8bytes
@@ -356,6 +517,82 @@ _GLOBAL(memcmp)
 	/* now s1 is aligned with 8 bytes. */
 	cmpdi   cr5,r5,31
 	ble	cr5,.Lcmp_lt32bytes
+
+#ifdef CONFIG_ALTIVEC
+	b	.Llong_novmx_cmp
+#else
 	b	.Llong
+#endif
+
+#ifdef CONFIG_ALTIVEC
+.Ldiffoffset_vmx_cmp:
+	ENTER_VMX_OPS
+	beq     cr1,.Ldiffoffset_novmx_cmp
+
+.Ldiffoffset_vmx_cmp_start:
+	/* Firstly try to align r3 with 16 bytes */
+	andi.   r6,r3,0xf
+	li	off16,16
+	beq     .Ldiffoffset_vmx_s1_16bytes_align
 
+	LVS	v3,0,r3
+	LVS	v4,0,r4
+
+	lvx     v5,0,r3
+	lvx     v6,0,r4
+	LD_VSR_CROSS16B(r3,v3,v5,v7,v9)
+	LD_VSR_CROSS16B(r4,v4,v6,v8,v10)
+
+	VCMPEQUB_RC(v7,v9,v10)
+	bnl	cr6,.Ldiffoffset_vmx_diff_found
+
+	subfic  r6,r6,16
+	subf    r5,r6,r5
+	add     r3,r3,r6
+	add     r4,r4,r6
+
+.Ldiffoffset_vmx_s1_16bytes_align:
+	/* now s1 is aligned with 16 bytes */
+	lvx     v6,0,r4
+	LVS	v4,0,r4
+	srdi	r6,r5,5  /* loop for 32 bytes each */
+	clrldi  r5,r5,59
+	mtctr	r6
+
+.balign	16
+.Ldiffoffset_vmx_32bytesloop:
+	/* the first qw of r4 was saved in v6 */
+	lvx	v9,0,r3
+	LD_VSR_CROSS16B(r4,v4,v6,v8,v10)
+	VCMPEQUB_RC(v7,v9,v10)
+	vor	v6,v8,v8
+	bnl	cr6,.Ldiffoffset_vmx_diff_found
+
+	addi	r3,r3,16
+	addi	r4,r4,16
+
+	lvx	v9,0,r3
+	LD_VSR_CROSS16B(r4,v4,v6,v8,v10)
+	VCMPEQUB_RC(v7,v9,v10)
+	vor	v6,v8,v8
+	bnl	cr6,.Ldiffoffset_vmx_diff_found
+
+	addi	r3,r3,16
+	addi	r4,r4,16
+
+	bdnz	.Ldiffoffset_vmx_32bytesloop
+
+	EXIT_VMX_OPS
+
+	cmpdi	r5,0
+	beq	.Lzero
+	b	.Lcmp_lt32bytes
+
+.Ldiffoffset_vmx_diff_found:
+	EXIT_VMX_OPS
+	/* anyway, the diff will appear in next 16 bytes */
+	li	r5,16
+	b	.Lcmp_lt32bytes
+
+#endif
 EXPORT_SYMBOL(memcmp)
diff --git a/arch/powerpc/lib/memcpy_power7.S b/arch/powerpc/lib/memcpy_power7.S
index df7de9d..070cdf6 100644
--- a/arch/powerpc/lib/memcpy_power7.S
+++ b/arch/powerpc/lib/memcpy_power7.S
@@ -230,7 +230,7 @@ _GLOBAL(memcpy_power7)
 	std	r5,-STACKFRAMESIZE+STK_REG(R29)(r1)
 	std	r0,16(r1)
 	stdu	r1,-STACKFRAMESIZE(r1)
-	bl	enter_vmx_copy
+	bl	enter_vmx_ops
 	cmpwi	cr1,r3,0
 	ld	r0,STACKFRAMESIZE+16(r1)
 	ld	r3,STK_REG(R31)(r1)
@@ -445,7 +445,7 @@ _GLOBAL(memcpy_power7)
 
 15:	addi	r1,r1,STACKFRAMESIZE
 	ld	r3,-STACKFRAMESIZE+STK_REG(R31)(r1)
-	b	exit_vmx_copy		/* tail call optimise */
+	b	exit_vmx_ops		/* tail call optimise */
 
 .Lvmx_unaligned_copy:
 	/* Get the destination 16B aligned */
@@ -649,5 +649,5 @@ _GLOBAL(memcpy_power7)
 
 15:	addi	r1,r1,STACKFRAMESIZE
 	ld	r3,-STACKFRAMESIZE+STK_REG(R31)(r1)
-	b	exit_vmx_copy		/* tail call optimise */
+	b	exit_vmx_ops		/* tail call optimise */
 #endif /* CONFIG_ALTIVEC */
diff --git a/arch/powerpc/lib/vmx-helper.c b/arch/powerpc/lib/vmx-helper.c
index bf925cd..9f34049 100644
--- a/arch/powerpc/lib/vmx-helper.c
+++ b/arch/powerpc/lib/vmx-helper.c
@@ -53,7 +53,7 @@ int exit_vmx_usercopy(void)
 	return 0;
 }
 
-int enter_vmx_copy(void)
+int enter_vmx_ops(void)
 {
 	if (in_interrupt())
 		return 0;
@@ -70,7 +70,7 @@ int enter_vmx_copy(void)
  * passed a pointer to the destination which we return as required by a
  * memcpy implementation.
  */
-void *exit_vmx_copy(void *dest)
+void *exit_vmx_ops(void *dest)
 {
 	disable_kernel_altivec();
 	preempt_enable();
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v7 4/5] powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp()
  2018-05-30  9:20 [PATCH v7 0/5] powerpc/64: memcmp() optimization wei.guo.simon
                   ` (2 preceding siblings ...)
  2018-05-30  9:21 ` [PATCH v7 3/5] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
@ 2018-05-30  9:21 ` wei.guo.simon
  2018-05-30  9:21 ` [PATCH v7 5/5] powerpc:selftest update memcmp_64 selftest for VMX implementation wei.guo.simon
  2018-06-05  2:16 ` [PATCH v7 0/5] powerpc/64: memcmp() optimization Michael Ellerman
  5 siblings, 0 replies; 11+ messages in thread
From: wei.guo.simon @ 2018-05-30  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Paul Mackerras, Michael Ellerman, Naveen N.  Rao, Cyril Bur, Simon Guo

From: Simon Guo <wei.guo.simon@gmail.com>

This patch is based on the previous VMX patch on memcmp().

To optimize ppc64 memcmp() with VMX instruction, we need to think about
the VMX penalty brought with: If kernel uses VMX instruction, it needs
to save/restore current thread's VMX registers. There are 32 x 128 bits
VMX registers in PPC, which means 32 x 16 = 512 bytes for load and store.

The major concern regarding the memcmp() performance in kernel is KSM,
who will use memcmp() frequently to merge identical pages. So it will
make sense to take some measures/enhancement on KSM to see whether any
improvement can be done here.  Cyril Bur indicates that the memcmp() for
KSM has a higher possibility to fail (unmatch) early in previous bytes
in following mail.
	https://patchwork.ozlabs.org/patch/817322/#1773629
And I am taking a follow-up on this with this patch.

Per some testing, it shows KSM memcmp() will fail early at previous 32
bytes.  More specifically:
    - 76% cases will fail/unmatch before 16 bytes;
    - 83% cases will fail/unmatch before 32 bytes;
    - 84% cases will fail/unmatch before 64 bytes;
So 32 bytes looks a better choice than other bytes for pre-checking.

The early failure is also true for memcmp() for non-KSM case. With a
non-typical call load, it shows ~73% cases fail before first 32 bytes.

This patch adds a 32 bytes pre-checking firstly before jumping into VMX
operations, to avoid the unnecessary VMX penalty. It is not limited to
KSM case. And the testing shows ~20% improvement on memcmp() average
execution time with this patch.

And note the 32B pre-checking is only performed when the compare size
is long enough (>=4K currently) to allow VMX operation.

The detail data and analysis is at:
https://github.com/justdoitqd/publicFiles/blob/master/memcmp/README.md

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
 arch/powerpc/lib/memcmp_64.S | 57 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 46 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index aef0e41..5eba497 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -404,8 +404,27 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 #ifdef CONFIG_ALTIVEC
 .Lsameoffset_vmx_cmp:
 	/* Enter with src/dst addrs has the same offset with 8 bytes
-	 * align boundary
+	 * align boundary.
+	 *
+	 * There is an optimization based on following fact: memcmp()
+	 * prones to fail early at the first 32 bytes.
+	 * Before applying VMX instructions which will lead to 32x128bits
+	 * VMX regs load/restore penalty, we compare the first 32 bytes
+	 * so that we can catch the ~80% fail cases.
 	 */
+
+	li	r0,4
+	mtctr	r0
+.Lsameoffset_prechk_32B_loop:
+	LD	rA,0,r3
+	LD	rB,0,r4
+	cmpld	cr0,rA,rB
+	addi	r3,r3,8
+	addi	r4,r4,8
+	bne     cr0,.LcmpAB_lightweight
+	addi	r5,r5,-8
+	bdnz	.Lsameoffset_prechk_32B_loop
+
 	ENTER_VMX_OPS
 	beq     cr1,.Llong_novmx_cmp
 
@@ -482,16 +501,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 #endif
 
 .Ldiffoffset_8bytes_make_align_start:
-#ifdef CONFIG_ALTIVEC
-BEGIN_FTR_SECTION
-	/* only do vmx ops when the size equal or greater than 4K bytes */
-	cmpdi	cr5,r5,VMX_THRESH
-	bge	cr5,.Ldiffoffset_vmx_cmp
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
-
-.Ldiffoffset_novmx_cmp:
-#endif
-
 	/* now try to align s1 with 8 bytes */
 	rlwinm  r6,r3,3,26,28
 	beq     .Ldiffoffset_align_s1_8bytes
@@ -515,6 +524,17 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 
 .Ldiffoffset_align_s1_8bytes:
 	/* now s1 is aligned with 8 bytes. */
+#ifdef CONFIG_ALTIVEC
+BEGIN_FTR_SECTION
+	/* only do vmx ops when the size equal or greater than 4K bytes */
+	cmpdi	cr5,r5,VMX_THRESH
+	bge	cr5,.Ldiffoffset_vmx_cmp
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
+
+.Ldiffoffset_novmx_cmp:
+#endif
+
+
 	cmpdi   cr5,r5,31
 	ble	cr5,.Lcmp_lt32bytes
 
@@ -526,6 +546,21 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 
 #ifdef CONFIG_ALTIVEC
 .Ldiffoffset_vmx_cmp:
+	/* perform a 32 bytes pre-checking before
+	 * enable VMX operations.
+	 */
+	li	r0,4
+	mtctr	r0
+.Ldiffoffset_prechk_32B_loop:
+	LD	rA,0,r3
+	LD	rB,0,r4
+	cmpld	cr0,rA,rB
+	addi	r3,r3,8
+	addi	r4,r4,8
+	bne     cr0,.LcmpAB_lightweight
+	addi	r5,r5,-8
+	bdnz	.Ldiffoffset_prechk_32B_loop
+
 	ENTER_VMX_OPS
 	beq     cr1,.Ldiffoffset_novmx_cmp
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v7 5/5] powerpc:selftest update memcmp_64 selftest for VMX implementation
  2018-05-30  9:20 [PATCH v7 0/5] powerpc/64: memcmp() optimization wei.guo.simon
                   ` (3 preceding siblings ...)
  2018-05-30  9:21 ` [PATCH v7 4/5] powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp() wei.guo.simon
@ 2018-05-30  9:21 ` wei.guo.simon
  2018-06-05  2:16 ` [PATCH v7 0/5] powerpc/64: memcmp() optimization Michael Ellerman
  5 siblings, 0 replies; 11+ messages in thread
From: wei.guo.simon @ 2018-05-30  9:21 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Paul Mackerras, Michael Ellerman, Naveen N.  Rao, Cyril Bur, Simon Guo

From: Simon Guo <wei.guo.simon@gmail.com>

This patch reworked selftest memcmp_64 so that memcmp selftest can
cover more test cases.

It adds testcases for:
- memcmp over 4K bytes size.
- s1/s2 with different/random offset on 16 bytes boundary.
- enter/exit_vmx_ops pairness.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
 .../selftests/powerpc/copyloops/asm/ppc_asm.h      |  4 +-
 .../selftests/powerpc/stringloops/asm/ppc-opcode.h | 39 +++++++++
 .../selftests/powerpc/stringloops/asm/ppc_asm.h    | 24 ++++++
 .../testing/selftests/powerpc/stringloops/memcmp.c | 98 +++++++++++++++++-----
 4 files changed, 141 insertions(+), 24 deletions(-)
 create mode 100644 tools/testing/selftests/powerpc/stringloops/asm/ppc-opcode.h

diff --git a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
index 5ffe04d..dfce161 100644
--- a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
+++ b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
@@ -36,11 +36,11 @@
 	li	r3,0
 	blr
 
-FUNC_START(enter_vmx_copy)
+FUNC_START(enter_vmx_ops)
 	li	r3,1
 	blr
 
-FUNC_START(exit_vmx_copy)
+FUNC_START(exit_vmx_ops)
 	blr
 
 FUNC_START(memcpy_power7)
diff --git a/tools/testing/selftests/powerpc/stringloops/asm/ppc-opcode.h b/tools/testing/selftests/powerpc/stringloops/asm/ppc-opcode.h
new file mode 100644
index 0000000..9de413c
--- /dev/null
+++ b/tools/testing/selftests/powerpc/stringloops/asm/ppc-opcode.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright 2009 Freescale Semiconductor, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * provides masks and opcode images for use by code generation, emulation
+ * and for instructions that older assemblers might not know about
+ */
+#ifndef _ASM_POWERPC_PPC_OPCODE_H
+#define _ASM_POWERPC_PPC_OPCODE_H
+
+
+#  define stringify_in_c(...)	__VA_ARGS__
+#  define ASM_CONST(x)		x
+
+
+#define PPC_INST_VCMPEQUD_RC		0x100000c7
+#define PPC_INST_VCMPEQUB_RC		0x10000006
+
+#define __PPC_RC21     (0x1 << 10)
+
+/* macros to insert fields into opcodes */
+#define ___PPC_RA(a)	(((a) & 0x1f) << 16)
+#define ___PPC_RB(b)	(((b) & 0x1f) << 11)
+#define ___PPC_RS(s)	(((s) & 0x1f) << 21)
+#define ___PPC_RT(t)	___PPC_RS(t)
+
+#define VCMPEQUD_RC(vrt, vra, vrb)	stringify_in_c(.long PPC_INST_VCMPEQUD_RC | \
+			      ___PPC_RT(vrt) | ___PPC_RA(vra) | \
+			      ___PPC_RB(vrb) | __PPC_RC21)
+
+#define VCMPEQUB_RC(vrt, vra, vrb)	stringify_in_c(.long PPC_INST_VCMPEQUB_RC | \
+			      ___PPC_RT(vrt) | ___PPC_RA(vra) | \
+			      ___PPC_RB(vrb) | __PPC_RC21)
+
+#endif /* _ASM_POWERPC_PPC_OPCODE_H */
diff --git a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
index 136242e..33912bb 100644
--- a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
+++ b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
@@ -1,4 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _PPC_ASM_H
+#define __PPC_ASM_H
 #include <ppc-asm.h>
 
 #ifndef r1
@@ -6,3 +8,25 @@
 #endif
 
 #define _GLOBAL(A) FUNC_START(test_ ## A)
+
+#define CONFIG_ALTIVEC
+
+#define R14 r14
+#define R15 r15
+#define R16 r16
+#define R17 r17
+#define R18 r18
+#define R19 r19
+#define R20 r20
+#define R21 r21
+#define R22 r22
+#define R29 r29
+#define R30 r30
+#define R31 r31
+
+#define STACKFRAMESIZE	256
+#define STK_REG(i)	(112 + ((i)-14)*8)
+
+#define BEGIN_FTR_SECTION
+#define END_FTR_SECTION_IFSET(val)
+#endif
diff --git a/tools/testing/selftests/powerpc/stringloops/memcmp.c b/tools/testing/selftests/powerpc/stringloops/memcmp.c
index 8250db2..b5cf717 100644
--- a/tools/testing/selftests/powerpc/stringloops/memcmp.c
+++ b/tools/testing/selftests/powerpc/stringloops/memcmp.c
@@ -2,20 +2,40 @@
 #include <malloc.h>
 #include <stdlib.h>
 #include <string.h>
+#include <time.h>
 #include "utils.h"
 
 #define SIZE 256
 #define ITERATIONS 10000
 
+#define LARGE_SIZE (5 * 1024)
+#define LARGE_ITERATIONS 1000
+#define LARGE_MAX_OFFSET 32
+#define LARGE_SIZE_START 4096
+
+#define MAX_OFFSET_DIFF_S1_S2 48
+
+int vmx_count;
+int enter_vmx_ops(void)
+{
+	vmx_count++;
+	return 1;
+}
+
+void exit_vmx_ops(void)
+{
+	vmx_count--;
+}
 int test_memcmp(const void *s1, const void *s2, size_t n);
 
 /* test all offsets and lengths */
-static void test_one(char *s1, char *s2)
+static void test_one(char *s1, char *s2, unsigned long max_offset,
+		unsigned long size_start, unsigned long max_size)
 {
 	unsigned long offset, size;
 
-	for (offset = 0; offset < SIZE; offset++) {
-		for (size = 0; size < (SIZE-offset); size++) {
+	for (offset = 0; offset < max_offset; offset++) {
+		for (size = size_start; size < (max_size - offset); size++) {
 			int x, y;
 			unsigned long i;
 
@@ -35,70 +55,104 @@ static void test_one(char *s1, char *s2)
 				printf("\n");
 				abort();
 			}
+
+			if (vmx_count != 0) {
+				printf("vmx enter/exit not paired.(offset:%ld size:%ld s1:%p s2:%p vc:%d\n",
+					offset, size, s1, s2, vmx_count);
+				printf("\n");
+				abort();
+			}
 		}
 	}
 }
 
-static int testcase(void)
+static int testcase(bool islarge)
 {
 	char *s1;
 	char *s2;
 	unsigned long i;
 
-	s1 = memalign(128, SIZE);
+	unsigned long comp_size = (islarge ? LARGE_SIZE : SIZE);
+	unsigned long alloc_size = comp_size + MAX_OFFSET_DIFF_S1_S2;
+	int iterations = islarge ? LARGE_ITERATIONS : ITERATIONS;
+
+	s1 = memalign(128, alloc_size);
 	if (!s1) {
 		perror("memalign");
 		exit(1);
 	}
 
-	s2 = memalign(128, SIZE);
+	s2 = memalign(128, alloc_size);
 	if (!s2) {
 		perror("memalign");
 		exit(1);
 	}
 
-	srandom(1);
+	srandom(time(0));
 
-	for (i = 0; i < ITERATIONS; i++) {
+	for (i = 0; i < iterations; i++) {
 		unsigned long j;
 		unsigned long change;
+		char *rand_s1 = s1;
+		char *rand_s2 = s2;
 
-		for (j = 0; j < SIZE; j++)
+		for (j = 0; j < alloc_size; j++)
 			s1[j] = random();
 
-		memcpy(s2, s1, SIZE);
+		rand_s1 += random() % MAX_OFFSET_DIFF_S1_S2;
+		rand_s2 += random() % MAX_OFFSET_DIFF_S1_S2;
+		memcpy(rand_s2, rand_s1, comp_size);
 
 		/* change one byte */
-		change = random() % SIZE;
-		s2[change] = random() & 0xff;
-
-		test_one(s1, s2);
+		change = random() % comp_size;
+		rand_s2[change] = random() & 0xff;
+
+		if (islarge)
+			test_one(rand_s1, rand_s2, LARGE_MAX_OFFSET,
+					LARGE_SIZE_START, comp_size);
+		else
+			test_one(rand_s1, rand_s2, SIZE, 0, comp_size);
 	}
 
-	srandom(1);
+	srandom(time(0));
 
-	for (i = 0; i < ITERATIONS; i++) {
+	for (i = 0; i < iterations; i++) {
 		unsigned long j;
 		unsigned long change;
+		char *rand_s1 = s1;
+		char *rand_s2 = s2;
 
-		for (j = 0; j < SIZE; j++)
+		for (j = 0; j < alloc_size; j++)
 			s1[j] = random();
 
-		memcpy(s2, s1, SIZE);
+		rand_s1 += random() % MAX_OFFSET_DIFF_S1_S2;
+		rand_s2 += random() % MAX_OFFSET_DIFF_S1_S2;
+		memcpy(rand_s2, rand_s1, comp_size);
 
 		/* change multiple bytes, 1/8 of total */
-		for (j = 0; j < SIZE / 8; j++) {
-			change = random() % SIZE;
+		for (j = 0; j < comp_size / 8; j++) {
+			change = random() % comp_size;
 			s2[change] = random() & 0xff;
 		}
 
-		test_one(s1, s2);
+		if (islarge)
+			test_one(rand_s1, rand_s2, LARGE_MAX_OFFSET,
+					LARGE_SIZE_START, comp_size);
+		else
+			test_one(rand_s1, rand_s2, SIZE, 0, comp_size);
 	}
 
 	return 0;
 }
 
+static int testcases(void)
+{
+	testcase(0);
+	testcase(1);
+	return 0;
+}
+
 int main(void)
 {
-	return test_harness(testcase, "memcmp");
+	return test_harness(testcases, "memcmp");
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 0/5] powerpc/64: memcmp() optimization
  2018-06-05  2:16 ` [PATCH v7 0/5] powerpc/64: memcmp() optimization Michael Ellerman
@ 2018-06-04 10:27   ` Simon Guo
  2018-06-06  6:21   ` Simon Guo
  1 sibling, 0 replies; 11+ messages in thread
From: Simon Guo @ 2018-06-04 10:27 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev, Paul Mackerras, Naveen N.  Rao, Cyril Bur

Hi Michael,
On Tue, Jun 05, 2018 at 12:16:22PM +1000, Michael Ellerman wrote:
> Hi Simon,
> 
> wei.guo.simon@gmail.com writes:
> > From: Simon Guo <wei.guo.simon@gmail.com>
> >
> > There is some room to optimize memcmp() in powerpc 64 bits version for
> > following 2 cases:
> > (1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
> > memcmp() can align them and go with .Llong comparision mode without
> > fallback to .Lshort comparision mode do compare buffer byte by byte.
> > (2) VMX instructions can be used to speed up for large size comparision,
> > currently the threshold is set for 4K bytes. Notes the VMX instructions
> > will lead to VMX regs save/load penalty. This patch set includes a
> > patch to add a 32 bytes pre-checking to minimize the penalty.
> >
> > It did the similar with glibc commit dec4a7105e (powerpc: Improve memcmp 
> > performance for POWER8). Thanks Cyril Bur's information.
> > This patch set also updates memcmp selftest case to make it compiled and
> > incorporate large size comparison case.
> 
> I'm seeing a few crashes with this applied, I haven't had time to look
> into what is happening yet, sorry.
Sorry I didn't catch this in my testing. I will check the root cause
and update later.

Thanks,
- Simon

> 
> [ 2471.300595] kselftest: Running tests in user
> [ 2471.302785] calling  test_user_copy_init+0x0/0xd14 [test_user_copy] @ 44883
> [ 2471.302892] Unable to handle kernel paging request for data at address 0xc008000018553005
> [ 2471.303014] Faulting instruction address: 0xc00000000001f29c
> [ 2471.303119] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 2471.303193] LE SMP NR_CPUS=2048 NUMA PowerNV


> [ 2471.303256] Modules linked in: test_user_copy(+) vxlan ip6_udp_tunnel udp_tunnel 8021q bridge stp llc dummy test_printf test_firmware vmx_crypto crct10dif_vpmsum crct10dif_common crc32c_vpmsum veth [last unloaded: test_static_key_base]
> [ 2471.303532] CPU: 4 PID: 44883 Comm: modprobe Tainted: G        W         4.17.0-rc3-gcc7x-g7204012 #1
> [ 2471.303644] NIP:  c00000000001f29c LR: c00000000001f6e4 CTR: 0000000000000000
> [ 2471.303754] REGS: c000001fddc2b560 TRAP: 0300   Tainted: G        W          (4.17.0-rc3-gcc7x-g7204012)
> [ 2471.303873] MSR:  9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 24222844  XER: 00000000
> [ 2471.303996] CFAR: c00000000001f6e0 DAR: c008000018553005 DSISR: 40000000 IRQMASK: 0 
> [ 2471.303996] GPR00: c00000000001f6e4 c000001fddc2b7e0 c008000018529900 0000000002000000 
> [ 2471.303996] GPR04: c000001fe4b90020 000000000000ffe0 0000000000000000 03fffffe01b48000 
> [ 2471.303996] GPR08: 0000000080000000 c008000018553005 c000001fddc28000 c008000018520df0 
> [ 2471.303996] GPR12: c00000000009c430 c000001fffffbc00 0000000020000000 0000000000000000 
> [ 2471.303996] GPR16: c000001fddc2bc20 0000000000000030 c0000000001f7ba0 0000000000000001 
> [ 2471.303996] GPR20: 0000000000000000 c000000000c772b0 c0000000010b4018 0000000000000000 
> [ 2471.303996] GPR24: 0000000000000000 c008000018521c98 0000000000000000 c000001fe4b90000 
> [ 2471.303996] GPR28: fffffffffffffff4 0000000002000000 9000000002009033 9000000002009033 
> [ 2471.304930] NIP [c00000000001f29c] msr_check_and_set+0x3c/0xc0
> [ 2471.305008] LR [c00000000001f6e4] enable_kernel_altivec+0x44/0x100
> [ 2471.305084] Call Trace:
> [ 2471.305122] [c000001fddc2b7e0] [c00000000009baa8] __copy_tofrom_user_base+0x9c/0x574 (unreliable)
> [ 2471.305240] [c000001fddc2b860] [c00000000001f6e4] enable_kernel_altivec+0x44/0x100
> [ 2471.305336] [c000001fddc2b890] [c00000000009ce40] enter_vmx_ops+0x50/0x70
> [ 2471.305418] [c000001fddc2b8b0] [c00000000009c768] memcmp+0x338/0x680
> [ 2471.305501] [c000001fddc2b9b0] [c008000018520190] test_user_copy_init+0x188/0xd14 [test_user_copy]
> [ 2471.305617] [c000001fddc2ba60] [c00000000000de20] do_one_initcall+0x90/0x560
> [ 2471.305710] [c000001fddc2bb30] [c000000000200630] do_init_module+0x90/0x260
> [ 2471.305795] [c000001fddc2bbc0] [c0000000001fec88] load_module+0x1a28/0x1ce0
> [ 2471.305875] [c000001fddc2bd70] [c0000000001ff1e8] sys_finit_module+0xc8/0x110
> [ 2471.305983] [c000001fddc2be30] [c00000000000b528] system_call+0x58/0x6c
> [ 2471.306066] Instruction dump:
> [ 2471.306112] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff81 7c7d1b78 60000000 60000000 
> [ 2471.306216] 7fe000a6 3d220003 39299705 7ffeeb78 <89290000> 2f890000 419e0044 60000000 
> [ 2471.306326] ---[ end trace daf8d409e65b9841 ]---
> 
> And:
> 
> [   19.096709] test_bpf: test_skb_segment: success in skb_segment!
> [   19.096799] initcall test_bpf_init+0x0/0xae0 [test_bpf] returned 0 after 591217 usecs
> [   19.115869] calling  test_user_copy_init+0x0/0xd14 [test_user_copy] @ 3159
> [   19.116165] Unable to handle kernel paging request for data at address 0xd000000003852805
> [   19.116352] Faulting instruction address: 0xc00000000001f44c
> [   19.116483] Oops: Kernel access of bad area, sig: 11 [#1]
> [   19.116583] LE SMP NR_CPUS=2048 NUMA pSeries
> [   19.116684] Modules linked in: test_user_copy(+) lzo_compress crc_itu_t zstd_compress zstd_decompress test_bpf test_static_keys test_static_key_base xxhash test_firmware af_key cls_bpf act_bpf bridge nf_nat_irc xt_NFLOG nfnetlink_log xt_policy nf_conntrack_netlink nfnetlink xt_nat nf_conntrack_irc xt_mark xt_tcpudp nf_nat_sip xt_TCPMSS xt_LOG nf_nat_ftp nf_conntrack_ftp xt_conntrack nf_conntrack_sip xt_addrtype xt_state 8021q iptable_filter ipt_MASQUERADE nf_log_ipv4 iptable_mangle nf_nat_masquerade_ipv4 ipt_REJECT nf_reject_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables nf_log_arp nf_log_common ah4 ipcomp xfrm4_tunnel esp4 rpcrdma stp p8022 psnap llc xfrm_ipcomp xfrm_user xfrm_algo platform_lcd lcd ocxl virtio_balloon virtio_crypto crypto_engine
> [   19.118040]  vmx_crypto nbd zram zsmalloc virtio_blk st be2iscsi cxgb3i cxgb4i libcxgbi bnx2i ibmvfc sym53c8xx scsi_transport_spi scsi_dh_alua scsi_dh_rdac qla4xxx mpt3sas scsi_transport_sas cxlflash cxl libiscsi_tcp lpfc crc_t10dif crct10dif_generic crct10dif_common qla2xxx iscsi_boot_sysfs raid_class parport_pc parport powernv_op_panel powernv_rng pseries_rng rng_core virtio_console pcspkr input_leds evdev dm_round_robin dm_mirror dm_region_hash dm_log raid10 dm_service_time multipath dm_queue_length dm_multipath dm_thin_pool faulty dm_persistent_data dm_zero dm_crypt dm_bio_prison dm_snapshot dm_bufio raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq rpadlpar_io rpaphp jsm icom hvcs ib_ipoib ib_srp ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_ucm ib_ucm ib_uverbs
> [   19.119505]  rdma_cm iw_cm ib_cm mlx4_ib iw_cxgb3 iw_cxgb4 ib_mthca ib_core leds_powernv led_class vhost_net vhost macvtap macvlan dummy bsd_comp ppp_async crc_ccitt pppoe ppp_synctty pppox ppp_deflate ppp_generic 3c59x s2io bnx2 cnic uio bnx2x libcrc32c i40e ixgbe ixgb cxgb3 libcxgb cxgb cxgb4 pcnet32 netxen_nic qlge be2net acenic mlx4_en mlx4_core myri10ge bonding slhc tap mdio veth vxlan udp_tunnel tun usb_storage usbmon oprofile sha1_powerpc md5_ppc crc32c_vpmsum kvm hvcserver
> [   19.120358] CPU: 4 PID: 3159 Comm: modprobe Not tainted 4.17.0-rc3-gcc7x-g7204012 #1
> [   19.120508] NIP:  c00000000001f44c LR: c00000000001f894 CTR: 0000000000000000
> [   19.120666] REGS: c0000000f8d9f570 TRAP: 0300   Not tainted  (4.17.0-rc3-gcc7x-g7204012)
> [   19.120817] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24222844  XER: 00000000
> [   19.120984] CFAR: c00000000000c03c DAR: d000000003852805 DSISR: 40000000 IRQMASK: 0 
>                GPR00: c00000000001f894 c0000000f8d9f7f0 d000000003829900 0000000002000000 
>                GPR04: c0000000f9a30048 000000000000ffe0 0000000000000000 03fffffff065dffd 
>                GPR08: 0000000080000000 d000000003852805 c0000000f8d9c000 d000000003820df0 
>                GPR12: c00000000009ebb0 c00000003fffb300 c0000000f8d9fd90 d000000003840000 
>                GPR16: d000000003840000 0000000000000000 c0000000011d6900 d000000003821ad0 
>                GPR20: c000000000bd7860 0000000000000000 c000000000ff9060 00000000014000c0 
>                GPR24: 0000000000000000 0000000000000000 0000000000000100 c0000000f9a30028 
>                GPR28: fffffffffffffff4 0000000002000000 8000000002009033 8000000000009033 
> [   19.122454] NIP [c00000000001f44c] msr_check_and_set+0x3c/0xc0
> [   19.122580] LR [c00000000001f894] enable_kernel_altivec+0x44/0x100
> [   19.122707] Call Trace:
> [   19.122789] [c0000000f8d9f7f0] [c00000000009e228] __copy_tofrom_user_base+0x9c/0x574 (unreliable)
> [   19.122962] [c0000000f8d9f870] [c00000000001f894] enable_kernel_altivec+0x44/0x100
> [   19.123344] [c0000000f8d9f8a0] [c00000000009f740] enter_vmx_ops+0x50/0x70
> [   19.123583] [c0000000f8d9f8c0] [c00000000009eee8] memcmp+0x338/0x680
> [   19.123728] [c0000000f8d9f9c0] [d000000003820190] test_user_copy_init+0x188/0xd14 [test_user_copy]
> [   19.123909] [c0000000f8d9fa70] [c00000000000e37c] do_one_initcall+0x5c/0x2d0
> [   19.124094] [c0000000f8d9fb30] [c00000000020066c] do_init_module+0x90/0x264
> [   19.124234] [c0000000f8d9fbc0] [c0000000001ff084] load_module+0x2f64/0x3600
> [   19.124371] [c0000000f8d9fd70] [c0000000001ff9c8] sys_finit_module+0xc8/0x110
> [   19.124530] [c0000000f8d9fe30] [c00000000000b868] system_call+0x58/0x6c
> [   19.124648] Instruction dump:
> [   19.124721] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff81 7c7d1b78 60000000 60000000 
> [   19.124869] 7fe000a6 3d220003 39298f05 7ffeeb78 <89290000> 2f890000 419e0044 60000000 
> [   19.125034] ---[ end trace 7c08acedd4b4e6aa ]---
> 
> 
> cheers

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 0/5] powerpc/64: memcmp() optimization
  2018-05-30  9:20 [PATCH v7 0/5] powerpc/64: memcmp() optimization wei.guo.simon
                   ` (4 preceding siblings ...)
  2018-05-30  9:21 ` [PATCH v7 5/5] powerpc:selftest update memcmp_64 selftest for VMX implementation wei.guo.simon
@ 2018-06-05  2:16 ` Michael Ellerman
  2018-06-04 10:27   ` Simon Guo
  2018-06-06  6:21   ` Simon Guo
  5 siblings, 2 replies; 11+ messages in thread
From: Michael Ellerman @ 2018-06-05  2:16 UTC (permalink / raw)
  To: wei.guo.simon, linuxppc-dev
  Cc: Paul Mackerras, Naveen N.  Rao, Cyril Bur, Simon Guo

Hi Simon,

wei.guo.simon@gmail.com writes:
> From: Simon Guo <wei.guo.simon@gmail.com>
>
> There is some room to optimize memcmp() in powerpc 64 bits version for
> following 2 cases:
> (1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
> memcmp() can align them and go with .Llong comparision mode without
> fallback to .Lshort comparision mode do compare buffer byte by byte.
> (2) VMX instructions can be used to speed up for large size comparision,
> currently the threshold is set for 4K bytes. Notes the VMX instructions
> will lead to VMX regs save/load penalty. This patch set includes a
> patch to add a 32 bytes pre-checking to minimize the penalty.
>
> It did the similar with glibc commit dec4a7105e (powerpc: Improve memcmp 
> performance for POWER8). Thanks Cyril Bur's information.
> This patch set also updates memcmp selftest case to make it compiled and
> incorporate large size comparison case.

I'm seeing a few crashes with this applied, I haven't had time to look
into what is happening yet, sorry.

[ 2471.300595] kselftest: Running tests in user
[ 2471.302785] calling  test_user_copy_init+0x0/0xd14 [test_user_copy] @ 44883
[ 2471.302892] Unable to handle kernel paging request for data at address 0xc008000018553005
[ 2471.303014] Faulting instruction address: 0xc00000000001f29c
[ 2471.303119] Oops: Kernel access of bad area, sig: 11 [#1]
[ 2471.303193] LE SMP NR_CPUS=2048 NUMA PowerNV
[ 2471.303256] Modules linked in: test_user_copy(+) vxlan ip6_udp_tunnel udp_tunnel 8021q bridge stp llc dummy test_printf test_firmware vmx_crypto crct10dif_vpmsum crct10dif_common crc32c_vpmsum veth [last unloaded: test_static_key_base]
[ 2471.303532] CPU: 4 PID: 44883 Comm: modprobe Tainted: G        W         4.17.0-rc3-gcc7x-g7204012 #1
[ 2471.303644] NIP:  c00000000001f29c LR: c00000000001f6e4 CTR: 0000000000000000
[ 2471.303754] REGS: c000001fddc2b560 TRAP: 0300   Tainted: G        W          (4.17.0-rc3-gcc7x-g7204012)
[ 2471.303873] MSR:  9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 24222844  XER: 00000000
[ 2471.303996] CFAR: c00000000001f6e0 DAR: c008000018553005 DSISR: 40000000 IRQMASK: 0 
[ 2471.303996] GPR00: c00000000001f6e4 c000001fddc2b7e0 c008000018529900 0000000002000000 
[ 2471.303996] GPR04: c000001fe4b90020 000000000000ffe0 0000000000000000 03fffffe01b48000 
[ 2471.303996] GPR08: 0000000080000000 c008000018553005 c000001fddc28000 c008000018520df0 
[ 2471.303996] GPR12: c00000000009c430 c000001fffffbc00 0000000020000000 0000000000000000 
[ 2471.303996] GPR16: c000001fddc2bc20 0000000000000030 c0000000001f7ba0 0000000000000001 
[ 2471.303996] GPR20: 0000000000000000 c000000000c772b0 c0000000010b4018 0000000000000000 
[ 2471.303996] GPR24: 0000000000000000 c008000018521c98 0000000000000000 c000001fe4b90000 
[ 2471.303996] GPR28: fffffffffffffff4 0000000002000000 9000000002009033 9000000002009033 
[ 2471.304930] NIP [c00000000001f29c] msr_check_and_set+0x3c/0xc0
[ 2471.305008] LR [c00000000001f6e4] enable_kernel_altivec+0x44/0x100
[ 2471.305084] Call Trace:
[ 2471.305122] [c000001fddc2b7e0] [c00000000009baa8] __copy_tofrom_user_base+0x9c/0x574 (unreliable)
[ 2471.305240] [c000001fddc2b860] [c00000000001f6e4] enable_kernel_altivec+0x44/0x100
[ 2471.305336] [c000001fddc2b890] [c00000000009ce40] enter_vmx_ops+0x50/0x70
[ 2471.305418] [c000001fddc2b8b0] [c00000000009c768] memcmp+0x338/0x680
[ 2471.305501] [c000001fddc2b9b0] [c008000018520190] test_user_copy_init+0x188/0xd14 [test_user_copy]
[ 2471.305617] [c000001fddc2ba60] [c00000000000de20] do_one_initcall+0x90/0x560
[ 2471.305710] [c000001fddc2bb30] [c000000000200630] do_init_module+0x90/0x260
[ 2471.305795] [c000001fddc2bbc0] [c0000000001fec88] load_module+0x1a28/0x1ce0
[ 2471.305875] [c000001fddc2bd70] [c0000000001ff1e8] sys_finit_module+0xc8/0x110
[ 2471.305983] [c000001fddc2be30] [c00000000000b528] system_call+0x58/0x6c
[ 2471.306066] Instruction dump:
[ 2471.306112] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff81 7c7d1b78 60000000 60000000 
[ 2471.306216] 7fe000a6 3d220003 39299705 7ffeeb78 <89290000> 2f890000 419e0044 60000000 
[ 2471.306326] ---[ end trace daf8d409e65b9841 ]---

And:

[   19.096709] test_bpf: test_skb_segment: success in skb_segment!
[   19.096799] initcall test_bpf_init+0x0/0xae0 [test_bpf] returned 0 after 591217 usecs
[   19.115869] calling  test_user_copy_init+0x0/0xd14 [test_user_copy] @ 3159
[   19.116165] Unable to handle kernel paging request for data at address 0xd000000003852805
[   19.116352] Faulting instruction address: 0xc00000000001f44c
[   19.116483] Oops: Kernel access of bad area, sig: 11 [#1]
[   19.116583] LE SMP NR_CPUS=2048 NUMA pSeries
[   19.116684] Modules linked in: test_user_copy(+) lzo_compress crc_itu_t zstd_compress zstd_decompress test_bpf test_static_keys test_static_key_base xxhash test_firmware af_key cls_bpf act_bpf bridge nf_nat_irc xt_NFLOG nfnetlink_log xt_policy nf_conntrack_netlink nfnetlink xt_nat nf_conntrack_irc xt_mark xt_tcpudp nf_nat_sip xt_TCPMSS xt_LOG nf_nat_ftp nf_conntrack_ftp xt_conntrack nf_conntrack_sip xt_addrtype xt_state 8021q iptable_filter ipt_MASQUERADE nf_log_ipv4 iptable_mangle nf_nat_masquerade_ipv4 ipt_REJECT nf_reject_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables nf_log_arp nf_log_common ah4 ipcomp xfrm4_tunnel esp4 rpcrdma stp p8022 psnap llc xfrm_ipcomp xfrm_user xfrm_algo platform_lcd lcd ocxl virtio_balloon virtio_crypto crypto_engine
[   19.118040]  vmx_crypto nbd zram zsmalloc virtio_blk st be2iscsi cxgb3i cxgb4i libcxgbi bnx2i ibmvfc sym53c8xx scsi_transport_spi scsi_dh_alua scsi_dh_rdac qla4xxx mpt3sas scsi_transport_sas cxlflash cxl libiscsi_tcp lpfc crc_t10dif crct10dif_generic crct10dif_common qla2xxx iscsi_boot_sysfs raid_class parport_pc parport powernv_op_panel powernv_rng pseries_rng rng_core virtio_console pcspkr input_leds evdev dm_round_robin dm_mirror dm_region_hash dm_log raid10 dm_service_time multipath dm_queue_length dm_multipath dm_thin_pool faulty dm_persistent_data dm_zero dm_crypt dm_bio_prison dm_snapshot dm_bufio raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq rpadlpar_io rpaphp jsm icom hvcs ib_ipoib ib_srp ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_ucm ib_ucm ib_uverbs
[   19.119505]  rdma_cm iw_cm ib_cm mlx4_ib iw_cxgb3 iw_cxgb4 ib_mthca ib_core leds_powernv led_class vhost_net vhost macvtap macvlan dummy bsd_comp ppp_async crc_ccitt pppoe ppp_synctty pppox ppp_deflate ppp_generic 3c59x s2io bnx2 cnic uio bnx2x libcrc32c i40e ixgbe ixgb cxgb3 libcxgb cxgb cxgb4 pcnet32 netxen_nic qlge be2net acenic mlx4_en mlx4_core myri10ge bonding slhc tap mdio veth vxlan udp_tunnel tun usb_storage usbmon oprofile sha1_powerpc md5_ppc crc32c_vpmsum kvm hvcserver
[   19.120358] CPU: 4 PID: 3159 Comm: modprobe Not tainted 4.17.0-rc3-gcc7x-g7204012 #1
[   19.120508] NIP:  c00000000001f44c LR: c00000000001f894 CTR: 0000000000000000
[   19.120666] REGS: c0000000f8d9f570 TRAP: 0300   Not tainted  (4.17.0-rc3-gcc7x-g7204012)
[   19.120817] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24222844  XER: 00000000
[   19.120984] CFAR: c00000000000c03c DAR: d000000003852805 DSISR: 40000000 IRQMASK: 0 
               GPR00: c00000000001f894 c0000000f8d9f7f0 d000000003829900 0000000002000000 
               GPR04: c0000000f9a30048 000000000000ffe0 0000000000000000 03fffffff065dffd 
               GPR08: 0000000080000000 d000000003852805 c0000000f8d9c000 d000000003820df0 
               GPR12: c00000000009ebb0 c00000003fffb300 c0000000f8d9fd90 d000000003840000 
               GPR16: d000000003840000 0000000000000000 c0000000011d6900 d000000003821ad0 
               GPR20: c000000000bd7860 0000000000000000 c000000000ff9060 00000000014000c0 
               GPR24: 0000000000000000 0000000000000000 0000000000000100 c0000000f9a30028 
               GPR28: fffffffffffffff4 0000000002000000 8000000002009033 8000000000009033 
[   19.122454] NIP [c00000000001f44c] msr_check_and_set+0x3c/0xc0
[   19.122580] LR [c00000000001f894] enable_kernel_altivec+0x44/0x100
[   19.122707] Call Trace:
[   19.122789] [c0000000f8d9f7f0] [c00000000009e228] __copy_tofrom_user_base+0x9c/0x574 (unreliable)
[   19.122962] [c0000000f8d9f870] [c00000000001f894] enable_kernel_altivec+0x44/0x100
[   19.123344] [c0000000f8d9f8a0] [c00000000009f740] enter_vmx_ops+0x50/0x70
[   19.123583] [c0000000f8d9f8c0] [c00000000009eee8] memcmp+0x338/0x680
[   19.123728] [c0000000f8d9f9c0] [d000000003820190] test_user_copy_init+0x188/0xd14 [test_user_copy]
[   19.123909] [c0000000f8d9fa70] [c00000000000e37c] do_one_initcall+0x5c/0x2d0
[   19.124094] [c0000000f8d9fb30] [c00000000020066c] do_init_module+0x90/0x264
[   19.124234] [c0000000f8d9fbc0] [c0000000001ff084] load_module+0x2f64/0x3600
[   19.124371] [c0000000f8d9fd70] [c0000000001ff9c8] sys_finit_module+0xc8/0x110
[   19.124530] [c0000000f8d9fe30] [c00000000000b868] system_call+0x58/0x6c
[   19.124648] Instruction dump:
[   19.124721] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff81 7c7d1b78 60000000 60000000 
[   19.124869] 7fe000a6 3d220003 39298f05 7ffeeb78 <89290000> 2f890000 419e0044 60000000 
[   19.125034] ---[ end trace 7c08acedd4b4e6aa ]---


cheers

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 0/5] powerpc/64: memcmp() optimization
  2018-06-05  2:16 ` [PATCH v7 0/5] powerpc/64: memcmp() optimization Michael Ellerman
  2018-06-04 10:27   ` Simon Guo
@ 2018-06-06  6:21   ` Simon Guo
  2018-06-06  6:36     ` Naveen N. Rao
  1 sibling, 1 reply; 11+ messages in thread
From: Simon Guo @ 2018-06-06  6:21 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev, Paul Mackerras, Naveen N.  Rao, Cyril Bur

Hi Michael,
On Tue, Jun 05, 2018 at 12:16:22PM +1000, Michael Ellerman wrote:
> Hi Simon,
> 
> wei.guo.simon@gmail.com writes:
> > From: Simon Guo <wei.guo.simon@gmail.com>
> >
> > There is some room to optimize memcmp() in powerpc 64 bits version for
> > following 2 cases:
> > (1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
> > memcmp() can align them and go with .Llong comparision mode without
> > fallback to .Lshort comparision mode do compare buffer byte by byte.
> > (2) VMX instructions can be used to speed up for large size comparision,
> > currently the threshold is set for 4K bytes. Notes the VMX instructions
> > will lead to VMX regs save/load penalty. This patch set includes a
> > patch to add a 32 bytes pre-checking to minimize the penalty.
> >
> > It did the similar with glibc commit dec4a7105e (powerpc: Improve memcmp 
> > performance for POWER8). Thanks Cyril Bur's information.
> > This patch set also updates memcmp selftest case to make it compiled and
> > incorporate large size comparison case.
> 
> I'm seeing a few crashes with this applied, I haven't had time to look
> into what is happening yet, sorry.
> 

The bug is due to memcmp() invokes a C function enter_vmx_ops() who will load 
some PIC value based on r2.

memcmp() doesn't use r2 and if the memcmp() is invoked from kernel
itself, everything is fine. But if memcmp() is invoked from modules[test_user_copy], 
r2 will be required to be setup correctly. Otherwise the enter_vmx_ops() will refer 
to an incorrect/unexisting data location based on wrong r2 value.

Following patch will fix this issue:
------------
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index 5eba49744a5a..24d093fa89bb 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -102,7 +102,7 @@
  * 2) src/dst has different offset to the 8 bytes boundary. The handlers
  * are named like .Ldiffoffset_xxxx
  */
-_GLOBAL(memcmp)
+_GLOBAL_TOC(memcmp)
        cmpdi   cr1,r5,0

        /* Use the short loop if the src/dst addresses are not
----------

It means the memcmp() fun entry will have additional 2 instructions. Is there
any way to save these 2 instructions when the memcmp() is actually invoked
from kernel itself?

Again thanks for finding this issue.

Thanks,
- Simon

> [ 2471.300595] kselftest: Running tests in user
> [ 2471.302785] calling  test_user_copy_init+0x0/0xd14 [test_user_copy] @ 44883
> [ 2471.302892] Unable to handle kernel paging request for data at address 0xc008000018553005
> [ 2471.303014] Faulting instruction address: 0xc00000000001f29c
> [ 2471.303119] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 2471.303193] LE SMP NR_CPUS=2048 NUMA PowerNV
> [ 2471.303256] Modules linked in: test_user_copy(+) vxlan ip6_udp_tunnel udp_tunnel 8021q bridge stp llc dummy test_printf test_firmware vmx_crypto crct10dif_vpmsum crct10dif_common crc32c_vpmsum veth [last unloaded: test_static_key_base]
> [ 2471.303532] CPU: 4 PID: 44883 Comm: modprobe Tainted: G        W         4.17.0-rc3-gcc7x-g7204012 #1
> [ 2471.303644] NIP:  c00000000001f29c LR: c00000000001f6e4 CTR: 0000000000000000
> [ 2471.303754] REGS: c000001fddc2b560 TRAP: 0300   Tainted: G        W          (4.17.0-rc3-gcc7x-g7204012)
> [ 2471.303873] MSR:  9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 24222844  XER: 00000000
> [ 2471.303996] CFAR: c00000000001f6e0 DAR: c008000018553005 DSISR: 40000000 IRQMASK: 0 
> [ 2471.303996] GPR00: c00000000001f6e4 c000001fddc2b7e0 c008000018529900 0000000002000000 
> [ 2471.303996] GPR04: c000001fe4b90020 000000000000ffe0 0000000000000000 03fffffe01b48000 
> [ 2471.303996] GPR08: 0000000080000000 c008000018553005 c000001fddc28000 c008000018520df0 
> [ 2471.303996] GPR12: c00000000009c430 c000001fffffbc00 0000000020000000 0000000000000000 
> [ 2471.303996] GPR16: c000001fddc2bc20 0000000000000030 c0000000001f7ba0 0000000000000001 
> [ 2471.303996] GPR20: 0000000000000000 c000000000c772b0 c0000000010b4018 0000000000000000 
> [ 2471.303996] GPR24: 0000000000000000 c008000018521c98 0000000000000000 c000001fe4b90000 
> [ 2471.303996] GPR28: fffffffffffffff4 0000000002000000 9000000002009033 9000000002009033 
> [ 2471.304930] NIP [c00000000001f29c] msr_check_and_set+0x3c/0xc0
> [ 2471.305008] LR [c00000000001f6e4] enable_kernel_altivec+0x44/0x100
> [ 2471.305084] Call Trace:
> [ 2471.305122] [c000001fddc2b7e0] [c00000000009baa8] __copy_tofrom_user_base+0x9c/0x574 (unreliable)
> [ 2471.305240] [c000001fddc2b860] [c00000000001f6e4] enable_kernel_altivec+0x44/0x100
> [ 2471.305336] [c000001fddc2b890] [c00000000009ce40] enter_vmx_ops+0x50/0x70
> [ 2471.305418] [c000001fddc2b8b0] [c00000000009c768] memcmp+0x338/0x680
> [ 2471.305501] [c000001fddc2b9b0] [c008000018520190] test_user_copy_init+0x188/0xd14 [test_user_copy]
> [ 2471.305617] [c000001fddc2ba60] [c00000000000de20] do_one_initcall+0x90/0x560
> [ 2471.305710] [c000001fddc2bb30] [c000000000200630] do_init_module+0x90/0x260
> [ 2471.305795] [c000001fddc2bbc0] [c0000000001fec88] load_module+0x1a28/0x1ce0
> [ 2471.305875] [c000001fddc2bd70] [c0000000001ff1e8] sys_finit_module+0xc8/0x110
> [ 2471.305983] [c000001fddc2be30] [c00000000000b528] system_call+0x58/0x6c
> [ 2471.306066] Instruction dump:
> [ 2471.306112] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff81 7c7d1b78 60000000 60000000 
> [ 2471.306216] 7fe000a6 3d220003 39299705 7ffeeb78 <89290000> 2f890000 419e0044 60000000 
> [ 2471.306326] ---[ end trace daf8d409e65b9841 ]---
> 
> And:
> 
> [   19.096709] test_bpf: test_skb_segment: success in skb_segment!
> [   19.096799] initcall test_bpf_init+0x0/0xae0 [test_bpf] returned 0 after 591217 usecs
> [   19.115869] calling  test_user_copy_init+0x0/0xd14 [test_user_copy] @ 3159
> [   19.116165] Unable to handle kernel paging request for data at address 0xd000000003852805
> [   19.116352] Faulting instruction address: 0xc00000000001f44c
> [   19.116483] Oops: Kernel access of bad area, sig: 11 [#1]
> [   19.116583] LE SMP NR_CPUS=2048 NUMA pSeries
> [   19.116684] Modules linked in: test_user_copy(+) lzo_compress crc_itu_t zstd_compress zstd_decompress test_bpf test_static_keys test_static_key_base xxhash test_firmware af_key cls_bpf act_bpf bridge nf_nat_irc xt_NFLOG nfnetlink_log xt_policy nf_conntrack_netlink nfnetlink xt_nat nf_conntrack_irc xt_mark xt_tcpudp nf_nat_sip xt_TCPMSS xt_LOG nf_nat_ftp nf_conntrack_ftp xt_conntrack nf_conntrack_sip xt_addrtype xt_state 8021q iptable_filter ipt_MASQUERADE nf_log_ipv4 iptable_mangle nf_nat_masquerade_ipv4 ipt_REJECT nf_reject_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables nf_log_arp nf_log_common ah4 ipcomp xfrm4_tunnel esp4 rpcrdma stp p8022 psnap llc xfrm_ipcomp xfrm_user xfrm_algo platform_lcd lcd ocxl virtio_balloon virtio_crypto crypto_engine
> [   19.118040]  vmx_crypto nbd zram zsmalloc virtio_blk st be2iscsi cxgb3i cxgb4i libcxgbi bnx2i ibmvfc sym53c8xx scsi_transport_spi scsi_dh_alua scsi_dh_rdac qla4xxx mpt3sas scsi_transport_sas cxlflash cxl libiscsi_tcp lpfc crc_t10dif crct10dif_generic crct10dif_common qla2xxx iscsi_boot_sysfs raid_class parport_pc parport powernv_op_panel powernv_rng pseries_rng rng_core virtio_console pcspkr input_leds evdev dm_round_robin dm_mirror dm_region_hash dm_log raid10 dm_service_time multipath dm_queue_length dm_multipath dm_thin_pool faulty dm_persistent_data dm_zero dm_crypt dm_bio_prison dm_snapshot dm_bufio raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq rpadlpar_io rpaphp jsm icom hvcs ib_ipoib ib_srp ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_ucm ib_ucm ib_uverbs
> [   19.119505]  rdma_cm iw_cm ib_cm mlx4_ib iw_cxgb3 iw_cxgb4 ib_mthca ib_core leds_powernv led_class vhost_net vhost macvtap macvlan dummy bsd_comp ppp_async crc_ccitt pppoe ppp_synctty pppox ppp_deflate ppp_generic 3c59x s2io bnx2 cnic uio bnx2x libcrc32c i40e ixgbe ixgb cxgb3 libcxgb cxgb cxgb4 pcnet32 netxen_nic qlge be2net acenic mlx4_en mlx4_core myri10ge bonding slhc tap mdio veth vxlan udp_tunnel tun usb_storage usbmon oprofile sha1_powerpc md5_ppc crc32c_vpmsum kvm hvcserver
> [   19.120358] CPU: 4 PID: 3159 Comm: modprobe Not tainted 4.17.0-rc3-gcc7x-g7204012 #1
> [   19.120508] NIP:  c00000000001f44c LR: c00000000001f894 CTR: 0000000000000000
> [   19.120666] REGS: c0000000f8d9f570 TRAP: 0300   Not tainted  (4.17.0-rc3-gcc7x-g7204012)
> [   19.120817] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24222844  XER: 00000000
> [   19.120984] CFAR: c00000000000c03c DAR: d000000003852805 DSISR: 40000000 IRQMASK: 0 
>                GPR00: c00000000001f894 c0000000f8d9f7f0 d000000003829900 0000000002000000 
>                GPR04: c0000000f9a30048 000000000000ffe0 0000000000000000 03fffffff065dffd 
>                GPR08: 0000000080000000 d000000003852805 c0000000f8d9c000 d000000003820df0 
>                GPR12: c00000000009ebb0 c00000003fffb300 c0000000f8d9fd90 d000000003840000 
>                GPR16: d000000003840000 0000000000000000 c0000000011d6900 d000000003821ad0 
>                GPR20: c000000000bd7860 0000000000000000 c000000000ff9060 00000000014000c0 
>                GPR24: 0000000000000000 0000000000000000 0000000000000100 c0000000f9a30028 
>                GPR28: fffffffffffffff4 0000000002000000 8000000002009033 8000000000009033 
> [   19.122454] NIP [c00000000001f44c] msr_check_and_set+0x3c/0xc0
> [   19.122580] LR [c00000000001f894] enable_kernel_altivec+0x44/0x100
> [   19.122707] Call Trace:
> [   19.122789] [c0000000f8d9f7f0] [c00000000009e228] __copy_tofrom_user_base+0x9c/0x574 (unreliable)
> [   19.122962] [c0000000f8d9f870] [c00000000001f894] enable_kernel_altivec+0x44/0x100
> [   19.123344] [c0000000f8d9f8a0] [c00000000009f740] enter_vmx_ops+0x50/0x70
> [   19.123583] [c0000000f8d9f8c0] [c00000000009eee8] memcmp+0x338/0x680
> [   19.123728] [c0000000f8d9f9c0] [d000000003820190] test_user_copy_init+0x188/0xd14 [test_user_copy]
> [   19.123909] [c0000000f8d9fa70] [c00000000000e37c] do_one_initcall+0x5c/0x2d0
> [   19.124094] [c0000000f8d9fb30] [c00000000020066c] do_init_module+0x90/0x264
> [   19.124234] [c0000000f8d9fbc0] [c0000000001ff084] load_module+0x2f64/0x3600
> [   19.124371] [c0000000f8d9fd70] [c0000000001ff9c8] sys_finit_module+0xc8/0x110
> [   19.124530] [c0000000f8d9fe30] [c00000000000b868] system_call+0x58/0x6c
> [   19.124648] Instruction dump:
> [   19.124721] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff81 7c7d1b78 60000000 60000000 
> [   19.124869] 7fe000a6 3d220003 39298f05 7ffeeb78 <89290000> 2f890000 419e0044 60000000 
> [   19.125034] ---[ end trace 7c08acedd4b4e6aa ]---
> 
> 
> cheers

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 0/5] powerpc/64: memcmp() optimization
  2018-06-06  6:21   ` Simon Guo
@ 2018-06-06  6:36     ` Naveen N. Rao
  2018-06-06  6:53       ` Simon Guo
  0 siblings, 1 reply; 11+ messages in thread
From: Naveen N. Rao @ 2018-06-06  6:36 UTC (permalink / raw)
  To: Michael Ellerman, Simon Guo; +Cc: Cyril Bur, linuxppc-dev

Simon Guo wrote:
> Hi Michael,
> On Tue, Jun 05, 2018 at 12:16:22PM +1000, Michael Ellerman wrote:
>> Hi Simon,
>>=20
>> wei.guo.simon@gmail.com writes:
>> > From: Simon Guo <wei.guo.simon@gmail.com>
>> >
>> > There is some room to optimize memcmp() in powerpc 64 bits version for
>> > following 2 cases:
>> > (1) Even src/dst addresses are not aligned with 8 bytes at the beginni=
ng,
>> > memcmp() can align them and go with .Llong comparision mode without
>> > fallback to .Lshort comparision mode do compare buffer byte by byte.
>> > (2) VMX instructions can be used to speed up for large size comparisio=
n,
>> > currently the threshold is set for 4K bytes. Notes the VMX instruction=
s
>> > will lead to VMX regs save/load penalty. This patch set includes a
>> > patch to add a 32 bytes pre-checking to minimize the penalty.
>> >
>> > It did the similar with glibc commit dec4a7105e (powerpc: Improve memc=
mp=20
>> > performance for POWER8). Thanks Cyril Bur's information.
>> > This patch set also updates memcmp selftest case to make it compiled a=
nd
>> > incorporate large size comparison case.
>>=20
>> I'm seeing a few crashes with this applied, I haven't had time to look
>> into what is happening yet, sorry.
>>=20
>=20
> The bug is due to memcmp() invokes a C function enter_vmx_ops() who will =
load=20
> some PIC value based on r2.
>=20
> memcmp() doesn't use r2 and if the memcmp() is invoked from kernel
> itself, everything is fine. But if memcmp() is invoked from modules[test_=
user_copy],=20
> r2 will be required to be setup correctly. Otherwise the enter_vmx_ops() =
will refer=20
> to an incorrect/unexisting data location based on wrong r2 value.
>=20
> Following patch will fix this issue:
> ------------
> diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
> index 5eba49744a5a..24d093fa89bb 100644
> --- a/arch/powerpc/lib/memcmp_64.S
> +++ b/arch/powerpc/lib/memcmp_64.S
> @@ -102,7 +102,7 @@
>   * 2) src/dst has different offset to the 8 bytes boundary. The handlers
>   * are named like .Ldiffoffset_xxxx
>   */
> -_GLOBAL(memcmp)
> +_GLOBAL_TOC(memcmp)
>         cmpdi   cr1,r5,0
>=20
>         /* Use the short loop if the src/dst addresses are not
> ----------
>=20
> It means the memcmp() fun entry will have additional 2 instructions. Is t=
here
> any way to save these 2 instructions when the memcmp() is actually invoke=
d
> from kernel itself?

That will be the case. We will end up entering the function via the=20
local entry point skipping the first two instructions. The Global entry=20
point is only used for cross-module calls.

- Naveen

=

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 0/5] powerpc/64: memcmp() optimization
  2018-06-06  6:36     ` Naveen N. Rao
@ 2018-06-06  6:53       ` Simon Guo
  0 siblings, 0 replies; 11+ messages in thread
From: Simon Guo @ 2018-06-06  6:53 UTC (permalink / raw)
  To: Naveen N. Rao; +Cc: Michael Ellerman, Cyril Bur, linuxppc-dev

Hi Naveen,
On Wed, Jun 06, 2018 at 12:06:09PM +0530, Naveen N. Rao wrote:
> Simon Guo wrote:
> >Hi Michael,
> >On Tue, Jun 05, 2018 at 12:16:22PM +1000, Michael Ellerman wrote:
> >>Hi Simon,
> >>
> >>wei.guo.simon@gmail.com writes:
> >>> From: Simon Guo <wei.guo.simon@gmail.com>
> >>>
> >>> There is some room to optimize memcmp() in powerpc 64 bits version for
> >>> following 2 cases:
> >>> (1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
> >>> memcmp() can align them and go with .Llong comparision mode without
> >>> fallback to .Lshort comparision mode do compare buffer byte by byte.
> >>> (2) VMX instructions can be used to speed up for large size comparision,
> >>> currently the threshold is set for 4K bytes. Notes the VMX instructions
> >>> will lead to VMX regs save/load penalty. This patch set includes a
> >>> patch to add a 32 bytes pre-checking to minimize the penalty.
> >>>
> >>> It did the similar with glibc commit dec4a7105e (powerpc:
> >>Improve memcmp > performance for POWER8). Thanks Cyril Bur's
> >>information.
> >>> This patch set also updates memcmp selftest case to make it compiled and
> >>> incorporate large size comparison case.
> >>
> >>I'm seeing a few crashes with this applied, I haven't had time to look
> >>into what is happening yet, sorry.
> >>
> >
> >The bug is due to memcmp() invokes a C function enter_vmx_ops()
> >who will load some PIC value based on r2.
> >
> >memcmp() doesn't use r2 and if the memcmp() is invoked from kernel
> >itself, everything is fine. But if memcmp() is invoked from
> >modules[test_user_copy], r2 will be required to be setup
> >correctly. Otherwise the enter_vmx_ops() will refer to an
> >incorrect/unexisting data location based on wrong r2 value.
> >
> >Following patch will fix this issue:
> >------------
> >diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
> >index 5eba49744a5a..24d093fa89bb 100644
> >--- a/arch/powerpc/lib/memcmp_64.S
> >+++ b/arch/powerpc/lib/memcmp_64.S
> >@@ -102,7 +102,7 @@
> >  * 2) src/dst has different offset to the 8 bytes boundary. The handlers
> >  * are named like .Ldiffoffset_xxxx
> >  */
> >-_GLOBAL(memcmp)
> >+_GLOBAL_TOC(memcmp)
> >        cmpdi   cr1,r5,0
> >
> >        /* Use the short loop if the src/dst addresses are not
> >----------
> >
> >It means the memcmp() fun entry will have additional 2 instructions. Is there
> >any way to save these 2 instructions when the memcmp() is actually invoked
> >from kernel itself?
> 
> That will be the case. We will end up entering the function via the
> local entry point skipping the first two instructions. The Global
> entry point is only used for cross-module calls.
> 

Yes. Thanks :)

- Simon

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-06-06  6:53 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-30  9:20 [PATCH v7 0/5] powerpc/64: memcmp() optimization wei.guo.simon
2018-05-30  9:20 ` [PATCH v7 1/5] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
2018-05-30  9:21 ` [PATCH v7 2/5] powerpc: add vcmpequd/vcmpequb ppc instruction macro wei.guo.simon
2018-05-30  9:21 ` [PATCH v7 3/5] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
2018-05-30  9:21 ` [PATCH v7 4/5] powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp() wei.guo.simon
2018-05-30  9:21 ` [PATCH v7 5/5] powerpc:selftest update memcmp_64 selftest for VMX implementation wei.guo.simon
2018-06-05  2:16 ` [PATCH v7 0/5] powerpc/64: memcmp() optimization Michael Ellerman
2018-06-04 10:27   ` Simon Guo
2018-06-06  6:21   ` Simon Guo
2018-06-06  6:36     ` Naveen N. Rao
2018-06-06  6:53       ` Simon Guo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.