All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] powerpc/64: memcmp() optimization
@ 2017-09-20 23:34 wei.guo.simon
  2017-09-20 23:34 ` [PATCH v2 1/3] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: wei.guo.simon @ 2017-09-20 23:34 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Paul Mackerras, Michael Ellerman, Naveen N.  Rao, David Laight,
	Christophe LEROY, Simon Guo

From: Simon Guo <wei.guo.simon@gmail.com>

There is some room to optimize memcmp() in powerpc 64 bits version for
following 2 cases:
(1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
memcmp() can align them and go with .Llong comparision mode without
fallback to .Lshort comparision mode do compare buffer byte by byte.
(2) VMX instructions can be used to speed up for large size comparision.

This patch set also updates memcmp selftest case to make it compiled and
incorporate large size comparison case.

v1 -> v2:
- update 8bytes unaligned bytes comparison method.
- fix a VMX comparision bug.
- enhanced the original memcmp() selftest.
- add powerpc/64 to subject/commit message.

Simon Guo (3):
  powerpc/64: Align bytes before fall back to .Lshort in powerpc64
    memcmp().
  powerpc/64: enhance memcmp() with VMX instruction for long bytes
    comparision
  powerpc:selftest update memcmp_64 selftest for VMX implementation

 arch/powerpc/include/asm/asm-prototypes.h          |   2 +-
 arch/powerpc/lib/copypage_power7.S                 |   2 +-
 arch/powerpc/lib/memcmp_64.S                       | 181 ++++++++++++++++++++-
 arch/powerpc/lib/memcpy_power7.S                   |   2 +-
 arch/powerpc/lib/vmx-helper.c                      |   2 +-
 .../selftests/powerpc/copyloops/asm/ppc_asm.h      |   2 +-
 .../selftests/powerpc/stringloops/asm/ppc_asm.h    |  31 ++++
 .../testing/selftests/powerpc/stringloops/memcmp.c |  63 ++++---
 8 files changed, 254 insertions(+), 31 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/3] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp().
  2017-09-20 23:34 [PATCH v2 0/3] powerpc/64: memcmp() optimization wei.guo.simon
@ 2017-09-20 23:34 ` wei.guo.simon
  2017-09-20 23:34 ` [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
  2017-09-20 23:34 ` [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation wei.guo.simon
  2 siblings, 0 replies; 18+ messages in thread
From: wei.guo.simon @ 2017-09-20 23:34 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Paul Mackerras, Michael Ellerman, Naveen N.  Rao, David Laight,
	Christophe LEROY, Simon Guo

From: Simon Guo <wei.guo.simon@gmail.com>

Currently memcmp() 64bytes version in powerpc will fall back to .Lshort
(compare per byte mode) if either src or dst address is not 8 bytes
aligned. It can be opmitized if both addresses are with the same offset
with 8 bytes boundary.

memcmp() can compare the unaligned bytes within 8 bytes boundary firstly
and then compare the rest 8-bytes-aligned content with .Llong mode.

This patch optmizes memcmp() behavior in this situation.

Test result:

(1) 256 bytes
Test with the existing tools/testing/selftests/powerpc/stringloops/memcmp:
- without patch
      50.996607479 seconds time elapsed                                          ( +- 0.01% )
- with patch
      28.033316997 seconds time elapsed                                          ( +- 0.01% )
		-> There is ~+81% percent improvement

(2) 32 bytes
To observe performance impact on < 32 bytes, modify
tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
-------
 #include <string.h>
 #include "utils.h"

-#define SIZE 256
+#define SIZE 32
 #define ITERATIONS 10000

 int test_memcmp(const void *s1, const void *s2, size_t n);
--------

- Without patch
       0.392578831 seconds time elapsed                                          ( +- 0.05% )
- with patch
       0.358446662 seconds time elapsed                                          ( +- 0.04% )
		-> There is ~+9% improvement

(3) 0~8 bytes
To observe <8 bytes performance impact, modify
tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
-------
 #include <string.h>
 #include "utils.h"

-#define SIZE 256
-#define ITERATIONS 10000
+#define SIZE 8
+#define ITERATIONS 1000000

 int test_memcmp(const void *s1, const void *s2, size_t n);
-------
- Without patch
       3.168752060 seconds time elapsed                                          ( +- 0.10% )
- With patch
       3.153030138 seconds time elapsed                                          ( +- 0.09% )
		-> They are nearly the same. (-0.4%)

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
 arch/powerpc/lib/memcmp_64.S | 99 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 93 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index d75d18b..6dccfb8 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -24,28 +24,35 @@
 #define rH	r31
 
 #ifdef __LITTLE_ENDIAN__
+#define LH	lhbrx
+#define LW	lwbrx
 #define LD	ldbrx
 #else
+#define LH	lhzx
+#define LW	lwzx
 #define LD	ldx
 #endif
 
 _GLOBAL(memcmp)
 	cmpdi	cr1,r5,0
 
-	/* Use the short loop if both strings are not 8B aligned */
-	or	r6,r3,r4
+	/* Use the short loop if the src/dst addresses are not
+	 * with the same offset of 8 bytes align boundary.
+	 */
+	xor	r6,r3,r4
 	andi.	r6,r6,7
 
-	/* Use the short loop if length is less than 32B */
-	cmpdi	cr6,r5,31
+	/* fall back to short loop if compare at aligned addrs
+	 * with less than 8 bytes.
+	 */
+	cmpdi   cr6,r5,7
 
 	beq	cr1,.Lzero
 	bne	.Lshort
-	bgt	cr6,.Llong
+	bgt	cr6,.L8bytes_make_align_start
 
 .Lshort:
 	mtctr	r5
-
 1:	lbz	rA,0(r3)
 	lbz	rB,0(r4)
 	subf.	rC,rB,rA
@@ -78,6 +85,78 @@ _GLOBAL(memcmp)
 	li	r3,0
 	blr
 
+.L8bytes_make_align_start:
+	/* attempt to compare bytes not aligned with 8 bytes so that
+	 * left comparison can run based on 8 bytes alignment.
+	 */
+	andi.   r6,r3,7
+	beq     .L8bytes_aligned
+
+	/* Try to compare the first double word which is not 8 bytes aligned:
+	 * load the first double word at (src & ~7UL) and shift left appropriate
+	 * bits before comparision.
+	 */
+	clrlwi  r6,r3,29
+	rlwinm  r6,r6,3,0,28
+	clrrdi	r3,r3,3
+	clrrdi	r4,r4,3
+	LD	rA,0,r3
+	LD	rB,0,r4
+	sld	rA,rA,r6
+	sld	rB,rB,r6
+	cmpld	cr0,rA,rB
+	bne	cr0,.LcmpAB_lightweight
+	srwi	r6,r6,3
+	subfic  r6,r6,8
+	subfc.	r5,r6,r5
+	beq	.Lzero
+	addi	r3,r3,8
+	addi	r4,r4,8
+
+.L8bytes_aligned:
+	/* now we are aligned with 8 bytes.
+	 * Use .Llong loop if left cmp bytes are equal or greater than 32B.
+	 */
+	cmpdi   cr6,r5,31
+	bgt	cr6,.Llong
+
+	cmpdi   cr6,r5,7
+	bgt	cr6,.Lcmp_8bytes_31bytes
+
+.Lcmp_rest_lt8bytes:
+	/* Here we have only less than 8 bytes to compare with. Addresses
+	 * are aligned with 8 bytes.
+	 * The next double words are load and shift right with appropriate
+	 * bits.
+	 */
+	subfic  r6,r5,8
+	rlwinm  r6,r6,3,0,28
+	LD	rA,0,r3
+	LD	rB,0,r4
+	srd	rA,rA,r6
+	srd	rB,rB,r6
+	cmpld	cr0,rA,rB
+	bne	cr0,.LcmpAB_lightweight
+	beq	.Lzero
+
+.Lcmp_8bytes_31bytes:
+	/* compare 8 ~ 31 bytes with 8 bytes aligned */
+	srdi.   r0,r5,3
+	clrldi  r5,r5,61
+	mtctr   r0
+831:
+	LD	rA,0,r3
+	LD	rB,0,r4
+	cmpld	cr0,rA,rB
+	bne	cr0,.LcmpAB_lightweight
+	addi	r3,r3,8
+	addi	r4,r4,8
+	bdnz	831b
+
+	cmpwi   r5,0
+	beq	.Lzero
+	b	.Lcmp_rest_lt8bytes
+
 .Lnon_zero:
 	mr	r3,rC
 	blr
@@ -232,4 +311,12 @@ _GLOBAL(memcmp)
 	ld	r28,-32(r1)
 	ld	r27,-40(r1)
 	blr
+
+.LcmpAB_lightweight:   /* skip NV GPRS restore */
+	li	r3,1
+	bgt	cr0,8f
+	li	r3,-1
+8:
+	blr
+
 EXPORT_SYMBOL(memcmp)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
  2017-09-20 23:34 [PATCH v2 0/3] powerpc/64: memcmp() optimization wei.guo.simon
  2017-09-20 23:34 ` [PATCH v2 1/3] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
@ 2017-09-20 23:34 ` wei.guo.simon
  2017-09-21  0:54   ` Simon Guo
  2017-09-22 14:06   ` Cyril Bur
  2017-09-20 23:34 ` [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation wei.guo.simon
  2 siblings, 2 replies; 18+ messages in thread
From: wei.guo.simon @ 2017-09-20 23:34 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Paul Mackerras, Michael Ellerman, Naveen N.  Rao, David Laight,
	Christophe LEROY, Simon Guo

From: Simon Guo <wei.guo.simon@gmail.com>

This patch add VMX primitives to do memcmp() in case the compare size
exceeds 4K bytes.

Test result with following test program(replace the "^>" with ""):
------
># cat tools/testing/selftests/powerpc/stringloops/memcmp.c
>#include <malloc.h>
>#include <stdlib.h>
>#include <string.h>
>#include <time.h>
>#include "utils.h"
>#define SIZE (1024 * 1024 * 900)
>#define ITERATIONS 40

int test_memcmp(const void *s1, const void *s2, size_t n);

static int testcase(void)
{
        char *s1;
        char *s2;
        unsigned long i;

        s1 = memalign(128, SIZE);
        if (!s1) {
                perror("memalign");
                exit(1);
        }

        s2 = memalign(128, SIZE);
        if (!s2) {
                perror("memalign");
                exit(1);
        }

        for (i = 0; i < SIZE; i++)  {
                s1[i] = i & 0xff;
                s2[i] = i & 0xff;
        }
        for (i = 0; i < ITERATIONS; i++) {
		int ret = test_memcmp(s1, s2, SIZE);

		if (ret) {
			printf("return %d at[%ld]! should have returned zero\n", ret, i);
			abort();
		}
	}

        return 0;
}

int main(void)
{
        return test_harness(testcase, "memcmp");
}
------
Without VMX patch:
       7.435191479 seconds time elapsed                                          ( +- 0.51% )
With VMX patch:
       6.802038938 seconds time elapsed                                          ( +- 0.56% )
		There is ~+8% improvement.

However I am not aware whether there is use case in kernel for memcmp on
large size yet.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
 arch/powerpc/include/asm/asm-prototypes.h |  2 +-
 arch/powerpc/lib/copypage_power7.S        |  2 +-
 arch/powerpc/lib/memcmp_64.S              | 82 +++++++++++++++++++++++++++++++
 arch/powerpc/lib/memcpy_power7.S          |  2 +-
 arch/powerpc/lib/vmx-helper.c             |  2 +-
 5 files changed, 86 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h
index 7330150..e6530d8 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -49,7 +49,7 @@ void __trace_hcall_exit(long opcode, unsigned long retval,
 /* VMX copying */
 int enter_vmx_usercopy(void);
 int exit_vmx_usercopy(void);
-int enter_vmx_copy(void);
+int enter_vmx_ops(void);
 void * exit_vmx_copy(void *dest);
 
 /* Traps */
diff --git a/arch/powerpc/lib/copypage_power7.S b/arch/powerpc/lib/copypage_power7.S
index ca5fc8f..9e7729e 100644
--- a/arch/powerpc/lib/copypage_power7.S
+++ b/arch/powerpc/lib/copypage_power7.S
@@ -60,7 +60,7 @@ _GLOBAL(copypage_power7)
 	std	r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
 	std	r0,16(r1)
 	stdu	r1,-STACKFRAMESIZE(r1)
-	bl	enter_vmx_copy
+	bl	enter_vmx_ops
 	cmpwi	r3,0
 	ld	r0,STACKFRAMESIZE+16(r1)
 	ld	r3,STK_REG(R31)(r1)
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index 6dccfb8..40218fc 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -162,6 +162,13 @@ _GLOBAL(memcmp)
 	blr
 
 .Llong:
+#ifdef CONFIG_ALTIVEC
+	/* Try to use vmx loop if length is larger than 4K */
+	cmpldi  cr6,r5,4096
+	bgt	cr6,.Lvmx_cmp
+
+.Llong_novmx_cmp:
+#endif
 	li	off8,8
 	li	off16,16
 	li	off24,24
@@ -319,4 +326,79 @@ _GLOBAL(memcmp)
 8:
 	blr
 
+#ifdef CONFIG_ALTIVEC
+.Lvmx_cmp:
+	mflr    r0
+	std     r3,-STACKFRAMESIZE+STK_REG(R31)(r1)
+	std     r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
+	std     r5,-STACKFRAMESIZE+STK_REG(R29)(r1)
+	std     r0,16(r1)
+	stdu    r1,-STACKFRAMESIZE(r1)
+	bl      enter_vmx_ops
+	cmpwi   cr1,r3,0
+	ld      r0,STACKFRAMESIZE+16(r1)
+	ld      r3,STK_REG(R31)(r1)
+	ld      r4,STK_REG(R30)(r1)
+	ld      r5,STK_REG(R29)(r1)
+	addi	r1,r1,STACKFRAMESIZE
+	mtlr    r0
+	beq     cr1,.Llong_novmx_cmp
+
+3:
+	/* Enter with src/dst address 8 bytes aligned, and len is
+	 * no less than 4KB. Need to align with 16 bytes further.
+	 */
+	andi.	rA,r3,8
+	beq	4f
+	LD	rA,0,r3
+	LD	rB,0,r4
+	cmpld	cr0,rA,rB
+	bne	cr0,.LcmpAB_lightweight
+
+	addi	r3,r3,8
+	addi	r4,r4,8
+	addi	r5,r5,-8
+
+4:
+	/* compare 32 bytes for each loop */
+	srdi	r0,r5,5
+	mtctr	r0
+	andi.	r5,r5,31
+	li	off16,16
+
+.balign 16
+5:
+	lvx 	v0,0,r3
+	lvx 	v1,0,r4
+	vcmpequd. v0,v0,v1
+	bf	24,7f
+	lvx 	v0,off16,r3
+	lvx 	v1,off16,r4
+	vcmpequd. v0,v0,v1
+	bf	24,6f
+	addi	r3,r3,32
+	addi	r4,r4,32
+	bdnz	5b
+
+	cmpdi	r5,0
+	beq	.Lzero
+	b	.L8bytes_aligned
+
+6:
+	addi	r3,r3,16
+	addi	r4,r4,16
+
+7:
+	LD	rA,0,r3
+	LD	rB,0,r4
+	cmpld	cr0,rA,rB
+	bne	cr0,.LcmpAB_lightweight
+
+	li	off8,8
+	LD	rA,off8,r3
+	LD	rB,off8,r4
+	cmpld	cr0,rA,rB
+	bne	cr0,.LcmpAB_lightweight
+	b	.Lzero
+#endif
 EXPORT_SYMBOL(memcmp)
diff --git a/arch/powerpc/lib/memcpy_power7.S b/arch/powerpc/lib/memcpy_power7.S
index 193909a..682e386 100644
--- a/arch/powerpc/lib/memcpy_power7.S
+++ b/arch/powerpc/lib/memcpy_power7.S
@@ -230,7 +230,7 @@ _GLOBAL(memcpy_power7)
 	std	r5,-STACKFRAMESIZE+STK_REG(R29)(r1)
 	std	r0,16(r1)
 	stdu	r1,-STACKFRAMESIZE(r1)
-	bl	enter_vmx_copy
+	bl	enter_vmx_ops
 	cmpwi	cr1,r3,0
 	ld	r0,STACKFRAMESIZE+16(r1)
 	ld	r3,STK_REG(R31)(r1)
diff --git a/arch/powerpc/lib/vmx-helper.c b/arch/powerpc/lib/vmx-helper.c
index bf925cd..923a9ab 100644
--- a/arch/powerpc/lib/vmx-helper.c
+++ b/arch/powerpc/lib/vmx-helper.c
@@ -53,7 +53,7 @@ int exit_vmx_usercopy(void)
 	return 0;
 }
 
-int enter_vmx_copy(void)
+int enter_vmx_ops(void)
 {
 	if (in_interrupt())
 		return 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation
  2017-09-20 23:34 [PATCH v2 0/3] powerpc/64: memcmp() optimization wei.guo.simon
  2017-09-20 23:34 ` [PATCH v2 1/3] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
  2017-09-20 23:34 ` [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
@ 2017-09-20 23:34 ` wei.guo.simon
  2017-09-25  9:30   ` David Laight
  2 siblings, 1 reply; 18+ messages in thread
From: wei.guo.simon @ 2017-09-20 23:34 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Paul Mackerras, Michael Ellerman, Naveen N.  Rao, David Laight,
	Christophe LEROY, Simon Guo

From: Simon Guo <wei.guo.simon@gmail.com>

This patch adjust selftest memcmp_64 so that memcmp selftest can be
compiled successfully.

It also adds testcases for memcmp over 4K bytes size.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
 .../selftests/powerpc/copyloops/asm/ppc_asm.h      |  2 +-
 .../selftests/powerpc/stringloops/asm/ppc_asm.h    | 31 +++++++++++
 .../testing/selftests/powerpc/stringloops/memcmp.c | 63 +++++++++++++++-------
 3 files changed, 75 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
index 80d34a9..a9da02d 100644
--- a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
+++ b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
@@ -35,7 +35,7 @@
 	li	r3,0
 	blr
 
-FUNC_START(enter_vmx_copy)
+FUNC_START(enter_vmx_ops)
 	li	r3,1
 	blr
 
diff --git a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
index 11bece8..c8bb360 100644
--- a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
+++ b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
@@ -1,3 +1,5 @@
+#ifndef _PPC_ASM_H
+#define __PPC_ASM_H
 #include <ppc-asm.h>
 
 #ifndef r1
@@ -5,3 +7,32 @@
 #endif
 
 #define _GLOBAL(A) FUNC_START(test_ ## A)
+
+#define CONFIG_ALTIVEC
+
+#define R14 r14
+#define R15 r15
+#define R16 r16
+#define R17 r17
+#define R18 r18
+#define R19 r19
+#define R20 r20
+#define R21 r21
+#define R22 r22
+#define R29 r29
+#define R30 r30
+#define R31 r31
+
+#define STACKFRAMESIZE	256
+#define STK_REG(i)	(112 + ((i)-14)*8)
+
+#define _GLOBAL(A) FUNC_START(test_ ## A)
+#define _GLOBAL_TOC(A) _GLOBAL(A)
+
+#define PPC_MTOCRF(A, B)	mtocrf A, B
+
+FUNC_START(enter_vmx_ops)
+	li      r3, 1
+	blr
+
+#endif
diff --git a/tools/testing/selftests/powerpc/stringloops/memcmp.c b/tools/testing/selftests/powerpc/stringloops/memcmp.c
index 30b1222..4826669 100644
--- a/tools/testing/selftests/powerpc/stringloops/memcmp.c
+++ b/tools/testing/selftests/powerpc/stringloops/memcmp.c
@@ -1,20 +1,27 @@
 #include <malloc.h>
 #include <stdlib.h>
 #include <string.h>
+#include <time.h>
 #include "utils.h"
 
 #define SIZE 256
 #define ITERATIONS 10000
 
+#define LARGE_SIZE (5 * 1024)
+#define LARGE_ITERATIONS 1000
+#define LARGE_MAX_OFFSET 16
+#define LARGE_SIZE_START 4096
+
 int test_memcmp(const void *s1, const void *s2, size_t n);
 
 /* test all offsets and lengths */
-static void test_one(char *s1, char *s2)
+static void test_one(char *s1, char *s2, unsigned long max_offset,
+		unsigned long size_start)
 {
 	unsigned long offset, size;
 
-	for (offset = 0; offset < SIZE; offset++) {
-		for (size = 0; size < (SIZE-offset); size++) {
+	for (offset = 0; offset < max_offset; offset++) {
+		for (size = size_start; size < (SIZE-offset); size++) {
 			int x, y;
 			unsigned long i;
 
@@ -38,66 +45,82 @@ static void test_one(char *s1, char *s2)
 	}
 }
 
-static int testcase(void)
+static int testcase(bool islarge)
 {
 	char *s1;
 	char *s2;
 	unsigned long i;
 
-	s1 = memalign(128, SIZE);
+	unsigned long alloc_size = islarge ? LARGE_SIZE : SIZE;
+	int iterations = islarge ? LARGE_ITERATIONS : ITERATIONS;
+
+	s1 = memalign(128, alloc_size);
 	if (!s1) {
 		perror("memalign");
 		exit(1);
 	}
 
-	s2 = memalign(128, SIZE);
+	s2 = memalign(128, alloc_size);
 	if (!s2) {
 		perror("memalign");
 		exit(1);
 	}
 
-	srandom(1);
+	srandom(time(0));
 
-	for (i = 0; i < ITERATIONS; i++) {
+	for (i = 0; i < iterations; i++) {
 		unsigned long j;
 		unsigned long change;
 
-		for (j = 0; j < SIZE; j++)
+		for (j = 0; j < alloc_size; j++)
 			s1[j] = random();
 
-		memcpy(s2, s1, SIZE);
+		memcpy(s2, s1, alloc_size);
 
 		/* change one byte */
-		change = random() % SIZE;
+		change = random() % alloc_size;
 		s2[change] = random() & 0xff;
 
-		test_one(s1, s2);
+		if (islarge)
+			test_one(s1, s2, LARGE_MAX_OFFSET, LARGE_SIZE_START);
+		else
+			test_one(s1, s2, SIZE, 0);
 	}
 
-	srandom(1);
+	srandom(time(0));
 
-	for (i = 0; i < ITERATIONS; i++) {
+	for (i = 0; i < iterations; i++) {
 		unsigned long j;
 		unsigned long change;
 
-		for (j = 0; j < SIZE; j++)
+		for (j = 0; j < alloc_size; j++)
 			s1[j] = random();
 
-		memcpy(s2, s1, SIZE);
+		memcpy(s2, s1, alloc_size);
 
 		/* change multiple bytes, 1/8 of total */
-		for (j = 0; j < SIZE / 8; j++) {
-			change = random() % SIZE;
+		for (j = 0; j < alloc_size / 8; j++) {
+			change = random() % alloc_size;
 			s2[change] = random() & 0xff;
 		}
 
-		test_one(s1, s2);
+		if (islarge)
+			test_one(s1, s2, LARGE_MAX_OFFSET, LARGE_SIZE_START);
+		else
+			test_one(s1, s2, SIZE, 0);
 	}
 
 	return 0;
 }
 
+static int testcases(void)
+{
+	testcase(0);
+	testcase(1);
+	return 0;
+}
+
 int main(void)
 {
-	return test_harness(testcase, "memcmp");
+	return test_harness(testcases, "memcmp");
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
  2017-09-20 23:34 ` [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
@ 2017-09-21  0:54   ` Simon Guo
  2017-09-22 14:06   ` Cyril Bur
  1 sibling, 0 replies; 18+ messages in thread
From: Simon Guo @ 2017-09-21  0:54 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Paul Mackerras, Michael Ellerman, Naveen N.  Rao, David Laight,
	Christophe LEROY

Hi,
On Thu, Sep 21, 2017 at 07:34:39AM +0800, wei.guo.simon@gmail.com wrote:
> From: Simon Guo <wei.guo.simon@gmail.com>
> 
> This patch add VMX primitives to do memcmp() in case the compare size
> exceeds 4K bytes.
> 
> Test result with following test program(replace the "^>" with ""):
> ------
I missed the exit_vmx_ops() part and need to rework on v3.

Thanks,
- Simon

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
  2017-09-20 23:34 ` [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
  2017-09-21  0:54   ` Simon Guo
@ 2017-09-22 14:06   ` Cyril Bur
  2017-09-23 21:18     ` Simon Guo
  1 sibling, 1 reply; 18+ messages in thread
From: Cyril Bur @ 2017-09-22 14:06 UTC (permalink / raw)
  To: wei.guo.simon, linuxppc-dev; +Cc: David Laight, Naveen N.  Rao

On Thu, 2017-09-21 at 07:34 +0800, wei.guo.simon@gmail.com wrote:
> From: Simon Guo <wei.guo.simon@gmail.com>
> 
> This patch add VMX primitives to do memcmp() in case the compare size
> exceeds 4K bytes.
> 

Hi Simon,

Sorry I didn't see this sooner, I've actually been working on a kernel
version of glibc commit dec4a7105e (powerpc: Improve memcmp performance
for POWER8) unfortunately I've been distracted and it still isn't done.
I wonder if we can consolidate our efforts here. One thing I did come
across in my testing is that for memcmp() that will fail early (I
haven't narrowed down the the optimal number yet) the cost of enabling
VMX actually turns out to be a performance regression, as such I've
added a small check of the first 64 bytes to the start before enabling
VMX to ensure the penalty is worth taking.

Also, you should consider doing 4K and greater, KSM (Kernel Samepage
Merging) uses PAGE_SIZE which can be as small as 4K.

Cyril

> Test result with following test program(replace the "^>" with ""):
> ------
> > # cat tools/testing/selftests/powerpc/stringloops/memcmp.c
> > #include <malloc.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <time.h>
> > #include "utils.h"
> > #define SIZE (1024 * 1024 * 900)
> > #define ITERATIONS 40
> 
> int test_memcmp(const void *s1, const void *s2, size_t n);
> 
> static int testcase(void)
> {
>         char *s1;
>         char *s2;
>         unsigned long i;
> 
>         s1 = memalign(128, SIZE);
>         if (!s1) {
>                 perror("memalign");
>                 exit(1);
>         }
> 
>         s2 = memalign(128, SIZE);
>         if (!s2) {
>                 perror("memalign");
>                 exit(1);
>         }
> 
>         for (i = 0; i < SIZE; i++)  {
>                 s1[i] = i & 0xff;
>                 s2[i] = i & 0xff;
>         }
>         for (i = 0; i < ITERATIONS; i++) {
> 		int ret = test_memcmp(s1, s2, SIZE);
> 
> 		if (ret) {
> 			printf("return %d at[%ld]! should have returned zero\n", ret, i);
> 			abort();
> 		}
> 	}
> 
>         return 0;
> }
> 
> int main(void)
> {
>         return test_harness(testcase, "memcmp");
> }
> ------
> Without VMX patch:
>        7.435191479 seconds time elapsed                                          ( +- 0.51% )
> With VMX patch:
>        6.802038938 seconds time elapsed                                          ( +- 0.56% )
> 		There is ~+8% improvement.
> 
> However I am not aware whether there is use case in kernel for memcmp on
> large size yet.
> 
> Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
> ---
>  arch/powerpc/include/asm/asm-prototypes.h |  2 +-
>  arch/powerpc/lib/copypage_power7.S        |  2 +-
>  arch/powerpc/lib/memcmp_64.S              | 82 +++++++++++++++++++++++++++++++
>  arch/powerpc/lib/memcpy_power7.S          |  2 +-
>  arch/powerpc/lib/vmx-helper.c             |  2 +-
>  5 files changed, 86 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h
> index 7330150..e6530d8 100644
> --- a/arch/powerpc/include/asm/asm-prototypes.h
> +++ b/arch/powerpc/include/asm/asm-prototypes.h
> @@ -49,7 +49,7 @@ void __trace_hcall_exit(long opcode, unsigned long retval,
>  /* VMX copying */
>  int enter_vmx_usercopy(void);
>  int exit_vmx_usercopy(void);
> -int enter_vmx_copy(void);
> +int enter_vmx_ops(void);
>  void * exit_vmx_copy(void *dest);
>  
>  /* Traps */
> diff --git a/arch/powerpc/lib/copypage_power7.S b/arch/powerpc/lib/copypage_power7.S
> index ca5fc8f..9e7729e 100644
> --- a/arch/powerpc/lib/copypage_power7.S
> +++ b/arch/powerpc/lib/copypage_power7.S
> @@ -60,7 +60,7 @@ _GLOBAL(copypage_power7)
>  	std	r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
>  	std	r0,16(r1)
>  	stdu	r1,-STACKFRAMESIZE(r1)
> -	bl	enter_vmx_copy
> +	bl	enter_vmx_ops
>  	cmpwi	r3,0
>  	ld	r0,STACKFRAMESIZE+16(r1)
>  	ld	r3,STK_REG(R31)(r1)
> diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
> index 6dccfb8..40218fc 100644
> --- a/arch/powerpc/lib/memcmp_64.S
> +++ b/arch/powerpc/lib/memcmp_64.S
> @@ -162,6 +162,13 @@ _GLOBAL(memcmp)
>  	blr
>  
>  .Llong:
> +#ifdef CONFIG_ALTIVEC
> +	/* Try to use vmx loop if length is larger than 4K */
> +	cmpldi  cr6,r5,4096
> +	bgt	cr6,.Lvmx_cmp
> +
> +.Llong_novmx_cmp:
> +#endif
>  	li	off8,8
>  	li	off16,16
>  	li	off24,24
> @@ -319,4 +326,79 @@ _GLOBAL(memcmp)
>  8:
>  	blr
>  
> +#ifdef CONFIG_ALTIVEC
> +.Lvmx_cmp:
> +	mflr    r0
> +	std     r3,-STACKFRAMESIZE+STK_REG(R31)(r1)
> +	std     r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
> +	std     r5,-STACKFRAMESIZE+STK_REG(R29)(r1)
> +	std     r0,16(r1)
> +	stdu    r1,-STACKFRAMESIZE(r1)
> +	bl      enter_vmx_ops
> +	cmpwi   cr1,r3,0
> +	ld      r0,STACKFRAMESIZE+16(r1)
> +	ld      r3,STK_REG(R31)(r1)
> +	ld      r4,STK_REG(R30)(r1)
> +	ld      r5,STK_REG(R29)(r1)
> +	addi	r1,r1,STACKFRAMESIZE
> +	mtlr    r0
> +	beq     cr1,.Llong_novmx_cmp
> +
> +3:
> +	/* Enter with src/dst address 8 bytes aligned, and len is
> +	 * no less than 4KB. Need to align with 16 bytes further.
> +	 */
> +	andi.	rA,r3,8
> +	beq	4f
> +	LD	rA,0,r3
> +	LD	rB,0,r4
> +	cmpld	cr0,rA,rB
> +	bne	cr0,.LcmpAB_lightweight
> +
> +	addi	r3,r3,8
> +	addi	r4,r4,8
> +	addi	r5,r5,-8
> +
> +4:
> +	/* compare 32 bytes for each loop */
> +	srdi	r0,r5,5
> +	mtctr	r0
> +	andi.	r5,r5,31
> +	li	off16,16
> +
> +.balign 16
> +5:
> +	lvx 	v0,0,r3
> +	lvx 	v1,0,r4
> +	vcmpequd. v0,v0,v1
> +	bf	24,7f
> +	lvx 	v0,off16,r3
> +	lvx 	v1,off16,r4
> +	vcmpequd. v0,v0,v1
> +	bf	24,6f
> +	addi	r3,r3,32
> +	addi	r4,r4,32
> +	bdnz	5b
> +
> +	cmpdi	r5,0
> +	beq	.Lzero
> +	b	.L8bytes_aligned
> +
> +6:
> +	addi	r3,r3,16
> +	addi	r4,r4,16
> +
> +7:
> +	LD	rA,0,r3
> +	LD	rB,0,r4
> +	cmpld	cr0,rA,rB
> +	bne	cr0,.LcmpAB_lightweight
> +
> +	li	off8,8
> +	LD	rA,off8,r3
> +	LD	rB,off8,r4
> +	cmpld	cr0,rA,rB
> +	bne	cr0,.LcmpAB_lightweight
> +	b	.Lzero
> +#endif
>  EXPORT_SYMBOL(memcmp)
> diff --git a/arch/powerpc/lib/memcpy_power7.S b/arch/powerpc/lib/memcpy_power7.S
> index 193909a..682e386 100644
> --- a/arch/powerpc/lib/memcpy_power7.S
> +++ b/arch/powerpc/lib/memcpy_power7.S
> @@ -230,7 +230,7 @@ _GLOBAL(memcpy_power7)
>  	std	r5,-STACKFRAMESIZE+STK_REG(R29)(r1)
>  	std	r0,16(r1)
>  	stdu	r1,-STACKFRAMESIZE(r1)
> -	bl	enter_vmx_copy
> +	bl	enter_vmx_ops
>  	cmpwi	cr1,r3,0
>  	ld	r0,STACKFRAMESIZE+16(r1)
>  	ld	r3,STK_REG(R31)(r1)
> diff --git a/arch/powerpc/lib/vmx-helper.c b/arch/powerpc/lib/vmx-helper.c
> index bf925cd..923a9ab 100644
> --- a/arch/powerpc/lib/vmx-helper.c
> +++ b/arch/powerpc/lib/vmx-helper.c
> @@ -53,7 +53,7 @@ int exit_vmx_usercopy(void)
>  	return 0;
>  }
>  
> -int enter_vmx_copy(void)
> +int enter_vmx_ops(void)
>  {
>  	if (in_interrupt())
>  		return 0;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
  2017-09-22 14:06   ` Cyril Bur
@ 2017-09-23 21:18     ` Simon Guo
  2017-09-25 23:59       ` Cyril Bur
  0 siblings, 1 reply; 18+ messages in thread
From: Simon Guo @ 2017-09-23 21:18 UTC (permalink / raw)
  To: Cyril Bur; +Cc: linuxppc-dev, David Laight, Naveen N.  Rao

Hi Cyril,
On Sat, Sep 23, 2017 at 12:06:48AM +1000, Cyril Bur wrote:
> On Thu, 2017-09-21 at 07:34 +0800, wei.guo.simon@gmail.com wrote:
> > From: Simon Guo <wei.guo.simon@gmail.com>
> > 
> > This patch add VMX primitives to do memcmp() in case the compare size
> > exceeds 4K bytes.
> > 
> 
> Hi Simon,
> 
> Sorry I didn't see this sooner, I've actually been working on a kernel
> version of glibc commit dec4a7105e (powerpc: Improve memcmp performance
> for POWER8) unfortunately I've been distracted and it still isn't done.
Thanks for sync with me. Let's consolidate our effort together :)

I have a quick check on glibc commit dec4a7105e. 
Looks the aligned case comparison with VSX is launched without rN size
limitation, which means it will have a VSX reg load penalty even when the 
length is 9 bytes.

It did some optimization when src/dest addrs don't have the same offset 
on 8 bytes alignment boundary. I need to read more closely.

> I wonder if we can consolidate our efforts here. One thing I did come
> across in my testing is that for memcmp() that will fail early (I
> haven't narrowed down the the optimal number yet) the cost of enabling
> VMX actually turns out to be a performance regression, as such I've
> added a small check of the first 64 bytes to the start before enabling
> VMX to ensure the penalty is worth taking.
Will there still be a penalty if the 65th byte differs?  

> 
> Also, you should consider doing 4K and greater, KSM (Kernel Samepage
> Merging) uses PAGE_SIZE which can be as small as 4K.
Currently the VMX will only be applied when size exceeds 4K. Are you
suggesting a bigger threshold than 4K?

We can sync more offline for v3.

Thanks,
- Simon

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation
  2017-09-25  9:30   ` David Laight
@ 2017-09-24  6:19     ` Simon Guo
  0 siblings, 0 replies; 18+ messages in thread
From: Simon Guo @ 2017-09-24  6:19 UTC (permalink / raw)
  To: David Laight
  Cc: linuxppc-dev, Paul Mackerras, Michael Ellerman, Naveen N.  Rao,
	Christophe LEROY

Hi David,
On Mon, Sep 25, 2017 at 09:30:28AM +0000, David Laight wrote:
> From: wei.guo.simon@gmail.com
> > Sent: 21 September 2017 00:35
> > This patch adjust selftest memcmp_64 so that memcmp selftest can be
> > compiled successfully.
> ...
> >  #define ITERATIONS 10000
> > 
> > +#define LARGE_SIZE (5 * 1024)
> > +#define LARGE_ITERATIONS 1000
> ...
> 
> Measuring performance by doing a lot of iterations isn't ideal
> and is pretty pointless.
> Cold cache performance can be more useful.
> Also you don't really want any dynamic branch prediction logic
> tuned to the exact test you keep doing.

I think the (orignal) selftest aims at full coverage of functionality
correctness, since each iteration generates a new data set by random.

Thanks,
- Simon

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation
  2017-09-20 23:34 ` [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation wei.guo.simon
@ 2017-09-25  9:30   ` David Laight
  2017-09-24  6:19     ` Simon Guo
  0 siblings, 1 reply; 18+ messages in thread
From: David Laight @ 2017-09-25  9:30 UTC (permalink / raw)
  To: 'wei.guo.simon@gmail.com', linuxppc-dev
  Cc: Paul Mackerras, Michael Ellerman, Naveen N.  Rao, Christophe LEROY

From: wei.guo.simon@gmail.com
> Sent: 21 September 2017 00:35
> This patch adjust selftest memcmp_64 so that memcmp selftest can be
> compiled successfully.
...
>  #define ITERATIONS 10000
>=20
> +#define LARGE_SIZE (5 * 1024)
> +#define LARGE_ITERATIONS 1000
...

Measuring performance by doing a lot of iterations isn't ideal
and is pretty pointless.
Cold cache performance can be more useful.
Also you don't really want any dynamic branch prediction logic
tuned to the exact test you keep doing.

	David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
  2017-09-23 21:18     ` Simon Guo
@ 2017-09-25 23:59       ` Cyril Bur
  2017-09-26  5:34         ` Michael Ellerman
  0 siblings, 1 reply; 18+ messages in thread
From: Cyril Bur @ 2017-09-25 23:59 UTC (permalink / raw)
  To: Simon Guo; +Cc: linuxppc-dev, David Laight, Naveen N.  Rao

On Sun, 2017-09-24 at 05:18 +0800, Simon Guo wrote:
> Hi Cyril,
> On Sat, Sep 23, 2017 at 12:06:48AM +1000, Cyril Bur wrote:
> > On Thu, 2017-09-21 at 07:34 +0800, wei.guo.simon@gmail.com wrote:
> > > From: Simon Guo <wei.guo.simon@gmail.com>
> > > 
> > > This patch add VMX primitives to do memcmp() in case the compare size
> > > exceeds 4K bytes.
> > > 
> > 
> > Hi Simon,
> > 
> > Sorry I didn't see this sooner, I've actually been working on a kernel
> > version of glibc commit dec4a7105e (powerpc: Improve memcmp performance
> > for POWER8) unfortunately I've been distracted and it still isn't done.
> 
> Thanks for sync with me. Let's consolidate our effort together :)
> 
> I have a quick check on glibc commit dec4a7105e. 
> Looks the aligned case comparison with VSX is launched without rN size
> limitation, which means it will have a VSX reg load penalty even when the 
> length is 9 bytes.
> 

This was written for userspace which doesn't have to explicitly enable
VMX in order to use it - we need to be smarter in the kernel.

> It did some optimization when src/dest addrs don't have the same offset 
> on 8 bytes alignment boundary. I need to read more closely.
> 
> > I wonder if we can consolidate our efforts here. One thing I did come
> > across in my testing is that for memcmp() that will fail early (I
> > haven't narrowed down the the optimal number yet) the cost of enabling
> > VMX actually turns out to be a performance regression, as such I've
> > added a small check of the first 64 bytes to the start before enabling
> > VMX to ensure the penalty is worth taking.
> 
> Will there still be a penalty if the 65th byte differs?  
> 

I haven't benchmarked it exactly, my rationale for 64 bytes was that it
is the stride of the vectorised copy loop so, if we know we'll fail
before even completing one iteration of the vectorized loop there isn't
any point using the vector regs.

> > 
> > Also, you should consider doing 4K and greater, KSM (Kernel Samepage
> > Merging) uses PAGE_SIZE which can be as small as 4K.
> 
> Currently the VMX will only be applied when size exceeds 4K. Are you
> suggesting a bigger threshold than 4K?
> 

Equal to or greater than 4K, KSM will benefit.

> We can sync more offline for v3.
> 
> Thanks,
> - Simon

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
  2017-09-25 23:59       ` Cyril Bur
@ 2017-09-26  5:34         ` Michael Ellerman
  2017-09-26 11:26           ` Segher Boessenkool
  0 siblings, 1 reply; 18+ messages in thread
From: Michael Ellerman @ 2017-09-26  5:34 UTC (permalink / raw)
  To: Cyril Bur, Simon Guo; +Cc: Naveen N.  Rao, David Laight, linuxppc-dev, raji

Cyril Bur <cyrilbur@gmail.com> writes:

> On Sun, 2017-09-24 at 05:18 +0800, Simon Guo wrote:
>> Hi Cyril,
>> On Sat, Sep 23, 2017 at 12:06:48AM +1000, Cyril Bur wrote:
>> > On Thu, 2017-09-21 at 07:34 +0800, wei.guo.simon@gmail.com wrote:
>> > > From: Simon Guo <wei.guo.simon@gmail.com>
>> > > 
>> > > This patch add VMX primitives to do memcmp() in case the compare size
>> > > exceeds 4K bytes.
>> > 
>> > Sorry I didn't see this sooner, I've actually been working on a kernel
>> > version of glibc commit dec4a7105e (powerpc: Improve memcmp performance
>> > for POWER8) unfortunately I've been distracted and it still isn't done.
>> 
>> Thanks for sync with me. Let's consolidate our effort together :)
>> 
>> I have a quick check on glibc commit dec4a7105e. 
>> Looks the aligned case comparison with VSX is launched without rN size
>> limitation, which means it will have a VSX reg load penalty even when the 
>> length is 9 bytes.
>> 
>
> This was written for userspace which doesn't have to explicitly enable
> VMX in order to use it - we need to be smarter in the kernel.

Well the kernel has to do it for them after a trap, which is actually
even more expensive, so arguably the glibc code should be smarter too
and the threshold before using VMX should probably be higher than in the
kernel (to cover the cost of the trap).

But I digress :)

cheers

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
  2017-09-26  5:34         ` Michael Ellerman
@ 2017-09-26 11:26           ` Segher Boessenkool
  2017-09-27  3:38             ` Michael Ellerman
  0 siblings, 1 reply; 18+ messages in thread
From: Segher Boessenkool @ 2017-09-26 11:26 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Cyril Bur, Simon Guo, raji, Naveen N.  Rao, David Laight, linuxppc-dev

On Tue, Sep 26, 2017 at 03:34:36PM +1000, Michael Ellerman wrote:
> Cyril Bur <cyrilbur@gmail.com> writes:
> > This was written for userspace which doesn't have to explicitly enable
> > VMX in order to use it - we need to be smarter in the kernel.
> 
> Well the kernel has to do it for them after a trap, which is actually
> even more expensive, so arguably the glibc code should be smarter too
> and the threshold before using VMX should probably be higher than in the
> kernel (to cover the cost of the trap).

A lot of userspace code uses V*X, more and more with newer CPUs and newer
compiler versions.  If you already paid the price for using vector
registers you do not need to again :-)

> But I digress :)

Yeah sorry :-)


Segher

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
  2017-09-26 11:26           ` Segher Boessenkool
@ 2017-09-27  3:38             ` Michael Ellerman
  2017-09-27  9:27               ` Segher Boessenkool
  2017-09-27 16:22               ` Simon Guo
  0 siblings, 2 replies; 18+ messages in thread
From: Michael Ellerman @ 2017-09-27  3:38 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Cyril Bur, Simon Guo, raji, Naveen N.  Rao, David Laight, linuxppc-dev

Segher Boessenkool <segher@kernel.crashing.org> writes:

> On Tue, Sep 26, 2017 at 03:34:36PM +1000, Michael Ellerman wrote:
>> Cyril Bur <cyrilbur@gmail.com> writes:
>> > This was written for userspace which doesn't have to explicitly enable
>> > VMX in order to use it - we need to be smarter in the kernel.
>> 
>> Well the kernel has to do it for them after a trap, which is actually
>> even more expensive, so arguably the glibc code should be smarter too
>> and the threshold before using VMX should probably be higher than in the
>> kernel (to cover the cost of the trap).
>
> A lot of userspace code uses V*X, more and more with newer CPUs and newer
> compiler versions.  If you already paid the price for using vector
> registers you do not need to again :-)

True, but you don't know if you've paid the price already.

You also pay the price on every context switch (more state to switch),
so it's not free even once enabled. Which is why the kernel will
eventually turn it off if it's unused again.

But now that I've actually looked at the glibc version, it does do some
checks for minimum length before doing any vector instructions, so
that's probably all we want. The exact trade off between checking some
bytes without vector vs turning on vector depends on your input data, so
it's tricky to tune in general.

>> But I digress :)
>
> Yeah sorry :-)

:)

cheers

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
  2017-09-27  3:38             ` Michael Ellerman
@ 2017-09-27  9:27               ` Segher Boessenkool
  2017-09-27  9:43                 ` David Laight
  2017-09-27 16:22               ` Simon Guo
  1 sibling, 1 reply; 18+ messages in thread
From: Segher Boessenkool @ 2017-09-27  9:27 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Cyril Bur, Simon Guo, raji, Naveen N.  Rao, David Laight, linuxppc-dev

On Wed, Sep 27, 2017 at 01:38:09PM +1000, Michael Ellerman wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > A lot of userspace code uses V*X, more and more with newer CPUs and newer
> > compiler versions.  If you already paid the price for using vector
> > registers you do not need to again :-)
> 
> True, but you don't know if you've paid the price already.
> 
> You also pay the price on every context switch (more state to switch),
> so it's not free even once enabled. Which is why the kernel will
> eventually turn it off if it's unused again.

Yup.  But my point is that because user space code uses vector registers
more and more, the penalty for user space code to use vector registers
even more keeps shrinking.

> But now that I've actually looked at the glibc version, it does do some
> checks for minimum length before doing any vector instructions, so
> that's probably all we want. The exact trade off between checking some
> bytes without vector vs turning on vector depends on your input data, so
> it's tricky to tune in general.

You also need nasty code to deal with the start and end of strings, with
conditional branches and whatnot, which quickly overwhelms the benefit
of using vector registers at all.  This tradeoff also changes with newer
ISA versions.

Things have to become *really* cheap before it will be good to often use
vector registers in the kernel though.


Segher

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
  2017-09-27  9:27               ` Segher Boessenkool
@ 2017-09-27  9:43                 ` David Laight
  2017-09-27 18:33                   ` Simon Guo
  0 siblings, 1 reply; 18+ messages in thread
From: David Laight @ 2017-09-27  9:43 UTC (permalink / raw)
  To: 'Segher Boessenkool', Michael Ellerman
  Cc: Cyril Bur, Simon Guo, raji, Naveen N.  Rao, linuxppc-dev

From: Segher Boessenkool
> Sent: 27 September 2017 10:28
...
> You also need nasty code to deal with the start and end of strings, with
> conditional branches and whatnot, which quickly overwhelms the benefit
> of using vector registers at all.  This tradeoff also changes with newer
> ISA versions.

The goal posts keep moving.
For instance with modern intel x86 cpus 'rep movsb' is by far the fastest
way to copy data (from cached memory).

> Things have to become *really* cheap before it will be good to often use
> vector registers in the kernel though.

I've had thoughts about this in the past.
If the vector registers belong to the current process then you might
get away with just saving the ones you want to use.
If they belong to a different process then you also need to tell the
FPU save code where you've saved the registers.
Then the IPI code can recover all the correct values.

On X86 all the AVX registers are caller saved, the system call
entry could issue the instruction that invalidates them all.
Kernel code running in the context of a user process could then
use the registers without saving them.
It would only need to set a mark to ensure they are invalidated
again on return to user (might be cheap enough to do anyway).
Dunno about PPC though.

	David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
  2017-09-27  3:38             ` Michael Ellerman
  2017-09-27  9:27               ` Segher Boessenkool
@ 2017-09-27 16:22               ` Simon Guo
  1 sibling, 0 replies; 18+ messages in thread
From: Simon Guo @ 2017-09-27 16:22 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Segher Boessenkool, Cyril Bur, raji, Naveen N.  Rao,
	David Laight, linuxppc-dev

Hi Michael,
On Wed, Sep 27, 2017 at 01:38:09PM +1000, Michael Ellerman wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> 
> > On Tue, Sep 26, 2017 at 03:34:36PM +1000, Michael Ellerman wrote:
> >> Cyril Bur <cyrilbur@gmail.com> writes:
> >> > This was written for userspace which doesn't have to explicitly enable
> >> > VMX in order to use it - we need to be smarter in the kernel.
> >> 
> >> Well the kernel has to do it for them after a trap, which is actually
> >> even more expensive, so arguably the glibc code should be smarter too
> >> and the threshold before using VMX should probably be higher than in the
> >> kernel (to cover the cost of the trap).
> >
> > A lot of userspace code uses V*X, more and more with newer CPUs and newer
> > compiler versions.  If you already paid the price for using vector
> > registers you do not need to again :-)
> 
> True, but you don't know if you've paid the price already.
> 
> You also pay the price on every context switch (more state to switch),
> so it's not free even once enabled. Which is why the kernel will
> eventually turn it off if it's unused again.
> 
> But now that I've actually looked at the glibc version, it does do some
> checks for minimum length before doing any vector instructions, so

Looks the glibc version will use VSX instruction and lead to trap in a 
9 bytes size cmp with src/dst 16 bytes aligned. 
 132         /* Now both rSTR1 and rSTR2 are aligned to QW.  */
 133         .align  4
 134 L(qw_align):
 135         vspltisb        v0, 0
 136         srdi.   r6, rN, 6
 137         li      r8, 16
 138         li      r10, 32
 139         li      r11, 48
 140         ble     cr0, L(lessthan64)
 141         mtctr   r6
 142         vspltisb        v8, 0
 143         vspltisb        v6, 0
 144         /* Aligned vector loop.  */
 145         .align  4
line 135 is the VSX instruction causing trap. Did I miss anything?

> that's probably all we want. The exact trade off between checking some
> bytes without vector vs turning on vector depends on your input data, so
> it's tricky to tune in general.

Discussed offline with Cyril. The plan is to use (>=4KB) as the minimum len 
before vector regs steps at v3. Cyril will consolidate his existing work on 
KSM optimization later, which is probably making 64bytes comparison-ahead to 
determine whether it is an early or late matching pattern.

Cyril has also some other valuable comments and I will rework on v3.

Is it OK for you?

Thanks,
- Simon

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
  2017-09-27  9:43                 ` David Laight
@ 2017-09-27 18:33                   ` Simon Guo
  2017-09-28  9:24                     ` David Laight
  0 siblings, 1 reply; 18+ messages in thread
From: Simon Guo @ 2017-09-27 18:33 UTC (permalink / raw)
  To: David Laight
  Cc: 'Segher Boessenkool',
	Michael Ellerman, Cyril Bur, raji, Naveen N.  Rao, linuxppc-dev

On Wed, Sep 27, 2017 at 09:43:44AM +0000, David Laight wrote:
> From: Segher Boessenkool
> > Sent: 27 September 2017 10:28
> ...
> > You also need nasty code to deal with the start and end of strings, with
> > conditional branches and whatnot, which quickly overwhelms the benefit
> > of using vector registers at all.  This tradeoff also changes with newer
> > ISA versions.
> 
> The goal posts keep moving.
> For instance with modern intel x86 cpus 'rep movsb' is by far the fastest
> way to copy data (from cached memory).
> 
> > Things have to become *really* cheap before it will be good to often use
> > vector registers in the kernel though.
> 
> I've had thoughts about this in the past.
> If the vector registers belong to the current process then you might
> get away with just saving the ones you want to use.
> If they belong to a different process then you also need to tell the
> FPU save code where you've saved the registers.
> Then the IPI code can recover all the correct values.
> 
> On X86 all the AVX registers are caller saved, the system call
> entry could issue the instruction that invalidates them all.
> Kernel code running in the context of a user process could then
> use the registers without saving them.
> It would only need to set a mark to ensure they are invalidated
> again on return to user (might be cheap enough to do anyway).
> Dunno about PPC though.

I am not aware of any ppc instruction which can set a "mark" or provide 
any high granularity flag against single or subgroup of vec regs' validity.
But ppc experts may want to correct me.

Thanks,
- Simon

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
  2017-09-27 18:33                   ` Simon Guo
@ 2017-09-28  9:24                     ` David Laight
  0 siblings, 0 replies; 18+ messages in thread
From: David Laight @ 2017-09-28  9:24 UTC (permalink / raw)
  To: 'Simon Guo'
  Cc: 'Segher Boessenkool',
	Michael Ellerman, Cyril Bur, raji, Naveen N.  Rao, linuxppc-dev

From: Simon Guo
> Sent: 27 September 2017 19:34
...
> > On X86 all the AVX registers are caller saved, the system call
> > entry could issue the instruction that invalidates them all.
> > Kernel code running in the context of a user process could then
> > use the registers without saving them.
> > It would only need to set a mark to ensure they are invalidated
> > again on return to user (might be cheap enough to do anyway).
> > Dunno about PPC though.
>=20
> I am not aware of any ppc instruction which can set a "mark" or provide
> any high granularity flag against single or subgroup of vec regs' validit=
y.
> But ppc experts may want to correct me.

I was just thinking of a software flag.

	David

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-09-28  9:24 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-20 23:34 [PATCH v2 0/3] powerpc/64: memcmp() optimization wei.guo.simon
2017-09-20 23:34 ` [PATCH v2 1/3] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() wei.guo.simon
2017-09-20 23:34 ` [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
2017-09-21  0:54   ` Simon Guo
2017-09-22 14:06   ` Cyril Bur
2017-09-23 21:18     ` Simon Guo
2017-09-25 23:59       ` Cyril Bur
2017-09-26  5:34         ` Michael Ellerman
2017-09-26 11:26           ` Segher Boessenkool
2017-09-27  3:38             ` Michael Ellerman
2017-09-27  9:27               ` Segher Boessenkool
2017-09-27  9:43                 ` David Laight
2017-09-27 18:33                   ` Simon Guo
2017-09-28  9:24                     ` David Laight
2017-09-27 16:22               ` Simon Guo
2017-09-20 23:34 ` [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation wei.guo.simon
2017-09-25  9:30   ` David Laight
2017-09-24  6:19     ` Simon Guo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.