All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Optimize memcpy for AVX512 platforms
@ 2016-01-14  6:13 Zhihong Wang
  2016-01-14  6:13 ` [PATCH 1/4] lib/librte_eal: Identify AVX512 CPU flag Zhihong Wang
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-14  6:13 UTC (permalink / raw)
  To: dev

This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
utilization of hardware resources and deliver high performance.

In current DPDK, memcpy holds a large proportion of execution time in
libs like Vhost, especially for large packets, and this patch can bring
considerable benefits.

The implementation is based on the current DPDK memcpy framework, some
background introduction can be found in these threads:
http://dpdk.org/ml/archives/dev/2014-November/008158.html
http://dpdk.org/ml/archives/dev/2015-January/011800.html

Code changes are:

  1. Read CPUID to check if AVX512 is supported by CPU

  2. Predefine AVX512 macro if AVX512 is enabled by compiler

  3. Implement AVX512 memcpy and choose the right implementation based on
     predefined macros

  4. Decide alignment unit for memcpy perf test based on predefined macros

Zhihong Wang (4):
  lib/librte_eal: Identify AVX512 CPU flag
  mk: Predefine AVX512 macro for compiler
  lib/librte_eal: Optimize memcpy for AVX512 platforms
  app/test: Adjust alignment unit for memcpy perf test

 app/test/test_memcpy_perf.c                        |   6 +
 .../common/include/arch/x86/rte_cpuflags.h         |   2 +
 .../common/include/arch/x86/rte_memcpy.h           | 247 ++++++++++++++++++++-
 mk/rte.cpuflags.mk                                 |   4 +
 4 files changed, 255 insertions(+), 4 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/4] lib/librte_eal: Identify AVX512 CPU flag
  2016-01-14  6:13 [PATCH 0/4] Optimize memcpy for AVX512 platforms Zhihong Wang
@ 2016-01-14  6:13 ` Zhihong Wang
  2016-01-14  6:13 ` [PATCH 2/4] mk: Predefine AVX512 macro for compiler Zhihong Wang
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-14  6:13 UTC (permalink / raw)
  To: dev

Read CPUID to check if AVX512 is supported by CPU.

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 lib/librte_eal/common/include/arch/x86/rte_cpuflags.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
index dd56553..89c0d9d 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
@@ -131,6 +131,7 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_ERMS,                   /**< ERMS */
 	RTE_CPUFLAG_INVPCID,                /**< INVPCID */
 	RTE_CPUFLAG_RTM,                    /**< Transactional memory */
+	RTE_CPUFLAG_AVX512F,                /**< AVX512F */
 
 	/* (EAX 80000001h) ECX features */
 	RTE_CPUFLAG_LAHF_SAHF,              /**< LAHF_SAHF */
@@ -238,6 +239,7 @@ static const struct feature_entry cpu_feature_table[] = {
 	FEAT_DEF(ERMS, 0x00000007, 0, RTE_REG_EBX,  8)
 	FEAT_DEF(INVPCID, 0x00000007, 0, RTE_REG_EBX, 10)
 	FEAT_DEF(RTM, 0x00000007, 0, RTE_REG_EBX, 11)
+	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/4] mk: Predefine AVX512 macro for compiler
  2016-01-14  6:13 [PATCH 0/4] Optimize memcpy for AVX512 platforms Zhihong Wang
  2016-01-14  6:13 ` [PATCH 1/4] lib/librte_eal: Identify AVX512 CPU flag Zhihong Wang
@ 2016-01-14  6:13 ` Zhihong Wang
  2016-01-14  6:13 ` [PATCH 3/4] lib/librte_eal: Optimize memcpy for AVX512 platforms Zhihong Wang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-14  6:13 UTC (permalink / raw)
  To: dev

Predefine AVX512 macro if AVX512 is enabled by compiler.

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 mk/rte.cpuflags.mk | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mk/rte.cpuflags.mk b/mk/rte.cpuflags.mk
index 28f203b..19a3e7e 100644
--- a/mk/rte.cpuflags.mk
+++ b/mk/rte.cpuflags.mk
@@ -89,6 +89,10 @@ ifneq ($(filter $(AUTO_CPUFLAGS),__AVX2__),)
 CPUFLAGS += AVX2
 endif
 
+ifneq ($(filter $(AUTO_CPUFLAGS),__AVX512F__),)
+CPUFLAGS += AVX512F
+endif
+
 # IBM Power CPU flags
 ifneq ($(filter $(AUTO_CPUFLAGS),__PPC64__),)
 CPUFLAGS += PPC64
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/4] lib/librte_eal: Optimize memcpy for AVX512 platforms
  2016-01-14  6:13 [PATCH 0/4] Optimize memcpy for AVX512 platforms Zhihong Wang
  2016-01-14  6:13 ` [PATCH 1/4] lib/librte_eal: Identify AVX512 CPU flag Zhihong Wang
  2016-01-14  6:13 ` [PATCH 2/4] mk: Predefine AVX512 macro for compiler Zhihong Wang
@ 2016-01-14  6:13 ` Zhihong Wang
  2016-01-14  6:13 ` [PATCH 4/4] app/test: Adjust alignment unit for memcpy perf test Zhihong Wang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-14  6:13 UTC (permalink / raw)
  To: dev

Implement AVX512 memcpy and choose the right implementation based on
predefined macros, to make full utilization of hardware resources and
deliver high performance.

In current DPDK, memcpy holds a large proportion of execution time in
libs like Vhost, especially for large packets, and this patch can bring
considerable benefits for AVX512 platforms.

The implementation is based on the current DPDK memcpy framework, some
background introduction can be found in these threads:
http://dpdk.org/ml/archives/dev/2014-November/008158.html
http://dpdk.org/ml/archives/dev/2015-January/011800.html

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 .../common/include/arch/x86/rte_memcpy.h           | 247 ++++++++++++++++++++-
 1 file changed, 243 insertions(+), 4 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index 6a57426..fee954a 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -37,7 +37,7 @@
 /**
  * @file
  *
- * Functions for SSE/AVX/AVX2 implementation of memcpy().
+ * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
  */
 
 #include <stdio.h>
@@ -67,7 +67,246 @@ extern "C" {
 static inline void *
 rte_memcpy(void *dst, const void *src, size_t n) __attribute__((always_inline));
 
-#ifdef RTE_MACHINE_CPUFLAG_AVX2
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+
+/**
+ * AVX512 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_loadu_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	__m512i zmm0;
+
+	zmm0 = _mm512_loadu_si512((const void *)src);
+	_mm512_storeu_si512((void *)dst, zmm0);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov64(dst + 0 * 64, src + 0 * 64);
+	rte_mov64(dst + 1 * 64, src + 1 * 64);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov64(dst + 0 * 64, src + 0 * 64);
+	rte_mov64(dst + 1 * 64, src + 1 * 64);
+	rte_mov64(dst + 2 * 64, src + 2 * 64);
+	rte_mov64(dst + 3 * 64, src + 3 * 64);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m512i zmm0, zmm1;
+
+	while (n >= 128) {
+		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+		n -= 128;
+		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+		src = src + 128;
+		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+		dst = dst + 128;
+	}
+}
+
+/**
+ * Copy 512-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
+
+	while (n >= 512) {
+		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+		n -= 512;
+		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+		zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
+		zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
+		zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
+		zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
+		zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
+		zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
+		src = src + 512;
+		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+		_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
+		_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
+		_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
+		_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
+		_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
+		_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
+		dst = dst + 512;
+	}
+}
+
+static inline void *
+rte_memcpy(void *dst, const void *src, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				  (const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				  (const uint8_t *)src - 32 + n);
+		return ret;
+	}
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK63:
+		if (n > 64) {
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov64((uint8_t *)dst - 64 + n,
+					  (const uint8_t *)src - 64 + n);
+			return ret;
+		}
+		if (n > 0)
+			rte_mov64((uint8_t *)dst - 64 + n,
+					  (const uint8_t *)src - 64 + n);
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes
+	 */
+	dstofss = ((uintptr_t)dst & 0x3F);
+	if (dstofss > 0) {
+		dstofss = 64 - dstofss;
+		n -= dstofss;
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+
+	/**
+	 * Copy 512-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	bits = n;
+	n = n & 511;
+	bits -= n;
+	src = (const uint8_t *)src + bits;
+	dst = (uint8_t *)dst + bits;
+
+	/**
+	 * Copy 128-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	if (n >= 128) {
+		rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
+		bits = n;
+		n = n & 127;
+		bits -= n;
+		src = (const uint8_t *)src + bits;
+		dst = (uint8_t *)dst + bits;
+	}
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_128_BACK63;
+}
+
+#elif RTE_MACHINE_CPUFLAG_AVX2
 
 /**
  * AVX2 implementation below
@@ -311,7 +550,7 @@ COPY_BLOCK_64_BACK31:
 	goto COPY_BLOCK_64_BACK31;
 }
 
-#else /* RTE_MACHINE_CPUFLAG_AVX2 */
+#else /* RTE_MACHINE_CPUFLAG */
 
 /**
  * SSE & AVX implementation below
@@ -630,7 +869,7 @@ COPY_BLOCK_64_BACK15:
 	goto COPY_BLOCK_64_BACK15;
 }
 
-#endif /* RTE_MACHINE_CPUFLAG_AVX2 */
+#endif /* RTE_MACHINE_CPUFLAG */
 
 #ifdef __cplusplus
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/4] app/test: Adjust alignment unit for memcpy perf test
  2016-01-14  6:13 [PATCH 0/4] Optimize memcpy for AVX512 platforms Zhihong Wang
                   ` (2 preceding siblings ...)
  2016-01-14  6:13 ` [PATCH 3/4] lib/librte_eal: Optimize memcpy for AVX512 platforms Zhihong Wang
@ 2016-01-14  6:13 ` Zhihong Wang
  2016-01-14 16:48 ` [PATCH 0/4] Optimize memcpy for AVX512 platforms Stephen Hemminger
  2016-01-18  3:05 ` [PATCH v2 0/5] " Zhihong Wang
  5 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-14  6:13 UTC (permalink / raw)
  To: dev

Decide alignment unit for memcpy perf test based on predefined macros.

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 app/test/test_memcpy_perf.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
index 754828e..73babec 100644
--- a/app/test/test_memcpy_perf.c
+++ b/app/test/test_memcpy_perf.c
@@ -79,7 +79,13 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 #define TEST_BATCH_SIZE         100
 
 /* Data is aligned on this many bytes (power of 2) */
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+#define ALIGNMENT_UNIT          64
+#elif RTE_MACHINE_CPUFLAG_AVX2
 #define ALIGNMENT_UNIT          32
+#else /* RTE_MACHINE_CPUFLAG */
+#define ALIGNMENT_UNIT          16
+#endif /* RTE_MACHINE_CPUFLAG */
 
 /*
  * Pointers used in performance tests. The two large buffers are for uncached
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] Optimize memcpy for AVX512 platforms
  2016-01-14  6:13 [PATCH 0/4] Optimize memcpy for AVX512 platforms Zhihong Wang
                   ` (3 preceding siblings ...)
  2016-01-14  6:13 ` [PATCH 4/4] app/test: Adjust alignment unit for memcpy perf test Zhihong Wang
@ 2016-01-14 16:48 ` Stephen Hemminger
  2016-01-15  6:39   ` Wang, Zhihong
  2016-01-18  3:05 ` [PATCH v2 0/5] " Zhihong Wang
  5 siblings, 1 reply; 23+ messages in thread
From: Stephen Hemminger @ 2016-01-14 16:48 UTC (permalink / raw)
  To: Zhihong Wang; +Cc: dev

On Thu, 14 Jan 2016 01:13:18 -0500
Zhihong Wang <zhihong.wang@intel.com> wrote:

> This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> utilization of hardware resources and deliver high performance.
> 
> In current DPDK, memcpy holds a large proportion of execution time in
> libs like Vhost, especially for large packets, and this patch can bring
> considerable benefits.
> 
> The implementation is based on the current DPDK memcpy framework, some
> background introduction can be found in these threads:
> http://dpdk.org/ml/archives/dev/2014-November/008158.html
> http://dpdk.org/ml/archives/dev/2015-January/011800.html
> 
> Code changes are:
> 
>   1. Read CPUID to check if AVX512 is supported by CPU
> 
>   2. Predefine AVX512 macro if AVX512 is enabled by compiler
> 
>   3. Implement AVX512 memcpy and choose the right implementation based on
>      predefined macros
> 
>   4. Decide alignment unit for memcpy perf test based on predefined macros
> 
> Zhihong Wang (4):
>   lib/librte_eal: Identify AVX512 CPU flag
>   mk: Predefine AVX512 macro for compiler
>   lib/librte_eal: Optimize memcpy for AVX512 platforms
>   app/test: Adjust alignment unit for memcpy perf test
> 
>  app/test/test_memcpy_perf.c                        |   6 +
>  .../common/include/arch/x86/rte_cpuflags.h         |   2 +
>  .../common/include/arch/x86/rte_memcpy.h           | 247 ++++++++++++++++++++-
>  mk/rte.cpuflags.mk                                 |   4 +
>  4 files changed, 255 insertions(+), 4 deletions(-)
> 

This really looks like code that could benefit from Gcc
function multiversioning. The current cpuflags model is useless/flawed
in real product deployment

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] Optimize memcpy for AVX512 platforms
  2016-01-14 16:48 ` [PATCH 0/4] Optimize memcpy for AVX512 platforms Stephen Hemminger
@ 2016-01-15  6:39   ` Wang, Zhihong
  2016-01-15 22:03     ` Vincent JARDIN
  0 siblings, 1 reply; 23+ messages in thread
From: Wang, Zhihong @ 2016-01-15  6:39 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev



> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Friday, January 15, 2016 12:49 AM
> To: Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> Richardson, Bruce <bruce.richardson@intel.com>; Xie, Huawei
> <huawei.xie@intel.com>
> Subject: Re: [PATCH 0/4] Optimize memcpy for AVX512 platforms
> 
> On Thu, 14 Jan 2016 01:13:18 -0500
> Zhihong Wang <zhihong.wang@intel.com> wrote:
> 
> > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> > utilization of hardware resources and deliver high performance.
> >
> > In current DPDK, memcpy holds a large proportion of execution time in
> > libs like Vhost, especially for large packets, and this patch can bring
> > considerable benefits.
> >
> > The implementation is based on the current DPDK memcpy framework, some
> > background introduction can be found in these threads:
> > http://dpdk.org/ml/archives/dev/2014-November/008158.html
> > http://dpdk.org/ml/archives/dev/2015-January/011800.html
> >
> > Code changes are:
> >
> >   1. Read CPUID to check if AVX512 is supported by CPU
> >
> >   2. Predefine AVX512 macro if AVX512 is enabled by compiler
> >
> >   3. Implement AVX512 memcpy and choose the right implementation based
> on
> >      predefined macros
> >
> >   4. Decide alignment unit for memcpy perf test based on predefined macros
> >
> > Zhihong Wang (4):
> >   lib/librte_eal: Identify AVX512 CPU flag
> >   mk: Predefine AVX512 macro for compiler
> >   lib/librte_eal: Optimize memcpy for AVX512 platforms
> >   app/test: Adjust alignment unit for memcpy perf test
> >
> >  app/test/test_memcpy_perf.c                        |   6 +
> >  .../common/include/arch/x86/rte_cpuflags.h         |   2 +
> >  .../common/include/arch/x86/rte_memcpy.h           | 247
> ++++++++++++++++++++-
> >  mk/rte.cpuflags.mk                                 |   4 +
> >  4 files changed, 255 insertions(+), 4 deletions(-)
> >
> 
> This really looks like code that could benefit from Gcc
> function multiversioning. The current cpuflags model is useless/flawed
> in real product deployment


I've tried gcc function multi versioning, with a simple add() function
which returns a + b, and a loop calling it for millions of times. Turned
out this mechanism adds 17% extra time to execute, overall it's a lot
of extra overhead.

Quote the gcc wiki: "GCC takes care of doing the dispatching to call
the right version at runtime". So it loses inlining and adds extra
dispatching overhead.

Also this mechanism works only for C++, right?

I think using predefined macros at compile time is more efficient and
suits DPDK more.

Could you please give an example when the current CPU flags model
stop working? So I can fix it.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] Optimize memcpy for AVX512 platforms
  2016-01-15  6:39   ` Wang, Zhihong
@ 2016-01-15 22:03     ` Vincent JARDIN
  0 siblings, 0 replies; 23+ messages in thread
From: Vincent JARDIN @ 2016-01-15 22:03 UTC (permalink / raw)
  To: Wang, Zhihong; +Cc: dev

Le 14 janv. 2016 22:39, "Wang, Zhihong" <zhihong.wang@intel.com> a écrit :
>
>
>
> > -----Original Message-----
> > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > Sent: Friday, January 15, 2016 12:49 AM
> > To: Wang, Zhihong <zhihong.wang@intel.com>
> > Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > Richardson, Bruce <bruce.richardson@intel.com>; Xie, Huawei
> > <huawei.xie@intel.com>
> > Subject: Re: [PATCH 0/4] Optimize memcpy for AVX512 platforms
> >
> > On Thu, 14 Jan 2016 01:13:18 -0500
> > Zhihong Wang <zhihong.wang@intel.com> wrote:
> >
> > > This patch set optimizes DPDK memcpy for AVX512 platforms, to make
full
> > > utilization of hardware resources and deliver high performance.
> > >
> > > In current DPDK, memcpy holds a large proportion of execution time in
> > > libs like Vhost, especially for large packets, and this patch can
bring
> > > considerable benefits.
> > >
> > > The implementation is based on the current DPDK memcpy framework, some
> > > background introduction can be found in these threads:
> > > http://dpdk.org/ml/archives/dev/2014-November/008158.html
> > > http://dpdk.org/ml/archives/dev/2015-January/011800.html
> > >
> > > Code changes are:
> > >
> > >   1. Read CPUID to check if AVX512 is supported by CPU
> > >
> > >   2. Predefine AVX512 macro if AVX512 is enabled by compiler
> > >
> > >   3. Implement AVX512 memcpy and choose the right implementation based
> > on
> > >      predefined macros
> > >
> > >   4. Decide alignment unit for memcpy perf test based on predefined
macros
> > >
> > > Zhihong Wang (4):
> > >   lib/librte_eal: Identify AVX512 CPU flag
> > >   mk: Predefine AVX512 macro for compiler
> > >   lib/librte_eal: Optimize memcpy for AVX512 platforms
> > >   app/test: Adjust alignment unit for memcpy perf test
> > >
> > >  app/test/test_memcpy_perf.c                        |   6 +
> > >  .../common/include/arch/x86/rte_cpuflags.h         |   2 +
> > >  .../common/include/arch/x86/rte_memcpy.h           | 247
> > ++++++++++++++++++++-
> > >  mk/rte.cpuflags.mk                                 |   4 +
> > >  4 files changed, 255 insertions(+), 4 deletions(-)
> > >
> >
> > This really looks like code that could benefit from Gcc
> > function multiversioning. The current cpuflags model is useless/flawed
> > in real product deployment
>
>
> I've tried gcc function multi versioning, with a simple add() function
> which returns a + b, and a loop calling it for millions of times. Turned
> out this mechanism adds 17% extra time to execute, overall it's a lot
> of extra overhead.
>
> Quote the gcc wiki: "GCC takes care of doing the dispatching to call
> the right version at runtime". So it loses inlining and adds extra
> dispatching overhead.
>
> Also this mechanism works only for C++, right?
>
> I think using predefined macros at compile time is more efficient and
> suits DPDK more.
>

I agree with you: performance first.

So having a mix of runtime and compile time would work. For those who are
ok with some performance drops, they can go with runtime.

> Could you please give an example when the current CPU flags model
> stop working? So I can fix it.
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
  2016-01-14  6:13 [PATCH 0/4] Optimize memcpy for AVX512 platforms Zhihong Wang
                   ` (4 preceding siblings ...)
  2016-01-14 16:48 ` [PATCH 0/4] Optimize memcpy for AVX512 platforms Stephen Hemminger
@ 2016-01-18  3:05 ` Zhihong Wang
  2016-01-18  3:05   ` [PATCH v2 1/5] lib/librte_eal: Identify AVX512 CPU flag Zhihong Wang
                     ` (8 more replies)
  5 siblings, 9 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-18  3:05 UTC (permalink / raw)
  To: dev

This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
utilization of hardware resources and deliver high performance.

In current DPDK, memcpy holds a large proportion of execution time in
libs like Vhost, especially for large packets, and this patch can bring
considerable benefits.

The implementation is based on the current DPDK memcpy framework, some
background introduction can be found in these threads:
http://dpdk.org/ml/archives/dev/2014-November/008158.html
http://dpdk.org/ml/archives/dev/2015-January/011800.html

Code changes are:

  1. Read CPUID to check if AVX512 is supported by CPU

  2. Predefine AVX512 macro if AVX512 is enabled by compiler

  3. Implement AVX512 memcpy and choose the right implementation based on
     predefined macros

  4. Decide alignment unit for memcpy perf test based on predefined macros

--------------
Changes in v2:

  1. Tune performance for prior platforms

Zhihong Wang (5):
  lib/librte_eal: Identify AVX512 CPU flag
  mk: Predefine AVX512 macro for compiler
  lib/librte_eal: Optimize memcpy for AVX512 platforms
  app/test: Adjust alignment unit for memcpy perf test
  lib/librte_eal: Tune memcpy for prior platforms

 app/test/test_memcpy_perf.c                        |   6 +
 .../common/include/arch/x86/rte_cpuflags.h         |   2 +
 .../common/include/arch/x86/rte_memcpy.h           | 269 ++++++++++++++++++++-
 mk/rte.cpuflags.mk                                 |   4 +
 4 files changed, 268 insertions(+), 13 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/5] lib/librte_eal: Identify AVX512 CPU flag
  2016-01-18  3:05 ` [PATCH v2 0/5] " Zhihong Wang
@ 2016-01-18  3:05   ` Zhihong Wang
  2016-01-18  3:05   ` [PATCH v2 2/5] mk: Predefine AVX512 macro for compiler Zhihong Wang
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-18  3:05 UTC (permalink / raw)
  To: dev

Read CPUID to check if AVX512 is supported by CPU.

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 lib/librte_eal/common/include/arch/x86/rte_cpuflags.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
index dd56553..89c0d9d 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
@@ -131,6 +131,7 @@ enum rte_cpu_flag_t {
 	RTE_CPUFLAG_ERMS,                   /**< ERMS */
 	RTE_CPUFLAG_INVPCID,                /**< INVPCID */
 	RTE_CPUFLAG_RTM,                    /**< Transactional memory */
+	RTE_CPUFLAG_AVX512F,                /**< AVX512F */
 
 	/* (EAX 80000001h) ECX features */
 	RTE_CPUFLAG_LAHF_SAHF,              /**< LAHF_SAHF */
@@ -238,6 +239,7 @@ static const struct feature_entry cpu_feature_table[] = {
 	FEAT_DEF(ERMS, 0x00000007, 0, RTE_REG_EBX,  8)
 	FEAT_DEF(INVPCID, 0x00000007, 0, RTE_REG_EBX, 10)
 	FEAT_DEF(RTM, 0x00000007, 0, RTE_REG_EBX, 11)
+	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/5] mk: Predefine AVX512 macro for compiler
  2016-01-18  3:05 ` [PATCH v2 0/5] " Zhihong Wang
  2016-01-18  3:05   ` [PATCH v2 1/5] lib/librte_eal: Identify AVX512 CPU flag Zhihong Wang
@ 2016-01-18  3:05   ` Zhihong Wang
  2016-01-18  3:05   ` [PATCH v2 3/5] lib/librte_eal: Optimize memcpy for AVX512 platforms Zhihong Wang
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-18  3:05 UTC (permalink / raw)
  To: dev

Predefine AVX512 macro if AVX512 is enabled by compiler.

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 mk/rte.cpuflags.mk | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mk/rte.cpuflags.mk b/mk/rte.cpuflags.mk
index 28f203b..19a3e7e 100644
--- a/mk/rte.cpuflags.mk
+++ b/mk/rte.cpuflags.mk
@@ -89,6 +89,10 @@ ifneq ($(filter $(AUTO_CPUFLAGS),__AVX2__),)
 CPUFLAGS += AVX2
 endif
 
+ifneq ($(filter $(AUTO_CPUFLAGS),__AVX512F__),)
+CPUFLAGS += AVX512F
+endif
+
 # IBM Power CPU flags
 ifneq ($(filter $(AUTO_CPUFLAGS),__PPC64__),)
 CPUFLAGS += PPC64
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 3/5] lib/librte_eal: Optimize memcpy for AVX512 platforms
  2016-01-18  3:05 ` [PATCH v2 0/5] " Zhihong Wang
  2016-01-18  3:05   ` [PATCH v2 1/5] lib/librte_eal: Identify AVX512 CPU flag Zhihong Wang
  2016-01-18  3:05   ` [PATCH v2 2/5] mk: Predefine AVX512 macro for compiler Zhihong Wang
@ 2016-01-18  3:05   ` Zhihong Wang
  2016-01-18  3:05   ` [PATCH v2 4/5] app/test: Adjust alignment unit for memcpy perf test Zhihong Wang
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-18  3:05 UTC (permalink / raw)
  To: dev

Implement AVX512 memcpy and choose the right implementation based on
predefined macros, to make full utilization of hardware resources and
deliver high performance.

In current DPDK, memcpy holds a large proportion of execution time in
libs like Vhost, especially for large packets, and this patch can bring
considerable benefits for AVX512 platforms.

The implementation is based on the current DPDK memcpy framework, some
background introduction can be found in these threads:
http://dpdk.org/ml/archives/dev/2014-November/008158.html
http://dpdk.org/ml/archives/dev/2015-January/011800.html

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 .../common/include/arch/x86/rte_memcpy.h           | 247 ++++++++++++++++++++-
 1 file changed, 243 insertions(+), 4 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index 6a57426..fee954a 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -37,7 +37,7 @@
 /**
  * @file
  *
- * Functions for SSE/AVX/AVX2 implementation of memcpy().
+ * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
  */
 
 #include <stdio.h>
@@ -67,7 +67,246 @@ extern "C" {
 static inline void *
 rte_memcpy(void *dst, const void *src, size_t n) __attribute__((always_inline));
 
-#ifdef RTE_MACHINE_CPUFLAG_AVX2
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+
+/**
+ * AVX512 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_loadu_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	__m512i zmm0;
+
+	zmm0 = _mm512_loadu_si512((const void *)src);
+	_mm512_storeu_si512((void *)dst, zmm0);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov64(dst + 0 * 64, src + 0 * 64);
+	rte_mov64(dst + 1 * 64, src + 1 * 64);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov64(dst + 0 * 64, src + 0 * 64);
+	rte_mov64(dst + 1 * 64, src + 1 * 64);
+	rte_mov64(dst + 2 * 64, src + 2 * 64);
+	rte_mov64(dst + 3 * 64, src + 3 * 64);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m512i zmm0, zmm1;
+
+	while (n >= 128) {
+		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+		n -= 128;
+		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+		src = src + 128;
+		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+		dst = dst + 128;
+	}
+}
+
+/**
+ * Copy 512-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
+
+	while (n >= 512) {
+		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+		n -= 512;
+		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+		zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
+		zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
+		zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
+		zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
+		zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
+		zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
+		src = src + 512;
+		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+		_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
+		_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
+		_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
+		_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
+		_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
+		_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
+		dst = dst + 512;
+	}
+}
+
+static inline void *
+rte_memcpy(void *dst, const void *src, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				  (const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				  (const uint8_t *)src - 32 + n);
+		return ret;
+	}
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK63:
+		if (n > 64) {
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov64((uint8_t *)dst - 64 + n,
+					  (const uint8_t *)src - 64 + n);
+			return ret;
+		}
+		if (n > 0)
+			rte_mov64((uint8_t *)dst - 64 + n,
+					  (const uint8_t *)src - 64 + n);
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes
+	 */
+	dstofss = ((uintptr_t)dst & 0x3F);
+	if (dstofss > 0) {
+		dstofss = 64 - dstofss;
+		n -= dstofss;
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+
+	/**
+	 * Copy 512-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	bits = n;
+	n = n & 511;
+	bits -= n;
+	src = (const uint8_t *)src + bits;
+	dst = (uint8_t *)dst + bits;
+
+	/**
+	 * Copy 128-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	if (n >= 128) {
+		rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
+		bits = n;
+		n = n & 127;
+		bits -= n;
+		src = (const uint8_t *)src + bits;
+		dst = (uint8_t *)dst + bits;
+	}
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_128_BACK63;
+}
+
+#elif RTE_MACHINE_CPUFLAG_AVX2
 
 /**
  * AVX2 implementation below
@@ -311,7 +550,7 @@ COPY_BLOCK_64_BACK31:
 	goto COPY_BLOCK_64_BACK31;
 }
 
-#else /* RTE_MACHINE_CPUFLAG_AVX2 */
+#else /* RTE_MACHINE_CPUFLAG */
 
 /**
  * SSE & AVX implementation below
@@ -630,7 +869,7 @@ COPY_BLOCK_64_BACK15:
 	goto COPY_BLOCK_64_BACK15;
 }
 
-#endif /* RTE_MACHINE_CPUFLAG_AVX2 */
+#endif /* RTE_MACHINE_CPUFLAG */
 
 #ifdef __cplusplus
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 4/5] app/test: Adjust alignment unit for memcpy perf test
  2016-01-18  3:05 ` [PATCH v2 0/5] " Zhihong Wang
                     ` (2 preceding siblings ...)
  2016-01-18  3:05   ` [PATCH v2 3/5] lib/librte_eal: Optimize memcpy for AVX512 platforms Zhihong Wang
@ 2016-01-18  3:05   ` Zhihong Wang
  2016-01-18  3:05   ` [PATCH v2 5/5] lib/librte_eal: Tune memcpy for prior platforms Zhihong Wang
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-18  3:05 UTC (permalink / raw)
  To: dev

Decide alignment unit for memcpy perf test based on predefined macros.

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 app/test/test_memcpy_perf.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
index 754828e..73babec 100644
--- a/app/test/test_memcpy_perf.c
+++ b/app/test/test_memcpy_perf.c
@@ -79,7 +79,13 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 #define TEST_BATCH_SIZE         100
 
 /* Data is aligned on this many bytes (power of 2) */
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+#define ALIGNMENT_UNIT          64
+#elif RTE_MACHINE_CPUFLAG_AVX2
 #define ALIGNMENT_UNIT          32
+#else /* RTE_MACHINE_CPUFLAG */
+#define ALIGNMENT_UNIT          16
+#endif /* RTE_MACHINE_CPUFLAG */
 
 /*
  * Pointers used in performance tests. The two large buffers are for uncached
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 5/5] lib/librte_eal: Tune memcpy for prior platforms
  2016-01-18  3:05 ` [PATCH v2 0/5] " Zhihong Wang
                     ` (3 preceding siblings ...)
  2016-01-18  3:05   ` [PATCH v2 4/5] app/test: Adjust alignment unit for memcpy perf test Zhihong Wang
@ 2016-01-18  3:05   ` Zhihong Wang
  2016-01-18 20:06   ` [PATCH v2 0/5] Optimize memcpy for AVX512 platforms Stephen Hemminger
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-18  3:05 UTC (permalink / raw)
  To: dev

For prior platforms, add condition for unalignment handling, to keep this
operation from interrupting the batch copy loop for aligned cases.

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 .../common/include/arch/x86/rte_memcpy.h           | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index fee954a..d965957 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -513,10 +513,12 @@ COPY_BLOCK_64_BACK31:
 	 * Make store aligned when copy size exceeds 512 bytes
 	 */
 	dstofss = 32 - ((uintptr_t)dst & 0x1F);
-	n -= dstofss;
-	rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-	src = (const uint8_t *)src + dstofss;
-	dst = (uint8_t *)dst + dstofss;
+	if (dstofss > 0) {
+		n -= dstofss;
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
 
 	/**
 	 * Copy 256-byte blocks.
@@ -833,11 +835,13 @@ COPY_BLOCK_64_BACK15:
 	 * backwards access.
 	 */
 	dstofss = 16 - ((uintptr_t)dst & 0x0F) + 16;
-	n -= dstofss;
-	rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-	src = (const uint8_t *)src + dstofss;
-	dst = (uint8_t *)dst + dstofss;
-	srcofs = ((uintptr_t)src & 0x0F);
+	if (dstofss > 0) {
+		n -= dstofss;
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+		srcofs = ((uintptr_t)src & 0x0F);
+	}
 
 	/**
 	 * For aligned copy
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
  2016-01-18  3:05 ` [PATCH v2 0/5] " Zhihong Wang
                     ` (4 preceding siblings ...)
  2016-01-18  3:05   ` [PATCH v2 5/5] lib/librte_eal: Tune memcpy for prior platforms Zhihong Wang
@ 2016-01-18 20:06   ` Stephen Hemminger
  2016-01-19  2:37     ` Wang, Zhihong
  2016-01-27 15:23   ` Thomas Monjalon
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Stephen Hemminger @ 2016-01-18 20:06 UTC (permalink / raw)
  To: Zhihong Wang; +Cc: dev

On Sun, 17 Jan 2016 22:05:09 -0500
Zhihong Wang <zhihong.wang@intel.com> wrote:

> This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> utilization of hardware resources and deliver high performance.
> 
> In current DPDK, memcpy holds a large proportion of execution time in
> libs like Vhost, especially for large packets, and this patch can bring
> considerable benefits.
> 
> The implementation is based on the current DPDK memcpy framework, some
> background introduction can be found in these threads:
> http://dpdk.org/ml/archives/dev/2014-November/008158.html
> http://dpdk.org/ml/archives/dev/2015-January/011800.html
> 
> Code changes are:
> 
>   1. Read CPUID to check if AVX512 is supported by CPU
> 
>   2. Predefine AVX512 macro if AVX512 is enabled by compiler
> 
>   3. Implement AVX512 memcpy and choose the right implementation based on
>      predefined macros
> 
>   4. Decide alignment unit for memcpy perf test based on predefined macros

Cool, I like it. How much impact does this have on VHOST?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
  2016-01-18 20:06   ` [PATCH v2 0/5] Optimize memcpy for AVX512 platforms Stephen Hemminger
@ 2016-01-19  2:37     ` Wang, Zhihong
  0 siblings, 0 replies; 23+ messages in thread
From: Wang, Zhihong @ 2016-01-19  2:37 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Tuesday, January 19, 2016 4:06 AM
> To: Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> Richardson, Bruce <bruce.richardson@intel.com>; Xie, Huawei
> <huawei.xie@intel.com>
> Subject: Re: [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
> 
> On Sun, 17 Jan 2016 22:05:09 -0500
> Zhihong Wang <zhihong.wang@intel.com> wrote:
> 
> > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> > utilization of hardware resources and deliver high performance.
> >
> > In current DPDK, memcpy holds a large proportion of execution time in
> > libs like Vhost, especially for large packets, and this patch can bring
> > considerable benefits.
> >
> > The implementation is based on the current DPDK memcpy framework, some
> > background introduction can be found in these threads:
> > http://dpdk.org/ml/archives/dev/2014-November/008158.html
> > http://dpdk.org/ml/archives/dev/2015-January/011800.html
> >
> > Code changes are:
> >
> >   1. Read CPUID to check if AVX512 is supported by CPU
> >
> >   2. Predefine AVX512 macro if AVX512 is enabled by compiler
> >
> >   3. Implement AVX512 memcpy and choose the right implementation based
> on
> >      predefined macros
> >
> >   4. Decide alignment unit for memcpy perf test based on predefined macros
> 
> Cool, I like it. How much impact does this have on VHOST?

The impact is significant especially for enqueue (Detailed numbers might not
be appropriate here due to policy :-), only how I test it), because VHOST actually
spends a lot of time doing memcpy. Simply measure 1024B RX/TX time cost and
compare it with 64B's and you'll get a sense of it, although not precise.

My test cases include NIC2VM2NIC and VM2VM scenarios, which are the main
use cases currently, and use both throughput and RX/TX cycles for evaluation.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
  2016-01-18  3:05 ` [PATCH v2 0/5] " Zhihong Wang
                     ` (5 preceding siblings ...)
  2016-01-18 20:06   ` [PATCH v2 0/5] Optimize memcpy for AVX512 platforms Stephen Hemminger
@ 2016-01-27 15:23   ` Thomas Monjalon
  2016-01-28  6:09     ` Wang, Zhihong
  2016-01-27 15:30   ` Thomas Monjalon
  2017-08-30  9:37   ` linhaifeng
  8 siblings, 1 reply; 23+ messages in thread
From: Thomas Monjalon @ 2016-01-27 15:23 UTC (permalink / raw)
  To: Zhihong Wang; +Cc: dev

2016-01-17 22:05, Zhihong Wang:
> This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> utilization of hardware resources and deliver high performance.

On a related note, your expertise would be very valuable to review
these patches please:
(memcpy) http://dpdk.org/dev/patchwork/patch/4396/
(memcmp) http://dpdk.org/dev/patchwork/patch/4788/

Thanks

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
  2016-01-18  3:05 ` [PATCH v2 0/5] " Zhihong Wang
                     ` (6 preceding siblings ...)
  2016-01-27 15:23   ` Thomas Monjalon
@ 2016-01-27 15:30   ` Thomas Monjalon
  2016-01-27 18:48     ` Ananyev, Konstantin
  2017-08-30  9:37   ` linhaifeng
  8 siblings, 1 reply; 23+ messages in thread
From: Thomas Monjalon @ 2016-01-27 15:30 UTC (permalink / raw)
  To: bruce.richardson, konstantin.ananyev; +Cc: dev

> Zhihong Wang (5):
>   lib/librte_eal: Identify AVX512 CPU flag
>   mk: Predefine AVX512 macro for compiler
>   lib/librte_eal: Optimize memcpy for AVX512 platforms
>   app/test: Adjust alignment unit for memcpy perf test
>   lib/librte_eal: Tune memcpy for prior platforms
> 
>  app/test/test_memcpy_perf.c                        |   6 +
>  .../common/include/arch/x86/rte_cpuflags.h         |   2 +
>  .../common/include/arch/x86/rte_memcpy.h           | 269 ++++++++++++++++++++-
>  mk/rte.cpuflags.mk                                 |   4 +
>  4 files changed, 268 insertions(+), 13 deletions(-)

The maintainers of arch/x86 are Bruce and Konstantin.
I guess there is no comment and we can apply this cool series?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
  2016-01-27 15:30   ` Thomas Monjalon
@ 2016-01-27 18:48     ` Ananyev, Konstantin
  2016-01-27 20:18       ` Thomas Monjalon
  0 siblings, 1 reply; 23+ messages in thread
From: Ananyev, Konstantin @ 2016-01-27 18:48 UTC (permalink / raw)
  To: Thomas Monjalon, Richardson, Bruce; +Cc: dev

Hi Thomas,

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Wednesday, January 27, 2016 3:31 PM
> To: Richardson, Bruce; Ananyev, Konstantin
> Cc: dev@dpdk.org; Wang, Zhihong
> Subject: Re: [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
> 
> > Zhihong Wang (5):
> >   lib/librte_eal: Identify AVX512 CPU flag
> >   mk: Predefine AVX512 macro for compiler
> >   lib/librte_eal: Optimize memcpy for AVX512 platforms
> >   app/test: Adjust alignment unit for memcpy perf test
> >   lib/librte_eal: Tune memcpy for prior platforms
> >
> >  app/test/test_memcpy_perf.c                        |   6 +
> >  .../common/include/arch/x86/rte_cpuflags.h         |   2 +
> >  .../common/include/arch/x86/rte_memcpy.h           | 269 ++++++++++++++++++++-
> >  mk/rte.cpuflags.mk                                 |   4 +
> >  4 files changed, 268 insertions(+), 13 deletions(-)
> 
> The maintainers of arch/x86 are Bruce and Konstantin.
> I guess there is no comment and we can apply this cool series?

Yes, looks ok to me.
Konstantin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
  2016-01-27 18:48     ` Ananyev, Konstantin
@ 2016-01-27 20:18       ` Thomas Monjalon
  0 siblings, 0 replies; 23+ messages in thread
From: Thomas Monjalon @ 2016-01-27 20:18 UTC (permalink / raw)
  To: Wang, Zhihong; +Cc: dev

2016-01-27 18:48, Ananyev, Konstantin:
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > 
> > > Zhihong Wang (5):
> > >   lib/librte_eal: Identify AVX512 CPU flag
> > >   mk: Predefine AVX512 macro for compiler
> > >   lib/librte_eal: Optimize memcpy for AVX512 platforms
> > >   app/test: Adjust alignment unit for memcpy perf test
> > >   lib/librte_eal: Tune memcpy for prior platforms
> > >
> > >  app/test/test_memcpy_perf.c                        |   6 +
> > >  .../common/include/arch/x86/rte_cpuflags.h         |   2 +
> > >  .../common/include/arch/x86/rte_memcpy.h           | 269 ++++++++++++++++++++-
> > >  mk/rte.cpuflags.mk                                 |   4 +
> > >  4 files changed, 268 insertions(+), 13 deletions(-)
> > 
> > The maintainers of arch/x86 are Bruce and Konstantin.
> > I guess there is no comment and we can apply this cool series?
> 
> Yes, looks ok to me.

Applied, thanks

Some benchmark feedbacks would be welcome.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
  2016-01-27 15:23   ` Thomas Monjalon
@ 2016-01-28  6:09     ` Wang, Zhihong
  0 siblings, 0 replies; 23+ messages in thread
From: Wang, Zhihong @ 2016-01-28  6:09 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev



> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Wednesday, January 27, 2016 11:24 PM
> To: Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; Ravi Kerur <rkerur@gmail.com>
> Subject: Re: [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
> 
> 2016-01-17 22:05, Zhihong Wang:
> > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> > utilization of hardware resources and deliver high performance.
> 
> On a related note, your expertise would be very valuable to review
> these patches please:
> (memcpy) http://dpdk.org/dev/patchwork/patch/4396/
> (memcmp) http://dpdk.org/dev/patchwork/patch/4788/

Will do, thanks.

> 
> Thanks

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
  2016-01-18  3:05 ` [PATCH v2 0/5] " Zhihong Wang
                     ` (7 preceding siblings ...)
  2016-01-27 15:30   ` Thomas Monjalon
@ 2017-08-30  9:37   ` linhaifeng
  2017-09-18  5:10     ` Wang, Zhihong
  8 siblings, 1 reply; 23+ messages in thread
From: linhaifeng @ 2017-08-30  9:37 UTC (permalink / raw)
  To: Zhihong Wang, dev

在 2016/1/18 11:05, Zhihong Wang 写道:
> This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> utilization of hardware resources and deliver high performance.
>
> In current DPDK, memcpy holds a large proportion of execution time in
> libs like Vhost, especially for large packets, and this patch can bring
> considerable benefits.
>
> The implementation is based on the current DPDK memcpy framework, some
> background introduction can be found in these threads:
> http://dpdk.org/ml/archives/dev/2014-November/008158.html
> http://dpdk.org/ml/archives/dev/2015-January/011800.html
>
> Code changes are:
>
>   1. Read CPUID to check if AVX512 is supported by CPU
>
>   2. Predefine AVX512 macro if AVX512 is enabled by compiler
>
>   3. Implement AVX512 memcpy and choose the right implementation based on
>      predefined macros
>
>   4. Decide alignment unit for memcpy perf test based on predefined macros
>
> --------------
> Changes in v2:
>
>   1. Tune performance for prior platforms
>
> Zhihong Wang (5):
>   lib/librte_eal: Identify AVX512 CPU flag
>   mk: Predefine AVX512 macro for compiler
>   lib/librte_eal: Optimize memcpy for AVX512 platforms
>   app/test: Adjust alignment unit for memcpy perf test
>   lib/librte_eal: Tune memcpy for prior platforms
>
>  app/test/test_memcpy_perf.c                        |   6 +
>  .../common/include/arch/x86/rte_cpuflags.h         |   2 +
>  .../common/include/arch/x86/rte_memcpy.h           | 269 ++++++++++++++++++++-
>  mk/rte.cpuflags.mk                                 |   4 +
>  4 files changed, 268 insertions(+), 13 deletions(-)
>

Hi Zhihong Wang

I test avx512 rte_memcpy found the performanc for ovs dpdk is lower than avx2 rte_memcpy.

The vm loop test for ovs dpdk results:
avx512 is *15*Gbps
perf data:
  0.52 │      vmovdq (%r8,%r10,1),%zmm0
 95.33 │      sub    $0x40,%r9
  0.45 │      add    $0x40,%r8
  0.60 │      vmovdq %zmm0,-0x40(%r8)
  1.84 │      cmp    $0x3f,%r9
       │    ↓ ja     f20
       │      lea    -0x40(%rsi),%r8
  0.15 │      or     $0xffffffffffffffc0,%rsi
  0.21 │      and    $0xffffffffffffffc0,%r8
  0.00 │      lea    0x40(%rsi,%r8,1),%rsi
  0.00 │      vmovdq (%rcx,%rsi,1),%zmm0
  0.22 │      vmovdq %zmm0,(%rdx,%rsi,1)
  0.67 │    ↓ jmpq   c78
       │      mov    -0x128(%rbp),%rdi
       │      rex.R
       │      .byte  0x89
       │      popfq

avx2 is *18.8*Gbps
perf data:
  0.96 │      add    %r9,%r13
 66.04 │      vmovdq (%rdx),%ymm0
  1.20 │      sub    $0x40,%rdi
  1.53 │      add    $0x40,%rdx
 10.83 │      vmovdq %ymm0,-0x40(%rdx,%r15,1)
  8.64 │      vmovdq -0x20(%rdx),%ymm0
  7.58 │      vmovdq %ymm0,-0x40(%rdx,%r13,1)


dpdk version: v17.05
ovs version: 2.8.90
qemu version: QEMU emulator version 2.9.94 (v2.10.0-rc4-dirty)

gcc version: gcc (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6)
kernal version: 3.10.0


compile dpdk:
CONFIG_RTE_ENABLE_AVX512=y
export DPDK_DIR=$PWD
export DPDK_TARGET=x86_64-native-linuxapp-gcc
export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET
make install T=$DPDK_TARGET DESTDIR=install

compile ovs:
sh boot.sh
./configure  CFLAGS="-g -O2" --with-dpdk=$DPDK_BUILD --prefix=/usr --localstatedir=/var --sysconfdir=/etc
make -j
make install

The test for dpdk test_memcpy_perf:
avx2:
** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **
======= ============== ============== ============== ==============
   Size Cache to cache   Cache to mem   Mem to cache     Mem to mem
(bytes)        (ticks)        (ticks)        (ticks)        (ticks)
------- -------------- -------------- -------------- --------------
========================== 32B aligned ============================
     64       6 -   10      27 -   52      30 -   39      56 -   97
    512      24 -   44     251 -  271     145 -  217     396 -  447
   1024      35 -   78     394 -  433     252 -  319     609 -  670
------- -------------- -------------- -------------- --------------
C    64       3 -    9      28 -   31      29 -   40      55 -   66
C   512      25 -   55     253 -  268     139 -  268     397 -  410
C  1024      32 -   83     394 -  416     250 -  396     612 -  687
=========================== Unaligned =============================
     64       8 -    9      85 -   71      45 -   45     125 -  121
    512      33 -   49     282 -  305     153 -  252     420 -  478
   1024      42 -   83     409 -  491     259 -  389     640 -  748
------- -------------- -------------- -------------- --------------
C    64       4 -    9      42 -   46      39 -   46      76 -   90
C   512      33 -   55     280 -  272     153 -  281     421 -  415
C  1024      41 -   83     407 -  427     258 -  405     578 -  701
======= ============== ============== ============== ==============

avx512:
** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **
======= ============== ============== ============== ==============
   Size Cache to cache   Cache to mem   Mem to cache     Mem to mem
(bytes)        (ticks)        (ticks)        (ticks)        (ticks)
------- -------------- -------------- -------------- --------------
========================== 64B aligned ============================
     64       6 -    9      18 -   33      24 -   38      40 -   65
    512      18 -   44     178 -  262     138 -  218     309 -  429
   1024      27 -   79     338 -  430     250 -  322     560 -  674
------- -------------- -------------- -------------- --------------
C    64       3 -    9      18 -   20      23 -   41      39 -   50
C   512      15 -   54     205 -  270     134 -  268     304 -  409
C  1024      24 -   83     371 -  414     242 -  400     550 -  692
=========================== Unaligned =============================
     64       8 -    9      87 -   74      45 -   48     125 -  118
    512      23 -   49     298 -  311     150 -  250     437 -  482
   1024      36 -   83     427 -  505     259 -  406     633 -  754
------- -------------- -------------- -------------- --------------
C    64       4 -    9      42 -   46      39 -   46      76 -   94
C   512      23 -   55     246 -  277     152 -  290     349 -  426
C  1024      38 -   83     398 -  431     258 -  416     634 -  725
======= ============== ============== ============== ==============

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
  2017-08-30  9:37   ` linhaifeng
@ 2017-09-18  5:10     ` Wang, Zhihong
  0 siblings, 0 replies; 23+ messages in thread
From: Wang, Zhihong @ 2017-09-18  5:10 UTC (permalink / raw)
  To: linhaifeng, dev

> Hi Zhihong Wang
> 
> I test avx512 rte_memcpy found the performanc for ovs dpdk is lower than
> avx2 rte_memcpy.

Hi Haifeng,

AVX512 memcpy is marked as experimental and disabled by default, its
benefit varies from case to case. So enable it only when the case
(SW + HW setup with expected data pattern) is verified.

BTW, it's not recommended to use micro benchmarks like test_memcpy_perf
for memcpy performance report as they aren't likely able to reflect
performance of real world applications, please find more details at
https://software.intel.com/en-us/articles/performance-optimization-of-memcpy-in-dpdk


Thanks
Zhihong

> 
> The vm loop test for ovs dpdk results:
> avx512 is *15*Gbps
> perf data:
>   0.52 │      vmovdq (%r8,%r10,1),%zmm0
>  95.33 │      sub    $0x40,%r9
>   0.45 │      add    $0x40,%r8
>   0.60 │      vmovdq %zmm0,-0x40(%r8)
>   1.84 │      cmp    $0x3f,%r9
>        │    ↓ ja     f20
>        │      lea    -0x40(%rsi),%r8
>   0.15 │      or     $0xffffffffffffffc0,%rsi
>   0.21 │      and    $0xffffffffffffffc0,%r8
>   0.00 │      lea    0x40(%rsi,%r8,1),%rsi
>   0.00 │      vmovdq (%rcx,%rsi,1),%zmm0
>   0.22 │      vmovdq %zmm0,(%rdx,%rsi,1)
>   0.67 │    ↓ jmpq   c78
>        │      mov    -0x128(%rbp),%rdi
>        │      rex.R
>        │      .byte  0x89
>        │      popfq
> 
> avx2 is *18.8*Gbps
> perf data:
>   0.96 │      add    %r9,%r13
>  66.04 │      vmovdq (%rdx),%ymm0
>   1.20 │      sub    $0x40,%rdi
>   1.53 │      add    $0x40,%rdx
>  10.83 │      vmovdq %ymm0,-0x40(%rdx,%r15,1)
>   8.64 │      vmovdq -0x20(%rdx),%ymm0
>   7.58 │      vmovdq %ymm0,-0x40(%rdx,%r13,1)
> 
> 
> dpdk version: v17.05
> ovs version: 2.8.90
> qemu version: QEMU emulator version 2.9.94 (v2.10.0-rc4-dirty)
> 
> gcc version: gcc (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6)
> kernal version: 3.10.0
> 
> 
> compile dpdk:
> CONFIG_RTE_ENABLE_AVX512=y
> export DPDK_DIR=$PWD
> export DPDK_TARGET=x86_64-native-linuxapp-gcc
> export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET
> make install T=$DPDK_TARGET DESTDIR=install
> 
> compile ovs:
> sh boot.sh
> ./configure  CFLAGS="-g -O2" --with-dpdk=$DPDK_BUILD --prefix=/usr --
> localstatedir=/var --sysconfdir=/etc
> make -j
> make install
> 
> The test for dpdk test_memcpy_perf:
> avx2:
> ** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **
> ======= ============== ============== ==============
> ==============
>    Size Cache to cache   Cache to mem   Mem to cache     Mem to mem
> (bytes)        (ticks)        (ticks)        (ticks)        (ticks)
> ------- -------------- -------------- -------------- --------------
> ========================== 32B aligned
> ============================
>      64       6 -   10      27 -   52      30 -   39      56 -   97
>     512      24 -   44     251 -  271     145 -  217     396 -  447
>    1024      35 -   78     394 -  433     252 -  319     609 -  670
> ------- -------------- -------------- -------------- --------------
> C    64       3 -    9      28 -   31      29 -   40      55 -   66
> C   512      25 -   55     253 -  268     139 -  268     397 -  410
> C  1024      32 -   83     394 -  416     250 -  396     612 -  687
> =========================== Unaligned
> =============================
>      64       8 -    9      85 -   71      45 -   45     125 -  121
>     512      33 -   49     282 -  305     153 -  252     420 -  478
>    1024      42 -   83     409 -  491     259 -  389     640 -  748
> ------- -------------- -------------- -------------- --------------
> C    64       4 -    9      42 -   46      39 -   46      76 -   90
> C   512      33 -   55     280 -  272     153 -  281     421 -  415
> C  1024      41 -   83     407 -  427     258 -  405     578 -  701
> ======= ============== ============== ==============
> ==============
> 
> avx512:
> ** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **
> ======= ============== ============== ==============
> ==============
>    Size Cache to cache   Cache to mem   Mem to cache     Mem to mem
> (bytes)        (ticks)        (ticks)        (ticks)        (ticks)
> ------- -------------- -------------- -------------- --------------
> ========================== 64B aligned
> ============================
>      64       6 -    9      18 -   33      24 -   38      40 -   65
>     512      18 -   44     178 -  262     138 -  218     309 -  429
>    1024      27 -   79     338 -  430     250 -  322     560 -  674
> ------- -------------- -------------- -------------- --------------
> C    64       3 -    9      18 -   20      23 -   41      39 -   50
> C   512      15 -   54     205 -  270     134 -  268     304 -  409
> C  1024      24 -   83     371 -  414     242 -  400     550 -  692
> =========================== Unaligned
> =============================
>      64       8 -    9      87 -   74      45 -   48     125 -  118
>     512      23 -   49     298 -  311     150 -  250     437 -  482
>    1024      36 -   83     427 -  505     259 -  406     633 -  754
> ------- -------------- -------------- -------------- --------------
> C    64       4 -    9      42 -   46      39 -   46      76 -   94
> C   512      23 -   55     246 -  277     152 -  290     349 -  426
> C  1024      38 -   83     398 -  431     258 -  416     634 -  725
> ======= ============== ============== ==============
> ==============
> 
> 
> 
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-09-18  5:11 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-14  6:13 [PATCH 0/4] Optimize memcpy for AVX512 platforms Zhihong Wang
2016-01-14  6:13 ` [PATCH 1/4] lib/librte_eal: Identify AVX512 CPU flag Zhihong Wang
2016-01-14  6:13 ` [PATCH 2/4] mk: Predefine AVX512 macro for compiler Zhihong Wang
2016-01-14  6:13 ` [PATCH 3/4] lib/librte_eal: Optimize memcpy for AVX512 platforms Zhihong Wang
2016-01-14  6:13 ` [PATCH 4/4] app/test: Adjust alignment unit for memcpy perf test Zhihong Wang
2016-01-14 16:48 ` [PATCH 0/4] Optimize memcpy for AVX512 platforms Stephen Hemminger
2016-01-15  6:39   ` Wang, Zhihong
2016-01-15 22:03     ` Vincent JARDIN
2016-01-18  3:05 ` [PATCH v2 0/5] " Zhihong Wang
2016-01-18  3:05   ` [PATCH v2 1/5] lib/librte_eal: Identify AVX512 CPU flag Zhihong Wang
2016-01-18  3:05   ` [PATCH v2 2/5] mk: Predefine AVX512 macro for compiler Zhihong Wang
2016-01-18  3:05   ` [PATCH v2 3/5] lib/librte_eal: Optimize memcpy for AVX512 platforms Zhihong Wang
2016-01-18  3:05   ` [PATCH v2 4/5] app/test: Adjust alignment unit for memcpy perf test Zhihong Wang
2016-01-18  3:05   ` [PATCH v2 5/5] lib/librte_eal: Tune memcpy for prior platforms Zhihong Wang
2016-01-18 20:06   ` [PATCH v2 0/5] Optimize memcpy for AVX512 platforms Stephen Hemminger
2016-01-19  2:37     ` Wang, Zhihong
2016-01-27 15:23   ` Thomas Monjalon
2016-01-28  6:09     ` Wang, Zhihong
2016-01-27 15:30   ` Thomas Monjalon
2016-01-27 18:48     ` Ananyev, Konstantin
2016-01-27 20:18       ` Thomas Monjalon
2017-08-30  9:37   ` linhaifeng
2017-09-18  5:10     ` Wang, Zhihong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.