All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] eal/common: introduce rte_memset and related test
@ 2016-12-02  8:36 Zhiyong Yang
  2016-12-02  8:36 ` [PATCH 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang
                   ` (4 more replies)
  0 siblings, 5 replies; 44+ messages in thread
From: Zhiyong Yang @ 2016-12-02  8:36 UTC (permalink / raw)
  To: dev; +Cc: yuanhan.liu, bruce.richardson, konstantin.ananyev

DPDK code has met performance drop badly in some case when calling glibc
function memset. Reference to discussions about memset in 
http://dpdk.org/ml/archives/dev/2016-October/048628.html
It is necessary to introduce more high efficient function to fix it.
One important thing about rte_memset is that we can get clear control
on what instruction flow is used.

This patchset introduces rte_memset to bring more high efficient
implementation, and will bring obvious perf improvement, especially
for small N bytes in the most application scenarios.

Patch 1 implements rte_memset in the file rte_memset.h on IA platform
The file supports three types of instruction sets including sse & avx
(128bits), avx2(256bits) and avx512(512bits). rte_memset makes use of
vectorization and inline function to improve the perf on IA. In addition,
cache line and memory alignment are fully taken into consideration.

Patch 2 implements functional autotest to validates the function whether
to work in a right way.

Patch 3 implements performance autotest separately in cache and memory.

Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring 3%~4%
performance improvements on IA platform from virtio/vhost non-mergeable
loopback testing.

Zhiyong Yang (4):
  eal/common: introduce rte_memset on IA platform
  app/test: add functional autotest for rte_memset
  app/test: add performance autotest for rte_memset
  lib/librte_vhost: improve vhost perf using rte_memset

 app/test/Makefile                                  |   3 +
 app/test/test_memset.c                             | 158 +++++++++
 app/test/test_memset_perf.c                        | 348 +++++++++++++++++++
 doc/guides/rel_notes/release_17_02.rst             |  11 +
 .../common/include/arch/x86/rte_memset.h           | 376 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_memset.h |  51 +++
 lib/librte_vhost/virtio_net.c                      |  18 +-
 7 files changed, 958 insertions(+), 7 deletions(-)
 create mode 100644 app/test/test_memset.c
 create mode 100644 app/test/test_memset_perf.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset.h
 create mode 100644 lib/librte_eal/common/include/generic/rte_memset.h

-- 
2.7.4

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-02  8:36 [PATCH 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang
@ 2016-12-02  8:36 ` Zhiyong Yang
  2016-12-02 10:25   ` Thomas Monjalon
  2016-12-27 10:04   ` [PATCH v2 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang
  2016-12-02  8:36 ` [PATCH 2/4] app/test: add functional autotest for rte_memset Zhiyong Yang
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 44+ messages in thread
From: Zhiyong Yang @ 2016-12-02  8:36 UTC (permalink / raw)
  To: dev; +Cc: yuanhan.liu, bruce.richardson, konstantin.ananyev, Zhiyong Yang

Performance drop has been caused in some cases when DPDK code calls glibc
function memset. reference to discussions about memset in
http://dpdk.org/ml/archives/dev/2016-October/048628.html
It is necessary to introduce more high efficient function to fix it.
One important thing about rte_memset is that we can get clear control
on what instruction flow is used. This patch supports instruction sets
such as sse & avx(128 bits), avx2(256 bits) and avx512(512bits).
rte_memset makes full use of vectorization and inline function to improve
the perf on IA. In addition, cache line and memory alignment are fully
taken into consideration.

Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>
---
 .../common/include/arch/x86/rte_memset.h           | 376 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_memset.h |  51 +++
 2 files changed, 427 insertions(+)
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset.h
 create mode 100644 lib/librte_eal/common/include/generic/rte_memset.h

diff --git a/lib/librte_eal/common/include/arch/x86/rte_memset.h b/lib/librte_eal/common/include/arch/x86/rte_memset.h
new file mode 100644
index 0000000..3b2d3a3
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memset.h
@@ -0,0 +1,376 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMSET_X86_64_H_
+#define _RTE_MEMSET_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file
+ *
+ * Functions for vectorised implementation of memset().
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <rte_vect.h>
+
+static inline void *
+rte_memset(void *dst, int a, size_t n) __attribute__((always_inline));
+
+static inline void
+rte_memset_less16(void *dst, int a, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+
+	if (n & 0x01) {
+		*(uint8_t *)dstu = (uint8_t)a;
+		dstu = (uintptr_t)((uint8_t *)dstu + 1);
+	}
+	if (n & 0x02) {
+		*(uint16_t *)dstu = (uint8_t)a | (((uint8_t)a) << 8);
+		dstu = (uintptr_t)((uint16_t *)dstu + 1);
+	}
+	if (n & 0x04) {
+		uint16_t b = ((uint8_t)a | (((uint8_t)a) << 8));
+
+		*(uint32_t *)dstu = (uint32_t)(b | (b << 16));
+		dstu = (uintptr_t)((uint32_t *)dstu + 1);
+	}
+	if (n & 0x08) {
+		uint16_t b = ((uint8_t)a | (((uint8_t)a) << 8));
+		uint32_t c = b | (b << 16);
+
+		*(uint32_t *)dstu = c;
+		*((uint32_t *)dstu + 1) = c;
+		dstu = (uintptr_t)((uint32_t *)dstu + 2);
+	}
+}
+
+static inline void
+rte_memset16(uint8_t *dst, int8_t a)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_set1_epi8(a);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+static inline void
+rte_memset_17to32(void *dst, int a, size_t n)
+{
+	rte_memset16((uint8_t *)dst, a);
+	rte_memset16((uint8_t *)dst - 16 + n, a);
+}
+
+#ifdef RTE_MACHINE_CPUFLAG_AVX512
+
+/**
+ * AVX512 implementation below
+ */
+
+static inline void
+rte_memset32(uint8_t *dst, int8_t a)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_set1_epi8(a);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+static inline void
+rte_memset64(uint8_t *dst, int8_t a)
+{
+	__m512i zmm0;
+
+	zmm0 = _mm512_set1_epi8(a);
+	_mm512_storeu_si512((void *)dst, zmm0);
+}
+
+static inline void
+rte_memset128blocks(uint8_t *dst, int8_t a, size_t n)
+{
+	__m512i zmm0;
+
+	zmm0 = _mm512_set1_epi8(a);
+	while (n >= 128) {
+		n -= 128;
+		_mm512_store_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_store_si512((void *)(dst + 1 * 64), zmm0);
+		dst = dst + 128;
+	}
+}
+
+static inline void *
+rte_memset(void *dst, int a, size_t n)
+{
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	if (n < 16) {
+		rte_memset_less16(dst, a, n);
+		return ret;
+	} else if (n == 16) {
+		rte_memset16((uint8_t *)dst, a);
+		return ret;
+	}
+	if (n <= 32) {
+		rte_memset_17to32(dst, a, n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_memset32((uint8_t *)dst, a);
+		rte_memset32((uint8_t *)dst - 32 + n, a);
+		return ret;
+	}
+	if (n >= 256) {
+		dstofss = ((uintptr_t)dst & 0x3F);
+		if (dstofss > 0) {
+			dstofss = 64 - dstofss;
+			n -= dstofss;
+			rte_memset64((uint8_t *)dst, a);
+			dst = (uint8_t *)dst + dstofss;
+		}
+		rte_memset128blocks((uint8_t *)dst, a, n);
+		bits = n;
+		n = n & 127;
+		bits -= n;
+		dst = (uint8_t *)dst + bits;
+	}
+	if (n > 128) {
+		n -= 128;
+		rte_memset64((uint8_t *)dst, a);
+		rte_memset64((uint8_t *)dst + 64, a);
+		dst = (uint8_t *)dst + 128;
+	}
+	if (n > 64) {
+		rte_memset64((uint8_t *)dst, a);
+		rte_memset64((uint8_t *)dst - 64 + n, a);
+		return ret;
+	}
+	if (n > 0)
+		rte_memset64((uint8_t *)dst - 64 + n, a);
+	return ret;
+}
+
+#elif defined RTE_MACHINE_CPUFLAG_AVX2
+
+/**
+ *  AVX2 implementation below
+ */
+
+static inline void
+rte_memset32(uint8_t *dst, int8_t a)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_set1_epi8(a);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+static inline void
+rte_memset_33to64(void *dst, int a, size_t n)
+{
+	rte_memset32((uint8_t *)dst, a);
+	rte_memset32((uint8_t *)dst - 32 + n, a);
+}
+
+static inline void
+rte_memset64blocks(uint8_t *dst, int8_t a, size_t n)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_set1_epi8(a);
+	while (n >= 64) {
+		n -= 64;
+		_mm256_store_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
+		_mm256_store_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm0);
+		dst = (uint8_t *)dst + 64;
+
+	}
+}
+
+static inline void *
+rte_memset(void *dst, int a, size_t n)
+{
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	if (n < 16) {
+		rte_memset_less16(dst, a, n);
+		return ret;
+	} else if (n == 16) {
+		rte_memset16((uint8_t *)dst, a);
+		return ret;
+	}
+	if (n <= 32) {
+		rte_memset_17to32(dst, a, n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_memset_33to64(dst, a, n);
+		return ret;
+	}
+	if (n > 64) {
+		dstofss = (uintptr_t)dst & 0x1F;
+		if (dstofss > 0) {
+			dstofss = 32 - dstofss;
+			n -= dstofss;
+			rte_memset32((uint8_t *)dst, a);
+			dst = (uint8_t *)dst + dstofss;
+		}
+		rte_memset64blocks((uint8_t *)dst, a, n);
+		bits = n;
+		n = n & 63;
+		bits -= n;
+		dst = (uint8_t *)dst + bits;
+	}
+	if (n > 32) {
+		rte_memset_33to64(dst, a, n);
+		return ret;
+	}
+	if (n > 0)
+		rte_memset32((uint8_t *)dst - 32 + n, a);
+	return ret;
+}
+
+#else /* RTE_MACHINE_CPUFLAG */
+
+/**
+ * SSE && AVX implementation below
+ */
+
+static inline void
+rte_memset32(uint8_t *dst, int8_t a)
+{
+	__m128i xmm0 = _mm_set1_epi8(a);
+
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+	_mm_storeu_si128((__m128i *)(dst + 16), xmm0);
+}
+
+static inline void
+rte_memset16blocks(uint8_t *dst, int8_t a, size_t n)
+{
+	__m128i xmm0 = _mm_set1_epi8(a);
+
+	while (n >= 16) {
+		n -= 16;
+		_mm_store_si128((__m128i *)(dst + 0 * 16), xmm0);
+		dst = (uint8_t *)dst + 16;
+	}
+}
+
+static inline void
+rte_memset64blocks(uint8_t *dst, int8_t a, size_t n)
+{
+	__m128i xmm0 = _mm_set1_epi8(a);
+
+	while (n >= 64) {
+		n -= 64;
+		_mm_store_si128((__m128i *)(dst + 0 * 16), xmm0);
+		_mm_store_si128((__m128i *)(dst + 1 * 16), xmm0);
+		_mm_store_si128((__m128i *)(dst + 2 * 16), xmm0);
+		_mm_store_si128((__m128i *)(dst + 3 * 16), xmm0);
+		dst = (uint8_t *)dst + 64;
+	}
+}
+
+static inline void *
+rte_memset(void *dst, int a, size_t n)
+{
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	if (n < 16) {
+		rte_memset_less16(dst, a, n);
+		return ret;
+	} else if (n == 16) {
+		rte_memset16((uint8_t *)dst, a);
+		return ret;
+	}
+	if (n <= 32) {
+		rte_memset_17to32(dst, a, n);
+		return ret;
+	}
+	if (n <= 48) {
+		rte_memset32((uint8_t *)dst, a);
+		rte_memset16((uint8_t *)dst - 16 + n, a);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_memset32((uint8_t *)dst, a);
+		rte_memset16((uint8_t *)dst + 32, a);
+		rte_memset16((uint8_t *)dst - 16 + n, a);
+		return ret;
+	}
+	if (n > 64) {
+		dstofss = (uintptr_t)dst & 0xF;
+		if (dstofss > 0) {
+			dstofss = 16 - dstofss;
+			n -= dstofss;
+			rte_memset16((uint8_t *)dst, a);
+			dst = (uint8_t *)dst + dstofss;
+		}
+		rte_memset64blocks((uint8_t *)dst, a, n);
+		bits = n;
+		n &= 63;
+		bits -= n;
+		dst = (uint8_t *)dst + bits;
+		rte_memset16blocks((uint8_t *)dst, a, n);
+		bits = n;
+		n &= 0xf;
+		bits -= n;
+		dst = (uint8_t *)dst + bits;
+		if (n > 0) {
+			rte_memset16((uint8_t *)dst - 16 + n, a);
+			return ret;
+		}
+	}
+	return ret;
+}
+
+#endif /* RTE_MACHINE_CPUFLAG */
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMSET_H_ */
diff --git a/lib/librte_eal/common/include/generic/rte_memset.h b/lib/librte_eal/common/include/generic/rte_memset.h
new file mode 100644
index 0000000..416a638
--- /dev/null
+++ b/lib/librte_eal/common/include/generic/rte_memset.h
@@ -0,0 +1,51 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMSET_H_
+#define _RTE_MEMSET_H_
+
+/**
+ * @file
+ *
+ * Functions for vectorised implementation of memset().
+ */
+#ifndef _RTE_MEMSET_X86_64_H_
+
+#define rte_memset memset
+
+#else
+
+static void *
+rte_memset(void *dst, int a, size_t n);
+
+#endif
+#endif /* _RTE_MEMSET_H_ */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/4] app/test: add functional autotest for rte_memset
  2016-12-02  8:36 [PATCH 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang
  2016-12-02  8:36 ` [PATCH 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang
@ 2016-12-02  8:36 ` Zhiyong Yang
  2016-12-02  8:36 ` [PATCH 3/4] app/test: add performance " Zhiyong Yang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 44+ messages in thread
From: Zhiyong Yang @ 2016-12-02  8:36 UTC (permalink / raw)
  To: dev; +Cc: yuanhan.liu, bruce.richardson, konstantin.ananyev, Zhiyong Yang

The file implements the functional autotest for rte_memset, which
validates the new function rte_memset whether to work in a right
way. The implementation of test_memcpy.c is used as a reference.

Usage:
step 1: run ./x86_64-native-linuxapp-gcc/app/test
step 2: run command memset_autotest at the run time.

Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>
---
 app/test/Makefile      |   2 +
 app/test/test_memset.c | 158 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 160 insertions(+)
 create mode 100644 app/test/test_memset.c

diff --git a/app/test/Makefile b/app/test/Makefile
index 5be023a..82da3f3 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -123,6 +123,8 @@ SRCS-y += test_logs.c
 SRCS-y += test_memcpy.c
 SRCS-y += test_memcpy_perf.c
 
+SRCS-y += test_memset.c
+
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash.c
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_thash.c
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash_perf.c
diff --git a/app/test/test_memset.c b/app/test/test_memset.c
new file mode 100644
index 0000000..c9020bf
--- /dev/null
+++ b/app/test/test_memset.c
@@ -0,0 +1,158 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <rte_common.h>
+#include <rte_random.h>
+#include <rte_memset.h>
+#include "test.h"
+
+/*
+ * Set this to the maximum buffer size you want to test. If it is 0, then the
+ * values in the buf_sizes[] array below will be used.
+ */
+#define TEST_VALUE_RANGE 0
+#define MAX_INT8 127
+#define MIN_INT8 -128
+/* List of buffer sizes to test */
+#if TEST_VALUE_RANGE == 0
+static size_t buf_sizes[] = {
+	0, 1, 7, 8, 9, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128, 129,
+	255, 256, 257, 320, 384, 511, 512, 513, 1023, 1024, 1025, 1518,
+	1522, 1600, 2048, 3072, 4096, 5120, 6144, 7168, 8192
+};
+/* MUST be as large as largest packet size above */
+#define BUFFER_SIZE       8192
+#else /* TEST_VALUE_RANGE != 0 */
+static size_t buf_sizes[TEST_VALUE_RANGE];
+#define BUFFER_SIZE       TEST_VALUE_RANGE
+#endif /* TEST_VALUE_RANGE == 0 */
+
+/* Data is aligned on this many bytes (power of 2) */
+#define ALIGNMENT_UNIT 32
+
+/*
+ * Create two buffers, and initialize the one as the reference buffer with
+ * random values. Another(dest_buff) is assigned by the reference buffer.
+ * Set some memory area of dest_buff by using ch and then compare to see
+ * if the rte_memset is successful. The bytes outside the setted area are
+ * also checked to make sure they are not changed.
+ */
+static int
+test_single_memset(unsigned int off_dst, int ch, size_t size)
+{
+	unsigned int i;
+	uint8_t dest_buff[BUFFER_SIZE + ALIGNMENT_UNIT];
+	uint8_t ref_buff[BUFFER_SIZE + ALIGNMENT_UNIT];
+	void *ret;
+
+	/* Setup buffers */
+	for (i = 0; i < BUFFER_SIZE + ALIGNMENT_UNIT; i++) {
+		ref_buff[i] = (uint8_t) rte_rand();
+		dest_buff[i] = ref_buff[i];
+	}
+	/* Do the rte_memset */
+	ret = rte_memset(dest_buff + off_dst, ch, size);
+	if (ret != (dest_buff + off_dst)) {
+		printf("rte_memset() returned %p, not %p\n",
+		       ret, dest_buff + off_dst);
+	}
+	/* Check nothing before offset was affected */
+	for (i = 0; i < off_dst; i++) {
+		if (dest_buff[i] != ref_buff[i]) {
+			printf("rte_memset() failed for %u bytes (offsets=%u): \
+			       [modified before start of dst].\n",
+			       (unsigned int)size, off_dst);
+			return -1;
+		}
+	}
+	/* Check every byte was setted */
+	for (i = 0; i < size; i++) {
+		if (dest_buff[i + off_dst] != (uint8_t)ch) {
+			printf("rte_memset() failed for %u bytes (offsets=%u): \
+			       [didn't memset byte %u].\n",
+			       (unsigned int)size, off_dst, i);
+			return -1;
+		}
+	}
+	/* Check nothing after memset was affected */
+	for (i = off_dst + size; i < BUFFER_SIZE + ALIGNMENT_UNIT; i++) {
+		if (dest_buff[i] != ref_buff[i]) {
+			printf("rte_memset() failed for %u bytes (offsets=%u): \
+			      [memset too many].\n",
+			       (unsigned int)size, off_dst);
+			return -1;
+		}
+	}
+	return 0;
+}
+
+/*
+ * Check functionality for various buffer sizes and data offsets/alignments.
+ */
+static int
+func_test(void)
+{
+	unsigned int off_dst, i;
+	unsigned int num_buf_sizes = sizeof(buf_sizes) / sizeof(buf_sizes[0]);
+	int ret;
+	int j;
+
+	for (j = MIN_INT8; j <= MAX_INT8; j++) {
+		for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
+			for (i = 0; i < num_buf_sizes; i++) {
+				ret = test_single_memset(off_dst, j,
+							 buf_sizes[i]);
+				if (ret != 0)
+					return -1;
+			}
+		}
+	}
+	return 0;
+}
+
+static int
+test_memset(void)
+{
+	int ret;
+
+	ret = func_test();
+	if (ret != 0)
+		return -1;
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(memset_autotest, test_memset);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/4] app/test: add performance autotest for rte_memset
  2016-12-02  8:36 [PATCH 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang
  2016-12-02  8:36 ` [PATCH 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang
  2016-12-02  8:36 ` [PATCH 2/4] app/test: add functional autotest for rte_memset Zhiyong Yang
@ 2016-12-02  8:36 ` Zhiyong Yang
  2016-12-02  8:36 ` [PATCH 4/4] lib/librte_vhost: improve vhost perf using rte_memset Zhiyong Yang
  2016-12-02 10:00 ` [PATCH 0/4] eal/common: introduce rte_memset and related test Maxime Coquelin
  4 siblings, 0 replies; 44+ messages in thread
From: Zhiyong Yang @ 2016-12-02  8:36 UTC (permalink / raw)
  To: dev; +Cc: yuanhan.liu, bruce.richardson, konstantin.ananyev, Zhiyong Yang

The file implements the perf autotest for rte_memset. The perf data
can be gotten compared between memset and rte_memset when you run it.
The first column shows the N size for memset.
The second column lists a set of numbers for memset in cache,
The third column lists a set of numbers for memset in memory.

Usage:
step 1: run ./x86_64-native-linuxapp-gcc/app/test
step 2: run command memset_perf_autotest at the run time.

Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>
---
 app/test/Makefile           |   1 +
 app/test/test_memset_perf.c | 348 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 349 insertions(+)
 create mode 100644 app/test/test_memset_perf.c

diff --git a/app/test/Makefile b/app/test/Makefile
index 82da3f3..1c3e7f1 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -124,6 +124,7 @@ SRCS-y += test_memcpy.c
 SRCS-y += test_memcpy_perf.c
 
 SRCS-y += test_memset.c
+SRCS-y += test_memset_perf.c
 
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash.c
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_thash.c
diff --git a/app/test/test_memset_perf.c b/app/test/test_memset_perf.c
new file mode 100644
index 0000000..83b15b5
--- /dev/null
+++ b/app/test/test_memset_perf.c
@@ -0,0 +1,348 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_random.h>
+#include <rte_malloc.h>
+#include <rte_memset.h>
+#include "test.h"
+
+/*
+ * Set this to the maximum buffer size you want to test. If it is 0, then the
+ * values in the buf_sizes[] array below will be used.
+ */
+#define TEST_VALUE_RANGE        0
+
+/* List of buffer sizes to test */
+#if TEST_VALUE_RANGE == 0
+static size_t buf_sizes[] = {
+	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65,
+	70, 85, 96, 105, 115, 127, 128, 129, 161, 191, 192, 193, 255, 256,
+	257, 319, 320, 321, 383, 384, 385, 447, 448, 449, 511, 512, 513,
+	767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600, 2048, 2560,
+	3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
+};
+/* MUST be as large as largest packet size above */
+#define SMALL_BUFFER_SIZE 8192
+#else /* TEST_VALUE_RANGE != 0 */
+static size_t buf_sizes[TEST_VALUE_RANGE];
+#define SMALL_BUFFER_SIZE       TEST_VALUE_RANGE
+#endif /* TEST_VALUE_RANGE == 0 */
+
+/*
+ * Arrays of this size are used for measuring uncached memory accesses by
+ * picking a random location within the buffer. Make this smaller if there are
+ * memory allocation errors.
+ */
+#define LARGE_BUFFER_SIZE       (100 * 1024 * 1024)
+
+/* How many times to run timing loop for performance tests */
+#define TEST_ITERATIONS         1000000
+#define TEST_BATCH_SIZE         100
+
+/* Data is aligned on this many bytes (power of 2) */
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+#define ALIGNMENT_UNIT          64
+#elif defined RTE_MACHINE_CPUFLAG_AVX2
+#define ALIGNMENT_UNIT          32
+#else /* RTE_MACHINE_CPUFLAG */
+#define ALIGNMENT_UNIT          16
+#endif /* RTE_MACHINE_CPUFLAG */
+
+/*
+ * Pointers used in performance tests. The two large buffers are for uncached
+ * access where random addresses within the buffer are used for each
+ * memset. The two small buffers are for cached access.
+ */
+static uint8_t *large_buf_read, *large_buf_write;
+static uint8_t *small_buf_read, *small_buf_write;
+
+/* Initialise data buffers. */
+static int
+init_buffers(void)
+{
+	unsigned int i;
+
+	large_buf_read = rte_malloc("memset", LARGE_BUFFER_SIZE
+					+ ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	if (large_buf_read == NULL)
+		goto error_large_buf_read;
+
+	large_buf_write = rte_malloc("memset", LARGE_BUFFER_SIZE
+					+ ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	if (large_buf_write == NULL)
+		goto error_large_buf_write;
+
+	small_buf_read = rte_malloc("memset", SMALL_BUFFER_SIZE
+					+ ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	if (small_buf_read == NULL)
+		goto error_small_buf_read;
+
+	small_buf_write = rte_malloc("memset", SMALL_BUFFER_SIZE
+					+ ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	if (small_buf_write == NULL)
+		goto error_small_buf_write;
+
+	for (i = 0; i < LARGE_BUFFER_SIZE; i++)
+		large_buf_read[i] = rte_rand();
+	for (i = 0; i < SMALL_BUFFER_SIZE; i++)
+		small_buf_read[i] = rte_rand();
+
+	return 0;
+
+error_small_buf_write:
+	rte_free(small_buf_read);
+error_small_buf_read:
+	rte_free(large_buf_write);
+error_large_buf_write:
+	rte_free(large_buf_read);
+error_large_buf_read:
+	printf("ERROR: not enough memory\n");
+	return -1;
+}
+
+/* Cleanup data buffers */
+static void
+free_buffers(void)
+{
+	rte_free(large_buf_read);
+	rte_free(large_buf_write);
+	rte_free(small_buf_read);
+	rte_free(small_buf_write);
+}
+
+/*
+ * Get a random offset into large array, with enough space needed to perform
+ * max memset size. Offset is aligned, uoffset is used for unalignment setting.
+ */
+static inline size_t
+get_rand_offset(size_t uoffset)
+{
+	return ((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) &
+			~(ALIGNMENT_UNIT - 1)) + uoffset;
+}
+
+/* Fill in destination addresses. */
+static inline void
+fill_addr_arrays(size_t *dst_addr, int is_dst_cached, size_t dst_uoffset)
+{
+	unsigned int i;
+
+	for (i = 0; i < TEST_BATCH_SIZE; i++)
+		dst_addr[i] = (is_dst_cached) ? dst_uoffset :
+					get_rand_offset(dst_uoffset);
+}
+
+/*
+ * WORKAROUND: For some reason the first test doing an uncached write
+ * takes a very long time (~25 times longer than is expected). So we do
+ * it once without timing.
+ */
+static void
+do_uncached_write(uint8_t *dst, int is_dst_cached, size_t size)
+{
+	unsigned int i, j;
+	size_t dst_addrs[TEST_BATCH_SIZE];
+	int ch = rte_rand() & 0xff;
+
+	for (i = 0; i < (TEST_ITERATIONS / TEST_BATCH_SIZE); i++) {
+		fill_addr_arrays(dst_addrs, is_dst_cached, 0);
+		for (j = 0; j < TEST_BATCH_SIZE; j++)
+			rte_memset(dst+dst_addrs[j], ch, size);
+	}
+}
+
+/*
+ * Run a single memset performance test. This is a macro to ensure that if
+ * the "size" parameter is a constant it won't be converted to a variable.
+ */
+#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset, size)             \
+do {                                                                        \
+	unsigned int iter, t;                                               \
+	size_t dst_addrs[TEST_BATCH_SIZE];                                  \
+	uint64_t start_time, total_time = 0;                                \
+	uint64_t total_time2 = 0;                                           \
+	int ch = rte_rand() & 0xff;                                         \
+									    \
+	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {\
+	fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset);            \
+	start_time = rte_rdtsc();                                           \
+	for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
+		rte_memset(dst+dst_addrs[t], ch, size);                      \
+	total_time += rte_rdtsc() - start_time;                             \
+	}                                                                   \
+	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {\
+	fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset);            \
+	start_time = rte_rdtsc();                                           \
+	for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
+		memset(dst+dst_addrs[t], ch, size);                         \
+	total_time2 += rte_rdtsc() - start_time;                            \
+	}                                                                   \
+	printf("%8.0f -",  (double)total_time / TEST_ITERATIONS);           \
+	printf("%5.0f",  (double)total_time2 / TEST_ITERATIONS);            \
+} while (0)
+
+/* Run aligned memset tests. */
+#define ALL_PERF_TESTS_FOR_SIZE(n)                                       \
+do {                                                                     \
+	if (__builtin_constant_p(n))                                     \
+		printf("\nC%6u", (unsigned int)n);                       \
+	else                                                             \
+		printf("\n%7u", (unsigned int)n);                        \
+	SINGLE_PERF_TEST(small_buf_write, 1, 0, n);                      \
+	SINGLE_PERF_TEST(large_buf_write, 0, 0, n);                      \
+} while (0)
+
+/* Run unaligned memset tests */
+#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)                             \
+do {                                                                     \
+	if (__builtin_constant_p(n))                                     \
+		printf("\nC%6u", (unsigned int)n);                       \
+	else                                                             \
+		printf("\n%7u", (unsigned int)n);                        \
+	SINGLE_PERF_TEST(small_buf_write, 1, 1, n);                      \
+	SINGLE_PERF_TEST(large_buf_write, 0, 1, n);                      \
+} while (0)
+
+/* Run memset tests for constant length */
+#define ALL_PERF_TEST_FOR_CONSTANT                                       \
+do {                                                                     \
+	TEST_CONSTANT(6U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);      \
+	TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);   \
+	TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U); \
+} while (0)
+
+/* Run all memset tests for aligned constant cases */
+static inline void
+perf_test_constant_aligned(void)
+{
+#define TEST_CONSTANT ALL_PERF_TESTS_FOR_SIZE
+	ALL_PERF_TEST_FOR_CONSTANT;
+#undef TEST_CONSTANT
+}
+
+/* Run all memset tests for unaligned constant cases */
+static inline void
+perf_test_constant_unaligned(void)
+{
+#define TEST_CONSTANT ALL_PERF_TESTS_FOR_SIZE_UNALIGNED
+	ALL_PERF_TEST_FOR_CONSTANT;
+#undef TEST_CONSTANT
+}
+
+/* Run all memset tests for aligned variable cases */
+static inline void
+perf_test_variable_aligned(void)
+{
+	unsigned int n = sizeof(buf_sizes) / sizeof(buf_sizes[0]);
+	unsigned int i;
+
+	for (i = 0; i < n; i++)
+		ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]);
+}
+
+/* Run all memset tests for unaligned variable cases */
+static inline void
+perf_test_variable_unaligned(void)
+{
+	unsigned int n = sizeof(buf_sizes) / sizeof(buf_sizes[0]);
+	unsigned int i;
+
+	for (i = 0; i < n; i++)
+		ALL_PERF_TESTS_FOR_SIZE_UNALIGNED((size_t)buf_sizes[i]);
+}
+
+/* Run all memset tests */
+static int
+perf_test(void)
+{
+	int ret;
+
+	ret = init_buffers();
+	if (ret != 0)
+		return ret;
+
+#if TEST_VALUE_RANGE != 0
+	/* Set up buf_sizes array, if required */
+	unsigned int i;
+
+	for (i = 0; i < TEST_VALUE_RANGE; i++)
+		buf_sizes[i] = i;
+#endif
+
+	/* See function comment */
+	do_uncached_write(large_buf_write, 0, SMALL_BUFFER_SIZE);
+
+	printf("\n** rte_memset() - memset perf tests \t\n  \
+	(C = compile-time constant) **\n"
+		"======== ======= ======== ======= ========\n"
+		"   Size memset in cache  memset in mem\n"
+		"(bytes)        (ticks)        (ticks)\n"
+		"------- -------------- ---------------");
+
+	printf("\n============= %2dB aligned ================", ALIGNMENT_UNIT);
+	/* Do aligned tests where size is a variable */
+	perf_test_variable_aligned();
+	printf("\n------ -------------- -------------- ------");
+	/* Do aligned tests where size is a compile-time constant */
+	perf_test_constant_aligned();
+	printf("\n============= Unaligned ===================");
+	/* Do unaligned tests where size is a variable */
+	perf_test_variable_unaligned();
+	printf("\n------ -------------- -------------- ------");
+	/* Do unaligned tests where size is a compile-time constant */
+	perf_test_constant_unaligned();
+	printf("\n====== ============== ============== =======\n\n");
+
+	free_buffers();
+
+	return 0;
+}
+
+static int
+test_memset_perf(void)
+{
+	int ret;
+
+	ret = perf_test();
+	if (ret != 0)
+		return -1;
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(memset_perf_autotest, test_memset_perf);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/4] lib/librte_vhost: improve vhost perf using rte_memset
  2016-12-02  8:36 [PATCH 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang
                   ` (2 preceding siblings ...)
  2016-12-02  8:36 ` [PATCH 3/4] app/test: add performance " Zhiyong Yang
@ 2016-12-02  8:36 ` Zhiyong Yang
  2016-12-02  9:46   ` Thomas Monjalon
  2016-12-02 10:00 ` [PATCH 0/4] eal/common: introduce rte_memset and related test Maxime Coquelin
  4 siblings, 1 reply; 44+ messages in thread
From: Zhiyong Yang @ 2016-12-02  8:36 UTC (permalink / raw)
  To: dev; +Cc: yuanhan.liu, bruce.richardson, konstantin.ananyev, Zhiyong Yang

Using rte_memset instead of copy_virtio_net_hdr can bring 3%~4%
performance improvements on IA platform from virtio/vhost
non-mergeable loopback testing.

Two key points have been considered:
1. One variable initialization could be saved, which involves memory
store.
2. copy_virtio_net_hdr involves both load (from stack, the virtio_hdr
var) and store (to virtio driver memory), while rte_memset just involves
store.

Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>
---
 doc/guides/rel_notes/release_17_02.rst | 11 +++++++++++
 lib/librte_vhost/virtio_net.c          | 18 +++++++++++-------
 2 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/doc/guides/rel_notes/release_17_02.rst b/doc/guides/rel_notes/release_17_02.rst
index 3b65038..eecf857 100644
--- a/doc/guides/rel_notes/release_17_02.rst
+++ b/doc/guides/rel_notes/release_17_02.rst
@@ -38,6 +38,17 @@ New Features
      Also, make sure to start the actual text at the margin.
      =========================================================
 
+* **Introduced rte_memset and related test on IA platform.**
+
+  Performance drop had been caused in some cases on Ivybridge when DPDK code calls glibc
+  function memset. It was necessary to introduce more high efficient function to fix it.
+  The function rte_memset supported three types of instruction sets including sse & avx(128 bits),
+  avx2(256 bits) and avx512(512bits).
+
+  * Added rte_memset support on IA platform.
+  * Added functional autotest support for rte_memset.
+  * Added performance autotest support for rte_memset.
+  * Improved performance to use rte_memset instead of copy_virtio_net_hdr in lib/librte_vhost.
 
 Resolved Issues
 ---------------
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 595f67c..392b31b 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -37,6 +37,7 @@
 
 #include <rte_mbuf.h>
 #include <rte_memcpy.h>
+#include <rte_memset.h>
 #include <rte_ether.h>
 #include <rte_ip.h>
 #include <rte_virtio_net.h>
@@ -194,7 +195,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vring_desc *descs,
 	uint32_t cpy_len;
 	struct vring_desc *desc;
 	uint64_t desc_addr;
-	struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
+	struct virtio_net_hdr *virtio_hdr;
 
 	desc = &descs[desc_idx];
 	desc_addr = gpa_to_vva(dev, desc->addr);
@@ -208,8 +209,9 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vring_desc *descs,
 
 	rte_prefetch0((void *)(uintptr_t)desc_addr);
 
-	virtio_enqueue_offload(m, &virtio_hdr.hdr);
-	copy_virtio_net_hdr(dev, desc_addr, virtio_hdr);
+	virtio_hdr = (struct virtio_net_hdr *)(uintptr_t)desc_addr;
+	rte_memset(virtio_hdr, 0, sizeof(*virtio_hdr));
+	virtio_enqueue_offload(m, virtio_hdr);
 	vhost_log_write(dev, desc->addr, dev->vhost_hlen);
 	PRINT_PACKET(dev, (uintptr_t)desc_addr, dev->vhost_hlen, 0);
 
@@ -459,7 +461,6 @@ static inline int __attribute__((always_inline))
 copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct rte_mbuf *m,
 			    struct buf_vector *buf_vec, uint16_t num_buffers)
 {
-	struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
 	uint32_t vec_idx = 0;
 	uint64_t desc_addr;
 	uint32_t mbuf_offset, mbuf_avail;
@@ -480,7 +481,6 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct rte_mbuf *m,
 	hdr_phys_addr = buf_vec[vec_idx].buf_addr;
 	rte_prefetch0((void *)(uintptr_t)hdr_addr);
 
-	virtio_hdr.num_buffers = num_buffers;
 	LOG_DEBUG(VHOST_DATA, "(%d) RX: num merge buffers %d\n",
 		dev->vid, num_buffers);
 
@@ -512,8 +512,12 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct rte_mbuf *m,
 		}
 
 		if (hdr_addr) {
-			virtio_enqueue_offload(hdr_mbuf, &virtio_hdr.hdr);
-			copy_virtio_net_hdr(dev, hdr_addr, virtio_hdr);
+			struct virtio_net_hdr_mrg_rxbuf *hdr =
+			(struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)hdr_addr;
+
+			rte_memset(&(hdr->hdr), 0, sizeof(hdr->hdr));
+			hdr->num_buffers = num_buffers;
+			virtio_enqueue_offload(hdr_mbuf, &(hdr->hdr));
 			vhost_log_write(dev, hdr_phys_addr, dev->vhost_hlen);
 			PRINT_PACKET(dev, (uintptr_t)hdr_addr,
 				     dev->vhost_hlen, 0);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/4] lib/librte_vhost: improve vhost perf using rte_memset
  2016-12-02  8:36 ` [PATCH 4/4] lib/librte_vhost: improve vhost perf using rte_memset Zhiyong Yang
@ 2016-12-02  9:46   ` Thomas Monjalon
  2016-12-06  8:04     ` Yang, Zhiyong
  0 siblings, 1 reply; 44+ messages in thread
From: Thomas Monjalon @ 2016-12-02  9:46 UTC (permalink / raw)
  To: Zhiyong Yang; +Cc: dev, yuanhan.liu, bruce.richardson, konstantin.ananyev

2016-12-05 16:26, Zhiyong Yang:
> +* **Introduced rte_memset and related test on IA platform.**
> +
> +  Performance drop had been caused in some cases on Ivybridge when DPDK code calls glibc
> +  function memset. It was necessary to introduce more high efficient function to fix it.
> +  The function rte_memset supported three types of instruction sets including sse & avx(128 bits),
> +  avx2(256 bits) and avx512(512bits).
> +
> +  * Added rte_memset support on IA platform.
> +  * Added functional autotest support for rte_memset.
> +  * Added performance autotest support for rte_memset.

No need to reference autotests in the release notes.

> +  * Improved performance to use rte_memset instead of copy_virtio_net_hdr in lib/librte_vhost.

Please describe this change at a higher level. Which case it is improving?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/4] eal/common: introduce rte_memset and related test
  2016-12-02  8:36 [PATCH 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang
                   ` (3 preceding siblings ...)
  2016-12-02  8:36 ` [PATCH 4/4] lib/librte_vhost: improve vhost perf using rte_memset Zhiyong Yang
@ 2016-12-02 10:00 ` Maxime Coquelin
  2016-12-06  6:33   ` Yang, Zhiyong
  4 siblings, 1 reply; 44+ messages in thread
From: Maxime Coquelin @ 2016-12-02 10:00 UTC (permalink / raw)
  To: Zhiyong Yang, dev; +Cc: yuanhan.liu, bruce.richardson, konstantin.ananyev

Hi Zhiyong,

On 12/05/2016 09:26 AM, Zhiyong Yang wrote:
> DPDK code has met performance drop badly in some case when calling glibc
> function memset. Reference to discussions about memset in
> http://dpdk.org/ml/archives/dev/2016-October/048628.html
> It is necessary to introduce more high efficient function to fix it.
> One important thing about rte_memset is that we can get clear control
> on what instruction flow is used.
>
> This patchset introduces rte_memset to bring more high efficient
> implementation, and will bring obvious perf improvement, especially
> for small N bytes in the most application scenarios.
>
> Patch 1 implements rte_memset in the file rte_memset.h on IA platform
> The file supports three types of instruction sets including sse & avx
> (128bits), avx2(256bits) and avx512(512bits). rte_memset makes use of
> vectorization and inline function to improve the perf on IA. In addition,
> cache line and memory alignment are fully taken into consideration.
>
> Patch 2 implements functional autotest to validates the function whether
> to work in a right way.
>
> Patch 3 implements performance autotest separately in cache and memory.
>
> Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring 3%~4%
> performance improvements on IA platform from virtio/vhost non-mergeable
> loopback testing.
>
> Zhiyong Yang (4):
>   eal/common: introduce rte_memset on IA platform
>   app/test: add functional autotest for rte_memset
>   app/test: add performance autotest for rte_memset
>   lib/librte_vhost: improve vhost perf using rte_memset
>
>  app/test/Makefile                                  |   3 +
>  app/test/test_memset.c                             | 158 +++++++++
>  app/test/test_memset_perf.c                        | 348 +++++++++++++++++++
>  doc/guides/rel_notes/release_17_02.rst             |  11 +
>  .../common/include/arch/x86/rte_memset.h           | 376 +++++++++++++++++++++
>  lib/librte_eal/common/include/generic/rte_memset.h |  51 +++
>  lib/librte_vhost/virtio_net.c                      |  18 +-
>  7 files changed, 958 insertions(+), 7 deletions(-)
>  create mode 100644 app/test/test_memset.c
>  create mode 100644 app/test/test_memset_perf.c
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset.h
>  create mode 100644 lib/librte_eal/common/include/generic/rte_memset.h
>

Thanks for the series, idea looks good to me.

Wouldn't be worth to also use rte_memset in Virtio PMD (not
compiled/tested)? :

diff --git a/drivers/net/virtio/virtio_rxtx.c 
b/drivers/net/virtio/virtio_rxtx.c
index 22d97a4..a5f70c4 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -287,7 +287,7 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, 
struct rte_mbuf *cookie,
                         rte_pktmbuf_prepend(cookie, head_size);
                 /* if offload disabled, it is not zeroed below, do it 
now */
                 if (offload == 0)
-                       memset(hdr, 0, head_size);
+                       rte_memset(hdr, 0, head_size);
         } else if (use_indirect) {
                 /* setup tx ring slot to point to indirect
                  * descriptor list stored in reserved region.

Cheers,
Maxime

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-02  8:36 ` [PATCH 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang
@ 2016-12-02 10:25   ` Thomas Monjalon
  2016-12-08  7:41     ` Yang, Zhiyong
  2016-12-27 10:04   ` [PATCH v2 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang
  1 sibling, 1 reply; 44+ messages in thread
From: Thomas Monjalon @ 2016-12-02 10:25 UTC (permalink / raw)
  To: Zhiyong Yang
  Cc: dev, yuanhan.liu, bruce.richardson, konstantin.ananyev, Pablo de Lara

2016-12-05 16:26, Zhiyong Yang:
> +#ifndef _RTE_MEMSET_X86_64_H_

Is this implementation specific to 64-bit?

> +
> +#define rte_memset memset
> +
> +#else
> +
> +static void *
> +rte_memset(void *dst, int a, size_t n);
> +
> +#endif

If I understand well, rte_memset (as rte_memcpy) is using the most recent
instructions available (and enabled) when compiling.
It is not adapting the instructions to the run-time CPU.
There is no need to downgrade at run-time the instruction set as it is
obviously not a supported case, but it would be nice to be able to
upgrade a "default compilation" at run-time as it is done in rte_acl.
I explain this case more clearly for reference:

We can have AVX512 supported in the compiler but disable it when compiling
(CONFIG_RTE_MACHINE=snb) in order to build a binary running almost everywhere.
When running this binary on a CPU having AVX512 support, it will not
benefit of the AVX512 improvement.
Though, we can compile an AVX512 version of some functions and use them only
if the running CPU is capable.
This kind of miracle can be achieved in two ways:

1/ For generic C code compiled with a recent GCC, a function can be built
for several CPUs thanks to the attribute target_clones.

2/ For manually optimized functions using CPU-specific intrinsics or asm,
it is possible to build them with non-default flags thanks to the
attribute target.

3/ For manually optimized files using CPU-specific intrinsics or asm,
we use specifics flags in the makefile.

The function clone in case 1/ is dynamically chosen at run-time
through ifunc resolver.
The specific functions in cases 2/ and 3/ must chosen at run-time
by initializing a function pointer thanks to rte_cpu_get_flag_enabled().

Note that rte_hash and software crypto PMDs have a run-time check
with rte_cpu_get_flag_enabled() but do not override CFLAGS
in the Makefile. Next step for these libraries?

Back to rte_memset, I think you should try the solution 2/.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/4] eal/common: introduce rte_memset and related test
  2016-12-02 10:00 ` [PATCH 0/4] eal/common: introduce rte_memset and related test Maxime Coquelin
@ 2016-12-06  6:33   ` Yang, Zhiyong
  2016-12-06  8:29     ` Maxime Coquelin
  0 siblings, 1 reply; 44+ messages in thread
From: Yang, Zhiyong @ 2016-12-06  6:33 UTC (permalink / raw)
  To: Maxime Coquelin, dev; +Cc: yuanhan.liu, Richardson, Bruce, Ananyev, Konstantin

Hi, Maxime:

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Friday, December 2, 2016 6:01 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org
> Cc: yuanhan.liu@linux.intel.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset
> and related test
> 
> Hi Zhiyong,
> 
> On 12/05/2016 09:26 AM, Zhiyong Yang wrote:
> > DPDK code has met performance drop badly in some case when calling
> > glibc function memset. Reference to discussions about memset in
> > http://dpdk.org/ml/archives/dev/2016-October/048628.html
> > It is necessary to introduce more high efficient function to fix it.
> > One important thing about rte_memset is that we can get clear control
> > on what instruction flow is used.
> >
> > This patchset introduces rte_memset to bring more high efficient
> > implementation, and will bring obvious perf improvement, especially
> > for small N bytes in the most application scenarios.
> >
> > Patch 1 implements rte_memset in the file rte_memset.h on IA platform
> > The file supports three types of instruction sets including sse & avx
> > (128bits), avx2(256bits) and avx512(512bits). rte_memset makes use of
> > vectorization and inline function to improve the perf on IA. In
> > addition, cache line and memory alignment are fully taken into
> consideration.
> >
> > Patch 2 implements functional autotest to validates the function
> > whether to work in a right way.
> >
> > Patch 3 implements performance autotest separately in cache and memory.
> >
> > Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring
> > 3%~4% performance improvements on IA platform from virtio/vhost
> > non-mergeable loopback testing.
> >
> > Zhiyong Yang (4):
> >   eal/common: introduce rte_memset on IA platform
> >   app/test: add functional autotest for rte_memset
> >   app/test: add performance autotest for rte_memset
> >   lib/librte_vhost: improve vhost perf using rte_memset
> >
> >  app/test/Makefile                                  |   3 +
> >  app/test/test_memset.c                             | 158 +++++++++
> >  app/test/test_memset_perf.c                        | 348 +++++++++++++++++++
> >  doc/guides/rel_notes/release_17_02.rst             |  11 +
> >  .../common/include/arch/x86/rte_memset.h           | 376
> +++++++++++++++++++++
> >  lib/librte_eal/common/include/generic/rte_memset.h |  51 +++
> >  lib/librte_vhost/virtio_net.c                      |  18 +-
> >  7 files changed, 958 insertions(+), 7 deletions(-)  create mode
> > 100644 app/test/test_memset.c  create mode 100644
> > app/test/test_memset_perf.c  create mode 100644
> > lib/librte_eal/common/include/arch/x86/rte_memset.h
> >  create mode 100644
> lib/librte_eal/common/include/generic/rte_memset.h
> >
> 
> Thanks for the series, idea looks good to me.
> 
> Wouldn't be worth to also use rte_memset in Virtio PMD (not
> compiled/tested)? :
> 

I think  rte_memset  maybe can bring some benefit here,  but , I'm not clear how to
enter the branch and test it. :) 

thanks
Zhiyong

> diff --git a/drivers/net/virtio/virtio_rxtx.c
> b/drivers/net/virtio/virtio_rxtx.c
> index 22d97a4..a5f70c4 100644
> --- a/drivers/net/virtio/virtio_rxtx.c
> +++ b/drivers/net/virtio/virtio_rxtx.c
> @@ -287,7 +287,7 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq,
> struct rte_mbuf *cookie,
>                          rte_pktmbuf_prepend(cookie, head_size);
>                  /* if offload disabled, it is not zeroed below, do it now */
>                  if (offload == 0)
> -                       memset(hdr, 0, head_size);
> +                       rte_memset(hdr, 0, head_size);
>          } else if (use_indirect) {
>                  /* setup tx ring slot to point to indirect
>                   * descriptor list stored in reserved region.
> 
> Cheers,
> Maxime

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/4] lib/librte_vhost: improve vhost perf using rte_memset
  2016-12-02  9:46   ` Thomas Monjalon
@ 2016-12-06  8:04     ` Yang, Zhiyong
  0 siblings, 0 replies; 44+ messages in thread
From: Yang, Zhiyong @ 2016-12-06  8:04 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, yuanhan.liu, Richardson, Bruce, Ananyev, Konstantin

Hi, Thomas:

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Friday, December 2, 2016 5:46 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>
> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 4/4] lib/librte_vhost: improve vhost perf
> using rte_memset
> 
> 2016-12-05 16:26, Zhiyong Yang:
> > +* **Introduced rte_memset and related test on IA platform.**
> > +
> > +  Performance drop had been caused in some cases on Ivybridge when
> > + DPDK code calls glibc  function memset. It was necessary to introduce
> more high efficient function to fix it.
> > +  The function rte_memset supported three types of instruction sets
> > + including sse & avx(128 bits),
> > +  avx2(256 bits) and avx512(512bits).
> > +
> > +  * Added rte_memset support on IA platform.
> > +  * Added functional autotest support for rte_memset.
> > +  * Added performance autotest support for rte_memset.
> 
> No need to reference autotests in the release notes.

Ok.
I will remove the two lines.
> 
> > +  * Improved performance to use rte_memset instead of
> copy_virtio_net_hdr in lib/librte_vhost.
> 
> Please describe this change at a higher level. Which case it is improving?

Ok, good comments.

* Improved performance to get 3% or so perf improvement
on IA platform by using rte_memset when running virtio/vhost non-mergeable
loopback test without NIC.

Thanks
Zhiyong

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/4] eal/common: introduce rte_memset and related test
  2016-12-06  6:33   ` Yang, Zhiyong
@ 2016-12-06  8:29     ` Maxime Coquelin
  2016-12-07  9:28       ` Yang, Zhiyong
  0 siblings, 1 reply; 44+ messages in thread
From: Maxime Coquelin @ 2016-12-06  8:29 UTC (permalink / raw)
  To: Yang, Zhiyong, dev
  Cc: yuanhan.liu, Richardson, Bruce, Ananyev, Konstantin,
	Pierre Pfister (ppfister)



On 12/06/2016 07:33 AM, Yang, Zhiyong wrote:
> Hi, Maxime:
>
>> -----Original Message-----
>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>> Sent: Friday, December 2, 2016 6:01 PM
>> To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org
>> Cc: yuanhan.liu@linux.intel.com; Richardson, Bruce
>> <bruce.richardson@intel.com>; Ananyev, Konstantin
>> <konstantin.ananyev@intel.com>
>> Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset
>> and related test
>>
>> Hi Zhiyong,
>>
>> On 12/05/2016 09:26 AM, Zhiyong Yang wrote:
>>> DPDK code has met performance drop badly in some case when calling
>>> glibc function memset. Reference to discussions about memset in
>>> http://dpdk.org/ml/archives/dev/2016-October/048628.html
>>> It is necessary to introduce more high efficient function to fix it.
>>> One important thing about rte_memset is that we can get clear control
>>> on what instruction flow is used.
>>>
>>> This patchset introduces rte_memset to bring more high efficient
>>> implementation, and will bring obvious perf improvement, especially
>>> for small N bytes in the most application scenarios.
>>>
>>> Patch 1 implements rte_memset in the file rte_memset.h on IA platform
>>> The file supports three types of instruction sets including sse & avx
>>> (128bits), avx2(256bits) and avx512(512bits). rte_memset makes use of
>>> vectorization and inline function to improve the perf on IA. In
>>> addition, cache line and memory alignment are fully taken into
>> consideration.
>>>
>>> Patch 2 implements functional autotest to validates the function
>>> whether to work in a right way.
>>>
>>> Patch 3 implements performance autotest separately in cache and memory.
>>>
>>> Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring
>>> 3%~4% performance improvements on IA platform from virtio/vhost
>>> non-mergeable loopback testing.
>>>
>>> Zhiyong Yang (4):
>>>   eal/common: introduce rte_memset on IA platform
>>>   app/test: add functional autotest for rte_memset
>>>   app/test: add performance autotest for rte_memset
>>>   lib/librte_vhost: improve vhost perf using rte_memset
>>>
>>>  app/test/Makefile                                  |   3 +
>>>  app/test/test_memset.c                             | 158 +++++++++
>>>  app/test/test_memset_perf.c                        | 348 +++++++++++++++++++
>>>  doc/guides/rel_notes/release_17_02.rst             |  11 +
>>>  .../common/include/arch/x86/rte_memset.h           | 376
>> +++++++++++++++++++++
>>>  lib/librte_eal/common/include/generic/rte_memset.h |  51 +++
>>>  lib/librte_vhost/virtio_net.c                      |  18 +-
>>>  7 files changed, 958 insertions(+), 7 deletions(-)  create mode
>>> 100644 app/test/test_memset.c  create mode 100644
>>> app/test/test_memset_perf.c  create mode 100644
>>> lib/librte_eal/common/include/arch/x86/rte_memset.h
>>>  create mode 100644
>> lib/librte_eal/common/include/generic/rte_memset.h
>>>
>>
>> Thanks for the series, idea looks good to me.
>>
>> Wouldn't be worth to also use rte_memset in Virtio PMD (not
>> compiled/tested)? :
>>
>
> I think  rte_memset  maybe can bring some benefit here,  but , I'm not clear how to
> enter the branch and test it. :)

Indeed, you will need Pierre's patch:
[dpdk-dev] [PATCH] virtio: tx with can_push when VERSION_1 is set

Thanks,
Maxime
>
> thanks
> Zhiyong
>
>> diff --git a/drivers/net/virtio/virtio_rxtx.c
>> b/drivers/net/virtio/virtio_rxtx.c
>> index 22d97a4..a5f70c4 100644
>> --- a/drivers/net/virtio/virtio_rxtx.c
>> +++ b/drivers/net/virtio/virtio_rxtx.c
>> @@ -287,7 +287,7 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq,
>> struct rte_mbuf *cookie,
>>                          rte_pktmbuf_prepend(cookie, head_size);
>>                  /* if offload disabled, it is not zeroed below, do it now */
>>                  if (offload == 0)
>> -                       memset(hdr, 0, head_size);
>> +                       rte_memset(hdr, 0, head_size);
>>          } else if (use_indirect) {
>>                  /* setup tx ring slot to point to indirect
>>                   * descriptor list stored in reserved region.
>>
>> Cheers,
>> Maxime

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/4] eal/common: introduce rte_memset and related test
  2016-12-06  8:29     ` Maxime Coquelin
@ 2016-12-07  9:28       ` Yang, Zhiyong
  2016-12-07  9:37         ` Yuanhan Liu
  0 siblings, 1 reply; 44+ messages in thread
From: Yang, Zhiyong @ 2016-12-07  9:28 UTC (permalink / raw)
  To: Maxime Coquelin, dev
  Cc: yuanhan.liu, Richardson, Bruce, Ananyev, Konstantin,
	Pierre Pfister (ppfister)

Hi, Maxime:

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Tuesday, December 6, 2016 4:30 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org
> Cc: yuanhan.liu@linux.intel.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Pierre Pfister (ppfister)
> <ppfister@cisco.com>
> Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset
> and related test
> 
> 
> 
> On 12/06/2016 07:33 AM, Yang, Zhiyong wrote:
> > Hi, Maxime:
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >> Sent: Friday, December 2, 2016 6:01 PM
> >> To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org
> >> Cc: yuanhan.liu@linux.intel.com; Richardson, Bruce
> >> <bruce.richardson@intel.com>; Ananyev, Konstantin
> >> <konstantin.ananyev@intel.com>
> >> Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset
> >> and related test
> >>
> >> Hi Zhiyong,
> >>
> >> On 12/05/2016 09:26 AM, Zhiyong Yang wrote:
> >>> DPDK code has met performance drop badly in some case when calling
> >>> glibc function memset. Reference to discussions about memset in
> >>> http://dpdk.org/ml/archives/dev/2016-October/048628.html
> >>> It is necessary to introduce more high efficient function to fix it.
> >>> One important thing about rte_memset is that we can get clear
> >>> control on what instruction flow is used.
> >>>
> >>> This patchset introduces rte_memset to bring more high efficient
> >>> implementation, and will bring obvious perf improvement, especially
> >>> for small N bytes in the most application scenarios.
> >>>
> >>> Patch 1 implements rte_memset in the file rte_memset.h on IA
> >>> platform The file supports three types of instruction sets including
> >>> sse & avx (128bits), avx2(256bits) and avx512(512bits). rte_memset
> >>> makes use of vectorization and inline function to improve the perf
> >>> on IA. In addition, cache line and memory alignment are fully taken
> >>> into
> >> consideration.
> >>>
> >>> Patch 2 implements functional autotest to validates the function
> >>> whether to work in a right way.
> >>>
> >>> Patch 3 implements performance autotest separately in cache and
> memory.
> >>>
> >>> Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring
> >>> 3%~4% performance improvements on IA platform from virtio/vhost
> >>> non-mergeable loopback testing.
> >>>
> >>> Zhiyong Yang (4):
> >>>   eal/common: introduce rte_memset on IA platform
> >>>   app/test: add functional autotest for rte_memset
> >>>   app/test: add performance autotest for rte_memset
> >>>   lib/librte_vhost: improve vhost perf using rte_memset
> >>>
> >>>  app/test/Makefile                                  |   3 +
> >>>  app/test/test_memset.c                             | 158 +++++++++
> >>>  app/test/test_memset_perf.c                        | 348
> +++++++++++++++++++
> >>>  doc/guides/rel_notes/release_17_02.rst             |  11 +
> >>>  .../common/include/arch/x86/rte_memset.h           | 376
> >> +++++++++++++++++++++
> >>>  lib/librte_eal/common/include/generic/rte_memset.h |  51 +++
> >>>  lib/librte_vhost/virtio_net.c                      |  18 +-
> >>>  7 files changed, 958 insertions(+), 7 deletions(-)  create mode
> >>> 100644 app/test/test_memset.c  create mode 100644
> >>> app/test/test_memset_perf.c  create mode 100644
> >>> lib/librte_eal/common/include/arch/x86/rte_memset.h
> >>>  create mode 100644
> >> lib/librte_eal/common/include/generic/rte_memset.h
> >>>
> >>
> >> Thanks for the series, idea looks good to me.
> >>
> >> Wouldn't be worth to also use rte_memset in Virtio PMD (not
> >> compiled/tested)? :
> >>
> >
> > I think  rte_memset  maybe can bring some benefit here,  but , I'm not
> > clear how to enter the branch and test it. :)
> 
> Indeed, you will need Pierre's patch:
> [dpdk-dev] [PATCH] virtio: tx with can_push when VERSION_1 is set
> 
> Thanks,
> Maxime
> >
Thank you Maxime.
I can see a little, but not obviously  performance improvement here.  
You know, memset(hdr, 0, head_size); only consumes  fewer cycles for virtio pmd. 
head_size only  10 or 12 bytes.
I optimize rte_memset perf further for N=8~15 bytes.
The main purpose of Introducing rte_memset is that we can use it
to avoid perf drop issue instead of glibc memset on some platform, I think. 

> >
> >> diff --git a/drivers/net/virtio/virtio_rxtx.c
> >> b/drivers/net/virtio/virtio_rxtx.c
> >> index 22d97a4..a5f70c4 100644
> >> --- a/drivers/net/virtio/virtio_rxtx.c
> >> +++ b/drivers/net/virtio/virtio_rxtx.c
> >> @@ -287,7 +287,7 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq,
> >> struct rte_mbuf *cookie,
> >>                          rte_pktmbuf_prepend(cookie, head_size);
> >>                  /* if offload disabled, it is not zeroed below, do it now */
> >>                  if (offload == 0)
> >> -                       memset(hdr, 0, head_size);
> >> +                       rte_memset(hdr, 0, head_size);
> >>          } else if (use_indirect) {
> >>                  /* setup tx ring slot to point to indirect
> >>                   * descriptor list stored in reserved region.
> >>
> >> Cheers,
> >> Maxime

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/4] eal/common: introduce rte_memset and related test
  2016-12-07  9:28       ` Yang, Zhiyong
@ 2016-12-07  9:37         ` Yuanhan Liu
  2016-12-07  9:43           ` Yang, Zhiyong
  0 siblings, 1 reply; 44+ messages in thread
From: Yuanhan Liu @ 2016-12-07  9:37 UTC (permalink / raw)
  To: Yang, Zhiyong
  Cc: Maxime Coquelin, dev, Richardson, Bruce, Ananyev, Konstantin,
	Pierre Pfister (ppfister)

On Wed, Dec 07, 2016 at 09:28:17AM +0000, Yang, Zhiyong wrote:
> > >> Wouldn't be worth to also use rte_memset in Virtio PMD (not
> > >> compiled/tested)? :
> > >>
> > >
> > > I think  rte_memset  maybe can bring some benefit here,  but , I'm not
> > > clear how to enter the branch and test it. :)
> > 
> > Indeed, you will need Pierre's patch:
> > [dpdk-dev] [PATCH] virtio: tx with can_push when VERSION_1 is set

I will apply it shortly.

> > Thanks,
> > Maxime
> > >
> Thank you Maxime.
> I can see a little, but not obviously  performance improvement here.  

Are you you have run into that code piece? FYI, you have to enable
virtio 1.0 explicitly, which is disabled by deafault.

> You know, memset(hdr, 0, head_size); only consumes  fewer cycles for virtio pmd. 
> head_size only  10 or 12 bytes.
> I optimize rte_memset perf further for N=8~15 bytes.
> The main purpose of Introducing rte_memset is that we can use it
> to avoid perf drop issue instead of glibc memset on some platform, I think. 

For this case (as well as the 4th patch), it's more about making sure
rte_memset is inlined.

	--yliu

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/4] eal/common: introduce rte_memset and related test
  2016-12-07  9:37         ` Yuanhan Liu
@ 2016-12-07  9:43           ` Yang, Zhiyong
  2016-12-07  9:48             ` Yuanhan Liu
  0 siblings, 1 reply; 44+ messages in thread
From: Yang, Zhiyong @ 2016-12-07  9:43 UTC (permalink / raw)
  To: Yuanhan Liu
  Cc: Maxime Coquelin, dev, Richardson, Bruce, Ananyev, Konstantin,
	Pierre Pfister (ppfister)

Hi, yuanhan:

> -----Original Message-----
> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> Sent: Wednesday, December 7, 2016 5:38 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org;
> Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Pierre Pfister (ppfister)
> <ppfister@cisco.com>
> Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset
> and related test
> 
> On Wed, Dec 07, 2016 at 09:28:17AM +0000, Yang, Zhiyong wrote:
> > > >> Wouldn't be worth to also use rte_memset in Virtio PMD (not
> > > >> compiled/tested)? :
> > > >>
> > > >
> > > > I think  rte_memset  maybe can bring some benefit here,  but , I'm
> > > > not clear how to enter the branch and test it. :)
> > >
> > > Indeed, you will need Pierre's patch:
> > > [dpdk-dev] [PATCH] virtio: tx with can_push when VERSION_1 is set
> 
> I will apply it shortly.
> 
> > > Thanks,
> > > Maxime
> > > >
> > Thank you Maxime.
> > I can see a little, but not obviously  performance improvement here.
> 
> Are you you have run into that code piece? FYI, you have to enable virtio 1.0
> explicitly, which is disabled by deafault.

Yes. I use the patch from Pierre and set offload  = 0 ; 
Thanks
Zhiyong

> 
> > You know, memset(hdr, 0, head_size); only consumes  fewer cycles for
> virtio pmd.
> > head_size only  10 or 12 bytes.
> > I optimize rte_memset perf further for N=8~15 bytes.
> > The main purpose of Introducing rte_memset is that we can use it to
> > avoid perf drop issue instead of glibc memset on some platform, I think.
> 
> For this case (as well as the 4th patch), it's more about making sure
> rte_memset is inlined.
> 
> 	--yliu

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/4] eal/common: introduce rte_memset and related test
  2016-12-07  9:43           ` Yang, Zhiyong
@ 2016-12-07  9:48             ` Yuanhan Liu
  0 siblings, 0 replies; 44+ messages in thread
From: Yuanhan Liu @ 2016-12-07  9:48 UTC (permalink / raw)
  To: Yang, Zhiyong
  Cc: Maxime Coquelin, dev, Richardson, Bruce, Ananyev, Konstantin,
	Pierre Pfister (ppfister)

On Wed, Dec 07, 2016 at 09:43:06AM +0000, Yang, Zhiyong wrote:
> > On Wed, Dec 07, 2016 at 09:28:17AM +0000, Yang, Zhiyong wrote:
> > > > >> Wouldn't be worth to also use rte_memset in Virtio PMD (not
> > > > >> compiled/tested)? :
> > > > >>
> > > > >
> > > > > I think  rte_memset  maybe can bring some benefit here,  but , I'm
> > > > > not clear how to enter the branch and test it. :)
> > > >
> > > > Indeed, you will need Pierre's patch:
> > > > [dpdk-dev] [PATCH] virtio: tx with can_push when VERSION_1 is set
> > 
> > I will apply it shortly.
> > 
> > > > Thanks,
> > > > Maxime
> > > > >
> > > Thank you Maxime.
> > > I can see a little, but not obviously  performance improvement here.
> > 
> > Are you you have run into that code piece? FYI, you have to enable virtio 1.0
> > explicitly, which is disabled by deafault.
> 
> Yes. I use the patch from Pierre and set offload  = 0 ; 

I meant virtio 1.0. Have you added following options for the QEMU virtio-net
device?

    disable-modern=false

	--yliu

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-02 10:25   ` Thomas Monjalon
@ 2016-12-08  7:41     ` Yang, Zhiyong
  2016-12-08  9:26       ` Ananyev, Konstantin
  2016-12-08 15:09       ` Thomas Monjalon
  0 siblings, 2 replies; 44+ messages in thread
From: Yang, Zhiyong @ 2016-12-08  7:41 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, yuanhan.liu, Richardson, Bruce, Ananyev, Konstantin,
	De Lara Guarch, Pablo

HI, Thomas:
	Sorry for late reply. I have been being always considering your suggestion. 

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Friday, December 2, 2016 6:25 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>
> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> IA platform
> 
> 2016-12-05 16:26, Zhiyong Yang:
> > +#ifndef _RTE_MEMSET_X86_64_H_
> 
> Is this implementation specific to 64-bit?
> 

Yes.

> > +
> > +#define rte_memset memset
> > +
> > +#else
> > +
> > +static void *
> > +rte_memset(void *dst, int a, size_t n);
> > +
> > +#endif
> 
> If I understand well, rte_memset (as rte_memcpy) is using the most recent
> instructions available (and enabled) when compiling.
> It is not adapting the instructions to the run-time CPU.
> There is no need to downgrade at run-time the instruction set as it is
> obviously not a supported case, but it would be nice to be able to upgrade a
> "default compilation" at run-time as it is done in rte_acl.
> I explain this case more clearly for reference:
> 
> We can have AVX512 supported in the compiler but disable it when compiling
> (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost
> everywhere.
> When running this binary on a CPU having AVX512 support, it will not benefit
> of the AVX512 improvement.
> Though, we can compile an AVX512 version of some functions and use them
> only if the running CPU is capable.
> This kind of miracle can be achieved in two ways:
> 
> 1/ For generic C code compiled with a recent GCC, a function can be built for
> several CPUs thanks to the attribute target_clones.
> 
> 2/ For manually optimized functions using CPU-specific intrinsics or asm, it is
> possible to build them with non-default flags thanks to the attribute target.
> 
> 3/ For manually optimized files using CPU-specific intrinsics or asm, we use
> specifics flags in the makefile.
> 
> The function clone in case 1/ is dynamically chosen at run-time through ifunc
> resolver.
> The specific functions in cases 2/ and 3/ must chosen at run-time by
> initializing a function pointer thanks to rte_cpu_get_flag_enabled().
> 
> Note that rte_hash and software crypto PMDs have a run-time check with
> rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile.
> Next step for these libraries?
> 
> Back to rte_memset, I think you should try the solution 2/.

I have read the ACL code, if I understand well , for complex algo implementation,  
it is good idea, but Choosing functions at run time will bring some overhead. For frequently  called function
Which consumes small cycles, the overhead maybe is more than  the gains optimizations brings 
For example, for most applications in dpdk, memset only set N = 10 or 12bytes. It consumes fewer cycles.

Thanks
Zhiyong

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-08  7:41     ` Yang, Zhiyong
@ 2016-12-08  9:26       ` Ananyev, Konstantin
  2016-12-08  9:53         ` Yang, Zhiyong
  2016-12-08 15:09       ` Thomas Monjalon
  1 sibling, 1 reply; 44+ messages in thread
From: Ananyev, Konstantin @ 2016-12-08  9:26 UTC (permalink / raw)
  To: Yang, Zhiyong, Thomas Monjalon
  Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo


Hi Zhiyong,

> 
> HI, Thomas:
> 	Sorry for late reply. I have been being always considering your suggestion.
> 
> > -----Original Message-----
> > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > Sent: Friday, December 2, 2016 6:25 PM
> > To: Yang, Zhiyong <zhiyong.yang@intel.com>
> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> > IA platform
> >
> > 2016-12-05 16:26, Zhiyong Yang:
> > > +#ifndef _RTE_MEMSET_X86_64_H_
> >
> > Is this implementation specific to 64-bit?
> >
> 
> Yes.
> 
> > > +
> > > +#define rte_memset memset
> > > +
> > > +#else
> > > +
> > > +static void *
> > > +rte_memset(void *dst, int a, size_t n);
> > > +
> > > +#endif
> >
> > If I understand well, rte_memset (as rte_memcpy) is using the most recent
> > instructions available (and enabled) when compiling.
> > It is not adapting the instructions to the run-time CPU.
> > There is no need to downgrade at run-time the instruction set as it is
> > obviously not a supported case, but it would be nice to be able to upgrade a
> > "default compilation" at run-time as it is done in rte_acl.
> > I explain this case more clearly for reference:
> >
> > We can have AVX512 supported in the compiler but disable it when compiling
> > (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost
> > everywhere.
> > When running this binary on a CPU having AVX512 support, it will not benefit
> > of the AVX512 improvement.
> > Though, we can compile an AVX512 version of some functions and use them
> > only if the running CPU is capable.
> > This kind of miracle can be achieved in two ways:
> >
> > 1/ For generic C code compiled with a recent GCC, a function can be built for
> > several CPUs thanks to the attribute target_clones.
> >
> > 2/ For manually optimized functions using CPU-specific intrinsics or asm, it is
> > possible to build them with non-default flags thanks to the attribute target.
> >
> > 3/ For manually optimized files using CPU-specific intrinsics or asm, we use
> > specifics flags in the makefile.
> >
> > The function clone in case 1/ is dynamically chosen at run-time through ifunc
> > resolver.
> > The specific functions in cases 2/ and 3/ must chosen at run-time by
> > initializing a function pointer thanks to rte_cpu_get_flag_enabled().
> >
> > Note that rte_hash and software crypto PMDs have a run-time check with
> > rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile.
> > Next step for these libraries?
> >
> > Back to rte_memset, I think you should try the solution 2/.
> 
> I have read the ACL code, if I understand well , for complex algo implementation,
> it is good idea, but Choosing functions at run time will bring some overhead. For frequently  called function
> Which consumes small cycles, the overhead maybe is more than  the gains optimizations brings
> For example, for most applications in dpdk, memset only set N = 10 or 12bytes. It consumes fewer cycles.

But then what the point to have an rte_memset() using vector instructions at all?
>From what you are saying the most common case is even less then SSE register size.
Konstantin

> 
> Thanks
> Zhiyong

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-08  9:26       ` Ananyev, Konstantin
@ 2016-12-08  9:53         ` Yang, Zhiyong
  2016-12-08 10:27           ` Bruce Richardson
  2016-12-08 10:30           ` Ananyev, Konstantin
  0 siblings, 2 replies; 44+ messages in thread
From: Yang, Zhiyong @ 2016-12-08  9:53 UTC (permalink / raw)
  To: Ananyev, Konstantin, Thomas Monjalon
  Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo

Hi, Konstantin:

> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Thursday, December 8, 2016 5:26 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> <thomas.monjalon@6wind.com>
> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> IA platform
> 
> 
> Hi Zhiyong,
> 
> >
> > HI, Thomas:
> > 	Sorry for late reply. I have been being always considering your
> suggestion.
> >
> > > -----Original Message-----
> > > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > > Sent: Friday, December 2, 2016 6:25 PM
> > > To: Yang, Zhiyong <zhiyong.yang@intel.com>
> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo
> > > <pablo.de.lara.guarch@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> rte_memset
> > > on IA platform
> > >
> > > 2016-12-05 16:26, Zhiyong Yang:
> > > > +#ifndef _RTE_MEMSET_X86_64_H_
> > >
> > > Is this implementation specific to 64-bit?
> > >
> >
> > Yes.
> >
> > > > +
> > > > +#define rte_memset memset
> > > > +
> > > > +#else
> > > > +
> > > > +static void *
> > > > +rte_memset(void *dst, int a, size_t n);
> > > > +
> > > > +#endif
> > >
> > > If I understand well, rte_memset (as rte_memcpy) is using the most
> > > recent instructions available (and enabled) when compiling.
> > > It is not adapting the instructions to the run-time CPU.
> > > There is no need to downgrade at run-time the instruction set as it
> > > is obviously not a supported case, but it would be nice to be able
> > > to upgrade a "default compilation" at run-time as it is done in rte_acl.
> > > I explain this case more clearly for reference:
> > >
> > > We can have AVX512 supported in the compiler but disable it when
> > > compiling
> > > (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost
> > > everywhere.
> > > When running this binary on a CPU having AVX512 support, it will not
> > > benefit of the AVX512 improvement.
> > > Though, we can compile an AVX512 version of some functions and use
> > > them only if the running CPU is capable.
> > > This kind of miracle can be achieved in two ways:
> > >
> > > 1/ For generic C code compiled with a recent GCC, a function can be
> > > built for several CPUs thanks to the attribute target_clones.
> > >
> > > 2/ For manually optimized functions using CPU-specific intrinsics or
> > > asm, it is possible to build them with non-default flags thanks to the
> attribute target.
> > >
> > > 3/ For manually optimized files using CPU-specific intrinsics or
> > > asm, we use specifics flags in the makefile.
> > >
> > > The function clone in case 1/ is dynamically chosen at run-time
> > > through ifunc resolver.
> > > The specific functions in cases 2/ and 3/ must chosen at run-time by
> > > initializing a function pointer thanks to rte_cpu_get_flag_enabled().
> > >
> > > Note that rte_hash and software crypto PMDs have a run-time check
> > > with
> > > rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile.
> > > Next step for these libraries?
> > >
> > > Back to rte_memset, I think you should try the solution 2/.
> >
> > I have read the ACL code, if I understand well , for complex algo
> > implementation, it is good idea, but Choosing functions at run time
> > will bring some overhead. For frequently  called function Which
> > consumes small cycles, the overhead maybe is more than  the gains
> optimizations brings For example, for most applications in dpdk, memset only
> set N = 10 or 12bytes. It consumes fewer cycles.
> 
> But then what the point to have an rte_memset() using vector instructions at
> all?
> From what you are saying the most common case is even less then SSE
> register size.
> Konstantin

For most cases, memset is used such as memset(address, 0, sizeof(struct xxx)); 
The use case here is small by accident, I only give an example here. 
but rte_memset is introduced to need consider generic case. 
sizeof(struct xxx) is not limited to very small size, such as  less than SSE register size.
I just want to say that the size for the most use case is not very large,  So cycles consumed
Is not large. It is not suited to choose function at run-time since overhead  is considered.

thanks
Zhiyong

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-08  9:53         ` Yang, Zhiyong
@ 2016-12-08 10:27           ` Bruce Richardson
  2016-12-08 10:30           ` Ananyev, Konstantin
  1 sibling, 0 replies; 44+ messages in thread
From: Bruce Richardson @ 2016-12-08 10:27 UTC (permalink / raw)
  To: Yang, Zhiyong
  Cc: Ananyev, Konstantin, Thomas Monjalon, dev, yuanhan.liu,
	De Lara Guarch, Pablo

On Thu, Dec 08, 2016 at 09:53:12AM +0000, Yang, Zhiyong wrote:
> Hi, Konstantin:
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Thursday, December 8, 2016 5:26 PM
> > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> > <thomas.monjalon@6wind.com>
> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>
> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> > IA platform
> > 
> > 
> > Hi Zhiyong,
> > 
> > >
> > > HI, Thomas:
> > > 	Sorry for late reply. I have been being always considering your
> > suggestion.
> > >
> > > > -----Original Message-----
> > > > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > > > Sent: Friday, December 2, 2016 6:25 PM
> > > > To: Yang, Zhiyong <zhiyong.yang@intel.com>
> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > > > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo
> > > > <pablo.de.lara.guarch@intel.com>
> > > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> > rte_memset
> > > > on IA platform
> > > >
> > > > 2016-12-05 16:26, Zhiyong Yang:
> > > > > +#ifndef _RTE_MEMSET_X86_64_H_
> > > >
> > > > Is this implementation specific to 64-bit?
> > > >
> > >
> > > Yes.
> > >
> > > > > +
> > > > > +#define rte_memset memset
> > > > > +
> > > > > +#else
> > > > > +
> > > > > +static void *
> > > > > +rte_memset(void *dst, int a, size_t n);
> > > > > +
> > > > > +#endif
> > > >
> > > > If I understand well, rte_memset (as rte_memcpy) is using the most
> > > > recent instructions available (and enabled) when compiling.
> > > > It is not adapting the instructions to the run-time CPU.
> > > > There is no need to downgrade at run-time the instruction set as it
> > > > is obviously not a supported case, but it would be nice to be able
> > > > to upgrade a "default compilation" at run-time as it is done in rte_acl.
> > > > I explain this case more clearly for reference:
> > > >
> > > > We can have AVX512 supported in the compiler but disable it when
> > > > compiling
> > > > (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost
> > > > everywhere.
> > > > When running this binary on a CPU having AVX512 support, it will not
> > > > benefit of the AVX512 improvement.
> > > > Though, we can compile an AVX512 version of some functions and use
> > > > them only if the running CPU is capable.
> > > > This kind of miracle can be achieved in two ways:
> > > >
> > > > 1/ For generic C code compiled with a recent GCC, a function can be
> > > > built for several CPUs thanks to the attribute target_clones.
> > > >
> > > > 2/ For manually optimized functions using CPU-specific intrinsics or
> > > > asm, it is possible to build them with non-default flags thanks to the
> > attribute target.
> > > >
> > > > 3/ For manually optimized files using CPU-specific intrinsics or
> > > > asm, we use specifics flags in the makefile.
> > > >
> > > > The function clone in case 1/ is dynamically chosen at run-time
> > > > through ifunc resolver.
> > > > The specific functions in cases 2/ and 3/ must chosen at run-time by
> > > > initializing a function pointer thanks to rte_cpu_get_flag_enabled().
> > > >
> > > > Note that rte_hash and software crypto PMDs have a run-time check
> > > > with
> > > > rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile.
> > > > Next step for these libraries?
> > > >
> > > > Back to rte_memset, I think you should try the solution 2/.
> > >
> > > I have read the ACL code, if I understand well , for complex algo
> > > implementation, it is good idea, but Choosing functions at run time
> > > will bring some overhead. For frequently  called function Which
> > > consumes small cycles, the overhead maybe is more than  the gains
> > optimizations brings For example, for most applications in dpdk, memset only
> > set N = 10 or 12bytes. It consumes fewer cycles.
> > 
> > But then what the point to have an rte_memset() using vector instructions at
> > all?
> > From what you are saying the most common case is even less then SSE
> > register size.
> > Konstantin
> 
> For most cases, memset is used such as memset(address, 0, sizeof(struct xxx)); 
> The use case here is small by accident, I only give an example here. 
> but rte_memset is introduced to need consider generic case. 
> sizeof(struct xxx) is not limited to very small size, such as  less than SSE register size.
> I just want to say that the size for the most use case is not very large,  So cycles consumed
> Is not large. It is not suited to choose function at run-time since overhead  is considered.
> 
For small copies with sizes specified at compile time, do compilers not
fully inline the memset call with a fixed-size equivalent. I believe
some compilers used to do so with memcpy - which is why we had a macro
for it in DPDK, so that compile-time constant copies would use regular
memcpy. If that is also the case for memset, then we should perhaps
specify that rte_memset is only for relatively large copies, e.g. >64
bytes. In that case, run-time detection may be worthwhile.

/Bruce

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-08  9:53         ` Yang, Zhiyong
  2016-12-08 10:27           ` Bruce Richardson
@ 2016-12-08 10:30           ` Ananyev, Konstantin
  2016-12-11 12:32             ` Yang, Zhiyong
  1 sibling, 1 reply; 44+ messages in thread
From: Ananyev, Konstantin @ 2016-12-08 10:30 UTC (permalink / raw)
  To: Yang, Zhiyong, Thomas Monjalon
  Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo



> -----Original Message-----
> From: Yang, Zhiyong
> Sent: Thursday, December 8, 2016 9:53 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas Monjalon <thomas.monjalon@6wind.com>
> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform
> 
> Hi, Konstantin:
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Thursday, December 8, 2016 5:26 PM
> > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> > <thomas.monjalon@6wind.com>
> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>
> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> > IA platform
> >
> >
> > Hi Zhiyong,
> >
> > >
> > > HI, Thomas:
> > > 	Sorry for late reply. I have been being always considering your
> > suggestion.
> > >
> > > > -----Original Message-----
> > > > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > > > Sent: Friday, December 2, 2016 6:25 PM
> > > > To: Yang, Zhiyong <zhiyong.yang@intel.com>
> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > > > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo
> > > > <pablo.de.lara.guarch@intel.com>
> > > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> > rte_memset
> > > > on IA platform
> > > >
> > > > 2016-12-05 16:26, Zhiyong Yang:
> > > > > +#ifndef _RTE_MEMSET_X86_64_H_
> > > >
> > > > Is this implementation specific to 64-bit?
> > > >
> > >
> > > Yes.
> > >
> > > > > +
> > > > > +#define rte_memset memset
> > > > > +
> > > > > +#else
> > > > > +
> > > > > +static void *
> > > > > +rte_memset(void *dst, int a, size_t n);
> > > > > +
> > > > > +#endif
> > > >
> > > > If I understand well, rte_memset (as rte_memcpy) is using the most
> > > > recent instructions available (and enabled) when compiling.
> > > > It is not adapting the instructions to the run-time CPU.
> > > > There is no need to downgrade at run-time the instruction set as it
> > > > is obviously not a supported case, but it would be nice to be able
> > > > to upgrade a "default compilation" at run-time as it is done in rte_acl.
> > > > I explain this case more clearly for reference:
> > > >
> > > > We can have AVX512 supported in the compiler but disable it when
> > > > compiling
> > > > (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost
> > > > everywhere.
> > > > When running this binary on a CPU having AVX512 support, it will not
> > > > benefit of the AVX512 improvement.
> > > > Though, we can compile an AVX512 version of some functions and use
> > > > them only if the running CPU is capable.
> > > > This kind of miracle can be achieved in two ways:
> > > >
> > > > 1/ For generic C code compiled with a recent GCC, a function can be
> > > > built for several CPUs thanks to the attribute target_clones.
> > > >
> > > > 2/ For manually optimized functions using CPU-specific intrinsics or
> > > > asm, it is possible to build them with non-default flags thanks to the
> > attribute target.
> > > >
> > > > 3/ For manually optimized files using CPU-specific intrinsics or
> > > > asm, we use specifics flags in the makefile.
> > > >
> > > > The function clone in case 1/ is dynamically chosen at run-time
> > > > through ifunc resolver.
> > > > The specific functions in cases 2/ and 3/ must chosen at run-time by
> > > > initializing a function pointer thanks to rte_cpu_get_flag_enabled().
> > > >
> > > > Note that rte_hash and software crypto PMDs have a run-time check
> > > > with
> > > > rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile.
> > > > Next step for these libraries?
> > > >
> > > > Back to rte_memset, I think you should try the solution 2/.
> > >
> > > I have read the ACL code, if I understand well , for complex algo
> > > implementation, it is good idea, but Choosing functions at run time
> > > will bring some overhead. For frequently  called function Which
> > > consumes small cycles, the overhead maybe is more than  the gains
> > optimizations brings For example, for most applications in dpdk, memset only
> > set N = 10 or 12bytes. It consumes fewer cycles.
> >
> > But then what the point to have an rte_memset() using vector instructions at
> > all?
> > From what you are saying the most common case is even less then SSE
> > register size.
> > Konstantin
> 
> For most cases, memset is used such as memset(address, 0, sizeof(struct xxx));

Ok then I suppose for such cases you don't need any special function and memset()
would still be the best choice, right?

> The use case here is small by accident, I only give an example here.
> but rte_memset is introduced to need consider generic case.

We can have rte_memset_huge() or so instead, and document that
it should be used for sizes greater than some cutoff point.
Inside it you can just call a function pointer installed at startup (same as rte_acl_classify() does).
For big sizes, I suppose the price of extra function pointer call would not affect performance much.
For sizes smaller then this cutoff point you still can use either rte_memset_scalar() or just normal rte_memset().
Something like that:

extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);

static inline void*
rte_memset_huge(void *s, int c, size_t n)
{
   return __rte_memset_vector(s, c, n);
}

static inline void *
rte_memset(void *s, int c, size_t n)
{
	If (n < XXX)
		return rte_memset_scalar(s, c, n);
	else
		return rte_memset_huge(s, c, n);
}

XXX could be either a define, or could also be a variable, so it can be setuped at startup,
depending on the architecture.

Would that work?
Konstantin

> sizeof(struct xxx) is not limited to very small size, such as  less than SSE register size.
> I just want to say that the size for the most use case is not very large,  So cycles consumed
> Is not large. It is not suited to choose function at run-time since overhead  is considered.
> 
> thanks
> Zhiyong

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-08  7:41     ` Yang, Zhiyong
  2016-12-08  9:26       ` Ananyev, Konstantin
@ 2016-12-08 15:09       ` Thomas Monjalon
  2016-12-11 12:04         ` Yang, Zhiyong
  1 sibling, 1 reply; 44+ messages in thread
From: Thomas Monjalon @ 2016-12-08 15:09 UTC (permalink / raw)
  To: Yang, Zhiyong
  Cc: dev, yuanhan.liu, Richardson, Bruce, Ananyev, Konstantin,
	De Lara Guarch, Pablo

2016-12-08 07:41, Yang, Zhiyong:
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > 2016-12-05 16:26, Zhiyong Yang:
> > > +#ifndef _RTE_MEMSET_X86_64_H_
> > 
> > Is this implementation specific to 64-bit?
> > 
> 
> Yes.

So should we rename this file?
rte_memset.h -> rte_memset_64.h

You need also to create a file rte_memset.h for each arch.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-08 15:09       ` Thomas Monjalon
@ 2016-12-11 12:04         ` Yang, Zhiyong
  0 siblings, 0 replies; 44+ messages in thread
From: Yang, Zhiyong @ 2016-12-11 12:04 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, yuanhan.liu, Richardson, Bruce, Ananyev, Konstantin,
	De Lara Guarch, Pablo

Hi, Thomas:

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Thursday, December 8, 2016 11:10 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>
> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> IA platform
> 
> 2016-12-08 07:41, Yang, Zhiyong:
> > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > > 2016-12-05 16:26, Zhiyong Yang:
> > > > +#ifndef _RTE_MEMSET_X86_64_H_
> > >
> > > Is this implementation specific to 64-bit?
> > >
> >
> > Yes.
> 
> So should we rename this file?
> rte_memset.h -> rte_memset_64.h
> 
> You need also to create a file rte_memset.h for each arch.

Ok

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-08 10:30           ` Ananyev, Konstantin
@ 2016-12-11 12:32             ` Yang, Zhiyong
  2016-12-15  6:51               ` Yang, Zhiyong
  0 siblings, 1 reply; 44+ messages in thread
From: Yang, Zhiyong @ 2016-12-11 12:32 UTC (permalink / raw)
  To: Ananyev, Konstantin, Thomas Monjalon
  Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo

Hi, Konstantin, Bruce:

> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Thursday, December 8, 2016 6:31 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> <thomas.monjalon@6wind.com>
> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> IA platform
> 
> 
> 
> > -----Original Message-----
> > From: Yang, Zhiyong
> > Sent: Thursday, December 8, 2016 9:53 AM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > Monjalon <thomas.monjalon@6wind.com>
> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>
> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset
> > on IA platform
> >
> > Hi, Konstantin:
> >
> > > -----Original Message-----
> > > From: Ananyev, Konstantin
> > > Sent: Thursday, December 8, 2016 5:26 PM
> > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> > > <thomas.monjalon@6wind.com>
> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > <pablo.de.lara.guarch@intel.com>
> > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset
> > > on IA platform
> > >
> > >
> > > Hi Zhiyong,
> > >
> > > >
> > > > HI, Thomas:
> > > > 	Sorry for late reply. I have been being always considering your
> > > suggestion.
> > > >
> > > > > -----Original Message-----
> > > > > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > > > > Sent: Friday, December 2, 2016 6:25 PM
> > > > > To: Yang, Zhiyong <zhiyong.yang@intel.com>
> > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > > > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > > > > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo
> > > > > <pablo.de.lara.guarch@intel.com>
> > > > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> > > rte_memset
> > > > > on IA platform
> > > > >
> > > > > 2016-12-05 16:26, Zhiyong Yang:
> > > > > > +#ifndef _RTE_MEMSET_X86_64_H_
> > > > >
> > > > > Is this implementation specific to 64-bit?
> > > > >
> > > >
> > > > Yes.
> > > >
> > > > > > +
> > > > > > +#define rte_memset memset
> > > > > > +
> > > > > > +#else
> > > > > > +
> > > > > > +static void *
> > > > > > +rte_memset(void *dst, int a, size_t n);
> > > > > > +
> > > > > > +#endif
> > > > >
> > > > > If I understand well, rte_memset (as rte_memcpy) is using the
> > > > > most recent instructions available (and enabled) when compiling.
> > > > > It is not adapting the instructions to the run-time CPU.
> > > > > There is no need to downgrade at run-time the instruction set as
> > > > > it is obviously not a supported case, but it would be nice to be
> > > > > able to upgrade a "default compilation" at run-time as it is done in
> rte_acl.
> > > > > I explain this case more clearly for reference:
> > > > >
> > > > > We can have AVX512 supported in the compiler but disable it when
> > > > > compiling
> > > > > (CONFIG_RTE_MACHINE=snb) in order to build a binary running
> > > > > almost everywhere.
> > > > > When running this binary on a CPU having AVX512 support, it will
> > > > > not benefit of the AVX512 improvement.
> > > > > Though, we can compile an AVX512 version of some functions and
> > > > > use them only if the running CPU is capable.
> > > > > This kind of miracle can be achieved in two ways:
> > > > >
> > > > > 1/ For generic C code compiled with a recent GCC, a function can
> > > > > be built for several CPUs thanks to the attribute target_clones.
> > > > >
> > > > > 2/ For manually optimized functions using CPU-specific
> > > > > intrinsics or asm, it is possible to build them with non-default
> > > > > flags thanks to the
> > > attribute target.
> > > > >
> > > > > 3/ For manually optimized files using CPU-specific intrinsics or
> > > > > asm, we use specifics flags in the makefile.
> > > > >
> > > > > The function clone in case 1/ is dynamically chosen at run-time
> > > > > through ifunc resolver.
> > > > > The specific functions in cases 2/ and 3/ must chosen at
> > > > > run-time by initializing a function pointer thanks to
> rte_cpu_get_flag_enabled().
> > > > >
> > > > > Note that rte_hash and software crypto PMDs have a run-time
> > > > > check with
> > > > > rte_cpu_get_flag_enabled() but do not override CFLAGS in the
> Makefile.
> > > > > Next step for these libraries?
> > > > >
> > > > > Back to rte_memset, I think you should try the solution 2/.
> > > >
> > > > I have read the ACL code, if I understand well , for complex algo
> > > > implementation, it is good idea, but Choosing functions at run
> > > > time will bring some overhead. For frequently  called function
> > > > Which consumes small cycles, the overhead maybe is more than  the
> > > > gains
> > > optimizations brings For example, for most applications in dpdk,
> > > memset only set N = 10 or 12bytes. It consumes fewer cycles.
> > >
> > > But then what the point to have an rte_memset() using vector
> > > instructions at all?
> > > From what you are saying the most common case is even less then SSE
> > > register size.
> > > Konstantin
> >
> > For most cases, memset is used such as memset(address, 0,
> > sizeof(struct xxx));
> 
> Ok then I suppose for such cases you don't need any special function and
> memset() would still be the best choice, right?
> 

In fact, the bad performance drop has been found on IVB,   Please reference to 
http://dpdk.org/ml/archives/dev/2016-October/048628.html
The following code cause the perf issue
memset((void *)(uintptr_t)&(virtio_hdr->hdr),0 , dev->vhost_hlen);
vhost_hlen is 10 or 12 bytes, So, glibc memset is not used here.

> > The use case here is small by accident, I only give an example here.
> > but rte_memset is introduced to need consider generic case.
> 
> We can have rte_memset_huge() or so instead, and document that it should
> be used for sizes greater than some cutoff point.
> Inside it you can just call a function pointer installed at startup (same as
> rte_acl_classify() does).
> For big sizes, I suppose the price of extra function pointer call would not
> affect performance much.
> For sizes smaller then this cutoff point you still can use either
> rte_memset_scalar() or just normal rte_memset().
> Something like that:
> 
> extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);
> 
> static inline void*
> rte_memset_huge(void *s, int c, size_t n) {
>    return __rte_memset_vector(s, c, n);
> }
> 
> static inline void *
> rte_memset(void *s, int c, size_t n)
> {
> 	If (n < XXX)
> 		return rte_memset_scalar(s, c, n);
> 	else
> 		return rte_memset_huge(s, c, n);
> }
> 
> XXX could be either a define, or could also be a variable, so it can be setuped
> at startup, depending on the architecture.
> 
> Would that work?
> Konstantin
> 
The idea sounds good.   It maybe is more feasible for rte_memcpy and rte_memset.
If I understand well , the idea from Bruce is similar, right ?

> > sizeof(struct xxx) is not limited to very small size, such as  less than SSE
> register size.
> > I just want to say that the size for the most use case is not very
> > large,  So cycles consumed Is not large. It is not suited to choose function at
> run-time since overhead  is considered.
> >
> > thanks
> > Zhiyong

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-11 12:32             ` Yang, Zhiyong
@ 2016-12-15  6:51               ` Yang, Zhiyong
  2016-12-15 10:12                 ` Bruce Richardson
  2016-12-15 10:53                 ` Ananyev, Konstantin
  0 siblings, 2 replies; 44+ messages in thread
From: Yang, Zhiyong @ 2016-12-15  6:51 UTC (permalink / raw)
  To: Yang, Zhiyong, Ananyev, Konstantin, Thomas Monjalon
  Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo

Hi, Thomas, Konstantin:

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong
> Sent: Sunday, December 11, 2016 8:33 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> Monjalon <thomas.monjalon@6wind.com>
> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> IA platform
> 
> Hi, Konstantin, Bruce:
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Thursday, December 8, 2016 6:31 PM
> > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> > <thomas.monjalon@6wind.com>
> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>
> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset
> > on IA platform
> >
> >
> >
> > > -----Original Message-----
> > > From: Yang, Zhiyong
> > > Sent: Thursday, December 8, 2016 9:53 AM
> > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > > Monjalon <thomas.monjalon@6wind.com>
> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > <pablo.de.lara.guarch@intel.com>
> > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset
> > > on IA platform
> > >
> > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);
> >
> > static inline void*
> > rte_memset_huge(void *s, int c, size_t n) {
> >    return __rte_memset_vector(s, c, n); }
> >
> > static inline void *
> > rte_memset(void *s, int c, size_t n)
> > {
> > 	If (n < XXX)
> > 		return rte_memset_scalar(s, c, n);
> > 	else
> > 		return rte_memset_huge(s, c, n);
> > }
> >
> > XXX could be either a define, or could also be a variable, so it can
> > be setuped at startup, depending on the architecture.
> >
> > Would that work?
> > Konstantin
> >
I have implemented the code for  choosing the functions at run time.
rte_memcpy is used more frequently, So I test it at run time. 

typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, size_t n);
extern rte_memcpy_vector_t rte_memcpy_vector;
static inline void *
rte_memcpy(void *dst, const void *src, size_t n)
{
        return rte_memcpy_vector(dst, src, n);
}
In order to reduce the overhead at run time, 
I assign the function address to var rte_memcpy_vector before main() starts to init the var.

static void __attribute__((constructor))
rte_memcpy_init(void)
{
	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
	{
		rte_memcpy_vector = rte_memcpy_avx2;
	}
	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
	{
		rte_memcpy_vector = rte_memcpy_sse;
	}
	else
	{
		rte_memcpy_vector = memcpy;
	}

}
I run the same virtio/vhost loopback tests without NIC.
I can see the  throughput drop  when running choosing functions at run time
compared to original code as following on the same platform(my machine is haswell) 
	Packet size	perf drop
	64 		-4%
	256 		-5.4%
	1024		-5%
	1500		-2.5%
Another thing, I run the memcpy_perf_autotest,  when N= <128, 
the rte_memcpy perf gains almost disappears
When choosing functions at run time.  For N=other numbers, the perf gains will become narrow.

Thanks
Zhiyong

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-15  6:51               ` Yang, Zhiyong
@ 2016-12-15 10:12                 ` Bruce Richardson
  2016-12-16 10:19                   ` Yang, Zhiyong
  2016-12-15 10:53                 ` Ananyev, Konstantin
  1 sibling, 1 reply; 44+ messages in thread
From: Bruce Richardson @ 2016-12-15 10:12 UTC (permalink / raw)
  To: Yang, Zhiyong
  Cc: Ananyev, Konstantin, Thomas Monjalon, dev, yuanhan.liu,
	De Lara Guarch, Pablo

On Thu, Dec 15, 2016 at 06:51:08AM +0000, Yang, Zhiyong wrote:
> Hi, Thomas, Konstantin:
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong
> > Sent: Sunday, December 11, 2016 8:33 PM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > Monjalon <thomas.monjalon@6wind.com>
> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> > IA platform
> > 
> > Hi, Konstantin, Bruce:
> > 
> > > -----Original Message-----
> > > From: Ananyev, Konstantin
> > > Sent: Thursday, December 8, 2016 6:31 PM
> > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> > > <thomas.monjalon@6wind.com>
> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > <pablo.de.lara.guarch@intel.com>
> > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset
> > > on IA platform
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Yang, Zhiyong
> > > > Sent: Thursday, December 8, 2016 9:53 AM
> > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > > > Monjalon <thomas.monjalon@6wind.com>
> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > > <pablo.de.lara.guarch@intel.com>
> > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset
> > > > on IA platform
> > > >
> > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);
> > >
> > > static inline void*
> > > rte_memset_huge(void *s, int c, size_t n) {
> > >    return __rte_memset_vector(s, c, n); }
> > >
> > > static inline void *
> > > rte_memset(void *s, int c, size_t n)
> > > {
> > > 	If (n < XXX)
> > > 		return rte_memset_scalar(s, c, n);
> > > 	else
> > > 		return rte_memset_huge(s, c, n);
> > > }
> > >
> > > XXX could be either a define, or could also be a variable, so it can
> > > be setuped at startup, depending on the architecture.
> > >
> > > Would that work?
> > > Konstantin
> > >
> I have implemented the code for  choosing the functions at run time.
> rte_memcpy is used more frequently, So I test it at run time. 
> 
> typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, size_t n);
> extern rte_memcpy_vector_t rte_memcpy_vector;
> static inline void *
> rte_memcpy(void *dst, const void *src, size_t n)
> {
>         return rte_memcpy_vector(dst, src, n);
> }
> In order to reduce the overhead at run time, 
> I assign the function address to var rte_memcpy_vector before main() starts to init the var.
> 
> static void __attribute__((constructor))
> rte_memcpy_init(void)
> {
> 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> 	{
> 		rte_memcpy_vector = rte_memcpy_avx2;
> 	}
> 	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> 	{
> 		rte_memcpy_vector = rte_memcpy_sse;
> 	}
> 	else
> 	{
> 		rte_memcpy_vector = memcpy;
> 	}
> 
> }
> I run the same virtio/vhost loopback tests without NIC.
> I can see the  throughput drop  when running choosing functions at run time
> compared to original code as following on the same platform(my machine is haswell) 
> 	Packet size	perf drop
> 	64 		-4%
> 	256 		-5.4%
> 	1024		-5%
> 	1500		-2.5%
> Another thing, I run the memcpy_perf_autotest,  when N= <128, 
> the rte_memcpy perf gains almost disappears
> When choosing functions at run time.  For N=other numbers, the perf gains will become narrow.
> 
How narrow. How significant is the improvement that we gain from having
to maintain our own copy of memcpy. If the libc version is nearly as
good we should just use that.

/Bruce

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-15  6:51               ` Yang, Zhiyong
  2016-12-15 10:12                 ` Bruce Richardson
@ 2016-12-15 10:53                 ` Ananyev, Konstantin
  2016-12-16  2:15                   ` Yang, Zhiyong
  1 sibling, 1 reply; 44+ messages in thread
From: Ananyev, Konstantin @ 2016-12-15 10:53 UTC (permalink / raw)
  To: Yang, Zhiyong, Thomas Monjalon
  Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo

Hi Zhiyong,

> -----Original Message-----
> From: Yang, Zhiyong
> Sent: Thursday, December 15, 2016 6:51 AM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas Monjalon
> <thomas.monjalon@6wind.com>
> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform
> 
> Hi, Thomas, Konstantin:
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong
> > Sent: Sunday, December 11, 2016 8:33 PM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > Monjalon <thomas.monjalon@6wind.com>
> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> > IA platform
> >
> > Hi, Konstantin, Bruce:
> >
> > > -----Original Message-----
> > > From: Ananyev, Konstantin
> > > Sent: Thursday, December 8, 2016 6:31 PM
> > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> > > <thomas.monjalon@6wind.com>
> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > <pablo.de.lara.guarch@intel.com>
> > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset
> > > on IA platform
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Yang, Zhiyong
> > > > Sent: Thursday, December 8, 2016 9:53 AM
> > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > > > Monjalon <thomas.monjalon@6wind.com>
> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > > <pablo.de.lara.guarch@intel.com>
> > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset
> > > > on IA platform
> > > >
> > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);
> > >
> > > static inline void*
> > > rte_memset_huge(void *s, int c, size_t n) {
> > >    return __rte_memset_vector(s, c, n); }
> > >
> > > static inline void *
> > > rte_memset(void *s, int c, size_t n)
> > > {
> > > 	If (n < XXX)
> > > 		return rte_memset_scalar(s, c, n);
> > > 	else
> > > 		return rte_memset_huge(s, c, n);
> > > }
> > >
> > > XXX could be either a define, or could also be a variable, so it can
> > > be setuped at startup, depending on the architecture.
> > >
> > > Would that work?
> > > Konstantin
> > >
> I have implemented the code for  choosing the functions at run time.
> rte_memcpy is used more frequently, So I test it at run time.
> 
> typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, size_t n);
> extern rte_memcpy_vector_t rte_memcpy_vector;
> static inline void *
> rte_memcpy(void *dst, const void *src, size_t n)
> {
>         return rte_memcpy_vector(dst, src, n);
> }
> In order to reduce the overhead at run time,
> I assign the function address to var rte_memcpy_vector before main() starts to init the var.
> 
> static void __attribute__((constructor))
> rte_memcpy_init(void)
> {
> 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> 	{
> 		rte_memcpy_vector = rte_memcpy_avx2;
> 	}
> 	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> 	{
> 		rte_memcpy_vector = rte_memcpy_sse;
> 	}
> 	else
> 	{
> 		rte_memcpy_vector = memcpy;
> 	}
> 
> }

I thought we discussed a bit different approach.
In which rte_memcpy_vector() (rte_memeset_vector) would be called  only after some cutoff point, i.e:

void
rte_memcpy(void *dst, const void *src, size_t len)
{
	if (len < N) memcpy(dst, src, len);
	else rte_memcpy_vector(dst, src, len);
}

If you just always call rte_memcpy_vector() for every len, 
then it means that compiler most likely has always to generate a proper call
(not inlining happening).
For small length(s) price of extra function would probably overweight any
potential gain with SSE/AVX2 implementation.  

Konstantin 

> I run the same virtio/vhost loopback tests without NIC.
> I can see the  throughput drop  when running choosing functions at run time
> compared to original code as following on the same platform(my machine is haswell)
> 	Packet size	perf drop
> 	64 		-4%
> 	256 		-5.4%
> 	1024		-5%
> 	1500		-2.5%
> Another thing, I run the memcpy_perf_autotest,  when N= <128,
> the rte_memcpy perf gains almost disappears
> When choosing functions at run time.  For N=other numbers, the perf gains will become narrow.
> 
> Thanks
> Zhiyong

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-15 10:53                 ` Ananyev, Konstantin
@ 2016-12-16  2:15                   ` Yang, Zhiyong
  2016-12-16 11:47                     ` Ananyev, Konstantin
  0 siblings, 1 reply; 44+ messages in thread
From: Yang, Zhiyong @ 2016-12-16  2:15 UTC (permalink / raw)
  To: Ananyev, Konstantin, Thomas Monjalon
  Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo

Hi,Konstantin:

> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Thursday, December 15, 2016 6:54 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> <thomas.monjalon@6wind.com>
> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> IA platform
> 
> Hi Zhiyong,
> 
> > -----Original Message-----
> > From: Yang, Zhiyong
> > Sent: Thursday, December 15, 2016 6:51 AM
> > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; Thomas Monjalon
> > <thomas.monjalon@6wind.com>
> > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>
> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset
> > on IA platform
> >
> > Hi, Thomas, Konstantin:
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong
> > > Sent: Sunday, December 11, 2016 8:33 PM
> > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > > Monjalon <thomas.monjalon@6wind.com>
> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > <pablo.de.lara.guarch@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> rte_memset
> > > on IA platform
> > >
> > > Hi, Konstantin, Bruce:
> > >
> > > > -----Original Message-----
> > > > From: Ananyev, Konstantin
> > > > Sent: Thursday, December 8, 2016 6:31 PM
> > > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> > > > <thomas.monjalon@6wind.com>
> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > > <pablo.de.lara.guarch@intel.com>
> > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> > > > rte_memset on IA platform
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Yang, Zhiyong
> > > > > Sent: Thursday, December 8, 2016 9:53 AM
> > > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > > > > Monjalon <thomas.monjalon@6wind.com>
> > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > > > <pablo.de.lara.guarch@intel.com>
> > > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> > > > > rte_memset on IA platform
> > > > >
> > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);
> > > >
> > > > static inline void*
> > > > rte_memset_huge(void *s, int c, size_t n) {
> > > >    return __rte_memset_vector(s, c, n); }
> > > >
> > > > static inline void *
> > > > rte_memset(void *s, int c, size_t n) {
> > > > 	If (n < XXX)
> > > > 		return rte_memset_scalar(s, c, n);
> > > > 	else
> > > > 		return rte_memset_huge(s, c, n); }
> > > >
> > > > XXX could be either a define, or could also be a variable, so it
> > > > can be setuped at startup, depending on the architecture.
> > > >
> > > > Would that work?
> > > > Konstantin
> > > >
> > I have implemented the code for  choosing the functions at run time.
> > rte_memcpy is used more frequently, So I test it at run time.
> >
> > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src,
> > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static inline
> > void * rte_memcpy(void *dst, const void *src, size_t n) {
> >         return rte_memcpy_vector(dst, src, n); } In order to reduce
> > the overhead at run time, I assign the function address to var
> > rte_memcpy_vector before main() starts to init the var.
> >
> > static void __attribute__((constructor))
> > rte_memcpy_init(void)
> > {
> > 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > 	{
> > 		rte_memcpy_vector = rte_memcpy_avx2;
> > 	}
> > 	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> > 	{
> > 		rte_memcpy_vector = rte_memcpy_sse;
> > 	}
> > 	else
> > 	{
> > 		rte_memcpy_vector = memcpy;
> > 	}
> >
> > }
> 
> I thought we discussed a bit different approach.
> In which rte_memcpy_vector() (rte_memeset_vector) would be called  only
> after some cutoff point, i.e:
> 
> void
> rte_memcpy(void *dst, const void *src, size_t len) {
> 	if (len < N) memcpy(dst, src, len);
> 	else rte_memcpy_vector(dst, src, len);
> }
> 
> If you just always call rte_memcpy_vector() for every len, then it means that
> compiler most likely has always to generate a proper call (not inlining
> happening).

> For small length(s) price of extra function would probably overweight any
> potential gain with SSE/AVX2 implementation.
> 
> Konstantin

Yes, in fact,  from my tests, For small length(s)  rte_memset is far better than glibc memset, 
For large lengths, rte_memset is only a bit better than memset. 
because memset use the AVX2/SSE, too. Of course, it will use AVX512 on future machine.

>For small length(s) price of extra function would probably overweight any
 >potential gain.  
This is the key point. I think it should include the scalar optimization, not only vector optimization.

The value of rte_memset is always inlined and for small lengths it will be better.
when in some case We are not sure that memset is always inlined by compiler.
It seems that choosing function at run time will lose the gains.
The following is tested on haswell by patch code.
** rte_memset() - memset perf tests
        (C = compile-time constant) **
======== ======= ======== ======= ========
   Size memset in cache  memset in mem
(bytes)        (ticks)        (ticks)
------- -------------- ---------------
============= 32B aligned ================
      3            3 -    8       19 -  128
      4            4 -    8       13 -  128
      8            2 -    7       19 -  128
      9            2 -    7       19 -  127
     12           2 -    7       19 -  127
     17          3 -    8        19 -  132
     64          3 -    8        28 -  168
    128        7 -   13       54 -  200
    255        8 -   20       100 -  223
    511        14 -   20     187 -  314
   1024      24 -   29     328 -  379
   8192     198 -  225   1829 - 2193

Thanks
Zhiyong


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-15 10:12                 ` Bruce Richardson
@ 2016-12-16 10:19                   ` Yang, Zhiyong
  2016-12-19  6:27                     ` Yuanhan Liu
  0 siblings, 1 reply; 44+ messages in thread
From: Yang, Zhiyong @ 2016-12-16 10:19 UTC (permalink / raw)
  To: Richardson, Bruce
  Cc: Ananyev, Konstantin, Thomas Monjalon, dev, yuanhan.liu,
	De Lara Guarch, Pablo

Hi, Bruce:

> -----Original Message-----
> From: Richardson, Bruce
> Sent: Thursday, December 15, 2016 6:13 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> Monjalon <thomas.monjalon@6wind.com>; dev@dpdk.org;
> yuanhan.liu@linux.intel.com; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> IA platform
> 
> On Thu, Dec 15, 2016 at 06:51:08AM +0000, Yang, Zhiyong wrote:
> > Hi, Thomas, Konstantin:
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong
> > > Sent: Sunday, December 11, 2016 8:33 PM
> > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > > Monjalon <thomas.monjalon@6wind.com>
> > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > <pablo.de.lara.guarch@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> rte_memset
> > > on IA platform
> > >
> > > Hi, Konstantin, Bruce:
> > >
> > > > -----Original Message-----
> > > > From: Ananyev, Konstantin
> > > > Sent: Thursday, December 8, 2016 6:31 PM
> > > > To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> > > > <thomas.monjalon@6wind.com>
> > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > > <pablo.de.lara.guarch@intel.com>
> > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> > > > rte_memset on IA platform
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Yang, Zhiyong
> > > > > Sent: Thursday, December 8, 2016 9:53 AM
> > > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Thomas
> > > > > Monjalon <thomas.monjalon@6wind.com>
> > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> > > > > <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> > > > > <pablo.de.lara.guarch@intel.com>
> > > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> > > > > rte_memset on IA platform
> > > > >
> > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);
> > > >
> > > > static inline void*
> > > > rte_memset_huge(void *s, int c, size_t n) {
> > > >    return __rte_memset_vector(s, c, n); }
> > > >
> > > > static inline void *
> > > > rte_memset(void *s, int c, size_t n) {
> > > > 	If (n < XXX)
> > > > 		return rte_memset_scalar(s, c, n);
> > > > 	else
> > > > 		return rte_memset_huge(s, c, n); }
> > > >
> > > > XXX could be either a define, or could also be a variable, so it
> > > > can be setuped at startup, depending on the architecture.
> > > >
> > > > Would that work?
> > > > Konstantin
> > > >
> > I have implemented the code for  choosing the functions at run time.
> > rte_memcpy is used more frequently, So I test it at run time.
> >
> > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src,
> > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static inline
> > void * rte_memcpy(void *dst, const void *src, size_t n) {
> >         return rte_memcpy_vector(dst, src, n); } In order to reduce
> > the overhead at run time, I assign the function address to var
> > rte_memcpy_vector before main() starts to init the var.
> >
> > static void __attribute__((constructor))
> > rte_memcpy_init(void)
> > {
> > 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > 	{
> > 		rte_memcpy_vector = rte_memcpy_avx2;
> > 	}
> > 	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> > 	{
> > 		rte_memcpy_vector = rte_memcpy_sse;
> > 	}
> > 	else
> > 	{
> > 		rte_memcpy_vector = memcpy;
> > 	}
> >
> > }
> > I run the same virtio/vhost loopback tests without NIC.
> > I can see the  throughput drop  when running choosing functions at run
> > time compared to original code as following on the same platform(my
> machine is haswell)
> > 	Packet size	perf drop
> > 	64 		-4%
> > 	256 		-5.4%
> > 	1024		-5%
> > 	1500		-2.5%
> > Another thing, I run the memcpy_perf_autotest,  when N= <128, the
> > rte_memcpy perf gains almost disappears When choosing functions at run
> > time.  For N=other numbers, the perf gains will become narrow.
> >
> How narrow. How significant is the improvement that we gain from having to
> maintain our own copy of memcpy. If the libc version is nearly as good we
> should just use that.
> 
> /Bruce

Zhihong sent a patch about rte_memcpy,  From the patch,  
we can see the optimization job for memcpy will bring obvious perf improvements
than glibc for DPDK.
http://www.dpdk.org/dev/patchwork/patch/17753/
git log as following:
This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
from 64 to 1500 bytes.

thanks
Zhiyong

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-16  2:15                   ` Yang, Zhiyong
@ 2016-12-16 11:47                     ` Ananyev, Konstantin
  2016-12-20  9:31                       ` Yang, Zhiyong
  0 siblings, 1 reply; 44+ messages in thread
From: Ananyev, Konstantin @ 2016-12-16 11:47 UTC (permalink / raw)
  To: Yang, Zhiyong, Thomas Monjalon
  Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo

Hi Zhiyong,

> > > > > >
> > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);
> > > > >
> > > > > static inline void*
> > > > > rte_memset_huge(void *s, int c, size_t n) {
> > > > >    return __rte_memset_vector(s, c, n); }
> > > > >
> > > > > static inline void *
> > > > > rte_memset(void *s, int c, size_t n) {
> > > > > 	If (n < XXX)
> > > > > 		return rte_memset_scalar(s, c, n);
> > > > > 	else
> > > > > 		return rte_memset_huge(s, c, n); }
> > > > >
> > > > > XXX could be either a define, or could also be a variable, so it
> > > > > can be setuped at startup, depending on the architecture.
> > > > >
> > > > > Would that work?
> > > > > Konstantin
> > > > >
> > > I have implemented the code for  choosing the functions at run time.
> > > rte_memcpy is used more frequently, So I test it at run time.
> > >
> > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src,
> > > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static inline
> > > void * rte_memcpy(void *dst, const void *src, size_t n) {
> > >         return rte_memcpy_vector(dst, src, n); } In order to reduce
> > > the overhead at run time, I assign the function address to var
> > > rte_memcpy_vector before main() starts to init the var.
> > >
> > > static void __attribute__((constructor))
> > > rte_memcpy_init(void)
> > > {
> > > 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > > 	{
> > > 		rte_memcpy_vector = rte_memcpy_avx2;
> > > 	}
> > > 	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> > > 	{
> > > 		rte_memcpy_vector = rte_memcpy_sse;
> > > 	}
> > > 	else
> > > 	{
> > > 		rte_memcpy_vector = memcpy;
> > > 	}
> > >
> > > }
> >
> > I thought we discussed a bit different approach.
> > In which rte_memcpy_vector() (rte_memeset_vector) would be called  only
> > after some cutoff point, i.e:
> >
> > void
> > rte_memcpy(void *dst, const void *src, size_t len) {
> > 	if (len < N) memcpy(dst, src, len);
> > 	else rte_memcpy_vector(dst, src, len);
> > }
> >
> > If you just always call rte_memcpy_vector() for every len, then it means that
> > compiler most likely has always to generate a proper call (not inlining
> > happening).
> 
> > For small length(s) price of extra function would probably overweight any
> > potential gain with SSE/AVX2 implementation.
> >
> > Konstantin
> 
> Yes, in fact,  from my tests, For small length(s)  rte_memset is far better than glibc memset,
> For large lengths, rte_memset is only a bit better than memset.
> because memset use the AVX2/SSE, too. Of course, it will use AVX512 on future machine.

Ok, thanks for clarification.
>From previous mails I got a wrong  impression that on big lengths
rte_memset_vector() is significantly faster than memset().

> 
> >For small length(s) price of extra function would probably overweight any
>  >potential gain.
> This is the key point. I think it should include the scalar optimization, not only vector optimization.
> 
> The value of rte_memset is always inlined and for small lengths it will be better.
> when in some case We are not sure that memset is always inlined by compiler.

Ok, so do you know in what cases memset() is not get inlined?
Is it when len parameter can't be precomputed by the compiler
(is not a constant)?

So to me it sounds like:
- We don't need to have an optimized verision of rte_memset() for big sizes.
- Which probably means we don't need an arch specific versions of rte_memset_vector() at all -
   for small sizes (<= 32B) scalar version would be good enough. 
- For big sizes we can just rely on memset().
Is that so?

> It seems that choosing function at run time will lose the gains.
> The following is tested on haswell by patch code.

Not sure what columns 2 and 3 in the table below mean? 
Konstantin

> ** rte_memset() - memset perf tests
>         (C = compile-time constant) **
> ======== ======= ======== ======= ========
>    Size memset in cache  memset in mem
> (bytes)        (ticks)        (ticks)
> ------- -------------- ---------------
> ============= 32B aligned ================
>       3            3 -    8       19 -  128
>       4            4 -    8       13 -  128
>       8            2 -    7       19 -  128
>       9            2 -    7       19 -  127
>      12           2 -    7       19 -  127
>      17          3 -    8        19 -  132
>      64          3 -    8        28 -  168
>     128        7 -   13       54 -  200
>     255        8 -   20       100 -  223
>     511        14 -   20     187 -  314
>    1024      24 -   29     328 -  379
>    8192     198 -  225   1829 - 2193
> 
> Thanks
> Zhiyong

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-16 10:19                   ` Yang, Zhiyong
@ 2016-12-19  6:27                     ` Yuanhan Liu
  2016-12-20  2:41                       ` Yao, Lei A
  0 siblings, 1 reply; 44+ messages in thread
From: Yuanhan Liu @ 2016-12-19  6:27 UTC (permalink / raw)
  To: Yang, Zhiyong
  Cc: Richardson, Bruce, Ananyev, Konstantin, Thomas Monjalon, dev,
	De Lara Guarch, Pablo, Wang, Zhihong

On Fri, Dec 16, 2016 at 10:19:43AM +0000, Yang, Zhiyong wrote:
> > > I run the same virtio/vhost loopback tests without NIC.
> > > I can see the  throughput drop  when running choosing functions at run
> > > time compared to original code as following on the same platform(my
> > machine is haswell)
> > > 	Packet size	perf drop
> > > 	64 		-4%
> > > 	256 		-5.4%
> > > 	1024		-5%
> > > 	1500		-2.5%
> > > Another thing, I run the memcpy_perf_autotest,  when N= <128, the
> > > rte_memcpy perf gains almost disappears When choosing functions at run
> > > time.  For N=other numbers, the perf gains will become narrow.
> > >
> > How narrow. How significant is the improvement that we gain from having to
> > maintain our own copy of memcpy. If the libc version is nearly as good we
> > should just use that.
> > 
> > /Bruce
> 
> Zhihong sent a patch about rte_memcpy,  From the patch,  
> we can see the optimization job for memcpy will bring obvious perf improvements
> than glibc for DPDK.

Just a clarification: it's better than the __original DPDK__ rte_memcpy
but not the glibc one. That makes me think have any one tested the memcpy
with big packets? Does the one from DPDK outweigh the one from glibc,
even for big packets?

	--yliu

> http://www.dpdk.org/dev/patchwork/patch/17753/
> git log as following:
> This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
> up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
> from 64 to 1500 bytes.
> 
> thanks
> Zhiyong

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-19  6:27                     ` Yuanhan Liu
@ 2016-12-20  2:41                       ` Yao, Lei A
  0 siblings, 0 replies; 44+ messages in thread
From: Yao, Lei A @ 2016-12-20  2:41 UTC (permalink / raw)
  To: Yuanhan Liu, Yang, Zhiyong
  Cc: Richardson, Bruce, Ananyev, Konstantin, Thomas Monjalon, dev,
	De Lara Guarch, Pablo, Wang, Zhihong

> On Fri, Dec 16, 2016 at 10:19:43AM +0000, Yang, Zhiyong wrote:
> > > > I run the same virtio/vhost loopback tests without NIC.
> > > > I can see the  throughput drop  when running choosing functions at run
> > > > time compared to original code as following on the same platform(my
> > > machine is haswell)
> > > > 	Packet size	perf drop
> > > > 	64 		-4%
> > > > 	256 		-5.4%
> > > > 	1024		-5%
> > > > 	1500		-2.5%
> > > > Another thing, I run the memcpy_perf_autotest,  when N= <128, the
> > > > rte_memcpy perf gains almost disappears When choosing functions at
> run
> > > > time.  For N=other numbers, the perf gains will become narrow.
> > > >
> > > How narrow. How significant is the improvement that we gain from
> having to
> > > maintain our own copy of memcpy. If the libc version is nearly as good we
> > > should just use that.
> > >
> > > /Bruce
> >
> > Zhihong sent a patch about rte_memcpy,  From the patch,
> > we can see the optimization job for memcpy will bring obvious perf
> improvements
> > than glibc for DPDK.
> 
> Just a clarification: it's better than the __original DPDK__ rte_memcpy
> but not the glibc one. That makes me think have any one tested the memcpy
> with big packets? Does the one from DPDK outweigh the one from glibc,
> even for big packets?
> 
> 	--yliu
> 
I have test the loopback performanc rte_memcpy and glibc memcpy. For both small packer and
Big packet, rte_memcpy has better performance. My test enviromen is following
CPU: BDW
Ubutnu16.04
Kernal:  4.4.0
gcc : 5.4.0
Path: mergeable
Size       rte_memcpy performance gain
64           31%
128         35%
260         27%
520         33%
1024      18%
1500      12%

--Lei
> > http://www.dpdk.org/dev/patchwork/patch/17753/
> > git log as following:
> > This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
> > up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
> > from 64 to 1500 bytes.
> >
> > thanks
> > Zhiyong

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-16 11:47                     ` Ananyev, Konstantin
@ 2016-12-20  9:31                       ` Yang, Zhiyong
  0 siblings, 0 replies; 44+ messages in thread
From: Yang, Zhiyong @ 2016-12-20  9:31 UTC (permalink / raw)
  To: Ananyev, Konstantin, Thomas Monjalon
  Cc: dev, yuanhan.liu, Richardson, Bruce, De Lara Guarch, Pablo

Hi, Konstantin:

> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Friday, December 16, 2016 7:48 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>; Thomas Monjalon
> <thomas.monjalon@6wind.com>
> Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce
> <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> IA platform
> 
> Hi Zhiyong,
> 
> > > > > > >
> > > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t
> > > > > > n);
> > > > > >
> > > > > > static inline void*
> > > > > > rte_memset_huge(void *s, int c, size_t n) {
> > > > > >    return __rte_memset_vector(s, c, n); }
> > > > > >
> > > > > > static inline void *
> > > > > > rte_memset(void *s, int c, size_t n) {
> > > > > > 	If (n < XXX)
> > > > > > 		return rte_memset_scalar(s, c, n);
> > > > > > 	else
> > > > > > 		return rte_memset_huge(s, c, n); }
> > > > > >
> > > > > > XXX could be either a define, or could also be a variable, so
> > > > > > it can be setuped at startup, depending on the architecture.
> > > > > >
> > > > > > Would that work?
> > > > > > Konstantin
> > > > > >
> > > > I have implemented the code for  choosing the functions at run time.
> > > > rte_memcpy is used more frequently, So I test it at run time.
> > > >
> > > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src,
> > > > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static
> > > > inline void * rte_memcpy(void *dst, const void *src, size_t n) {
> > > >         return rte_memcpy_vector(dst, src, n); } In order to
> > > > reduce the overhead at run time, I assign the function address to
> > > > var rte_memcpy_vector before main() starts to init the var.
> > > >
> > > > static void __attribute__((constructor))
> > > > rte_memcpy_init(void)
> > > > {
> > > > 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > > > 	{
> > > > 		rte_memcpy_vector = rte_memcpy_avx2;
> > > > 	}
> > > > 	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> > > > 	{
> > > > 		rte_memcpy_vector = rte_memcpy_sse;
> > > > 	}
> > > > 	else
> > > > 	{
> > > > 		rte_memcpy_vector = memcpy;
> > > > 	}
> > > >
> > > > }
> > >
> > > I thought we discussed a bit different approach.
> > > In which rte_memcpy_vector() (rte_memeset_vector) would be called
> > > only after some cutoff point, i.e:
> > >
> > > void
> > > rte_memcpy(void *dst, const void *src, size_t len) {
> > > 	if (len < N) memcpy(dst, src, len);
> > > 	else rte_memcpy_vector(dst, src, len); }
> > >
> > > If you just always call rte_memcpy_vector() for every len, then it
> > > means that compiler most likely has always to generate a proper call
> > > (not inlining happening).
> >
> > > For small length(s) price of extra function would probably
> > > overweight any potential gain with SSE/AVX2 implementation.
> > >
> > > Konstantin
> >
> > Yes, in fact,  from my tests, For small length(s)  rte_memset is far
> > better than glibc memset, For large lengths, rte_memset is only a bit better
> than memset.
> > because memset use the AVX2/SSE, too. Of course, it will use AVX512 on
> future machine.
> 
> Ok, thanks for clarification.
> From previous mails I got a wrong  impression that on big lengths
> rte_memset_vector() is significantly faster than memset().
> 
> >
> > >For small length(s) price of extra function would probably overweight
> > >any
> >  >potential gain.
> > This is the key point. I think it should include the scalar optimization, not
> only vector optimization.
> >
> > The value of rte_memset is always inlined and for small lengths it will be
> better.
> > when in some case We are not sure that memset is always inlined by
> compiler.
> 
> Ok, so do you know in what cases memset() is not get inlined?
> Is it when len parameter can't be precomputed by the compiler (is not a
> constant)?
> 
> So to me it sounds like:
> - We don't need to have an optimized verision of rte_memset() for big sizes.
> - Which probably means we don't need an arch specific versions of
> rte_memset_vector() at all -
>    for small sizes (<= 32B) scalar version would be good enough.
> - For big sizes we can just rely on memset().
> Is that so?

Using memset has actually met some trouble in some case, such as
http://dpdk.org/ml/archives/dev/2016-October/048628.html

> 
> > It seems that choosing function at run time will lose the gains.
> > The following is tested on haswell by patch code.
> 
> Not sure what columns 2 and 3 in the table below mean?
> Konstantin

Column1 shows Size(bytes).
Column2 shows  rte_memset Vs memset  perf results in cache
Column3 shows  rte_memset Vs memset  perf results in memory.
The data is  gotten using  rte_rdtsc();
 The test can be run using [PATCH 3/4] app/test: add performance autotest for rte_memset

Thanks
Zhiyong
> 
> > ** rte_memset() - memset perf tests
> >         (C = compile-time constant) ** ======== ======= ========
> > ======= ========
> >    Size memset in cache  memset in mem
> > (bytes)        (ticks)        (ticks)
> > ------- -------------- --------------- ============= 32B aligned
> > ================
> >       3            3 -    8       19 -  128
> >       4            4 -    8       13 -  128
> >       8            2 -    7       19 -  128
> >       9            2 -    7       19 -  127
> >      12           2 -    7       19 -  127
> >      17          3 -    8        19 -  132
> >      64          3 -    8        28 -  168
> >     128        7 -   13       54 -  200
> >     255        8 -   20       100 -  223
> >     511        14 -   20     187 -  314
> >    1024      24 -   29     328 -  379
> >    8192     198 -  225   1829 - 2193
> >
> > Thanks
> > Zhiyong

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 0/4] eal/common: introduce rte_memset and related test
  2016-12-02  8:36 ` [PATCH 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang
  2016-12-02 10:25   ` Thomas Monjalon
@ 2016-12-27 10:04   ` Zhiyong Yang
  2016-12-27 10:04     ` [PATCH v2 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang
                       ` (4 more replies)
  1 sibling, 5 replies; 44+ messages in thread
From: Zhiyong Yang @ 2016-12-27 10:04 UTC (permalink / raw)
  To: dev
  Cc: yuanhan.liu, thomas.monjalon, bruce.richardson,
	konstantin.ananyev, pablo.de.lara.guarch

DPDK code has met performance drop badly in some case when calling glibc
function memset. Reference to discussions about memset in 
http://dpdk.org/ml/archives/dev/2016-October/048628.html
It is necessary to introduce more high efficient function to fix it.
One important thing about rte_memset is that we can get clear control
on what instruction flow is used.

This patchset introduces rte_memset to bring more high efficient
implementation, and will bring obvious perf improvement, especially
for small N bytes in the most application scenarios.

Patch 1 implements rte_memset in the file rte_memset.h on IA platform
The file supports three types of instruction sets including sse & avx
(128bits), avx2(256bits) and avx512(512bits). rte_memset makes use of
vectorization and inline function to improve the perf on IA. In addition,
cache line and memory alignment are fully taken into consideration.

Patch 2 implements functional autotest to validates the function whether
to work in a right way.

Patch 3 implements performance autotest separately in cache and memory.
We can see the perf of rte_memset is obviously better than glibc memset
especially for small N bytes.

Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring 3%~4%
performance improvements on IA platform from virtio/vhost non-mergeable
loopback testing.

Changes in V2:

Patch 1:
Rename rte_memset.h -> rte_memset_64.h and create a file rte_memset.h
for each arch.

Patch 3:
add the perf comparation data between rte_memset and memset on haswell.

Patch 4:
Modify release_17_02.rst description.

Zhiyong Yang (4):
  eal/common: introduce rte_memset on IA platform
  app/test: add functional autotest for rte_memset
  app/test: add performance autotest for rte_memset
  lib/librte_vhost: improve vhost perf using rte_memset

 app/test/Makefile                                  |   3 +
 app/test/test_memset.c                             | 158 +++++++++
 app/test/test_memset_perf.c                        | 348 +++++++++++++++++++
 doc/guides/rel_notes/release_17_02.rst             |   7 +
 .../common/include/arch/arm/rte_memset.h           |  36 ++
 .../common/include/arch/ppc_64/rte_memset.h        |  36 ++
 .../common/include/arch/tile/rte_memset.h          |  36 ++
 .../common/include/arch/x86/rte_memset.h           |  51 +++
 .../common/include/arch/x86/rte_memset_64.h        | 378 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_memset.h |  52 +++
 lib/librte_vhost/virtio_net.c                      |  18 +-
 11 files changed, 1116 insertions(+), 7 deletions(-)
 create mode 100644 app/test/test_memset.c
 create mode 100644 app/test/test_memset_perf.c
 create mode 100644 lib/librte_eal/common/include/arch/arm/rte_memset.h
 create mode 100644 lib/librte_eal/common/include/arch/ppc_64/rte_memset.h
 create mode 100644 lib/librte_eal/common/include/arch/tile/rte_memset.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset_64.h
 create mode 100644 lib/librte_eal/common/include/generic/rte_memset.h

-- 
2.7.4

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 1/4] eal/common: introduce rte_memset on IA platform
  2016-12-27 10:04   ` [PATCH v2 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang
@ 2016-12-27 10:04     ` Zhiyong Yang
  2016-12-27 10:04     ` [PATCH v2 2/4] app/test: add functional autotest for rte_memset Zhiyong Yang
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 44+ messages in thread
From: Zhiyong Yang @ 2016-12-27 10:04 UTC (permalink / raw)
  To: dev
  Cc: yuanhan.liu, thomas.monjalon, bruce.richardson,
	konstantin.ananyev, pablo.de.lara.guarch, Zhiyong Yang

Performance drop has been caused in some cases when DPDK code calls glibc
function memset. please reference to discussions about memset in
http://dpdk.org/ml/archives/dev/2016-October/048628.html
It is necessary to introduce more high efficient function to fix it.
One important thing about rte_memset is that we can get clear control
on what instruction flow is used. This patch supports instruction sets
such as sse & avx(128 bits), avx2(256 bits) and avx512(512bits).
rte_memset makes full use of vectorization and inline function to improve
the perf on IA. In addition, cache line and memory alignment are fully
taken into consideration.

Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>
---

Changes in V2:

Rename rte_memset.h -> rte_memset_64.h and create a file rte_memset.h
for each arch.

 .../common/include/arch/arm/rte_memset.h           |  36 ++
 .../common/include/arch/ppc_64/rte_memset.h        |  36 ++
 .../common/include/arch/tile/rte_memset.h          |  36 ++
 .../common/include/arch/x86/rte_memset.h           |  51 +++
 .../common/include/arch/x86/rte_memset_64.h        | 378 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_memset.h |  52 +++
 6 files changed, 589 insertions(+)
 create mode 100644 lib/librte_eal/common/include/arch/arm/rte_memset.h
 create mode 100644 lib/librte_eal/common/include/arch/ppc_64/rte_memset.h
 create mode 100644 lib/librte_eal/common/include/arch/tile/rte_memset.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset_64.h
 create mode 100644 lib/librte_eal/common/include/generic/rte_memset.h

diff --git a/lib/librte_eal/common/include/arch/arm/rte_memset.h b/lib/librte_eal/common/include/arch/arm/rte_memset.h
new file mode 100644
index 0000000..6945f6d
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/arm/rte_memset.h
@@ -0,0 +1,36 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of RehiveTech nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMSET_ARM_H_
+#define _RTE_MEMSET_ARM_H_
+
+#define rte_memset memset
+
+#endif /* _RTE_MEMSET_ARM_H_ */
diff --git a/lib/librte_eal/common/include/arch/ppc_64/rte_memset.h b/lib/librte_eal/common/include/arch/ppc_64/rte_memset.h
new file mode 100644
index 0000000..0d73f05
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/ppc_64/rte_memset.h
@@ -0,0 +1,36 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of IBM Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#ifndef _RTE_MEMSET_PPC_64_H_
+#define _RTE_MEMSET_PPC_64_H_
+
+#define rte_memset memset
+
+#endif /* _RTE_MEMSET_PPC_64_H_ */
diff --git a/lib/librte_eal/common/include/arch/tile/rte_memset.h b/lib/librte_eal/common/include/arch/tile/rte_memset.h
new file mode 100644
index 0000000..e8a1aa1
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/tile/rte_memset.h
@@ -0,0 +1,36 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of EZchip Semiconductor nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#ifndef _RTE_MEMSET_TILE_H_
+#define _RTE_MEMSET_TILE_H_
+
+#define rte_memset memset
+
+#endif /* _RTE_MEMSET_TILE_H_ */
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memset.h b/lib/librte_eal/common/include/arch/x86/rte_memset.h
new file mode 100644
index 0000000..86e0812
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memset.h
@@ -0,0 +1,51 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMSET_X86_H_
+#define _RTE_MEMSET_X86_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#ifdef RTE_ARCH_X86_64
+#include "rte_memset_64.h"
+#else
+#define rte_memset memset
+#endif
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMSET_X86_64_H_ */
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memset_64.h b/lib/librte_eal/common/include/arch/x86/rte_memset_64.h
new file mode 100644
index 0000000..f25d344
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memset_64.h
@@ -0,0 +1,378 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMSET_X86_64_H_
+#define _RTE_MEMSET_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file
+ *
+ * Functions for vectorised implementation of memset().
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <rte_vect.h>
+
+static inline void *
+rte_memset(void *dst, int a, size_t n) __attribute__((always_inline));
+
+static inline void
+rte_memset_less16(void *dst, int a, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+
+	if (n >= 8) {
+		uint16_t b = ((uint8_t)a | (((uint8_t)a) << 8));
+		uint16_t c = ((uint8_t)a | (((uint8_t)a) << 8));
+		uint32_t d = b | c << 16;
+		uint64_t e = d | ((uint64_t)d << 32);
+
+		*(uint64_t *)dstu = e;
+		*(uint64_t *)((uint8_t *)dstu + n - 8) = e;
+	} else {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = (uint8_t)a;
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = (uint8_t)a | (((uint8_t)a) << 8);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			uint16_t b = ((uint8_t)a | (((uint8_t)a) << 8));
+
+			*(uint32_t *)dstu = (uint32_t)(b | (b << 16));
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+	}
+}
+
+static inline void
+rte_memset16(uint8_t *dst, int8_t a)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_set1_epi8(a);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+static inline void
+rte_memset_17to32(void *dst, int a, size_t n)
+{
+	rte_memset16((uint8_t *)dst, a);
+	rte_memset16((uint8_t *)dst - 16 + n, a);
+}
+
+#ifdef RTE_MACHINE_CPUFLAG_AVX512
+
+/**
+ * AVX512 implementation below
+ */
+
+static inline void
+rte_memset32(uint8_t *dst, int8_t a)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_set1_epi8(a);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+static inline void
+rte_memset64(uint8_t *dst, int8_t a)
+{
+	__m512i zmm0;
+
+	zmm0 = _mm512_set1_epi8(a);
+	_mm512_storeu_si512((void *)dst, zmm0);
+}
+
+static inline void
+rte_memset128blocks(uint8_t *dst, int8_t a, size_t n)
+{
+	__m512i zmm0;
+
+	zmm0 = _mm512_set1_epi8(a);
+	while (n >= 128) {
+		n -= 128;
+		_mm512_store_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_store_si512((void *)(dst + 1 * 64), zmm0);
+		dst = dst + 128;
+	}
+}
+
+static inline void *
+rte_memset(void *dst, int a, size_t n)
+{
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	if (n < 16) {
+		rte_memset_less16(dst, a, n);
+		return ret;
+	} else if (n == 16) {
+		rte_memset16((uint8_t *)dst, a);
+		return ret;
+	}
+	if (n <= 32) {
+		rte_memset_17to32(dst, a, n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_memset32((uint8_t *)dst, a);
+		rte_memset32((uint8_t *)dst - 32 + n, a);
+		return ret;
+	}
+	if (n >= 256) {
+		dstofss = ((uintptr_t)dst & 0x3F);
+		if (dstofss > 0) {
+			dstofss = 64 - dstofss;
+			n -= dstofss;
+			rte_memset64((uint8_t *)dst, a);
+			dst = (uint8_t *)dst + dstofss;
+		}
+		rte_memset128blocks((uint8_t *)dst, a, n);
+		bits = n;
+		n = n & 127;
+		bits -= n;
+		dst = (uint8_t *)dst + bits;
+	}
+	if (n > 128) {
+		n -= 128;
+		rte_memset64((uint8_t *)dst, a);
+		rte_memset64((uint8_t *)dst + 64, a);
+		dst = (uint8_t *)dst + 128;
+	}
+	if (n > 64) {
+		rte_memset64((uint8_t *)dst, a);
+		rte_memset64((uint8_t *)dst - 64 + n, a);
+		return ret;
+	}
+	if (n > 0)
+		rte_memset64((uint8_t *)dst - 64 + n, a);
+	return ret;
+}
+
+#elif defined RTE_MACHINE_CPUFLAG_AVX2
+
+/**
+ *  AVX2 implementation below
+ */
+
+static inline void
+rte_memset32(uint8_t *dst, int8_t a)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_set1_epi8(a);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+static inline void
+rte_memset_33to64(void *dst, int a, size_t n)
+{
+	rte_memset32((uint8_t *)dst, a);
+	rte_memset32((uint8_t *)dst - 32 + n, a);
+}
+
+static inline void
+rte_memset64blocks(uint8_t *dst, int8_t a, size_t n)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_set1_epi8(a);
+	while (n >= 64) {
+		n -= 64;
+		_mm256_store_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
+		_mm256_store_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm0);
+		dst = (uint8_t *)dst + 64;
+
+	}
+}
+
+static inline void *
+rte_memset(void *dst, int a, size_t n)
+{
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	if (n < 16) {
+		rte_memset_less16(dst, a, n);
+		return ret;
+	} else if (n == 16) {
+		rte_memset16((uint8_t *)dst, a);
+		return ret;
+	}
+	if (n <= 32) {
+		rte_memset_17to32(dst, a, n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_memset_33to64(dst, a, n);
+		return ret;
+	}
+	if (n > 64) {
+		dstofss = (uintptr_t)dst & 0x1F;
+		if (dstofss > 0) {
+			dstofss = 32 - dstofss;
+			n -= dstofss;
+			rte_memset32((uint8_t *)dst, a);
+			dst = (uint8_t *)dst + dstofss;
+		}
+		rte_memset64blocks((uint8_t *)dst, a, n);
+		bits = n;
+		n = n & 63;
+		bits -= n;
+		dst = (uint8_t *)dst + bits;
+	}
+	if (n > 32) {
+		rte_memset_33to64(dst, a, n);
+		return ret;
+	}
+	if (n > 0)
+		rte_memset32((uint8_t *)dst - 32 + n, a);
+	return ret;
+}
+
+#else /* RTE_MACHINE_CPUFLAG */
+
+/**
+ * SSE && AVX implementation below
+ */
+
+static inline void
+rte_memset32(uint8_t *dst, int8_t a)
+{
+	__m128i xmm0 = _mm_set1_epi8(a);
+
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+	_mm_storeu_si128((__m128i *)(dst + 16), xmm0);
+}
+
+static inline void
+rte_memset16blocks(uint8_t *dst, int8_t a, size_t n)
+{
+	__m128i xmm0 = _mm_set1_epi8(a);
+
+	while (n >= 16) {
+		n -= 16;
+		_mm_store_si128((__m128i *)(dst + 0 * 16), xmm0);
+		dst = (uint8_t *)dst + 16;
+	}
+}
+
+static inline void
+rte_memset64blocks(uint8_t *dst, int8_t a, size_t n)
+{
+	__m128i xmm0 = _mm_set1_epi8(a);
+
+	while (n >= 64) {
+		n -= 64;
+		_mm_store_si128((__m128i *)(dst + 0 * 16), xmm0);
+		_mm_store_si128((__m128i *)(dst + 1 * 16), xmm0);
+		_mm_store_si128((__m128i *)(dst + 2 * 16), xmm0);
+		_mm_store_si128((__m128i *)(dst + 3 * 16), xmm0);
+		dst = (uint8_t *)dst + 64;
+	}
+}
+
+static inline void *
+rte_memset(void *dst, int a, size_t n)
+{
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	if (n < 16) {
+		rte_memset_less16(dst, a, n);
+		return ret;
+	} else if (n == 16) {
+		rte_memset16((uint8_t *)dst, a);
+		return ret;
+	}
+	if (n <= 32) {
+		rte_memset_17to32(dst, a, n);
+		return ret;
+	}
+	if (n <= 48) {
+		rte_memset32((uint8_t *)dst, a);
+		rte_memset16((uint8_t *)dst - 16 + n, a);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_memset32((uint8_t *)dst, a);
+		rte_memset16((uint8_t *)dst + 32, a);
+		rte_memset16((uint8_t *)dst - 16 + n, a);
+		return ret;
+	}
+	if (n > 64) {
+		dstofss = (uintptr_t)dst & 0xF;
+		if (dstofss > 0) {
+			dstofss = 16 - dstofss;
+			n -= dstofss;
+			rte_memset16((uint8_t *)dst, a);
+			dst = (uint8_t *)dst + dstofss;
+		}
+		rte_memset64blocks((uint8_t *)dst, a, n);
+		bits = n;
+		n &= 63;
+		bits -= n;
+		dst = (uint8_t *)dst + bits;
+		rte_memset16blocks((uint8_t *)dst, a, n);
+		bits = n;
+		n &= 0xf;
+		bits -= n;
+		dst = (uint8_t *)dst + bits;
+		if (n > 0) {
+			rte_memset16((uint8_t *)dst - 16 + n, a);
+			return ret;
+		}
+	}
+	return ret;
+}
+
+#endif /* RTE_MACHINE_CPUFLAG */
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMSET_H_ */
diff --git a/lib/librte_eal/common/include/generic/rte_memset.h b/lib/librte_eal/common/include/generic/rte_memset.h
new file mode 100644
index 0000000..b03a7d0
--- /dev/null
+++ b/lib/librte_eal/common/include/generic/rte_memset.h
@@ -0,0 +1,52 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMSET_H_
+#define _RTE_MEMSET_H_
+
+/**
+ * @file
+ *
+ * Functions for vectorised implementation of memset().
+ */
+#ifdef _RTE_MEMSET_X86_64_H_
+
+static void *
+rte_memset(void *dst, int a, size_t n);
+
+#else
+
+#define rte_memset memset
+
+#endif
+#endif /* _RTE_MEMSET_H_ */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 2/4] app/test: add functional autotest for rte_memset
  2016-12-27 10:04   ` [PATCH v2 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang
  2016-12-27 10:04     ` [PATCH v2 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang
@ 2016-12-27 10:04     ` Zhiyong Yang
  2016-12-27 10:04     ` [PATCH v2 3/4] app/test: add performance " Zhiyong Yang
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 44+ messages in thread
From: Zhiyong Yang @ 2016-12-27 10:04 UTC (permalink / raw)
  To: dev
  Cc: yuanhan.liu, thomas.monjalon, bruce.richardson,
	konstantin.ananyev, pablo.de.lara.guarch, Zhiyong Yang

The file implements the functional autotest for rte_memset, which
validates the new function rte_memset whether to work in a right
way. The implementation of test_memcpy.c is used as a reference.

Usage:
step 1: run ./x86_64-native-linuxapp-gcc/app/test
step 2: run command memset_autotest at the run time.

Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>
---
 app/test/Makefile      |   2 +
 app/test/test_memset.c | 158 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 160 insertions(+)
 create mode 100644 app/test/test_memset.c

diff --git a/app/test/Makefile b/app/test/Makefile
index 5be023a..82da3f3 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -123,6 +123,8 @@ SRCS-y += test_logs.c
 SRCS-y += test_memcpy.c
 SRCS-y += test_memcpy_perf.c
 
+SRCS-y += test_memset.c
+
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash.c
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_thash.c
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash_perf.c
diff --git a/app/test/test_memset.c b/app/test/test_memset.c
new file mode 100644
index 0000000..c9020bf
--- /dev/null
+++ b/app/test/test_memset.c
@@ -0,0 +1,158 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <rte_common.h>
+#include <rte_random.h>
+#include <rte_memset.h>
+#include "test.h"
+
+/*
+ * Set this to the maximum buffer size you want to test. If it is 0, then the
+ * values in the buf_sizes[] array below will be used.
+ */
+#define TEST_VALUE_RANGE 0
+#define MAX_INT8 127
+#define MIN_INT8 -128
+/* List of buffer sizes to test */
+#if TEST_VALUE_RANGE == 0
+static size_t buf_sizes[] = {
+	0, 1, 7, 8, 9, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128, 129,
+	255, 256, 257, 320, 384, 511, 512, 513, 1023, 1024, 1025, 1518,
+	1522, 1600, 2048, 3072, 4096, 5120, 6144, 7168, 8192
+};
+/* MUST be as large as largest packet size above */
+#define BUFFER_SIZE       8192
+#else /* TEST_VALUE_RANGE != 0 */
+static size_t buf_sizes[TEST_VALUE_RANGE];
+#define BUFFER_SIZE       TEST_VALUE_RANGE
+#endif /* TEST_VALUE_RANGE == 0 */
+
+/* Data is aligned on this many bytes (power of 2) */
+#define ALIGNMENT_UNIT 32
+
+/*
+ * Create two buffers, and initialize the one as the reference buffer with
+ * random values. Another(dest_buff) is assigned by the reference buffer.
+ * Set some memory area of dest_buff by using ch and then compare to see
+ * if the rte_memset is successful. The bytes outside the setted area are
+ * also checked to make sure they are not changed.
+ */
+static int
+test_single_memset(unsigned int off_dst, int ch, size_t size)
+{
+	unsigned int i;
+	uint8_t dest_buff[BUFFER_SIZE + ALIGNMENT_UNIT];
+	uint8_t ref_buff[BUFFER_SIZE + ALIGNMENT_UNIT];
+	void *ret;
+
+	/* Setup buffers */
+	for (i = 0; i < BUFFER_SIZE + ALIGNMENT_UNIT; i++) {
+		ref_buff[i] = (uint8_t) rte_rand();
+		dest_buff[i] = ref_buff[i];
+	}
+	/* Do the rte_memset */
+	ret = rte_memset(dest_buff + off_dst, ch, size);
+	if (ret != (dest_buff + off_dst)) {
+		printf("rte_memset() returned %p, not %p\n",
+		       ret, dest_buff + off_dst);
+	}
+	/* Check nothing before offset was affected */
+	for (i = 0; i < off_dst; i++) {
+		if (dest_buff[i] != ref_buff[i]) {
+			printf("rte_memset() failed for %u bytes (offsets=%u): \
+			       [modified before start of dst].\n",
+			       (unsigned int)size, off_dst);
+			return -1;
+		}
+	}
+	/* Check every byte was setted */
+	for (i = 0; i < size; i++) {
+		if (dest_buff[i + off_dst] != (uint8_t)ch) {
+			printf("rte_memset() failed for %u bytes (offsets=%u): \
+			       [didn't memset byte %u].\n",
+			       (unsigned int)size, off_dst, i);
+			return -1;
+		}
+	}
+	/* Check nothing after memset was affected */
+	for (i = off_dst + size; i < BUFFER_SIZE + ALIGNMENT_UNIT; i++) {
+		if (dest_buff[i] != ref_buff[i]) {
+			printf("rte_memset() failed for %u bytes (offsets=%u): \
+			      [memset too many].\n",
+			       (unsigned int)size, off_dst);
+			return -1;
+		}
+	}
+	return 0;
+}
+
+/*
+ * Check functionality for various buffer sizes and data offsets/alignments.
+ */
+static int
+func_test(void)
+{
+	unsigned int off_dst, i;
+	unsigned int num_buf_sizes = sizeof(buf_sizes) / sizeof(buf_sizes[0]);
+	int ret;
+	int j;
+
+	for (j = MIN_INT8; j <= MAX_INT8; j++) {
+		for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
+			for (i = 0; i < num_buf_sizes; i++) {
+				ret = test_single_memset(off_dst, j,
+							 buf_sizes[i]);
+				if (ret != 0)
+					return -1;
+			}
+		}
+	}
+	return 0;
+}
+
+static int
+test_memset(void)
+{
+	int ret;
+
+	ret = func_test();
+	if (ret != 0)
+		return -1;
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(memset_autotest, test_memset);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 3/4] app/test: add performance autotest for rte_memset
  2016-12-27 10:04   ` [PATCH v2 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang
  2016-12-27 10:04     ` [PATCH v2 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang
  2016-12-27 10:04     ` [PATCH v2 2/4] app/test: add functional autotest for rte_memset Zhiyong Yang
@ 2016-12-27 10:04     ` Zhiyong Yang
  2016-12-27 10:04     ` [PATCH v2 4/4] lib/librte_vhost: improve vhost perf using rte_memset Zhiyong Yang
  2017-01-09  9:48     ` [PATCH v2 0/4] eal/common: introduce rte_memset and related test Yang, Zhiyong
  4 siblings, 0 replies; 44+ messages in thread
From: Zhiyong Yang @ 2016-12-27 10:04 UTC (permalink / raw)
  To: dev
  Cc: yuanhan.liu, thomas.monjalon, bruce.richardson,
	konstantin.ananyev, pablo.de.lara.guarch, Zhiyong Yang

The file implements the perf autotest for rte_memset. The perf data
can be gotten compared between rte_memset and memset when you run it.
We can see the perf of rte_memset obviously is better than glibc memset
especially for small N bytes.
The first column shows the N size for memset & rte_memset.
The second column lists a set of numbers for rte_memset Vs memset perf
in cache.
The third column lists a set of numbers for rte_memset Vs memset perf
in memory.

The following data is gotten on haswell. 

** rte_memset() - memset perf tests
        (C = compile-time constant) **
======== ======= ======== ======= ========
   Size memset in cache  memset in mem
(bytes)        (ticks)        (ticks)
------- -------------- ---------------
============= 32B aligned ================
      1       3 -    8      14 -  115
      3       4 -    8      19 -  125
      6       3 -    7      19 -  125
      8       3 -    6      19 -  124
     12       3 -    6      19 -  124
     15       3 -    6      19 -  125
     16       3 -    8      13 -  125
     32       3 -    7      19 -  133
     64       3 -    7      28 -  162
     65       6 -    8      41 -  182
    128       6 -   13      54 -  199
    192       8 -   13      77 -  273
    255       8 -   16     100 -  222
    512      17 -   14     187 -  247
    768      22 -   20     270 -  362
   1024      29 -   28     329 -  377
   2048      63 -   57     564 -  601
   4096     104 -  102     993 - 1025
   8192     200 -  211    1831 - 2270
------ -------------- -------------- ------
C     6       2 -    2      19 -   19
C    64       2 -    6      28 -   33
C   128       3 -   12      54 -   59
C   192       5 -   29      77 -   83
C   256       6 -   35     100 -  105
C   512      12 -   60     188 -  195
C   768      18 -   20     271 -  362
C  1024      24 -   29     329 -  377

Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>
---

Change in V2:

Add perf comparation data between rte_memset and memset on haswell.

 app/test/Makefile           |   1 +
 app/test/test_memset_perf.c | 348 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 349 insertions(+)
 create mode 100644 app/test/test_memset_perf.c

diff --git a/app/test/Makefile b/app/test/Makefile
index 82da3f3..1c3e7f1 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -124,6 +124,7 @@ SRCS-y += test_memcpy.c
 SRCS-y += test_memcpy_perf.c
 
 SRCS-y += test_memset.c
+SRCS-y += test_memset_perf.c
 
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash.c
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_thash.c
diff --git a/app/test/test_memset_perf.c b/app/test/test_memset_perf.c
new file mode 100644
index 0000000..83b15b5
--- /dev/null
+++ b/app/test/test_memset_perf.c
@@ -0,0 +1,348 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_random.h>
+#include <rte_malloc.h>
+#include <rte_memset.h>
+#include "test.h"
+
+/*
+ * Set this to the maximum buffer size you want to test. If it is 0, then the
+ * values in the buf_sizes[] array below will be used.
+ */
+#define TEST_VALUE_RANGE        0
+
+/* List of buffer sizes to test */
+#if TEST_VALUE_RANGE == 0
+static size_t buf_sizes[] = {
+	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65,
+	70, 85, 96, 105, 115, 127, 128, 129, 161, 191, 192, 193, 255, 256,
+	257, 319, 320, 321, 383, 384, 385, 447, 448, 449, 511, 512, 513,
+	767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600, 2048, 2560,
+	3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
+};
+/* MUST be as large as largest packet size above */
+#define SMALL_BUFFER_SIZE 8192
+#else /* TEST_VALUE_RANGE != 0 */
+static size_t buf_sizes[TEST_VALUE_RANGE];
+#define SMALL_BUFFER_SIZE       TEST_VALUE_RANGE
+#endif /* TEST_VALUE_RANGE == 0 */
+
+/*
+ * Arrays of this size are used for measuring uncached memory accesses by
+ * picking a random location within the buffer. Make this smaller if there are
+ * memory allocation errors.
+ */
+#define LARGE_BUFFER_SIZE       (100 * 1024 * 1024)
+
+/* How many times to run timing loop for performance tests */
+#define TEST_ITERATIONS         1000000
+#define TEST_BATCH_SIZE         100
+
+/* Data is aligned on this many bytes (power of 2) */
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+#define ALIGNMENT_UNIT          64
+#elif defined RTE_MACHINE_CPUFLAG_AVX2
+#define ALIGNMENT_UNIT          32
+#else /* RTE_MACHINE_CPUFLAG */
+#define ALIGNMENT_UNIT          16
+#endif /* RTE_MACHINE_CPUFLAG */
+
+/*
+ * Pointers used in performance tests. The two large buffers are for uncached
+ * access where random addresses within the buffer are used for each
+ * memset. The two small buffers are for cached access.
+ */
+static uint8_t *large_buf_read, *large_buf_write;
+static uint8_t *small_buf_read, *small_buf_write;
+
+/* Initialise data buffers. */
+static int
+init_buffers(void)
+{
+	unsigned int i;
+
+	large_buf_read = rte_malloc("memset", LARGE_BUFFER_SIZE
+					+ ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	if (large_buf_read == NULL)
+		goto error_large_buf_read;
+
+	large_buf_write = rte_malloc("memset", LARGE_BUFFER_SIZE
+					+ ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	if (large_buf_write == NULL)
+		goto error_large_buf_write;
+
+	small_buf_read = rte_malloc("memset", SMALL_BUFFER_SIZE
+					+ ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	if (small_buf_read == NULL)
+		goto error_small_buf_read;
+
+	small_buf_write = rte_malloc("memset", SMALL_BUFFER_SIZE
+					+ ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	if (small_buf_write == NULL)
+		goto error_small_buf_write;
+
+	for (i = 0; i < LARGE_BUFFER_SIZE; i++)
+		large_buf_read[i] = rte_rand();
+	for (i = 0; i < SMALL_BUFFER_SIZE; i++)
+		small_buf_read[i] = rte_rand();
+
+	return 0;
+
+error_small_buf_write:
+	rte_free(small_buf_read);
+error_small_buf_read:
+	rte_free(large_buf_write);
+error_large_buf_write:
+	rte_free(large_buf_read);
+error_large_buf_read:
+	printf("ERROR: not enough memory\n");
+	return -1;
+}
+
+/* Cleanup data buffers */
+static void
+free_buffers(void)
+{
+	rte_free(large_buf_read);
+	rte_free(large_buf_write);
+	rte_free(small_buf_read);
+	rte_free(small_buf_write);
+}
+
+/*
+ * Get a random offset into large array, with enough space needed to perform
+ * max memset size. Offset is aligned, uoffset is used for unalignment setting.
+ */
+static inline size_t
+get_rand_offset(size_t uoffset)
+{
+	return ((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) &
+			~(ALIGNMENT_UNIT - 1)) + uoffset;
+}
+
+/* Fill in destination addresses. */
+static inline void
+fill_addr_arrays(size_t *dst_addr, int is_dst_cached, size_t dst_uoffset)
+{
+	unsigned int i;
+
+	for (i = 0; i < TEST_BATCH_SIZE; i++)
+		dst_addr[i] = (is_dst_cached) ? dst_uoffset :
+					get_rand_offset(dst_uoffset);
+}
+
+/*
+ * WORKAROUND: For some reason the first test doing an uncached write
+ * takes a very long time (~25 times longer than is expected). So we do
+ * it once without timing.
+ */
+static void
+do_uncached_write(uint8_t *dst, int is_dst_cached, size_t size)
+{
+	unsigned int i, j;
+	size_t dst_addrs[TEST_BATCH_SIZE];
+	int ch = rte_rand() & 0xff;
+
+	for (i = 0; i < (TEST_ITERATIONS / TEST_BATCH_SIZE); i++) {
+		fill_addr_arrays(dst_addrs, is_dst_cached, 0);
+		for (j = 0; j < TEST_BATCH_SIZE; j++)
+			rte_memset(dst+dst_addrs[j], ch, size);
+	}
+}
+
+/*
+ * Run a single memset performance test. This is a macro to ensure that if
+ * the "size" parameter is a constant it won't be converted to a variable.
+ */
+#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset, size)             \
+do {                                                                        \
+	unsigned int iter, t;                                               \
+	size_t dst_addrs[TEST_BATCH_SIZE];                                  \
+	uint64_t start_time, total_time = 0;                                \
+	uint64_t total_time2 = 0;                                           \
+	int ch = rte_rand() & 0xff;                                         \
+									    \
+	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {\
+	fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset);            \
+	start_time = rte_rdtsc();                                           \
+	for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
+		rte_memset(dst+dst_addrs[t], ch, size);                      \
+	total_time += rte_rdtsc() - start_time;                             \
+	}                                                                   \
+	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {\
+	fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset);            \
+	start_time = rte_rdtsc();                                           \
+	for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
+		memset(dst+dst_addrs[t], ch, size);                         \
+	total_time2 += rte_rdtsc() - start_time;                            \
+	}                                                                   \
+	printf("%8.0f -",  (double)total_time / TEST_ITERATIONS);           \
+	printf("%5.0f",  (double)total_time2 / TEST_ITERATIONS);            \
+} while (0)
+
+/* Run aligned memset tests. */
+#define ALL_PERF_TESTS_FOR_SIZE(n)                                       \
+do {                                                                     \
+	if (__builtin_constant_p(n))                                     \
+		printf("\nC%6u", (unsigned int)n);                       \
+	else                                                             \
+		printf("\n%7u", (unsigned int)n);                        \
+	SINGLE_PERF_TEST(small_buf_write, 1, 0, n);                      \
+	SINGLE_PERF_TEST(large_buf_write, 0, 0, n);                      \
+} while (0)
+
+/* Run unaligned memset tests */
+#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)                             \
+do {                                                                     \
+	if (__builtin_constant_p(n))                                     \
+		printf("\nC%6u", (unsigned int)n);                       \
+	else                                                             \
+		printf("\n%7u", (unsigned int)n);                        \
+	SINGLE_PERF_TEST(small_buf_write, 1, 1, n);                      \
+	SINGLE_PERF_TEST(large_buf_write, 0, 1, n);                      \
+} while (0)
+
+/* Run memset tests for constant length */
+#define ALL_PERF_TEST_FOR_CONSTANT                                       \
+do {                                                                     \
+	TEST_CONSTANT(6U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);      \
+	TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);   \
+	TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U); \
+} while (0)
+
+/* Run all memset tests for aligned constant cases */
+static inline void
+perf_test_constant_aligned(void)
+{
+#define TEST_CONSTANT ALL_PERF_TESTS_FOR_SIZE
+	ALL_PERF_TEST_FOR_CONSTANT;
+#undef TEST_CONSTANT
+}
+
+/* Run all memset tests for unaligned constant cases */
+static inline void
+perf_test_constant_unaligned(void)
+{
+#define TEST_CONSTANT ALL_PERF_TESTS_FOR_SIZE_UNALIGNED
+	ALL_PERF_TEST_FOR_CONSTANT;
+#undef TEST_CONSTANT
+}
+
+/* Run all memset tests for aligned variable cases */
+static inline void
+perf_test_variable_aligned(void)
+{
+	unsigned int n = sizeof(buf_sizes) / sizeof(buf_sizes[0]);
+	unsigned int i;
+
+	for (i = 0; i < n; i++)
+		ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]);
+}
+
+/* Run all memset tests for unaligned variable cases */
+static inline void
+perf_test_variable_unaligned(void)
+{
+	unsigned int n = sizeof(buf_sizes) / sizeof(buf_sizes[0]);
+	unsigned int i;
+
+	for (i = 0; i < n; i++)
+		ALL_PERF_TESTS_FOR_SIZE_UNALIGNED((size_t)buf_sizes[i]);
+}
+
+/* Run all memset tests */
+static int
+perf_test(void)
+{
+	int ret;
+
+	ret = init_buffers();
+	if (ret != 0)
+		return ret;
+
+#if TEST_VALUE_RANGE != 0
+	/* Set up buf_sizes array, if required */
+	unsigned int i;
+
+	for (i = 0; i < TEST_VALUE_RANGE; i++)
+		buf_sizes[i] = i;
+#endif
+
+	/* See function comment */
+	do_uncached_write(large_buf_write, 0, SMALL_BUFFER_SIZE);
+
+	printf("\n** rte_memset() - memset perf tests \t\n  \
+	(C = compile-time constant) **\n"
+		"======== ======= ======== ======= ========\n"
+		"   Size memset in cache  memset in mem\n"
+		"(bytes)        (ticks)        (ticks)\n"
+		"------- -------------- ---------------");
+
+	printf("\n============= %2dB aligned ================", ALIGNMENT_UNIT);
+	/* Do aligned tests where size is a variable */
+	perf_test_variable_aligned();
+	printf("\n------ -------------- -------------- ------");
+	/* Do aligned tests where size is a compile-time constant */
+	perf_test_constant_aligned();
+	printf("\n============= Unaligned ===================");
+	/* Do unaligned tests where size is a variable */
+	perf_test_variable_unaligned();
+	printf("\n------ -------------- -------------- ------");
+	/* Do unaligned tests where size is a compile-time constant */
+	perf_test_constant_unaligned();
+	printf("\n====== ============== ============== =======\n\n");
+
+	free_buffers();
+
+	return 0;
+}
+
+static int
+test_memset_perf(void)
+{
+	int ret;
+
+	ret = perf_test();
+	if (ret != 0)
+		return -1;
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(memset_perf_autotest, test_memset_perf);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 4/4] lib/librte_vhost: improve vhost perf using rte_memset
  2016-12-27 10:04   ` [PATCH v2 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang
                       ` (2 preceding siblings ...)
  2016-12-27 10:04     ` [PATCH v2 3/4] app/test: add performance " Zhiyong Yang
@ 2016-12-27 10:04     ` Zhiyong Yang
  2017-01-09  9:48     ` [PATCH v2 0/4] eal/common: introduce rte_memset and related test Yang, Zhiyong
  4 siblings, 0 replies; 44+ messages in thread
From: Zhiyong Yang @ 2016-12-27 10:04 UTC (permalink / raw)
  To: dev
  Cc: yuanhan.liu, thomas.monjalon, bruce.richardson,
	konstantin.ananyev, pablo.de.lara.guarch, Zhiyong Yang

Using rte_memset instead of copy_virtio_net_hdr can bring 3%~4%
performance improvements on IA platform from virtio/vhost
non-mergeable loopback testing.

Two key points have been considered:
1. One variable initialization could be saved, which involves memory
store.
2. copy_virtio_net_hdr involves both load (from stack, the virtio_hdr
var) and store (to virtio driver memory), while rte_memset just involves
store.

Signed-off-by: Zhiyong Yang <zhiyong.yang@intel.com>
---

Changes in V2:

Modify release_17_02.rst description.

 doc/guides/rel_notes/release_17_02.rst |  7 +++++++
 lib/librte_vhost/virtio_net.c          | 18 +++++++++++-------
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/doc/guides/rel_notes/release_17_02.rst b/doc/guides/rel_notes/release_17_02.rst
index 180af82..3d39cde 100644
--- a/doc/guides/rel_notes/release_17_02.rst
+++ b/doc/guides/rel_notes/release_17_02.rst
@@ -52,6 +52,13 @@ New Features
   See the :ref:`Generic flow API <Generic_flow_API>` documentation for more
   information.
 
+* **Introduced rte_memset on IA platform.**
+
+  Performance drop had been caused in some cases on Ivybridge when DPDK code calls
+  glibc function memset. It was necessary to introduce more high efficient function
+  to replace it. The function rte_memset supported three types of instruction sets
+  including sse & avx(128 bits), avx2(256 bits) and avx512(512bits) and have better
+  performance than glibc memset.
 
 Resolved Issues
 ---------------
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 595f67c..392b31b 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -37,6 +37,7 @@
 
 #include <rte_mbuf.h>
 #include <rte_memcpy.h>
+#include <rte_memset.h>
 #include <rte_ether.h>
 #include <rte_ip.h>
 #include <rte_virtio_net.h>
@@ -194,7 +195,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vring_desc *descs,
 	uint32_t cpy_len;
 	struct vring_desc *desc;
 	uint64_t desc_addr;
-	struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
+	struct virtio_net_hdr *virtio_hdr;
 
 	desc = &descs[desc_idx];
 	desc_addr = gpa_to_vva(dev, desc->addr);
@@ -208,8 +209,9 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vring_desc *descs,
 
 	rte_prefetch0((void *)(uintptr_t)desc_addr);
 
-	virtio_enqueue_offload(m, &virtio_hdr.hdr);
-	copy_virtio_net_hdr(dev, desc_addr, virtio_hdr);
+	virtio_hdr = (struct virtio_net_hdr *)(uintptr_t)desc_addr;
+	rte_memset(virtio_hdr, 0, sizeof(*virtio_hdr));
+	virtio_enqueue_offload(m, virtio_hdr);
 	vhost_log_write(dev, desc->addr, dev->vhost_hlen);
 	PRINT_PACKET(dev, (uintptr_t)desc_addr, dev->vhost_hlen, 0);
 
@@ -459,7 +461,6 @@ static inline int __attribute__((always_inline))
 copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct rte_mbuf *m,
 			    struct buf_vector *buf_vec, uint16_t num_buffers)
 {
-	struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
 	uint32_t vec_idx = 0;
 	uint64_t desc_addr;
 	uint32_t mbuf_offset, mbuf_avail;
@@ -480,7 +481,6 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct rte_mbuf *m,
 	hdr_phys_addr = buf_vec[vec_idx].buf_addr;
 	rte_prefetch0((void *)(uintptr_t)hdr_addr);
 
-	virtio_hdr.num_buffers = num_buffers;
 	LOG_DEBUG(VHOST_DATA, "(%d) RX: num merge buffers %d\n",
 		dev->vid, num_buffers);
 
@@ -512,8 +512,12 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct rte_mbuf *m,
 		}
 
 		if (hdr_addr) {
-			virtio_enqueue_offload(hdr_mbuf, &virtio_hdr.hdr);
-			copy_virtio_net_hdr(dev, hdr_addr, virtio_hdr);
+			struct virtio_net_hdr_mrg_rxbuf *hdr =
+			(struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)hdr_addr;
+
+			rte_memset(&(hdr->hdr), 0, sizeof(hdr->hdr));
+			hdr->num_buffers = num_buffers;
+			virtio_enqueue_offload(hdr_mbuf, &(hdr->hdr));
 			vhost_log_write(dev, hdr_phys_addr, dev->vhost_hlen);
 			PRINT_PACKET(dev, (uintptr_t)hdr_addr,
 				     dev->vhost_hlen, 0);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/4] eal/common: introduce rte_memset and related test
  2016-12-27 10:04   ` [PATCH v2 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang
                       ` (3 preceding siblings ...)
  2016-12-27 10:04     ` [PATCH v2 4/4] lib/librte_vhost: improve vhost perf using rte_memset Zhiyong Yang
@ 2017-01-09  9:48     ` Yang, Zhiyong
  2017-01-17  6:24       ` Yang, Zhiyong
  4 siblings, 1 reply; 44+ messages in thread
From: Yang, Zhiyong @ 2017-01-09  9:48 UTC (permalink / raw)
  To: thomas.monjalon, Richardson, Bruce, Ananyev, Konstantin
  Cc: yuanhan.liu, De Lara Guarch, Pablo, dev

Hi, Thomas, Bruce, Konstantin:

	Any comments about the patchset?  Do I need to modify anything?

Thanks
Zhiyong 

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhiyong Yang
> Sent: Tuesday, December 27, 2016 6:05 PM
> To: dev@dpdk.org
> Cc: yuanhan.liu@linux.intel.com; thomas.monjalon@6wind.com; Richardson,
> Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Subject: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset and
> related test
> 
> DPDK code has met performance drop badly in some case when calling glibc
> function memset. Reference to discussions about memset in
> http://dpdk.org/ml/archives/dev/2016-October/048628.html
> It is necessary to introduce more high efficient function to fix it.
> One important thing about rte_memset is that we can get clear control on
> what instruction flow is used.
> 
> This patchset introduces rte_memset to bring more high efficient
> implementation, and will bring obvious perf improvement, especially for
> small N bytes in the most application scenarios.
> 
> Patch 1 implements rte_memset in the file rte_memset.h on IA platform The
> file supports three types of instruction sets including sse & avx (128bits),
> avx2(256bits) and avx512(512bits). rte_memset makes use of vectorization
> and inline function to improve the perf on IA. In addition, cache line and
> memory alignment are fully taken into consideration.
> 
> Patch 2 implements functional autotest to validates the function whether to
> work in a right way.
> 
> Patch 3 implements performance autotest separately in cache and memory.
> We can see the perf of rte_memset is obviously better than glibc memset
> especially for small N bytes.
> 
> Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring 3%~4%
> performance improvements on IA platform from virtio/vhost non-mergeable
> loopback testing.
> 
> Changes in V2:
> 
> Patch 1:
> Rename rte_memset.h -> rte_memset_64.h and create a file rte_memset.h
> for each arch.
> 
> Patch 3:
> add the perf comparation data between rte_memset and memset on
> haswell.
> 
> Patch 4:
> Modify release_17_02.rst description.
> 
> Zhiyong Yang (4):
>   eal/common: introduce rte_memset on IA platform
>   app/test: add functional autotest for rte_memset
>   app/test: add performance autotest for rte_memset
>   lib/librte_vhost: improve vhost perf using rte_memset
> 
>  app/test/Makefile                                  |   3 +
>  app/test/test_memset.c                             | 158 +++++++++
>  app/test/test_memset_perf.c                        | 348 +++++++++++++++++++
>  doc/guides/rel_notes/release_17_02.rst             |   7 +
>  .../common/include/arch/arm/rte_memset.h           |  36 ++
>  .../common/include/arch/ppc_64/rte_memset.h        |  36 ++
>  .../common/include/arch/tile/rte_memset.h          |  36 ++
>  .../common/include/arch/x86/rte_memset.h           |  51 +++
>  .../common/include/arch/x86/rte_memset_64.h        | 378
> +++++++++++++++++++++
>  lib/librte_eal/common/include/generic/rte_memset.h |  52 +++
>  lib/librte_vhost/virtio_net.c                      |  18 +-
>  11 files changed, 1116 insertions(+), 7 deletions(-)  create mode 100644
> app/test/test_memset.c  create mode 100644 app/test/test_memset_perf.c
> create mode 100644
> lib/librte_eal/common/include/arch/arm/rte_memset.h
>  create mode 100644
> lib/librte_eal/common/include/arch/ppc_64/rte_memset.h
>  create mode 100644 lib/librte_eal/common/include/arch/tile/rte_memset.h
>  create mode 100644
> lib/librte_eal/common/include/arch/x86/rte_memset.h
>  create mode 100644
> lib/librte_eal/common/include/arch/x86/rte_memset_64.h
>  create mode 100644 lib/librte_eal/common/include/generic/rte_memset.h
> 
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/4] eal/common: introduce rte_memset and related test
  2017-01-09  9:48     ` [PATCH v2 0/4] eal/common: introduce rte_memset and related test Yang, Zhiyong
@ 2017-01-17  6:24       ` Yang, Zhiyong
  2017-01-17 20:14         ` Thomas Monjalon
  0 siblings, 1 reply; 44+ messages in thread
From: Yang, Zhiyong @ 2017-01-17  6:24 UTC (permalink / raw)
  To: thomas.monjalon, Richardson, Bruce, Ananyev, Konstantin
  Cc: yuanhan.liu, De Lara Guarch, Pablo, dev

Hi, Thomas:
	Does this patchset have chance to be applied for 1702 release? 
Thanks
Zhiyong

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong
> Sent: Monday, January 9, 2017 5:49 PM
> To: thomas.monjalon@6wind.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Cc: yuanhan.liu@linux.intel.com; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset
> and related test
> 
> Hi, Thomas, Bruce, Konstantin:
> 
> 	Any comments about the patchset?  Do I need to modify anything?
> 
> Thanks
> Zhiyong
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhiyong Yang
> > Sent: Tuesday, December 27, 2016 6:05 PM
> > To: dev@dpdk.org
> > Cc: yuanhan.liu@linux.intel.com; thomas.monjalon@6wind.com;
> > Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>
> > Subject: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset
> > and related test
> >
> > DPDK code has met performance drop badly in some case when calling
> > glibc function memset. Reference to discussions about memset in
> > http://dpdk.org/ml/archives/dev/2016-October/048628.html
> > It is necessary to introduce more high efficient function to fix it.
> > One important thing about rte_memset is that we can get clear control
> > on what instruction flow is used.
> >
> > This patchset introduces rte_memset to bring more high efficient
> > implementation, and will bring obvious perf improvement, especially
> > for small N bytes in the most application scenarios.
> >
> > Patch 1 implements rte_memset in the file rte_memset.h on IA platform
> > The file supports three types of instruction sets including sse & avx
> > (128bits),
> > avx2(256bits) and avx512(512bits). rte_memset makes use of
> > vectorization and inline function to improve the perf on IA. In
> > addition, cache line and memory alignment are fully taken into
> consideration.
> >
> > Patch 2 implements functional autotest to validates the function
> > whether to work in a right way.
> >
> > Patch 3 implements performance autotest separately in cache and memory.
> > We can see the perf of rte_memset is obviously better than glibc
> > memset especially for small N bytes.
> >
> > Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring
> > 3%~4% performance improvements on IA platform from virtio/vhost
> > non-mergeable loopback testing.
> >
> > Changes in V2:
> >
> > Patch 1:
> > Rename rte_memset.h -> rte_memset_64.h and create a file
> rte_memset.h
> > for each arch.
> >
> > Patch 3:
> > add the perf comparation data between rte_memset and memset on
> > haswell.
> >
> > Patch 4:
> > Modify release_17_02.rst description.
> >
> > Zhiyong Yang (4):
> >   eal/common: introduce rte_memset on IA platform
> >   app/test: add functional autotest for rte_memset
> >   app/test: add performance autotest for rte_memset
> >   lib/librte_vhost: improve vhost perf using rte_memset
> >
> >  app/test/Makefile                                  |   3 +
> >  app/test/test_memset.c                             | 158 +++++++++
> >  app/test/test_memset_perf.c                        | 348 +++++++++++++++++++
> >  doc/guides/rel_notes/release_17_02.rst             |   7 +
> >  .../common/include/arch/arm/rte_memset.h           |  36 ++
> >  .../common/include/arch/ppc_64/rte_memset.h        |  36 ++
> >  .../common/include/arch/tile/rte_memset.h          |  36 ++
> >  .../common/include/arch/x86/rte_memset.h           |  51 +++
> >  .../common/include/arch/x86/rte_memset_64.h        | 378
> > +++++++++++++++++++++
> >  lib/librte_eal/common/include/generic/rte_memset.h |  52 +++
> >  lib/librte_vhost/virtio_net.c                      |  18 +-
> >  11 files changed, 1116 insertions(+), 7 deletions(-)  create mode
> > 100644 app/test/test_memset.c  create mode 100644
> > app/test/test_memset_perf.c create mode 100644
> > lib/librte_eal/common/include/arch/arm/rte_memset.h
> >  create mode 100644
> > lib/librte_eal/common/include/arch/ppc_64/rte_memset.h
> >  create mode 100644
> > lib/librte_eal/common/include/arch/tile/rte_memset.h
> >  create mode 100644
> > lib/librte_eal/common/include/arch/x86/rte_memset.h
> >  create mode 100644
> > lib/librte_eal/common/include/arch/x86/rte_memset_64.h
> >  create mode 100644
> lib/librte_eal/common/include/generic/rte_memset.h
> >
> > --
> > 2.7.4

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/4] eal/common: introduce rte_memset and related test
  2017-01-17  6:24       ` Yang, Zhiyong
@ 2017-01-17 20:14         ` Thomas Monjalon
  2017-01-18  0:15           ` Vincent JARDIN
  2017-01-18  2:42           ` Yang, Zhiyong
  0 siblings, 2 replies; 44+ messages in thread
From: Thomas Monjalon @ 2017-01-17 20:14 UTC (permalink / raw)
  To: Yang, Zhiyong
  Cc: Richardson, Bruce, Ananyev, Konstantin, yuanhan.liu,
	De Lara Guarch, Pablo, dev

2017-01-17 06:24, Yang, Zhiyong:
> Hi, Thomas:
> 	Does this patchset have chance to be applied for 1702 release? 

It could be part of 17.02 but there are some issues:

The x86 part did not receive any ack from x86 maintainers.

checkpatch reports some warnings, especially about counting elements
of an array. Please use RTE_DIM.

The file in generic/ is for doxygen only.
Please check how it is done for other files.

The description is "Functions for vectorised implementation of memset()."
Does it mean memset from glibc does not use vector instructions?

The functional autotest is not integrated in the basic test suite.

I wish this kind of review would be done by someone else.
As it has not a big performance impact, this series could wait the next release.
By the way, have you tried to work on glibc, as I had suggested?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/4] eal/common: introduce rte_memset and related test
  2017-01-17 20:14         ` Thomas Monjalon
@ 2017-01-18  0:15           ` Vincent JARDIN
  2017-01-18  2:42           ` Yang, Zhiyong
  1 sibling, 0 replies; 44+ messages in thread
From: Vincent JARDIN @ 2017-01-18  0:15 UTC (permalink / raw)
  To: Thomas Monjalon, Yang, Zhiyong
  Cc: Richardson, Bruce, Ananyev, Konstantin, yuanhan.liu,
	De Lara Guarch, Pablo, dev

Le 17/01/2017 à 21:14, Thomas Monjalon a écrit :
> By the way, have you tried to work on glibc, as I had suggested?

Nothing here:
 
https://sourceware.org/cgi-bin/search.cgi?wm=wrd&form=extended&m=all&s=D&ul=%2Fml%2Flibc-alpha%2F%25&q=memset

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/4] eal/common: introduce rte_memset and related test
  2017-01-17 20:14         ` Thomas Monjalon
  2017-01-18  0:15           ` Vincent JARDIN
@ 2017-01-18  2:42           ` Yang, Zhiyong
  2017-01-18  7:42             ` Thomas Monjalon
  1 sibling, 1 reply; 44+ messages in thread
From: Yang, Zhiyong @ 2017-01-18  2:42 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Richardson, Bruce, Ananyev, Konstantin, yuanhan.liu,
	De Lara Guarch, Pablo, dev

hi, Thomas:
	Thanks for your reply.

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Wednesday, January 18, 2017 4:14 AM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>
> Cc: Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; yuanhan.liu@linux.intel.com; De Lara
> Guarch, Pablo <pablo.de.lara.guarch@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset
> and related test
> 
> 2017-01-17 06:24, Yang, Zhiyong:
> > Hi, Thomas:
> > 	Does this patchset have chance to be applied for 1702 release?
> 
> It could be part of 17.02 but there are some issues:
> 
> The x86 part did not receive any ack from x86 maintainers.

Ok

> 
> checkpatch reports some warnings, especially about counting elements of an
> array. Please use RTE_DIM.

Ok, I ignore these warning as reference to current release code. More clean code
will been sent in future.

> 
> The file in generic/ is for doxygen only.
> Please check how it is done for other files.

Ok.  I don't know this before. :), thank you.

> 
> The description is "Functions for vectorised implementation of memset()."
> Does it mean memset from glibc does not use vector instructions?
> 

Sorry for causing misleading understanding,
Glibc memset() use vectorization instructions to implement optimization, of course.
I just want to say "the functions for implementing the same functionality
like glibc memset() ".  My bad English expressions.  :)

> The functional autotest is not integrated in the basic test suite.
> 

I can run command line "memset_autotest",  It seems that I leave something out.

> I wish this kind of review would be done by someone else.
> As it has not a big performance impact, this series could wait the next release.

Ok.
Maybe memset() consumes small ratio for current DPDK data path. 

> By the way, have you tried to work on glibc, as I had suggested?

I'm not familiar with glibc regulation, as far as I know, glibc is using X86 asm,
rather than intrinsic.  I will consider your suggestion. 

 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/4] eal/common: introduce rte_memset and related test
  2017-01-18  2:42           ` Yang, Zhiyong
@ 2017-01-18  7:42             ` Thomas Monjalon
  2017-01-19  1:36               ` Yang, Zhiyong
  0 siblings, 1 reply; 44+ messages in thread
From: Thomas Monjalon @ 2017-01-18  7:42 UTC (permalink / raw)
  To: Yang, Zhiyong
  Cc: Richardson, Bruce, Ananyev, Konstantin, yuanhan.liu,
	De Lara Guarch, Pablo, dev

2017-01-18 02:42, Yang, Zhiyong:
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > The functional autotest is not integrated in the basic test suite.
> 
> I can run command line "memset_autotest",  It seems that I leave something out.

Please check app/test/autotest_data.py

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/4] eal/common: introduce rte_memset and related test
  2017-01-18  7:42             ` Thomas Monjalon
@ 2017-01-19  1:36               ` Yang, Zhiyong
  0 siblings, 0 replies; 44+ messages in thread
From: Yang, Zhiyong @ 2017-01-19  1:36 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Richardson, Bruce, Ananyev, Konstantin, yuanhan.liu,
	De Lara Guarch, Pablo, dev


> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Wednesday, January 18, 2017 3:43 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>
> Cc: Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; yuanhan.liu@linux.intel.com; De Lara
> Guarch, Pablo <pablo.de.lara.guarch@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 0/4] eal/common: introduce rte_memset
> and related test
> 
> 2017-01-18 02:42, Yang, Zhiyong:
> > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > > The functional autotest is not integrated in the basic test suite.
> >
> > I can run command line "memset_autotest",  It seems that I leave
> something out.
> 
> Please check app/test/autotest_data.py

Thanks, Thomas.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2017-01-19  1:36 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-02  8:36 [PATCH 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang
2016-12-02  8:36 ` [PATCH 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang
2016-12-02 10:25   ` Thomas Monjalon
2016-12-08  7:41     ` Yang, Zhiyong
2016-12-08  9:26       ` Ananyev, Konstantin
2016-12-08  9:53         ` Yang, Zhiyong
2016-12-08 10:27           ` Bruce Richardson
2016-12-08 10:30           ` Ananyev, Konstantin
2016-12-11 12:32             ` Yang, Zhiyong
2016-12-15  6:51               ` Yang, Zhiyong
2016-12-15 10:12                 ` Bruce Richardson
2016-12-16 10:19                   ` Yang, Zhiyong
2016-12-19  6:27                     ` Yuanhan Liu
2016-12-20  2:41                       ` Yao, Lei A
2016-12-15 10:53                 ` Ananyev, Konstantin
2016-12-16  2:15                   ` Yang, Zhiyong
2016-12-16 11:47                     ` Ananyev, Konstantin
2016-12-20  9:31                       ` Yang, Zhiyong
2016-12-08 15:09       ` Thomas Monjalon
2016-12-11 12:04         ` Yang, Zhiyong
2016-12-27 10:04   ` [PATCH v2 0/4] eal/common: introduce rte_memset and related test Zhiyong Yang
2016-12-27 10:04     ` [PATCH v2 1/4] eal/common: introduce rte_memset on IA platform Zhiyong Yang
2016-12-27 10:04     ` [PATCH v2 2/4] app/test: add functional autotest for rte_memset Zhiyong Yang
2016-12-27 10:04     ` [PATCH v2 3/4] app/test: add performance " Zhiyong Yang
2016-12-27 10:04     ` [PATCH v2 4/4] lib/librte_vhost: improve vhost perf using rte_memset Zhiyong Yang
2017-01-09  9:48     ` [PATCH v2 0/4] eal/common: introduce rte_memset and related test Yang, Zhiyong
2017-01-17  6:24       ` Yang, Zhiyong
2017-01-17 20:14         ` Thomas Monjalon
2017-01-18  0:15           ` Vincent JARDIN
2017-01-18  2:42           ` Yang, Zhiyong
2017-01-18  7:42             ` Thomas Monjalon
2017-01-19  1:36               ` Yang, Zhiyong
2016-12-02  8:36 ` [PATCH 2/4] app/test: add functional autotest for rte_memset Zhiyong Yang
2016-12-02  8:36 ` [PATCH 3/4] app/test: add performance " Zhiyong Yang
2016-12-02  8:36 ` [PATCH 4/4] lib/librte_vhost: improve vhost perf using rte_memset Zhiyong Yang
2016-12-02  9:46   ` Thomas Monjalon
2016-12-06  8:04     ` Yang, Zhiyong
2016-12-02 10:00 ` [PATCH 0/4] eal/common: introduce rte_memset and related test Maxime Coquelin
2016-12-06  6:33   ` Yang, Zhiyong
2016-12-06  8:29     ` Maxime Coquelin
2016-12-07  9:28       ` Yang, Zhiyong
2016-12-07  9:37         ` Yuanhan Liu
2016-12-07  9:43           ` Yang, Zhiyong
2016-12-07  9:48             ` Yuanhan Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.