DPDK-dev Archive on lore.kernel.org
 help / color / Atom feed
* [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64
@ 2019-06-30 16:21 Gavin Hu
  2019-06-30 16:21 ` [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal Gavin Hu
                   ` (17 more replies)
  0 siblings, 18 replies; 42+ messages in thread
From: Gavin Hu @ 2019-06-30 16:21 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, nd, gavin.hu

DPDK has multiple use cases where the core repeatedly polls a location in
memory. This polling results in many cache and memory transactions.

Arm architecture provides WFE (Wait For Event) instruction, which allows
the cpu core to enter a low power state until woken up by the update to the
memory location being polled. Thus reducing the cache and memory
transactions.

x86 has the PAUSE hint instruction to reduce such overhead.

The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

For non-Arm platforms, these APIs are just wrappers around do-while loop
with rte_pause, so there are no performance differences.

For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
option. It is disabled by default.

Currently, use of WFE is supported only for aarch64 platforms. armv7
platforms do support the WFE instruction, but they require explicit wake up
events(sev) and are less performannt.

Testing shows that, performance varies across different platforms, with
some showing degradation.

CONFIG_RTE_USE_WFE should be enabled depending on the performance on the
target platforms.

Gavin Hu (5):
  eal: add the APIs to wait until equal
  ticketlock: use new API to reduce contention on aarch64
  ring: use wfe to wait for ring tail update on aarch64
  spinlock: use wfe to reduce contention on aarch64
  config: add WFE config entry for aarch64

 config/arm/meson.build                             |   1 +
 config/common_armv8a_linux                         |   6 +
 .../common/include/arch/arm/rte_pause_64.h         | 143 +++++++++++++++++++++
 .../common/include/arch/arm/rte_spinlock.h         |  25 ++++
 lib/librte_eal/common/include/generic/rte_pause.h  |  20 +++
 .../common/include/generic/rte_spinlock.h          |   2 +-
 .../common/include/generic/rte_ticketlock.h        |   4 +-
 lib/librte_ring/rte_ring_c11_mem.h                 |   5 +-
 lib/librte_ring/rte_ring_generic.h                 |   4 +-
 9 files changed, 203 insertions(+), 7 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
@ 2019-06-30 16:21 ` Gavin Hu
  2019-06-30 20:27   ` Stephen Hemminger
  2019-07-01  9:58   ` Pavan Nikhilesh Bhagavatula
  2019-06-30 16:21 ` [dpdk-dev] [RFC 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 42+ messages in thread
From: Gavin Hu @ 2019-06-30 16:21 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, nd, gavin.hu

The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 .../common/include/arch/arm/rte_pause_64.h         | 143 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  |  20 +++
 2 files changed, 163 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
index 93895d3..0095da6 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
@@ -17,6 +17,149 @@ static inline void rte_pause(void)
 	asm volatile("yield" ::: "memory");
 }
 
+#ifdef RTE_USE_WFE
+#define rte_wait_until_equal_relaxed(addr, expected) do {\
+		typeof(*addr) tmp;  \
+		if (__builtin_constant_p((expected))) \
+			do { \
+				if (sizeof(*(addr)) == 16)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldxrh  %w0, %1\n"  \
+						"cmp	%w0, %w2\n"  \
+						"bne	1b\n"  \
+						: "=&r"(tmp)  \
+						: "Q"(*addr), "i"(expected)  \
+						: "cc", "memory");  \
+				else if (sizeof(*(addr)) == 32)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldxr  %w0, %1\n"  \
+						"cmp	%w0, %w2\n"  \
+						"bne	1b\n"  \
+						: "=&r"(tmp)  \
+						: "Q"(*addr), "i"(expected)  \
+						: "cc", "memory");  \
+				else if (sizeof(*(addr)) == 64)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldxr  %x0, %1\n"  \
+						"cmp	%x0, %x2\n"  \
+						"bne	1b\n"  \
+						: "=&r" (tmp)  \
+						: "Q"(*addr), "i"(expected)  \
+						: "cc", "memory"); \
+			} while (0); \
+		else \
+			do { \
+				if (sizeof(*(addr)) == 16)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldxrh  %w0, %1\n"  \
+						"cmp	%w0, %w2\n"  \
+						"bne	1b\n"  \
+						: "=&r"(tmp)  \
+						: "Q"(*addr), "r"(expected)  \
+						: "cc", "memory");  \
+				else if (sizeof(*(addr)) == 32)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldxr  %w0, %1\n"  \
+						"cmp	%w0, %w2\n"  \
+						"bne	1b\n"  \
+						: "=&r"(tmp)  \
+						: "Q"(*addr), "r"(expected)  \
+						: "cc", "memory");  \
+				else if (sizeof(*(addr)) == 64)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldxr  %x0, %1\n"  \
+						"cmp	%x0, %x2\n"  \
+						"bne	1b\n"  \
+						: "=&r" (tmp)  \
+						: "Q"(*addr), "r"(expected)  \
+						: "cc", "memory");  \
+		} while (0); \
+} while (0)
+
+#define rte_wait_until_equal_acquire(addr, expected) do {\
+		typeof(*addr) tmp;  \
+		if (__builtin_constant_p((expected))) \
+			do { \
+				if (sizeof(*(addr)) == 16)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldaxrh  %w0, %1\n"  \
+						"cmp	%w0, %w2\n"  \
+						"bne	1b\n"  \
+						: "=&r"(tmp)  \
+						: "Q"(*addr), "i"(expected)  \
+						: "cc", "memory");  \
+				else if (sizeof(*(addr)) == 32)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldaxr  %w0, %1\n"  \
+						"cmp	%w0, %w2\n"  \
+						"bne	1b\n"  \
+						: "=&r"(tmp)  \
+						: "Q"(*addr), "i"(expected)  \
+						: "cc", "memory");  \
+				else if (sizeof(*(addr)) == 64)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldaxr  %x0, %1\n"  \
+						"cmp	%x0, %x2\n"  \
+						"bne	1b\n"  \
+						: "=&r" (tmp)  \
+						: "Q"(*addr), "i"(expected)  \
+						: "cc", "memory"); \
+			} while (0); \
+		else \
+			do { \
+				if (sizeof(*(addr)) == 16)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldaxrh  %w0, %1\n"  \
+						"cmp	%w0, %w2\n"  \
+						"bne	1b\n"  \
+						: "=&r"(tmp)  \
+						: "Q"(*addr), "r"(expected)  \
+						: "cc", "memory");  \
+				else if (sizeof(*(addr)) == 32)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldaxr  %w0, %1\n"  \
+						"cmp	%w0, %w2\n"  \
+						"bne	1b\n"  \
+						: "=&r"(tmp)  \
+						: "Q"(*addr), "r"(expected)  \
+						: "cc", "memory");  \
+				else if (sizeof(*(addr)) == 64)\
+					asm volatile(  \
+						"sevl\n"  \
+						"1:	 wfe\n"  \
+						"ldaxr  %x0, %1\n"  \
+						"cmp	%x0, %x2\n"  \
+						"bne	1b\n"  \
+						: "=&r" (tmp)  \
+						: "Q"(*addr), "r"(expected)  \
+						: "cc", "memory");  \
+		} while (0); \
+} while (0)
+
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
index 52bd4db..c115b61 100644
--- a/lib/librte_eal/common/include/generic/rte_pause.h
+++ b/lib/librte_eal/common/include/generic/rte_pause.h
@@ -20,4 +20,24 @@
  */
 static inline void rte_pause(void);
 
+#if !defined(RTE_USE_WFE)
+#define rte_wait_until_equal_relaxed(addr, expected) do {\
+		rte_pause();	\
+	} while (*(addr) != (expected))
+
+#ifdef RTE_USE_C11_MEM_MODEL
+#define rte_wait_until_equal_acquire(addr, expected) do {\
+		rte_pause();	\
+	} while (__atomic_load_n((addr), __ATOMIC_ACQUIRE) != (expected))
+#else
+#define rte_wait_until_equal_acquire(addr, expected) do {\
+		do {\
+			rte_pause(); \
+		} while (*(addr) != (expected)); \
+		rte_smp_rmb(); \
+	} while (0)
+#endif
+#endif
+
+
 #endif /* _RTE_PAUSE_H_ */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [RFC 2/5] ticketlock: use new API to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
  2019-06-30 16:21 ` [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-06-30 16:21 ` Gavin Hu
  2019-06-30 16:21 ` [dpdk-dev] [RFC 3/5] ring: use wfe to wait for ring tail update " Gavin Hu
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Gavin Hu @ 2019-06-30 16:21 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, nd, gavin.hu

While using ticket lock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running ticketlock_autotest on ThunderX2, with different numbers of cores
and depths of rings, 3%~8% performance gains were measured.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/include/generic/rte_ticketlock.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h b/lib/librte_eal/common/include/generic/rte_ticketlock.h
index 191146f..6820f01 100644
--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
@@ -64,8 +64,8 @@ static inline __rte_experimental void
 rte_ticketlock_lock(rte_ticketlock_t *tl)
 {
 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1, __ATOMIC_RELAXED);
-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
-		rte_pause();
+	if (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
+		rte_wait_until_equal_acquire(&tl->s.current, me);
 }
 
 /**
-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [RFC 3/5] ring: use wfe to wait for ring tail update on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
  2019-06-30 16:21 ` [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal Gavin Hu
  2019-06-30 16:21 ` [dpdk-dev] [RFC 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
@ 2019-06-30 16:21 ` " Gavin Hu
  2019-06-30 16:21 ` [dpdk-dev] [RFC 4/5] spinlock: use wfe to reduce contention " Gavin Hu
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Gavin Hu @ 2019-06-30 16:21 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, nd, gavin.hu

Instead of polling for tail to be updated, use wfe instruction.

50%~70% performance gain was measured by running ring_perf_autotest on
ThunderX2.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

---
 lib/librte_ring/rte_ring_c11_mem.h | 5 +++--
 lib/librte_ring/rte_ring_generic.h | 4 ++--
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0fb73a3..f1de79c 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -2,6 +2,7 @@
  *
  * Copyright (c) 2017,2018 HXT-semitech Corporation.
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
+ * Copyright (c) 2019 Arm Limited
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
  * Used as BSD-3 Licensed with permission from Kip Macy.
@@ -21,8 +22,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		if (unlikely(ht->tail != old_val))
+			rte_wait_until_equal_relaxed(&ht->tail, old_val);
 
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
 }
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index 953cdbb..bb0dce0 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -23,8 +23,8 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		if (unlikely(ht->tail != old_val))
+			rte_wait_until_equal_relaxed(&ht->tail, old_val);
 
 	ht->tail = new_val;
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [RFC 4/5] spinlock: use wfe to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (2 preceding siblings ...)
  2019-06-30 16:21 ` [dpdk-dev] [RFC 3/5] ring: use wfe to wait for ring tail update " Gavin Hu
@ 2019-06-30 16:21 ` " Gavin Hu
  2019-06-30 16:21 ` [dpdk-dev] [RFC 5/5] config: add WFE config entry for aarch64 Gavin Hu
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Gavin Hu @ 2019-06-30 16:21 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, nd, gavin.hu

In acquiring a spinlock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

20% performance gain was measured by running spinlock_autotest on 14
isolated cores of ThunderX2.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 .../common/include/arch/arm/rte_spinlock.h         | 25 ++++++++++++++++++++++
 .../common/include/generic/rte_spinlock.h          |  2 +-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
index 1a6916b..b7e8521 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
@@ -16,6 +16,31 @@ extern "C" {
 #include <rte_common.h>
 #include "generic/rte_spinlock.h"
 
+/* armv7a does support WFE, but an explicit wake-up signal using SEV is
+ * required (must be preceded by DSB to drain the store buffer) and
+ * this is less performant, so keep armv7a implementation unchanged.
+ */
+#if defined(RTE_USE_WFE) && defined(RTE_ARCH_ARM64)
+static inline void
+rte_spinlock_lock(rte_spinlock_t *sl)
+{
+	unsigned int tmp;
+	/* http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
+	 * faqs/ka16809.html
+	 */
+	asm volatile(
+		"sevl\n"
+		"1:	 wfe\n"
+		"2:	 ldaxr   %w0, %1\n"
+		"cbnz   %w0, 1b\n"
+		"stxr   %w0, %w2, %1\n"
+		"cbnz   %w0, 2b\n"
+		: "=&r" (tmp), "+Q"(sl->locked)
+		: "r" (1)
+		: "cc", "memory");
+}
+#endif
+
 static inline int rte_tm_supported(void)
 {
 	return 0;
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index 87ae7a4..cf4f15b 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -57,7 +57,7 @@ rte_spinlock_init(rte_spinlock_t *sl)
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl);
 
-#ifdef RTE_FORCE_INTRINSICS
+#if defined(RTE_FORCE_INTRINSICS) && !defined(RTE_USE_WFE)
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl)
 {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [RFC 5/5] config: add WFE config entry for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (3 preceding siblings ...)
  2019-06-30 16:21 ` [dpdk-dev] [RFC 4/5] spinlock: use wfe to reduce contention " Gavin Hu
@ 2019-06-30 16:21 ` Gavin Hu
  2019-06-30 20:29 ` [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Stephen Hemminger
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Gavin Hu @ 2019-06-30 16:21 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, nd, gavin.hu

Add the RTE_USE_WFE configuration entry for aarch64, disabled by default.
It can be enabled selectively based on the performance benchmarking.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 config/arm/meson.build     | 1 +
 config/common_armv8a_linux | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 6fa06a1..939d60e 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -116,6 +116,7 @@ impl_dpaa = ['NXP DPAA', flags_dpaa, machine_args_generic]
 impl_dpaa2 = ['NXP DPAA2', flags_dpaa2, machine_args_generic]
 
 dpdk_conf.set('RTE_FORCE_INTRINSICS', 1)
+dpdk_conf.set('RTE_USE_WFE', 0)
 
 if not dpdk_conf.get('RTE_ARCH_64')
 	dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
diff --git a/config/common_armv8a_linux b/config/common_armv8a_linux
index 72091de..ae87a87 100644
--- a/config/common_armv8a_linux
+++ b/config/common_armv8a_linux
@@ -12,6 +12,12 @@ CONFIG_RTE_ARCH_64=y
 
 CONFIG_RTE_FORCE_INTRINSICS=y
 
+# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
+# calling these APIs put the cores enter low power state while waiting
+# for the memory address to be become equal to the expected value.
+# This is supported only by aarch64.
+CONFIG_RTE_USE_WFE=n
+
 # Maximum available cache line size in arm64 implementations.
 # Setting to maximum available cache line size in generic config
 # to address minimum DMA alignment across all arm64 implementations.
-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
  2019-06-30 16:21 ` [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-06-30 20:27   ` Stephen Hemminger
  2019-07-01  7:16     ` Gavin Hu (Arm Technology China)
  2019-07-01  9:58   ` Pavan Nikhilesh Bhagavatula
  1 sibling, 1 reply; 42+ messages in thread
From: Stephen Hemminger @ 2019-06-30 20:27 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, nd

On Mon,  1 Jul 2019 00:21:12 +0800
Gavin Hu <gavin.hu@arm.com> wrote:

> +#ifdef RTE_USE_WFE
> +#define rte_wait_until_equal_relaxed(addr, expected) do {\
> +		typeof(*addr) tmp;  \
> +		if (__builtin_constant_p((expected))) \
> +			do { \
> +				if (sizeof(*(addr)) == 16)\
> +					asm volatile(  \
> +						"sevl\n"  \
> +						"1:	 wfe\n"  \
> +						"ldxrh  %w0, %1\n"  \
> +						"cmp	%w0, %w2\n"  \
> +						"bne	1b\n"  \
> +						: "=&r"(tmp)  \
> +						: "Q"(*addr), "i"(expected)  \
> +						: "cc", "memory");  \
> +				else if (sizeof(*(addr)) == 32)\
> +					asm volatile(  \
> +						"sevl\n"  \
> +						"1:	 wfe\n"  \
> +						"ldxr  %w0, %1\n"  \
> +						"cmp	%w0, %w2\n"  \
> +						"bne	1b\n"  \
> +						: "=&r"(tmp)  \
> +						: "Q"(*addr), "i"(expected)  \
> +						: "cc", "memory");  \
> +				else if (sizeof(*(addr)) == 64)\
> +					asm volatile(  \
> +						"sevl\n"  \
> +						"1:	 wfe\n"  \
> +						"ldxr  %x0, %1\n"  \
> +						"cmp	%x0, %x2\n"  \
> +						"bne	1b\n"  \
> +						: "=&r" (tmp)  \
> +						: "Q"(*addr), "i"(expected)  \
> +						: "cc", "memory"); \
> +			} while (0); \
> +		else \
> +			do { \
> +				if (sizeof(*(addr)) == 16)\
> +					asm volatile(  \
> +						"sevl\n"  \
> +						"1:	 wfe\n"  \
> +						"ldxrh  %w0, %1\n"  \
> +						"cmp	%w0, %w2\n"  \
> +						"bne	1b\n"  \
> +						: "=&r"(tmp)  \
> +						: "Q"(*addr), "r"(expected)  \
> +						: "cc", "memory");  \
> +				else if (sizeof(*(addr)) == 32)\
> +					asm volatile(  \
> +						"sevl\n"  \
> +						"1:	 wfe\n"  \
> +						"ldxr  %w0, %1\n"  \
> +						"cmp	%w0, %w2\n"  \
> +						"bne	1b\n"  \
> +						: "=&r"(tmp)  \
> +						: "Q"(*addr), "r"(expected)  \
> +						: "cc", "memory");  \
> +				else if (sizeof(*(addr)) == 64)\
> +					asm volatile(  \
> +						"sevl\n"  \
> +						"1:	 wfe\n"  \
> +						"ldxr  %x0, %1\n"  \
> +						"cmp	%x0, %x2\n"  \
> +						"bne	1b\n"  \
> +						: "=&r" (tmp)  \
> +						: "Q"(*addr), "r"(expected)  \
> +						: "cc", "memory");  \
> +		} while (0); \
> +} while (0)

That is a hot mess.
Macro's are harder to maintain and offer no benefit over inline functions.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (4 preceding siblings ...)
  2019-06-30 16:21 ` [dpdk-dev] [RFC 5/5] config: add WFE config entry for aarch64 Gavin Hu
@ 2019-06-30 20:29 ` Stephen Hemminger
  2019-07-01  9:12   ` Gavin Hu (Arm Technology China)
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 " Gavin Hu
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Stephen Hemminger @ 2019-06-30 20:29 UTC (permalink / raw)
  To: Gavin Hu
  Cc: dev, thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa.Nagarahalli, nd

On Mon,  1 Jul 2019 00:21:11 +0800
Gavin Hu <gavin.hu@arm.com> wrote:

> DPDK has multiple use cases where the core repeatedly polls a location in
> memory. This polling results in many cache and memory transactions.
> 
> Arm architecture provides WFE (Wait For Event) instruction, which allows
> the cpu core to enter a low power state until woken up by the update to the
> memory location being polled. Thus reducing the cache and memory
> transactions.
> 
> x86 has the PAUSE hint instruction to reduce such overhead.
> 
> The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
> for a memory location to become equal to a given value'.
> 
> For non-Arm platforms, these APIs are just wrappers around do-while loop
> with rte_pause, so there are no performance differences.
> 
> For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
> option. It is disabled by default.
> 
> Currently, use of WFE is supported only for aarch64 platforms. armv7
> platforms do support the WFE instruction, but they require explicit wake up
> events(sev) and are less performannt.
> 
> Testing shows that, performance varies across different platforms, with
> some showing degradation.
> 
> CONFIG_RTE_USE_WFE should be enabled depending on the performance on the
> target platforms.

How does this work if process is preempted?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
  2019-06-30 20:27   ` Stephen Hemminger
@ 2019-07-01  7:16     ` Gavin Hu (Arm Technology China)
  2019-07-01  7:43       ` Thomas Monjalon
  0 siblings, 1 reply; 42+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-07-01  7:16 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa Nagarahalli, nd, gaetan.rivet,
	Gavin Hu (Arm Technology China)

Hi Stephen,

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Monday, July 1, 2019 4:28 AM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; thomas@monjalon.net; jerinj@marvell.com;
> hemant.agrawal@nxp.com; bruce.richardson@intel.com;
> chaozhu@linux.vnet.ibm.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
> 
> On Mon,  1 Jul 2019 00:21:12 +0800
> Gavin Hu <gavin.hu@arm.com> wrote:
> 
> > +#ifdef RTE_USE_WFE
> > +#define rte_wait_until_equal_relaxed(addr, expected) do {\
> > +		typeof(*addr) tmp;  \
> > +		if (__builtin_constant_p((expected))) \
> > +			do { \
> > +				if (sizeof(*(addr)) == 16)\
> > +					asm volatile(  \
> > +						"sevl\n"  \
> > +						"1:	 wfe\n"  \
> > +						"ldxrh  %w0, %1\n"  \
> > +						"cmp	%w0, %w2\n"  \
> > +						"bne	1b\n"  \
> > +						: "=&r"(tmp)  \
> > +						: "Q"(*addr), "i"(expected)  \
> > +						: "cc", "memory");  \
> > +				else if (sizeof(*(addr)) == 32)\
> > +					asm volatile(  \
> > +						"sevl\n"  \
> > +						"1:	 wfe\n"  \
> > +						"ldxr  %w0, %1\n"  \
> > +						"cmp	%w0, %w2\n"  \
> > +						"bne	1b\n"  \
> > +						: "=&r"(tmp)  \
> > +						: "Q"(*addr), "i"(expected)  \
> > +						: "cc", "memory");  \
> > +				else if (sizeof(*(addr)) == 64)\
> > +					asm volatile(  \
> > +						"sevl\n"  \
> > +						"1:	 wfe\n"  \
> > +						"ldxr  %x0, %1\n"  \
> > +						"cmp	%x0, %x2\n"  \
> > +						"bne	1b\n"  \
> > +						: "=&r" (tmp)  \
> > +						: "Q"(*addr), "i"(expected)  \
> > +						: "cc", "memory"); \
> > +			} while (0); \
> > +		else \
> > +			do { \
> > +				if (sizeof(*(addr)) == 16)\
> > +					asm volatile(  \
> > +						"sevl\n"  \
> > +						"1:	 wfe\n"  \
> > +						"ldxrh  %w0, %1\n"  \
> > +						"cmp	%w0, %w2\n"  \
> > +						"bne	1b\n"  \
> > +						: "=&r"(tmp)  \
> > +						: "Q"(*addr), "r"(expected)  \
> > +						: "cc", "memory");  \
> > +				else if (sizeof(*(addr)) == 32)\
> > +					asm volatile(  \
> > +						"sevl\n"  \
> > +						"1:	 wfe\n"  \
> > +						"ldxr  %w0, %1\n"  \
> > +						"cmp	%w0, %w2\n"  \
> > +						"bne	1b\n"  \
> > +						: "=&r"(tmp)  \
> > +						: "Q"(*addr), "r"(expected)  \
> > +						: "cc", "memory");  \
> > +				else if (sizeof(*(addr)) == 64)\
> > +					asm volatile(  \
> > +						"sevl\n"  \
> > +						"1:	 wfe\n"  \
> > +						"ldxr  %x0, %1\n"  \
> > +						"cmp	%x0, %x2\n"  \
> > +						"bne	1b\n"  \
> > +						: "=&r" (tmp)  \
> > +						: "Q"(*addr), "r"(expected)  \
> > +						: "cc", "memory");  \
> > +		} while (0); \
> > +} while (0)
> 
> That is a hot mess.
> Macro's are harder to maintain and offer no benefit over inline functions.
During our internal review, I ever used C11 _Generic to generalize the API to take different types of arguments. 
That makes the API look much simpler and better, but it poses a hard requirement for C11 and gcc 4.9+.
That means, Gaetan's patch, as shown below, has to be reverted, otherwise there are compiling errors.
https://gcc.gnu.org/wiki/C11Status 
$ git show ea7726a6
commit ea7726a6ee4b2b63313c4a198522a8dcea70c13d
Author: Gaetan Rivet <gaetan.rivet@6wind.com>
Date:   Thu Jul 20 14:27:53 2017 +0200

    net/failsafe: fix build on FreeBSD 10 with GCC 4.8

diff --git a/drivers/net/failsafe/Makefile b/drivers/net/failsafe/Makefile
index 32aaaa2..d516d36 100644
--- a/drivers/net/failsafe/Makefile
+++ b/drivers/net/failsafe/Makefile
@@ -50,7 +50,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_PMD_FAILSAFE) += failsafe_flow.c
 # No exported include files

 # Basic CFLAGS:
-CFLAGS += -std=c11 -Wextra
+CFLAGS += -std=gnu99 -Wextra

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
  2019-07-01  7:16     ` Gavin Hu (Arm Technology China)
@ 2019-07-01  7:43       ` Thomas Monjalon
  2019-07-02 14:07         ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 42+ messages in thread
From: Thomas Monjalon @ 2019-07-01  7:43 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China)
  Cc: Stephen Hemminger, dev, jerinj, hemant.agrawal, bruce.richardson,
	chaozhu, Honnappa Nagarahalli, nd, gaetan.rivet

01/07/2019 09:16, Gavin Hu (Arm Technology China):
> From: Stephen Hemminger <stephen@networkplumber.org>
> > Gavin Hu <gavin.hu@arm.com> wrote:
> > 
> > > +#ifdef RTE_USE_WFE
> > > +#define rte_wait_until_equal_relaxed(addr, expected) do {\
[...]
> > That is a hot mess.
> > Macro's are harder to maintain and offer no benefit over inline functions.
> 
> During our internal review, I ever used C11 _Generic to generalize the API to take different types of arguments. 

Gavin, the question is about macros versus functions.
Please, could you convert it into an inline function?




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64
  2019-06-30 20:29 ` [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Stephen Hemminger
@ 2019-07-01  9:12   ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 42+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-07-01  9:12 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa Nagarahalli, nd

Hi Stephen,

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Monday, July 1, 2019 4:30 AM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; thomas@monjalon.net; jerinj@marvell.com;
> hemant.agrawal@nxp.com; bruce.richardson@intel.com;
> chaozhu@linux.vnet.ibm.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64
> 
> On Mon,  1 Jul 2019 00:21:11 +0800
> Gavin Hu <gavin.hu@arm.com> wrote:
> 
> > DPDK has multiple use cases where the core repeatedly polls a location in
> > memory. This polling results in many cache and memory transactions.
> >
> > Arm architecture provides WFE (Wait For Event) instruction, which allows
> > the cpu core to enter a low power state until woken up by the update to the
> > memory location being polled. Thus reducing the cache and memory
> > transactions.
> >
> > x86 has the PAUSE hint instruction to reduce such overhead.
> >
> > The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
> > for a memory location to become equal to a given value'.
> >
> > For non-Arm platforms, these APIs are just wrappers around do-while loop
> > with rte_pause, so there are no performance differences.
> >
> > For Arm platforms, use of WFE can be configured using
> CONFIG_RTE_USE_WFE
> > option. It is disabled by default.
> >
> > Currently, use of WFE is supported only for aarch64 platforms. armv7
> > platforms do support the WFE instruction, but they require explicit wake up
> > events(sev) and are less performannt.
> >
> > Testing shows that, performance varies across different platforms, with
> > some showing degradation.
> >
> > CONFIG_RTE_USE_WFE should be enabled depending on the performance
> on the
> > target platforms.
> 
> How does this work if process is preempted?
WFE won't prevent pre-emption from the kernel as that is down to a timer/re-scheduling interrupt.
Software using the WFE mechanism must tolerate spurious wake-up events, including timer/re-scheduling interrupts, so a re-check of the condition upon exit of WFE is needed to be in place(this is already included in the patch)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
  2019-06-30 16:21 ` [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal Gavin Hu
  2019-06-30 20:27   ` Stephen Hemminger
@ 2019-07-01  9:58   ` Pavan Nikhilesh Bhagavatula
  2019-07-02 14:08     ` Gavin Hu (Arm Technology China)
  1 sibling, 1 reply; 42+ messages in thread
From: Pavan Nikhilesh Bhagavatula @ 2019-07-01  9:58 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, hemant.agrawal,
	bruce.richardson, chaozhu, Honnappa.Nagarahalli, nd

Hi Gavin,

>-----Original Message-----
>From: dev <dev-bounces@dpdk.org> On Behalf Of Gavin Hu
>Sent: Sunday, June 30, 2019 9:51 PM
>To: dev@dpdk.org
>Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran
><jerinj@marvell.com>; hemant.agrawal@nxp.com;
>bruce.richardson@intel.com; chaozhu@linux.vnet.ibm.com;
>Honnappa.Nagarahalli@arm.com; nd@arm.com; gavin.hu@arm.com
>Subject: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
>
>The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
>for a memory location to become equal to a given value'.
>
>Signed-off-by: Gavin Hu <gavin.hu@arm.com>
>Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>Reviewed-by: Steve Capper <steve.capper@arm.com>
>Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
>Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>---
> .../common/include/arch/arm/rte_pause_64.h         | 143
>+++++++++++++++++++++
> lib/librte_eal/common/include/generic/rte_pause.h  |  20 +++
> 2 files changed, 163 insertions(+)
>
>diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
>b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
>index 93895d3..0095da6 100644
>--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
>+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
>@@ -17,6 +17,149 @@ static inline void rte_pause(void)
> 	asm volatile("yield" ::: "memory");
> }
>
>+#ifdef RTE_USE_WFE
>+#define rte_wait_until_equal_relaxed(addr, expected) do {\
>+		typeof(*addr) tmp;  \
>+		if (__builtin_constant_p((expected))) \
>+			do { \
>+				if (sizeof(*(addr)) == 16)\
>+					asm volatile(  \
>+						"sevl\n"  \
>+						"1:	 wfe\n"  \
>+						"ldxrh  %w0, %1\n"  \
>+						"cmp	%w0, %w2\n"  \
>+						"bne	1b\n"  \
>+						: "=&r"(tmp)  \
>+						: "Q"(*addr),
>"i"(expected)  \
>+						: "cc", "memory");  \

Can we have early exit here i.e. instead of going directly to wfe can we first check the condition and then fallthrough?
Something like:
		asm volatile("	ldxrh	%w0 %1	\n"
			        "    cmp	%w0 %w2	\n"
			        "	b.eq  	2:		\n"
			        "1: wfe			\n"
			        "    ldxrh	%w0, %1	\n"  
			        "    cmp	%w0, %w2	\n"  
			        "    b.ne	1b		\n"  
			        "2:				\n"
			        :::);

Regards,
Pavan.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
  2019-07-01  7:43       ` Thomas Monjalon
@ 2019-07-02 14:07         ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 42+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-07-02 14:07 UTC (permalink / raw)
  To: thomas
  Cc: Stephen Hemminger, dev, jerinj, hemant.agrawal, bruce.richardson,
	chaozhu, Honnappa Nagarahalli, nd, gaetan.rivet

Hi Thomas,
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Monday, July 1, 2019 3:43 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>; dev@dpdk.org;
> jerinj@marvell.com; hemant.agrawal@nxp.com;
> bruce.richardson@intel.com; chaozhu@linux.vnet.ibm.com; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>;
> gaetan.rivet@6wind.com
> Subject: Re: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
> 
> 01/07/2019 09:16, Gavin Hu (Arm Technology China):
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > > Gavin Hu <gavin.hu@arm.com> wrote:
> > >
> > > > +#ifdef RTE_USE_WFE
> > > > +#define rte_wait_until_equal_relaxed(addr, expected) do {\
> [...]
> > > That is a hot mess.
> > > Macro's are harder to maintain and offer no benefit over inline functions.
> >
> > During our internal review, I ever used C11 _Generic to generalize the API
> to take different types of arguments.
> 
> Gavin, the question is about macros versus functions.
> Please, could you convert it into an inline function?
Sure, I will do it in next version.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
  2019-07-01  9:58   ` Pavan Nikhilesh Bhagavatula
@ 2019-07-02 14:08     ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 42+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-07-02 14:08 UTC (permalink / raw)
  To: Pavan Nikhilesh Bhagavatula, dev
  Cc: thomas, jerinj, hemant.agrawal, bruce.richardson, chaozhu,
	Honnappa Nagarahalli, nd

Hi Pavan,

> -----Original Message-----
> From: Pavan Nikhilesh Bhagavatula <pbhagavatula@marvell.com>
> Sent: Monday, July 1, 2019 5:59 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; jerinj@marvell.com; hemant.agrawal@nxp.com;
> bruce.richardson@intel.com; chaozhu@linux.vnet.ibm.com; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
> 
> Hi Gavin,
> 
> >-----Original Message-----
> >From: dev <dev-bounces@dpdk.org> On Behalf Of Gavin Hu
> >Sent: Sunday, June 30, 2019 9:51 PM
> >To: dev@dpdk.org
> >Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran
> ><jerinj@marvell.com>; hemant.agrawal@nxp.com;
> >bruce.richardson@intel.com; chaozhu@linux.vnet.ibm.com;
> >Honnappa.Nagarahalli@arm.com; nd@arm.com; gavin.hu@arm.com
> >Subject: [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal
> >
> >The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
> >for a memory location to become equal to a given value'.
> >
> >Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> >Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >Reviewed-by: Steve Capper <steve.capper@arm.com>
> >Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> >Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >---
> > .../common/include/arch/arm/rte_pause_64.h         | 143
> >+++++++++++++++++++++
> > lib/librte_eal/common/include/generic/rte_pause.h  |  20 +++
> > 2 files changed, 163 insertions(+)
> >
> >diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> >b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> >index 93895d3..0095da6 100644
> >--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> >+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> >@@ -17,6 +17,149 @@ static inline void rte_pause(void)
> > 	asm volatile("yield" ::: "memory");
> > }
> >
> >+#ifdef RTE_USE_WFE
> >+#define rte_wait_until_equal_relaxed(addr, expected) do {\
> >+		typeof(*addr) tmp;  \
> >+		if (__builtin_constant_p((expected))) \
> >+			do { \
> >+				if (sizeof(*(addr)) == 16)\
> >+					asm volatile(  \
> >+						"sevl\n"  \
> >+						"1:	 wfe\n"  \
> >+						"ldxrh  %w0, %1\n"  \
> >+						"cmp	%w0, %w2\n"  \
> >+						"bne	1b\n"  \
> >+						: "=&r"(tmp)  \
> >+						: "Q"(*addr),
> >"i"(expected)  \
> >+						: "cc", "memory");  \
> 
> Can we have early exit here i.e. instead of going directly to wfe can we first
> check the condition and then fallthrough?
> Something like:
> 		asm volatile("	ldxrh	%w0 %1	\n"
> 			        "    cmp	%w0 %w2	\n"
> 			        "	b.eq  	2:		\n"
> 			        "1: wfe			\n"
> 			        "    ldxrh	%w0, %1	\n"
> 			        "    cmp	%w0, %w2	\n"
> 			        "    b.ne	1b		\n"
> 			        "2:				\n"
> 			        :::);
> 
> Regards,
> Pavan.
Ok, I will do it in next version.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [RFC v2 0/5] use WFE for locks and ring on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (5 preceding siblings ...)
  2019-06-30 20:29 ` [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Stephen Hemminger
@ 2019-07-03  8:58 ` " Gavin Hu
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 1/5] eal: add the APIs to wait until equal Gavin Hu
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Gavin Hu @ 2019-07-03  8:58 UTC (permalink / raw)
  To: dev; +Cc: nd

DPDK has multiple use cases where the core repeatedly polls a location in
memory. This polling results in many cache and memory transactions.

Arm architecture provides WFE (Wait For Event) instruction, which allows
the cpu core to enter a low power state until woken up by the update to the
memory location being polled. Thus reducing the cache and memory
transactions.

x86 has the PAUSE hint instruction to reduce such overhead.

The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

For non-Arm platforms, these APIs are just wrappers around do-while loop
with rte_pause, so there are no performance differences.

For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
option. It is disabled by default.

Currently, use of WFE is supported only for aarch64 platforms. armv7
platforms do support the WFE instruction, but they require explicit wake up
events(sev) and are less performannt.

Testing shows that, performance varies across different platforms, with
some showing degradation.

CONFIG_RTE_USE_WFE should be enabled depending on the performance on the
target platforms.

V2:
* Use inline functions instead of marcos
* Add load and compare in the beginning of the APIs
* Fix some style errors in asm inline 

V1:
* Add the new APIs and use it for ring and locks

Gavin Hu (5):
  eal: add the APIs to wait until equal
  ticketlock: use new API to reduce contention on aarch64
  ring: use wfe to wait for ring tail update on aarch64
  spinlock: use wfe to reduce contention on aarch64
  config: add WFE config entry for aarch64

 config/arm/meson.build                             |   1 +
 config/common_armv8a_linux                         |   6 ++
 .../common/include/arch/arm/rte_atomic_64.h        |   4 +
 .../common/include/arch/arm/rte_pause_64.h         | 106 +++++++++++++++++++++
 .../common/include/arch/arm/rte_spinlock.h         |  25 +++++
 lib/librte_eal/common/include/generic/rte_pause.h  |  39 +++++++-
 .../common/include/generic/rte_spinlock.h          |   2 +-
 .../common/include/generic/rte_ticketlock.h        |   3 +-
 lib/librte_ring/rte_ring_c11_mem.h                 |   4 +-
 lib/librte_ring/rte_ring_generic.h                 |   3 +-
 10 files changed, 185 insertions(+), 8 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [RFC v2 1/5] eal: add the APIs to wait until equal
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (6 preceding siblings ...)
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 " Gavin Hu
@ 2019-07-03  8:58 ` Gavin Hu
  2019-07-20  6:46   ` [dpdk-dev] [EXT] " Pavan Nikhilesh Bhagavatula
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Gavin Hu @ 2019-07-03  8:58 UTC (permalink / raw)
  To: dev; +Cc: nd

The rte_wait_until_equalxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 .../common/include/arch/arm/rte_atomic_64.h        |   4 +
 .../common/include/arch/arm/rte_pause_64.h         | 106 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  |  39 +++++++-
 3 files changed, 148 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 97060e4..8d742c6 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -15,8 +15,12 @@ extern "C" {
 
 #include "generic/rte_atomic.h"
 
+#ifndef dsb
 #define dsb(opt) asm volatile("dsb " #opt : : : "memory")
+#endif
+#ifndef dmb
 #define dmb(opt) asm volatile("dmb " #opt : : : "memory")
+#endif
 
 #define rte_mb() dsb(sy)
 
diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
index 93895d3..1f7be0a 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
@@ -17,6 +17,112 @@ static inline void rte_pause(void)
 	asm volatile("yield" ::: "memory");
 }
 
+#ifdef RTE_USE_WFE
+/* Wait for *addr to be updated with expected value */
+static __rte_always_inline void
+rte_wait_until_equal16(volatile uint16_t *addr, uint16_t expected, int memorder)
+{
+	uint16_t tmp;
+	if (memorder == __ATOMIC_RELAXED)
+		asm volatile(
+			"ldxrh	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldxrh	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+	else
+		asm volatile(
+			"ldaxrh %w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldaxrh	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+}
+
+static __rte_always_inline void
+rte_wait_until_equal32(volatile uint32_t *addr, uint32_t expected, int memorder)
+{
+	uint32_t tmp;
+	if (memorder == __ATOMIC_RELAXED)
+		asm volatile(
+			"ldxr	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldxr	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+	else
+		asm volatile(
+			"ldaxr  %w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldaxr  %w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+}
+
+static __rte_always_inline void
+rte_wait_until_equal64(volatile uint64_t *addr, uint64_t expected, int memorder)
+{
+	uint64_t tmp;
+	if (memorder == __ATOMIC_RELAXED)
+		asm volatile(
+			"ldxr	%x[tmp], %x[addr]\n"
+			"cmp	%x[tmp], %x[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldxr	%x[tmp], %x[addr]\n"
+			"cmp	%x[tmp], %x[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+	else
+		asm volatile(
+			"ldaxr  %x[tmp], %x[addr]\n"
+			"cmp	%x[tmp], %x[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldaxr  %x[tmp], %x[addr]\n"
+			"cmp	%x[tmp], %x[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+}
+
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
index 52bd4db..8f5f025 100644
--- a/lib/librte_eal/common/include/generic/rte_pause.h
+++ b/lib/librte_eal/common/include/generic/rte_pause.h
@@ -4,7 +4,6 @@
 
 #ifndef _RTE_PAUSE_H_
 #define _RTE_PAUSE_H_
-
 /**
  * @file
  *
@@ -12,6 +11,10 @@
  *
  */
 
+#include <stdint.h>
+#include <rte_common.h>
+#include <rte_atomic.h>
+
 /**
  * Pause CPU execution for a short while
  *
@@ -20,4 +23,38 @@
  */
 static inline void rte_pause(void);
 
+#if !defined(RTE_USE_WFE)
+#ifdef RTE_USE_C11_MEM_MODEL
+#define __rte_wait_until_equal(addr, expected, memorder) do {\
+	while (__atomic_load_n(addr, memorder) != expected) \
+		rte_pause();\
+} while (0)
+#else
+#define __rte_wait_until_equal(addr, expected, memorder) do {\
+	while (*addr != expected)\
+		rte_pause();\
+	if (memorder != __ATOMIC_RELAXED)\
+		rte_smp_rmb();\
+} while (0)
+#endif
+
+static __rte_always_inline void
+rte_wait_until_equal16(volatile uint16_t *addr, uint16_t expected, int memorder)
+{
+	__rte_wait_until_equal(addr, expected, memorder);
+}
+
+static __rte_always_inline void
+rte_wait_until_equal32(volatile uint32_t *addr, uint32_t expected, int memorder)
+{
+	__rte_wait_until_equal(addr, expected, memorder);
+}
+
+static __rte_always_inline void
+rte_wait_until_equal64(volatile uint64_t *addr, uint64_t expected, int memorder)
+{
+	__rte_wait_until_equal(addr, expected, memorder);
+}
+#endif /* RTE_USE_WFE */
+
 #endif /* _RTE_PAUSE_H_ */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [RFC v2 2/5] ticketlock: use new API to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (7 preceding siblings ...)
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 1/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-07-03  8:58 ` Gavin Hu
  2019-07-20  6:57   ` Pavan Nikhilesh Bhagavatula
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 3/5] ring: use wfe to wait for ring tail update " Gavin Hu
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Gavin Hu @ 2019-07-03  8:58 UTC (permalink / raw)
  To: dev; +Cc: nd

While using ticket lock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running ticketlock_autotest on ThunderX2, with different numbers of cores
and depths of rings, 3%~8% performance gains were measured.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/include/generic/rte_ticketlock.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h b/lib/librte_eal/common/include/generic/rte_ticketlock.h
index 191146f..8fa1f62 100644
--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
@@ -64,8 +64,7 @@ static inline __rte_experimental void
 rte_ticketlock_lock(rte_ticketlock_t *tl)
 {
 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1, __ATOMIC_RELAXED);
-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
-		rte_pause();
+	rte_wait_until_equal16(&tl->s.current, me, __ATOMIC_ACQUIRE);
 }
 
 /**
-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [RFC v2 3/5] ring: use wfe to wait for ring tail update on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (8 preceding siblings ...)
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
@ 2019-07-03  8:58 ` " Gavin Hu
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 4/5] spinlock: use wfe to reduce contention " Gavin Hu
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Gavin Hu @ 2019-07-03  8:58 UTC (permalink / raw)
  To: dev; +Cc: nd

Instead of polling for tail to be updated, use wfe instruction.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 4 ++--
 lib/librte_ring/rte_ring_generic.h | 3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0fb73a3..037811e 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -2,6 +2,7 @@
  *
  * Copyright (c) 2017,2018 HXT-semitech Corporation.
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
+ * Copyright (c) 2019 Arm Limited
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
  * Used as BSD-3 Licensed with permission from Kip Macy.
@@ -21,8 +22,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		rte_wait_until_equal32(&ht->tail, old_val, __ATOMIC_RELAXED);
 
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
 }
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index 953cdbb..570765c 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -23,8 +23,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		rte_wait_until_equal32(&ht->tail, old_val, __ATOMIC_RELAXED);
 
 	ht->tail = new_val;
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [RFC v2 4/5] spinlock: use wfe to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (9 preceding siblings ...)
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 3/5] ring: use wfe to wait for ring tail update " Gavin Hu
@ 2019-07-03  8:58 ` " Gavin Hu
  2019-07-20  6:59   ` Pavan Nikhilesh Bhagavatula
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for aarch64 Gavin Hu
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Gavin Hu @ 2019-07-03  8:58 UTC (permalink / raw)
  To: dev; +Cc: nd

In acquiring a spinlock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

5~10% performance gain was measured by running spinlock_autotest on
14 isolated cores of ThunderX2.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 .../common/include/arch/arm/rte_spinlock.h         | 25 ++++++++++++++++++++++
 .../common/include/generic/rte_spinlock.h          |  2 +-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
index 1a6916b..f25d17f 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
@@ -16,6 +16,31 @@ extern "C" {
 #include <rte_common.h>
 #include "generic/rte_spinlock.h"
 
+/* armv7a does support WFE, but an explicit wake-up signal using SEV is
+ * required (must be preceded by DSB to drain the store buffer) and
+ * this is less performant, so keep armv7a implementation unchanged.
+ */
+#if defined(RTE_USE_WFE) && defined(RTE_ARCH_ARM64)
+static inline void
+rte_spinlock_lock(rte_spinlock_t *sl)
+{
+	unsigned int tmp;
+	/* http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
+	 * faqs/ka16809.html
+	 */
+	asm volatile(
+		"sevl\n"
+		"1:	wfe\n"
+		"2:	ldaxr %w[tmp], %w[locked]\n"
+		"cbnz   %w[tmp], 1b\n"
+		"stxr   %w[tmp], %w[one], %w[locked]\n"
+		"cbnz   %w[tmp], 2b\n"
+		: [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
+		: [one] "r" (1)
+		: "cc", "memory");
+}
+#endif
+
 static inline int rte_tm_supported(void)
 {
 	return 0;
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index 87ae7a4..cf4f15b 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -57,7 +57,7 @@ rte_spinlock_init(rte_spinlock_t *sl)
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl);
 
-#ifdef RTE_FORCE_INTRINSICS
+#if defined(RTE_FORCE_INTRINSICS) && !defined(RTE_USE_WFE)
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl)
 {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (10 preceding siblings ...)
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 4/5] spinlock: use wfe to reduce contention " Gavin Hu
@ 2019-07-03  8:58 ` Gavin Hu
  2019-07-20  7:03   ` Pavan Nikhilesh Bhagavatula
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Gavin Hu @ 2019-07-03  8:58 UTC (permalink / raw)
  To: dev; +Cc: nd

Add the RTE_USE_WFE configuration entry for aarch64, disabled by default.
It can be enabled selectively based on the performance benchmarking.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 config/arm/meson.build     | 1 +
 config/common_armv8a_linux | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 6fa06a1..939d60e 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -116,6 +116,7 @@ impl_dpaa = ['NXP DPAA', flags_dpaa, machine_args_generic]
 impl_dpaa2 = ['NXP DPAA2', flags_dpaa2, machine_args_generic]
 
 dpdk_conf.set('RTE_FORCE_INTRINSICS', 1)
+dpdk_conf.set('RTE_USE_WFE', 0)
 
 if not dpdk_conf.get('RTE_ARCH_64')
 	dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
diff --git a/config/common_armv8a_linux b/config/common_armv8a_linux
index 72091de..ae87a87 100644
--- a/config/common_armv8a_linux
+++ b/config/common_armv8a_linux
@@ -12,6 +12,12 @@ CONFIG_RTE_ARCH_64=y
 
 CONFIG_RTE_FORCE_INTRINSICS=y
 
+# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
+# calling these APIs put the cores enter low power state while waiting
+# for the memory address to be become equal to the expected value.
+# This is supported only by aarch64.
+CONFIG_RTE_USE_WFE=n
+
 # Maximum available cache line size in arm64 implementations.
 # Setting to maximum available cache line size in generic config
 # to address minimum DMA alignment across all arm64 implementations.
-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [EXT] [RFC v2 1/5] eal: add the APIs to wait until equal
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 1/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-07-20  6:46   ` " Pavan Nikhilesh Bhagavatula
  0 siblings, 0 replies; 42+ messages in thread
From: Pavan Nikhilesh Bhagavatula @ 2019-07-20  6:46 UTC (permalink / raw)
  To: Gavin Hu, dev; +Cc: nd, Jerin Jacob Kollanukkaran



>-----Original Message-----
>From: dev <dev-bounces@dpdk.org> On Behalf Of Gavin Hu
>Sent: Wednesday, July 3, 2019 2:29 PM
>To: dev@dpdk.org
>Cc: nd@arm.com
>Subject: [EXT] [dpdk-dev] [RFC v2 1/5] eal: add the APIs to wait until
>equal
>
>External Email
>
>----------------------------------------------------------------------
>The rte_wait_until_equalxx APIs abstract the functionality of 'polling
>for a memory location to become equal to a given value'.
>
>Signed-off-by: Gavin Hu <gavin.hu@arm.com>
>Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>Reviewed-by: Steve Capper <steve.capper@arm.com>
>Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
>Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>

>---
> .../common/include/arch/arm/rte_atomic_64.h        |   4 +
> .../common/include/arch/arm/rte_pause_64.h         | 106
>+++++++++++++++++++++
> lib/librte_eal/common/include/generic/rte_pause.h  |  39 +++++++-
> 3 files changed, 148 insertions(+), 1 deletion(-)
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [RFC v2 2/5] ticketlock: use new API to reduce contention on aarch64
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
@ 2019-07-20  6:57   ` Pavan Nikhilesh Bhagavatula
  0 siblings, 0 replies; 42+ messages in thread
From: Pavan Nikhilesh Bhagavatula @ 2019-07-20  6:57 UTC (permalink / raw)
  To: Gavin Hu, dev; +Cc: nd



>-----Original Message-----
>From: dev <dev-bounces@dpdk.org> On Behalf Of Gavin Hu
>Sent: Wednesday, July 3, 2019 2:29 PM
>To: dev@dpdk.org
>Cc: nd@arm.com
>Subject: [dpdk-dev] [RFC v2 2/5] ticketlock: use new API to reduce
>contention on aarch64
>
>While using ticket lock, cores repeatedly poll the lock variable.
>This is replaced by rte_wait_until_equal API.
>
>Running ticketlock_autotest on ThunderX2, with different numbers of
>cores
>and depths of rings, 3%~8% performance gains were measured.

Tested on octeontx2 board.

>
>Signed-off-by: Gavin Hu <gavin.hu@arm.com>
>Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
>---
> lib/librte_eal/common/include/generic/rte_ticketlock.h | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
>diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h
>b/lib/librte_eal/common/include/generic/rte_ticketlock.h
>index 191146f..8fa1f62 100644
>--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
>+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
>@@ -64,8 +64,7 @@ static inline __rte_experimental void
> rte_ticketlock_lock(rte_ticketlock_t *tl)
> {
> 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1,
>__ATOMIC_RELAXED);
>-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE)
>!= me)
>-		rte_pause();
>+	rte_wait_until_equal16(&tl->s.current, me,
>__ATOMIC_ACQUIRE);
> }
>
> /**
>--
>2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [RFC v2 4/5] spinlock: use wfe to reduce contention on aarch64
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 4/5] spinlock: use wfe to reduce contention " Gavin Hu
@ 2019-07-20  6:59   ` Pavan Nikhilesh Bhagavatula
  0 siblings, 0 replies; 42+ messages in thread
From: Pavan Nikhilesh Bhagavatula @ 2019-07-20  6:59 UTC (permalink / raw)
  To: Gavin Hu, dev; +Cc: nd



>-----Original Message-----
>From: dev <dev-bounces@dpdk.org> On Behalf Of Gavin Hu
>Sent: Wednesday, July 3, 2019 2:29 PM
>To: dev@dpdk.org
>Cc: nd@arm.com
>Subject: [dpdk-dev] [RFC v2 4/5] spinlock: use wfe to reduce contention
>on aarch64
>
>In acquiring a spinlock, cores repeatedly poll the lock variable.
>This is replaced by rte_wait_until_equal API.
>
>5~10% performance gain was measured by running spinlock_autotest
>on
>14 isolated cores of ThunderX2.

Tested on octeontx2 board.

>
>Signed-off-by: Gavin Hu <gavin.hu@arm.com>
>Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>Reviewed-by: Phil Yang <phil.yang@arm.com>
>Reviewed-by: Steve Capper <steve.capper@arm.com>
>Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
>Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>

>---
> .../common/include/arch/arm/rte_spinlock.h         | 25
>++++++++++++++++++++++
> .../common/include/generic/rte_spinlock.h          |  2 +-
> 2 files changed, 26 insertions(+), 1 deletion(-)
>
>diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
>b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
>index 1a6916b..f25d17f 100644
>--- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
>+++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
>@@ -16,6 +16,31 @@ extern "C" {
> #include <rte_common.h>
> #include "generic/rte_spinlock.h"
>
>+/* armv7a does support WFE, but an explicit wake-up signal using SEV
>is
>+ * required (must be preceded by DSB to drain the store buffer) and
>+ * this is less performant, so keep armv7a implementation unchanged.
>+ */
>+#if defined(RTE_USE_WFE) && defined(RTE_ARCH_ARM64)
>+static inline void
>+rte_spinlock_lock(rte_spinlock_t *sl)
>+{
>+	unsigned int tmp;
>+	/*
>http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
>+	 * faqs/ka16809.html
>+	 */
>+	asm volatile(
>+		"sevl\n"
>+		"1:	wfe\n"
>+		"2:	ldaxr %w[tmp], %w[locked]\n"
>+		"cbnz   %w[tmp], 1b\n"
>+		"stxr   %w[tmp], %w[one], %w[locked]\n"
>+		"cbnz   %w[tmp], 2b\n"
>+		: [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
>+		: [one] "r" (1)
>+		: "cc", "memory");
>+}
>+#endif
>+
> static inline int rte_tm_supported(void)
> {
> 	return 0;
>diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
>b/lib/librte_eal/common/include/generic/rte_spinlock.h
>index 87ae7a4..cf4f15b 100644
>--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
>+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
>@@ -57,7 +57,7 @@ rte_spinlock_init(rte_spinlock_t *sl)
> static inline void
> rte_spinlock_lock(rte_spinlock_t *sl);
>
>-#ifdef RTE_FORCE_INTRINSICS
>+#if defined(RTE_FORCE_INTRINSICS) && !defined(RTE_USE_WFE)
> static inline void
> rte_spinlock_lock(rte_spinlock_t *sl)
> {
>--
>2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for aarch64
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for aarch64 Gavin Hu
@ 2019-07-20  7:03   ` Pavan Nikhilesh Bhagavatula
  2019-07-23 15:47     ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 42+ messages in thread
From: Pavan Nikhilesh Bhagavatula @ 2019-07-20  7:03 UTC (permalink / raw)
  To: Gavin Hu, dev; +Cc: nd



>-----Original Message-----
>From: dev <dev-bounces@dpdk.org> On Behalf Of Gavin Hu
>Sent: Wednesday, July 3, 2019 2:29 PM
>To: dev@dpdk.org
>Cc: nd@arm.com
>Subject: [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for
>aarch64
>
>Add the RTE_USE_WFE configuration entry for aarch64, disabled by
>default.
>It can be enabled selectively based on the performance benchmarking.
>
>Signed-off-by: Gavin Hu <gavin.hu@arm.com>
>Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>Reviewed-by: Steve Capper <steve.capper@arm.com>
>Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>

>---
> config/arm/meson.build     | 1 +
> config/common_armv8a_linux | 6 ++++++
> 2 files changed, 7 insertions(+)
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (11 preceding siblings ...)
  2019-07-03  8:58 ` [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for aarch64 Gavin Hu
@ 2019-07-23 15:43 ` Gavin Hu
  2019-07-23 19:15   ` Honnappa Nagarahalli
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 1/5] eal: add the APIs to wait until equal Gavin Hu
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Gavin Hu @ 2019-07-23 15:43 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	gavin.hu

DPDK has multiple use cases where the core repeatedly polls a location in
memory. This polling results in many cache and memory transactions.

Arm architecture provides WFE (Wait For Event) instruction, which allows
the cpu core to enter a low power state until woken up by the update to the
memory location being polled. Thus reducing the cache and memory
transactions.

x86 has the PAUSE hint instruction to reduce such overhead.

The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

For non-Arm platforms, these APIs are just wrappers around do-while loop
with rte_pause, so there are no performance differences.

For Arm platforms, use of WFE can be configured using CONFIG_RTE_USE_WFE
option. It is disabled by default.

Currently, use of WFE is supported only for aarch64 platforms. armv7
platforms do support the WFE instruction, but they require explicit wake up
events(sev) and are less performannt.

Testing shows that, performance varies across different platforms, with
some showing degradation.

CONFIG_RTE_USE_WFE should be enabled depending on the performance on the
target platforms.

V3:
* Convert RFCs to patches
V2:
* Use inline functions instead of marcos
* Add load and compare in the beginning of the APIs
* Fix some style errors in asm inline 
V1:
* Add the new APIs and use it for ring and locks

Gavin Hu (5):
  eal: add the APIs to wait until equal
  ticketlock: use new API to reduce contention on aarch64
  ring: use wfe to wait for ring tail update on aarch64
  spinlock: use wfe to reduce contention on aarch64
  config: add WFE config entry for aarch64

 config/arm/meson.build                             |   1 +
 config/common_armv8a_linux                         |   6 ++
 .../common/include/arch/arm/rte_atomic_64.h        |   4 +
 .../common/include/arch/arm/rte_pause_64.h         | 106 +++++++++++++++++++++
 .../common/include/arch/arm/rte_spinlock.h         |  25 +++++
 lib/librte_eal/common/include/generic/rte_pause.h  |  39 +++++++-
 .../common/include/generic/rte_spinlock.h          |   2 +-
 .../common/include/generic/rte_ticketlock.h        |   3 +-
 lib/librte_ring/rte_ring_c11_mem.h                 |   4 +-
 lib/librte_ring/rte_ring_generic.h                 |   3 +-
 10 files changed, 185 insertions(+), 8 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [PATCH v3 1/5] eal: add the APIs to wait until equal
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (12 preceding siblings ...)
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64 Gavin Hu
@ 2019-07-23 15:43 ` Gavin Hu
  2019-07-24 11:52   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Gavin Hu @ 2019-07-23 15:43 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	gavin.hu

The rte_wait_until_equalxx APIs abstract the functionality of 'polling
for a memory location to become equal to a given value'.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 .../common/include/arch/arm/rte_atomic_64.h        |   4 +
 .../common/include/arch/arm/rte_pause_64.h         | 106 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_pause.h  |  39 +++++++-
 3 files changed, 148 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 97060e4..8d742c6 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -15,8 +15,12 @@ extern "C" {
 
 #include "generic/rte_atomic.h"
 
+#ifndef dsb
 #define dsb(opt) asm volatile("dsb " #opt : : : "memory")
+#endif
+#ifndef dmb
 #define dmb(opt) asm volatile("dmb " #opt : : : "memory")
+#endif
 
 #define rte_mb() dsb(sy)
 
diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
index 93895d3..1f7be0a 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
@@ -17,6 +17,112 @@ static inline void rte_pause(void)
 	asm volatile("yield" ::: "memory");
 }
 
+#ifdef RTE_USE_WFE
+/* Wait for *addr to be updated with expected value */
+static __rte_always_inline void
+rte_wait_until_equal16(volatile uint16_t *addr, uint16_t expected, int memorder)
+{
+	uint16_t tmp;
+	if (memorder == __ATOMIC_RELAXED)
+		asm volatile(
+			"ldxrh	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldxrh	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+	else
+		asm volatile(
+			"ldaxrh %w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldaxrh	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+}
+
+static __rte_always_inline void
+rte_wait_until_equal32(volatile uint32_t *addr, uint32_t expected, int memorder)
+{
+	uint32_t tmp;
+	if (memorder == __ATOMIC_RELAXED)
+		asm volatile(
+			"ldxr	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldxr	%w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+	else
+		asm volatile(
+			"ldaxr  %w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldaxr  %w[tmp], %w[addr]\n"
+			"cmp	%w[tmp], %w[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+}
+
+static __rte_always_inline void
+rte_wait_until_equal64(volatile uint64_t *addr, uint64_t expected, int memorder)
+{
+	uint64_t tmp;
+	if (memorder == __ATOMIC_RELAXED)
+		asm volatile(
+			"ldxr	%x[tmp], %x[addr]\n"
+			"cmp	%x[tmp], %x[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldxr	%x[tmp], %x[addr]\n"
+			"cmp	%x[tmp], %x[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+	else
+		asm volatile(
+			"ldaxr  %x[tmp], %x[addr]\n"
+			"cmp	%x[tmp], %x[expected]\n"
+			"b.eq	2f\n"
+			"sevl\n"
+			"1:	wfe\n"
+			"ldaxr  %x[tmp], %x[addr]\n"
+			"cmp	%x[tmp], %x[expected]\n"
+			"bne	1b\n"
+			"2:\n"
+			: [tmp] "=&r" (tmp)
+			: [addr] "Q"(*addr), [expected] "r"(expected)
+			: "cc", "memory");
+}
+
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/generic/rte_pause.h b/lib/librte_eal/common/include/generic/rte_pause.h
index 52bd4db..8f5f025 100644
--- a/lib/librte_eal/common/include/generic/rte_pause.h
+++ b/lib/librte_eal/common/include/generic/rte_pause.h
@@ -4,7 +4,6 @@
 
 #ifndef _RTE_PAUSE_H_
 #define _RTE_PAUSE_H_
-
 /**
  * @file
  *
@@ -12,6 +11,10 @@
  *
  */
 
+#include <stdint.h>
+#include <rte_common.h>
+#include <rte_atomic.h>
+
 /**
  * Pause CPU execution for a short while
  *
@@ -20,4 +23,38 @@
  */
 static inline void rte_pause(void);
 
+#if !defined(RTE_USE_WFE)
+#ifdef RTE_USE_C11_MEM_MODEL
+#define __rte_wait_until_equal(addr, expected, memorder) do {\
+	while (__atomic_load_n(addr, memorder) != expected) \
+		rte_pause();\
+} while (0)
+#else
+#define __rte_wait_until_equal(addr, expected, memorder) do {\
+	while (*addr != expected)\
+		rte_pause();\
+	if (memorder != __ATOMIC_RELAXED)\
+		rte_smp_rmb();\
+} while (0)
+#endif
+
+static __rte_always_inline void
+rte_wait_until_equal16(volatile uint16_t *addr, uint16_t expected, int memorder)
+{
+	__rte_wait_until_equal(addr, expected, memorder);
+}
+
+static __rte_always_inline void
+rte_wait_until_equal32(volatile uint32_t *addr, uint32_t expected, int memorder)
+{
+	__rte_wait_until_equal(addr, expected, memorder);
+}
+
+static __rte_always_inline void
+rte_wait_until_equal64(volatile uint64_t *addr, uint64_t expected, int memorder)
+{
+	__rte_wait_until_equal(addr, expected, memorder);
+}
+#endif /* RTE_USE_WFE */
+
 #endif /* _RTE_PAUSE_H_ */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [PATCH v3 2/5] ticketlock: use new API to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (13 preceding siblings ...)
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 1/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-07-23 15:43 ` Gavin Hu
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 3/5] ring: use wfe to wait for ring tail update " Gavin Hu
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Gavin Hu @ 2019-07-23 15:43 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	gavin.hu

While using ticket lock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

Running ticketlock_autotest on ThunderX2, with different numbers of cores
and depths of rings, 3%~8% performance gains were measured.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 lib/librte_eal/common/include/generic/rte_ticketlock.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/librte_eal/common/include/generic/rte_ticketlock.h b/lib/librte_eal/common/include/generic/rte_ticketlock.h
index d9bec87..f0821f2 100644
--- a/lib/librte_eal/common/include/generic/rte_ticketlock.h
+++ b/lib/librte_eal/common/include/generic/rte_ticketlock.h
@@ -66,8 +66,7 @@ static inline void
 rte_ticketlock_lock(rte_ticketlock_t *tl)
 {
 	uint16_t me = __atomic_fetch_add(&tl->s.next, 1, __ATOMIC_RELAXED);
-	while (__atomic_load_n(&tl->s.current, __ATOMIC_ACQUIRE) != me)
-		rte_pause();
+	rte_wait_until_equal16(&tl->s.current, me, __ATOMIC_ACQUIRE);
 }
 
 /**
-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [PATCH v3 3/5] ring: use wfe to wait for ring tail update on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (14 preceding siblings ...)
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
@ 2019-07-23 15:43 ` " Gavin Hu
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 4/5] spinlock: use wfe to reduce contention " Gavin Hu
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64 Gavin Hu
  17 siblings, 0 replies; 42+ messages in thread
From: Gavin Hu @ 2019-07-23 15:43 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	gavin.hu

Instead of polling for tail to be updated, use wfe instruction.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 4 ++--
 lib/librte_ring/rte_ring_generic.h | 3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0fb73a3..037811e 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -2,6 +2,7 @@
  *
  * Copyright (c) 2017,2018 HXT-semitech Corporation.
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
+ * Copyright (c) 2019 Arm Limited
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
  * Used as BSD-3 Licensed with permission from Kip Macy.
@@ -21,8 +22,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		rte_wait_until_equal32(&ht->tail, old_val, __ATOMIC_RELAXED);
 
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
 }
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index 953cdbb..570765c 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -23,8 +23,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 	 * we need to wait for them to complete
 	 */
 	if (!single)
-		while (unlikely(ht->tail != old_val))
-			rte_pause();
+		rte_wait_until_equal32(&ht->tail, old_val, __ATOMIC_RELAXED);
 
 	ht->tail = new_val;
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [PATCH v3 4/5] spinlock: use wfe to reduce contention on aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (15 preceding siblings ...)
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 3/5] ring: use wfe to wait for ring tail update " Gavin Hu
@ 2019-07-23 15:43 ` " Gavin Hu
  2019-07-24 12:17   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64 Gavin Hu
  17 siblings, 1 reply; 42+ messages in thread
From: Gavin Hu @ 2019-07-23 15:43 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	gavin.hu

In acquiring a spinlock, cores repeatedly poll the lock variable.
This is replaced by rte_wait_until_equal API.

5~10% performance gain was measured by running spinlock_autotest on
14 isolated cores of ThunderX2.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 .../common/include/arch/arm/rte_spinlock.h         | 25 ++++++++++++++++++++++
 .../common/include/generic/rte_spinlock.h          |  2 +-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
index 1a6916b..f25d17f 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
@@ -16,6 +16,31 @@ extern "C" {
 #include <rte_common.h>
 #include "generic/rte_spinlock.h"
 
+/* armv7a does support WFE, but an explicit wake-up signal using SEV is
+ * required (must be preceded by DSB to drain the store buffer) and
+ * this is less performant, so keep armv7a implementation unchanged.
+ */
+#if defined(RTE_USE_WFE) && defined(RTE_ARCH_ARM64)
+static inline void
+rte_spinlock_lock(rte_spinlock_t *sl)
+{
+	unsigned int tmp;
+	/* http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
+	 * faqs/ka16809.html
+	 */
+	asm volatile(
+		"sevl\n"
+		"1:	wfe\n"
+		"2:	ldaxr %w[tmp], %w[locked]\n"
+		"cbnz   %w[tmp], 1b\n"
+		"stxr   %w[tmp], %w[one], %w[locked]\n"
+		"cbnz   %w[tmp], 2b\n"
+		: [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
+		: [one] "r" (1)
+		: "cc", "memory");
+}
+#endif
+
 static inline int rte_tm_supported(void)
 {
 	return 0;
diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h b/lib/librte_eal/common/include/generic/rte_spinlock.h
index 87ae7a4..cf4f15b 100644
--- a/lib/librte_eal/common/include/generic/rte_spinlock.h
+++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
@@ -57,7 +57,7 @@ rte_spinlock_init(rte_spinlock_t *sl)
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl);
 
-#ifdef RTE_FORCE_INTRINSICS
+#if defined(RTE_FORCE_INTRINSICS) && !defined(RTE_USE_WFE)
 static inline void
 rte_spinlock_lock(rte_spinlock_t *sl)
 {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64
  2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
                   ` (16 preceding siblings ...)
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 4/5] spinlock: use wfe to reduce contention " Gavin Hu
@ 2019-07-23 15:43 ` Gavin Hu
  2019-07-23 18:05   ` Stephen Hemminger
  2019-07-24 12:25   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
  17 siblings, 2 replies; 42+ messages in thread
From: Gavin Hu @ 2019-07-23 15:43 UTC (permalink / raw)
  To: dev
  Cc: nd, thomas, stephen, jerinj, pbhagavatula, Honnappa.Nagarahalli,
	gavin.hu

Add the RTE_USE_WFE configuration entry for aarch64, disabled by default.
It can be enabled selectively based on the performance benchmarking.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 config/arm/meson.build     | 1 +
 config/common_armv8a_linux | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 979018e..496813a 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -116,6 +116,7 @@ impl_dpaa = ['NXP DPAA', flags_dpaa, machine_args_generic]
 impl_dpaa2 = ['NXP DPAA2', flags_dpaa2, machine_args_generic]
 
 dpdk_conf.set('RTE_FORCE_INTRINSICS', 1)
+dpdk_conf.set('RTE_USE_WFE', 0)
 
 if not dpdk_conf.get('RTE_ARCH_64')
 	dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
diff --git a/config/common_armv8a_linux b/config/common_armv8a_linux
index 481712e..48c7ab5 100644
--- a/config/common_armv8a_linux
+++ b/config/common_armv8a_linux
@@ -12,6 +12,12 @@ CONFIG_RTE_ARCH_64=y
 
 CONFIG_RTE_FORCE_INTRINSICS=y
 
+# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
+# calling these APIs put the cores enter low power state while waiting
+# for the memory address to be become equal to the expected value.
+# This is supported only by aarch64.
+CONFIG_RTE_USE_WFE=n
+
 # Maximum available cache line size in arm64 implementations.
 # Setting to maximum available cache line size in generic config
 # to address minimum DMA alignment across all arm64 implementations.
-- 
2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for aarch64
  2019-07-20  7:03   ` Pavan Nikhilesh Bhagavatula
@ 2019-07-23 15:47     ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 42+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-07-23 15:47 UTC (permalink / raw)
  To: Pavan Nikhilesh Bhagavatula, dev, Stephen Hemminger, thomas; +Cc: nd

Hi Stephen,
> -----Original Message-----
> From: Pavan Nikhilesh Bhagavatula <pbhagavatula@marvell.com>
> Sent: Saturday, July 20, 2019 3:03 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for
> aarch64
> 
> 
> 
> >-----Original Message-----
> >From: dev <dev-bounces@dpdk.org> On Behalf Of Gavin Hu
> >Sent: Wednesday, July 3, 2019 2:29 PM
> >To: dev@dpdk.org
> >Cc: nd@arm.com
> >Subject: [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for
> >aarch64
> >
> >Add the RTE_USE_WFE configuration entry for aarch64, disabled by
> >default.
> >It can be enabled selectively based on the performance benchmarking.
> >
> >Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> >Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >Reviewed-by: Steve Capper <steve.capper@arm.com>
> >Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Hi Stephen,
I just converted the RFCs to patches in V3, could you review your comments for RFCs were addressed? 
Thanks Pavan for review and testing!
Best regards,
Gavin
> 
> >---
> > config/arm/meson.build     | 1 +
> > config/common_armv8a_linux | 6 ++++++
> > 2 files changed, 7 insertions(+)
> >


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64 Gavin Hu
@ 2019-07-23 18:05   ` Stephen Hemminger
  2019-07-23 19:10     ` Honnappa Nagarahalli
  2019-07-24 12:25   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
  1 sibling, 1 reply; 42+ messages in thread
From: Stephen Hemminger @ 2019-07-23 18:05 UTC (permalink / raw)
  To: Gavin Hu; +Cc: dev, nd, thomas, jerinj, pbhagavatula, Honnappa.Nagarahalli

On Tue, 23 Jul 2019 23:43:46 +0800
Gavin Hu <gavin.hu@arm.com> wrote:

> Add the RTE_USE_WFE configuration entry for aarch64, disabled by default.
> It can be enabled selectively based on the performance benchmarking.
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> ---
>  config/arm/meson.build     | 1 +
>  config/common_armv8a_linux | 6 ++++++
>  2 files changed, 7 insertions(+)
> 
> diff --git a/config/arm/meson.build b/config/arm/meson.build
> index 979018e..496813a 100644
> --- a/config/arm/meson.build
> +++ b/config/arm/meson.build
> @@ -116,6 +116,7 @@ impl_dpaa = ['NXP DPAA', flags_dpaa, machine_args_generic]
>  impl_dpaa2 = ['NXP DPAA2', flags_dpaa2, machine_args_generic]
>  
>  dpdk_conf.set('RTE_FORCE_INTRINSICS', 1)
> +dpdk_conf.set('RTE_USE_WFE', 0)
>  
>  if not dpdk_conf.get('RTE_ARCH_64')
>  	dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
> diff --git a/config/common_armv8a_linux b/config/common_armv8a_linux
> index 481712e..48c7ab5 100644
> --- a/config/common_armv8a_linux
> +++ b/config/common_armv8a_linux
> @@ -12,6 +12,12 @@ CONFIG_RTE_ARCH_64=y
>  
>  CONFIG_RTE_FORCE_INTRINSICS=y
>  
> +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> +# calling these APIs put the cores enter low power state while waiting
> +# for the memory address to be become equal to the expected value.
> +# This is supported only by aarch64.
> +CONFIG_RTE_USE_WFE=n
> +
>  # Maximum available cache line size in arm64 implementations.
>  # Setting to maximum available cache line size in generic config
>  # to address minimum DMA alignment across all arm64 implementations.

Introducing config options is a maintenance nightmare.
How are distributions supposed to ship a package?
Does full regression test get done on both options?

The user should not be able to change this.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64
  2019-07-23 18:05   ` Stephen Hemminger
@ 2019-07-23 19:10     ` Honnappa Nagarahalli
  2019-07-24 17:59       ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 42+ messages in thread
From: Honnappa Nagarahalli @ 2019-07-23 19:10 UTC (permalink / raw)
  To: Stephen Hemminger, Gavin Hu (Arm Technology China)
  Cc: dev, nd, thomas, jerinj, pbhagavatula, Honnappa Nagarahalli, nd

> 
> On Tue, 23 Jul 2019 23:43:46 +0800
> Gavin Hu <gavin.hu@arm.com> wrote:
> 
> > Add the RTE_USE_WFE configuration entry for aarch64, disabled by default.
> > It can be enabled selectively based on the performance benchmarking.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > ---
> >  config/arm/meson.build     | 1 +
> >  config/common_armv8a_linux | 6 ++++++
> >  2 files changed, 7 insertions(+)
> >
> > diff --git a/config/arm/meson.build b/config/arm/meson.build index
> > 979018e..496813a 100644
> > --- a/config/arm/meson.build
> > +++ b/config/arm/meson.build
> > @@ -116,6 +116,7 @@ impl_dpaa = ['NXP DPAA', flags_dpaa,
> > machine_args_generic]
> >  impl_dpaa2 = ['NXP DPAA2', flags_dpaa2, machine_args_generic]
> >
> >  dpdk_conf.set('RTE_FORCE_INTRINSICS', 1)
> > +dpdk_conf.set('RTE_USE_WFE', 0)
> >
> >  if not dpdk_conf.get('RTE_ARCH_64')
> >  	dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64) diff --git
> > a/config/common_armv8a_linux b/config/common_armv8a_linux index
> > 481712e..48c7ab5 100644
> > --- a/config/common_armv8a_linux
> > +++ b/config/common_armv8a_linux
> > @@ -12,6 +12,12 @@ CONFIG_RTE_ARCH_64=y
> >
> >  CONFIG_RTE_FORCE_INTRINSICS=y
> >
> > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > +# calling these APIs put the cores enter low power state while
> > +waiting # for the memory address to be become equal to the expected value.
> > +# This is supported only by aarch64.
> > +CONFIG_RTE_USE_WFE=n
> > +
> >  # Maximum available cache line size in arm64 implementations.
> >  # Setting to maximum available cache line size in generic config  #
> > to address minimum DMA alignment across all arm64 implementations.
> 
> Introducing config options is a maintenance nightmare.
> How are distributions supposed to ship a package?
> Does full regression test get done on both options?
> 
> The user should not be able to change this.
Agree with these concerns here. In our tests, we are finding that this patch does not result in performance improvements on all micro-architectures. May be these micro-architectures will evolve in the future knowing that WFE is being used in DPDK. But at this point, it does not make sense to enable this by default. This means additional testing/regression with the flag enabled. We could add this to Travis build (Travis yml file).

Currently, this patch will address use cases where the target hardware/environment is known during compilation.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64 Gavin Hu
@ 2019-07-23 19:15   ` Honnappa Nagarahalli
  2019-07-23 21:27     ` Thomas Monjalon
  0 siblings, 1 reply; 42+ messages in thread
From: Honnappa Nagarahalli @ 2019-07-23 19:15 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), dev
  Cc: nd, thomas, stephen, jerinj, pbhagavatula,
	Gavin Hu (Arm Technology China),
	Honnappa Nagarahalli, nd

Hi Gavin,
	I think this should have been V1 (I mean, no versioning, just 'PATCH'), since it is converted to patch. I think we should be able to resend it as V1 and mark this V3 as 'superseded'.

Hi Thomas,
	Please let us know if it is worth/helps fixing the version.

Thanks,
Honnappa

> -----Original Message-----
> From: Gavin Hu <gavin.hu@arm.com>
> Sent: Tuesday, July 23, 2019 10:44 AM
> To: dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net;
> stephen@networkplumber.org; jerinj@marvell.com;
> pbhagavatula@marvell.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>
> Subject: [PATCH v3 0/5] use WFE for locks and ring on aarch64
> 
> DPDK has multiple use cases where the core repeatedly polls a location in
> memory. This polling results in many cache and memory transactions.
> 
> Arm architecture provides WFE (Wait For Event) instruction, which allows the
> cpu core to enter a low power state until woken up by the update to the
> memory location being polled. Thus reducing the cache and memory
> transactions.
> 
> x86 has the PAUSE hint instruction to reduce such overhead.
> 
> The rte_wait_until_equal_xxx APIs abstract the functionality of 'polling for a
> memory location to become equal to a given value'.
> 
> For non-Arm platforms, these APIs are just wrappers around do-while loop
> with rte_pause, so there are no performance differences.
> 
> For Arm platforms, use of WFE can be configured using
> CONFIG_RTE_USE_WFE option. It is disabled by default.
> 
> Currently, use of WFE is supported only for aarch64 platforms. armv7
> platforms do support the WFE instruction, but they require explicit wake up
> events(sev) and are less performannt.
> 
> Testing shows that, performance varies across different platforms, with some
> showing degradation.
> 
> CONFIG_RTE_USE_WFE should be enabled depending on the performance on
> the target platforms.
> 
> V3:
> * Convert RFCs to patches
> V2:
> * Use inline functions instead of marcos
> * Add load and compare in the beginning of the APIs
> * Fix some style errors in asm inline
> V1:
> * Add the new APIs and use it for ring and locks
> 
> Gavin Hu (5):
>   eal: add the APIs to wait until equal
>   ticketlock: use new API to reduce contention on aarch64
>   ring: use wfe to wait for ring tail update on aarch64
>   spinlock: use wfe to reduce contention on aarch64
>   config: add WFE config entry for aarch64
> 
>  config/arm/meson.build                             |   1 +
>  config/common_armv8a_linux                         |   6 ++
>  .../common/include/arch/arm/rte_atomic_64.h        |   4 +
>  .../common/include/arch/arm/rte_pause_64.h         | 106
> +++++++++++++++++++++
>  .../common/include/arch/arm/rte_spinlock.h         |  25 +++++
>  lib/librte_eal/common/include/generic/rte_pause.h  |  39 +++++++-
>  .../common/include/generic/rte_spinlock.h          |   2 +-
>  .../common/include/generic/rte_ticketlock.h        |   3 +-
>  lib/librte_ring/rte_ring_c11_mem.h                 |   4 +-
>  lib/librte_ring/rte_ring_generic.h                 |   3 +-
>  10 files changed, 185 insertions(+), 8 deletions(-)
> 
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64
  2019-07-23 19:15   ` Honnappa Nagarahalli
@ 2019-07-23 21:27     ` Thomas Monjalon
  2019-07-24  2:44       ` Honnappa Nagarahalli
  0 siblings, 1 reply; 42+ messages in thread
From: Thomas Monjalon @ 2019-07-23 21:27 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: Gavin Hu (Arm Technology China), dev, nd, stephen, jerinj, pbhagavatula

23/07/2019 21:15, Honnappa Nagarahalli:
> Hi Gavin,
> 	I think this should have been V1 (I mean, no versioning, just 'PATCH'), since it is converted to patch. I think we should be able to resend it as V1 and mark this V3 as 'superseded'.
> 
> Hi Thomas,
> 	Please let us know if it is worth/helps fixing the version.

I don't follow why it should be v1.




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64
  2019-07-23 21:27     ` Thomas Monjalon
@ 2019-07-24  2:44       ` Honnappa Nagarahalli
  2019-07-24  7:43         ` Thomas Monjalon
  0 siblings, 1 reply; 42+ messages in thread
From: Honnappa Nagarahalli @ 2019-07-24  2:44 UTC (permalink / raw)
  To: thomas
  Cc: Gavin Hu (Arm Technology China),
	dev, nd, stephen, jerinj, pbhagavatula, Honnappa Nagarahalli, nd

> 
> 23/07/2019 21:15, Honnappa Nagarahalli:
> > Hi Gavin,
> > 	I think this should have been V1 (I mean, no versioning, just 'PATCH'),
> since it is converted to patch. I think we should be able to resend it as V1 and
> mark this V3 as 'superseded'.
> >
> > Hi Thomas,
> > 	Please let us know if it is worth/helps fixing the version.
> 
> I don't follow why it should be v1.
This patch series was a RFC (RFC V1 and RFC v2). It is converted to a patch, I thought it should start with V1.

> 
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64
  2019-07-24  2:44       ` Honnappa Nagarahalli
@ 2019-07-24  7:43         ` Thomas Monjalon
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Monjalon @ 2019-07-24  7:43 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: Gavin Hu (Arm Technology China), dev, nd, stephen, jerinj, pbhagavatula

24/07/2019 04:44, Honnappa Nagarahalli:
> > 23/07/2019 21:15, Honnappa Nagarahalli:
> > > Hi Gavin,
> > > 	I think this should have been V1 (I mean, no versioning, just 'PATCH'),
> > since it is converted to patch. I think we should be able to resend it as V1 and
> > mark this V3 as 'superseded'.
> > >
> > > Hi Thomas,
> > > 	Please let us know if it is worth/helps fixing the version.
> > 
> > I don't follow why it should be v1.
> 
> This patch series was a RFC (RFC V1 and RFC v2). It is converted to a patch, I thought it should start with V1.

No it can keep incrementing, it is OK and clear.




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 1/5] eal: add the APIs to wait until equal
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 1/5] eal: add the APIs to wait until equal Gavin Hu
@ 2019-07-24 11:52   ` " Jerin Jacob Kollanukkaran
  2019-07-24 18:10     ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 42+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-07-24 11:52 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: nd, thomas, stephen, Pavan Nikhilesh Bhagavatula, Honnappa.Nagarahalli

> -----Original Message-----
> From: Gavin Hu <gavin.hu@arm.com>
> Sent: Tuesday, July 23, 2019 9:14 PM
> To: dev@dpdk.org
> Cc: nd@arm.com; thomas@monjalon.net; stephen@networkplumber.org;
> Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Pavan Nikhilesh
> Bhagavatula <pbhagavatula@marvell.com>;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com
> Subject: [EXT] [PATCH v3 1/5] eal: add the APIs to wait until equal
> 
> The rte_wait_until_equalxx APIs abstract the functionality of 'polling for a
> memory location to become equal to a given value'.
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> ---
>  .../common/include/arch/arm/rte_atomic_64.h        |   4 +
>  .../common/include/arch/arm/rte_pause_64.h         | 106
> +++++++++++++++++++++
>  lib/librte_eal/common/include/generic/rte_pause.h  |  39 +++++++-
>  3 files changed, 148 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> index 97060e4..8d742c6 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> @@ -15,8 +15,12 @@ extern "C" {
> 
>  #include "generic/rte_atomic.h"
> 
> +#ifndef dsb
>  #define dsb(opt) asm volatile("dsb " #opt : : : "memory")
> +#endif
> +#ifndef dmb
>  #define dmb(opt) asm volatile("dmb " #opt : : : "memory")
> +#endif

Is this change required? Please fix the root cause.
I do see some build error too.

In file included from /home/jerin/dpdk.org/build/include/rte_pause_64.h:13,
                 from /home/jerin/dpdk.org/build/include/rte_pause.h:13,
                 from /home/jerin/dpdk.org/build/include/generic/rte_spinlock.h:25,
                 from /home/jerin/dpdk.org/build/include/rte_spinlock.h:17,
                 from /home/jerin/dpdk.org/drivers/bus/fslmc/mc/mc_sys.c:10:
/home/jerin/dpdk.org/build/include/generic/rte_pause.h: In function 'rte_wait_until_equal16':
/home/jerin/dpdk.org/build/include/generic/rte_pause.h:44:49: error: macro "dmb" passed 1 arguments, but takes just 0
   44 |  __rte_wait_until_equal(addr, expected, memorder);

Command to reproduce(gcc 9.1)

rm -rf build && unset RTE_KERNELDIR && make -j  T=arm64-thunderx-linux-gcc  CROSS=aarch64-linux-gnu- && sed -ri    's,(CONFIG_RTE_KNI_KMOD=)y,\1n,' build/.config && sed -ri  's,(CONFIG_RTE_LIBRTE_VHOST_NUMA=)y,\1n,' build/.config &&  sed -ri  's,(CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=)y,\1n,' build/.config && sed -ri  's,(CONFIG_RTE_EAL_IGB_UIO=)y,\1n,' build/.config && CC="ccache gcc" make -j  test-build CROSS=aarch64-linux-gnu-


> 
>  #define rte_mb() dsb(sy)
> 
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> index 93895d3..1f7be0a 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> @@ -17,6 +17,112 @@ static inline void rte_pause(void)
>  	asm volatile("yield" ::: "memory");
>  }
> 
> +#ifdef RTE_USE_WFE

Do we need it to be under RTE_USE_WFE? If there is no harm, no need to add
Conditional compilation to detect build errors, especially config is disabled by default.

> +/* Wait for *addr to be updated with expected value */ static
> +__rte_always_inline void rte_wait_until_equal16(volatile uint16_t
> +*addr, uint16_t expected, int memorder) {
> +	uint16_t tmp;
> +	if (memorder == __ATOMIC_RELAXED)
> +		asm volatile(
> +			"ldxrh	%w[tmp], %w[addr]\n"
> +			"cmp	%w[tmp], %w[expected]\n"
> +			"b.eq	2f\n"
> +			"sevl\n"
> +			"1:	wfe\n"
> +			"ldxrh	%w[tmp], %w[addr]\n"
> +			"cmp	%w[tmp], %w[expected]\n"
> +			"bne	1b\n"
> +			"2:\n"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "Q"(*addr), [expected] "r"(expected)
> +			: "cc", "memory");
> +	else
> +		asm volatile(
> +			"ldaxrh %w[tmp], %w[addr]\n"
> +			"cmp	%w[tmp], %w[expected]\n"
> +			"b.eq	2f\n"
> +			"sevl\n"
> +			"1:	wfe\n"
> +			"ldaxrh	%w[tmp], %w[addr]\n"
> +			"cmp	%w[tmp], %w[expected]\n"
> +			"bne	1b\n"
> +			"2:\n"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "Q"(*addr), [expected] "r"(expected)
> +			: "cc", "memory");
> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal32(volatile uint32_t *addr, uint32_t expected, int
> +memorder) {
> +	uint32_t tmp;
> +	if (memorder == __ATOMIC_RELAXED)
> +		asm volatile(
> +			"ldxr	%w[tmp], %w[addr]\n"
> +			"cmp	%w[tmp], %w[expected]\n"
> +			"b.eq	2f\n"
> +			"sevl\n"
> +			"1:	wfe\n"
> +			"ldxr	%w[tmp], %w[addr]\n"
> +			"cmp	%w[tmp], %w[expected]\n"
> +			"bne	1b\n"
> +			"2:\n"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "Q"(*addr), [expected] "r"(expected)
> +			: "cc", "memory");
> +	else
> +		asm volatile(
> +			"ldaxr  %w[tmp], %w[addr]\n"
> +			"cmp	%w[tmp], %w[expected]\n"
> +			"b.eq	2f\n"
> +			"sevl\n"
> +			"1:	wfe\n"
> +			"ldaxr  %w[tmp], %w[addr]\n"
> +			"cmp	%w[tmp], %w[expected]\n"
> +			"bne	1b\n"
> +			"2:\n"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "Q"(*addr), [expected] "r"(expected)
> +			: "cc", "memory");
> +}
> +
> +static __rte_always_inline void
> +rte_wait_until_equal64(volatile uint64_t *addr, uint64_t expected, int
> +memorder) {
> +	uint64_t tmp;
> +	if (memorder == __ATOMIC_RELAXED)
> +		asm volatile(
> +			"ldxr	%x[tmp], %x[addr]\n"
> +			"cmp	%x[tmp], %x[expected]\n"
> +			"b.eq	2f\n"
> +			"sevl\n"
> +			"1:	wfe\n"
> +			"ldxr	%x[tmp], %x[addr]\n"
> +			"cmp	%x[tmp], %x[expected]\n"
> +			"bne	1b\n"
> +			"2:\n"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "Q"(*addr), [expected] "r"(expected)
> +			: "cc", "memory");
> +	else
> +		asm volatile(
> +			"ldaxr  %x[tmp], %x[addr]\n"
> +			"cmp	%x[tmp], %x[expected]\n"
> +			"b.eq	2f\n"
> +			"sevl\n"
> +			"1:	wfe\n"
> +			"ldaxr  %x[tmp], %x[addr]\n"
> +			"cmp	%x[tmp], %x[expected]\n"
> +			"bne	1b\n"
> +			"2:\n"
> +			: [tmp] "=&r" (tmp)
> +			: [addr] "Q"(*addr), [expected] "r"(expected)
> +			: "cc", "memory");

Duplication of code. Please introduce a macro for assembly Skelton.
Something like

http://patches.dpdk.org/patch/56949/

> +}
> +
> +#endif
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/librte_eal/common/include/generic/rte_pause.h
> b/lib/librte_eal/common/include/generic/rte_pause.h
> index 52bd4db..8f5f025 100644
> --- a/lib/librte_eal/common/include/generic/rte_pause.h
> +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> @@ -4,7 +4,6 @@
> 
>  #ifndef _RTE_PAUSE_H_
>  #define _RTE_PAUSE_H_
> -
>  /**
>   * @file
>   *
> @@ -12,6 +11,10 @@
>   *
>   */
> 
> +#include <stdint.h>
> +#include <rte_common.h>
> +#include <rte_atomic.h>
> +
>  /**
>   * Pause CPU execution for a short while
>   *
> @@ -20,4 +23,38 @@
>   */
>  static inline void rte_pause(void);
> 
> +#if !defined(RTE_USE_WFE)
> +#ifdef RTE_USE_C11_MEM_MODEL
> +#define __rte_wait_until_equal(addr, expected, memorder) do {\
> +	while (__atomic_load_n(addr, memorder) != expected) \
> +		rte_pause();\
> +} while (0)
> +#else
> +#define __rte_wait_until_equal(addr, expected, memorder) do {\
> +	while (*addr != expected)\
> +		rte_pause();\
> +	if (memorder != __ATOMIC_RELAXED)\
> +		rte_smp_rmb();\

Is this correct wrt all memorder?
If there is no specific gain on no C11 mem model, let have only C11 memmodel
Aka remove RTE_USE_C11_MEM_MODEL

> +} while (0)
> +#endif
> +

Spotted public API. Lets have prototype with very good documentation on the
API details.


> +static __rte_always_inline void
> +rte_wait_until_equal16(volatile uint16_t *addr, uint16_t expected, int
> +memorder) {
> +	__rte_wait_until_equal(addr, expected, memorder); }
> +
> +static __rte_always_inline void
> +rte_wait_until_equal32(volatile uint32_t *addr, uint32_t expected, int
> +memorder) {
> +	__rte_wait_until_equal(addr, expected, memorder); }
> +
> +static __rte_always_inline void
> +rte_wait_until_equal64(volatile uint64_t *addr, uint64_t expected, int
> +memorder) {
> +	__rte_wait_until_equal(addr, expected, memorder); } #endif /*
> +RTE_USE_WFE */
> +
>  #endif /* _RTE_PAUSE_H_ */
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 4/5] spinlock: use wfe to reduce contention on aarch64
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 4/5] spinlock: use wfe to reduce contention " Gavin Hu
@ 2019-07-24 12:17   ` " Jerin Jacob Kollanukkaran
  0 siblings, 0 replies; 42+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-07-24 12:17 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: nd, thomas, stephen, Pavan Nikhilesh Bhagavatula, Honnappa.Nagarahalli

> -----Original Message-----
> From: Gavin Hu <gavin.hu@arm.com>
> Sent: Tuesday, July 23, 2019 9:14 PM
> To: dev@dpdk.org
> Cc: nd@arm.com; thomas@monjalon.net; stephen@networkplumber.org;
> Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Pavan Nikhilesh
> Bhagavatula <pbhagavatula@marvell.com>;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com
> Subject: [EXT] [PATCH v3 4/5] spinlock: use wfe to reduce contention on
> aarch64
> 
> In acquiring a spinlock, cores repeatedly poll the lock variable.
> This is replaced by rte_wait_until_equal API.
> 
> 5~10% performance gain was measured by running spinlock_autotest on
> 14 isolated cores of ThunderX2.
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Tested-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> ---
>  .../common/include/arch/arm/rte_spinlock.h         | 25
> ++++++++++++++++++++++
>  .../common/include/generic/rte_spinlock.h          |  2 +-
>  2 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> index 1a6916b..f25d17f 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_spinlock.h
> @@ -16,6 +16,31 @@ extern "C" {
>  #include <rte_common.h>
>  #include "generic/rte_spinlock.h"
> 
> +/* armv7a does support WFE, but an explicit wake-up signal using SEV is
> + * required (must be preceded by DSB to drain the store buffer) and
> + * this is less performant, so keep armv7a implementation unchanged.
> + */
> +#if defined(RTE_USE_WFE) && defined(RTE_ARCH_ARM64) static inline

See below. Please avoid complicated conditional compilation logic for scalability and readability.
 

> void
> +rte_spinlock_lock(rte_spinlock_t *sl) {
> +	unsigned int tmp;
> +	/* http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.
> +	 * faqs/ka16809.html
> +	 */
> +	asm volatile(
> +		"sevl\n"
> +		"1:	wfe\n"
> +		"2:	ldaxr %w[tmp], %w[locked]\n"
> +		"cbnz   %w[tmp], 1b\n"
> +		"stxr   %w[tmp], %w[one], %w[locked]\n"
> +		"cbnz   %w[tmp], 2b\n"
> +		: [tmp] "=&r" (tmp), [locked] "+Q"(sl->locked)
> +		: [one] "r" (1)
> +		: "cc", "memory");
> +}
> +#endif
> +
>  static inline int rte_tm_supported(void)  {
>  	return 0;
> diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> b/lib/librte_eal/common/include/generic/rte_spinlock.h
> index 87ae7a4..cf4f15b 100644
> --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> @@ -57,7 +57,7 @@ rte_spinlock_init(rte_spinlock_t *sl)  static inline void
> rte_spinlock_lock(rte_spinlock_t *sl);
> 
> -#ifdef RTE_FORCE_INTRINSICS
> +#if defined(RTE_FORCE_INTRINSICS) && !defined(RTE_USE_WFE)

I would like to avoid hacking around adjusting generic code to meet specific requirement.
For example, if someone enables RTE_USE_WFE for armv7 it will break
And it will pain for the new architecture to use RTE_FORCE_INTRINSICS.

Looks like the time has come to disable RTE_FORCE_INTRINSICS for arm64. 

Since this patch is targeted for next release. How about enable native
Implementation for RTE_FORCE_INTRINSICS used code for arm64 like spinlock, ticketlock like x86.
If you guys don't have the bandwidth to convert all blocks, let us know, we can collaborate
and Marvell can take up some RTE_FORCE_INTRINSICS conversion for next release.


>  static inline void
>  rte_spinlock_lock(rte_spinlock_t *sl)
>  {
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 5/5] config: add WFE config entry for aarch64
  2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64 Gavin Hu
  2019-07-23 18:05   ` Stephen Hemminger
@ 2019-07-24 12:25   ` " Jerin Jacob Kollanukkaran
  1 sibling, 0 replies; 42+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-07-24 12:25 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: nd, thomas, stephen, Pavan Nikhilesh Bhagavatula, Honnappa.Nagarahalli

> -----Original Message-----
> From: Gavin Hu <gavin.hu@arm.com>
> Sent: Tuesday, July 23, 2019 9:14 PM
> To: dev@dpdk.org
> Cc: nd@arm.com; thomas@monjalon.net; stephen@networkplumber.org;
> Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Pavan Nikhilesh
> Bhagavatula <pbhagavatula@marvell.com>;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com
> Subject: [EXT] [PATCH v3 5/5] config: add WFE config entry for aarch64
> 
> Add the RTE_USE_WFE configuration entry for aarch64, disabled by default.
> It can be enabled selectively based on the performance benchmarking.
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> 
> +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs, #
> +calling these APIs put the cores enter low power state while waiting #
> +for the memory address to be become equal to the expected value.
> +# This is supported only by aarch64.
> +CONFIG_RTE_USE_WFE=n

# If it specific for arm and none of the other architectures supports it then
I would like to change the config as CONFIG_RTE_ARM_USE_WFE
# Even if it is disabled, have the =n entry in config/common_base to know
all supported configs in DPDK in one place.
# Arrange all CONFIG_RTE_ARM_* together in config/common_base


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64
  2019-07-23 19:10     ` Honnappa Nagarahalli
@ 2019-07-24 17:59       ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 42+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-07-24 17:59 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Stephen Hemminger
  Cc: dev, nd, thomas, jerinj, pbhagavatula, nd

Hi Stephen,
> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Wednesday, July 24, 2019 3:10 AM
> To: Stephen Hemminger <stephen@networkplumber.org>; Gavin Hu (Arm
> Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; thomas@monjalon.net;
> jerinj@marvell.com; pbhagavatula@marvell.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 5/5] config: add WFE config entry for aarch64
> 
> >
> > On Tue, 23 Jul 2019 23:43:46 +0800
> > Gavin Hu <gavin.hu@arm.com> wrote:
> >
> > > Add the RTE_USE_WFE configuration entry for aarch64, disabled by
> default.
> > > It can be enabled selectively based on the performance benchmarking.
> > >
> > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > > ---
> > >  config/arm/meson.build     | 1 +
> > >  config/common_armv8a_linux | 6 ++++++
> > >  2 files changed, 7 insertions(+)
> > >
> > > diff --git a/config/arm/meson.build b/config/arm/meson.build index
> > > 979018e..496813a 100644
> > > --- a/config/arm/meson.build
> > > +++ b/config/arm/meson.build
> > > @@ -116,6 +116,7 @@ impl_dpaa = ['NXP DPAA', flags_dpaa,
> > > machine_args_generic]
> > >  impl_dpaa2 = ['NXP DPAA2', flags_dpaa2, machine_args_generic]
> > >
> > >  dpdk_conf.set('RTE_FORCE_INTRINSICS', 1)
> > > +dpdk_conf.set('RTE_USE_WFE', 0)
> > >
> > >  if not dpdk_conf.get('RTE_ARCH_64')
> > >  dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64) diff --git
> > > a/config/common_armv8a_linux b/config/common_armv8a_linux index
> > > 481712e..48c7ab5 100644
> > > --- a/config/common_armv8a_linux
> > > +++ b/config/common_armv8a_linux
> > > @@ -12,6 +12,12 @@ CONFIG_RTE_ARCH_64=y
> > >
> > >  CONFIG_RTE_FORCE_INTRINSICS=y
> > >
> > > +# Use WFE instructions to implement the rte_wait_for_equal_xxx APIs,
> > > +# calling these APIs put the cores enter low power state while
> > > +waiting # for the memory address to be become equal to the expected
> value.
> > > +# This is supported only by aarch64.
> > > +CONFIG_RTE_USE_WFE=n
> > > +
> > >  # Maximum available cache line size in arm64 implementations.
> > >  # Setting to maximum available cache line size in generic config  #
> > > to address minimum DMA alignment across all arm64 implementations.
> >
> > Introducing config options is a maintenance nightmare.
> > How are distributions supposed to ship a package?
> > Does full regression test get done on both options?
> >
> > The user should not be able to change this.
> Agree with these concerns here. In our tests, we are finding that this patch
> does not result in performance improvements on all micro-architectures.
> May be these micro-architectures will evolve in the future knowing that
> WFE is being used in DPDK. But at this point, it does not make sense to
> enable this by default. This means additional testing/regression with the flag
> enabled. We could add this to Travis build (Travis yml file).
> 
> Currently, this patch will address use cases where the target
> hardware/environment is known during compilation.
In our testing, like running testpmd and packet_ordering(for WFE ring benchmarking), it showed no improvements nor degradation in performance. 
For some micro-benchmarking, it showed slight improvements sometimes, no degradation were seen.
The added benefit of the patch set is power saving, but it is not a primary concern in DPDK and we are short of measurement ways to benchmark that.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [dpdk-dev] [EXT] [PATCH v3 1/5] eal: add the APIs to wait until equal
  2019-07-24 11:52   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
@ 2019-07-24 18:10     ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 42+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-07-24 18:10 UTC (permalink / raw)
  To: jerinj, dev
  Cc: nd, thomas, stephen, Pavan Nikhilesh Bhagavatula, Honnappa Nagarahalli

Hi Jerin,
> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Sent: Wednesday, July 24, 2019 7:53 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net;
> stephen@networkplumber.org; Pavan Nikhilesh Bhagavatula
> <pbhagavatula@marvell.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> Subject: RE: [EXT] [PATCH v3 1/5] eal: add the APIs to wait until equal
> 
> > -----Original Message-----
> > From: Gavin Hu <gavin.hu@arm.com>
> > Sent: Tuesday, July 23, 2019 9:14 PM
> > To: dev@dpdk.org
> > Cc: nd@arm.com; thomas@monjalon.net; stephen@networkplumber.org;
> > Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Pavan Nikhilesh
> > Bhagavatula <pbhagavatula@marvell.com>;
> > Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com
> > Subject: [EXT] [PATCH v3 1/5] eal: add the APIs to wait until equal
> >
> > The rte_wait_until_equalxx APIs abstract the functionality of 'polling for a
> > memory location to become equal to a given value'.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Acked-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > ---
> >  .../common/include/arch/arm/rte_atomic_64.h        |   4 +
> >  .../common/include/arch/arm/rte_pause_64.h         | 106
> > +++++++++++++++++++++
> >  lib/librte_eal/common/include/generic/rte_pause.h  |  39 +++++++-
> >  3 files changed, 148 insertions(+), 1 deletion(-)
> >
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > index 97060e4..8d742c6 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> > @@ -15,8 +15,12 @@ extern "C" {
> >
> >  #include "generic/rte_atomic.h"
> >
> > +#ifndef dsb
> >  #define dsb(opt) asm volatile("dsb " #opt : : : "memory")
> > +#endif
> > +#ifndef dmb
> >  #define dmb(opt) asm volatile("dmb " #opt : : : "memory")
> > +#endif
> 
> Is this change required? Please fix the root cause.
> I do see some build error too.
> 
> In file included from /home/jerin/dpdk.org/build/include/rte_pause_64.h:13,
>                  from /home/jerin/dpdk.org/build/include/rte_pause.h:13,
>                  from
> /home/jerin/dpdk.org/build/include/generic/rte_spinlock.h:25,
>                  from /home/jerin/dpdk.org/build/include/rte_spinlock.h:17,
>                  from /home/jerin/dpdk.org/drivers/bus/fslmc/mc/mc_sys.c:10:
> /home/jerin/dpdk.org/build/include/generic/rte_pause.h: In function
> 'rte_wait_until_equal16':
> /home/jerin/dpdk.org/build/include/generic/rte_pause.h:44:49: error:
> macro "dmb" passed 1 arguments, but takes just 0
>    44 |  __rte_wait_until_equal(addr, expected, memorder);
> 
> Command to reproduce(gcc 9.1)
> 
> rm -rf build && unset RTE_KERNELDIR && make -j  T=arm64-thunderx-linux-
> gcc  CROSS=aarch64-linux-gnu- && sed -ri
> 's,(CONFIG_RTE_KNI_KMOD=)y,\1n,' build/.config && sed -ri
> 's,(CONFIG_RTE_LIBRTE_VHOST_NUMA=)y,\1n,' build/.config &&  sed -ri
> 's,(CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=)y,\1n,' build/.config &&
> sed -ri  's,(CONFIG_RTE_EAL_IGB_UIO=)y,\1n,' build/.config && CC="ccache
> gcc" make -j  test-build CROSS=aarch64-linux-gnu-
> 
> 
> >
> >  #define rte_mb() dsb(sy)
> >
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > index 93895d3..1f7be0a 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_pause_64.h
> > @@ -17,6 +17,112 @@ static inline void rte_pause(void)
> >  	asm volatile("yield" ::: "memory");
> >  }
> >
> > +#ifdef RTE_USE_WFE
> 
> Do we need it to be under RTE_USE_WFE? If there is no harm, no need to
> add
> Conditional compilation to detect build errors, especially config is disabled
> by default.
> 
> > +/* Wait for *addr to be updated with expected value */ static
> > +__rte_always_inline void rte_wait_until_equal16(volatile uint16_t
> > +*addr, uint16_t expected, int memorder) {
> > +	uint16_t tmp;
> > +	if (memorder == __ATOMIC_RELAXED)
> > +		asm volatile(
> > +			"ldxrh	%w[tmp], %w[addr]\n"
> > +			"cmp	%w[tmp], %w[expected]\n"
> > +			"b.eq	2f\n"
> > +			"sevl\n"
> > +			"1:	wfe\n"
> > +			"ldxrh	%w[tmp], %w[addr]\n"
> > +			"cmp	%w[tmp], %w[expected]\n"
> > +			"bne	1b\n"
> > +			"2:\n"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "Q"(*addr), [expected] "r"(expected)
> > +			: "cc", "memory");
> > +	else
> > +		asm volatile(
> > +			"ldaxrh %w[tmp], %w[addr]\n"
> > +			"cmp	%w[tmp], %w[expected]\n"
> > +			"b.eq	2f\n"
> > +			"sevl\n"
> > +			"1:	wfe\n"
> > +			"ldaxrh	%w[tmp], %w[addr]\n"
> > +			"cmp	%w[tmp], %w[expected]\n"
> > +			"bne	1b\n"
> > +			"2:\n"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "Q"(*addr), [expected] "r"(expected)
> > +			: "cc", "memory");
> > +}
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal32(volatile uint32_t *addr, uint32_t expected, int
> > +memorder) {
> > +	uint32_t tmp;
> > +	if (memorder == __ATOMIC_RELAXED)
> > +		asm volatile(
> > +			"ldxr	%w[tmp], %w[addr]\n"
> > +			"cmp	%w[tmp], %w[expected]\n"
> > +			"b.eq	2f\n"
> > +			"sevl\n"
> > +			"1:	wfe\n"
> > +			"ldxr	%w[tmp], %w[addr]\n"
> > +			"cmp	%w[tmp], %w[expected]\n"
> > +			"bne	1b\n"
> > +			"2:\n"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "Q"(*addr), [expected] "r"(expected)
> > +			: "cc", "memory");
> > +	else
> > +		asm volatile(
> > +			"ldaxr  %w[tmp], %w[addr]\n"
> > +			"cmp	%w[tmp], %w[expected]\n"
> > +			"b.eq	2f\n"
> > +			"sevl\n"
> > +			"1:	wfe\n"
> > +			"ldaxr  %w[tmp], %w[addr]\n"
> > +			"cmp	%w[tmp], %w[expected]\n"
> > +			"bne	1b\n"
> > +			"2:\n"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "Q"(*addr), [expected] "r"(expected)
> > +			: "cc", "memory");
> > +}
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal64(volatile uint64_t *addr, uint64_t expected, int
> > +memorder) {
> > +	uint64_t tmp;
> > +	if (memorder == __ATOMIC_RELAXED)
> > +		asm volatile(
> > +			"ldxr	%x[tmp], %x[addr]\n"
> > +			"cmp	%x[tmp], %x[expected]\n"
> > +			"b.eq	2f\n"
> > +			"sevl\n"
> > +			"1:	wfe\n"
> > +			"ldxr	%x[tmp], %x[addr]\n"
> > +			"cmp	%x[tmp], %x[expected]\n"
> > +			"bne	1b\n"
> > +			"2:\n"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "Q"(*addr), [expected] "r"(expected)
> > +			: "cc", "memory");
> > +	else
> > +		asm volatile(
> > +			"ldaxr  %x[tmp], %x[addr]\n"
> > +			"cmp	%x[tmp], %x[expected]\n"
> > +			"b.eq	2f\n"
> > +			"sevl\n"
> > +			"1:	wfe\n"
> > +			"ldaxr  %x[tmp], %x[addr]\n"
> > +			"cmp	%x[tmp], %x[expected]\n"
> > +			"bne	1b\n"
> > +			"2:\n"
> > +			: [tmp] "=&r" (tmp)
> > +			: [addr] "Q"(*addr), [expected] "r"(expected)
> > +			: "cc", "memory");
> 
> Duplication of code. Please introduce a macro for assembly Skelton.
> Something like
> 
> http://patches.dpdk.org/patch/56949/
> 
> > +}
> > +
> > +#endif
> > +
> >  #ifdef __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/librte_eal/common/include/generic/rte_pause.h
> > b/lib/librte_eal/common/include/generic/rte_pause.h
> > index 52bd4db..8f5f025 100644
> > --- a/lib/librte_eal/common/include/generic/rte_pause.h
> > +++ b/lib/librte_eal/common/include/generic/rte_pause.h
> > @@ -4,7 +4,6 @@
> >
> >  #ifndef _RTE_PAUSE_H_
> >  #define _RTE_PAUSE_H_
> > -
> >  /**
> >   * @file
> >   *
> > @@ -12,6 +11,10 @@
> >   *
> >   */
> >
> > +#include <stdint.h>
> > +#include <rte_common.h>
> > +#include <rte_atomic.h>
> > +
> >  /**
> >   * Pause CPU execution for a short while
> >   *
> > @@ -20,4 +23,38 @@
> >   */
> >  static inline void rte_pause(void);
> >
> > +#if !defined(RTE_USE_WFE)
> > +#ifdef RTE_USE_C11_MEM_MODEL
> > +#define __rte_wait_until_equal(addr, expected, memorder) do {\
> > +	while (__atomic_load_n(addr, memorder) != expected) \
> > +		rte_pause();\
> > +} while (0)
> > +#else
> > +#define __rte_wait_until_equal(addr, expected, memorder) do {\
> > +	while (*addr != expected)\
> > +		rte_pause();\
> > +	if (memorder != __ATOMIC_RELAXED)\
> > +		rte_smp_rmb();\
> 
> Is this correct wrt all memorder?
> If there is no specific gain on no C11 mem model, let have only C11
> memmodel
> Aka remove RTE_USE_C11_MEM_MODEL

I am looking into all your comments(thanks!) and will submit a new version. 

> > +} while (0)
> > +#endif
> > +
> 
> Spotted public API. Lets have prototype with very good documentation on
> the
> API details.
> 
> 
> > +static __rte_always_inline void
> > +rte_wait_until_equal16(volatile uint16_t *addr, uint16_t expected, int
> > +memorder) {
> > +	__rte_wait_until_equal(addr, expected, memorder); }
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal32(volatile uint32_t *addr, uint32_t expected, int
> > +memorder) {
> > +	__rte_wait_until_equal(addr, expected, memorder); }
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal64(volatile uint64_t *addr, uint64_t expected, int
> > +memorder) {
> > +	__rte_wait_until_equal(addr, expected, memorder); } #endif /*
> > +RTE_USE_WFE */
> > +
> >  #endif /* _RTE_PAUSE_H_ */
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, back to index

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-30 16:21 [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Gavin Hu
2019-06-30 16:21 ` [dpdk-dev] [RFC 1/5] eal: add the APIs to wait until equal Gavin Hu
2019-06-30 20:27   ` Stephen Hemminger
2019-07-01  7:16     ` Gavin Hu (Arm Technology China)
2019-07-01  7:43       ` Thomas Monjalon
2019-07-02 14:07         ` Gavin Hu (Arm Technology China)
2019-07-01  9:58   ` Pavan Nikhilesh Bhagavatula
2019-07-02 14:08     ` Gavin Hu (Arm Technology China)
2019-06-30 16:21 ` [dpdk-dev] [RFC 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
2019-06-30 16:21 ` [dpdk-dev] [RFC 3/5] ring: use wfe to wait for ring tail update " Gavin Hu
2019-06-30 16:21 ` [dpdk-dev] [RFC 4/5] spinlock: use wfe to reduce contention " Gavin Hu
2019-06-30 16:21 ` [dpdk-dev] [RFC 5/5] config: add WFE config entry for aarch64 Gavin Hu
2019-06-30 20:29 ` [dpdk-dev] [RFC 0/5] use WFE for locks and ring on aarch64 Stephen Hemminger
2019-07-01  9:12   ` Gavin Hu (Arm Technology China)
2019-07-03  8:58 ` [dpdk-dev] [RFC v2 " Gavin Hu
2019-07-03  8:58 ` [dpdk-dev] [RFC v2 1/5] eal: add the APIs to wait until equal Gavin Hu
2019-07-20  6:46   ` [dpdk-dev] [EXT] " Pavan Nikhilesh Bhagavatula
2019-07-03  8:58 ` [dpdk-dev] [RFC v2 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
2019-07-20  6:57   ` Pavan Nikhilesh Bhagavatula
2019-07-03  8:58 ` [dpdk-dev] [RFC v2 3/5] ring: use wfe to wait for ring tail update " Gavin Hu
2019-07-03  8:58 ` [dpdk-dev] [RFC v2 4/5] spinlock: use wfe to reduce contention " Gavin Hu
2019-07-20  6:59   ` Pavan Nikhilesh Bhagavatula
2019-07-03  8:58 ` [dpdk-dev] [RFC v2 5/5] config: add WFE config entry for aarch64 Gavin Hu
2019-07-20  7:03   ` Pavan Nikhilesh Bhagavatula
2019-07-23 15:47     ` Gavin Hu (Arm Technology China)
2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 0/5] use WFE for locks and ring on aarch64 Gavin Hu
2019-07-23 19:15   ` Honnappa Nagarahalli
2019-07-23 21:27     ` Thomas Monjalon
2019-07-24  2:44       ` Honnappa Nagarahalli
2019-07-24  7:43         ` Thomas Monjalon
2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 1/5] eal: add the APIs to wait until equal Gavin Hu
2019-07-24 11:52   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
2019-07-24 18:10     ` Gavin Hu (Arm Technology China)
2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 2/5] ticketlock: use new API to reduce contention on aarch64 Gavin Hu
2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 3/5] ring: use wfe to wait for ring tail update " Gavin Hu
2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 4/5] spinlock: use wfe to reduce contention " Gavin Hu
2019-07-24 12:17   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
2019-07-23 15:43 ` [dpdk-dev] [PATCH v3 5/5] config: add WFE config entry for aarch64 Gavin Hu
2019-07-23 18:05   ` Stephen Hemminger
2019-07-23 19:10     ` Honnappa Nagarahalli
2019-07-24 17:59       ` Gavin Hu (Arm Technology China)
2019-07-24 12:25   ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran

DPDK-dev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/dpdk-dev/0 dpdk-dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dpdk-dev dpdk-dev/ https://lore.kernel.org/dpdk-dev \
		dev@dpdk.org dpdk-dev@archiver.kernel.org
	public-inbox-index dpdk-dev


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox